-
-
Notifications
You must be signed in to change notification settings - Fork 701
Fix controller waitpoint resolution, suspendable state, and snapshot race conditions #2006
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🦋 Changeset detectedLatest commit: ed1a44c The changes in this PR will be included in the next version bump. Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
WalkthroughThis update refactors and enhances the workflow engine's handling of snapshot and waitpoint state, focusing on reliability and concurrency control. It introduces a new Changes
Sequence Diagram(s)sequenceDiagram
participant Controller
participant RunExecution
participant SnapshotManager
participant TaskRunProcess
Controller->>RunExecution: enqueueSnapshotChangeAndWait(runData)
RunExecution->>SnapshotManager: handleSnapshotChange(runData)
SnapshotManager->>SnapshotManager: Queue runData, process in order
SnapshotManager-->>RunExecution: onSnapshotChange(runData)
RunExecution->>TaskRunProcess: (as needed, e.g., suspend)
SnapshotManager-->>RunExecution: onSuspendable(suspendableSnapshot)
RunExecution->>TaskRunProcess: cleanup and suspend if needed
sequenceDiagram
participant Worker
participant SharedRuntimeManager
participant ExternalSystem
Worker->>SharedRuntimeManager: waitForTask / waitForBatch / waitForWaitpoint
ExternalSystem-->>SharedRuntimeManager: resolveWaitpoints([waitpoint])
SharedRuntimeManager->>Worker: Resolve promise(s) for completed waitpoints
Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🔭 Outside diff range comments (1)
packages/cli-v3/src/entryPoints/managed/execution.ts (1)
767-783
:⚠️ Potential issueKeep
SnapshotManager
in-sync aftercontinueRunExecution
continueRunExecution
usually returns an updated snapshot (often with a new ID and status"EXECUTING"
).
Because the localSnapshotManager
is not updated here, subsequent calls (heartbeats, completions, etc.) may reference a stalesnapshotId
, causing 404 / 409 errors server-side until the poller/web-socket pushes the next change.-const continuationResult = await this.httpClient.continueRunExecution( - this.runFriendlyId, - this.snapshotManager.snapshotId -); +const continuationResult = await this.httpClient.continueRunExecution( + this.runFriendlyId, + this.snapshotManager.snapshotId +); if (!continuationResult.success) { throw new Error(continuationResult.error); } -// Track restore count +// 🔧 Immediately reflect the new snapshot locally +this.updateSnapshot( + continuationResult.data.snapshot.friendlyId, + continuationResult.data.snapshot.executionStatus +); + +// Track restore count this.restoreCount++;
🧹 Nitpick comments (13)
.changeset/plenty-dolphins-act.md (1)
6-8
: Summarize core fixes clearly.The bullet points accurately capture the patch’s purpose—early waitpoint resolution, pre-suspend state checks, and snapshot race-condition fixes.
Consider adding a trailing period on each line for consistency with the header style (optional).apps/webapp/app/v3/runEngineHandlers.server.ts (1)
510-510
: Consistent[engine]
prefix for checkpoint discards.Marking the discard event as
[engine] Checkpoint discarded: ${checkpoint.discardReason}
aligns it with other engine debug logs.
For uniformity with the execution snapshot message, you might consider using a dash (-
) instead of a colon (:
).packages/cli-v3/src/entryPoints/managed/controller.ts (3)
404-413
: Use a collision-safe UUID instead ofMath.random
for notification IDs
Math.random()
is not guaranteed to be unique and is not cryptographically secure.
Because the notification ID is subsequently logged and could be used for correlating log lines, a collision would make debugging harder.-const notificationId = Math.random().toString(36).substring(2, 15); +import { randomUUID } from "crypto"; // move import to top of file +const notificationId = randomUUID();If you want to avoid an extra import, at least use
Date.now()
together withMath.random()
to reduce the chance of collisions.
415-419
:controller
variable is misleadingly namedThe object holds captured IDs, not a controller instance. A more explicit name (e.g.
capturedIds
) would improve readability and avoid confusion with the actual controller class.
11-12
: RedundantRunLogger
import / type mismatch
this.logger
is typed asRunLogger
, but an instance ofManagedRunLogger
is assigned.
BecauseManagedRunLogger
already implementsRunLogger
, you can:
- Remove the unused
RunLogger
import.- Annotate the property more explicitly if you need concrete methods of
ManagedRunLogger
.-import { ManagedRunLogger, RunLogger, SendDebugLogOptions } from "./logger.js"; +import { ManagedRunLogger, SendDebugLogOptions } from "./logger.js"; ... -private readonly logger: RunLogger; +private readonly logger: ManagedRunLogger;Also applies to: 29-31, 50-54
packages/cli-v3/src/entryPoints/managed/snapshot.test.ts (1)
208-272
: Flaky timing–based concurrency assertionThe test relies on wall-clock
Date.now()
comparisons andsetTimeout
delays to prove handlers never overlap.
On a loaded CI runner or a very fast machine thecurrent.start >= previous.end
check may still intermittently fail because of timer coalescing or clock granularity.Consider replacing this with an atomic counter or a
Mutex
inside the handler that asserts mutual exclusion synchronously, e.g.:let inHandler = false; ... if (inHandler) { throw new Error("Parallel execution"); } inHandler = true; await setTimeout(20); inHandler = false;This removes dependence on wall-clock ordering.
packages/cli-v3/src/executions/taskRunProcess.ts (1)
193-198
: Unhandled rejections in new handlersBoth new handlers are
async
, but they forward the message to anEvt
.
If downstream listeners throw, the promise is rejected and silently ignored.
Wrap the body intry/catch
orvoid
-cast the awaited call to avoid unhandled rejection warnings in Node 18+.SEND_DEBUG_LOG: async (message) => { - this.onSendDebugLog.post(message); + try { + this.onSendDebugLog.post(message); + } catch (err) { + logger.debug("Unhandled error in onSendDebugLog listener", { err }); + } },packages/core/src/v3/runtime/sharedRuntimeManager.ts (2)
309-318
: Guard against non-stringoutput
when slicing for debug logs
output.slice(0, 100)
assumesoutput
is a string. If it is aBuffer
or other type, this will throw.- output: output?.slice(0, 100), + output: + typeof output === "string" + ? output.slice(0, 100) + : undefined,
38-43
: Memory leak: interval never cleared
setInterval
is created in the constructor but never cleared (evendisable()
is a no-op).
If a worker runs many executions, intervals accumulate and keep the event-loop alive.Store the interval ID and clear it in
disable()
.- setInterval(() => { + this.statusInterval = setInterval(() => { this.log("[DEBUG] SharedRuntimeManager status", this.status); }, 300_000); ... disable(): void { - // do nothing + if (this.statusInterval) clearInterval(this.statusInterval); }packages/cli-v3/src/entryPoints/managed/snapshot.ts (2)
90-97
: Lexicographical<
is unreliable for CUID / ULID orderingSnapshot IDs are compared with simple
<
/>
operators (string lexicographical order).
For CUIDs, this is usually okay, but for ULIDs or other formats ordering can break.
Safer: compare the numeric timestamp part (first 10 chars of ULID) or keep a monotonic counter.At minimum, clarify the invariant in comments and add tests to lock behaviour.
160-174
: Queue superseding resolves promises silently – callers may misinterpretWhen a pending
suspendable
change is superseded, youresolve()
the promise even though the change never reachedapplyChange
.
Callers awaiting that promise may assume the handler ran.Consider rejecting with a specific
AbortError
, or resolve with a boolean flag indicating it was skipped, to avoid false positives.packages/cli-v3/src/entryPoints/managed/execution.ts (2)
894-898
: Surface unhandled rejections when settingsuspendable
The setter silently swallows errors by only logging them. In production this can mask critical bugs (e.g. network failures) and make incident triage harder.
Consider propagating the promise (or at least returning it) so that callers mayawait
if they need to guarantee the suspendable flag is processed.
952-1023
: Concurrency: ensure Snapshot still current after suspend API callThere is a very small race window between the cleanup finishing and the HTTP
suspendRun
call returning.
If another snapshot change sneaks in (e.g. cancel) the runner could suspend an outdated snapshot ID, leaving the run in an inconsistent state.
A second equality check after the suspend response and before logging “suspending, any day now” would make this bullet-proof.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (23)
.changeset/plenty-dolphins-act.md
(1 hunks)apps/webapp/app/runEngine/validators/triggerTaskValidator.ts
(1 hunks)apps/webapp/app/v3/runEngineHandlers.server.ts
(4 hunks)packages/cli-v3/package.json
(2 hunks)packages/cli-v3/src/entryPoints/dev-run-worker.ts
(3 hunks)packages/cli-v3/src/entryPoints/managed-run-worker.ts
(3 hunks)packages/cli-v3/src/entryPoints/managed/controller.ts
(4 hunks)packages/cli-v3/src/entryPoints/managed/env.ts
(0 hunks)packages/cli-v3/src/entryPoints/managed/execution.ts
(36 hunks)packages/cli-v3/src/entryPoints/managed/logger.ts
(2 hunks)packages/cli-v3/src/entryPoints/managed/poller.ts
(1 hunks)packages/cli-v3/src/entryPoints/managed/snapshot.test.ts
(1 hunks)packages/cli-v3/src/entryPoints/managed/snapshot.ts
(1 hunks)packages/cli-v3/src/executions/taskRunProcess.ts
(4 hunks)packages/cli-v3/tsconfig.json
(1 hunks)packages/cli-v3/tsconfig.src.json
(1 hunks)packages/cli-v3/tsconfig.test.json
(1 hunks)packages/core/src/v3/runEngineWorker/supervisor/schemas.ts
(1 hunks)packages/core/src/v3/runtime/managedRuntimeManager.ts
(0 hunks)packages/core/src/v3/runtime/sharedRuntimeManager.ts
(1 hunks)packages/core/src/v3/schemas/messages.ts
(3 hunks)packages/core/src/v3/schemas/schemas.ts
(0 hunks)packages/core/src/v3/workers/index.ts
(1 hunks)
💤 Files with no reviewable changes (3)
- packages/cli-v3/src/entryPoints/managed/env.ts
- packages/core/src/v3/schemas/schemas.ts
- packages/core/src/v3/runtime/managedRuntimeManager.ts
🧰 Additional context used
🧬 Code Graph Analysis (4)
packages/cli-v3/src/entryPoints/dev-run-worker.ts (2)
packages/core/src/v3/runtime/sharedRuntimeManager.ts (1)
SharedRuntimeManager
(25-349)packages/core/src/v3/workers/index.ts (1)
SharedRuntimeManager
(24-24)
packages/cli-v3/src/entryPoints/managed-run-worker.ts (1)
packages/core/src/v3/runtime/sharedRuntimeManager.ts (1)
SharedRuntimeManager
(25-349)
packages/cli-v3/src/entryPoints/managed/logger.ts (4)
packages/core/src/v3/runEngineWorker/supervisor/schemas.ts (2)
DebugLogPropertiesInput
(141-141)DebugLogPropertiesInput
(142-142)packages/core/src/v3/runEngineWorker/workload/http.ts (1)
WorkloadHttpClient
(21-181)packages/cli-v3/src/entryPoints/managed/env.ts (1)
RunnerEnv
(53-219)packages/core/src/v3/index.ts (1)
flattenAttributes
(43-43)
packages/core/src/v3/schemas/messages.ts (1)
packages/core/src/v3/runEngineWorker/supervisor/schemas.ts (2)
DebugLogPropertiesInput
(141-141)DebugLogPropertiesInput
(142-142)
⏰ Context from checks skipped due to timeout of 90000ms (7)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
- GitHub Check: typecheck / typecheck
- GitHub Check: units / 🧪 Unit Tests
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (34)
packages/cli-v3/src/entryPoints/managed/poller.ts (1)
87-87
: Good clarification on the snapshot ID purpose.This comment effectively communicates the limited role of the snapshot ID within the refactored architecture, making it clear that it serves as a diagnostic tool rather than a functional component in the execution flow.
packages/cli-v3/tsconfig.json (1)
8-11
: LGTM: Test configuration integration.Adding the reference to
tsconfig.test.json
properly integrates the new test configuration into the TypeScript project structure, creating a clearer separation between source and test configurations.apps/webapp/app/runEngine/validators/triggerTaskValidator.ts (1)
49-49
: Improved entitlement validation logic.The condition has been refined to only return an error on explicit denial of access (
result.hasAccess === false
), rather than also failing whenresult
is null/undefined. This ensures validation errors only occur on explicit denial, which aligns with the broader runtime management refactoring in this PR.packages/cli-v3/package.json (2)
42-44
: LGTM: Test file exclusion from build.Excluding test files from the
tshy
build process is the correct approach, ensuring they don't get included in the published package while still being available for development and CI workflows.
76-76
: LGTM: Test script addition.Adding the
test
script for Vitest complements the existing E2E test script and provides a standardized way to run unit tests, improving the developer experience and CI integration.packages/cli-v3/tsconfig.src.json (1)
4-4
: Exclude test files from source compilation.Adding
"exclude": ["./src/**/*.test.ts"]
ensures that test files are omitted from the primarytsconfig.src.json
, delegating test compilation to the dedicated test config. This change aligns perfectly with the newtsconfig.test.json
.packages/core/src/v3/workers/index.ts (1)
24-24
: Export the new SharedRuntimeManager.Replacing the legacy
ManagedRuntimeManager
export withSharedRuntimeManager
from../runtime/sharedRuntimeManager.js
follows the PR’s refactor. Please verify thatSharedRuntimeManager
implements theRuntimeManager
interface and that no stale imports ofManagedRuntimeManager
remain.packages/cli-v3/tsconfig.test.json (1)
1-11
: Introduce dedicated test TS config.This new
tsconfig.test.json
properly extends the base config, referencestsconfig.src.json
, and includesvitest/globals
. It cleanly separates test compilation from production builds.apps/webapp/app/v3/runEngineHandlers.server.ts (3)
404-404
: Add[engine]
prefix for execution snapshot logs.Updating the debug message to
[engine] ${snapshot.executionStatus} - ${snapshot.description}
makes engine-related logs explicit and consistent with other handlers.
453-454
: Userun:notify
prefix for worker notifications.The inline comment and
run:notify platform -> supervisor: ${snapshot.executionStatus}
message correctly adhere to the established notification prefix, avoiding the[engine]
tag here.
483-484
: Maintainrun:notify
prefix on error notifications.The error-path log
run:notify ERROR platform -> supervisor: ${snapshot.executionStatus}
also properly omits[engine]
, preserving consistency across notification events.packages/cli-v3/src/entryPoints/dev-run-worker.ts (3)
38-38
: LGTM: Import replacement aligns with runtime manager refactoringThe replacement of
ManagedRuntimeManager
withSharedRuntimeManager
in the imports is consistent with the PR's objective to refactor waitpoint and suspendable state handling.
458-460
: LGTM: Simplified waitpoint resolutionThe implementation now uses a single
RESOLVE_WAITPOINT
handler that callsresolveWaitpoints([waitpoint])
on the shared runtime manager. This simplifies the IPC messaging architecture and aligns with the PR goal of fixing waitpoint resolution issues, particularly for waitpoints that arrive early.
531-532
: LGTM: Runtime manager implementation replacementReplacing
managedWorkerRuntime
withsharedWorkerRuntime
and using theSharedRuntimeManager
class is aligned with the architectural changes in this PR. TheSharedRuntimeManager
centralizes runtime management with improved logic for handling waitpoints and suspensions.packages/cli-v3/src/entryPoints/managed-run-worker.ts (3)
37-37
: LGTM: Consistent runtime manager import replacementSimilar to the changes in dev-run-worker.ts, this import change from
ManagedRuntimeManager
toSharedRuntimeManager
keeps the codebase consistent with the refactoring approach.
451-453
: LGTM: Simplified waitpoint resolution handlerThe implementation now uses a single
RESOLVE_WAITPOINT
handler that callsresolveWaitpoints([waitpoint])
on the shared runtime manager. This simplification is consistent with the changes in dev-run-worker.ts and improves the handling of waitpoints.
559-561
: LGTM: Runtime manager instantiation updatedCreating
sharedWorkerRuntime
withSharedRuntimeManager
and setting it as the global runtime manager is consistent with the architectural changes. Note that theshowLogs
parameter is hardcoded totrue
here, unlike in dev-run-worker.ts where it uses theshowInternalLogs
variable.packages/cli-v3/src/entryPoints/managed/logger.ts (6)
2-3
: LGTM: Enhanced debug logging importsAdding imports for
DebugLogPropertiesInput
andflattenAttributes
supports the improved debug logging capabilities, providing better type definitions and utilities for processing log properties.Also applies to: 7-7
9-15
: LGTM: Improved debug log optionsThe
SendDebugLogOptions
type now usesDebugLogPropertiesInput
for properties and adds a
17-19
: LGTM: Abstraction with RunLogger interfaceCreating a
RunLogger
interface is a good design decision that allows for different logger implementations while maintaining a consistent API.
26-26
: LGTM: Class renamed for clarityRenaming the class from
RunLogger
toManagedRunLogger
better reflects its specific implementation role and allows for other implementations of theRunLogger
interface.
35-64
: LGTM: Enhanced debug log handlingThe updated
sendDebugLog
method now:
- Conditionally prints to the console based on the
- Uses
flattenAttributes
to ensure properties are in the correct format for the API- Merges additional context (runId, runnerId, workerName) into the properties
These improvements provide more flexibility and ensure consistent formatting of log data.
67-79
: LGTM: New console logger implementationAdding the
ConsoleRunLogger
class provides a simpler alternative for logging that only writes to the console. This is useful for testing or environments where the full managed logger isn't needed.packages/core/src/v3/runEngineWorker/supervisor/schemas.ts (5)
129-136
: LGTM: Improved type definitions for debug log propertiesRenaming
AttributeValue
toDebugLogPropertiesValue
makes the purpose clearer, and updating array types to usenullish()
instead ofnullable()
improves type safety by handling bothnull
andundefined
values.
138-140
: LGTM: Renamed schema for clarityRenaming
Attributes
toDebugLogProperties
makes the schema's purpose more obvious and aligns with the other debug log-related schema changes.
141-142
: LGTM: New input schema for flexibilityAdding the
DebugLogPropertiesInput
schema withz.unknown()
values provides more flexibility for input validation, allowing arbitrary property values that can be properly validated and processed later.
144-149
: LGTM: Input validation schema for debug logsThe new
WorkerApiDebugLogBodyInput
schema provides clear input validation for debug log bodies, using the more permissiveDebugLogPropertiesInput
for properties while maintaining strict typing for required fields.
151-155
: LGTM: Updated schema for consistencyUpdating
WorkerApiDebugLogBody
to useDebugLogProperties
maintains consistency with the renamed schemas and ensures proper validation of debug log data.packages/cli-v3/src/entryPoints/managed/controller.ts (1)
470-485
: 🛠️ Refactor suggestion
⚠️ Potential issuePotential race:
currentExecution
may have changed before the enqueue
this.currentExecution
is checked for existence at L470-479, but there is aawait
before the call toenqueueSnapshotChangeAndWait
.
Another run could start in the meantime, replacingcurrentExecution
with an execution for a different run, leading to a snapshot from run A being applied to run B.Consider capturing the execution reference early and validating it before use:
-const [error] = await tryCatch( - this.currentExecution.enqueueSnapshotChangeAndWait(runExecutionData) -); +const execution = this.currentExecution; +if (!execution || execution.runFriendlyId !== controller.runFriendlyId) { + return; // execution replaced – ignore stale notification +} + +const [error] = await tryCatch( + execution.enqueueSnapshotChangeAndWait(runExecutionData) +);Likely an incorrect or invalid review comment.
packages/cli-v3/src/executions/taskRunProcess.ts (2)
36-45
: Confirm new catalog keys existThe new message types
"SEND_DEBUG_LOG"
and"SET_SUSPENDABLE"
must be present inExecutorToWorkerMessageCatalog
.
A missing key will cause a run-time validation error inZodIpcConnection
.Please verify with a quick grep or unit test.
277-278
: Message rename check
waitpointCompleted()
now sends"RESOLVE_WAITPOINT"
.
Make sure all workers listening for"WAITPOINT_COMPLETED"
were updated accordingly, otherwise waitpoint resolution will silently fail.packages/core/src/v3/schemas/messages.ts (1)
21-21
:✅ Verification successful
Possible broken import path for
DebugLogPropertiesInput
DebugLogPropertiesInput
is exported fromrunEngineWorker/supervisor/schemas.ts
(per relevant snippet), but this file imports it from../runEngineWorker/index.js
.
Unlessindex.js
re-exports the symbol, TypeScript will fail to resolve the type and the build will break.-import { DebugLogPropertiesInput } from "../runEngineWorker/index.js"; +// Re-export exists? If not, point directly at the schema file +import { DebugLogPropertiesInput } from "../runEngineWorker/supervisor/schemas.js";Please verify the re-export or adjust the path accordingly.
I’d like to confirm whether
DebugLogPropertiesInput
is re-exported inrunEngineWorker/index.js
(or.ts
). Let’s list the files in that folder and search for exports:
🏁 Script executed:
#!/bin/bash set -e echo "1) Files in runEngineWorker directory:" find packages/core/src/v3/runEngineWorker -maxdepth 1 -type f echo echo "2) All occurrences of DebugLogPropertiesInput in packages/core:" rg -n "DebugLogPropertiesInput" packages/core echo echo "3) Contents of index.js (if present):" sed -n '1,200p' packages/core/src/v3/runEngineWorker/index.js 2>/dev/null || echo "→ index.js not found" echo echo "4) Contents of index.ts:" sed -n '1,200p' packages/core/src/v3/runEngineWorker/index.ts 2>/dev/null || echo "→ index.ts not found"Length of output: 2132
No broken import—
DebugLogPropertiesInput
is correctly re-exported
Thepackages/core/src/v3/runEngineWorker/index.ts
file includesexport * from "./supervisor/schemas.js";and that module defines and exports
DebugLogPropertiesInput
. The import in
packages/core/src/v3/schemas/messages.ts
import { DebugLogPropertiesInput } from "../runEngineWorker/index.js";will resolve correctly at build time. No changes needed.
packages/cli-v3/src/entryPoints/managed/snapshot.ts (1)
76-80
:crypto.randomUUID()
requires Node ≥ 16.17If the managed worker is executed in older runtimes (e.g. AWS Lambda Node 14), this will throw at runtime.
Either polyfill (import { v4 as uuid } from "uuid"
) or assert minimum Node version in package engines.packages/cli-v3/src/entryPoints/managed/execution.ts (1)
672-687
: Verify mapping completeness for new attempt statuses
convertAttemptStatusToSnapshotStatus
deliberately omits"RUN_PENDING_EXECUTING"
and any future statuses.
If the server starts returning a new status, compilation will fail thanks toassertExhaustive
, but deployment pipelines running with"skipLibCheck": true
or older build artefacts may silently mis-map.
Make sure the exhaustive check is covered by unit tests so CI catches any upstream additions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
packages/core/src/v3/runtime/sharedRuntimeManager.ts (1)
291-313
:⚠️ Potential issueAdd defensive JSON parsing for error output
The
JSON.parse
call has no error handling, which could crash the worker ifwaitpoint.output
contains malformed JSON.- error: waitpoint.output - ? JSON.parse(waitpoint.output) - : { + error: waitpoint.output + ? (() => { + try { + return JSON.parse(waitpoint.output); + } catch (e) { + return { + type: "STRING_ERROR", + message: `Error parsing output: ${String(waitpoint.output).slice(0, 100)}`, + }; + } + })() + : { type: "STRING_ERROR", message: "Missing error output", },
🧹 Nitpick comments (1)
packages/cli-v3/src/entryPoints/managed/execution.ts (1)
900-904
: Add error handling for suspendable state setterThe current implementation swallows errors when setting suspendable state. While logging is good, consider whether errors should be propagated or if retries are appropriate.
private set suspendable(suspendable: boolean) { this.snapshotManager?.setSuspendable(suspendable).catch((error) => { this.sendDebugLog("failed to set suspendable", { error: error.message }); + // Consider whether this failure is critical enough to propagate or retry }); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
internal-packages/testcontainers/src/index.ts
(3 hunks)packages/cli-v3/src/entryPoints/managed/execution.ts
(36 hunks)packages/cli-v3/src/entryPoints/managed/snapshot.ts
(1 hunks)packages/cli-v3/src/executions/taskRunProcess.ts
(5 hunks)packages/core/src/v3/runtime/sharedRuntimeManager.ts
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- internal-packages/testcontainers/src/index.ts
- packages/cli-v3/src/entryPoints/managed/snapshot.ts
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: units / 🧪 Unit Tests
- GitHub Check: typecheck / typecheck
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (18)
packages/cli-v3/src/executions/taskRunProcess.ts (5)
36-44
: IPC message type definitions for debug logs and suspendable stateGood addition of typed IPC message interfaces for the new debug logging and suspendable state functionality. This provides strong type safety for message handlers.
81-82
: Event emitters for runtime state managementThese new event emitters replace the previous wait-related events with a more streamlined approach focusing on debug logging and suspendable state. This simplifies the IPC communication model.
100-105
: Good addition of event handler cleanup methodAdding this method helps prevent memory leaks by providing a centralized way to detach all event handlers. Using this in the shutdown flow will ensure proper cleanup.
200-205
: New IPC handlers for debug logs and suspendable stateWell-structured handlers that simply post the messages to the appropriate event emitters. This maintains the separation between IPC communication and event handling.
284-284
: Simplified waitpoint resolution messagingReplacing
WAITPOINT_COMPLETED
withRESOLVE_WAITPOINT
aligns with the new centralized waitpoint handling architecture, making the communication protocol more consistent.packages/core/src/v3/runtime/sharedRuntimeManager.ts (6)
19-23
: Good type definitions for resolvers and resolver IDsThe use of branded types for
ResolverId
is an excellent practice that provides type safety against accidental misuse of string IDs.
25-42
: Well-structured class initialization with status loggingThe class structure with clear private fields and appropriate logging interval helps with debugging long-running executions. The 5-minute interval for status logging is suitable for production environments while avoiding log spam.
48-72
: Consistent implementation of wait methodsAll three wait methods (
waitForTask
,waitForBatch
,waitForWaitpoint
) follow the same pattern:
- Use prevention of multiple waits
- Create promise with resolver
- Register resolver
- Resolve any pending waitpoints
- Call lifecycle hooks
- Wait with suspendable wrapper
- Process results
- Call resume hooks
This consistency makes the code more maintainable and easier to understand.
Also applies to: 74-114, 116-163
261-272
: Safe iteration over maps during mutationGood implementation of
resolvePendingWaitpoints()
that avoids mutation-during-iteration issues by cloning the keys first. This prevents potential skipped entries when mutating a collection during iteration.
278-289
: Robust suspendable state wrapperGood implementation of the suspendable wrapper that:
- Sets the suspendable state to true at the start
- Uses tryCatch for error handling
- Ensures suspendable is set to false even if an error occurs
- Properly logs errors
- Propagates errors to the caller
This ensures the runtime always exits the suspendable state, even in error scenarios.
315-326
: Safely format waitpoints for debug logsGood implementation that:
- Destructures to avoid modifying the original waitpoint
- Truncates output to avoid excessive log sizes
- Converts dates to ISO strings for readability
- Preserves original date objects for potential further processing
This ensures logs are readable and manageable without losing important information.
packages/cli-v3/src/entryPoints/managed/execution.ts (7)
57-57
: Introduction of SnapshotManagerGood architectural change to delegate snapshot and suspendable state management to a dedicated class. This improves separation of concerns.
149-156
: Event handlers for runtime communicationEffective implementation of debug log and suspendable state event handlers that integrate with the snapshot manager.
178-190
: Improved snapshot change handlingRenaming
handleSnapshotChange
toenqueueSnapshotChangeAndWait
better reflects the asynchronous nature of the operation. Good use of the snapshot manager for actual processing.
389-397
: Well-structured SnapshotManager initializationThe initialization provides all necessary parameters including callbacks for state changes, making dependencies clear and explicit.
673-693
: Helper methods for snapshot state managementThe introduction of
updateSnapshotAfterCompletion
andconvertAttemptStatusToSnapshotStatus
centralizes conversion logic and ensures consistent snapshot state updates.
957-957
: Proper cleanup of event handlersGood use of the new
unsafeDetachEvtHandlers
method to ensure event handlers are cleaned up during shutdown. This prevents memory leaks when processes are reused.
960-1031
: Comprehensive suspendable state handlingThe
handleSuspendable
method properly:
- Verifies snapshot consistency before and after cleanup
- Handles errors during process cleanup
- Gracefully handles API failures
- Provides detailed logging
This ensures robust suspension handling with appropriate error cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
.github/workflows/publish.yml (1)
72-78
:⚠️ Potential issueMissing reusable workflow file
publish-worker-v4.yml
publish-worker-v4
references./.github/workflows/publish-worker-v4.yml
, but the file does not exist (confirmed by action-lint).
This will cause the entire workflow run to fail.If the new job is not ready, remove the block; otherwise, add the referenced workflow file before merging.
🧰 Tools
🪛 actionlint (1.7.4)
74-74: could not read reusable workflow file for "./.github/workflows/publish-worker-v4.yml": open /home/jailuser/git/.github/workflows/publish-worker-v4.yml: no such file or directory
(workflow-call)
♻️ Duplicate comments (2)
packages/core/src/v3/runtime/sharedRuntimeManager.ts (2)
261-265
:⚠️ Potential issueMap is mutated during iteration – entries may be skipped
resolvePendingWaitpoints()
iterates overwaitpointsByResolverId.entries()
whileresolveWaitpoint()
can delete the very key being iterated (line 257-259).
Although V8 currently tolerates this, it is undefined behaviour per ECMA-262 and other JS engines may skip elements.- for (const [resolverId, waitpoint] of this.waitpointsByResolverId.entries()) { - this.resolveWaitpoint(waitpoint, resolverId); - } + // Clone keys first to avoid mutation-while-iterating + for (const resolverId of Array.from(this.waitpointsByResolverId.keys())) { + const waitpoint = this.waitpointsByResolverId.get(resolverId)!; + this.resolveWaitpoint(waitpoint, resolverId); + }This was pointed out in an earlier review but is still present.
284-305
:⚠️ Potential issue
JSON.parse
can crash the worker – add defensive parsingA malformed
waitpoint.output
will throw and bring down the entire run.
Wrap the parse in atry/catch
(or usesafeJsonParse
) so a single bad payload does not terminate the process.- error: waitpoint.output - ? JSON.parse(waitpoint.output) - : { - type: "STRING_ERROR", - message: "Missing error output", - }, + error: (() => { + if (!waitpoint.output) { + return { + type: "STRING_ERROR", + message: "Missing error output", + }; + } + try { + return JSON.parse(waitpoint.output); + } catch { + return { + type: "STRING_ERROR", + message: "Unparseable error output", + }; + } + })(),
🧹 Nitpick comments (7)
scripts/publish-prerelease.sh (1)
58-59
: Redirect error output to stderr for better logging
Currently, both the command output and the error message are printed to stdout. To clearly distinguish errors in CI logs or user terminals, send these to stderr.Proposed diff:
- echo "$output" - echo "Error running changeset version command, detailed output above" + echo "$output" >&2 + echo "Error running changeset version command, detailed output above" >&2packages/cli-v3/src/commands/deploy.ts (2)
217-237
: Avoid instantiating spinners when running in CI
$spinner
is created unconditionally, yet in CI mode (isCI === true
) it is never used – all progress is reported vialog.step
andconsole.log
.
This results in unnecessary object creation and (more importantly) an extra TTY escape sequence on some CI providers the moment the spinner is instantiated, which can pollute logs.- const $spinner = spinner(); + // Only create a spinner for interactive terminals + const $spinner = isCI ? undefined : spinner();…and wrap subsequent
$spinner.*
calls with a guard ($spinner?.start(...)
,$spinner?.message(...)
, etc.).[nitpick, performance]
Also applies to: 333-343
330-343
:cliLink()
is called when links are unsupportedYou already construct
rawDeploymentLink/rawTestLink
for the fallback case, yetcliLink()
is still invoked even whenterminalLink
support is absent.
WhilecliLink
has a built-in fallback, skipping the call altogether avoids the extra formatting step and prevents ANSI escape codes from leaking into plain-text CI logs on some shells.-const deploymentLink = cliLink("View deployment", rawDeploymentLink); -const testLink = cliLink("Test tasks", rawTestLink); +const deploymentLink = isLinksSupported + ? cliLink("View deployment", rawDeploymentLink) + : rawDeploymentLink; +const testLink = isLinksSupported + ? cliLink("Test tasks", rawTestLink) + : rawTestLink;packages/cli-v3/src/entryPoints/managed/controller.ts (2)
405-413
: Prefercrypto.randomUUID()
over manual random string IDs
Math.random().toString(36)
has lower entropy and can collide in high-throughput scenarios.
Node ≥ 14 supportscrypto.randomUUID()
which yields RFC 4122 v4 IDs without additional deps.-const notificationId = Math.random().toString(36).substring(2, 15); +import { randomUUID } from "node:crypto"; +const notificationId = randomUUID();
527-573
:currentEnv
/newEnv
capture identical data
processEnvOverrides()
presumably mutatesthis.env
; however both snapshots are built fromthis.env
after the call, so they will always be identical.
CapturecurrentEnv
before invoking the mutation to give meaningful diff logging.- if (this.currentExecution) { - const currentEnv = { ...snip... }; - await this.currentExecution.processEnvOverrides("socket disconnected"); - const newEnv = { ...snip... }; + if (this.currentExecution) { + const currentEnv = { + workerInstanceName: this.env.TRIGGER_WORKER_INSTANCE_NAME, + runnerId: this.env.TRIGGER_RUNNER_ID, + supervisorApiUrl: this.env.TRIGGER_SUPERVISOR_API_URL, + }; + + await this.currentExecution.processEnvOverrides("socket disconnected"); + + const newEnv = { + workerInstanceName: this.env.TRIGGER_WORKER_INSTANCE_NAME, + runnerId: this.env.TRIGGER_RUNNER_ID, + supervisorApiUrl: this.env.TRIGGER_SUPERVISOR_API_URL, + };packages/cli-v3/src/entryPoints/managed/execution.ts (2)
390-399
: Verify the appropriateness of the initial statusThe comment "We're just guessing here, but 'PENDING_EXECUTING' is probably fine" indicates uncertainty about the correct initial status for the SnapshotManager. This could potentially lead to issues if the actual state should be different.
Consider having a more systematic way to determine the initial status rather than guessing. For example, you could:
- Pass the initial status from the caller where it's known with certainty
- Make the status parameter optional and have the SnapshotManager derive it from other parameters
- Add validation logic in the SnapshotManager constructor to ensure the initial status is compatible with the given context
211-212
: Consider using a debug flag instead of commented codeThe comment "DO NOT REMOVE (very noisy, but helpful for debugging)" indicates that this line might be re-enabled for debugging.
Consider controlling this with a debug flag or log level setting rather than commenting and uncommenting code. This would make it easier to enable debugging when needed without code changes:
- // DO NOT REMOVE (very noisy, but helpful for debugging) - // this.sendDebugLog(`processing snapshot change: ${snapshot.executionStatus}`, snapshotMetadata); + if (this.debugVerbose) { + this.sendDebugLog(`processing snapshot change: ${snapshot.executionStatus}`, snapshotMetadata); + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
.changeset/sweet-dolphins-invent.md
(1 hunks).github/workflows/publish-worker-re2.yml
(1 hunks).github/workflows/publish-worker.yml
(1 hunks).github/workflows/publish.yml
(1 hunks)packages/cli-v3/src/commands/deploy.ts
(7 hunks)packages/cli-v3/src/entryPoints/managed/controller.ts
(6 hunks)packages/cli-v3/src/entryPoints/managed/execution.ts
(35 hunks)packages/cli-v3/src/entryPoints/managed/poller.ts
(2 hunks)packages/core/src/v3/runtime/sharedRuntimeManager.ts
(1 hunks)packages/core/src/v3/utils/interval.ts
(1 hunks)scripts/publish-prerelease.sh
(1 hunks)
✅ Files skipped from review due to trivial changes (3)
- .changeset/sweet-dolphins-invent.md
- .github/workflows/publish-worker.yml
- .github/workflows/publish-worker-re2.yml
🚧 Files skipped from review as they are similar to previous changes (1)
- packages/cli-v3/src/entryPoints/managed/poller.ts
🧰 Additional context used
🧬 Code Graph Analysis (1)
packages/cli-v3/src/commands/deploy.ts (1)
packages/cli-v3/src/utilities/cliOutput.ts (2)
cliLink
(140-145)isLinksSupported
(7-7)
🪛 actionlint (1.7.4)
.github/workflows/publish.yml
74-74: could not read reusable workflow file for "./.github/workflows/publish-worker-v4.yml": open /home/jailuser/git/.github/workflows/publish-worker-v4.yml: no such file or directory
(workflow-call)
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: units / 🧪 Unit Tests
- GitHub Check: typecheck / typecheck
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (6)
packages/core/src/v3/utils/interval.ts (1)
43-56
: Improved state reporting on interval service stopThis change enhances the
stop()
method to return the execution state at the time of stopping, providing valuable information for callers to handle cleanup correctly. This is a good improvement that helps address the race conditions mentioned in the PR objectives.The implementation correctly captures the execution state before any state changes are made, ensuring accurate reporting regardless of the enabled state.
packages/cli-v3/src/entryPoints/managed/execution.ts (5)
637-638
: Good approach for maintaining snapshot consistencyThe code explicitly updates the snapshot ID after completion to ensure any subsequent API calls use the correct snapshot. This helps prevent race conditions where another snapshot update arrives during completion processing.
912-916
: Error handling approach for setSuspendable is reasonableThe error from
setSuspendable
is caught and logged but not propagated. Based on the previous review discussion, this is an acceptable approach since failures here are rare and only impact checkpointing.
972-1043
: Well-designed suspension handling with robust race condition protectionThe
handleSuspendable
method is very thorough in its approach to handling suspension requests:
- It verifies snapshot consistency before and after cleanup
- Handles API failure gracefully
- Contains appropriate logging throughout the process
- Aborts execution when consistency checks fail
This comprehensive approach helps prevent race conditions during the suspension process.
966-970
: Good service cleanup with proper resource managementThe
stopServices
method now properly cleans up all resources and detaches event handlers. This addresses the memory leak concern raised in previous reviews.
149-156
:✅ Verification successful
Properly detach event listeners to prevent memory leaks
New event listeners are attached to the TaskRunProcess in lines 149-156, but I notice they must be properly detached when the process is stopped to prevent memory leaks with warm starts or retries.
Let's verify if the detachment is properly implemented in
unsafeDetachEvtHandlers()
:
🏁 Script executed:
#!/bin/bash # Check what the unsafeDetachEvtHandlers method does rg -A 10 "unsafeDetachEvtHandlers" packages/cli-v3/src/executions/Length of output: 866
Event listeners are already detached
The
unsafeDetachEvtHandlers()
method inpackages/cli-v3/src/executions/taskRunProcess.ts
callsdetach()
ononSendDebugLog
andonSetSuspendable
(along with the other handlers), so these listeners are properly cleaned up.
Quite a few things in here:
We now require two things before suspending a run:
EXECUTING_WITH_WAITPOINTS
orQUEUED_EXECUTING
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Tests
Chores