Skip to content

Fix controller waitpoint resolution, suspendable state, and snapshot race conditions #2006

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 49 commits into
base: main
Choose a base branch
from

Conversation

nicktrn
Copy link
Collaborator

@nicktrn nicktrn commented May 1, 2025

Quite a few things in here:

  • Fix testcontainer unit tests 🥳
  • Remove a lot of dead code
  • Add snapshot queue tests
  • Add debug logs to runtime manager
  • Fix entitlement validation when no client exists
  • Correctly resolve waitpoints that come in early
  • Ensure correct state before requesting suspension
  • Fix race conditions in snapshot processing

We now require two things before suspending a run:

  1. The execution status needs to be either EXECUTING_WITH_WAITPOINTS or QUEUED_EXECUTING
  2. The runtime manager needs to signal it's "suspendable"
  3. These events can happen out of order

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Introduced a Snapshot Manager to streamline and centralize snapshot and suspendable state handling.
    • Added enhanced debug logging with structured context and improved log message consistency.
  • Bug Fixes

    • Improved reliability of waitpoint handling, resolving early-arriving waitpoints and preventing race conditions during snapshot processing.
    • Ensured correct system state before initiating suspension requests.
  • Refactor

    • Replaced the previous runtime manager with a new shared runtime manager for improved waitpoint and task execution management.
    • Simplified and unified event and message handling for task execution and waitpoints.
    • Centralized snapshot and suspendable state management, simplifying execution flow and improving separation of concerns.
  • Tests

    • Added comprehensive tests for the new Snapshot Manager, covering state transitions, concurrency, and error scenarios.
  • Chores

    • Updated build and test configurations to exclude test files from production builds and add support for new testing tools.

Copy link

changeset-bot bot commented May 1, 2025

🦋 Changeset detected

Latest commit: ed1a44c

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Contributor

coderabbitai bot commented May 1, 2025

Walkthrough

This update refactors and enhances the workflow engine's handling of snapshot and waitpoint state, focusing on reliability and concurrency control. It introduces a new SnapshotManager class to centralize and serialize snapshot state transitions and suspendable state changes, replacing decentralized logic in the execution flow. The legacy ManagedRuntimeManager is removed and replaced with SharedRuntimeManager, which improves the resolution of waitpoints, especially those arriving early. The event and IPC message system is streamlined, removing obsolete wait-related messages and notifications, and adding support for debug logging and suspendable state. Logging, configuration, and test infrastructure are updated to align with these architectural changes.

Changes

File(s) Change Summary
.changeset/plenty-dolphins-act.md Documents the patch update addressing early waitpoint handling, suspension state checks, and snapshot race condition fixes.
apps/webapp/app/runEngine/validators/triggerTaskValidator.ts Refines entitlement validation logic to only error on explicit access denial.
apps/webapp/app/v3/runEngineHandlers.server.ts Adds "[engine]" prefix to engine-related debug logs and clarifies log prefix usage in comments.
packages/cli-v3/package.json, packages/cli-v3/tsconfig.src.json, packages/cli-v3/tsconfig.test.json, packages/cli-v3/tsconfig.json Updates test exclusion patterns, adds test script, and introduces a test-specific TypeScript config.
packages/cli-v3/src/entryPoints/dev-run-worker.ts, packages/cli-v3/src/entryPoints/managed-run-worker.ts Switches from ManagedRuntimeManager to SharedRuntimeManager, updates IPC handlers, and removes waitpoint association logic.
packages/cli-v3/src/entryPoints/managed/controller.ts Switches logger to ManagedRunLogger, tracks notification IDs and controller state in logs, and queues snapshot changes.
packages/cli-v3/src/entryPoints/managed/env.ts Removes TRIGGER_PRE_SUSPEND_WAIT_MS from environment schema and class.
packages/cli-v3/src/entryPoints/managed/execution.ts Refactors to use SnapshotManager for snapshot and suspendable state, updates event handling, and centralizes snapshot logic.
packages/cli-v3/src/entryPoints/managed/logger.ts Introduces RunLogger interface, adds ConsoleRunLogger, enhances debug log property handling, and refactors logging logic.
packages/cli-v3/src/entryPoints/managed/poller.ts Adds a clarifying comment about snapshot ID usage in the poller.
packages/cli-v3/src/entryPoints/managed/snapshot.ts Adds new SnapshotManager class for serialized snapshot and suspendable state management.
packages/cli-v3/src/entryPoints/managed/snapshot.test.ts Adds comprehensive tests for SnapshotManager covering ordering, concurrency, error handling, and edge cases.
packages/cli-v3/src/executions/taskRunProcess.ts Removes wait-related events and notifications, adds debug log and suspendable events, and updates waitpoint completion logic.
packages/core/src/v3/runEngineWorker/supervisor/schemas.ts Renames and exports debug log property schemas, adds input validation types for debug log messages.
packages/core/src/v3/runtime/managedRuntimeManager.ts Removes the ManagedRuntimeManager class and its methods.
packages/core/src/v3/runtime/sharedRuntimeManager.ts Introduces SharedRuntimeManager class for managing waitpoints and task execution asynchronously.
packages/core/src/v3/schemas/messages.ts Removes wait-related and notification messages, adds debug log and suspendable state messages to message catalogs.
packages/core/src/v3/schemas/schemas.ts Removes the RuntimeWait schema and type alias.
packages/core/src/v3/workers/index.ts Changes export from ManagedRuntimeManager to SharedRuntimeManager.

Sequence Diagram(s)

sequenceDiagram
    participant Controller
    participant RunExecution
    participant SnapshotManager
    participant TaskRunProcess

    Controller->>RunExecution: enqueueSnapshotChangeAndWait(runData)
    RunExecution->>SnapshotManager: handleSnapshotChange(runData)
    SnapshotManager->>SnapshotManager: Queue runData, process in order
    SnapshotManager-->>RunExecution: onSnapshotChange(runData)
    RunExecution->>TaskRunProcess: (as needed, e.g., suspend)
    SnapshotManager-->>RunExecution: onSuspendable(suspendableSnapshot)
    RunExecution->>TaskRunProcess: cleanup and suspend if needed
Loading
sequenceDiagram
    participant Worker
    participant SharedRuntimeManager
    participant ExternalSystem

    Worker->>SharedRuntimeManager: waitForTask / waitForBatch / waitForWaitpoint
    ExternalSystem-->>SharedRuntimeManager: resolveWaitpoints([waitpoint])
    SharedRuntimeManager->>Worker: Resolve promise(s) for completed waitpoints
Loading

Possibly related PRs

Suggested reviewers

  • matt-aitken

Poem

In the warren where snapshots leap and bound,
A rabbit refactored the state all around.
Waitpoints no longer lost in the race,
With managers shared, each finds its place.
Logs are more clever, the tests are robust—
In this garden of code, reliability’s a must!
🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🔭 Outside diff range comments (1)
packages/cli-v3/src/entryPoints/managed/execution.ts (1)

767-783: ⚠️ Potential issue

Keep SnapshotManager in-sync after continueRunExecution

continueRunExecution usually returns an updated snapshot (often with a new ID and status "EXECUTING").
Because the local SnapshotManager is not updated here, subsequent calls (heartbeats, completions, etc.) may reference a stale snapshotId, causing 404 / 409 errors server-side until the poller/web-socket pushes the next change.

-const continuationResult = await this.httpClient.continueRunExecution(
-  this.runFriendlyId,
-  this.snapshotManager.snapshotId
-);
+const continuationResult = await this.httpClient.continueRunExecution(
+  this.runFriendlyId,
+  this.snapshotManager.snapshotId
+);
 
 if (!continuationResult.success) {
   throw new Error(continuationResult.error);
 }
 
-// Track restore count
+// 🔧  Immediately reflect the new snapshot locally
+this.updateSnapshot(
+  continuationResult.data.snapshot.friendlyId,
+  continuationResult.data.snapshot.executionStatus
+);
+
+// Track restore count
 this.restoreCount++;
🧹 Nitpick comments (13)
.changeset/plenty-dolphins-act.md (1)

6-8: Summarize core fixes clearly.

The bullet points accurately capture the patch’s purpose—early waitpoint resolution, pre-suspend state checks, and snapshot race-condition fixes.
Consider adding a trailing period on each line for consistency with the header style (optional).

apps/webapp/app/v3/runEngineHandlers.server.ts (1)

510-510: Consistent [engine] prefix for checkpoint discards.

Marking the discard event as [engine] Checkpoint discarded: ${checkpoint.discardReason} aligns it with other engine debug logs.
For uniformity with the execution snapshot message, you might consider using a dash (-) instead of a colon (:).

packages/cli-v3/src/entryPoints/managed/controller.ts (3)

404-413: Use a collision-safe UUID instead of Math.random for notification IDs

Math.random() is not guaranteed to be unique and is not cryptographically secure.
Because the notification ID is subsequently logged and could be used for correlating log lines, a collision would make debugging harder.

-const notificationId = Math.random().toString(36).substring(2, 15);
+import { randomUUID } from "crypto"; // move import to top of file
+const notificationId = randomUUID();

If you want to avoid an extra import, at least use Date.now() together with Math.random() to reduce the chance of collisions.


415-419: controller variable is misleadingly named

The object holds captured IDs, not a controller instance. A more explicit name (e.g. capturedIds) would improve readability and avoid confusion with the actual controller class.


11-12: Redundant RunLogger import / type mismatch

this.logger is typed as RunLogger, but an instance of ManagedRunLogger is assigned.
Because ManagedRunLogger already implements RunLogger, you can:

  1. Remove the unused RunLogger import.
  2. Annotate the property more explicitly if you need concrete methods of ManagedRunLogger.
-import { ManagedRunLogger, RunLogger, SendDebugLogOptions } from "./logger.js";
+import { ManagedRunLogger, SendDebugLogOptions } from "./logger.js";
...
-private readonly logger: RunLogger;
+private readonly logger: ManagedRunLogger;

Also applies to: 29-31, 50-54

packages/cli-v3/src/entryPoints/managed/snapshot.test.ts (1)

208-272: Flaky timing–based concurrency assertion

The test relies on wall-clock Date.now() comparisons and setTimeout delays to prove handlers never overlap.
On a loaded CI runner or a very fast machine the current.start >= previous.end check may still intermittently fail because of timer coalescing or clock granularity.

Consider replacing this with an atomic counter or a Mutex inside the handler that asserts mutual exclusion synchronously, e.g.:

let inHandler = false;
...
if (inHandler) {
  throw new Error("Parallel execution");
}
inHandler = true;
await setTimeout(20);
inHandler = false;

This removes dependence on wall-clock ordering.

packages/cli-v3/src/executions/taskRunProcess.ts (1)

193-198: Unhandled rejections in new handlers

Both new handlers are async, but they forward the message to an Evt.
If downstream listeners throw, the promise is rejected and silently ignored.
Wrap the body in try/catch or void-cast the awaited call to avoid unhandled rejection warnings in Node 18+.

SEND_DEBUG_LOG: async (message) => {
-  this.onSendDebugLog.post(message);
+  try {
+    this.onSendDebugLog.post(message);
+  } catch (err) {
+    logger.debug("Unhandled error in onSendDebugLog listener", { err });
+  }
},
packages/core/src/v3/runtime/sharedRuntimeManager.ts (2)

309-318: Guard against non-string output when slicing for debug logs

output.slice(0, 100) assumes output is a string. If it is a Buffer or other type, this will throw.

-      output: output?.slice(0, 100),
+      output:
+        typeof output === "string"
+          ? output.slice(0, 100)
+          : undefined,

38-43: Memory leak: interval never cleared

setInterval is created in the constructor but never cleared (even disable() is a no-op).
If a worker runs many executions, intervals accumulate and keep the event-loop alive.

Store the interval ID and clear it in disable().

-    setInterval(() => {
+    this.statusInterval = setInterval(() => {
       this.log("[DEBUG] SharedRuntimeManager status", this.status);
     }, 300_000);
 ...
 disable(): void {
-    // do nothing
+    if (this.statusInterval) clearInterval(this.statusInterval);
 }
packages/cli-v3/src/entryPoints/managed/snapshot.ts (2)

90-97: Lexicographical < is unreliable for CUID / ULID ordering

Snapshot IDs are compared with simple < / > operators (string lexicographical order).
For CUIDs, this is usually okay, but for ULIDs or other formats ordering can break.
Safer: compare the numeric timestamp part (first 10 chars of ULID) or keep a monotonic counter.

At minimum, clarify the invariant in comments and add tests to lock behaviour.


160-174: Queue superseding resolves promises silently – callers may misinterpret

When a pending suspendable change is superseded, you resolve() the promise even though the change never reached applyChange.
Callers awaiting that promise may assume the handler ran.

Consider rejecting with a specific AbortError, or resolve with a boolean flag indicating it was skipped, to avoid false positives.

packages/cli-v3/src/entryPoints/managed/execution.ts (2)

894-898: Surface unhandled rejections when setting suspendable

The setter silently swallows errors by only logging them. In production this can mask critical bugs (e.g. network failures) and make incident triage harder.
Consider propagating the promise (or at least returning it) so that callers may await if they need to guarantee the suspendable flag is processed.


952-1023: Concurrency: ensure Snapshot still current after suspend API call

There is a very small race window between the cleanup finishing and the HTTP suspendRun call returning.
If another snapshot change sneaks in (e.g. cancel) the runner could suspend an outdated snapshot ID, leaving the run in an inconsistent state.
A second equality check after the suspend response and before logging “suspending, any day now” would make this bullet-proof.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c0807ad and b17a947.

📒 Files selected for processing (23)
  • .changeset/plenty-dolphins-act.md (1 hunks)
  • apps/webapp/app/runEngine/validators/triggerTaskValidator.ts (1 hunks)
  • apps/webapp/app/v3/runEngineHandlers.server.ts (4 hunks)
  • packages/cli-v3/package.json (2 hunks)
  • packages/cli-v3/src/entryPoints/dev-run-worker.ts (3 hunks)
  • packages/cli-v3/src/entryPoints/managed-run-worker.ts (3 hunks)
  • packages/cli-v3/src/entryPoints/managed/controller.ts (4 hunks)
  • packages/cli-v3/src/entryPoints/managed/env.ts (0 hunks)
  • packages/cli-v3/src/entryPoints/managed/execution.ts (36 hunks)
  • packages/cli-v3/src/entryPoints/managed/logger.ts (2 hunks)
  • packages/cli-v3/src/entryPoints/managed/poller.ts (1 hunks)
  • packages/cli-v3/src/entryPoints/managed/snapshot.test.ts (1 hunks)
  • packages/cli-v3/src/entryPoints/managed/snapshot.ts (1 hunks)
  • packages/cli-v3/src/executions/taskRunProcess.ts (4 hunks)
  • packages/cli-v3/tsconfig.json (1 hunks)
  • packages/cli-v3/tsconfig.src.json (1 hunks)
  • packages/cli-v3/tsconfig.test.json (1 hunks)
  • packages/core/src/v3/runEngineWorker/supervisor/schemas.ts (1 hunks)
  • packages/core/src/v3/runtime/managedRuntimeManager.ts (0 hunks)
  • packages/core/src/v3/runtime/sharedRuntimeManager.ts (1 hunks)
  • packages/core/src/v3/schemas/messages.ts (3 hunks)
  • packages/core/src/v3/schemas/schemas.ts (0 hunks)
  • packages/core/src/v3/workers/index.ts (1 hunks)
💤 Files with no reviewable changes (3)
  • packages/cli-v3/src/entryPoints/managed/env.ts
  • packages/core/src/v3/schemas/schemas.ts
  • packages/core/src/v3/runtime/managedRuntimeManager.ts
🧰 Additional context used
🧬 Code Graph Analysis (4)
packages/cli-v3/src/entryPoints/dev-run-worker.ts (2)
packages/core/src/v3/runtime/sharedRuntimeManager.ts (1)
  • SharedRuntimeManager (25-349)
packages/core/src/v3/workers/index.ts (1)
  • SharedRuntimeManager (24-24)
packages/cli-v3/src/entryPoints/managed-run-worker.ts (1)
packages/core/src/v3/runtime/sharedRuntimeManager.ts (1)
  • SharedRuntimeManager (25-349)
packages/cli-v3/src/entryPoints/managed/logger.ts (4)
packages/core/src/v3/runEngineWorker/supervisor/schemas.ts (2)
  • DebugLogPropertiesInput (141-141)
  • DebugLogPropertiesInput (142-142)
packages/core/src/v3/runEngineWorker/workload/http.ts (1)
  • WorkloadHttpClient (21-181)
packages/cli-v3/src/entryPoints/managed/env.ts (1)
  • RunnerEnv (53-219)
packages/core/src/v3/index.ts (1)
  • flattenAttributes (43-43)
packages/core/src/v3/schemas/messages.ts (1)
packages/core/src/v3/runEngineWorker/supervisor/schemas.ts (2)
  • DebugLogPropertiesInput (141-141)
  • DebugLogPropertiesInput (142-142)
⏰ Context from checks skipped due to timeout of 90000ms (7)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: units / 🧪 Unit Tests
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (34)
packages/cli-v3/src/entryPoints/managed/poller.ts (1)

87-87: Good clarification on the snapshot ID purpose.

This comment effectively communicates the limited role of the snapshot ID within the refactored architecture, making it clear that it serves as a diagnostic tool rather than a functional component in the execution flow.

packages/cli-v3/tsconfig.json (1)

8-11: LGTM: Test configuration integration.

Adding the reference to tsconfig.test.json properly integrates the new test configuration into the TypeScript project structure, creating a clearer separation between source and test configurations.

apps/webapp/app/runEngine/validators/triggerTaskValidator.ts (1)

49-49: Improved entitlement validation logic.

The condition has been refined to only return an error on explicit denial of access (result.hasAccess === false), rather than also failing when result is null/undefined. This ensures validation errors only occur on explicit denial, which aligns with the broader runtime management refactoring in this PR.

packages/cli-v3/package.json (2)

42-44: LGTM: Test file exclusion from build.

Excluding test files from the tshy build process is the correct approach, ensuring they don't get included in the published package while still being available for development and CI workflows.


76-76: LGTM: Test script addition.

Adding the test script for Vitest complements the existing E2E test script and provides a standardized way to run unit tests, improving the developer experience and CI integration.

packages/cli-v3/tsconfig.src.json (1)

4-4: Exclude test files from source compilation.

Adding "exclude": ["./src/**/*.test.ts"] ensures that test files are omitted from the primary tsconfig.src.json, delegating test compilation to the dedicated test config. This change aligns perfectly with the new tsconfig.test.json.

packages/core/src/v3/workers/index.ts (1)

24-24: Export the new SharedRuntimeManager.

Replacing the legacy ManagedRuntimeManager export with SharedRuntimeManager from ../runtime/sharedRuntimeManager.js follows the PR’s refactor. Please verify that SharedRuntimeManager implements the RuntimeManager interface and that no stale imports of ManagedRuntimeManager remain.

packages/cli-v3/tsconfig.test.json (1)

1-11: Introduce dedicated test TS config.

This new tsconfig.test.json properly extends the base config, references tsconfig.src.json, and includes vitest/globals. It cleanly separates test compilation from production builds.

apps/webapp/app/v3/runEngineHandlers.server.ts (3)

404-404: Add [engine] prefix for execution snapshot logs.

Updating the debug message to [engine] ${snapshot.executionStatus} - ${snapshot.description} makes engine-related logs explicit and consistent with other handlers.


453-454: Use run:notify prefix for worker notifications.

The inline comment and run:notify platform -> supervisor: ${snapshot.executionStatus} message correctly adhere to the established notification prefix, avoiding the [engine] tag here.


483-484: Maintain run:notify prefix on error notifications.

The error-path log run:notify ERROR platform -> supervisor: ${snapshot.executionStatus} also properly omits [engine], preserving consistency across notification events.

packages/cli-v3/src/entryPoints/dev-run-worker.ts (3)

38-38: LGTM: Import replacement aligns with runtime manager refactoring

The replacement of ManagedRuntimeManager with SharedRuntimeManager in the imports is consistent with the PR's objective to refactor waitpoint and suspendable state handling.


458-460: LGTM: Simplified waitpoint resolution

The implementation now uses a single RESOLVE_WAITPOINT handler that calls resolveWaitpoints([waitpoint]) on the shared runtime manager. This simplifies the IPC messaging architecture and aligns with the PR goal of fixing waitpoint resolution issues, particularly for waitpoints that arrive early.


531-532: LGTM: Runtime manager implementation replacement

Replacing managedWorkerRuntime with sharedWorkerRuntime and using the SharedRuntimeManager class is aligned with the architectural changes in this PR. The SharedRuntimeManager centralizes runtime management with improved logic for handling waitpoints and suspensions.

packages/cli-v3/src/entryPoints/managed-run-worker.ts (3)

37-37: LGTM: Consistent runtime manager import replacement

Similar to the changes in dev-run-worker.ts, this import change from ManagedRuntimeManager to SharedRuntimeManager keeps the codebase consistent with the refactoring approach.


451-453: LGTM: Simplified waitpoint resolution handler

The implementation now uses a single RESOLVE_WAITPOINT handler that calls resolveWaitpoints([waitpoint]) on the shared runtime manager. This simplification is consistent with the changes in dev-run-worker.ts and improves the handling of waitpoints.


559-561: LGTM: Runtime manager instantiation updated

Creating sharedWorkerRuntime with SharedRuntimeManager and setting it as the global runtime manager is consistent with the architectural changes. Note that the showLogs parameter is hardcoded to true here, unlike in dev-run-worker.ts where it uses the showInternalLogs variable.

packages/cli-v3/src/entryPoints/managed/logger.ts (6)

2-3: LGTM: Enhanced debug logging imports

Adding imports for DebugLogPropertiesInput and flattenAttributes supports the improved debug logging capabilities, providing better type definitions and utilities for processing log properties.

Also applies to: 7-7


9-15: LGTM: Improved debug log options

The SendDebugLogOptions type now uses DebugLogPropertiesInput for properties and adds a print flag, allowing more control over how debug logs are processed and displayed.


17-19: LGTM: Abstraction with RunLogger interface

Creating a RunLogger interface is a good design decision that allows for different logger implementations while maintaining a consistent API.


26-26: LGTM: Class renamed for clarity

Renaming the class from RunLogger to ManagedRunLogger better reflects its specific implementation role and allows for other implementations of the RunLogger interface.


35-64: LGTM: Enhanced debug log handling

The updated sendDebugLog method now:

  1. Conditionally prints to the console based on the print flag
  2. Uses flattenAttributes to ensure properties are in the correct format for the API
  3. Merges additional context (runId, runnerId, workerName) into the properties

These improvements provide more flexibility and ensure consistent formatting of log data.


67-79: LGTM: New console logger implementation

Adding the ConsoleRunLogger class provides a simpler alternative for logging that only writes to the console. This is useful for testing or environments where the full managed logger isn't needed.

packages/core/src/v3/runEngineWorker/supervisor/schemas.ts (5)

129-136: LGTM: Improved type definitions for debug log properties

Renaming AttributeValue to DebugLogPropertiesValue makes the purpose clearer, and updating array types to use nullish() instead of nullable() improves type safety by handling both null and undefined values.


138-140: LGTM: Renamed schema for clarity

Renaming Attributes to DebugLogProperties makes the schema's purpose more obvious and aligns with the other debug log-related schema changes.


141-142: LGTM: New input schema for flexibility

Adding the DebugLogPropertiesInput schema with z.unknown() values provides more flexibility for input validation, allowing arbitrary property values that can be properly validated and processed later.


144-149: LGTM: Input validation schema for debug logs

The new WorkerApiDebugLogBodyInput schema provides clear input validation for debug log bodies, using the more permissive DebugLogPropertiesInput for properties while maintaining strict typing for required fields.


151-155: LGTM: Updated schema for consistency

Updating WorkerApiDebugLogBody to use DebugLogProperties maintains consistency with the renamed schemas and ensures proper validation of debug log data.

packages/cli-v3/src/entryPoints/managed/controller.ts (1)

470-485: 🛠️ Refactor suggestion

⚠️ Potential issue

Potential race: currentExecution may have changed before the enqueue

this.currentExecution is checked for existence at L470-479, but there is a await before the call to enqueueSnapshotChangeAndWait.
Another run could start in the meantime, replacing currentExecution with an execution for a different run, leading to a snapshot from run A being applied to run B.

Consider capturing the execution reference early and validating it before use:

-const [error] = await tryCatch(
-  this.currentExecution.enqueueSnapshotChangeAndWait(runExecutionData)
-);
+const execution = this.currentExecution;
+if (!execution || execution.runFriendlyId !== controller.runFriendlyId) {
+  return; // execution replaced – ignore stale notification
+}
+
+const [error] = await tryCatch(
+  execution.enqueueSnapshotChangeAndWait(runExecutionData)
+);

Likely an incorrect or invalid review comment.

packages/cli-v3/src/executions/taskRunProcess.ts (2)

36-45: Confirm new catalog keys exist

The new message types "SEND_DEBUG_LOG" and "SET_SUSPENDABLE" must be present in ExecutorToWorkerMessageCatalog.
A missing key will cause a run-time validation error in ZodIpcConnection.

Please verify with a quick grep or unit test.


277-278: Message rename check

waitpointCompleted() now sends "RESOLVE_WAITPOINT".
Make sure all workers listening for "WAITPOINT_COMPLETED" were updated accordingly, otherwise waitpoint resolution will silently fail.

packages/core/src/v3/schemas/messages.ts (1)

21-21:

✅ Verification successful

Possible broken import path for DebugLogPropertiesInput

DebugLogPropertiesInput is exported from runEngineWorker/supervisor/schemas.ts (per relevant snippet), but this file imports it from ../runEngineWorker/index.js.
Unless index.js re-exports the symbol, TypeScript will fail to resolve the type and the build will break.

-import { DebugLogPropertiesInput } from "../runEngineWorker/index.js";
+// Re-export exists?  If not, point directly at the schema file
+import { DebugLogPropertiesInput } from "../runEngineWorker/supervisor/schemas.js";

Please verify the re-export or adjust the path accordingly.


I’d like to confirm whether DebugLogPropertiesInput is re-exported in runEngineWorker/index.js (or .ts). Let’s list the files in that folder and search for exports:


🏁 Script executed:

#!/bin/bash
set -e

echo "1) Files in runEngineWorker directory:"
find packages/core/src/v3/runEngineWorker -maxdepth 1 -type f

echo
echo "2) All occurrences of DebugLogPropertiesInput in packages/core:"
rg -n "DebugLogPropertiesInput" packages/core

echo
echo "3) Contents of index.js (if present):"
sed -n '1,200p' packages/core/src/v3/runEngineWorker/index.js 2>/dev/null || echo "→ index.js not found"

echo
echo "4) Contents of index.ts:"
sed -n '1,200p' packages/core/src/v3/runEngineWorker/index.ts 2>/dev/null || echo "→ index.ts not found"

Length of output: 2132


No broken import—DebugLogPropertiesInput is correctly re-exported
The packages/core/src/v3/runEngineWorker/index.ts file includes

export * from "./supervisor/schemas.js";

and that module defines and exports DebugLogPropertiesInput. The import in
packages/core/src/v3/schemas/messages.ts

import { DebugLogPropertiesInput } from "../runEngineWorker/index.js";

will resolve correctly at build time. No changes needed.

packages/cli-v3/src/entryPoints/managed/snapshot.ts (1)

76-80: crypto.randomUUID() requires Node ≥ 16.17

If the managed worker is executed in older runtimes (e.g. AWS Lambda Node 14), this will throw at runtime.
Either polyfill (import { v4 as uuid } from "uuid") or assert minimum Node version in package engines.

packages/cli-v3/src/entryPoints/managed/execution.ts (1)

672-687: Verify mapping completeness for new attempt statuses

convertAttemptStatusToSnapshotStatus deliberately omits "RUN_PENDING_EXECUTING" and any future statuses.
If the server starts returning a new status, compilation will fail thanks to assertExhaustive, but deployment pipelines running with "skipLibCheck": true or older build artefacts may silently mis-map.
Make sure the exhaustive check is covered by unit tests so CI catches any upstream additions.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
packages/core/src/v3/runtime/sharedRuntimeManager.ts (1)

291-313: ⚠️ Potential issue

Add defensive JSON parsing for error output

The JSON.parse call has no error handling, which could crash the worker if waitpoint.output contains malformed JSON.

-    error: waitpoint.output
-      ? JSON.parse(waitpoint.output)
-      : {
+    error: waitpoint.output
+      ? (() => {
+          try {
+            return JSON.parse(waitpoint.output);
+          } catch (e) {
+            return {
+              type: "STRING_ERROR",
+              message: `Error parsing output: ${String(waitpoint.output).slice(0, 100)}`,
+            };
+          }
+        })()
+      : {
         type: "STRING_ERROR",
         message: "Missing error output",
       },
🧹 Nitpick comments (1)
packages/cli-v3/src/entryPoints/managed/execution.ts (1)

900-904: Add error handling for suspendable state setter

The current implementation swallows errors when setting suspendable state. While logging is good, consider whether errors should be propagated or if retries are appropriate.

  private set suspendable(suspendable: boolean) {
    this.snapshotManager?.setSuspendable(suspendable).catch((error) => {
      this.sendDebugLog("failed to set suspendable", { error: error.message });
+     // Consider whether this failure is critical enough to propagate or retry
    });
  }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c36e274 and 55b835d.

📒 Files selected for processing (5)
  • internal-packages/testcontainers/src/index.ts (3 hunks)
  • packages/cli-v3/src/entryPoints/managed/execution.ts (36 hunks)
  • packages/cli-v3/src/entryPoints/managed/snapshot.ts (1 hunks)
  • packages/cli-v3/src/executions/taskRunProcess.ts (5 hunks)
  • packages/core/src/v3/runtime/sharedRuntimeManager.ts (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • internal-packages/testcontainers/src/index.ts
  • packages/cli-v3/src/entryPoints/managed/snapshot.ts
⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: units / 🧪 Unit Tests
  • GitHub Check: typecheck / typecheck
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (18)
packages/cli-v3/src/executions/taskRunProcess.ts (5)

36-44: IPC message type definitions for debug logs and suspendable state

Good addition of typed IPC message interfaces for the new debug logging and suspendable state functionality. This provides strong type safety for message handlers.


81-82: Event emitters for runtime state management

These new event emitters replace the previous wait-related events with a more streamlined approach focusing on debug logging and suspendable state. This simplifies the IPC communication model.


100-105: Good addition of event handler cleanup method

Adding this method helps prevent memory leaks by providing a centralized way to detach all event handlers. Using this in the shutdown flow will ensure proper cleanup.


200-205: New IPC handlers for debug logs and suspendable state

Well-structured handlers that simply post the messages to the appropriate event emitters. This maintains the separation between IPC communication and event handling.


284-284: Simplified waitpoint resolution messaging

Replacing WAITPOINT_COMPLETED with RESOLVE_WAITPOINT aligns with the new centralized waitpoint handling architecture, making the communication protocol more consistent.

packages/core/src/v3/runtime/sharedRuntimeManager.ts (6)

19-23: Good type definitions for resolvers and resolver IDs

The use of branded types for ResolverId is an excellent practice that provides type safety against accidental misuse of string IDs.


25-42: Well-structured class initialization with status logging

The class structure with clear private fields and appropriate logging interval helps with debugging long-running executions. The 5-minute interval for status logging is suitable for production environments while avoiding log spam.


48-72: Consistent implementation of wait methods

All three wait methods (waitForTask, waitForBatch, waitForWaitpoint) follow the same pattern:

  1. Use prevention of multiple waits
  2. Create promise with resolver
  3. Register resolver
  4. Resolve any pending waitpoints
  5. Call lifecycle hooks
  6. Wait with suspendable wrapper
  7. Process results
  8. Call resume hooks

This consistency makes the code more maintainable and easier to understand.

Also applies to: 74-114, 116-163


261-272: Safe iteration over maps during mutation

Good implementation of resolvePendingWaitpoints() that avoids mutation-during-iteration issues by cloning the keys first. This prevents potential skipped entries when mutating a collection during iteration.


278-289: Robust suspendable state wrapper

Good implementation of the suspendable wrapper that:

  1. Sets the suspendable state to true at the start
  2. Uses tryCatch for error handling
  3. Ensures suspendable is set to false even if an error occurs
  4. Properly logs errors
  5. Propagates errors to the caller

This ensures the runtime always exits the suspendable state, even in error scenarios.


315-326: Safely format waitpoints for debug logs

Good implementation that:

  1. Destructures to avoid modifying the original waitpoint
  2. Truncates output to avoid excessive log sizes
  3. Converts dates to ISO strings for readability
  4. Preserves original date objects for potential further processing

This ensures logs are readable and manageable without losing important information.

packages/cli-v3/src/entryPoints/managed/execution.ts (7)

57-57: Introduction of SnapshotManager

Good architectural change to delegate snapshot and suspendable state management to a dedicated class. This improves separation of concerns.


149-156: Event handlers for runtime communication

Effective implementation of debug log and suspendable state event handlers that integrate with the snapshot manager.


178-190: Improved snapshot change handling

Renaming handleSnapshotChange to enqueueSnapshotChangeAndWait better reflects the asynchronous nature of the operation. Good use of the snapshot manager for actual processing.


389-397: Well-structured SnapshotManager initialization

The initialization provides all necessary parameters including callbacks for state changes, making dependencies clear and explicit.


673-693: Helper methods for snapshot state management

The introduction of updateSnapshotAfterCompletion and convertAttemptStatusToSnapshotStatus centralizes conversion logic and ensures consistent snapshot state updates.


957-957: Proper cleanup of event handlers

Good use of the new unsafeDetachEvtHandlers method to ensure event handlers are cleaned up during shutdown. This prevents memory leaks when processes are reused.


960-1031: Comprehensive suspendable state handling

The handleSuspendable method properly:

  1. Verifies snapshot consistency before and after cleanup
  2. Handles errors during process cleanup
  3. Gracefully handles API failures
  4. Provides detailed logging

This ensures robust suspension handling with appropriate error cases.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
.github/workflows/publish.yml (1)

72-78: ⚠️ Potential issue

Missing reusable workflow file publish-worker-v4.yml

publish-worker-v4 references ./.github/workflows/publish-worker-v4.yml, but the file does not exist (confirmed by action-lint).
This will cause the entire workflow run to fail.

If the new job is not ready, remove the block; otherwise, add the referenced workflow file before merging.

🧰 Tools
🪛 actionlint (1.7.4)

74-74: could not read reusable workflow file for "./.github/workflows/publish-worker-v4.yml": open /home/jailuser/git/.github/workflows/publish-worker-v4.yml: no such file or directory

(workflow-call)

♻️ Duplicate comments (2)
packages/core/src/v3/runtime/sharedRuntimeManager.ts (2)

261-265: ⚠️ Potential issue

Map is mutated during iteration – entries may be skipped

resolvePendingWaitpoints() iterates over waitpointsByResolverId.entries() while resolveWaitpoint() can delete the very key being iterated (line 257-259).
Although V8 currently tolerates this, it is undefined behaviour per ECMA-262 and other JS engines may skip elements.

-    for (const [resolverId, waitpoint] of this.waitpointsByResolverId.entries()) {
-      this.resolveWaitpoint(waitpoint, resolverId);
-    }
+    // Clone keys first to avoid mutation-while-iterating
+    for (const resolverId of Array.from(this.waitpointsByResolverId.keys())) {
+      const waitpoint = this.waitpointsByResolverId.get(resolverId)!;
+      this.resolveWaitpoint(waitpoint, resolverId);
+    }

This was pointed out in an earlier review but is still present.


284-305: ⚠️ Potential issue

JSON.parse can crash the worker – add defensive parsing

A malformed waitpoint.output will throw and bring down the entire run.
Wrap the parse in a try/catch (or use safeJsonParse) so a single bad payload does not terminate the process.

-        error: waitpoint.output
-          ? JSON.parse(waitpoint.output)
-          : {
-              type: "STRING_ERROR",
-              message: "Missing error output",
-            },
+        error: (() => {
+          if (!waitpoint.output) {
+            return {
+              type: "STRING_ERROR",
+              message: "Missing error output",
+            };
+          }
+          try {
+            return JSON.parse(waitpoint.output);
+          } catch {
+            return {
+              type: "STRING_ERROR",
+              message: "Unparseable error output",
+            };
+          }
+        })(),
🧹 Nitpick comments (7)
scripts/publish-prerelease.sh (1)

58-59: Redirect error output to stderr for better logging
Currently, both the command output and the error message are printed to stdout. To clearly distinguish errors in CI logs or user terminals, send these to stderr.

Proposed diff:

-    echo "$output"
-    echo "Error running changeset version command, detailed output above"
+    echo "$output" >&2
+    echo "Error running changeset version command, detailed output above" >&2
packages/cli-v3/src/commands/deploy.ts (2)

217-237: Avoid instantiating spinners when running in CI

$spinner is created unconditionally, yet in CI mode (isCI === true) it is never used – all progress is reported via log.step and console.log.
This results in unnecessary object creation and (more importantly) an extra TTY escape sequence on some CI providers the moment the spinner is instantiated, which can pollute logs.

-  const $spinner = spinner();
+  // Only create a spinner for interactive terminals
+  const $spinner = isCI ? undefined : spinner();

…and wrap subsequent $spinner.* calls with a guard ($spinner?.start(...), $spinner?.message(...), etc.).

[nitpick, performance]

Also applies to: 333-343


330-343: cliLink() is called when links are unsupported

You already construct rawDeploymentLink/rawTestLink for the fallback case, yet cliLink() is still invoked even when terminalLink support is absent.
While cliLink has a built-in fallback, skipping the call altogether avoids the extra formatting step and prevents ANSI escape codes from leaking into plain-text CI logs on some shells.

-const deploymentLink = cliLink("View deployment", rawDeploymentLink);
-const testLink       = cliLink("Test tasks",    rawTestLink);
+const deploymentLink = isLinksSupported
+  ? cliLink("View deployment", rawDeploymentLink)
+  : rawDeploymentLink;
+const testLink = isLinksSupported
+  ? cliLink("Test tasks", rawTestLink)
+  : rawTestLink;
packages/cli-v3/src/entryPoints/managed/controller.ts (2)

405-413: Prefer crypto.randomUUID() over manual random string IDs

Math.random().toString(36) has lower entropy and can collide in high-throughput scenarios.
Node ≥ 14 supports crypto.randomUUID() which yields RFC 4122 v4 IDs without additional deps.

-const notificationId = Math.random().toString(36).substring(2, 15);
+import { randomUUID } from "node:crypto";
+const notificationId = randomUUID();

527-573: currentEnv/newEnv capture identical data

processEnvOverrides() presumably mutates this.env; however both snapshots are built from this.env after the call, so they will always be identical.
Capture currentEnv before invoking the mutation to give meaningful diff logging.

-      if (this.currentExecution) {
-        const currentEnv = { ...snip... };
-        await this.currentExecution.processEnvOverrides("socket disconnected");
-        const newEnv = { ...snip... };
+      if (this.currentExecution) {
+        const currentEnv = {
+          workerInstanceName: this.env.TRIGGER_WORKER_INSTANCE_NAME,
+          runnerId: this.env.TRIGGER_RUNNER_ID,
+          supervisorApiUrl: this.env.TRIGGER_SUPERVISOR_API_URL,
+        };
+
+        await this.currentExecution.processEnvOverrides("socket disconnected");
+
+        const newEnv = {
+          workerInstanceName: this.env.TRIGGER_WORKER_INSTANCE_NAME,
+          runnerId: this.env.TRIGGER_RUNNER_ID,
+          supervisorApiUrl: this.env.TRIGGER_SUPERVISOR_API_URL,
+        };
packages/cli-v3/src/entryPoints/managed/execution.ts (2)

390-399: Verify the appropriateness of the initial status

The comment "We're just guessing here, but 'PENDING_EXECUTING' is probably fine" indicates uncertainty about the correct initial status for the SnapshotManager. This could potentially lead to issues if the actual state should be different.

Consider having a more systematic way to determine the initial status rather than guessing. For example, you could:

  1. Pass the initial status from the caller where it's known with certainty
  2. Make the status parameter optional and have the SnapshotManager derive it from other parameters
  3. Add validation logic in the SnapshotManager constructor to ensure the initial status is compatible with the given context

211-212: Consider using a debug flag instead of commented code

The comment "DO NOT REMOVE (very noisy, but helpful for debugging)" indicates that this line might be re-enabled for debugging.

Consider controlling this with a debug flag or log level setting rather than commenting and uncommenting code. This would make it easier to enable debugging when needed without code changes:

-  // DO NOT REMOVE (very noisy, but helpful for debugging)
-  // this.sendDebugLog(`processing snapshot change: ${snapshot.executionStatus}`, snapshotMetadata);
+  if (this.debugVerbose) {
+    this.sendDebugLog(`processing snapshot change: ${snapshot.executionStatus}`, snapshotMetadata);
+  }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 55b835d and ed1a44c.

📒 Files selected for processing (11)
  • .changeset/sweet-dolphins-invent.md (1 hunks)
  • .github/workflows/publish-worker-re2.yml (1 hunks)
  • .github/workflows/publish-worker.yml (1 hunks)
  • .github/workflows/publish.yml (1 hunks)
  • packages/cli-v3/src/commands/deploy.ts (7 hunks)
  • packages/cli-v3/src/entryPoints/managed/controller.ts (6 hunks)
  • packages/cli-v3/src/entryPoints/managed/execution.ts (35 hunks)
  • packages/cli-v3/src/entryPoints/managed/poller.ts (2 hunks)
  • packages/core/src/v3/runtime/sharedRuntimeManager.ts (1 hunks)
  • packages/core/src/v3/utils/interval.ts (1 hunks)
  • scripts/publish-prerelease.sh (1 hunks)
✅ Files skipped from review due to trivial changes (3)
  • .changeset/sweet-dolphins-invent.md
  • .github/workflows/publish-worker.yml
  • .github/workflows/publish-worker-re2.yml
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/cli-v3/src/entryPoints/managed/poller.ts
🧰 Additional context used
🧬 Code Graph Analysis (1)
packages/cli-v3/src/commands/deploy.ts (1)
packages/cli-v3/src/utilities/cliOutput.ts (2)
  • cliLink (140-145)
  • isLinksSupported (7-7)
🪛 actionlint (1.7.4)
.github/workflows/publish.yml

74-74: could not read reusable workflow file for "./.github/workflows/publish-worker-v4.yml": open /home/jailuser/git/.github/workflows/publish-worker-v4.yml: no such file or directory

(workflow-call)

⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: units / 🧪 Unit Tests
  • GitHub Check: typecheck / typecheck
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (6)
packages/core/src/v3/utils/interval.ts (1)

43-56: Improved state reporting on interval service stop

This change enhances the stop() method to return the execution state at the time of stopping, providing valuable information for callers to handle cleanup correctly. This is a good improvement that helps address the race conditions mentioned in the PR objectives.

The implementation correctly captures the execution state before any state changes are made, ensuring accurate reporting regardless of the enabled state.

packages/cli-v3/src/entryPoints/managed/execution.ts (5)

637-638: Good approach for maintaining snapshot consistency

The code explicitly updates the snapshot ID after completion to ensure any subsequent API calls use the correct snapshot. This helps prevent race conditions where another snapshot update arrives during completion processing.


912-916: Error handling approach for setSuspendable is reasonable

The error from setSuspendable is caught and logged but not propagated. Based on the previous review discussion, this is an acceptable approach since failures here are rare and only impact checkpointing.


972-1043: Well-designed suspension handling with robust race condition protection

The handleSuspendable method is very thorough in its approach to handling suspension requests:

  1. It verifies snapshot consistency before and after cleanup
  2. Handles API failure gracefully
  3. Contains appropriate logging throughout the process
  4. Aborts execution when consistency checks fail

This comprehensive approach helps prevent race conditions during the suspension process.


966-970: Good service cleanup with proper resource management

The stopServices method now properly cleans up all resources and detaches event handlers. This addresses the memory leak concern raised in previous reviews.


149-156:

✅ Verification successful

Properly detach event listeners to prevent memory leaks

New event listeners are attached to the TaskRunProcess in lines 149-156, but I notice they must be properly detached when the process is stopped to prevent memory leaks with warm starts or retries.

Let's verify if the detachment is properly implemented in unsafeDetachEvtHandlers():


🏁 Script executed:

#!/bin/bash
# Check what the unsafeDetachEvtHandlers method does
rg -A 10 "unsafeDetachEvtHandlers" packages/cli-v3/src/executions/

Length of output: 866


Event listeners are already detached

The unsafeDetachEvtHandlers() method in packages/cli-v3/src/executions/taskRunProcess.ts calls detach() on onSendDebugLog and onSetSuspendable (along with the other handlers), so these listeners are properly cleaned up.

@nicktrn nicktrn marked this pull request as draft May 2, 2025 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants