-
Notifications
You must be signed in to change notification settings - Fork 793
feat(debug): add comprehensive crash tracing system for Windows debugging #571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
…ging Adds a lightweight, off-by-default tracing system to help diagnose Windows crashes. Features: - Ring buffer-based tracer with JSONL output - Traces all spawn operations (session-notification, LSP, subagents) - Crash handlers (uncaughtException, unhandledRejection, signals) - debug-bundle CLI command for diagnostic collection - Automatic sensitive data redaction - Enable via OMO_DEBUG=1 environment variable This helps identify the root cause of crashes that persist after PR code-yeongyu#543. Refs: oven-sh/bun#25798, anomalyco/opencode#3267
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6 issues found across 16 files
Confidence score: 2/5
- The
src/shared/debug-tracer/tracer.tshandler flushes logs but keeps running after an uncaught exception, leaving the process in an undefined state and risking corrupt behavior for every request that follows. src/shared/debug-tracer/traced-spawn.tsresolves silently when a spawn fails, so callers can’t distinguish between a broken child process and a clean exit, which will mask failures in production pipelines.- Additional correctness gaps (unbalanced span handling, hardcoded CLI version) show instrumentation logic isn’t closing the loop yet, compounding the overall risk in this PR.
- Pay close attention to
src/shared/debug-tracer/tracer.ts,src/shared/debug-tracer/traced-spawn.ts- both contain regression-level reliability issues that need fixes before merging.
Prompt for AI agents (all issues)
Check if these issues are valid — if so, understand the root cause of each and fix them.
<file name="src/cli/debug-bundle/index.ts">
<violation number="1" location="src/cli/debug-bundle/index.ts:39">
P2: Version is hardcoded instead of being dynamically retrieved. This will get out of sync with the actual package version. Consider using `getCachedVersion()` from the auto-update-checker like other parts of the codebase.</violation>
</file>
<file name="src/shared/debug-tracer/tracer.ts">
<violation number="1" location="src/shared/debug-tracer/tracer.ts:435">
P2: The redaction check `lowerKey.includes("key")` is overly broad and will redact legitimate non-sensitive fields like `primaryKey`, `sortKey`, `keyboardLayout`, `keyframe`, etc. Consider using more specific patterns like `apiKey`, `secretKey`, or checking for `key` as a suffix/word boundary.</violation>
<violation number="2" location="src/shared/debug-tracer/tracer.ts:489">
P1: The `uncaughtException` handler doesn't terminate the process after flushing. After an uncaught exception, Node.js considers the application to be in an undefined state, and continuing execution is dangerous. The handler should call `process.exit(1)` after the emergency flush to prevent undefined behavior.</violation>
</file>
<file name="src/tools/call-omo-agent/tools.ts">
<violation number="1" location="src/tools/call-omo-agent/tools.ts:43">
P2: Span is started but never ended. The `spanId` returned by `startSpan()` should be passed to `endSpan()` when the operation completes (or errors). Without this, the span remains orphaned in `activeSpans` and no duration/completion event is recorded, defeating the purpose of the tracing system.</violation>
</file>
<file name="docs/DEBUG_TRACING.md">
<violation number="1" location="docs/DEBUG_TRACING.md:172">
P3: Inconsistent path separator: earlier in the doc uses backslash (`\`) for Windows paths, but this table entry uses forward slash (`/`). Should use `%TEMP%\oh-my-opencode-trace.jsonl` for consistency with line 53.</violation>
</file>
<file name="src/shared/debug-tracer/traced-spawn.ts">
<violation number="1" location="src/shared/debug-tracer/traced-spawn.ts:118">
P1: Spawn errors are silently swallowed - the error handler discards the error and resolves with null values. The caller cannot distinguish between a failed spawn and a process that exited normally. Consider rejecting the promise on error so callers can handle spawn failures appropriately.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
|
||
| private setupCrashHandlers(): void { | ||
| // uncaughtException | ||
| process.on("uncaughtException", (error) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P1: The uncaughtException handler doesn't terminate the process after flushing. After an uncaught exception, Node.js considers the application to be in an undefined state, and continuing execution is dangerous. The handler should call process.exit(1) after the emergency flush to prevent undefined behavior.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/shared/debug-tracer/tracer.ts, line 489:
<comment>The `uncaughtException` handler doesn't terminate the process after flushing. After an uncaught exception, Node.js considers the application to be in an undefined state, and continuing execution is dangerous. The handler should call `process.exit(1)` after the emergency flush to prevent undefined behavior.</comment>
<file context>
@@ -0,0 +1,549 @@
+
+ private setupCrashHandlers(): void {
+ // uncaughtException
+ process.on("uncaughtException", (error) => {
+ this.trace("crash.handler", "uncaughtException", undefined, error)
+ this.emergencyFlush("uncaughtException")
</file context>
| resolve({ exitCode: code, signal }) | ||
| }) | ||
|
|
||
| proc.on("error", () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P1: Spawn errors are silently swallowed - the error handler discards the error and resolves with null values. The caller cannot distinguish between a failed spawn and a process that exited normally. Consider rejecting the promise on error so callers can handle spawn failures appropriately.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/shared/debug-tracer/traced-spawn.ts, line 118:
<comment>Spawn errors are silently swallowed - the error handler discards the error and resolves with null values. The caller cannot distinguish between a failed spawn and a process that exited normally. Consider rejecting the promise on error so callers can handle spawn failures appropriately.</comment>
<file context>
@@ -0,0 +1,237 @@
+ resolve({ exitCode: code, signal })
+ })
+
+ proc.on("error", () => {
+ resolve({ exitCode: null, signal: null })
+ })
</file context>
| // Create bundle | ||
| const bundle: Record<string, unknown> = { | ||
| createdAt: new Date().toISOString(), | ||
| version: "2.14.1-debug.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Version is hardcoded instead of being dynamically retrieved. This will get out of sync with the actual package version. Consider using getCachedVersion() from the auto-update-checker like other parts of the codebase.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/cli/debug-bundle/index.ts, line 39:
<comment>Version is hardcoded instead of being dynamically retrieved. This will get out of sync with the actual package version. Consider using `getCachedVersion()` from the auto-update-checker like other parts of the codebase.</comment>
<file context>
@@ -0,0 +1,178 @@
+ // Create bundle
+ const bundle: Record<string, unknown> = {
+ createdAt: new Date().toISOString(),
+ version: "2.14.1-debug.1",
+ tracing: {
+ wasEnabled: tracingWasEnabled,
</file context>
| lowerKey.includes("token") || | ||
| lowerKey.includes("secret") || | ||
| lowerKey.includes("password") || | ||
| lowerKey.includes("key") || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: The redaction check lowerKey.includes("key") is overly broad and will redact legitimate non-sensitive fields like primaryKey, sortKey, keyboardLayout, keyframe, etc. Consider using more specific patterns like apiKey, secretKey, or checking for key as a suffix/word boundary.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/shared/debug-tracer/tracer.ts, line 435:
<comment>The redaction check `lowerKey.includes("key")` is overly broad and will redact legitimate non-sensitive fields like `primaryKey`, `sortKey`, `keyboardLayout`, `keyframe`, etc. Consider using more specific patterns like `apiKey`, `secretKey`, or checking for `key` as a suffix/word boundary.</comment>
<file context>
@@ -0,0 +1,549 @@
+ lowerKey.includes("token") ||
+ lowerKey.includes("secret") ||
+ lowerKey.includes("password") ||
+ lowerKey.includes("key") ||
+ lowerKey.includes("auth") ||
+ lowerKey.includes("credential")
</file context>
| log(`[call_omo_agent] Starting with agent: ${args.subagent_type}, background: ${args.run_in_background}`) | ||
|
|
||
| // Trace agent invocation | ||
| const spanId = isTracingEnabled() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Span is started but never ended. The spanId returned by startSpan() should be passed to endSpan() when the operation completes (or errors). Without this, the span remains orphaned in activeSpans and no duration/completion event is recorded, defeating the purpose of the tracing system.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/tools/call-omo-agent/tools.ts, line 43:
<comment>Span is started but never ended. The `spanId` returned by `startSpan()` should be passed to `endSpan()` when the operation completes (or errors). Without this, the span remains orphaned in `activeSpans` and no duration/completion event is recorded, defeating the purpose of the tracing system.</comment>
<file context>
@@ -37,6 +38,16 @@ export function createCallOmoAgent(
log(`[call_omo_agent] Starting with agent: ${args.subagent_type}, background: ${args.run_in_background}`)
+
+ // Trace agent invocation
+ const spanId = isTracingEnabled()
+ ? startSpan("subagent.start", `call_omo_agent.${args.subagent_type}`, {
+ agent: args.subagent_type,
</file context>
| | Variable | Description | Default | | ||
| |----------|-------------|---------| | ||
| | `OMO_DEBUG` | Enable tracing (`1` or `true`) | Not set (disabled) | | ||
| | `OMO_DEBUG_LOG` | Custom log file path | `%TEMP%/oh-my-opencode-trace.jsonl` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P3: Inconsistent path separator: earlier in the doc uses backslash (\) for Windows paths, but this table entry uses forward slash (/). Should use %TEMP%\oh-my-opencode-trace.jsonl for consistency with line 53.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/DEBUG_TRACING.md, line 172:
<comment>Inconsistent path separator: earlier in the doc uses backslash (`\`) for Windows paths, but this table entry uses forward slash (`/`). Should use `%TEMP%\oh-my-opencode-trace.jsonl` for consistency with line 53.</comment>
<file context>
@@ -0,0 +1,176 @@
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `OMO_DEBUG` | Enable tracing (`1` or `true`) | Not set (disabled) |
+| `OMO_DEBUG_LOG` | Custom log file path | `%TEMP%/oh-my-opencode-trace.jsonl` |
+
+## Version
</file context>
- Add custom error classes (LSPServerUnavailableError, LSPServerExitedError) for graceful failure handling - Track server availability with cooldown-based retry mechanism (5 min cooldown, 3 max retries) - Wrap spawn() in try-catch to catch ENOENT/EACCES errors immediately - Add safeWrite() method to handle stdin errors without crashing - Track stdinWritable state to prevent writes to destroyed streams - Handle warmup failures gracefully with console.warn instead of throwing - Skip unavailable servers on subsequent requests to avoid repeated spawn failures This prevents the crash discovered via debug tracing (PR code-yeongyu#571): spawn ENOENT -> uncaughtException -> ERR_STREAM_DESTROYED Refs: code-yeongyu#571
Greptile SummaryAdds a comprehensive off-by-default crash tracing system to help diagnose persistent Windows crashes. The implementation uses a ring buffer with JSONL output, automatic crash handlers, and sensitive data redaction. Key changes:
Architecture highlights:
Issues found:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant App as oh-my-opencode
participant Tracer as DebugTracer
participant RingBuffer as Ring Buffer
participant LogFile as trace.jsonl
participant CrashHandler as Crash Handlers
Note over App,Tracer: Initialization (OMO_DEBUG=1)
App->>Tracer: initTracer()
Tracer->>Tracer: Check OMO_DEBUG env var
Tracer->>LogFile: Create WriteStream
Tracer->>Tracer: Setup periodic flush (5s)
Tracer->>CrashHandler: Register handlers (uncaughtException, etc)
Tracer->>RingBuffer: trace("system.init")
Note over App,LogFile: Normal Operation
App->>Tracer: tracedSpawn(command, args)
Tracer->>Tracer: startSpan("spawn.start")
Tracer->>RingBuffer: Add event with memory stats
App->>App: Execute spawn
App->>Tracer: Process exits/errors
Tracer->>Tracer: endSpan() with duration
Tracer->>RingBuffer: Add completion event
App->>Tracer: trace("notification.send")
Tracer->>RingBuffer: Add event
App->>Tracer: Background agent lifecycle
Tracer->>Tracer: startSpan("subagent.start")
Tracer->>RingBuffer: Add event
loop Every 5 seconds
Tracer->>Tracer: Periodic flush timer
Tracer->>RingBuffer: Copy all events
Tracer->>LogFile: Write JSONL (async)
Tracer->>RingBuffer: Clear buffer
end
Note over App,CrashHandler: Crash Scenario
App->>CrashHandler: uncaughtException/unhandledRejection
CrashHandler->>Tracer: emergencyFlush()
Tracer->>RingBuffer: Add crash marker
Tracer->>LogFile: Synchronous writeFileSync
CrashHandler->>App: Process exits
Note over App,LogFile: Debug Bundle Creation
App->>Tracer: debug-bundle command
Tracer->>LogFile: Read JSONL
Tracer->>Tracer: Parse last 500 events
Tracer->>Tracer: Redact sensitive data
Tracer->>App: Generate diagnostic JSON
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
src/tools/call-omo-agent/tools.ts, line 44-51 (link)logic: span started but never ended - causes memory leak in
activeSpansMap
15 files reviewed, 1 comment
Update: Root Cause DiscoveryAfter further investigation, we discovered that the Windows crash was not caused by Bun's ShellInterpreter GC bug as originally suspected. The Real Root CauseThe crash occurs when oh-my-opencode's session-notification runs alongside external notification plugins (e.g., Evidence:
What This Means for This PRThis debug tracing system is still valuable for future debugging, but:
The Proper FixPR #575 implements the actual fix:
See: #575 and RecommendationThis PR can still be merged if the tracing functionality is desired, but the context in the description should be updated to reflect that it's a general debugging tool, not a fix for the specific crash. |
Summary
Adds a lightweight, off-by-default tracing system to help diagnose Windows crashes that persist after PR #543.
Background
After merging #543 (which fixed the Bun ShellInterpreter GC crash), crashes continued on Windows. Without visibility into what happens before crashes, debugging was difficult. This PR adds structured tracing to capture high-signal events.
Features
OMO_DEBUG=1environment variableUsage
Log file location:
%TEMP%\oh-my-opencode-trace.jsonlFiles Changed
src/shared/debug-tracer/*(5 files)src/cli/debug-bundle/index.tsdocs/DEBUG_TRACING.mdTesting
Related
Summary by cubic
Adds an off-by-default crash tracing system for Windows to capture high-signal events and produce JSONL logs for faster diagnosis of crashes persisting after PR #543. It records key spawn and lifecycle events and sets up crash handlers to preserve trace history.
New Features
Migration
Written for commit f07f100. Summary will update on new commits.