fix: retry LSP/MCP toolsets after tool calls, covering env-wrapped commands (fixes #2457)#2458
Conversation
|
I'm not sure I understand the fix, isn't the lsp tooset startable? Wouldn't re-starting the lsp toolset be enough? |
|
Good question — the LSP toolset is Startable, but there are two separate problems that a naive re-start doesn't cover: 1. When the command is A subsequent call to 2. The The fix adds a The new 3. The reprobe loop ties it together within a single turn Even with the above, without the reprobe the agent would only see new tools at the start of the next user turn (when |
simonferquel
left a comment
There was a problem hiding this comment.
With #2460 about to get merged, the env gopls thing is moot. This PR should not try to cover this pattern anymore
|
Done — removed in commit Removed (everything that was specifically for the
Retained (the actual fix from this PR):
|
|
use the proper notification component when surfacing things to the TUI. A warning one in this case might be appropriate |
|
Done in commit Dropped the separate The key distinction: Changes:
|
|
|
||
| // Cancel the process-lifetime context to stop the readNotifications | ||
| // goroutine and (if the process didn't exit cleanly) kill the process. | ||
| // goroutine and force-kill the process if it hasn't exited cleanly. |
There was a problem hiding this comment.
NIT: revert that comment only change
There was a problem hiding this comment.
Done — lsp.go reverted to upstream/main in commit 65d42382. Both lsp.go and lsp_test.go are now identical to upstream.
| t.Parallel() | ||
|
|
||
| tool := NewLSPTool("gopls", nil, nil, "/tmp") | ||
| tools, err := tool.Tools(t.Context()) |
There was a problem hiding this comment.
all these files changes should be reverted
There was a problem hiding this comment.
Done — lsp_test.go fully reverted to upstream/main in commit 65d42382. No changes from this PR remain in pkg/tools/builtin/.
| // Use WithoutCancel so the watcher outlives the caller's context; | ||
| // the only way to stop it is via Stop() setting ts.stopping. | ||
| go ts.watchConnection(context.WithoutCancel(ctx)) | ||
| // Spawn the connection watcher only when no watcher is already running. |
There was a problem hiding this comment.
is this file changes still needed with the env -C concern removal ?
There was a problem hiding this comment.
Yes — this change is independent of the env -C concern and is still needed.
The watcherAlive flag fixes a race introduced by the reprobe loop itself:
watchConnectionsetsts.started = falseat the top of its reconnect iteration (before sleeping its backoff), while the goroutine is still alive.- If
reprobe()fires during that window,StartableToolSet.Start()seesstarted == falseand callsToolset.Start(). - Without the guard,
Toolset.Start()would unconditionally spawn a secondgo watchConnection(...), resulting in two goroutines racing to calldoStart()and unsafely close/recreatets.restarted.
The fix: Toolset.Start() only spawns watchConnection when !watcherAlive. The goroutine clears the flag on all exit paths via defer. While an existing watcher is mid-backoff, reprobe skips the spawn. Once the watcher truly gives up, the next reprobe starts a fresh one.
|
|
||
| // WorkingDir returns the working directory of the LSP server process. | ||
| // This is intended for testing only. | ||
| func (t *LSPTool) WorkingDir() string { |
There was a problem hiding this comment.
if it is test only, it should be in lsp_test.go
There was a problem hiding this comment.
After rebasing onto upstream/main (which now includes PR #2460 that added WorkingDir()), our branch has no diff in lsp.go or lsp_test.go — both files are identical to upstream. Nothing to move.
… turn When an MCP server is configured but unavailable at session start, the agent now retries automatically after every tool-call batch — within the same user turn — without requiring a new user message. ## Core changes ### MCP double-watcher race (pkg/tools/mcp/mcp.go) Added watcherAlive bool to Toolset. Toolset.Start() only spawns go watchConnection(...) when !watcherAlive. The goroutine clears the flag on all exit paths via defer. This prevents reprobe() from spawning a second watcher while an existing one is mid-backoff (ts.started==false but goroutine alive), which would cause racing doStart() calls and unsafe close/recreate of ts.restarted. ### Failure deduplication + recovery notices (pkg/tools/startable.go) ShouldReportFailure() returns true exactly once per failure streak, suppressing repeated 'start failed' warnings on every retry. ConsumeRecovery() returns true exactly once when a previously-failed toolset successfully starts, triggering a 'now available' warning. Both surface via WarningEvent -> notification.WarningCmd() (persistent TUI notifications that stay until dismissed). ### Reprobe after each tool-call batch (pkg/runtime/loop.go) reprobe() is called after every tool-call batch. It re-runs ensureToolSetsAreStarted() without emitting MCPInitStarted/Finished events (no TUI spinner flicker), emits any pending warnings, and emits a ToolsetInfo event when new tools appear. The updated tool list is picked up by the top-of-loop getTools() on the next iteration, so the model sees new tools in its very next response within the same user turn. ### TUI (pkg/agent/agent.go, pkg/runtime/loop.go, pkg/runtime/event.go) DrainWarnings() now includes both failure and recovery messages. WarningEvent used for all toolset lifecycle notifications. ## Tests - pkg/tools/startable_test.go: ShouldReportFailure/ConsumeRecovery behaviour (one warning per streak, recovery fires once, Stop resets) - pkg/agent/agent_test.go: TestAgentReProbeEmitsWarningThenNotice, TestAgentNoDuplicateStartWarnings - pkg/runtime/runtime_test.go: TestReprobe_NewToolsAvailableAfterToolCall, TestReprobe_NoChangeMeansNoExtraEvents Fixes docker#2457 Assisted-By: docker-agent
65d4238 to
0176f95
Compare
simonferquel
left a comment
There was a problem hiding this comment.
Approving my own agent feels not enough. cc @krissetto for another approve :)
Summary
Fixes #2457.
When an MCP server is configured but not available at session start, the agent now retries within the same user turn — after each tool-call batch — without requiring a new user message.
Root cause and fix
MCP double-watcher race (B1)
Toolset.Start()previously spawnedgo watchConnection(...)unconditionally. With the reprobe loop callingToolset.Start()during reconnect backoff (whents.started==falsebut the goroutine is alive), a second watcher goroutine could be spawned, causing racingdoStart()calls and unsafe close/recreate ofts.restarted.Fix: added
watcherAlive booltoToolset.Start()only spawns the watcher when!watcherAlive; the goroutine clears the flag on all exit paths viadefer.Reprobe after each tool-call batch
Added
reprobe()called after every tool-call batch in the runtime loop. It re-runsensureToolSetsAreStarted()without emittingMCPInitStarted/Finishedevents (no TUI spinner flicker). If new tools appear (by name-set diff), it emits a ToolsetInfo update immediately. The updated tool list is picked up by the next iteration'sgetTools()— so the model sees new tools within the same user turn, without a new user message.Deduplicated warnings + recovery notices
StartableToolSettracks failure streaks withShouldReportFailure()(emits exactly one warning per streak) andConsumeRecovery()(emits exactly one "now available" warning when a toolset recovers). Both surface viaWarningEvent→notification.WarningCmd()— persistent TUI notifications that stay until dismissed.Changes
pkg/tools/mcp/mcp.gowatcherAliveflag; gated watcher spawn;watchConnectionclears flag on exitpkg/tools/startable.goShouldReportFailure(),ConsumeRecovery(), failure-streak tracking fieldspkg/agent/agent.goensureToolSetsAreStarteduses dedup methods;DrainWarningsincludes recovery noticespkg/runtime/loop.goreprobe()after each tool-call batch;emitLifecycleEventsparam ongetTools()pkg/runtime/event.goWarningEventused for both failures and recoverypkg/tui/page/chat/runtime_events.goWarningEvent→notification.WarningCmd()Tests
pkg/tools/startable_test.go:ShouldReportFailure/ConsumeRecoverybehaviour (one warning per streak, recovery fires once)pkg/agent/agent_test.go:TestAgentReProbeEmitsWarningThenNotice,TestAgentNoDuplicateStartWarningspkg/runtime/runtime_test.go:TestReprobe_NewToolsAvailableAfterToolCall,TestReprobe_NoChangeMeansNoExtraEventstmux e2e (MCP, validated locally)
Single user turn: agent runs
touch /flag→ reprobe fires → next model response uses new MCP tool natively.