Skip to content

fix: retry LSP/MCP toolsets after tool calls, covering env-wrapped commands (fixes #2457)#2458

Merged
simonferquel merged 1 commit intodocker:mainfrom
simonferquel-clanker:fix/retry-missing-lsp-mcp-2457
Apr 20, 2026
Merged

fix: retry LSP/MCP toolsets after tool calls, covering env-wrapped commands (fixes #2457)#2458
simonferquel merged 1 commit intodocker:mainfrom
simonferquel-clanker:fix/retry-missing-lsp-mcp-2457

Conversation

@simonferquel-clanker
Copy link
Copy Markdown
Contributor

@simonferquel-clanker simonferquel-clanker commented Apr 17, 2026

Summary

Fixes #2457.

When an MCP server is configured but not available at session start, the agent now retries within the same user turn — after each tool-call batch — without requiring a new user message.

Root cause and fix

MCP double-watcher race (B1)

Toolset.Start() previously spawned go watchConnection(...) unconditionally. With the reprobe loop calling Toolset.Start() during reconnect backoff (when ts.started==false but the goroutine is alive), a second watcher goroutine could be spawned, causing racing doStart() calls and unsafe close/recreate of ts.restarted.

Fix: added watcherAlive bool to Toolset. Start() only spawns the watcher when !watcherAlive; the goroutine clears the flag on all exit paths via defer.

Reprobe after each tool-call batch

Added reprobe() called after every tool-call batch in the runtime loop. It re-runs ensureToolSetsAreStarted() without emitting MCPInitStarted/Finished events (no TUI spinner flicker). If new tools appear (by name-set diff), it emits a ToolsetInfo update immediately. The updated tool list is picked up by the next iteration's getTools() — so the model sees new tools within the same user turn, without a new user message.

Deduplicated warnings + recovery notices

StartableToolSet tracks failure streaks with ShouldReportFailure() (emits exactly one warning per streak) and ConsumeRecovery() (emits exactly one "now available" warning when a toolset recovers). Both surface via WarningEventnotification.WarningCmd() — persistent TUI notifications that stay until dismissed.

Changes

File What changed
pkg/tools/mcp/mcp.go watcherAlive flag; gated watcher spawn; watchConnection clears flag on exit
pkg/tools/startable.go ShouldReportFailure(), ConsumeRecovery(), failure-streak tracking fields
pkg/agent/agent.go ensureToolSetsAreStarted uses dedup methods; DrainWarnings includes recovery notices
pkg/runtime/loop.go reprobe() after each tool-call batch; emitLifecycleEvents param on getTools()
pkg/runtime/event.go WarningEvent used for both failures and recovery
pkg/tui/page/chat/runtime_events.go WarningEventnotification.WarningCmd()

Tests

  • pkg/tools/startable_test.go: ShouldReportFailure/ConsumeRecovery behaviour (one warning per streak, recovery fires once)
  • pkg/agent/agent_test.go: TestAgentReProbeEmitsWarningThenNotice, TestAgentNoDuplicateStartWarnings
  • pkg/runtime/runtime_test.go: TestReprobe_NewToolsAvailableAfterToolCall, TestReprobe_NoChangeMeansNoExtraEvents

tmux e2e (MCP, validated locally)

Single user turn: agent runs touch /flag → reprobe fires → next model response uses new MCP tool natively.

✓ Shell  touch /tmp/e2e-test/mcp-ready
✓ hello  name=World  Hello, World! MCP is working.
Tools sidebar: 6 tools available  ← was 5

@simonferquel-clanker simonferquel-clanker requested a review from a team as a code owner April 17, 2026 13:13
@rumpl
Copy link
Copy Markdown
Member

rumpl commented Apr 17, 2026

I'm not sure I understand the fix, isn't the lsp tooset startable? Wouldn't re-starting the lsp toolset be enough?

@simonferquel-clanker
Copy link
Copy Markdown
Contributor Author

Good question — the LSP toolset is Startable, but there are two separate problems that a naive re-start doesn't cover:

1. env -C ./backend gopls looks like a successful start

When the command is env -C ./backend gopls, exec.Command("env", ...).Start() always returns nilenv is always present. The fact that env then immediately runs gopls, finds it missing, and exits 127 is invisible to cmd.Start(). By the time we return from LSPTool.Start() / startLocked(), h.cmd has been set and the call returned nil, so StartableToolSet sets started = true and considers the toolset live.

A subsequent call to Start() hits the if s.started { return nil } short-circuit and never retries — even though the underlying process died in milliseconds.

2. The HealthChecker interface closes that gap

The fix adds a HealthChecker interface (Healthy() bool) that StartableToolSet checks on every Start() call even when started == true. LSPTool.Healthy() returns h.cmd != nil.

The new watchExit goroutine is the mechanism that transitions h.cmd from non-nil to nil: it is the sole owner of cmd.Wait(), blocks until the process exits, then resets h.cmd = nil. So after env gopls exits 127, Healthy() returns false, StartableToolSet resets started = false, and the next Start() call actually retries — which will succeed once gopls has been installed.

3. The reprobe loop ties it together within a single turn

Even with the above, without the reprobe the agent would only see new tools at the start of the next user turn (when getTools() is called at the top of the run loop). The reprobe() call after each tool-call batch re-runs ensureToolSetsAreStarted immediately, so if a shell tool call installed gopls, the very next model response in the same turn already has the LSP tools in its available-tool list — no new user message needed.

Copy link
Copy Markdown
Contributor

@simonferquel simonferquel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With #2460 about to get merged, the env gopls thing is moot. This PR should not try to cover this pattern anymore

@simonferquel-clanker
Copy link
Copy Markdown
Contributor Author

Done — removed in commit 0c65de91.

Removed (everything that was specifically for the env -C wrapper pattern):

  • HealthChecker interface and the health-check block in StartableToolSet.Start()
  • LSPTool.Healthy(), watchExit goroutine, dead atomic.Bool, exited chan
  • stopAndWait reverted to stopLocked with direct cmd.Wait()
  • Toolset.Healthy() on the MCP side (no longer needed without the interface)
  • All related tests (TestLSPTool_Healthy_*, TestLSPTool_StartableToolSet_RetriesAfterProcessExits, TestStartableToolSet_HealthChecker_*)

Retained (the actual fix from this PR):

  • watcherAlive flag in MCP Toolset + gated watcher spawn (B1: prevents double watchConnection goroutine on reprobe)
  • reprobe() in the runtime loop (fires after each tool-call batch, no MCPInit* spinner noise)
  • ShouldReportFailure / ConsumeRecovery deduplication and recovery notices
  • All agent-level and runtime-level reprobe tests

@simonferquel simonferquel self-requested a review April 20, 2026 11:00
@krissetto
Copy link
Copy Markdown
Contributor

use the proper notification component when surfacing things to the TUI. A warning one in this case might be appropriate

@simonferquel-clanker
Copy link
Copy Markdown
Contributor Author

Done in commit ad1b1734.

Dropped the separate NoticeEvent/InfoCmd path entirely. Both start-failure warnings and toolset recovery messages now go through the existing WarningEventnotification.WarningCmd() path.

The key distinction: TypeInfo notifications auto-hide after 3 seconds; TypeWarning notifications are persistent and stay until the user dismisses them — which is the right behaviour for toolset availability state changes that the user should not miss.

Changes:

  • NoticeEvent, Notice(), formatToolNotice removed from runtime
  • DrainNotices, addToolNotice, pendingNotices removed from agent
  • ConsumeRecovery() now calls addToolWarning("X is now available")
  • emitAgentWarnings() simplified back to draining warnings only
  • NoticeEvent TUI handler and "notice" client type registration removed

Comment thread pkg/tools/builtin/lsp.go Outdated

// Cancel the process-lifetime context to stop the readNotifications
// goroutine and (if the process didn't exit cleanly) kill the process.
// goroutine and force-kill the process if it hasn't exited cleanly.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: revert that comment only change

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — lsp.go reverted to upstream/main in commit 65d42382. Both lsp.go and lsp_test.go are now identical to upstream.

t.Parallel()

tool := NewLSPTool("gopls", nil, nil, "/tmp")
tools, err := tool.Tools(t.Context())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all these files changes should be reverted

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — lsp_test.go fully reverted to upstream/main in commit 65d42382. No changes from this PR remain in pkg/tools/builtin/.

Comment thread pkg/tools/mcp/mcp.go
// Use WithoutCancel so the watcher outlives the caller's context;
// the only way to stop it is via Stop() setting ts.stopping.
go ts.watchConnection(context.WithoutCancel(ctx))
// Spawn the connection watcher only when no watcher is already running.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this file changes still needed with the env -C concern removal ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes — this change is independent of the env -C concern and is still needed.

The watcherAlive flag fixes a race introduced by the reprobe loop itself:

  1. watchConnection sets ts.started = false at the top of its reconnect iteration (before sleeping its backoff), while the goroutine is still alive.
  2. If reprobe() fires during that window, StartableToolSet.Start() sees started == false and calls Toolset.Start().
  3. Without the guard, Toolset.Start() would unconditionally spawn a second go watchConnection(...), resulting in two goroutines racing to call doStart() and unsafely close/recreate ts.restarted.

The fix: Toolset.Start() only spawns watchConnection when !watcherAlive. The goroutine clears the flag on all exit paths via defer. While an existing watcher is mid-backoff, reprobe skips the spawn. Once the watcher truly gives up, the next reprobe starts a fresh one.

Comment thread pkg/tools/builtin/lsp.go

// WorkingDir returns the working directory of the LSP server process.
// This is intended for testing only.
func (t *LSPTool) WorkingDir() string {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is test only, it should be in lsp_test.go

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After rebasing onto upstream/main (which now includes PR #2460 that added WorkingDir()), our branch has no diff in lsp.go or lsp_test.go — both files are identical to upstream. Nothing to move.

… turn

When an MCP server is configured but unavailable at session start, the
agent now retries automatically after every tool-call batch — within the
same user turn — without requiring a new user message.

## Core changes

### MCP double-watcher race (pkg/tools/mcp/mcp.go)
Added watcherAlive bool to Toolset. Toolset.Start() only spawns
go watchConnection(...) when !watcherAlive. The goroutine clears the
flag on all exit paths via defer. This prevents reprobe() from spawning
a second watcher while an existing one is mid-backoff (ts.started==false
but goroutine alive), which would cause racing doStart() calls and unsafe
close/recreate of ts.restarted.

### Failure deduplication + recovery notices (pkg/tools/startable.go)
ShouldReportFailure() returns true exactly once per failure streak,
suppressing repeated 'start failed' warnings on every retry.
ConsumeRecovery() returns true exactly once when a previously-failed
toolset successfully starts, triggering a 'now available' warning.
Both surface via WarningEvent -> notification.WarningCmd() (persistent
TUI notifications that stay until dismissed).

### Reprobe after each tool-call batch (pkg/runtime/loop.go)
reprobe() is called after every tool-call batch. It re-runs
ensureToolSetsAreStarted() without emitting MCPInitStarted/Finished
events (no TUI spinner flicker), emits any pending warnings, and emits
a ToolsetInfo event when new tools appear. The updated tool list is
picked up by the top-of-loop getTools() on the next iteration, so the
model sees new tools in its very next response within the same user turn.

### TUI (pkg/agent/agent.go, pkg/runtime/loop.go, pkg/runtime/event.go)
DrainWarnings() now includes both failure and recovery messages.
WarningEvent used for all toolset lifecycle notifications.

## Tests
- pkg/tools/startable_test.go: ShouldReportFailure/ConsumeRecovery
  behaviour (one warning per streak, recovery fires once, Stop resets)
- pkg/agent/agent_test.go: TestAgentReProbeEmitsWarningThenNotice,
  TestAgentNoDuplicateStartWarnings
- pkg/runtime/runtime_test.go: TestReprobe_NewToolsAvailableAfterToolCall,
  TestReprobe_NoChangeMeansNoExtraEvents

Fixes docker#2457

Assisted-By: docker-agent
@simonferquel-clanker simonferquel-clanker force-pushed the fix/retry-missing-lsp-mcp-2457 branch from 65d4238 to 0176f95 Compare April 20, 2026 11:37
Copy link
Copy Markdown
Contributor

@simonferquel simonferquel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving my own agent feels not enough. cc @krissetto for another approve :)

@simonferquel simonferquel merged commit f41413d into docker:main Apr 20, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Missing LSPs/MCPs should be retried on next turn / after tool calls, with AGENTS.md install instructions

5 participants