Platform

Solutions

Results

Insights

Resources

About

Book a Demo

Agent 07:Processing Q4 revenue forecast

← Back to Insights

NemoClaw OpenShell Issue #409: The 120-Second Wall Quietly Breaking Every Long-Lived MCP Integration

Liam McCarthy

Apr 8, 2026

7 min read

OpenShell's egress proxy hardcodes a ~120-second TCP idle timeout. That single number is the difference between NemoClaw 'preview' and 'enterprise-ready' — and almost nobody is talking about it.

120 seconds. That is the number that decides whether NVIDIA NemoClaw goes from "preview" to "enterprise-ready" this quarter — and almost nobody is talking about it yet.

Three weeks after NemoClaw's GTC alpha (March 16, 2026; NC-03) and four weeks after NeMo Agent Toolkit 1.5 shipped (March 12; NC-124), I am watching regulated-industry architects run NemoClaw labs in spite of the "do not use in production" label on the release notes. They are doing it because NAT 1.5 finally gave them the cleanest NVIDIA-blessed path to publish enterprise agent workflows over MCP — first-class A2A authentication, a Safety & Security Engine, and a Dynamo Runtime Intelligence layer (NC-124, NC-125, NC-126, NC-127).

Then they hit Issue #409. And nothing works.

The number that breaks the integration story

OpenShell — the sandbox NemoClaw uses for tool execution — runs an egress proxy in front of every container. That proxy enforces a hardcoded TCP idle timeout of approximately two minutes (NemoClaw GitHub Issue #409, OPEN as of 2026-04-08; NC-133). Any connection that goes idle longer than that gets reset by the proxy. PR #438 is in flight to make the timeout configurable, but it has not merged at the time of writing. Issue #481 corroborates the same failure mode on M4 Apple Silicon (NC-140).

~120s — Hardcoded TCP idle timeout in the OpenShell egress proxy
NemoClaw GitHub Issue #409 (OPEN as of 2026-04-08; NC-133)

Two minutes sounds harmless. It is not. Two minutes is shorter than:

A WebSocket heartbeat interval for Discord, Telegram, or any common chat bridge that wants the NemoClaw sandbox to reach out to a long-lived gateway.
A typical MCP Server-Sent Events (SSE) idle window — the interface most production MCP servers use today.
An A2A streaming session — exactly the protocol NAT 1.5 just shipped first-class authentication for (NC-126).
A WhatsApp Business webhook keepalive.
Most database session pools that were tuned for human-scale latency, not human-scale boredom.

In practice, the second your agent inside NemoClaw stops chattering on a connection — say, while it is thinking, or waiting for a slow upstream — the egress proxy nukes the socket and the integration silently breaks. You see partial output. You see retries. You see nothing at all. You do not see a stack trace that screams "two-minute idle timeout in the egress proxy." You file a ticket with NVIDIA, who very politely points you at Issue #409.

The canonical reproduction is six lines of Python:

import asyncio, websockets
async def main():
    async with websockets.connect("ws://your-host:8765", ping_interval=None) as ws:
        await ws.send("hello")
        await asyncio.sleep(300)   # idle for 5 minutes
        await ws.send("still-alive?")  # raises ConnectionResetError
asyncio.run(main())

Run that from inside an OpenShell container against any host you control. The kill lands somewhere between 110 and 140 seconds. That is your fingerprint for Issue #409.

110–140s — Measured kill window for idle WebSocket repro inside OpenShell
Reality ADAS-Evolved internal measurement, 2026-04-08

If the kill lands between 110 and 140 seconds, you are looking at Issue #409 — not a network bug.

Why this matters more than another CVE roundup

I have spent the last week reading Reality's security advisories. Q1 2026 has been the loudest CVE quarter in the agent ecosystem's short history: mcp-remote CVSS 9.6 RCE (CVE-2025-6514, NC-128) downloaded ~500K times; Anthropic's mcp-server-git chained RCE (CVE-2025-68143/68144/68145, NC-129); MCP Go SDK DNS rebinding default (CVE-2026-34742, patched in 1.4.0, NC-130); Claude SDK Memory Tool path traversal (CVE-2026-34451, NC-131); Claude Code settings.json sandbox escape (CVE-2026-25725, NC-132); plus an open path-traversal in Anthropic's Filesystem MCP server (NC-139). On top of that, ClawMoat's analysis still puts 82% of MCP servers in the wild vulnerable to path traversal (NC-122 carry-forward), and Koi Security's ClawHavoc tracker is up to 1,184 malicious skills (NC-123 carry-forward).

~500K — mcp-remote downloads (CVSS 9.6 RCE — CVE-2025-6514)
NC-128

82% — MCP servers in the wild vulnerable to path traversal (ClawMoat)
NC-122 (carry-forward)

1,184 — Known malicious agent skills (Koi Security ClawHavoc tracker)
NC-123 (carry-forward)

That is the loud half of the story. The quiet half is operational. CVEs get patched on a Tuesday. A 120-second timeout in your sandbox proxy is not a CVE. It is not in the National Vulnerability Database. It will not show up in any threat feed your CISO subscribes to. It will simply make your enterprise NemoClaw pilot fail in front of a procurement committee, and you will not know why.

This is the part of the agent security story that the analyst firms are not yet covering. Gartner's first-ever Market Guide for Guardian Agents (G00836388, Feb 25, 2026; TL-FRESH-07) projects guardian agents will capture 10–15% of the agentic AI market by 2030, and 17% of CIOs have already deployed AI agents with another 42% planning to within a year (TL-FRESH-08, single-source — directional). But the same Gartner guide projects that through 2028, at least 80% of unauthorized agent transactions will come from internal policy violations, oversharing, and misguided behavior — not external attacks (TL-FRESH-09). The quiet failure modes are the ones that matter.

10–15% — Guardian agents' projected share of agentic AI market by 2030 (Gartner G00836388)
TL-FRESH-07

80% — Unauthorized agent transactions from internal policy violations through 2028 (Gartner)
TL-FRESH-09

Issue #409 is the canonical quiet failure mode for the NemoClaw era.

The NAT 1.5 contrast

NeMo Agent Toolkit 1.5, also shipped in March, gives you the cleanest path NVIDIA has ever published for FastMCP workflow publishing, automatic LangGraph wrapping, and authenticated A2A streaming (NC-124, NC-125, NC-126, NC-127, NC-142). The two pieces fit together. NAT 1.5 says: "Publish your enterprise workflows over MCP and A2A, get auth and runtime intelligence for free." OpenShell says: "Cool, but if any of those connections idle for two minutes I will silently kill them." NVIDIA is shipping the two halves of the same stack out of two different orgs at slightly different speeds — a totally normal pattern at NVIDIA's scale, but one that means integrators need to do the seam-stitching for at least the next quarter.

This is a 3-to-5 day first-mover window (Reality NemoClaw Early-Adopter Fleet, 2026-04-08). No tracked competitor — Stormap, Repello, Penligent, Apigene, Yotta Labs, GLB GPT, FlowHunt — has published an OpenShell-aware integration audit. Stormap is closest at the architecture/sandbox layer (NC-141) but has not crossed into the integration story yet. Anthropic just mass-shipped 10+ enterprise MCP connectors (Google Drive/Calendar/Gmail/DocuSign/Apollo/Clay/Outreach/SimilarWeb/MSCI/LegalFly, NC-135), which compresses the analyst-firm category-definition deadline by roughly a week. The window is real; it is also short.

Two halves of the same NVIDIA stack shipping out of two orgs at different speeds — integrators own the seam.

What enterprises piloting NemoClaw can do today

You do not have to wait for PR #438 to merge.

First, assume idle is the enemy. Inventory every long-lived connection your agent stack opens through the sandbox: WebSockets (Discord, Telegram, WhatsApp, custom chat bridges), MCP SSE servers, A2A streaming sessions, database connection pools, message-bus subscribers, anything streaming. Anything that does not heartbeat every 60 seconds is a candidate to break.

Second, force keepalives at the application layer. Configure SO_KEEPALIVE on raw sockets where you can. For WebSockets, set ping intervals to ≤45 seconds. For MCP SSE, send a comment line (: keepalive\n\n) at least every 60 seconds — most MCP server implementations support this. For A2A streams under NAT 1.5, the protocol-level keepalive in the v0.3 spec (Linux Foundation A2A Project, NS-19) runs on a 30-second cadence by default — verify your client respects it.

Third, push long-running work out of the sandbox. If a workflow needs to wait on a slow upstream, hand the wait off to a runner outside OpenShell, store the correlation token, and re-enter the sandbox only when there is real work to do. NAT 1.5's Dynamo Runtime Intelligence (NC-127) makes this pattern much more pleasant than it was a quarter ago.

Fourth, instrument the egress path. If you cannot see the proxy's reset events, you cannot debug them. ADAS-Evolved's standard NemoClaw audit profile installs a tcpdump-on-egress sidecar by default, plus a Prometheus counter on egress_proxy_idle_resets_total — the same counter that PR #438 will surface upstream once it merges. We have read the OpenShell egress proxy source and the timeout is in the connection-handling path, but we have deliberately not committed a fork patch to Reality's customer-facing tooling until we re-verify against current OpenShell main.

Fifth — and this is where Reality is investing — script the audit. Reality's NemoClaw Integration Audit is a $25K fixed-fee, two-week engagement that runs the inventory, the timeout reproductions, and the keepalive remediation across a customer's full agent stack, with a published runbook the customer keeps. The audit is built on ADAS-Evolved and the Microsoft Agent Governance Toolkit primitives (NS-18). Email lm@aireality.io if you want to be one of the next three design partners.

Five moves — inventory, keepalives, externalize waits, instrument egress, script the audit. None require PR #438 to merge.

The bigger pattern

Issue #409 is not really about a 120-second timeout. It is about the fact that the agent stack is finally mature enough that the failures are getting boring. Boring is good. Boring means the conversation has moved on from "can agents do anything?" to "can agents stay connected for the length of a real customer interaction?" Boring is what comes right before production.

If you are a builder operating an agent fleet today, my advice is to spend the next 30 days mapping every long-lived connection you depend on, every silent reset you cannot explain, every retry loop your team has learned to ignore. That map is the operational moat. That map is what the analyst firms will reward in 2027. That map is what gets you out of "preview" and into "running this for a customer who pays us."

Reality is publishing the formal NemoClaw WebSocket integration audit later this week. If you want the pre-print, the audit script, or a 30-minute walkthrough of how to run it against your own pilot, reach out: lm@aireality.io.

Boring is the new fast.

Liam McCarthy is the founder of Reality (aireality.io), an AI agency for SMBs and enterprise security teams, and the maintainer of ADAS-Evolved, a self-evolving multi-agent fleet framework launching open-source April 1, 2026.

Intelligence briefings, delivered weekly

Autonomous AI strategy, agent architecture patterns, and enterprise deployment insights — curated by our fleet operations team.

Join 2,400+ AI leaders from Microsoft, Google, and Fortune 500 companies·No spam, unsubscribe anytime

Reality.

Autonomous AI consulting for enterprises ready to lead.

PLATFORM

Quarterback AI

Trigger AI

COMPANY

About

Insights

Resources

Contact

$ fleet status --live