Last week, a platform team copied the exact Supervisor + Worker shape from our multi-agent pipeline architecture write-up and deployed it on a Linux VPS in one long night. The graph looked correct, prompts were tuned, and the demo passed. Then the host rebooted after kernel updates. Slack still received events, but approvals vanished, old checkpoint branches resumed by mistake, and two Cron-triggered workers replayed stale tasks. The common root cause was not model quality. They had packed LangGraph orchestration and OpenClaw Gateway execution into one fragile terminal session and treated production as a bigger localhost.
The practical lesson is simple: multi-agent in production is not "run more Python processes". Responsibilities must split cleanly. LangGraph decides state transitions, node routing, checkpoint persistence, and human interrupt resumption. OpenClaw Gateway decides channel ingress, schedules, policy boundaries, and long-running operator visibility. This guide is intentionally operational: decision matrix, minimal topology, handoff contract, systemd baseline, and an L0-L3 troubleshooting path you can use during incidents. If you need Gateway runbook depth, start with OpenClaw Linux production troubleshooting FAQ; for budget guardrails and token accounting, pair this with the AI agent cost ledger article.
Why VPS deployments must split orchestration and execution
A single process can chain Planner -> Coder -> Reviewer for local demos. On a Linux VPS serving real users, that pattern breaks quickly. You will face operational pressure from four directions at once:
- Many ingress types - slash commands, direct chat, webhook callbacks, and scheduled jobs. Hard-coding all entry logic in a graph node becomes brittle and unreviewable.
- Durable pauses - approvals can wait hours. If checkpoint state is process-local or ephemeral, restart events silently corrupt human-in-the-loop workflows.
- Privilege boundaries - the gateway handling public messaging should not hold full MCP credentials for deployment and data-plane writes.
- Observable failure domains - channel auth and TLS errors belong to Gateway; dead edges and runaway loops belong to LangGraph. Mixing both logs destroys mean-time-to-recovery.
Our default production recommendation is stable across team sizes: run LangGraph on a private listener and expose only a trusted API surface; run OpenClaw Gateway as a separate service to translate inbound events into controlled graph invocations keyed by thread_id. This is not tool rivalry. It is layered system design, similar to scheduler vs runner in CI platforms. Once the boundary is explicit, ownership is clearer, security review is simpler, and rollback scope is dramatically smaller.
openclaw logs. Do not re-implement Supervisor logic in Gateway templates; that usually creates a second hidden planner that cannot be tested or retired.
Topology matrix: three practical VPS patterns
On Linux VPS, teams typically converge on one of these patterns. Choose based on always-on channel needs, checkpoint durability, and where heavy tools execute:
| Pattern | LangGraph | OpenClaw | Best for | Main risk |
|---|---|---|---|---|
| A - Single host, two services | 127.0.0.1:8123 |
Public reverse proxy + channels | PoC and small pilot teams | RAM contention; dual-service upgrade choreography |
| B - Private orchestration, external heavy workers | VPS private network | VPS Gateway + cloud Mac workers | Xcode, browsers, long toolchains | Cross-host MCP/SSH latency, token rotation drift |
| C - OpenClaw-only lightweight agents | Optional or batch-only | Multi-profile child agents | Sequential FAQ and support flows | Limited expression for parallel review topology |
Pattern A is the fastest reliable baseline. A 2 vCPU / 4 GB VPS can run both services if checkpoint persistence is truly durable and heavy execution is kept outside the host. Pattern B usually wins once workers need browser automation, repo indexing, mobile builds, or sustained CPU bursts. In that model, Gateway and graph stay lean on Linux while heavyweight execution nodes run on cloud Mac capacity. Pattern C is valid when your workflow is mostly linear and low risk; forcing graph complexity too early adds fragility without benefit.
LangGraph models multi-agent collaboration as graph transitions and handoffs in its multi-agent concepts documentation. OpenClaw channel ingress and gateway controls are documented in the Gateway configuration guide. Read both side by side before defining production ownership boundaries.
Minimal reproducible topology (Pattern A)
This is the shortest production-safe path we recommend for first rollout. Assume Linux VPS, TLS-ready domain, Python 3.11+, and basic reverse proxy operation already in place:
- LangGraph API service - wrap graph execution in FastAPI, bind
127.0.0.1:8123, and persist checkpoint state to a durable directory or managed database. - OpenClaw Gateway service - install from the OpenClaw CLI documentation, keep listener private when possible, and terminate TLS at Nginx/Caddy.
- Bridge rule - inbound channel event -> Gateway normalization -> HTTP call to graph run endpoint with stable
thread_idsemantics and strict intent mapping. - systemd separation - run
langgraph-api.serviceandopenclaw.serviceindependently with restart policy; never hide two production daemons under one process supervisor shell script.
# /etc/systemd/system/langgraph-api.service
[Unit]
Description=LangGraph multi-agent API
After=network-online.target
[Service]
User=deploy
WorkingDirectory=/opt/langgraph-app
Environment=CHECKPOINT_DIR=/var/lib/langgraph-checkpoints
ExecStart=/opt/langgraph-app/.venv/bin/uvicorn main:app --host 127.0.0.1 --port 8123
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
After enabling both units, verify with socket inspection that only your reverse proxy listens publicly on 443 and that graph API remains private. This single check prevents a surprising number of breach paths. Treat public ingress as Gateway-only, and keep orchestration API reachable only through trusted internal routing, VPN, or controlled SSH tunnel. Separation here also makes incident replay safer because you can restart one layer without destroying the other.
Handoff contract: what Gateway sends to the graph
Most production regressions in split architecture come from an unstable handoff envelope, not from model behavior. Define a strict JSON schema and version it. Avoid sending raw transcript blobs as state by default:
thread_id- stable conversation key, deterministic mapping from channel thread to graph state lineage.intent- constrained enum (new_task,approve,cancel,status) to avoid planner guesswork for button semantics.payload- structured task context (repo URL, path scope, explicit operator instruction). Large files should pass object storage references, not in-band blobs.caller_scope- channel identity and user scope for in-graph authorization checks, even if Gateway already validated ingress.
LangGraph memory and checkpoint semantics are designed for persistent world-state over long-running flows. Gateway should remain stateless enough to restart safely. If both layers invent independent session machines, operators eventually see split-brain behavior: channel says done, graph keeps running workers, and no one can prove which state is canonical.
thread_id. If approval state lives in temporary Gateway memory, a routine restart severs release control flow and forces unsafe manual recovery.
L0-L3 troubleshooting: isolate by layer, not by panic
When users report "bot is silent", avoid changing both sides at once. Follow this layer order and collect evidence at each step:
L0: systemd and sockets
Confirm systemctl status openclaw langgraph-api is active for both units, then verify listeners. If graph API is down, Gateway can still ingest channel events and return upstream errors that look like messaging failures. Resolve service health first before touching channel config.
L1: OpenClaw channel and bridge logs
Use openclaw logs to validate webhook reception, signature checks, and bridge endpoint mapping. TLS issues, 403 callbacks, and delivery retries belong here. Resist the common reflex to patch graph prompts when ingress transport is failing upstream.
L2: LangGraph trace and checkpoint lineage
Query graph state with the same thread_id and inspect the last stable node transition. If checkpoint permissions or storage mounts are wrong, every message appears as a fresh session. For parallel workers, inspect file locks and state merge policies before blaming Gateway behavior.
L3: cross-host workers and MCP edges
In Pattern B, heavy workers run remotely over MCP or SSH. At L3, inspect connectivity, NO_PROXY, credential expiry, and host-level quotas. A common anti-pattern is reinstalling Gateway while the true issue is a remote execution timeout or stale tool token on the worker host.
Pre-production checklist (30-minute self-audit)
- LangGraph binds private interface only; internet-facing traffic terminates at Gateway proxy.
- Checkpoint storage is durable and backed up; no state written to ephemeral filesystem layers.
- Gateway-to-graph timeout is below channel retry window to prevent duplicate invocation storms.
- Node-level tool allowlists are explicit; reviewer nodes cannot mutate production systems.
- Reboot simulation confirms paused
interruptcan resume and scheduled tasks do not double-run. - A single trace id flows from channel ingress through graph run and worker execution logs.
/tmp, sharing one API key across ingress and privileged tools, and running browser-heavy workers on the same 4 GB VPS are each sufficient to destabilize the stack. In combination they guarantee weekend incidents.
FAQ
Do all multi-agent pipelines need LangGraph from day one?
No. If your flow is linear, short-lived, and has no pause/resume requirement, OpenClaw child agents or a single deterministic worker may be enough. Introduce graph orchestration once you need checkpoint durability, parallel branches, or explicit human approval interrupts.
Can OpenClaw and LangGraph share one Linux VPS?
Yes, that is Pattern A. Keep strict memory headroom and file descriptor limits, and move browser automation or mobile build workloads to cloud Mac workers. The VPS should stay focused on Gateway ingress and orchestration API reliability.
How do we migrate gradually from an existing OpenClaw gateway?
Start with one low-risk route (for example a single Cron task or one command family) bridged into LangGraph. Keep the rest unchanged. After state durability and routing are proven under load, migrate supervisor decisions incrementally instead of cutover-by-switch.
Run orchestration on VPS, heavy work on cloud Mac
Reliable multi-agent delivery on Linux VPS depends less on prompt cleverness and more on layered architecture: LangGraph for state and topology, OpenClaw for ingress and 7x24 operations. Keep VPS responsibilities lightweight and always-on; move high-variance execution to dedicated worker hosts where resource spikes are expected and observable.
Production resilience comes from boring discipline: explicit service boundaries, durable checkpoint, and L0-L3 incident flow. That discipline is cheaper than emergency rewrites after one reboot-induced split brain.
If you are building your first VPS multi-agent pipeline, start from the VPSSpark homepage, compare plans on cloud Mac pricing, and continue with the launchd vs Linux residency FAQ to align cross-platform operations.