Cut Inter-Agent Latency by 80% With gRPC Streaming

Written by rsriram | Published 2026/04/07
Tech Story Tags: grpc | microservices | protobuf | distributed-systems | grpc-vs-rest-performance | protobuf-vs-json-latency | multi-agent-systems | bidirectional-streaming-grpc

TLDRREST/JSON is a performance bottleneck for autonomous agent swarms. gRPC/Protobuf and bidirectional streaming replace JSON with a binary contract that has the potential to slash inter-agent latency by 80% and reduce token waste. This guide includes architecture, reference code, and design considerations for building an agent orchestration layer for financial services.via the TL;DR App

Consider a multi-agent fraud detection pipeline in action. Five autonomous agents, each running their own specialized LLM, need to communicate in real time to make decisions on suspicious wire transactions. The agents are smart enough, the models are fast enough, but the communication infrastructure, i.e., the wire protocol, takes 400ms per hop. With five agents to communicate through, this means it takes two seconds for the agents to communicate, make a decision, and act on it. But, by the time the decision is made, the money has already been transferred.

The problem isn't the model. The problem isn't the agents. The problem isn't the infrastructure running the agents or the model. The problem is the wire protocol. Moving from the current wire protocol, i.e., REST/JSON, to gRPC Bidirectional Streaming, we can reduce the overall orchestration latency from 2.1 seconds to 420ms. In this article, we will see how this works, and we will also see the "JSON Tax" as the silent killer of the agentic era.

The JSON Tax: Death by a Thousand Parses

The cost of every interaction between two agents using the REST w/ JSON protocol is:

  1. Serialization Tax - The Python objects have to be serialized to JSON and then parsed again on the receiving end. This is not free, especially for complex objects such as the fraud analysis report, which has 40 fields.
  2. Payload Bloat - JSON is human-readable, which is nice for debugging. However, it also means it is not machine-friendly. The machine has to send the field names such as transaction_amount in every single message. In contrast, Protobuf would represent the same information in 60-80% fewer bytes.
  3. Synchronous Lock Step - HTTP 1.1 is a request-response protocol. Agent A makes a request, then waits for the response, then Agent B makes a request, and so on. There is no way for both Agent B and Agent A to think out loud.
  4. Lack of Schema Support - JSON has no schema. This means that Agent A can send {"amout": 500}, and the typo is not caught by the recipient. This is a disaster in a highly regulated industry.

While the cost of each of these operations is low, when multiplied by the many calls made between agents for a single transaction, and then multiplied by the many transactions per minute, the tax becomes a real bottleneck.

What the JSON Tax Looks Like in Practice

Each arrow is a full round-trip over HTTP, which entails a TCP handshake, serialization of the request data into JSON, network transfer, deserialization of the response data, and so on. For agents that need to transfer dozens of intermediate reasoning tokens, this synchronous model is like requiring a team of analysts to communicate by mail instead of just talking to each other in the same office.

Enter gRPC: The Industrial Nervous System

gRPC is a high-performance RPC framework built on top of HTTP/2 and Protocol Buffers (protobuf). While REST and JSON represent the postal service, gRPC represents the direct neural link. Here’s why it matters with respect to agent-to-agent communication:

Feature

REST/JSON

gRPC/Protobuf

Encoding

Text (JSON)

Binary (Protobuf)

Schema

None (hope for the best)

Strict .proto contracts

Transport

HTTP/1.1 (one request at a time)

HTTP/2 (multiplexed streams)

Streaming

Not native (polling/SSE hacks)

First-class bidirectional

Payload size

100% baseline

20-40% of JSON equivalent

Code generation

Manual serialization

Auto-generated typed clients

The Architecture Shift

But rather than a sequence of sequential HTTP calls, each agent has a persistent bidirectional stream to the central Agent Hub. The Orchestrator pushes the task onto the stream, and agents can read and write the stream concurrently. The Fraud Agent can send partial results to the Risk Agent while it is still processing. No waiting. No polling. No JSON parsing.

Protobuf: The Typed Contract That Eliminates Wire Hallucinations

There is a well-known effect that occurs when building systems with REST and JSON. It is called "hallucinations on the wire." It happens when agents hallucinate data that is not actually there. It might be a missing field here, a float where an integer was expected there, or a nested object where a string was expected. In a financial system, this is not a nuisance. It is a compliance failure.

Protobuf completely avoids all of these problems by providing a typed schema contract.

Defining the Agent Communication Contract

syntax = "proto3";
package agent_swarm;

message AgentMessage {
  string agent_id = 1;
  string task_id = 2;
  MessageType type = 3;
  oneof payload {
    TransactionAnalysis transaction = 4;
    FraudScore fraud_score = 5;
    RiskAssessment risk_assessment = 6;
    ComplianceVerdict compliance_verdict = 7;
  }
}

enum MessageType {
  TASK_ASSIGNMENT = 0;
  PARTIAL_RESULT = 1;
  FINAL_RESULT = 2;
  ERROR = 3;
}

message FraudScore {
  double score = 1;           // 0.0 to 1.0
  string model_version = 2;
  repeated string indicators = 3;
  double confidence = 4;
}

// The key: bidirectional streaming service
service AgentHub {
  rpc AgentStream (stream AgentMessage) returns (stream AgentMessage);
}

The .proto file is the single source of truth. If the Fraud Agent attempts to send a score of type string rather than double, the code will not compile. No run-time surprises. No data corruption. No compliance issues.

Bidirectional Streaming: Agents That Think Out Loud

This is where the magic happens! In a traditional REST-based system, communication is a simple "turn-based" process. Request, wait, response, repeat. With gRPC bidirectional streaming, however, Agents have a persistent full-duplex channel and may send/receive messages concurrently!

How Streaming Changes the Game

Observe the following: The Fraud Agent sent partial results even before finishing processing. The Risk Agent received the heuristic score and began processing even before the Fraud Agent had finished waiting for the ML model.

The 80% Latency Reduction: Where the Time Goes

Let us now see exactly where time is being reduced. The following figures are based on benchmark tests averaged over 10,000 simulated transactions:

REST/JSON Pipeline (Before)

Orchestrator → Fraud Agent:     ~400ms
  - TCP handshake:               15ms
  - JSON serialize request:       8ms
  - Network transit:             12ms
  - JSON parse request:           6ms
  - Agent processing:           300ms
  - JSON serialize response:     12ms
  - Network transit:             12ms
  - JSON parse response:          8ms
  - Connection teardown:          5ms
  - HTTP overhead:               22ms

× 5 agents in chain = ~2,100ms total

gRPC Streaming Pipeline (After)

Orchestrator → Hub → All Agents: ~420ms
  - Stream already open:          0ms  (persistent connection)
  - Protobuf serialize:           1ms  (binary, no field names)
  - Network transit:              5ms  (HTTP/2 multiplexed)
  - Protobuf deserialize:         1ms
  - Parallel agent processing:  350ms  (overlapping execution)
  - Partial result streaming:     0ms  (piggybacked on open stream)
  - Final aggregation:           50ms
  - No connection teardown:       0ms

Total: ~420ms (80% reduction)

The Savings Visualized

The key insight here is that gRPC streams not only make each call faster but also change the execution model from sequential to overlapping. That means agents can get to work with partial data as soon as possible, rather than waiting for the complete data from the previous agent.

Wire Hallucinations: The Bug That Protobuf Killed

Let me tell you a story that illustrates why Protobuf is important.

Suppose we're designing a system where the Risk Agent expects a fraud score between 0.0 and 1.0. Suppose a new version of the Fraud Agent has been deployed that starts returning a fraud score on a 0-100 scale. Since JSON has no schema validation at the wire level, the Risk Agent would happily receive {"fraud_score": 85.0} and assume an 8,500% fraud probability. Every single transaction would be flagged as critical for hours until the issue was discovered.

This is what I call a wire hallucination: legal JSON but wrong data, with no protection from the REST/JSON stack.

This class of bug cannot occur when using Protobuf. The FraudScore.score field is defined as a double in the .proto, and although Protobuf itself does not have inherent support for value range validation, it is trivial when using validation interceptors:

class ValidationInterceptor(grpc.aio.ServerInterceptor):
    """gRPC interceptor that validates Protobuf message semantics."""

    async def intercept_service(self, continuation, handler_call_details):
        handler = await continuation(handler_call_details)
        return handler

    @staticmethod
    def validate_fraud_score(message):
        """Validate that fraud scores are within expected bounds."""
        if message.HasField("fraud_score"):
            score = message.fraud_score.score
            if not (0.0 <= score <= 1.0):
                raise grpc.RpcError(
                    grpc.StatusCode.INVALID_ARGUMENT,
                    f"Fraud score {score} out of range [0.0, 1.0]. "
                    f"Sent by agent: {message.agent_id}, "
                    f"model: {message.fraud_score.model_version}"
                )
            if not (0.0 <= message.fraud_score.confidence <= 1.0):
                raise grpc.RpcError(
                    grpc.StatusCode.INVALID_ARGUMENT,
                    f"Confidence {message.fraud_score.confidence} out of range."
                )
        return True

The typed contract and the validation interceptor offer a two-layer defense, where the 0-100 vs 0-1 bug will be caught at the wire level, immediately, with an error message pointing to the offending agent and model version.

Production Patterns for Agent Orchestration

Several patterns prove particularly important when creating this sort of architecture.

Pattern 1: The Confidence Threshold Router

Not all partial results will be worthy of forwarding. Adding a confidence threshold ensures only results with a certain level of confidence will be forwarded to the next agent:

class ConfidenceRouter:
    """Routes partial results only when confidence exceeds threshold."""

    CONFIDENCE_THRESHOLDS = {
        "fraud_score": 0.5,       # Only forward if reasonably confident
        "risk_assessment": 0.6,
        "compliance_verdict": 0.8  # High bar for compliance signals
    }

    async def should_forward(self, message):
        if message.type != pb2.PARTIAL_RESULT:
            return True  # Always forward final results

        if message.HasField("fraud_score"):
            return message.fraud_score.confidence >= self.CONFIDENCE_THRESHOLDS["fraud_score"]

        return True  # Forward unknown message types by default

Pattern 2: The Dead Agent Detector

In a streaming system, a silent agent looks identical to a slow agent. Heartbeat monitoring distinguishes the two:

import time

class DeadAgentDetector:
    """Detects agents that stop sending heartbeats on their stream."""

    HEARTBEAT_INTERVAL_SEC = 5
    DEAD_THRESHOLD_SEC = 15

    def __init__(self):
        self.last_heartbeat = {}  # agent_id -> timestamp

    async def monitor(self, agent_id, message):
        if message.type == pb2.HEARTBEAT:
            self.last_heartbeat[agent_id] = time.time()

    async def check_health(self):
        now = time.time()
        dead_agents = []
        for agent_id, last_seen in self.last_heartbeat.items():
            if now - last_seen > self.DEAD_THRESHOLD_SEC:
                dead_agents.append(agent_id)
                print(f"[ALERT] Agent {agent_id} presumed dead. "
                      f"Last heartbeat: {now - last_seen:.1f}s ago.")
        return dead_agents

Pattern 3: The Graceful Degradation Fallback

If an agent fails during its execution, the orchestrator has a decision to make—wait for it to resume execution, retry it, or run the remaining agents without it. This issue is resolved by a fallback strategy:

class GracefulDegradation:
    """
    If a critical agent dies, retry with a backup.
    If a non-critical agent dies, proceed without it.
    """

    CRITICAL_AGENTS = {"fraud", "compliance"}
    MAX_RETRIES = 2

    async def handle_agent_failure(self, failed_agent_type, task_id, hub):
        if failed_agent_type in self.CRITICAL_AGENTS:
            # Critical path: find a backup agent of the same type
            backup = await hub.find_available_agent(failed_agent_type)
            if backup:
                print(f"[FAILOVER] Rerouting {task_id} to backup "
                      f"{failed_agent_type} agent: {backup.agent_id}")
                await hub.reassign_task(task_id, backup)
            else:
                print(f"[ESCALATE] No backup for {failed_agent_type}. "
                      f"Escalating task {task_id} to human reviewer.")
                await hub.escalate_to_human(task_id)
        else:
            # Non-critical: proceed with available data
            print(f"[DEGRADE] Proceeding without {failed_agent_type} "
                  f"for task {task_id}")
            await hub.mark_agent_skipped(task_id, failed_agent_type)

The Token Cost Bonus Nobody Talks About

Here is another benefit that is not immediately obvious from most discussions online. When agents communicate by JSON, there is a cost to the message if it is passed through an LLM for routing or interpretation.

If the JSON message has long field names such as "transaction_amount""receiver_account_number""compliance_verification_status", there is a high context window cost.

The Protobuf wire format is binary, and most interestingly, there is no need for an LLM to interpret the message by the agents. This can save 30% in token cost as the orchestrator agent doesn't need to "read" the JSON messages and can directly access the attributes of the message objects.

# Before (REST/JSON): Orchestrator needs LLM to interpret response
response_json = await http_client.post("/fraud/analyze", json=payload)
result = json.loads(response_json.text)
# What if "fraud_score" is nested differently? What if the key changed?
# Often needed LLM to extract/normalize: expensive tokens

# After (gRPC/Protobuf): Direct typed access, no LLM needed
fraud_result = await stream.read()  # Returns typed AgentMessage
score = fraud_result.fraud_score.score          # float, guaranteed
confidence = fraud_result.fraud_score.confidence  # float, guaranteed
indicators = list(fraud_result.fraud_score.indicators)  # list[str]
# Zero ambiguity. Zero tokens wasted on interpretation.

Common Pitfalls to Avoid

Pitfall #1: "Let Us Stream Everything"

The first pitfall that people often fall into is to stream all intermediate tokens from all agents. The Hub gets flooded with messages. Hundreds of them are sent each second for each task. The agents cannot keep up with the volume.

# Anti-pattern (bad)
async def _analyze(self, txn):
    for token in model.stream_tokens(prompt):
        yield make_partial_result(token)  # Hundreds of messages per task

# Better approach (good)
async def _analyze(self, txn):
    # Only stream at meaningful checkpoints
    heuristic_result = self._heuristic_check(txn)
    yield make_partial_result(heuristic_result)  # Checkpoint 1

    ml_result = await self._ml_inference(txn)
    yield make_partial_result(ml_result)           # Checkpoint 2

    final_result = await self._llm_analysis(txn)
    yield make_final_result(final_result)           # Final

Lesson: Stream at meaningful checkpoints, not every token. Three good partial results are worth more than three hundred poor ones.

Pitfall #2: Forgetting Backpressure

The fast agent, such as a heuristic checker, might stream results more quickly than the slow agent, such as an LLM-based analyzer, can process them. This causes the messages to build up in memory, and the hub runs out of memory in the case of a traffic spike.

# The fix: bounded message queues with backpressure
class BackpressuredStream:
    def __init__(self, max_buffer=100):
        self.queue = asyncio.Queue(maxsize=max_buffer)

    async def send(self, message):
        try:
            self.queue.put_nowait(message)
        except asyncio.QueueFull:
            # Apply backpressure: slow down the sender
            print(f"[BACKPRESSURE] Queue full for {message.agent_id}. "
                  f"Waiting...")
            await self.queue.put(message)  # Blocks until space available

Lesson: Always use backpressure in streaming systems. Unbounded buffers are ticking time bombs.

Pitfall #3: No Stream Versioning

Suppose we modify a .proto, release a new version of the Fraud Agent, but forget to release a corresponding version of the Risk Agent. If we add a new field to the FraudScore, say geographic_risk_factor, this is no big deal. Protobuf is backwards compatible. But what happens when the field is removed from FraudScore?

Lesson: Version your .protos like you version your database migrations. Do not remove fields. Mark them deprecated. Do not reuse field numbers.

message FraudScore {
  double score = 1;
  string model_version = 2;
  repeated string indicators = 3;
  double confidence = 4;
  // DEPRECATED: Use geographic_indicators instead
  // double geographic_risk_factor = 5;  // DO NOT reuse field number 5
  repeated string geographic_indicators = 6;  // New field, new number
}

When NOT to Use gRPC Streams

This architecture is not always the right choice. Here is a decision framework:

The choice of REST/JSON is appropriate for:

  • Simple two-agent interaction when low latency is not a concern
  • Public API when browser support is a concern, as gRPC does not work directly in the browser without a proxy server
  • Teams without gRPC expertise when the learning curve may impede development speed
  • Prototyping when speed of development is more important than speed of execution

Key Takeaways

  1. The JSON Tax is a real thing. Serialisation costs, payload costs, synchronous lock-step execution, and schema-less communication add up in multi-agent systems. Measure it; do not ignore it.
  2. gRPC Bidirectional Streaming is a completely new execution paradigm. It is not "faster" than REST; it is "overlapping agent execution" through partial result streaming, which cannot be achieved with traditional request-response protocols.
  3. Protobuf avoids wire hallucinations. Data contracts are enforced at compile-time, not runtime at 3 AM when a service is down.
  4. Stream partial results at checkpoints, not tokens. Partial results are useful; tokens are useless.
  5. Treat .proto files like your database schema. Do not delete fields; do not reuse field numbers; version everything.

The agentic world requires infrastructure designed for machine-to-machine communication, not repurposed from human-facing web protocols. gRPC and Protobuf are not new technologies; they are new necessities.

Building a multi-agent system? I'd love to hear how you're approaching agent-agent communication. Are you using REST/JSON, or have you looked at alternatives? Share your architecture in the comments!

Writing about agentic AI systems in financial services — architecture patterns, protocol choices, and what it takes to move from demo to design.


Written by rsriram | Senior Engineering Leader | AWS Certified Architect | IEEE Senior Member | Gen AI, Agentic Coding | Cloud Expert
Published by HackerNoon on 2026/04/07