Behind the Scenes of Self-Hosting a Language Model at Scale

Large Language Models (LLMs) are everywhere, from everyday apps to advanced tools. Using them is easy. But what if you need to run your own model? Whether you’ve fine-tuned one or are dealing with privacy-sensitive data, the complexity increases. In this post, we’ll share what we learned while building our own LLM inference system. We’ll cover storing and deploying models, designing the service architecture, and solving real-world issues like routing, streaming, and managing microservices. The process involved challenges, but ultimately, we built a reliable system and gathered lessons worth sharing. Introduction LLMs are powering a wide range of applications — from chatbots and workflow agents to smart automation tools. While retrieval-augmented generation, tool-calling, and multi-agent protocols are important, they operate at a level above the core engine: a foundational LLM. Many projects rely on external providers, such as OpenAI, Gemini, or Anthropic, which is sufficient for most use cases. But for certain applications, it quickly becomes a problem. What if the provider goes down? What if you need full control over latency, pricing, or uptime? Most importantly — what if you care about privacy and can’t afford to send user data to a third party? OpenAI OpenAI Gemini Gemini Anthropic Anthropic That’s where self-hosting becomes essential. Serving a pretrained or fine-tuned model provides control, security, and the ability to tailor the model to specific business needs. Building such a system doesn’t require a large team or extensive resources. We built it with a modest budget, a small team, and just a few nodes. This constraint influenced our architectural decision, requiring us to focus on practicality and efficiency. In the following sections, we’ll cover the challenges faced, the solutions implemented, and the lessons learned along the way. General overview These are the core components that form the backbone of the system. Formats and Encoding. A shared language across services is essential. That means consistent request/response formats, generation parameter schemas, dialogue history structures, and serialization that works everywhere — from frontend to backend to model runners. Streaming and Routing. Handling multiple models, request types, and host priorities requires deliberate routing decisions. We’ll outline how incoming user requests are routed through the system — from initial entry point to the appropriate worker node — and how responses are streamed back. Model storage and deployment. Where do models live, and how are they prepared for production use? Inference. We’ll discuss the key tests to perform, including ensuring the model’s reliability. Observability. How do you know things are working? We’ll show what metrics we track, how we monitor for failures, and the probes we use to ensure system health and reliability. Formats and Encoding. A shared language across services is essential. That means consistent request/response formats, generation parameter schemas, dialogue history structures, and serialization that works everywhere — from frontend to backend to model runners. Formats and Encoding. Streaming and Routing. Handling multiple models, request types, and host priorities requires deliberate routing decisions. We’ll outline how incoming user requests are routed through the system — from initial entry point to the appropriate worker node — and how responses are streamed back. Streaming and Routing. Model storage and deployment. Where do models live, and how are they prepared for production use? Model storage and deployment. Inference. We’ll discuss the key tests to perform, including ensuring the model’s reliability. Inference. Observability. How do you know things are working? We’ll show what metrics we track, how we monitor for failures, and the probes we use to ensure system health and reliability. Observability. Schema and Data Encoding Choosing the right schema for data transfer is crucial. A shared format across services simplifies integration, reduces errors, and improves adaptability. We aimed to design the system to work seamlessly with both self-hosted models and external providers — without exposing differences to the user. Why Schema Design Matters There’s no universal standard for LLM data exchange. Many providers follow schemas similar to OpenAI’s, while others — like Claude or Gemini — introduce subtle but important differences. Many of these providers offer OpenAI-compatible SDKs that retain the same schema, though often with limitations or reduced feature sets (e.g., Anthropic’s OpenAI-compatible SDK, Gemini’s OpenAI compatibility layer). Other projects such as OpenRouter aim to unify these variations by wrapping them into an OpenAI-compatible interface. OpenAI’s OpenAI’s Claude Claude Gemini Gemini Anthropic’s OpenAI-compatible SDK Anthropic’s OpenAI-compatible SDK Gemini’s OpenAI compatibility layer Gemini’s OpenAI compatibility layer OpenRouter OpenRouter Sticking to a single predefined provider’s schema has its benefits: You get a well-tested, stable API. You can rely on existing SDK and tools. You get a well-tested, stable API. well-tested, stable API You can rely on existing SDK and tools. existing SDK and tools But there are real downsides too: It creates vendor lock-in, making it harder to support multiple providers. It limits flexibility to extend the schema with custom features required for business needs or data science team requirements. You’re exposed to breaking changes or deprecations outside your control. These schemas often carry legacy constraints that restrict fine-grained control. It creates vendor lock-in, making it harder to support multiple providers. vendor lock-in It limits flexibility to extend the schema with custom features required for business needs or data science team requirements. You’re exposed to breaking changes or deprecations outside your control. breaking changes These schemas often carry legacy constraints that restrict fine-grained control. legacy constraints To address this, we chose to define our own internal data model — a schema designed around our needs, which we can then map to various external formats when necessary. own internal data model Internal Schema Design Before addressing the challenges, let’s define the problem and outline our expectations for the solution: Easy conversion to formats required by external providers and in reverse. Full support for features specific to our business and data science teams. Ensure the schema is easily extendable to accommodate future requirements. Easy conversion to formats required by external providers and in reverse. Easy conversion to formats required by external providers and in reverse. Full support for features specific to our business and data science teams. Full support for features specific to our business and data science teams. Ensure the schema is easily extendable to accommodate future requirements. Ensure the schema is easily extendable to accommodate future requirements. We began by reviewing major LLM schemas to understand how providers structure messages, parameters, and outputs. This allowed us to extract core domain entities common across most systems, including: core domain entities Messages (e.g., prompt, history) Generation Parameters (e.g., temperature, top_p, beam_search) Messages (e.g., prompt, history) Generation Parameters (e.g., temperature, top_p, beam_search) temperature top_p beam_search We identified certain parameters, such as service_tier, usage_metadata, or reasoning_mode, as being specific to the provider's internal configuration and business logic. These elements lie outside the core LLM domain and are not part of the shared schema. Instead, they are treated as optional extensions. Whenever a feature becomes widely adopted or necessary for broader interoperability, we evaluate integrating it into the core schema. service_tier usage_metadata reasoning_mode At a high level, our input schema is structured with these key components: Model — Used as a routing key, acts as a routing identifier, allowing the system to direct the request to the appropriate worker node. Generation Parameters — Core model settings (e.g., temperature, top_p, max_tokens). Messages — Conversation history and prompt payloads. Tools — Definitions of tools that the model may use. Model — Used as a routing key, acts as a routing identifier, allowing the system to direct the request to the appropriate worker node. Model Generation Parameters — Core model settings (e.g., temperature, top_p, max_tokens). Generation Parameters temperature top_p max_tokens Messages — Conversation history and prompt payloads. Messages Tools — Definitions of tools that the model may use. Tools This leads us to the following schema, represented in a Pydantic-like format. It illustrates the structure and intent of the design, though some implementation details are omitted for simplicity. Pydantic-like Pydantic-like class ChatCompletionRequest(BaseModel): model: str # Routing key to select the appropriate model or service messages: list[Message] # Prompt and dialogue history generation_parameters: GenerationParameters # Core generation settings tools: list[Tool] # Optional tool defenitions class GenerationParameters(BaseModel): temperature: float top_p: float max_tokens: int beam_search: BeamSearchParams # Optional, non-core fields specific to certain providers provider_extensions: dict[str, Any] = {} ... # Other parameters class ChatCompletionRequest(BaseModel): model: str # Routing key to select the appropriate model or service messages: list[Message] # Prompt and dialogue history generation_parameters: GenerationParameters # Core generation settings tools: list[Tool] # Optional tool defenitions class GenerationParameters(BaseModel): temperature: float top_p: float max_tokens: int beam_search: BeamSearchParams # Optional, non-core fields specific to certain providers provider_extensions: dict[str, Any] = {} ... # Other parameters We deliberately moved generation parameters into a separate nested field instead of placing them at the root level. This design choice makes a distinction between constant parameters (e.g., temperature, top-p, model settings) and variable components (e.g., messages, tools). Many teams in our ecosystem store these constant parameters in external configuration systems, making this separation both practical and necessary. constant variable We include an additional field called provider_extensions within the GenerationParameters class. These parameters vary significantly across different LLM providers, validation and interpretation of these fields is delegated to the final module that handles model inference—the component that knows how to communicate with a specific model provider. Thus, we avoid unnecessary pass-through coupling caused by redundant data validation across multiple services. provider_extensions GenerationParameters delegated to the final module that handles model inference To ensure backward compatibility, new output schema features are introduced as explicit, optional fields in the request schema. These fields act as feature flags — users must set them to opt into specific behaviors. This approach keeps the core schema stable while enabling incremental evolution. For example, reasoning traces will only be included in the output if the corresponding field is set in the request. explicit, optional fields These schemas are maintained in a shared Python library and used across services to ensure consistent request and response handling. Working with Third-Party providers We began by outlining how we built our own platform — so why bother with compatibility across external providers? Despite relying on our internal infrastructure, there are still several scenarios where external models play a role: Synthetic data generation for prototyping and experimentation by our data science teams. General-purpose tasks where some proprietary models perform better out of the box. Non-sensitive use cases where privacy, latency, or infrastructure control are less critical. Synthetic data generation for prototyping and experimentation by our data science teams. Synthetic data generation General-purpose tasks where some proprietary models perform better out of the box. General-purpose tasks Non-sensitive use cases where privacy, latency, or infrastructure control are less critical. Non-sensitive use cases The overall communication flow with external providers can be summarized as follows: This process involves the following steps: Special LLM-Gateway Service responsible for communication with provider receives user request in our schema-format. The request is converted into the provider-specific format, including any provide_extensions. The external provider processes the request and returns a response. The LLM-Gateway Service receives the response and maps it back into our standardized response schema. Special LLM-Gateway Service responsible for communication with provider receives user request in our schema-format. LLM-Gateway Service The request is converted into the provider-specific format, including any provide_extensions. provide_extensions The external provider processes the request and returns a response. The LLM-Gateway Service receives the response and maps it back into our standardized response schema. LLM-Gateway Service This is a high-level schematic that abstracts away some individual microservices. Details about specific components and the streaming response format will be covered in the following sections. Streaming format LLM responses are generated incrementally — token by token — and then aggregated into chunks for efficient transmission. From the user’s perspective, whether through a browser, mobile app, or terminal, the experience must remain fluid and responsive. This requires a transport mechanism that supports low-latency, real-time streaming. chunks fluid and responsive low-latency, real-time streaming There are two primary options for achieving this: WebSockets: A full-duplex communication channel that allows continuous two-way interaction between client and server. Server-Sent Events (SSE): A unidirectional, HTTP-based streaming protocol that is widely used for real-time updates. WebSockets: A full-duplex communication channel that allows continuous two-way interaction between client and server. WebSockets WebSockets WebSockets Server-Sent Events (SSE): A unidirectional, HTTP-based streaming protocol that is widely used for real-time updates. Server-Sent Events (SSE) Server-Sent Events (SSE) Server-Sent Events (SSE) Why SSE over WebSockets? While both options are viable, SSE is the more commonly used solution for standard LLM inference — particularly for OpenAI-compatible APIs and similar systems. This is due to several practical advantages: SSE is the more commonly used solution for standard LLM inference Simplicity: SSE runs over standard HTTP, requiring no special upgrades or negotiation. Compatibility: It works natively in all major browsers without additional libraries. Unidirectional Flow: Most LLM responses flow only from server to client, which aligns with SSE’s design. Proxy-Friendliness: SSE plays well with standard HTTP infrastructure, including reverse proxies. Simplicity: SSE runs over standard HTTP, requiring no special upgrades or negotiation. Simplicity upgrades or negotiation upgrades or negotiation Compatibility: It works natively in all major browsers without additional libraries. Compatibility Unidirectional Flow: Most LLM responses flow only from server to client, which aligns with SSE’s design. Unidirectional Flow Proxy-Friendliness: SSE plays well with standard HTTP infrastructure, including reverse proxies. Proxy-Friendliness Because of these benefits, SSE is typically chosen for text-only, prompt-response streaming use cases. SSE is typically chosen for text-only, prompt-response streaming use cases However, some emerging use cases require richer, low-latency, bidirectional communication — such as real-time transcription or speech-to-speech interactions. OpenAI’s Realtime API addresses these needs using WebSockets (for server-to-server). These protocols are better suited for continuous multimodal input and output. OpenAI’s Realtime API OpenAI’s Realtime API WebSockets Since our system focuses exclusively on text-based interactions, we stick with SSE for its simplicity, compatibility, and alignment with our streaming model. text-based interactions SSE Response Stream Content With SSE selected as the transport layer, the next step was defining whatdata to include in the stream. Effective streaming requires more than just raw text — it needs to provide sufficient structure, metadata, and context to support downstream consumers such as user interfaces and automation tools. The stream must include the following information: SSE what structure, metadata, and context Header-Level Metadata. Basic identifying information such as request ID. Actual Content Chunks. The core output — the tokens or strings generated by the model — is delivered incrementally as sequences (n) are streamed back, chunk-by-chunk. Each generation can consist of multiple sequences (e.g., n=2, n=4) .These sequences are generated independently and streamed in parallel, each broken into its own set of incremental chunks. Usage and Token-Level Metadata. This includes number of tokens generated, timing data, and optional diagnostics like logprobs or reasoning traces. These may be used for billing, debugging, or model evaluation. Header-Level Metadata. Basic identifying information such as request ID. Header-Level Metadata. Actual Content Chunks. The core output — the tokens or strings generated by the model — is delivered incrementally as sequences (n) are streamed back, chunk-by-chunk. Each generation can consist of multiple sequences (e.g., n=2, n=4) .These sequences are generated independently and streamed in parallel, each broken into its own set of incremental chunks. Actual Content Chunks. n n=2 n=4 Usage and Token-Level Metadata. This includes number of tokens generated, timing data, and optional diagnostics like logprobs or reasoning traces. These may be used for billing, debugging, or model evaluation. Usage and Token-Level Metadata. After defining the structure of the streamed response, we also considered several non-functional requirements essential for reliability and future evolution. Our stream design is intended to be: Structured — clearly distinguishing between content types and event boundaries. Extensible — capable of carrying optional metadata without breaking existing clients. Robust — resilient to malformed, delayed, or partial data. Structured — clearly distinguishing between content types and event boundaries. Structured Extensible — capable of carrying optional metadata without breaking existing clients. Extensible Robust — resilient to malformed, delayed, or partial data. Robust In many applications — such as side-by-side comparison or diverse sampling — multiple sequences (completions) are generated in parallel as part of a single generation request. side-by-side comparison diverse sampling The most comprehensive format for streaming responses is defined in the OpenAI API reference. According to the specification, a single generation chunk may include multiple sequences in the choices array: OpenAI API reference OpenAI API reference choices choices array A list of chat completion choices. Can contain more than one element if n is greater than 1. Can also be empty for the last chunk. choices choices array array A list of chat completion choices. Can contain more than one element if n is greater than 1. Can also be empty for the last chunk. A list of chat completion choices. Can contain more than one element if n is greater than 1. Can also be empty for the last chunk. Although, in practice, individual chunks usually contain only a single delta, the format allows for multiple sequence updates per chunk. It’s important to account for this, as future updates might make broader use of this capability. Notably, even the official Python SDK is designed to support this structure. official Python SDK official Python SDK We chose to follow the same structure to ensure compatibility with a wide range of potential features. The diagram below illustrates an example from our implementation, where a single generation consists of three sequences, streamed in six chunks over time: Chunk 1 — Generation Start. This chunk marks the beginning of the entire generation. It doesn’t contain any actual content but includes shared metadata, such as the generation ID, timestamp, and role (e.g., assistant, etc.). Chunk 2 — Sequence Start (Green & Purple). Two sequences begin streaming in parallel. Each is tagged with a unique identifier to distinguish it from others. Chunk 3 — Sequence Start (Blue) & Sequence Delta. The third sequence starts (blue), while the first two sequences (green and purple) stream incremental content via delta events. Chunk 4 — Midstream Updates & Finish (Purple). The green and blue sequences continue streaming deltas. The purple sequence finishes — this includes a structured finish_reason (like stop, length, etc.). Chunk 5 — Remaining Sequence Finishes. Both the green and blue sequences complete. Each sequence’s lifecycle is now fully enclosed between its respective start and finish markers. Chunk 6 — Generation Finish. This chunk closes the generation and may include global usage statistics, final token counts, latency info, or other diagnostics. Chunk 1 — Generation Start. This chunk marks the beginning of the entire generation. It doesn’t contain any actual content but includes shared metadata, such as the generation ID, timestamp, and role (e.g., assistant, etc.). Chunk 1 — Generation Start. Chunk 2 — Sequence Start (Green & Purple). Two sequences begin streaming in parallel. Each is tagged with a unique identifier to distinguish it from others. Chunk 2 — Sequence Start (Green & Purple). Chunk 3 — Sequence Start (Blue) & Sequence Delta. The third sequence starts (blue), while the first two sequences (green and purple) stream incremental content via delta events. Chunk 3 — Sequence Start (Blue) & Sequence Delta. Chunk 4 — Midstream Updates & Finish (Purple). The green and blue sequences continue streaming deltas. The purple sequence finishes — this includes a structured finish_reason (like stop, length, etc.). Chunk 4 — Midstream Updates & Finish (Purple). Chunk 5 — Remaining Sequence Finishes. Both the green and blue sequences complete. Each sequence’s lifecycle is now fully enclosed between its respective start and finish markers. Chunk 5 — Remaining Sequence Finishes. Chunk 6 — Generation Finish. This chunk closes the generation and may include global usage statistics, final token counts, latency info, or other diagnostics. Chunk 6 — Generation Finish. As you see, to make the stream robust and easier to parse, we opted to explicitly signal Start and Finish events for both the overall generation and each individual sequence, rather than relying on implicit mechanisms such as null checks, EOFs, or magic tokens. This structured approach simplifies downstream parsing, especially in environments where multiple completions are streamed in parallel, and also improves debuggability and fault isolation during development and runtime inspection. explicitly signal Start and Finish events for both the overall generation and each individual sequence Moreover, we introduce an additional Error chunk that carries structured information about failures. Some errors — such as malformed requests or authorization issues — can be surfaced directly via standard HTTP response codes. However, if an error occurs during the generation process, we have two options: either abruptly terminate the HTTP stream or emit a well-formed SSE error event. We chose the latter. Abruptly closing the connection makes it hard for clients to distinguish between network issues and actual model/service failures. By using a dedicated error chunk, we enable more reliable detection and propagation of issues during streaming. Error during the generation process Backend Services and Request Flow At the center of the system is a single entrypoint: LLM-Gateway. It handles basic concerns like authentication, usage tracking and quota enforcement, request formatting, and routing based on the specified model. While it may look like the Gateway carries a lot of responsibility, each task is intentionally simple and modular. For external providers, it adapts requests to their APIs and maps responses back into a unified format. For self-hosted models, requests are routed directly to internal systems using our own unified schema. This design allows seamless support for both external and internal models through a consistent interface. LLM-Gateway Self-Hosted Models As mentioned earlier, Server-Sent Events (SSE) is well-suited for streaming responses to end users, but it’s not a practical choice for internal backend communication. When a request arrives, it must be routed to a suitable worker node for model inference, and the result streamed back. While some systems handle this using chained HTTP proxies and header-based routing, in our experience, this approach becomes difficult to manage and evolve as the logic grows in complexity. Server-Sent Events (SSE) internal backend communication Our internal infrastructure needs to support: Priority-aware scheduling — Requests may have different urgency levels (e.g., interactive vs. batch), and high-priority tasks must be handled first. Hardware-aware routing — Certain nodes run on higher-performance GPUs and should be preferred; others serve as overflow capacity. Model-specific dispatching — Each worker is configured to support only a subset of models, based on hardware compatibility and resource constraints. Priority-aware scheduling — Requests may have different urgency levels (e.g., interactive vs. batch), and high-priority tasks must be handled first. Priority-aware scheduling Hardware-aware routing — Certain nodes run on higher-performance GPUs and should be preferred; others serve as overflow capacity. Hardware-aware routing Model-specific dispatching — Each worker is configured to support only a subset of models, based on hardware compatibility and resource constraints. Model-specific dispatching To address these requirements, we use a message broker to decouple task routing from result delivery. This design provides better flexibility and resilience under varying load and routing conditions. We use RabbitMQ for this purpose, though other brokers could also be viable depending on your latency, throughput, and operational preferences. RabbitMQ was a natural fit given its maturity and alignment with our existing tooling. message broker RabbitMQ RabbitMQ RabbitMQ Now let’s take a closer look at how this system is implemented in practice: We use dedicated queues per model, allowing us to route requests based on model compatibility and node capabilities. The process is as follows: dedicated queues per model Client Sends Request. The LLM-Gateway service (represented as the user) initiates an HTTP request to trigger a text generation task. Scheduler service starts a new Request Handler to manage this request. Task Routing via Scheduler service. The request is handled by the Scheduler, which selects the appropriate queue (marked in green on the image) based on the requested model and appends the message to it. Worker Picks Up Task. An appropriate Inference Worker (only one worker is shown for simplicity, but there are many) subscribed to the queue picks up the task and begins processing. This worker runs the selected model locally. Streaming the Response. The worker streams the response chunk-by-chunk into the Response Queue, to which the Scheduler replica handling the request is subscribed. Receiving Response Chunks. The Scheduler listens to the reply queue and receives the response chunks as they arrive. SSE Streaming. The chunks are converted to SSE format and streamed to the client. Client Sends Request. The LLM-Gateway service (represented as the user) initiates an HTTP request to trigger a text generation task. Scheduler service starts a new Request Handler to manage this request. Client Sends Request. Scheduler service Request Handler Task Routing via Scheduler service. The request is handled by the Scheduler, which selects the appropriate queue (marked in green on the image) based on the requested model and appends the message to it. Task Routing via Scheduler service. Scheduler Worker Picks Up Task. An appropriate Inference Worker (only one worker is shown for simplicity, but there are many) subscribed to the queue picks up the task and begins processing. This worker runs the selected model locally. Worker Picks Up Task. Inference Worker Streaming the Response. The worker streams the response chunk-by-chunk into the Response Queue, to which the Scheduler replica handling the request is subscribed. Streaming the Response. Response Queue Scheduler replica handling the request is subscribed Receiving Response Chunks. The Scheduler listens to the reply queue and receives the response chunks as they arrive. Receiving Response Chunks. SSE Streaming. The chunks are converted to SSE format and streamed to the client. SSE Streaming. To handle large payloads, we avoid overwhelming the message broker: large payloads Instead of embedding large input or output data directly in the task, we upload it to an external S3-compatible store. A reference (such as a URL or resource ID) is included in the task metadata, and the worker retrieves the actual content when needed. Instead of embedding large input or output data directly in the task, we upload it to an external S3-compatible store. external S3-compatible store A reference (such as a URL or resource ID) is included in the task metadata, and the worker retrieves the actual content when needed. Applying the Design with RabbitMQ When it comes to routing and publishing messages, each Request Queue is a regular RabbitMQ queue, dedicated to handling a single model type. We require priority-aware scheduling, which can be achieved using message priorities. In this setup, messages with higher priority values are delivered and processed before lower priority ones. For hardware-aware routing, where messages should be directed to the most performant available nodes first, consumer priorities can help. Consumers with higher priority receive messages as long as they are active; lower-priority consumers only receive messages when the higher-priority ones are blocked or unavailable. routing and publishing messages Request Queue queue queue priority-aware scheduling message priorities message priorities hardware-aware routing consumer priorities consumer priorities If message loss is unacceptable, the following must be in place: Publisher confirms to ensure the broker has received and stored the message. Durable queues and persistent messages so data survives restarts. Quorum queues for stronger durability through replication. These also support simplified message and consumer priorities as of RabbitMQ 4.0. Publisher confirms to ensure the broker has received and stored the message. Publisher confirms Publisher confirms Publisher confirms Durable queues and persistent messages so data survives restarts. Durable queues persistent messages Quorum queues for stronger durability through replication. These also support simplified message and consumer priorities as of RabbitMQ 4.0. Quorum queues Quorum queues Quorum queues simplified message and consumer priorities as of RabbitMQ 4.0 simplified message and consumer priorities as of RabbitMQ 4.0 So far, we’ve covered how tasks are published — but how is the streamed response handled? The first step is to understand how temporary queueswork in RabbitMQ. The broker supports a concept called exclusive queues, which are bound to a single connection and automatically deleted when that connection closes. This makes them a natural fit for our setup. streamed response temporary queues exclusive queues exclusive queues exclusive queues We create one exclusive queue per Scheduler service replica, ensuring it’s automatically cleaned up when the replica shuts down. However, this introduces a challenge: while each service replica has a single RabbitMQ queue, it must handle many requests in parallel. one exclusive queue per Scheduler service replica many requests in parallel To address this, we treat the RabbitMQ queue as a transport layer, routing responses to the correct Scheduler replica. Each user request is assigned a unique identifier, which is included in every response chunk. Inside the Scheduler, we maintain an additional in-memory routing layer with short-lived in-memory queues — one per active request. Incoming chunks are matched to these queues based on the identifier and forwarded accordingly. These in-memory queues are discarded once the request completes, while the RabbitMQ queue persists for the lifetime of the service replica. transport layer unique identifier Scheduler in-memory routing layer Schematically this looks as follows: A central dispatcher within the Scheduler dispatches chunks to the appropriate in-memory queue, each managed by a dedicated handler. Handlers then stream the chunks to users using SSE-protocol. Inference There are several mature frameworks available for efficient LLM inference, such as vLLM and SGLANG. These systems are designed to process multiple sequences in parallel and generate response tokens in real time, often with features like continuous batching and GPU memory optimization. In our setup, we use vLLM as the core inference engine, with a few custom modifications: vLLM vLLM vLLM SGLANG SGLANG SGLANG process multiple sequences in parallel vLLM Custom beam search implementation — to better suit our generation logic and support structured constraints. Support for structured output schemas — allowing models to return outputs conforming to business-specific formats. Custom beam search implementation — to better suit our generation logic and support structured constraints. Custom beam search implementation Support for structured output schemas — allowing models to return outputs conforming to business-specific formats. Support for structured output schemas Through experience, we’ve learned that even minor library updates can significantly alter model behavior — whether in output quality, determinism, or concurrency behavior. Because of this, we’ve established a robust testing pipeline: significantly alter model behavior Stress testing to uncover concurrency issues, memory leaks, or stability regressions. Determinism testing to ensure consistent outputs for fixed seeds and parameter sets. Parameter grid testing to cover a wide range of generation settings, without going overboard. Stress testing to uncover concurrency issues, memory leaks, or stability regressions. Stress testing Determinism testing to ensure consistent outputs for fixed seeds and parameter sets. Determinism testing Parameter grid testing to cover a wide range of generation settings, without going overboard. Parameter grid testing Storage and deployment Most modern systems run in containerized environments — either in the cloud or within Kubernetes (K8s). While this setup works well for typical backend services, it introduces challenges around model weight storage. LLM models can be tens or even hundreds of gigabytes in size, and baking model weights directly into Docker images — quickly becomes problematic: containerized environments model weight storage tens or even hundreds of gigabytes in size Slow builds — Even with multi-stage builds and caching, transferring large model files during the build phase can dramatically increase CI time. Slow deployments — Each rollout requires pulling massive images, which can take several minutes and cause downtime or delays. Resource inefficiency — Neither Docker registries nor Kubernetes nodes are optimized for handling extremely large images, resulting in bloated storage usage and bandwidth strain. Slow builds — Even with multi-stage builds and caching, transferring large model files during the build phase can dramatically increase CI time. Slow builds Slow deployments — Each rollout requires pulling massive images, which can take several minutes and cause downtime or delays. Slow deployments Resource inefficiency — Neither Docker registries nor Kubernetes nodes are optimized for handling extremely large images, resulting in bloated storage usage and bandwidth strain. Resource inefficiency To solve this, we separate model storage from the Docker image lifecycle. Our models are stored in an external S3-compatible object storage, and fetched just before inference service startup. To improve startup time and avoid redundant downloads, we also use local persistent volumes (PVCs) to cache model weights on each node. model storage external S3-compatible object storage local persistent volumes (PVCs) local persistent volumes (PVCs) local persistent volumes (PVCs) Observability A system like this — built on streaming, message queues, and real-time token generation — requires robust observability to ensure reliability and performance at scale. streaming, message queues, and real-time token generation robust observability In addition to standard service-level metrics (CPU, memory, error rates, etc.), we found it essential to monitor the following: Queue depth, message backlog, and consumer count — monitoring the number of pending messages, current queue size, and number of active consumers helps detect task distribution bottlenecks and imbalances in worker utilization. Token/chunk throughput — tracking the number of tokens or response chunks generated per second helps identify latency or throughput regressions. Distributed tracing — to pinpoint where requests fail or stall across components (gateway, broker, workers, etc.). Inference engine health checks — since inference processes can crash under rare conditions (e.g., bad input or extreme parameter values), proactive monitoring of liveness and readiness is critical. Queue depth, message backlog, and consumer count — monitoring the number of pending messages, current queue size, and number of active consumers helps detect task distribution bottlenecks and imbalances in worker utilization. Queue depth, message backlog, and consumer count Token/chunk throughput — tracking the number of tokens or response chunks generated per second helps identify latency or throughput regressions. Token/chunk throughput Distributed tracing — to pinpoint where requests fail or stall across components (gateway, broker, workers, etc.). Distributed tracing Distributed tracing Distributed tracing Inference engine health checks — since inference processes can crash under rare conditions (e.g., bad input or extreme parameter values), proactive monitoring of liveness and readiness is critical. Inference engine health checks Further Improvements While our system is production-ready, there are still important challenges and opportunities for optimization: Using a distributed KV-cache to boost inference performance. Supporting request cancellation to conserve compute when outputs are no longer needed. Creating a simple model delivery pipeline for data science teams. Using a distributed KV-cache to boost inference performance. distributed KV-cache Supporting request cancellation to conserve compute when outputs are no longer needed. request cancellation Creating a simple model delivery pipeline for data science teams. simple model delivery pipeline Conclusion While building a reliable and provider-independent LLM serving system can seem complex at first, it doesn’t require reinventing the wheel. Each component — streaming via SSE, task distribution through message brokers, and inference handled by runtimes like vLLM — serves a clear purpose and is grounded in existing, well-supported tools. With the right structure in place, it’s possible to create a maintainable and adaptable setup that meets production requirements without unnecessary complexity. In the next post, we’ll explore more advanced topics such as distributed KV-caching, handling multiple models across replicas, and deployment workflows suited to ML-oriented teams. Authors Stanislav Shimovolos, Tochka Stanislav Shimovolos, Stanislav Shimovolos Stanislav Shimovolos Maxim Afanasyev, Tochka Maxim Afanasyev, Maxim Afanasyev Maxim Afanasyev Acknowledgments Dmitry Kryukov, work done at Tochka Dmitry Kryukov Dmitry Kryukov Dmitry Kryukov