Serving Encoder LLMs

In the current ML landscape, the roar of massive decoder models like GPT, Qwen, or DeepSeek are deafening. The engineering world is rightly obsessed with solving the complex challenges of serving them: managing KV caches, streaming tokens, and optimizing for mind-boggling parameter counts.

But at Pi Labs, a significant portion of our modeling needs fall on the other, often-overlooked side of the Transformer architecture: the encoder.

Encoder models are the workhorses of tasks like text classification, semantic search, and similarity scoring. This post dives into how these architectures are different, which leads to significantly different tradeoffs in assembling a good inference stack.

The Blessing of Simplicity: Encoders vs. Decoders

To understand our design choices, it's crucial to appreciate the difference between serving an encoder and a decoder.

Decoder models (like GPT) are autoregressive. They take an input and generate an output one token at a time. Each new token depends on all the previous ones. Serving them efficiently requires complex state management (the KV cache), token streaming logic, sampling schemes, and tolerating variable response lengths.
Encoder models (like BERT or the sentence-transformers family) are much more straightforward. They process the entire input sequence at once and produce a fixed-size output. For our use cases, this is typically a single classification score or a dense vector embedding.

This "one input -> one fixed-size output" eliminates a lot of complexity:

Stateless Requests: Each request is independent. There's no conversational history or KV cache to maintain between calls.
Predictable Compute: The computational load for a given input size is consistent, making performance tuning and capacity planning much easier.
Simple Output: We don't need to stream tokens back to the client. The entire result is available at once.

This simplicity is great because it eliminates a lot of unnecessary modules that bloat inference offerings, but there is still a problem.

The Wrinkle: Performance Expectations

Decoders hide their latency behind token streaming in chat interfaces…as long as they’re faster than humans can read, or the generation is exceptionally valuable, it’s good enough. Encoders have nowhere to hide because they find themselves in the critical path of latency-sensitive systems. These include:

Annotating queries or reranking documents in RAG search systems
Guardrails or safety checks in realtime systems
Making decisions in agentic workflows
Lots more…

The much better baseline speed of encoders also means that a different set of bottlenecks start to dominate latency profiles.

As an example, consider reranking 200 query-document pairs in a RAG search application. You need to worry about:

Chunking: Break the pairs into chunks that can be executed on your model servers. Larger chunks are almost always more efficient, but they cannot then be parallelized. The optimal size depends on document size, fleet size, and current load.
Load Distribution: “Stragglers”, chunks that are outstanding, can dominate your overall latency because they hold up completion of the reranking operation. Poor load balancing makes it likely that one of your chunks falls onto a “hot” server.
Fault Recovery: - Dropped requests are inevitable, so some strategy for quickly retrying failed chunks is important.

Another key bottleneck is tokenization. This cost is negligible in decoder workflows, but it can be a significant fraction of the time in encoder workflows, especially if not pipelined with GPU operations.

What to do?

Model quality is likely the most important problem. Just use an off-the-shelf inference solution while you iterate through quality.

Next worry about some of the problems highlighted here. In our experience, simple solutions yield performance 2-3x better than the naive one, making it largely worth the effort without unduly increasing complexity.

Next time we’ll detail some of the solutions we deployed in this area.

Home Docs Pricing Support Status