Building a transparent LLM proxy with Cloudflare Workers and Hono
The core idea behind an LLM inference gateway is deceptively simple: intercept LLM API calls, forward them transparently, and capture data along the way. Implementing it cleanly took more thought than expected.
Here's the core of how we built the proxy layer with Cloudflare Workers and Hono.
Why Cloudflare Workers
The requirements were:
-
Minimal latency overhead (we're in the critical path of every inference request)
-
Globally distributed (customers are worldwide)
-
Support for streaming responses (most LLM APIs use SSE)
-
Easy to deploy and iterate
Workers checks all of these. The cold start story is good (sub-millisecond), the network edge distribution is exactly what we want for latency, and streaming support is solid.
The basic proxy structure
import { Hono } from 'hono';
const app = new Hono();
app.all('/v1/*', async (c) => {
const upstreamUrl = resolveUpstream(c.req.path);
const response = await fetch(upstreamUrl, {
method: c.req.method,
headers: buildUpstreamHeaders(c.req.header()),
body: c.req.body,
});
return new Response(response.body, {
status: response.status,
headers: buildDownstreamHeaders(response.headers),
});
});
The customer just changes their base_url from https://api.openai.com to https://your-worker.workers.dev and everything else stays the same. No SDK changes.
Handling streaming
This is where it gets interesting. SSE streaming means the response body is a ReadableStream that the client consumes progressively. You can't buffer the whole thing before forwarding — that kills the streaming UX.
The solution is stream tee: fork the stream into two readers, forward one to the client and consume the other for logging.
const [clientStream, logStream] = response.body!.tee();
// Forward to client immediately
const clientResponse = new Response(clientStream, {
status: response.status,
headers: response.headers,
});
// Consume log stream asynchronously (non-blocking)
c.executionCtx.waitUntil(
consumeAndLog(logStream, requestContext)
);
return clientResponse;
The key here is waitUntil. Cloudflare Workers have a strict execution model: the response returns as soon as you return from the fetch handler. waitUntil tells the runtime to keep the worker alive to finish background work (logging) even after the response is sent.
Without it, the log stream would get garbage collected and you'd lose data.
Multi-vendor routing
OpenAI and Anthropic have different API shapes. We normalize at the edge:
function buildUpstreamHeaders(
incoming: Headers,
vendor: Vendor
): Headers {
const headers = new Headers(incoming);
if (vendor === 'anthropic') {
const apiKey = headers.get('authorization')?.replace('Bearer ', '');
headers.set('x-api-key', apiKey ?? '');
headers.delete('authorization');
headers.set('anthropic-version', '2023-06-01');
}
return headers;
}
The client sends OpenAI-format requests. If they're routed to Anthropic, the worker translates the headers before forwarding.
Automatic failover
If the primary vendor returns a 429 or 5xx, we retry with a secondary:
async function fetchWithFailover(
request: ProxyRequest,
vendors: Vendor[]
): Promise<Response> {
for (const vendor of vendors) {
const response = await forwardToVendor(request, vendor);
if (response.ok || response.status < 500) return response;
}
throw new Error('All vendors failed');
}
Simple but effective. Rate limits and transient errors from a single provider no longer surface to the customer.
What surprised me
The Cloudflare Workers runtime is more constrained than I expected — no filesystem, limited CPU time per request, memory limits. But the constraints pushed us toward cleaner design. The proxy is stateless by design; all durability is in the async log pipeline. That's a feature, not a limitation.
Hono's request handling is lightweight and fast. The middleware ecosystem is good enough for our needs. Deploying a new version takes seconds. For this use case, it's been a great fit.