Cutting LLM costs by 20%: what actually moved the needle
We were burning through LLM API budget faster than expected. After a few weeks of profiling and experiments, we got costs down by about 20%. Here's what actually worked versus what sounded good in theory but didn't.
What didn't work
Switching to a smaller model entirely — We tried routing all conversations to a cheaper model. Response quality dropped noticeably for complex tasks. Users complained. We switched back.
Aggressive response length limits — Setting max_tokens low caused the model to truncate mid-thought. Worse UX than the cost savings were worth.
Compressing system prompts with abbreviations — Replacing common words with shorter versions. Didn't help enough (tokens are subword units, not words) and made prompts unmaintainable.
What actually worked
Token counting before sending — We started logging exact input and output token counts per request. Before doing this, we were guessing. With real data, the high-cost outliers became obvious: a few edge cases were sending 6x the average context size.
Once we identified those patterns, we could handle them specifically instead of applying blunt optimizations everywhere.
Conversation history pruning — We were naively appending every turn to the conversation history. Some conversations ran 50+ turns. The middle turns were mostly not relevant to the current query.
We implemented a sliding window: keep the system prompt, the last N turns, and a compressed summary of earlier context. The summary is generated once when the window slides and cached.
function buildContext(history: Turn[], systemPrompt: string): Message[] {
const recentTurns = history.slice(-WINDOW_SIZE);
const olderTurns = history.slice(0, -WINDOW_SIZE);
const summary = olderTurns.length > 0
? [{ role: 'system', content: `Earlier context: ${getSummary(olderTurns)}` }]
: [];
return [
{ role: 'system', content: systemPrompt },
...summary,
...recentTurns,
];
}
This alone cut context size by 30-40% for long conversations.
Caching identical prompts — Some users were sending the same system configurations repeatedly. We added a hash-based cache at the session level. Not prompt caching in the API sense — just avoiding redundant full roundtrips for duplicate inputs.
Removing tool definitions for inapplicable tools — We had 15+ tools defined in the tool manifest. For most conversations, fewer than 5 were relevant. Each tool definition adds tokens to the system prompt.
We added a routing step: a lightweight classifier checks the conversation intent and filters the tool manifest to only include likely-relevant tools. The classifier itself is cheap (fast model + simple rules). Net result: smaller prompts for the main model.
async function selectRelevantTools(
userMessage: string,
allTools: ToolDefinition[]
): Promise<ToolDefinition[]> {
const intent = await classifyIntent(userMessage); // fast, cheap call
return allTools.filter(tool => tool.categories.includes(intent));
}
The unglamorous reality
Most of the savings came from better observability first. You can't optimize what you can't measure. Token counting, per-conversation cost tracking, and identifying outlier patterns — that's where the work was.
The actual optimizations are not that complex. The hard part is knowing which ones to apply where.