Cutting LLM costs by 20%: what actually moved the needle

We were burning through LLM API budget faster than expected. After a few weeks of profiling and experiments, we got costs down by about 20%. Here's what actually worked versus what sounded good in theory but didn't.

What didn't work

Switching to a smaller model entirely — We tried routing all conversations to a cheaper model. Response quality dropped noticeably for complex tasks. Users complained. We switched back.

Aggressive response length limits — Setting max_tokens low caused the model to truncate mid-thought. Worse UX than the cost savings were worth.

Compressing system prompts with abbreviations — Replacing common words with shorter versions. Didn't help enough (tokens are subword units, not words) and made prompts unmaintainable.

What actually worked

Token counting before sending — We started logging exact input and output token counts per request. Before doing this, we were guessing. With real data, the high-cost outliers became obvious: a few edge cases were sending 6x the average context size.

Once we identified those patterns, we could handle them specifically instead of applying blunt optimizations everywhere.

Conversation history pruning — We were naively appending every turn to the conversation history. Some conversations ran 50+ turns. The middle turns were mostly not relevant to the current query.

We implemented a sliding window: keep the system prompt, the last N turns, and a compressed summary of earlier context. The summary is generated once when the window slides and cached.

function buildContext(history: Turn[], systemPrompt: string): Message[] {
  const recentTurns = history.slice(-WINDOW_SIZE);
  const olderTurns = history.slice(0, -WINDOW_SIZE);

  const summary = olderTurns.length > 0
    ? [{ role: 'system', content: `Earlier context: ${getSummary(olderTurns)}` }]
    : [];

  return [
    { role: 'system', content: systemPrompt },
    ...summary,
    ...recentTurns,
  ];
}

This alone cut context size by 30-40% for long conversations.

Caching identical prompts — Some users were sending the same system configurations repeatedly. We added a hash-based cache at the session level. Not prompt caching in the API sense — just avoiding redundant full roundtrips for duplicate inputs.

Removing tool definitions for inapplicable tools — We had 15+ tools defined in the tool manifest. For most conversations, fewer than 5 were relevant. Each tool definition adds tokens to the system prompt.

We added a routing step: a lightweight classifier checks the conversation intent and filters the tool manifest to only include likely-relevant tools. The classifier itself is cheap (fast model + simple rules). Net result: smaller prompts for the main model.

async function selectRelevantTools(
  userMessage: string,
  allTools: ToolDefinition[]
): Promise<ToolDefinition[]> {
  const intent = await classifyIntent(userMessage); // fast, cheap call
  return allTools.filter(tool => tool.categories.includes(intent));
}

The unglamorous reality

Most of the savings came from better observability first. You can't optimize what you can't measure. Token counting, per-conversation cost tracking, and identifying outlier patterns — that's where the work was.

The actual optimizations are not that complex. The hard part is knowing which ones to apply where.