At 47 tools, the IntelliSupply agent started picking wrong. Not occasionally — 30% of the time. And the API bill that month was $2,700.
I went through the logs expecting a prompt bug or a bad model response. The actual cause was simpler and more embarrassing: I had given the agent 47 tool definitions on every single call, whether it needed them or not. The model was drowning. Selection accuracy collapses under high tool density — Google's research (arXiv:2512.08296) puts the degradation at 39–70% on sequential tasks with 16+ tools. I was at 47. I was also paying for every one of those schemas on every call.
The fix dropped costs to $27/month and brought selection accuracy back above 97%. Here's exactly what I changed and why.
The Bill That Made Me Pay Attention
50 tools × ~300 tokens per schema = 15,000 tokens before the conversation even starts. At Claude Sonnet pricing, with 10,000 agentic calls per month:
| Scenario | Tool Tokens/Call | Monthly Cost |
|---|---|---|
| 50 tools, no optimization | 15,000 | ~$2,700 |
| Dynamic selection (top 5) | 1,500 | ~$270 |
| Dynamic selection + caching | ~150 after warmup | ~$27 |
| Combined reduction | 99% | 99% |
The $2,700 number assumes 10K calls/day. If your system makes fewer calls, scale linearly — but the ratio holds regardless of volume. The structural waste is the same whether you're at 100 calls or 100,000.
The fix is three layers stacked on top of each other. You can ship each layer independently and measure the improvement before adding the next.
The Three-Layer Architecture
- Layer 1 — Semantic retrieval: embed the query, find the top-K relevant tools from your registry. The LLM never sees the other 45.
- Layer 2 — Prompt caching: cache the selected schemas. Reads cost 10% of normal price. Break-even at 1.4 reads per write.
- Layer 3 — Schema compression: trim every schema to its minimum valid form. PA-Tool (arXiv:2510.07248) shows 40% size reduction without accuracy loss.
Layer 1 — Semantic Tool Retrieval
The research result that changed how I think about this: with K=3 dynamic selection from a 121-tool library, you get 99.6% token reduction while maintaining a 97.1% hit rate (arXiv:2603.20313). Three tools out of 121. Nearly every token saved, nearly every correct tool found.
The accuracy of retrieval lives or dies on what you embed. A bare description: "Gets weather data." retrieves poorly because it matches nothing specific. An enriched entry works:
private buildEmbeddingText(tool: ToolDefinition): string {
return [
`Tool: ${tool.name}`,
`Description: ${tool.description}`,
tool.tags ? `Tags: ${tool.tags.join(', ')}` : '',
tool.useCases ? `Use when: ${tool.useCases.join('; ')}` : '',
tool.category ? `Category: ${tool.category}` : '',
].filter(Boolean).join('\n');
}
The difference between sparse and enriched for get_weather:
| Field | Sparse | Enriched |
|---|---|---|
| description | "Gets weather data." | "Fetch current weather and forecast for any city." |
| tags | — | weather, forecast, temperature, outdoor |
| useCases | — | "User asks about weather", "planning outdoor activities" |
| Retrieval accuracy | Poor | Strong cross-query generalization |
arXiv:2412.03573 shows LLM-generated query expansion at registration time — asking the model to produce 5 different phrasings of a tool's purpose — yields the largest retrieval improvement. Run it once per tool. One-time cost, permanent gain.
Dependency Resolution
Some tools can't be called without another. When the retriever picks analyze_document, it must also pull in upload_document — otherwise the agent plans a call that can't succeed.
Implement this as a BFS walk of a requires: string[] field on each tool. The walk adds dependencies until the set is closed. I've been burned by missing this — the model confidently plans a two-step call sequence, the first step works, the second fails because the dependency tool wasn't available.
Layer 2 — Prompt Caching
Tool schemas are ideal cache targets: static, expensive, and reused across thousands of calls.
| Event | Cost |
|---|---|
| Cache write (1st call) | 1.25× input price |
| Cache read (all subsequent) | 0.10× input price |
| Break-even reads | 1.4 per write |
| Achievable hit rate in production | 80–84% |
The one rule that matters: static content before dynamic content. System prompt and tool schemas must come before the user's message. If you put the user message first, every call misses the cache entirely because the prefix changes each time.
ProjectDiscovery achieved 59% total cost reduction with this pattern alone. Cache hit rates jump from ~7% to 80%+ once you fix the structure order.
Layer 3 — Schema Compression
Every field the LLM ignores is wasted tokens. PA-Tool (arXiv:2510.07248) found 40% size reduction is achievable without selection accuracy loss. I've seen 83% on individual schemas with heavy examples arrays.
| Schema Field | Keep? | Reason |
|---|---|---|
name | Always | Core identifier |
description | Always — cap at 150 chars | Selection signal |
parameters.type | Always | Required |
parameters.description | Yes — cap at 80 chars | Intent |
enum values | Yes | Hard constraints |
required array | Yes | Prevents missing args |
$schema URL | Drop | LLM ignores it |
title on params | Drop | Redundant with name |
additionalProperties | Drop | LLMs ignore it |
examples array | Drop | High token cost, marginal gain |
default values | Drop | Explain in description instead |
Don't shorten parameter names to save tokens. PA-Tool found names matching the model's pretraining vocabulary outperform abbreviated names — get_user_data beats fetchUsrDt even though it's longer. Compress descriptions, not names.
The AutoTool Pattern: Co-Occurrence Graphs
Once semantic retrieval is working, layer on the AutoTool approach (arXiv:2511.14650): track which tools get called together and use co-occurrence to pre-expand the retrieval set.
When the retriever selects get_weather, the graph knows it co-occurs heavily with get_location and get_timezone — so those get added without a second embedding lookup. AutoTool reports 10–40% additional token reduction and 15–25% fewer LLM calls on top of semantic retrieval alone.
What to Monitor
You can't optimize what you don't measure. The numbers I track per agent run:
| Metric | Target | What it tells you |
|---|---|---|
| Tool schema % of input tokens | < 20% | If higher: schemas too verbose or too many tools |
| Tool utilization rate | > 60% | If lower: you're injecting irrelevant tools |
| Cache hit rate | > 70% | If lower: static/dynamic content order is wrong |
| Tool selection accuracy | > 95% | Did the right tools appear in retrieved set? |
| Avg tools injected per call | 3–6 | Baseline for optimization comparison |
The utilization rate is the most useful signal. If you're injecting 5 tools and the agent uses 1 of them, either your retrieval is wrong or the task doesn't need tools at all.
How to Roll This Out
I wouldn't build all three layers simultaneously. Each layer has measurable impact and its own failure modes — debug one at a time.
Start this week: audit all tool schemas, remove $schema, title, examples, additionalProperties. Measure baseline token count per run. Add cache_control on the last tool in your existing tool array. These two changes together often cut costs 40–50% before you touch retrieval.
Next two weeks: build the tool registry with pgvector or Pinecone, enrich tool metadata (tags, useCases, category on every tool), implement top-K cosine retrieval, add BFS dependency resolution. Measure hit rate. If it's below 90%, your tool descriptions are too sparse — enrich them.
After that: add the co-occurrence graph (record which tools were used per run), implement graph expansion, A/B test K=3 vs K=5 vs K=8 for your query distribution.
Research Summary
| Finding | Source | Impact |
|---|---|---|
| K=3 retrieval from 121 tools: 99.6% token reduction, 97.1% hit rate | arXiv:2603.20313 | Core technique |
| Agents collapse 39–70% on sequential tasks with 16+ tools | arXiv:2512.08296 | Why this matters |
| Co-occurrence graph: 10–40% more reduction, 15–25% fewer calls | arXiv:2511.14650 | AutoTool layer |
| 40% schema compression without accuracy loss | arXiv:2510.07248 | PA-Tool |
| Cache hit rates of 80–84% achievable in multi-user systems | Anthropic docs | Layer 2 |
| LLM query expansion at registration improves retrieval | arXiv:2412.03573 | Embedding quality |
Resources
- Semantic Tool Discovery for LLMs (arXiv:2603.20313)
- AutoTool: Efficient Tool Selection (arXiv:2511.14650)
- PA-Tool: Adapt Schemas to Models (arXiv:2510.07248)
- Scaling Agent Systems — Google (arXiv:2512.08296)
- Anthropic Prompt Caching docs
- ProjectDiscovery: 59% cost reduction case study
- Berkeley Function Calling Leaderboard
The most expensive mistake in agent development isn't using the wrong model — it's treating your tool registry as a flat list. A flat list scales to 5 tools. A retrieval system scales to 500 on the same token budget. The moment you have more than 15 tools, your registry is a search problem. Treat it like one.