Handling 50+ Tools in AI Agents Without Burning Your Token Budget

April 14, 2026

English

At 47 tools, the IntelliSupply agent started picking wrong. Not occasionally — 30% of the time. And the API bill that month was $2,700.

I went through the logs expecting a prompt bug or a bad model response. The actual cause was simpler and more embarrassing: I had given the agent 47 tool definitions on every single call, whether it needed them or not. The model was drowning. Selection accuracy collapses under high tool density — Google's research (arXiv:2512.08296) puts the degradation at 39–70% on sequential tasks with 16+ tools. I was at 47. I was also paying for every one of those schemas on every call.

The fix dropped costs to $27/month and brought selection accuracy back above 97%. Here's exactly what I changed and why.

The Bill That Made Me Pay Attention

50 tools × ~300 tokens per schema = 15,000 tokens before the conversation even starts. At Claude Sonnet pricing, with 10,000 agentic calls per month:

Scenario	Tool Tokens/Call	Monthly Cost
50 tools, no optimization	15,000	~$2,700
Dynamic selection (top 5)	1,500	~$270
Dynamic selection + caching	~150 after warmup	~$27
Combined reduction	99%	99%

The $2,700 number assumes 10K calls/day. If your system makes fewer calls, scale linearly — but the ratio holds regardless of volume. The structural waste is the same whether you're at 100 calls or 100,000.

The fix is three layers stacked on top of each other. You can ship each layer independently and measure the improvement before adding the next.

The Three-Layer Architecture

Loading diagram...

Layer 1 — Semantic retrieval: embed the query, find the top-K relevant tools from your registry. The LLM never sees the other 45.
Layer 2 — Prompt caching: cache the selected schemas. Reads cost 10% of normal price. Break-even at 1.4 reads per write.
Layer 3 — Schema compression: trim every schema to its minimum valid form. PA-Tool (arXiv:2510.07248) shows 40% size reduction without accuracy loss.

Layer 1 — Semantic Tool Retrieval

The research result that changed how I think about this: with K=3 dynamic selection from a 121-tool library, you get 99.6% token reduction while maintaining a 97.1% hit rate (arXiv:2603.20313). Three tools out of 121. Nearly every token saved, nearly every correct tool found.

The accuracy of retrieval lives or dies on what you embed. A bare description: "Gets weather data." retrieves poorly because it matches nothing specific. An enriched entry works:

private buildEmbeddingText(tool: ToolDefinition): string {
  return [
    `Tool: ${tool.name}`,
    `Description: ${tool.description}`,
    tool.tags     ? `Tags: ${tool.tags.join(', ')}` : '',
    tool.useCases ? `Use when: ${tool.useCases.join('; ')}` : '',
    tool.category ? `Category: ${tool.category}` : '',
  ].filter(Boolean).join('\n');
}

The difference between sparse and enriched for get_weather:

Field	Sparse	Enriched
description	"Gets weather data."	"Fetch current weather and forecast for any city."
tags	—	weather, forecast, temperature, outdoor
useCases	—	"User asks about weather", "planning outdoor activities"
Retrieval accuracy	Poor	Strong cross-query generalization

arXiv:2412.03573 shows LLM-generated query expansion at registration time — asking the model to produce 5 different phrasings of a tool's purpose — yields the largest retrieval improvement. Run it once per tool. One-time cost, permanent gain.

Dependency Resolution

Some tools can't be called without another. When the retriever picks analyze_document, it must also pull in upload_document — otherwise the agent plans a call that can't succeed.

Loading diagram...

Implement this as a BFS walk of a requires: string[] field on each tool. The walk adds dependencies until the set is closed. I've been burned by missing this — the model confidently plans a two-step call sequence, the first step works, the second fails because the dependency tool wasn't available.

Layer 2 — Prompt Caching

Tool schemas are ideal cache targets: static, expensive, and reused across thousands of calls.

Event	Cost
Cache write (1st call)	1.25× input price
Cache read (all subsequent)	0.10× input price
Break-even reads	1.4 per write
Achievable hit rate in production	80–84%

The one rule that matters: static content before dynamic content. System prompt and tool schemas must come before the user's message. If you put the user message first, every call misses the cache entirely because the prefix changes each time.

ProjectDiscovery achieved 59% total cost reduction with this pattern alone. Cache hit rates jump from ~7% to 80%+ once you fix the structure order.

Layer 3 — Schema Compression

Every field the LLM ignores is wasted tokens. PA-Tool (arXiv:2510.07248) found 40% size reduction is achievable without selection accuracy loss. I've seen 83% on individual schemas with heavy examples arrays.

Schema Field	Keep?	Reason
`name`	Always	Core identifier
`description`	Always — cap at 150 chars	Selection signal
`parameters.type`	Always	Required
`parameters.description`	Yes — cap at 80 chars	Intent
`enum` values	Yes	Hard constraints
`required` array	Yes	Prevents missing args
`$schema` URL	Drop	LLM ignores it
`title` on params	Drop	Redundant with name
`additionalProperties`	Drop	LLMs ignore it
`examples` array	Drop	High token cost, marginal gain
`default` values	Drop	Explain in description instead

Don't shorten parameter names to save tokens. PA-Tool found names matching the model's pretraining vocabulary outperform abbreviated names — get_user_data beats fetchUsrDt even though it's longer. Compress descriptions, not names.

The AutoTool Pattern: Co-Occurrence Graphs

Once semantic retrieval is working, layer on the AutoTool approach (arXiv:2511.14650): track which tools get called together and use co-occurrence to pre-expand the retrieval set.

Loading diagram...

When the retriever selects get_weather, the graph knows it co-occurs heavily with get_location and get_timezone — so those get added without a second embedding lookup. AutoTool reports 10–40% additional token reduction and 15–25% fewer LLM calls on top of semantic retrieval alone.

What to Monitor

You can't optimize what you don't measure. The numbers I track per agent run:

Metric	Target	What it tells you
Tool schema % of input tokens	< 20%	If higher: schemas too verbose or too many tools
Tool utilization rate	> 60%	If lower: you're injecting irrelevant tools
Cache hit rate	> 70%	If lower: static/dynamic content order is wrong
Tool selection accuracy	> 95%	Did the right tools appear in retrieved set?
Avg tools injected per call	3–6	Baseline for optimization comparison

The utilization rate is the most useful signal. If you're injecting 5 tools and the agent uses 1 of them, either your retrieval is wrong or the task doesn't need tools at all.

How to Roll This Out

I wouldn't build all three layers simultaneously. Each layer has measurable impact and its own failure modes — debug one at a time.

Start this week: audit all tool schemas, remove $schema, title, examples, additionalProperties. Measure baseline token count per run. Add cache_control on the last tool in your existing tool array. These two changes together often cut costs 40–50% before you touch retrieval.

Next two weeks: build the tool registry with pgvector or Pinecone, enrich tool metadata (tags, useCases, category on every tool), implement top-K cosine retrieval, add BFS dependency resolution. Measure hit rate. If it's below 90%, your tool descriptions are too sparse — enrich them.

After that: add the co-occurrence graph (record which tools were used per run), implement graph expansion, A/B test K=3 vs K=5 vs K=8 for your query distribution.

Research Summary

Finding	Source	Impact
K=3 retrieval from 121 tools: 99.6% token reduction, 97.1% hit rate	arXiv:2603.20313	Core technique
Agents collapse 39–70% on sequential tasks with 16+ tools	arXiv:2512.08296	Why this matters
Co-occurrence graph: 10–40% more reduction, 15–25% fewer calls	arXiv:2511.14650	AutoTool layer
40% schema compression without accuracy loss	arXiv:2510.07248	PA-Tool
Cache hit rates of 80–84% achievable in multi-user systems	Anthropic docs	Layer 2
LLM query expansion at registration improves retrieval	arXiv:2412.03573	Embedding quality

Resources

The most expensive mistake in agent development isn't using the wrong model — it's treating your tool registry as a flat list. A flat list scales to 5 tools. A retrieval system scales to 500 on the same token budget. The moment you have more than 15 tools, your registry is a search problem. Treat it like one.

Posted ondevwith tags:

#ai-agents #typescript #production