Research: Architecting Tools for AI Agents at Scale

Loading all available tools into an LLM’s context simultaneously is one of the most consequential architectural mistakes teams make when building AI integrations. The solution isn’t bigger context windows, and it’s progressive tool exposure: dynamically presenting only the tools relevant to each interaction. This post surveys the major patterns for doing so, drawn from production servers, academic research, and the emerging MCP ecosystem.

As a WordPress Core Committer contributing to the Abilities API, MCP Adapter, and related AI integrations, I see these patterns as directly relevant to where WordPress core needs to go. The Abilities API already registers abilities with per-ability schemas, permissions, and namespace/category/meta properties, and the MCP Adapter translates them into MCP tools. But as the ecosystem grows — core abilities plus WooCommerce plus Jetpack plus third-party plugins — the architecture needs to stay ahead of the scaling curve. The last two sections, A decision framework for WordPress and Connecting to WordPress Core AI, map these patterns directly to WordPress 7.1 work and open Trac tickets.

The problem: more tools, worse results

The core issue is not theoretical. Anthropic’s Advanced Tool Use blog (November 2025) provided concrete numbers: a typical 5-server MCP setup consumes ~55,000 tokens in tool definitions alone before any conversation begins. Internally, Anthropic observed tool definitions consuming 134K tokens before optimization. At that scale, Opus 4 achieved only 49% accuracy on tool selection in Anthropic’s internal evaluation.

This matches what I’ve observed in the WordPress ecosystem: stress-testing WooCommerce’s REST API endpoints as individual abilities hit 40k tokens in tool definitions alone, immediately raising the question every platform eventually faces: how do you limit the tools based on context?

Every major benchmark tells the same directional story: accuracy drops as tool count rises. TaskBench (NeurIPS 2024), ToolkenGPT (NeurIPS 2023), MCPVerse (2025), and WildToolBench (ICLR 2026) each use different setups, metrics, and task formats, but taken together the picture is consistent: individual-tool accuracy can be as high as 96% but drops to under 15% in the most challenging multi-turn, large-toolset scenarios. Model quality helps but doesn’t solve the problem: in MCPVerse, Claude-4-Sonnet barely degraded between minimal and full tool loading, while most other models showed clear drops. GPT-4o and GPT-5 results in Standard mode used prompt-based function calling due to API tool-count limits, making their degradation harder to attribute solely to tool count.

Three categories of errors increase with tool count: incorrect tool selection (choosing the wrong tool), parameter hallucination (invoking the right tool with invented arguments), and tool interference (confusion between similarly-named or overlapping capabilities).

The business cost is also quantifiable. The “MCP Tax” analysis by MMNTM estimated that for 1,000 developers running 5 sessions per day, annual waste on unused tool definitions approaches $4 million (a directional estimate from a practitioner analysis, not peer-reviewed research).

Practical thresholds for engineering teams, synthesized from the evidence above: 1-5 tools is generally safe for any model. 5-10 tools is workable with optimized descriptions. 10+ tools is where measurable degradation starts requiring architectural intervention. 50+ tools without progressive disclosure likely means unreliable tool selection. These are heuristics, not published benchmarks.

Strategy 1: Design fewer, better tools

The simplest response to tool overload is to ship fewer tools that each do more. This requires no infrastructure changes, just deliberate API design.

Consolidation patterns

Several practitioner-documented patterns have emerged for reducing tool count through consolidation.

The Single Tool Resource Action Pattern (STRAP), coined by Alma Tuck in January 2026, addresses the most common form of tool bloat: one-tool-per-CRUD-operation-per-entity. Instead of exposing create_email_list, get_email_list, update_email_list, and delete_email_list as separate tools, STRAP creates one tool per domain with resource and action parameters. A call becomes email(resource: "list", action: "create", name: "Newsletter"). Applied to the Outlet email platform, this reportedly reduced 96 tools to 10 domain tools with ~80% less context overhead. The pattern works best for systems with uniform CRUD operations across multiple entities, precisely the shape of WordPress’s API surface.

The Six-Tool Pattern, documented by MCP Bundles, takes a different approach for data-oriented servers: reduce everything to six core tools organized by operation type: fetch, search, list_collections, list_objects, upsert, delete. The key insight is encoding routing information into ID formats (weaviate:object:CollectionName:uuid-123) so a single fetch tool can retrieve any resource type. Applied to a Weaviate MCP server, it collapsed 12 tools to 6 with identical functionality.

Anthropic’s own guidance reinforces this direction. Their “Writing Effective Tools” blog (September 2025) recommends consolidating multi-step operations into single high-level tools: instead of list_users, list_events, create_event, build schedule_event. Their litmus test: “If a human engineer can’t definitively say which tool should be used in a given situation, an AI agent can’t be expected to do better.” Claude Code itself caps its built-in toolset at ~12 core tools.

Tool description quality

Before reaching for architectural solutions, there’s a surprisingly high-impact lever: improving the text that describes each tool. Paragon tested GPT-4o across 50 tasks and found baseline tool correctness at only 74.8%, with end-to-end task completion at 53%. Anthropic’s internal testing confirms the payoff: adding tool use examples improved accuracy from 72% to 90% on complex parameter handling.

A few concrete practices stand out. OpenAI recommends the “intern test”: could someone unfamiliar with the system use this tool correctly given only its description? Beyond that, use enums to constrain parameters to valid values, pre-fill arguments the system already knows rather than asking the model to guess them, and include concrete input examples that show realistic usage patterns.

Strategy 2: Show tools progressively

When consolidation alone isn’t enough, typically beyond 15-30 tools, the next strategy is revealing tools incrementally rather than loading everything upfront. This is the approach that has seen the most innovation in the MCP ecosystem.

Static filtering: decide at startup what to expose

The simplest form of progressive disclosure is configuration-based: decide before the session starts which tools to include.

GitHub’s MCP server organizes its tools into named toolsets (repos, issues, pull_requests, actions, code_security, etc.) that can be enabled via CLI flags, environment variables, or HTTP headers. The exact tool count varies by configuration and has been trending downward through deliberate consolidation. GitHub’s own blog describes the server as having “over a hundred tools,” though independent audits have found lower counts. Additional layers include read-only mode (filtering all write operations), token scope filtering (auto-hiding tools the user’s OAuth token can’t access), and tool description overrides via environment variables.

Dynamic discovery: let the model load what it needs

The more powerful approach is letting the LLM itself decide which tools to activate at runtime. GitHub pioneered this with dynamic toolset discovery (--dynamic-toolsets), which starts the server with only 3 meta-tools: list_available_toolsets, get_toolset_tools, and enable_toolset. The LLM discovers and activates toolsets as needed.

Anthropic’s Tool Search Tool (November 2025) generalizes this into a platform feature. Tools marked with defer_loading: true load only a ~500-token stub initially instead of full schemas. When the model needs a deferred tool, it searches by description and loads the full schema on demand. Results: 85% token reduction (77K → 8.7K) and accuracy improving from 49% to 74% (Opus 4) and 79.5% to 88.1% (Opus 4.5). Anthropic’s best practice: keep 3-5 most-used tools always loaded, defer the rest. OpenAI’s native tool_search (GPT-5.4+) implements a similar pattern: functions marked defer_loading: true stay out of the initial context and load on demand. OpenAI separately recommends keeping each namespace to fewer than 10 functions for better model performance.

Anthropic’s Agent Skills pattern extends this to organizational knowledge: only YAML frontmatter metadata loads at startup, full SKILL.md content loads when the skill is activated, and referenced files load only when needed. Several community MCP implementations have emerged around the same idea, each reporting significant context savings: Synaptic Labs’ Meta-Tool Pattern (2 tools, reported 85-95% token reduction), Klavis Strata’s four-stage discovery, MCPrism‘s search→schema→execute flow, and Resource-Based Progressive Disclosure (moving documentation to MCP resources the LLM must fetch before executing).

The core tradeoff across all implementations is latency vs. context savings: extra discovery steps add tool calls, and MCP’s load-everything approach requires 3-4× fewer calls per task.

Code-as-tools: the radical alternative

Cloudflare arrived at the most extreme position. Their API has 2,594 endpoints, and even minimal tool schemas total 244,047 tokens. Their solution: 2 tools (search and execute) that accept JavaScript code. The agent writes code against a typed API client, executing in isolated sandboxes. Token reduction: 99.9% (from 1,170,523 to 1,069 tokens). Rather than choosing from thousands of tool definitions, the agent explores the API through code and composes calls directly, a pattern Cloudflare describes as letting the code act as “a compact plan.”

Anthropic’s Programmatic Tool Calling pushes in the same direction: letting agents invoke tools from a code execution sandbox reduced average token usage by 37% (from 43,588 to 27,297 tokens on complex research tasks). Their “Code Execution with MCP” blog demonstrated that combining Tool Search with code execution can achieve a 98.7% reduction (150K → 2K tokens). The broader insight, from Anthropic’s Agent Skills work: “Generalist agents like Claude Code use remarkably few tools (~12). The key is giving agents access to a computer.”

Strategy 3: Filter and route at the infrastructure layer

The first two strategies operate within a single server or agent. When the problem grows to multiple servers, multiple teams, or hundreds of tools across an organization, filtering moves to infrastructure.

Pre-call filtering

The most direct approach: don’t send tools the model doesn’t need for this particular request.

Two-stage LLM filtering runs a fast, cheap model (e.g., GPT-4o-mini) as a first pass. LangChain’s LLMToolSelectorMiddleware reads all tool names and descriptions, picks the most relevant ones, and forwards only those to the primary model. The tradeoff is an extra LLM call per turn. That means added latency and cost, plus the risk that the filter model itself makes a bad selection.

Text similarity matching avoids LLM overhead entirely. Libraries like Aurelio AI’s Semantic Router define sample phrases for each tool and match incoming requests using fast text similarity, achieving ~90% accuracy in under a millisecond without any LLM inference. These work best as a first-pass filter combined with LLM-based final selection for ambiguous cases.

At larger scale, several research projects treat tool selection as a search problem, indexing tool descriptions and retrieving the best matches for a given request. The Toolshed project found that search-based tool retrieval consistently outperforms presenting all tools directly. ToolNet builds graphs of tools where edges encode which tools are commonly used together, narrowing the search space. Stacklok’s MCP Optimizer reports 94% accuracy at 2,792 tools using a combination of keyword and semantic search, suggesting that discovery-based approaches can scale well beyond what static tool lists can handle.

Gateways: centralized control across servers

When an organization connects agents to 8-10 servers simultaneously, the management problem shifts from “how does one server organize its tools” to “who decides which servers and tools are available to which agents, and how do we monitor what’s happening?” MCP gateways sit between agents and servers, providing a centralized layer for tool filtering, access control, and observability. A practical heuristic for when a gateway becomes necessary: you’re running ~8-10 servers, have 5+ developers building agents, or someone asks “who called what, and when?” for the first time.

Lunar MCPX introduces Tool Groups: named collections of tools bundled across servers into workflow-specific sets like “Sales Outreach” or “Frontend Dev.” Instead of an agent reasoning over every tool from every connected server, it sees only the tools in its assigned group. Beyond filtering, Lunar provides centralized policy enforcement, per-tool-call metrics, and an evaluation sandbox for testing MCP servers before production.

The mcp-gateway-registry takes a different approach: instead of manually curating tool groups, it lets agents find tools by describing what they need in plain language. All registered tools are indexed by their descriptions, and incoming queries are matched against that index using text similarity, enabling discovery across hundreds of MCP servers without prior configuration.

What production servers actually do

The spectrum of approaches across major production servers reveals a clear pattern: among the servers surveyed, none exposes its full API surface as individual MCP tools. GitHub, Cloudflare, Supabase, and Playwright are documented from primary repos. The remaining entries are illustrative examples from repository inspection and community implementations, not verified primary-source claims. Tool counts and strategies may shift as these servers evolve.

Server	Approximate tool count	Strategy
Cloudflare	2 (covering ~2,600 endpoints)	Code Mode — agent writes JavaScript
Figma	Under 10	Extreme consolidation
Shopify (Dev)	Under 10 per server	Multi-server split by persona
Slack	~10	Flat list, OAuth scope filtering
Notion (community)	~20	Env var filtering with wildcards
Playwright	~20	Flat namespace, `browser_` prefix
Stripe	~25	Curated subset + meta-tools
Supabase	~25-30	Feature groups with URL param filtering
Atlassian	~45	Organized by product (Jira, Confluence, Compass)
GitHub	Varies by config	Toolsets + dynamic discovery + static filtering

Shopify splits by persona rather than consolidating: separate MCP servers for developers, storefronts, customer accounts, and checkout. Each server stays small and focused.

The community Notion MCP server uses env-var-based filtering with wildcard support (NOTION_MCP_TOOLS_INCLUDE="get-*,retrieve-*", NOTION_MCP_TOOLS_EXCLUDE="delete-*"). Another community project, better-notion-mcp, reports ~77% token reduction via 10 composite tools replacing 28+ endpoint calls. Supabase implements feature group filtering via URL parameters (?features=database,docs) combined with a three-tier safety system (safe/write/destructive).

A decision framework for WordPress

For an engineering team building AI integrations into a large platform, the research points to a layered strategy combining multiple patterns based on the scale of the integration.

Start with deliberate consolidation

The risk for WordPress isn’t that core registers too many abilities. 20-30 well-scoped abilities is manageable. The risk is ecosystem growth: WooCommerce adds 30 more, Jetpack adds 20, a forms plugin adds 10, a SEO plugin adds 10, and suddenly a single site has 100+ abilities without anyone individually doing anything wrong. The answer is to consolidate where the data model naturally supports it, while keeping operations that have genuinely different shapes as separate abilities. Posts, pages, and custom post types all share the same underlying wp_posts model, the same meta system, the same status workflow, so core/create-content with a post_type parameter is more natural than separate core/create-post, core/create-page, and per-CPT abilities for every registered type. That turns 10 post types × 4 CRUD operations = 40 abilities into 4. But settings, site health, and plugin management have fundamentally different input shapes. Forcing them under one mega-tool would mean a complex conditional schema that hurts parameter accuracy more than the tool count reduction helps. The principle: consolidate similarities, but don’t sacrifice schema clarity. Parameter hallucination — where the model invokes the right tool but with invented or mismatched arguments — is one of the primary failure modes that increases with schema complexity.

Add static filtering and feature groups

WordPress already has a natural grouping boundary: plugins. WooCommerce abilities only exist if WooCommerce is installed, Jetpack abilities only if Jetpack is active. The plugin itself acts as a coarse filter. But within an active site’s abilities, further filtering is still needed. A content editor doesn’t need commerce tools, a store manager doesn’t need site health tools, and a third-party agent connecting via MCP may only need a narrow slice of what’s registered. The filtering layer should expose abilities based on user permissions and task context, not dump everything that’s installed into every session.

Implement progressive disclosure for advanced scenarios

As the platform grows, adopt Anthropic’s Tool Search pattern: keep 3-5 core tools always loaded, defer the rest. The always-loaded set should be orientation abilities: tools any agent needs regardless of task, like understanding what the site is, what the user can do, and what other abilities are available. Task-specific abilities load on demand once the agent knows what it’s working with. Variations of this are already being implemented in the WordPress ecosystem, including orchestrators that pass context about which abilities are available and STRAP facades that keep the LLM context lean. The same principle will apply when WordPress acts as an MCP client, where external tool surfaces should also be loaded only when a user’s task requires them, not registered into every session upfront.

Invest in tool descriptions and evaluation

Tool description quality has an outsized impact on accuracy, often more than architectural changes. Build evaluation suites that test tool selection across realistic WordPress workflows. Track three metrics obsessively: tool correctness rate, parameter accuracy, and tokens consumed per session. Every tool added to the system should justify its context cost with measurable utility.

Connecting to WordPress Core AI — WordPress 7.1 and beyond

The strategies above map directly to work already in progress or planned for the WordPress Core AI roadmap. The goal is to identify where core needs tactical improvements so that the broader ecosystem can build on solid primitives.

Abilities filtering is the most immediate enabler. Extending wp_get_abilities() with category, namespace, and meta filtering (Trac #64990) is what makes static filtering and feature groups possible. Without it, every consumer, whether an MCP server, an orchestrator, or a third-party agent, has to solve ability scoping on its own, and those ad hoc solutions won’t scale as plugins add more abilities. This is the foundational piece: get it right and both static and dynamic disclosure patterns become buildable on top.

Consolidation guidance for the ecosystem is a gap. Core can lead by example with content-type-based abilities (core/create-content with post_type), but plugin authors need clear patterns for when to consolidate and when to register separate abilities. The path of least resistance is mirroring REST API endpoints as abilities, the natural thinking we already see in the plugin ecosystem, and exactly how tool definitions balloon beyond what any context window can sustain. The right defaults and documentation can steer the ecosystem toward fewer, well-scoped abilities rather than thin wrappers around every endpoint.

Orientation abilities (core/get-site-info, core/get-theme, core/list-plugins, core/get-site-health) are near-term candidates for WordPress 7.1 (WordPress/ai#40). This research reinforces their priority: they’re the always-loaded foundation that makes progressive disclosure work: the tools an agent uses to understand the environment before requesting task-specific abilities on demand. Since they pay their schema cost on every session, it’s worth asking whether core/get-site-info could be extended to cover theme, plugin, and site health context, reducing the always-loaded set without introducing new abilities.

Schema quality at the output boundary connects directly to the JSON Schema normalization work (Trac #64955). The research on tool description quality, where refinements alone moved accuracy from 72% to 90%, argues that wp_compile_ability_schema_for_ai() isn’t just a nice-to-have. It’s one of the highest-leverage improvements core can ship for agent reliability.

The execution lifecycle filters (Trac #64989) become more important in a progressive disclosure world. When abilities are loaded dynamically, the filters in normalize_input(), check_permissions(), and execute() are the safety net. They ensure that an ability loaded on demand still goes through the same validation path as one loaded at startup.

The proposed next step is to evaluate the abilities filtering Trac ticket and the WordPress 7.1 orientation abilities against the progressive disclosure patterns documented here, and use the production evidence (GitHub’s toolset model in particular) to inform the API design. Early adopters within the WordPress ecosystem — including Automattic’s AI products, the WordPress AI plugin, and the MCP Adapter — can validate whether the core primitives are sufficient before the broader plugin ecosystem adopts them.

Conclusion

Leading implementations increasingly converge on a shared conclusion: progressive tool exposure is not optional at scale, but a primary determinant of agent reliability. Anthropic’s own data showing accuracy nearly doubling with tool search, combined with production evidence from GitHub (dynamic toolsets), Cloudflare (2 tools covering ~2,600 endpoints), and Shopify (persona-based server splitting), establishes that the most successful AI integrations invest heavily in tool architecture rather than simply exposing API endpoints.

The most underappreciated insight from this research is that tool consolidation has a sweet spot: consolidate where the underlying data model is shared (as WordPress can do across post types), but don’t sacrifice schema clarity by forcing dissimilar operations into a single tool. Complex conditional schemas degrade the parameter accuracy that consolidation was meant to improve. The most actionable insight is Anthropic’s “3-5 always loaded, defer the rest” heuristic, applied to orientation abilities that let the agent discover what else is available. And the most strategic insight is that tool architecture is not a one-time decision but an ongoing engineering discipline requiring evaluation infrastructure, description optimization, and continuous measurement of the context tax every tool imposes on every session.

This post was produced with assistance from Claude. Primary sources were validated directly with additional fact-checking via ChatGPT and Gemini: Anthropic’s engineering blogs, OpenAI’s API documentation, Cloudflare’s Code Mode blog, GitHub’s MCP server repository, and the benchmark papers (TaskBench, ToolkenGPT, MCPVerse, WildToolBench). Several supporting examples (STRAP, Six-Tool Pattern, MCPrism, better-notion-mcp, Lunar MCPX, mcp-gateway-registry, Stacklok) are practitioner case studies or vendor documentation rather than peer-reviewed research. They are presented as documented patterns, not as equivalent evidence. Production server tool counts are approximate and evolving. Exact figures should be verified against current repositories before citing externally.

Research: Architecting Tools for AI Agents at Scale

The problem: more tools, worse results

Strategy 1: Design fewer, better tools

Consolidation patterns

Tool description quality

Strategy 2: Show tools progressively

Static filtering: decide at startup what to expose

Dynamic discovery: let the model load what it needs

Code-as-tools: the radical alternative

Strategy 3: Filter and route at the infrastructure layer

Pre-call filtering

Gateways: centralized control across servers

What production servers actually do

A decision framework for WordPress

Start with deliberate consolidation

Add static filtering and feature groups

Implement progressive disclosure for advanced scenarios

Invest in tool descriptions and evaluation

Connecting to WordPress Core AI — WordPress 7.1 and beyond

Conclusion

Like this:

Comments

Leave a ReplyCancel reply

Research: Architecting Tools for AI Agents at Scale

The problem: more tools, worse results

Strategy 1: Design fewer, better tools

Consolidation patterns

Tool description quality

Strategy 2: Show tools progressively

Static filtering: decide at startup what to expose

Dynamic discovery: let the model load what it needs

Code-as-tools: the radical alternative

Strategy 3: Filter and route at the infrastructure layer

Pre-call filtering

Gateways: centralized control across servers

What production servers actually do

A decision framework for WordPress

Start with deliberate consolidation

Add static filtering and feature groups

Implement progressive disclosure for advanced scenarios

Invest in tool descriptions and evaluation

Connecting to WordPress Core AI — WordPress 7.1 and beyond

Conclusion

Share this:

Like this:

Comments

Leave a ReplyCancel reply

Discover more from Grzegorz Ziółkowski