AI agents with many similar tools pick the wrong one and waste tokens. This demo builds a travel agent with Strands Agents and uses FAISS to filter 29 tools down to the top 3 most relevant, comparing filtered vs unfiltered tool selection accuracy.
Based on research: "Internal Representations as Indicators of Hallucinations in Agent Tool Selection"
Research (Internal Representations, 2025) identifies 5 critical agent failure modes when tools scale:
- Function selection errors - Calling non-existent tools
- Function appropriateness errors - Choosing semantically wrong tools
- Parameter errors - Malformed or invalid arguments
- Completeness errors - Missing required parameters
- Tool bypass behavior - Generating outputs instead of calling tools
The dual problem:
- ❌ Hallucination risk: More tools = more inappropriate selections
- ❌ Token waste: Sending all tool descriptions on every call (29 tools = ~4,000 tokens per query)
Semantic tool selection filters tools before the agent sees them:
Results: Improved accuracy, fewer tokens
Strands Agents provides native capabilities that enable semantic tool selection in production:
1. Dynamic Tool Swapping
# Add/remove tools at runtime without recreating the agent
agent.tool_registry.register_tool(new_tool)
agent.tool_registry.unregister_tool(old_tool)2. Conversation Memory Preservation
# Swap tools between queries while keeping conversation history
swap_tools(agent, new_tools) # agent.messages preserved3. Runtime Tool Discovery
- Agent picks up tool changes automatically at each event loop
- No manual refresh needed—just modify
tool_registry - Zero-downtime tool updates in production
Traditional frameworks require agent recreation to change tools, losing conversation state. Strands maintains memory while tools change dynamically.
Learn more: Strands Tool Registry
- Python 3.9+
- Strands Agents — AI agent framework
- Optional: Neo4j connection for real hotel data (from
../01-hotel-rag-demo)
This demo uses OpenAI with GPT-4o-mini by default (requires OPENAI_API_KEY environment variable).
You can swap the model for any provider supported by Strands — Amazon Bedrock, Anthropic, Ollama, etc. See Strands Model Providers for configuration.
Create a .env file with your OpenAI API key:
# OpenAI API Key (required)
OPENAI_API_KEY=your_openai_api_key_hereHow to get your API key: Get from platform.openai.com/api-keys
uv venv && uv pip install -r requirements.txt| File | Purpose |
|---|---|
test_semantic_tools_hallucinations.ipynb |
Main demo - Comprehensive notebook with 29 tools, ground truth verification |
token_comparison_app.py |
Token savings verification - Standalone script to measure token reduction |
enhanced_tools.py |
31 travel agent tools (29 generic + 2 with optional Neo4j data) |
registry.py |
FAISS-based semantic tool filtering |
Open `test_semantic_tools_hallucinations.ipynb` in Jupyter, Kiro, or your preferred notebook environment.What it does:
- Tests 13 travel queries on 29 tools
- Compares Traditional (all 29 tools) vs Semantic (top 3 filtered)
- Verifies against ground truth (real hotel database)
- Shows token savings and error reduction
Key features:
- Real hotel data from Neo4j graph database
- Objective accuracy measurement
- Detailed error analysis
- Token cost comparison
Run the standalone token comparison script to verify the savings claimed in Part 3 of the notebook:
uv run token_comparison_app.pyWhat it measures:
- Compares 3 approaches: Traditional, Semantic, Semantic+Memory
- Shows actual token usage per query
- Demonstrates memory accumulation cost
- Verifies
swap_tools()preserves conversation history
Expected output:
Token breakdown:
- Traditional: 29 tools × 50 tokens = ~1450 tokens/query (constant)
- Semantic: 3 tools × 50 tokens = ~150 tokens/query (constant)
- Memory: ~150 tokens + conversation history (~400 tokens/turn, accumulates)
# Agent sees ALL 31 tools on every query
agent = Agent(tools=ALL_TOOLS, model=model)
agent("How much does Hotel Marriott cost?")
# Token cost: ~4,500 tokens (31 tool descriptions)
# Risk: Picks wrong tool from 31 options# 1. Build FAISS index once
build_index(ALL_TOOLS)
# 2. Filter tools per query
query = "How much does Hotel Marriott cost?"
relevant_tools = search_tools(query, top_k=3)
# Returns: [get_hotel_pricing, get_hotel_details, search_hotels]
# 3. Agent sees only 3 relevant tools
agent = Agent(tools=relevant_tools, model=model)
agent(query)
# Token cost: ~500 tokens (3 tool descriptions)
# Risk: Picks correct tool from 3 focused optionsFor multi-turn conversations, use Strands' native tool swapping to maintain conversation history:
def swap_tools(agent, new_tools):
"""Swap agent's tools without losing conversation memory"""
agent.tool_registry.registry.clear()
agent.tool_registry.dynamic_tools.clear()
for tool in new_tools:
agent.tool_registry.register_tool(tool)
# Create agent once
agent = Agent(tools=initial_tools, model=model)
# Multi-turn conversation with dynamic tool filtering
for query in queries:
selected = search_tools(query, top_k=3)
swap_tools(agent, selected) # Tools change, agent.messages preserved
agent(query) # Full conversation history intactWhy this works: Strands calls tool_registry.get_all_tools_config() at each event loop cycle, automatically picking up runtime changes. No agent recreation needed.
Key advantages:
- Zero conversation loss across tool swaps
- Same agent instance handles all queries
- Add/remove tools between any two queries
- Production-ready for long conversations
Learn more: Strands Agent Architecture
The notebook includes 6 tools connected to the Neo4j hotel database:
@tool
def search_real_hotels(country: str, min_rating: float = 0.0) -> str:
"""Search real hotels in a specific country from our database."""
# Executes Cypher query on Neo4j
# Returns actual hotel data from 515K reviews
@tool
def get_top_hotels(country: str, limit: int = 5) -> str:
"""Get top-rated hotels in a country."""
# Real aggregation from graph databaseThese tools provide ground truth for objective accuracy measurement.
This demo implements findings from:
- Internal Representations as Indicators of Hallucinations - Tool selection hallucinations increase with tool count
- Production systems report 89% token reduction (rconnect.tech)
- Demo 03 - Multi-Agent Validation — Cross-validate tool selections with Executor → Validator → Critic
- Demo 04 - Neurosymbolic Guardrails — Add symbolic rules to block invalid tool calls
Contributions are welcome! See CONTRIBUTING for more information.
If you discover a potential security issue in this project, notify AWS/Amazon Security via the vulnerability reporting page. Please do not create a public GitHub issue.
This library is licensed under the MIT-0 License. See the LICENSE file for details.



