I built a fully working voice-controlled AI agent that transcribes speech, classifies intent, and executes local tools β all powered by Groq's free AI APIs.
π― What It Does
You speak β It listens β It understands β It acts.
- Say "Write a Python retry function" β generates code and saves to file
- Say "Summarize this text" β returns a clean summary
- Say "What is machine learning?" β responds conversationally
- Say "Create a file called notes.txt" β creates the file safely
ποΈ Architecture
Audio Input β Groq Whisper (STT) β Groq LLaMA 3.3 70B (Intent) β Tool Execution β Streamlit UI
π§ Models I Chose & Why
Speech-to-Text: Groq Whisper large-v3
- Transcribes audio in under 2 seconds
- Free tier available, no GPU needed
LLM: Groq LLaMA 3.3 70B Versatile
- Accurately classifies intent from natural speech
- Handles compound commands like "write X and save to Y.py"
βοΈ Tech Stack
- Streamlit β Web UI
- Groq API β STT + LLM
- Python β Backend logic
π§ Challenges I Faced
1. Model Deprecation
During development, llama3-8b-8192 was decommissioned by Groq. I switched to llama-3.3-70b-versatile which is more powerful and still free.
2. Compound Commands
Handling commands like "Write a bubble sort and save it to sort.py" required careful prompt engineering to extract both the intent and filename simultaneously.
3. Safe File Operations
All file writes are sandboxed to an output/ folder with path traversal protection so no system files can be accidentally overwritten.
β¨ Bonus Features
- β Human-in-the-loop confirmation before file operations
- β Session memory β last 4 turns passed as context
- β Auto fallback if API fails
- β Compound command support
π Links
Thanks for reading! Feel free to star the repo if you found it useful β
Top comments (0)