This project implements a local AI agent that processes user input through voice or text, understands intent, performs actions, and returns results through a structured interface.
The system uses local models for speech recognition and language understanding, ensuring it works without paid APIs.
-
Audio input via file upload (.wav, .mp3)
-
Text input support
-
Speech-to-text using Whisper (local)
-
Intent detection using rule-based logic with LLM fallback (Ollama)
-
Tool execution:
- File creation
- Python code generation and saving
- Text summarization
- General chat responses
-
Compound command handling (multiple actions in one input)
-
Persistent chat history using JSON storage
-
Clean Streamlit UI displaying:
- Transcription
- Intent
- Final output
Input (Audio/Text)
↓
Speech-to-Text (Whisper)
↓
Intent Detection
↓
Action Execution
↓
Output + Storage
voice-ai-agent/
│
├── app.py # Streamlit application
├── stt.py # Speech-to-text (Whisper)
├── intent.py # Intent detection logic
├── tools.py # Action execution
├── memory.py # Persistent chat storage
├── output/ # Generated files
├── chat_history.json # Saved chat history
└── README.md
git clone <your-repository-link>
cd voice-ai-agentuv add streamlit openai-whisper ollamaWhisper requires FFmpeg.
Download: https://www.gyan.dev/ffmpeg/builds/
Add to PATH:
C:\ffmpeg\bin
Verify:
ffmpeg -versionDownload: https://ollama.com
Pull model:
ollama pull llama3Verify:
ollama run llama3uv run streamlit run app.pywhat is artificial intelligence
create a python file hello.py with a hello function
create a file notes.txt
summarize artificial intelligence is transforming industries worldwide
create a file test.txt and write a python file hello.py
Chat history is stored locally in:
chat_history.json
This allows the application to retain conversation history across sessions without using a database.
All generated files are saved only inside:
output/
This prevents unintended modifications to system files.
- Handling noisy or inaccurate speech transcription
- Ensuring correct intent classification for short or ambiguous inputs
- Cleaning LLM outputs for consistent formatting
- Managing compound commands reliably
- Maintaining persistence without a database
- Real-time microphone input
- Improved intent classification using structured prompts
- Support for document upload and summarization
- Multi-session chat management
- More robust error handling for unclear inputs
Thamizha
This project demonstrates a complete local AI pipeline combining speech recognition, intent understanding, and action execution with persistent memory and modular design.