close

DEV Community

THAMIZHAMUDHU GOPALAN
THAMIZHAMUDHU GOPALAN

Posted on

Voice-Controlled AI Agent Using Whisper and Local LLM

Overview

I recently built a Voice-Controlled AI Agent that processes both audio and text inputs, understands user intent, and performs meaningful actions through a structured pipeline.

The goal of this project was to design a complete AI system that works locally without relying on paid APIs, while maintaining simplicity and reliability.


Architecture

The system follows this pipeline:

Input → Speech-to-Text → Intent Detection → Action Execution → Output


Key Features

  • Supports both audio (.wav, .mp3) and text input
  • Speech-to-text using Whisper (local model)
  • Intent detection using a hybrid approach (rule-based + LLM fallback)
  • Actions supported:
    • File creation
    • Python code generation
    • Text summarization
    • Chat responses
  • Compound commands (multiple actions in one input)
  • Persistent memory using JSON
  • Safe file handling within a dedicated output directory

Tech Stack

  • Python
  • Streamlit
  • Whisper
  • Ollama (Llama3)

Challenges

One of the key challenges was handling noisy or unclear speech input. This was addressed by combining rule-based logic with LLM-based intent detection.

Another challenge was ensuring correct intent classification for short inputs, which required prioritizing rules over model responses.


Learnings

This project helped me understand how real-world AI systems are built beyond just using models — including pipeline design, validation, and system reliability.


Links

https://github.com/thamizhamudhu/voice-ai-agent/blob/main/README.md

Top comments (0)