close
Skip to content

Amratanshu-d/voice-ai-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎙️ Voice-Controlled Local AI Agent

Built as part of the Mem0 AI — MLOps and AI Infra Internship Assignment

A voice-controlled AI agent that accepts audio input, classifies intent, executes local tools, and displays results in a clean UI — all powered by Groq's free API.


📸 Demo

Voice AI Agent UI

🎥 Video Demo: [YouTube Link Here]
📝 Article: [Dev.to/Medium Article Link Here]


✨ Features

  • 🎤 Dual Audio Input — Record from microphone OR upload .wav/.mp3 files
  • 🔊 Speech-to-Text — Powered by Groq Whisper Large V3
  • 🧠 Intent Detection — LLaMA3-70B classifies user commands into 4 intents
  • Tool Execution — Creates files, writes code, summarizes text, or chats
  • 🧩 Session Memory — Tracks all actions taken during the session
  • 🖥️ Clean Gradio UI — Shows transcription, intent, action, and output

🏗️ Architecture

User speaks
    ↓
Audio file (.wav/.mp3)
    ↓
[stt.py] Groq Whisper → Transcribed Text
    ↓
[intent.py] LLaMA3-70B → Intent Classification
    ↓
[tools.py] Tool Router
    ├── create_file  → Creates file in output/
    ├── write_code   → Generates code, saves to output/
    ├── summarize    → Summarizes text, saves to output/summary.txt
    └── general_chat → LLM conversation
    ↓
[app.py] Gradio UI → Displays all results
    ↓
Session History (In-memory)

🛠️ Tech Stack

Component Technology Why Chosen
UI Gradio 4.x Fast to build, professional look
STT Groq Whisper Large V3 Free, fast, no GPU required
LLM Groq LLaMA3-70B Free API, excellent performance
File Ops Python pathlib Built-in, reliable
Env Vars python-dotenv Secure API key management

🔧 Hardware Note

This project uses Groq's API for both Speech-to-Text and LLM inference instead of running models locally. This was chosen because:

  1. Groq provides free tier access
  2. It achieves ultra-fast inference (faster than most local setups)
  3. It makes the project hardware-agnostic — runs on any machine
  4. Whisper Large V3 and LLaMA3-70B are available on Groq for free

🚀 Setup Instructions

Step 1 — Prerequisites

Step 2 — Clone the Repository

git clone https://github.com/YOUR_USERNAME/voice-ai-agent.git
cd voice-ai-agent

Step 3 — Install Dependencies

pip install -r requirements.txt

Step 4 — Set Up API Key

  1. Go to console.groq.com
  2. Sign up for free
  3. Click API KeysCreate API Key
  4. Open the .env file and replace your_groq_api_key_here with your actual key:
GROQ_API_KEY=gsk_xxxxxxxxxxxxxxxxxxxxx

Step 5 — Run the App

python app.py

Then open your browser and go to: http://localhost:7860


📁 Project Structure

voice-ai-agent/
├── app.py              ← Main Gradio UI application
├── stt.py              ← Speech-to-Text (Groq Whisper)
├── intent.py           ← Intent classification (LLaMA3)
├── tools.py            ← Tool execution (files, code, summarize, chat)
├── config.py           ← Loads .env file
├── requirements.txt    ← Python dependencies
├── .env                ← Your API key (never commit this!)
├── .gitignore          ← Excludes .env from git
└── output/             ← All generated files go here (auto-created)

🎯 Supported Intents

Intent Example Command Action
Create File "Create a text file called notes" Creates empty file in output/
Write Code "Write Python code for a retry function" Generates & saves code to output/
Summarize "Summarize the benefits of machine learning" Generates summary, saves to output/summary.txt
General Chat "What is artificial intelligence?" Responds conversationally

🌟 Bonus Features Implemented

  • Session Memory — All actions tracked throughout the session
  • Compound Commands — Handles multiple intents in one command
  • Graceful Degradation — Clear error messages for all failure cases
  • Safety Constraint — ALL file operations restricted to output/ folder

🙏 Acknowledgements

Built for the Mem0 AI MLOps & AI Infra Internship assignment. Inspired by the concept of persistent memory in AI agents — a core problem that Mem0 is solving.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages