close

DEV Community

Cover image for Building a RAG Pipeline on Your Existing Django App: A Practical Guide
Lycore Development
Lycore Development

Posted on

Building a RAG Pipeline on Your Existing Django App: A Practical Guide

You have a Django app. You want to add AI-powered search or a chatbot that actually knows about your data — not just a generic LLM that hallucinates answers. The solution is RAG: Retrieval-Augmented Generation.

This guide walks through exactly how we add RAG to existing Django projects. No rebuilding from scratch. No switching stacks. Just a clean integration on top of what you already have.

What RAG actually does

A plain LLM call like GPT-4 or Claude knows only what it was trained on. Ask it about your product catalogue, your customers, or your internal docs — and it will either make something up or tell you it doesn't know.

RAG fixes this by retrieving relevant chunks of your actual data and injecting them into the prompt as context. The model then answers based on that context, not just training data.

The flow looks like this:

  1. User asks a question
  2. Your app converts the question to a vector embedding
  3. You search your vector store for the most similar chunks of your data
  4. Those chunks get injected into the LLM prompt as context
  5. The LLM answers based on your data

What you need

  • A Django app with a PostgreSQL database (we'll use pgvector)
  • An OpenAI API key (for embeddings and completions)
  • pgvector Postgres extension installed
  • Python packages: openai, pgvector, django

We use pgvector here because it runs inside your existing Postgres instance — no separate vector database to manage or pay for. For larger scale, Pinecone or Chroma are worth evaluating, but for most Django projects pgvector is the right call.

Step 1: Install pgvector

pip install pgvector openai
Enter fullscreen mode Exit fullscreen mode

Enable the extension in Postgres:

CREATE EXTENSION IF NOT EXISTS vector;
Enter fullscreen mode Exit fullscreen mode

Step 2: Add a vector field to your model

# models.py
from django.db import models
from pgvector.django import VectorField

class Document(models.Model):
    title = models.CharField(max_length=255)
    content = models.TextField()
    embedding = VectorField(dimensions=1536, null=True, blank=True)
    created_at = models.DateTimeField(auto_now_add=True)

    def __str__(self):
        return self.title
Enter fullscreen mode Exit fullscreen mode

Run your migration:

python manage.py makemigrations
python manage.py migrate
Enter fullscreen mode Exit fullscreen mode

Step 3: Generate and store embeddings

# utils/embeddings.py
from openai import OpenAI

client = OpenAI()

def get_embedding(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding
Enter fullscreen mode Exit fullscreen mode

Hook this into a Django signal:

# signals.py
from django.db.models.signals import post_save
from django.dispatch import receiver
from .models import Document
from .utils.embeddings import get_embedding

@receiver(post_save, sender=Document)
def generate_embedding(sender, instance, created, **kwargs):
    if created or not instance.embedding:
        embedding = get_embedding(instance.content)
        Document.objects.filter(pk=instance.pk).update(embedding=embedding)
Enter fullscreen mode Exit fullscreen mode

Step 4: Semantic search with pgvector

# utils/search.py
from pgvector.django import CosineDistance
from .models import Document
from .utils.embeddings import get_embedding

def semantic_search(query: str, top_k: int = 5) -> list[Document]:
    query_embedding = get_embedding(query)
    results = (
        Document.objects
        .annotate(distance=CosineDistance('embedding', query_embedding))
        .order_by('distance')[:top_k]
    )
    return list(results)
Enter fullscreen mode Exit fullscreen mode

Step 5: Build the RAG response

# utils/rag.py
from openai import OpenAI
from .search import semantic_search

client = OpenAI()

def rag_response(user_query: str) -> dict:
    relevant_docs = semantic_search(user_query, top_k=4)
    context = "\n\n---\n\n".join([
        f"Title: {doc.title}\n{doc.content}"
        for doc in relevant_docs
    ])
    system_prompt = """You are a helpful assistant. Answer the user's question
using ONLY the context provided below. If the answer is not in the context,
say so clearly — do not make up information.

Context:
{context}""".format(context=context)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_query}
        ],
        temperature=0.2
    )
    return {
        "answer": response.choices[0].message.content,
        "sources": [{"id": doc.id, "title": doc.title} for doc in relevant_docs]
    }
Enter fullscreen mode Exit fullscreen mode

Step 6: Wire it up as a Django view

# views.py
import json
from django.http import JsonResponse
from django.views.decorators.http import require_POST
from django.views.decorators.csrf import csrf_exempt
from .utils.rag import rag_response

@require_POST
@csrf_exempt
def ask(request):
    data = json.loads(request.body)
    query = data.get("query", "").strip()
    if not query:
        return JsonResponse({"error": "query is required"}, status=400)
    result = rag_response(query)
    return JsonResponse(result)
Enter fullscreen mode Exit fullscreen mode

Test it:

curl -X POST http://localhost:8000/api/ask/ \
  -H "Content-Type: application/json" \
  -d '{"query": "What is your refund policy?"}'
Enter fullscreen mode Exit fullscreen mode

Things to watch in production

Chunking strategy matters. Split long documents into 300–500 token chunks with 50–100 token overlap before embedding. Add a DocumentChunk model alongside Document for this.

Cache embeddings aggressively. Never regenerate an embedding for content that hasn't changed.

Monitor retrieval quality. Log cosine distance scores. If the top result has a distance above ~0.4, handle this gracefully rather than feeding irrelevant context to the LLM.

Rate limit your LLM calls. Use the tenacity library for retry logic with exponential backoff.

What this gets you

A working RAG pipeline on your existing Django app, using your existing Postgres database, with no new infrastructure. The whole integration is around 100 lines of code.


At Lycore we build AI integrations on top of existing Django, React, and .NET applications. If you're looking to add RAG, agents, or other AI capabilities to your stack, get in touch.

Top comments (0)