Engineering

Building a Production-Ready RAG System from Scratch: An Architectural Deep Dive

Allan Porras

19 Oct 2025 — 7 min read

Retrieval-Augmented Generation (RAG) has emerged as a dominant architectural pattern for building sophisticated LLM-based applications. By grounding a model on an external, verifiable knowledge base, RAG mitigates hallucinations, enables access to private or real-time data, and provides a clear mechanism for source attribution. For CTOs and engineering leaders, mastering the RAG pipeline is not merely an academic exercise; it is a strategic imperative for unlocking reliable, enterprise-grade generative AI.

This article provides a comprehensive, from-scratch guide to designing and implementing a production-ready RAG system. We will bypass high-level frameworks to expose the core mechanics, focusing on the architectural decisions, performance trade-offs, and practical code required to build a robust solution. We will implement this system using Python, leveraging the ubiquitous PostgreSQL database with the pgvectorextension for vector search and the OpenAI API for its powerful models.

LLM & AI Engineering Services

We provide a comprehensive suite of AI-powered solutions, including generative AI, computer vision, machine learning, natural language processing, and AI-backed automation.

Learn more

The Core Architecture: Two Distinct Workflows

A RAG system is best understood as two separate, but connected, pipelines: the Offline Indexing Pipelineand the Online Inference Pipeline.

Offline Indexing Pipeline: This is a preparatory, asynchronous process responsible for ingesting source documents, converting them into a searchable format (vector embeddings), and storing them in a specialized database. This process is executed whenever the knowledge base needs to be created or updated.
Online Inference Pipeline: This is the real-time, user-facing workflow. It takes a user query, searches the indexed knowledge base for relevant context, and uses that context along with the original query to generate a grounded response from an LLM.

Key architectural choices at this stage include:

Embedding Model: This model translates text into high-dimensional vectors. The choice impacts retrieval quality and cost. We will use OpenAI's text-embedding-3-small for its balance of performance and cost-efficiency.
Vector Database: This database must efficiently store and query high-dimensional vectors. While dedicated vector databases like Pinecone or Weaviate are excellent, using PostgreSQL with pgvectorallows many organizations to leverage existing infrastructure and operational expertise, significantly reducing architectural complexity.
LLM: The generative component that synthesizes the final answer. We will use OpenAI's gpt-4o for its advanced reasoning and instruction-following capabilities.

Implementation Part 1: The Offline Indexing Pipeline

The goal of this pipeline is to populate our PostgreSQL vector store. This involves loading documents, breaking them into manageable chunks, generating embeddings, and storing them.

1. Database Setup with `pgvector`

First, ensure you have PostgreSQL installed with the pgvector extension enabled.

-- Connect to your PostgreSQL instance and run this command
CREATE EXTENSION IF NOT EXISTS vector;

-- Create a table to store the document chunks and their embeddings
CREATE TABLE document_chunks (
    id SERIAL PRIMARY KEY,
    document_name TEXT NOT NULL,
    chunk_text TEXT NOT NULL,
    embedding VECTOR(1536) -- 1536 is the dimension for text-embedding-3-small
);

-- Create an index for efficient similarity search
-- HNSW (Hierarchical Navigable Small World) is generally preferred for its speed-accuracy trade-off.
-- The parameters lists_to_check and ef_construction are tunable for performance.
CREATE INDEX ON document_chunks
USING HNSW (embedding vector_cosine_ops);

Architectural Note: We chose an HNSW index. Compared to an IVFFlat index, HNSW typically offers superior query performance (lower latency) at the cost of a slower, more memory-intensive build process. For most real-time applications, this is the correct trade-off.

2. Data Loading and Chunking

Effective chunking is critical for retrieval quality. Chunks that are too small lack context, while chunks that are too large introduce noise. A RecursiveCharacterTextSplitter is a robust strategy because it attempts to split text along semantic boundaries (paragraphs, sentences) first.

Here is a Python implementation for loading and chunking text files.

# requirements: pip install langchain openai psycopg2-binary
import os
import openai
import psycopg2
from langchain.text_splitter import RecursiveCharacterTextSplitter

# --- Configuration ---
OPENAI_API_KEY = "YOUR_OPENAI_API_KEY"
DB_CONNECTION_STRING = "postgresql://user:password@host:port/dbname"
DOCUMENTS_PATH = "./source_documents/"
EMBEDDING_MODEL = "text-embedding-3-small"

# --- Initialize Clients ---
openai.api_key = OPENAI_API_KEY

def process_and_embed_documents():
    """
    Loads documents, chunks them, generates embeddings, and stores them in PostgreSQL.
    """
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,  # The character length of each chunk
        chunk_overlap=200, # The number of characters to overlap between chunks
        length_function=len,
    )

    conn = psycopg2.connect(DB_CONNECTION_STRING)
    cur = conn.cursor()

    for filename in os.listdir(DOCUMENTS_PATH):
        if filename.endswith(".txt"):
            filepath = os.path.join(DOCUMENTS_PATH, filename)
            with open(filepath, 'r') as f:
                document_text = f.read()

            print(f"Processing {filename}...")
            chunks = text_splitter.split_text(document_text)

            # Generate embeddings in batches for efficiency
            response = openai.embeddings.create(
                input=chunks,
                model=EMBEDDING_MODEL
            )
            embeddings = [item.embedding for item in response.data]

            # Insert into database
            for i, chunk in enumerate(chunks):
                cur.execute(
                    "INSERT INTO document_chunks (document_name, chunk_text, embedding) VALUES (%s, %s, %s)",
                    (filename, chunk, embeddings[i])
                )
    
    conn.commit()
    cur.close()
    conn.close()
    print("Indexing complete.")

if __name__ == '__main__':
    process_and_embed_documents()

Implementation Part 2: The Online Inference Pipeline

This pipeline executes in real-time when a user submits a query. It involves embedding the query, retrieving relevant context from the database, constructing a precise prompt, and calling the LLM.

LLM & AI Engineering Services

We provide a comprehensive suite of AI-powered solutions, including generative AI, computer vision, machine learning, natural language processing, and AI-backed automation.

Learn more

1. Query Embedding and Context Retrieval

The user's query must be converted into a vector using the exact same embedding model used for indexing. We then use this vector to perform a similarity search in our document_chunks table. The <=> operator from pgvector calculates the cosine distance.

import openai
import psycopg2

# --- Configuration (reuse from previous section) ---
# ...

def retrieve_context(query: str, top_k: int = 5) -> list[str]:
    """
    Embeds the query and retrieves the top_k most relevant document chunks.
    """
    # 1. Embed the user's query
    response = openai.embeddings.create(
        input=[query],
        model=EMBEDDING_MODEL
    )
    query_embedding = response.data[0].embedding

    # 2. Retrieve relevant context from PostgreSQL
    conn = psycopg2.connect(DB_CONNECTION_STRING)
    cur = conn.cursor()

    # Find the most similar chunks using cosine distance
    cur.execute(
        """
        SELECT chunk_text FROM document_chunks
        ORDER BY embedding <=> %s
        LIMIT %s
        """,
        (query_embedding, top_k)
    )
    
    results = cur.fetchall()
    cur.close()
    conn.close()
    
    # Return the text of the chunks
    return [row[0] for row in results]

Performance Note: The LIMIT (top-k) parameter is a critical tuning knob. A smaller k is faster but risks missing relevant information. A larger k provides more context but can increase noise and LLM token costs. Starting with k=5 is a reasonable baseline.

2. Augmented Prompt Generation

This is the "augmentation" step. We construct a new prompt that explicitly instructs the LLM to answer based only on the context we just retrieved. This is the primary mechanism for preventing hallucination.

def construct_prompt(query: str, context: list[str]) -> str:
    """
    Constructs a prompt for the LLM with the retrieved context.
    """
    context_str = "\n\n---\n\n".join(context)
    
    prompt = f"""
    You are a highly intelligent AI assistant. Your task is to answer the user's question based exclusively on the provided context.
    - Do not use any external knowledge.
    - If the answer is not present within the context, you must state: "I cannot answer this question based on the provided information."
    
    Provided Context:
    {context_str}
    
    User's Question:
    {query}
    
    Answer:
    """
    return prompt

3. Final Answer Generation

The final step is to send the augmented prompt to the LLM.

def generate_response(query: str):
    """
    The main RAG pipeline function.
    """
    # 1. Retrieve context
    retrieved_context = retrieve_context(query, top_k=5)
    
    # 2. Construct the prompt
    final_prompt = construct_prompt(query, retrieved_context)
    
    # 3. Generate response from LLM
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": final_prompt}
        ],
        temperature=0.0 # Set to 0 for deterministic, fact-based answers
    )
    
    return response.choices[0].message.content

# --- Example Usage ---
if __name__ == '__main__':
    user_query = "What are the key performance considerations for the HNSW index?"
    final_answer = generate_response(user_query)
    print(f"Query: {user_query}\n")
    print(f"Answer: {final_answer}")

Architectural Decision: Setting temperature=0.0 is crucial for fact-based Q&A systems. It forces the model to be more deterministic and stick closely to the provided context, reducing creative (and potentially inaccurate) outputs.

Production Considerations and Advanced Optimizations

While the above implementation is functional, deploying it at scale requires further consideration.

Evaluation: A RAG system is only as good as its retrieval quality. Implement an evaluation pipeline using a "golden dataset" of (question, expected answer, context) tuples. Key metrics include Context Precision(is the retrieved context relevant?), Context Recall (was all the necessary context retrieved?), and Faithfulness (does the final answer stay within the context?).
Hybrid Search: Pure vector search can sometimes fail on queries containing specific keywords, acronyms, or codes. Augmenting vector search with a traditional keyword search algorithm like BM25 can provide a more robust retrieval system. This involves running two searches in parallel and combining the results.
Re-ranking: The initial top-k retrieval is optimized for speed. To improve relevance, a second-stage re-ranking model (typically a cross-encoder) can be used. This model takes the query and each of the top-k retrieved documents and computes a more accurate relevance score, re-ordering the results before they are passed to the LLM.
Scalability:
- Database: For PostgreSQL, use connection pooling (e.g., PgBouncer) and consider read replicas to handle high query loads.
- Inference: LLM APIs are a bottleneck. Implement caching for identical queries. For very high throughput, investigate hosting open-source models on dedicated GPU infrastructure using tools like Triton Inference Server.

Conclusion

Building a RAG system from scratch reveals the intricate interplay between data processing, vector search, and language modeling. By deconstructing the pipeline into its core indexing and inference components, engineering leaders can make informed architectural decisions that balance performance, cost, and maintainability.

The stack presented here—Python, pgvector, and the OpenAI API—offers a powerful and accessible starting point. However, the true art of productionizing RAG lies in continuous evaluation and the iterative application of advanced techniques like re-ranking and hybrid search to meet the specific demands of your use case.

Building a Production-Ready RAG System from Scratch: An Architectural Deep Dive

Allan Porras

LLM & AI Engineering Services

The Core Architecture: Two Distinct Workflows

Implementation Part 1: The Offline Indexing Pipeline

1. Database Setup with `pgvector`

2. Data Loading and Chunking

Implementation Part 2: The Online Inference Pipeline

LLM & AI Engineering Services

1. Query Embedding and Context Retrieval

2. Augmented Prompt Generation

3. Final Answer Generation

Production Considerations and Advanced Optimizations

Conclusion

Read more

Implementing a Serverless Data Pipeline with AWS Lambda and Python: A Step-by-Step Guide

Changes to Our Payments Supported Countries List

Top 5 Use Cases for AI Inbound Call Agents in Customer Service.

How 4Geeks Payments Can Streamline Online Prescription Refills

LLM & AI Engineering Services

The Core Architecture: Two Distinct Workflows

Implementation Part 1: The Offline Indexing Pipeline

1. Database Setup with pgvector

2. Data Loading and Chunking

Implementation Part 2: The Online Inference Pipeline

LLM & AI Engineering Services

1. Query Embedding and Context Retrieval

2. Augmented Prompt Generation

3. Final Answer Generation

Production Considerations and Advanced Optimizations

Conclusion

Read more

Implementing a Serverless Data Pipeline with AWS Lambda and Python: A Step-by-Step Guide

Changes to Our Payments Supported Countries List

Top 5 Use Cases for AI Inbound Call Agents in Customer Service.

How 4Geeks Payments Can Streamline Online Prescription Refills

1. Database Setup with `pgvector`