RAG Simplified: A Deep Dive into Intelligent AI Responses

6 min readDec 20, 2024

Simple RAG Implementation (source: NVIDIA)

Retrieval-Augmented Generation (RAG) is an advanced technique that enhances the performance of Large Language Models (LLMs) by incorporating external data sources into the text generation process. TThis approach addresses inherent limitations in LLMs, such as outdated information and the tendency to produce plausible-sounding but incorrect outputs, commonly known as “hallucinations.”

Core Components of RAG:

Document Ingestion and Indexing:

Data Ingestion Process (Step 1) of Rag Implementation

Data Collection: Data is gathered in various formats, including unstructured, semi-structured, and structured types, from diverse sources such as databases, documents, and live feeds. Semi-structured data, like PDFs or JSON files, have some level of organization but do not follow a strict schema. Structured data, such as SQL databases or Excel files, conform to a well-defined schema and format. In contrast, unstructured data, including images, videos, or HTML content, lacks a predefined organizational structure, requiring specialized techniques for processing and analysis.
Embedding Generation: Transform the collected data into vector representations (embeddings) that encapsulate its semantic meaning. A vector is typically represented as a 1×n numerical array, where n denotes the embedding dimension. Models such as BERT, Text-Embedding-Small, and similar architectures are designed to process text, images, or other data types, converting them into vectors based on their semantic relationships. These embeddings enable efficient comparison and retrieval tasks.
Indexing: Once embeddings are generated, they are stored in a vector database, which is optimized for efficient similarity searches. This process involves organizing and indexing the embeddings to enable quick retrieval of relevant data based on their semantic proximity. Vector databases, such as Milvus, Pinecone, or FAISS, are designed to handle high-dimensional embeddings and provide functionalities like approximate nearest neighbor (ANN) searches. This indexing ensures that when a query is made, the system can rapidly locate and retrieve the most relevant embeddings by comparing the vector representations, enabling seamless and scalable retrieval operations.

Retrieval Mechanism:

Query Embedding: Convert the user’s input into a vector representation (embedding) that captures its semantic content. This process ensures that the query is translated into a numerical format comparable to the indexed embeddings, enabling meaningful similarity comparisons.
Similarity Search: Utilize the query embedding to perform a similarity search against the indexed embeddings in the vector database. By comparing the semantic relationships between vectors, the system identifies and retrieves the most relevant documents or data points, ensuring accurate and context-aware results. This process is crucial for enabling efficient and meaningful retrieval in applications like search engines, recommendation systems, and question-answering frameworks.

Augmentation and Generation:

Contextual Integration: The retrieved documents are seamlessly integrated into the input context provided to the LLM. This step enriches the model’s understanding by supplementing its internal knowledge with up-to-date, domain-specific, or detailed information from external sources, ensuring a comprehensive context for processing.
Response Generation: The LLM leverages both its internal knowledge and the enriched input context, which includes the retrieved external information, to generate a response. This combination allows for outputs that are more accurate, contextually relevant, and tailored to the user’s query, effectively addressing limitations such as outdated information or knowledge gaps.

Advantages of RAG:

Enhanced Accuracy: Retrieval-Augmented Generation (RAG) improves response reliability by grounding outputs in factual, retrieved data. This approach minimizes the risk of hallucinations, ensuring the generated responses are more trustworthy and precise.
Real-Time Information Access: RAG empowers LLMs to access and incorporate the most up-to-date information during the generation process. This adaptability makes the model well-suited for dynamic environments with rapidly changing data or emerging trends.
Data Privacy Preservation: When implemented with self-hosted LLMs, RAG ensures that sensitive data remains on-premises, mitigating privacy concerns and facilitating compliance with data protection regulations. This feature is particularly valuable for organizations handling confidential or regulated information.

Implementing a RAG Pipeline:

Data Preparation:

Document Collection and Preprocessing: Gather relevant documents, such as PDFs and text files, extract their content using tools like PyPDF2 or pdfplumber, clean the text by removing noise and normalizing it, and segment it into manageable chunks for further processing.
Vector Embedding Generation: Transform the preprocessed text into semantic vector embeddings using transformer-based models like BERT, Sentence-Transformer, or Text-Embedding-Large, capturing the meaningful context of the data.
Embedding Indexing: Store the embeddings in a vector database like Pinecone, optimized for similarity search, and enhance it with metadata for efficient, scalable, and accurate retrieval.

from dataclass_utility import ChatResponse
from langchain.schema import HumanMessage, SystemMessage
from langchain_chroma import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from pathlib import Path
import PyPDF2
import os
import tempfile
import time
from pinecone import Pinecone
from langchain_pinecone import PineconeVectorStore
from dotenv import load_dotenv

load_dotenv()

embedding  = OpenAIEmbeddings(
    show_progress_bar=True,
    model="text-embedding-3-large"
)

def create_db() -> PineconeVectorStore:
    try:
        index_name = os.getenv("INDEX_NAME")
        pc = Pinecone()
        index = pc.Index(index_name)
        vector_store = PineconeVectorStore(index=index, embedding=embedding, pinecone_api_key=os.getenv("PINECONE_API_KEY"))
        return vector_store
    except Exception as e:
        print(e)
        return None

def handle_image_response(res : str, user_question: str) -> ChatResponse:
    from llm_calls import llm
    pattern = r'FILE_PATH:\s*"([^"]+)"'
    match = re.search(pattern, res)
    file_path = match.group(1) if match else None
    cleaned_content = llm.invoke([
        SystemMessage("Remove the file path and make respone as 'The image of .. has been generated'"),
        HumanMessage(res),
    ]).content
    return ChatResponse(question=user_question, answer=cleaned_content, document_path=file_path)

def handle_pdf_ingestion_tempfile(uploaded_file):
    try:
        with tempfile.NamedTemporaryFile(delete=False, suffix=Path(uploaded_file.name).suffix) as temp_file:
            temp_file.write(uploaded_file.read())
            temp_path = temp_file.name

        # Explicitly check if file exists and is readable
        if not os.path.exists(temp_path):
            print(f"Temporary file {temp_path} does not exist.")
            return False

        file_ext = Path(temp_path).suffix.lower()
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=100,
            chunk_overlap=20,
        )
        vector_store = create_db()
        
        if vector_store is None:
            print("Vector DB setup failed")
            return False

        print("Vector DB is successfully setup")

        try:
            if file_ext == ".txt":
                # Add error handling for TextLoader
                try:
                    loader = TextLoader(temp_path, encoding='utf-8')  # Specify encoding
                    documents = loader.load_and_split(text_splitter)
                    vector_store.add_documents(documents)
                    print("Text file successfully processed and indexed.")
                    return True
                except Exception as txt_error:
                    print(f"Error loading text file: {txt_error}")
                    return False

            elif file_ext == ".pdf":
                pdfReader = PyPDF2.PdfReader(temp_path)
                totalPages = len(pdfReader.pages)
                if totalPages < 5:
                    loader = PyMuPDFLoader(temp_path)
                    documents = loader.load_and_split(text_splitter)
                    vector_store.add_documents(documents)
                    print("PDF file successfully processed and indexed.")
                    return True
                else:
                    print("PDF has 5 or more pages, skipping processing.")
                    return False
            else:
                print("Unsupported file format.")
                return False
        except Exception as e:
            print(f"An error occurred while processing the file: {e}")
            return False
        finally:
            # Ensure temporary file is deleted
            if os.path.exists(temp_path):
                os.unlink(temp_path)
    except Exception as overall_error:
        print(f"Overall error in file processing: {overall_error}")
        return False

Query Processing and Response Generation:

Query Encoding: Convert the user’s query into a semantic vector embedding using a transformer-based model to capture its contextual meaning.
Document Retrieval: Perform a similarity search in the vector database to identify and retrieve the top-k relevant documents based on the query embedding.
Then Combine the retrieved documents with the original query and use the augmented input to prompt the large language model for response generation

def get_retriver():
    db = create_db()
    retriver = db.as_retriever()
    return retriver

from langchain.chains.retrieval import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-3.5-turbo")

retriver = get_retriver()
qa_chain = create_retrieval_chain(
    retriever = retriver,
    combine_docs_chain=create_stuff_documents_chain(
        llm=llm,
        prompt = ChatPromptTemplate.from_messages([
            ("system", """You are a helpful AI assistant. answer the user question from the given context"""
            ),
            ("human", "Use the following context to answer the question:\n\nContext: {context}\n\nQuestion: {input}"),
        ])
    )
)
user_question = "what is llm?"
result = qa_chain.invoke({"input": user_question})
print(result)

Challenges and Considerations:

Retrieval Quality: The effectiveness of RAG depends on the relevance of the retrieved documents. Implementing advanced retrieval techniques and fine-tuning the retriever are crucial for optimal performance.
System Complexity: Integrating retrieval mechanisms with LLMs adds complexity to the system architecture, necessitating careful design and maintenance.
Scalability: Handling large-scale data requires efficient indexing and retrieval strategies to maintain system responsiveness.

In summary, Retrieval-Augmented Generation represents a significant advancement in the field of natural language processing, enabling LLMs to produce more accurate, contextually relevant, and up-to-date responses by leveraging external data sources. Its implementation, while complex, offers substantial benefits in enhancing the reliability and applicability of AI-generated content.

References

Here is the references used for making this LLM RAG Blog:

Thanks for reading! My name is Abhishek, I have an passion of building app and learning new technology.