RAG – Retrieval Augmented Generation¶

CIML Summer 2025¶

Setup: LangChain¶

In this RAG tutorial, we'll be working with LangChain, which is a powerful framework for building applications with language models. LangChain provides utilities for working with various language model providers, integrating embeddings, and creating chains for more complex applications. Below are the necessary imports for this notebook:

In [1]:
import os
import random
import glob
import numpy as np
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.memory import ConversationSummaryMemory
from langchain.prompts import PromptTemplate
from langchain_community.llms import Ollama
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
from chromadb.utils import embedding_functions
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings

import warnings
warnings.filterwarnings('ignore')

Part 1: Retrieval¶

  • In this section, we'll focus on the retrieval aspect of RAG. We'll start by understanding vectorization, followed by storing and retrieving vectors efficiently.

Vectorizing¶

  • Vectorization is the process of converting text into vectors in an embedding space. These vectors capture the semantic meaning of the text, enabling us to perform various operations like similarity calculations. We'll use HuggingFaceEmbeddings for this task. You can see the documentation for this langchain object here.
In [2]:
vectorizer = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

This vectorizer converts text into vectors in embedding space. Lets try seeing how we can use this.

In [3]:
vectorizer.embed_query("dog")[0:10]
Out[3]:
[-0.05314699560403824,
 0.014194400049746037,
 0.007145748008042574,
 0.06860868632793427,
 -0.07848034799098969,
 0.01016747672110796,
 0.10228313505649567,
 -0.01206485740840435,
 0.09521342068910599,
 -0.030350159853696823]
  • As you can see from above, this converts text into a series of numbers.

Task 1¶

Your job is to write a function that takes in two strings, vectorize them, and return their cosine similarity. Implement the following function.

similarity_two_queries¶

In [4]:
def similarity_two_queries(word1, word2):
    # HINT:
    # Use vectorizer.embed_query(<text>) to embed text.
    # Use np.dot to find the cosine similarity/dot product of 2 vectors
    # TODO

    return None
  • Observe the similarity scores of both 'cat' and 'dog' to the word 'kitten'
In [ ]:
print("Similarity of 'kitten' and 'cat': ",similarity_two_queries("kitten","cat"))
print("Similarity of 'kitten' and 'dog': ",similarity_two_queries("kitten","dog"))
  • By using the previously defined function, we can take pairs of texts and quantify how similar they are.

Task 2¶

Which of the following words in the list words are most related to the word 'color'? The function similarity_list takes a list of words, and outputs the word and similarity score from highest to lowest.

In [ ]:
def similarity_list(word,words):
    similarity_list = [(i,similarity_two_queries("color",i)) for i in words]
    sorted_similarity_list = sorted(similarity_list,key=lambda x:x[1],reverse=True)
    return sorted_similarity_list
In [ ]:
words = ["rainbow","car","black","red","cat","tree"]
In [ ]:
# TODO: Which words are most similar to color?

Task 3¶

Each query below has an appropriate text that allows you to answer the question. The function match_queries_with_texts matches a query with its most related text. Come up with 3 more questions and 3 suitable answers and add them to the list below.

In [8]:
def match_queries_with_texts(queries, texts):
    # Calculate similarities between each query and text
    similarities = np.zeros((len(queries), len(texts)))
    
    for i, query in enumerate(queries):
        for j, text in enumerate(texts):
            similarities[i, j] = similarity_two_queries(query, text)
    
    # Match each query to the text with the highest similarity
    matches = {}
    for i, query in enumerate(queries):
        best_match_idx = np.argmax(similarities[i])
        matches[query] = texts[best_match_idx]
    
    return matches
In [9]:
# TODO: Fill in the list to make suitable question-text pairs.

queries = ["What are the 7 colors of the rainbow?", 
           "What does Elsie do for work?", 
           "Which country has the largest population?",
           "-- INSERT QUERY 1 HERE--",
           "-- INSERT QUERY 2 HERE--",
           "-- INSERT QUERY 3 HERE--"]
texts = ["China has 1.4 billion people.",
         "Elsie works the register at Arby's.", 
         "The colors of the rainbow are ROYGBIV.",
         "-- INSERT TEXT 1 HERE--",
         "-- INSERT TEXT 2 HERE--",
         "-- INSERT TEXT 3 HERE--"]
  • Now we shuffle the queries and texts. Let's see if we can match them!
In [10]:
import random
random.shuffle(queries)
random.shuffle(texts)

match_queries_with_texts(queries, texts)
Out[10]:
{'What are the 7 colors of the rainbow?': 'China has 1.4 billion people.',
 'What does Elsie do for work?': 'China has 1.4 billion people.',
 'Which country has the largest population?': 'China has 1.4 billion people.',
 '-- INSERT QUERY 2 HERE--': 'China has 1.4 billion people.',
 '-- INSERT QUERY 3 HERE--': 'China has 1.4 billion people.',
 '-- INSERT QUERY 1 HERE--': 'China has 1.4 billion people.'}

Vector Store¶

  • Now lets look at how we can store these for efficient retrieval of the vectors. There are many options for storage but in this exercise, we use ChromaDB

which is an open-source vector DB.

Through langchain, we can set the database to be a langchain *retriever* object, which essentially allows us to perform queries similarly to what we have done before.

  • Taking the texts and queries that you defined before, we can load it into ChromaDB and similarly perform the same operations.
In [11]:
ids = list(range(len(texts)))
random_id = random.randint(100000, 999999)
db = Chroma.from_texts(texts, vectorizer, metadatas=[{"id": id} for id in ids],collection_name=f"temp_{random_id}")
retriever = db.as_retriever(search_kwargs={"k": 1})
In [12]:
texts
Out[12]:
['China has 1.4 billion people.',
 'The colors of the rainbow are ROYGBIV.',
 '-- INSERT TEXT 1 HERE--',
 "Elsie works the register at Arby's.",
 '-- INSERT TEXT 3 HERE--',
 '-- INSERT TEXT 2 HERE--']
In [ ]:
retriever.invoke("Which country has the largest population?")
  • workplaces.txt contains names and workplaces of several people. Now let’s apply the same retrieval process to a file we read in.
In [ ]:
with open("workplaces.txt", 'r') as file:
    lines = file.readlines()
lines = [line.strip() for line in lines]
print(lines[0:4])

workplace_retriever is a function that takes in the workplace.txt file and returns a database as retriever that you can use to find out the workplaces of people in the file. You can specify the top-k results in the argument of the function.

In [ ]:
def workplace_retriever(k=3):
    with open("workplaces.txt", 'r') as file:
        lines = file.readlines()
    lines = [line.strip() for line in lines]
    
    db = Chroma.from_texts(
        lines,
        vectorizer,
        metadatas=[{"id": id} for id in range(len(lines))],
        collection_name=f"temp_{id(lines)}"
    )
    
    retriever = db.as_retriever(search_kwargs={"k": k})
    return retriever

Task 4¶

Using workplace_retriever, find out who works at Starbucks and McDonald's.

In [ ]:
# TODO: Find out who works at Starbucks and McDonalds. Use the retriever(k=3).invoke(<query>) method to do this
# Remember to experiment with the value of k to make sure you find all people that work in one place.

Chunking¶

The workplaces.txt data we just looked at was conveniently split into lines, with each line representing a distinct and meaningful chunk of information. This straightforward structure makes it easier to process and analyze the text data.

However, it is usually not so straightforward:

  • When dealing with text data, especially from large or complex documents, it's essential to handle the formatting and structure efficiently.
  • If we get a not-so-simply formatted file, we can break it down into manageable chunks using LangChain's TextLoader and RecursiveCharacterTextSplitter.
  • This allows us to preprocess and chunk the data effectively for further use in our RAG pipeline.

Lets take a look at some of the Expanse documentation here. We have downloaded the contents of this webpage into two text files named expanse_doc_1.txt and expanse_doc_2.txt.

In [ ]:
with open("expanse_doc_1.txt", 'r') as file:
    lines = file.readlines()
lines = [line.strip() for line in lines]
print(lines[20:35])
  • We see that the data and text is not split into meaningful chunks of information by default, so we need to try out best to format it in such a way it can be useful. This is why we use chunks, which capture local and neighboring texts, grouping them together.
  • When using the RecursiveCharacterTextSplitter, the chunk size determines the maximum size of each text chunk. This is particularly useful when dealing with large documents that need to be split into smaller, manageable pieces for better retrieval and analysis.
In [ ]:
def expanse_retriever(chunk_size):
    loader = TextLoader('expanse_doc_1.txt')
    documents = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=10, separators=[" ", ",", "\n"])
    texts = text_splitter.split_documents(documents)
    db = Chroma(embedding_function=vectorizer, collection_name=f"expanse_temp_{id(texts)}")
    db.add_documents(texts)
    retriever = db.as_retriever(search_kwargs={"k": 3})
    return retriever

Task 5¶

A function that chunks expanse_doc_1.txt has been provided above, experiment with different chunk sizes and pick a size that captures enough information to answer the question: *"How do you run jobs on expanse?"* Try sizes 10, 100 and 1000 and observe what info is being given.

In [ ]:
# TODO: Think about how many characters would be needed to contain useful information for such a complex task

Multiple Document Chunking¶

When we have more than one document we want to use in our database, we can simply iteratively chunk them. Metadata for the text source is added by default, but we can add our own metadata as well in the form of IDs.

Task 6¶

expanse_all_retriever is a function that chunks both expanse_doc_1.txt and expanse_doc_2.txt has been provided below, using a chunk size of 1000 characters, find which document information for *"Compiling Codes"* is most likely to be located. Hint: Look at the metadata

In [ ]:
def expanse_all_retriever(chunk_size):
    random_id = random.randint(100000, 999999)  # random 6-digit ID for uniqueness

    db = Chroma(
        embedding_function=vectorizer,
        collection_name=f"expanse_all_temp_{random_id}"
    )

    pattern = 'expanse_doc_*.txt'
    file_list = glob.glob(pattern)

    for file_name in file_list:
        loader = TextLoader(file_name)
        documents = loader.load()

        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=10,
            separators=[" ", ",", "\n"]
        )
        texts = text_splitter.split_documents(documents)

        for i, text in enumerate(texts):
            text.metadata["chunk_number"] = i

        db.add_documents(texts)

    retriever = db.as_retriever(search_kwargs={"k": 3})
    return retriever
In [ ]:
# TODO: Find the relevant source for the query "Compiling Codes"
In [ ]:
chunks = expanse_all_retriever(1000).invoke("Compiling Codes")
for chunk in chunks:
    print(chunk.metadata)

Part 2: Basic RAG¶

Ollama is an open-source LLM platform that allows us to use a plethora of different LLMs.

We need to first launch the ollama instance. In a terminal window in the jupyter lab instance, run:

export OLLAMA_HOST="0.0.0.0:$(shuf -i 3000-65000 -n 1)"; echo ${OLLAMA_HOST##*:} > ~/.ollama_port

ollama serve

In [ ]:
import os
from ollama import Client

# Read the port from the file
with open(os.path.expanduser('~/.ollama_port')) as f:
    port = f.read().strip()

# Connect to 127.0.0.1:<port>
host = f"http://127.0.0.1:{port}"

client = Client(host=host)
In [ ]:
# Get LLM
client.pull("gemma3:4b")
In [ ]:
llm = Ollama(
    model="gemma3:4b",
    base_url=f"http://127.0.0.1:{port}",  # CRITICAL: Use your custom port
    temperature=0
)
In [ ]:
llm.invoke("How are you?")

Task 7¶

Write a function that uses the workplace_retriever function to parse your question, retrieves relevant responses from workplace_retriever, and then sends this context to Ollama for it to answer your question in natural language. Fill in workplace_question which accomplishes this task.

In [ ]:
# TODO
def workplace_question(question):
    retriever = #TODO: assign the retriever
    context = #TODO: invoke the retriever here
    llm = Ollama(model="gemma3:4b",base_url=f"http://127.0.0.1:{port}",temperature="0.2")
    prompt = f"Based on the following context: {context}, answer the question: "
    response = #TODO: invoke ollama with the prompt and question
    return response
In [ ]:
print(workplace_question("Who are the people that work at Starbucks?"))

Part 3: LangChain RAG¶

The above is a very simple example of a RAG. Now, using langchain, we can put everything together in a cleaner and all inclusive way in one go. Let's combine everything we've learned into the function generate_rag.

  • The below implementation has a custom class that allows us to view what chunks are being used based on our queries.
In [ ]:
def generate_rag(verbose=False, chunk_info=False):
    import glob
    random_id = random.randint(100000, 999999)
    vectorizer = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
    db = Chroma(embedding_function=vectorizer, collection_name=f"expanse_all_temp_{random_id}")
    pattern = 'expanse_doc_*.txt'
    file_list = glob.glob(pattern)
    for file_name in file_list:
        loader = TextLoader(file_name)
        documents = loader.load()
        text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=10, separators=[" ", ",", "\n"])
        texts = text_splitter.split_documents(documents)
        for id,text in enumerate(texts):
            text.metadata["chunk_number"] = id
        db.add_documents(texts)
    
    template = """<s>[INST] Given the context - {context} </s>[INST] [INST] Answer the following question - {question}[/INST]"""
    pt = PromptTemplate(
                template=template, input_variables=["context", "question"]
            )
    # Let's retrieve the top 3 chunks for our results
    retriever = db.as_retriever(search_kwargs={"k": 3})
    class CustomRetrievalQA(RetrievalQA):
        def invoke(self, *args, **kwargs):
            result = super().invoke(*args, **kwargs)
            if chunk_info:
                # Print out the chunks that were retrieved
                print("Chunks being looked at:")
                chunks = retriever.invoke(*args, **kwargs)
                for chunk in chunks:
                    print(f"Source: {chunk.metadata['source']}, Chunk number: {chunk.metadata['chunk_number']}")
                    print(f"Text snippet: {chunk.page_content[:200]}...\n")  # Print the first 200 characters
            return result
    rag = CustomRetrievalQA.from_chain_type(
        llm=Ollama(model="gemma3:4b",base_url=f"http://127.0.0.1:{port}", temperature="0"),
        retriever=retriever,
        memory=ConversationSummaryMemory(llm=Ollama(model="gemma3:4b", base_url=f"http://127.0.0.1:{port}")),
        chain_type_kwargs={"prompt": pt, "verbose": verbose},
    )

    return rag

Task 8¶

Compare how gemma performs without context, and with context, so without RAG and with RAG.

In [ ]:
print(llm.invoke("How can a user check their resource allocations and the resources they can use on the Expanse supercomputer"))
#Try "How can a user check their resource allocations and the resources they can use on the Expanse supercomputer"
In [ ]:
expanse_rag = generate_rag()
result = expanse_rag.invoke("How can a user check their resource allocations and the resources they can use on the Expanse supercomputer")
print(result["result"])
  • We can see what is exactly being passed into the LLM highlighted in green when we set verbose to True.
In [ ]:
expanse_rag = generate_rag(verbose=True)
result = expanse_rag.invoke("How can a user check their resource allocations and the resources they can use on the Expanse supercomputer")
print(result["result"])
  • For more concise information, the function defined allows us to see individual chunk details as well as their source.
In [ ]:
expanse_rag = generate_rag(chunk_info=True)
result = expanse_rag.invoke("How can a user check their resource allocations and the resources they can use on the Expanse supercomputer")
In [ ]:
print(result["result"])

Great work! We've officially made a chatbot that can help us out with all things Expanse, at least according to the 2 .txt files we have access to!