LangChain RAG
What You'll Build Today
Welcome to Day 52! Today, we are going to give you superpowers.
Up until now, building a chatbot that knows about your specific data (RAG) has been a manual, heavy-lifting process. You had to write code to open files, clean text, split it into chunks, send it to an embedding model, save it to a database, search that database, and finally paste the results into a prompt for the LLM.
Today, we are going to replace hundreds of lines of manual logic with LangChain.
You will build a Universal Document Q&A System. By the end of this lesson, you will be able to point your Python script at a website or a PDF, and in about 30 lines of code, have a fully functioning AI assistant that can answer questions based on that content.
Here is what you will learn and why:
* Document Loaders: Because writing custom parsing logic for PDFs, HTML, Word docs, and text files is tedious and error-prone.
* Text Splitters: Because LLMs have context limits, and you need a smart way to break large documents into digestible pieces without cutting sentences in half.
* Vector Store Integrations: Because manually managing database connections and insertion loops is unnecessary work.
* Retrievers & Chains: Because you shouldn't have to manually glue the "search" step to the "generation" step every single time.
Let's turn that pain into power.
---
The Problem
Let's look at what life is like without a framework like LangChain.
Imagine your boss asks you to build a tool that answers questions about a company policy PDF. You think, "Okay, I know Python." You start writing.
First, you need a library just to read the PDF. Then you realize the text comes out messy, so you write regex to clean it. Then you need to chunk it, but you can't just slice the string every 500 characters because you might cut a word in half. Then you have to loop through every chunk to get embeddings...
Here is a glimpse of the "Manual Way" (Do not run this, just look at how painful it is):
# The "Painful Manual Way"
import pypdf
import requests
import os
# 1. Boilerplate just to read a PDF
def read_pdf(path):
reader = pypdf.PdfReader(path)
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
return text
# 2. Fragile chunking logic
def terrible_chunker(text, size=500):
# This is bad because it splits words in half!
chunks = []
for i in range(0, len(text), size):
chunks.append(text[i:i+size])
return chunks
# 3. Manual embedding loop
def get_embeddings_manually(chunks):
vectors = []
for chunk in chunks:
# Imagine calling an API here manually for every single chunk
# handling retries, errors, and rate limits yourself...
pass
return vectors
# 4. The main logic is messy
raw_text = read_pdf("policy.pdf")
clean_text = raw_text.replace("\n", " ") # Basic cleaning
chunks = terrible_chunker(clean_text)
# ... 50 more lines to set up a database and query it ...
The Frustration:
* What if you want to switch from PDF to a Website? You have to rewrite the read_pdf function entirely.
* The terrible_chunker breaks sentences and words, confusing the AI.
* You are spending 90% of your time on plumbing (reading files, API loops) and only 10% on the AI.
There has to be a better way. This is exactly why LangChain exists.
---
Let's Build It
We are going to build a "Chat with a Website" tool. We will use LangChain to load a URL, split the text, index it, and ask questions.
Prerequisites
You will need to install the LangChain community packages and ChromaDB (a local vector database).
``bash
pip install langchain langchain-openai langchain-community chromadb bs4
`
Step 1: The Imports
We need to bring in the specific modules for the "ETL" pipeline (Extract, Transform, Load) of RAG.
import os
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain.chains import RetrievalQA
# Set your API key
os.environ["OPENAI_API_KEY"] = "sk-..." # Replace with your actual key
Step 2: Load the Data (The "Extract" Step)
Instead of writing
requests and BeautifulSoup code manually, we use a Loader. LangChain has loaders for almost everything (YouTube, CSV, PDF, Notion, etc.).
We will use
WebBaseLoader. It fetches the HTML and strips away the tags, giving us just the text.
# 1. Load
print("Loading content...")
url = "https://lilianweng.github.io/posts/2023-06-23-agent/"
loader = WebBaseLoader(url)
data = loader.load()
# Let's see what we got
print(f"Loaded {len(data)} document(s).")
print(f"Content preview: {data[0].page_content[:200]}...")
Why this matters: You didn't have to worry about HTTP status codes, headers, or parsing HTML tags. It just worked.
Step 3: Split the Text (The "Transform" Step)
We cannot feed the entire website into the LLM at once (it might be too big), and for search to work well, we want specific snippets.
We use
RecursiveCharacterTextSplitter. This is smarter than simple splitting. It tries to split on paragraphs (\n\n) first, then sentences (\n), then spaces, keeping related text together.
# 2. Split
print("Splitting text...")
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Characters per chunk
chunk_overlap=200 # Overlap to maintain context between chunks
)
splits = text_splitter.split_documents(data)
print(f"Split into {len(splits)} chunks.")
print(f"First chunk: {splits[0].page_content[:100]}...")
Why this matters: The chunk_overlap is crucial. If a sentence is cut off at the end of chunk 1, the overlap ensures it appears complete at the start of chunk 2.
Step 4: Index and Store (The "Load" Step)
Now we need to turn those text chunks into numbers (embeddings) and store them in a database so we can search them. We will use Chroma (a vector store) and OpenAIEmbeddings.
In the manual way, this would be a loop. In LangChain, it is one line.
# 3. Vector Store
print("Creating vector store...")
vectorstore = Chroma.from_documents(
documents=splits,
embedding=OpenAIEmbeddings()
)
print("Vector store created.")
Why this matters: This single command handled:
Calling the OpenAI API for every chunk.
Creating a local database.
Saving the vectors and the text text together.
Step 5: The Retrieval Chain
This is the final piece of magic. We need a system that:
Takes your question.
Searches the vector store for relevant chunks.
Sends the question + relevant chunks to the LLM.
Returns the answer.
LangChain wraps this entire workflow into
RetrievalQA.
# 4. Retrieval Chain
llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # "stuff" means "stuff all found documents into the prompt"
retriever=vectorstore.as_retriever()
)
# 5. Ask a question
question = "What are the key components of an autonomous agent system?"
print(f"\nQuestion: {question}")
response = qa_chain.invoke(question)
print(f"Answer: {response['result']}")
Full Runnable Code
Here is the entire system in one block.
import os
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain.chains import RetrievalQA
# SETUP
os.environ["OPENAI_API_KEY"] = "your-key-here"
# 1. LOAD
# We are loading a technical article about AI Agents
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()
# 2. SPLIT
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
# 3. STORE
# This creates a temporary in-memory vector store
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())
# 4. RETRIEVE & GENERATE
llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)
# Create the chain that connects the database to the LLM
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever()
)
# 5. RUN
query = "What is the difference between short-term and long-term memory in agents?"
print("Thinking...")
response = qa_chain.invoke(query)
print("-" * 50)
print(f"Query: {query}")
print(f"Answer: {response['result']}")
print("-" * 50)
---
Now You Try
You have the base system. Now let's tweak the knobs to understand how it works.
1. Change the Loader (PDF Support)
The
WebBaseLoader is great for URLs, but businesses run on PDFs.
* Task: Install
pypdf (pip install pypdf).
* Action: Import
PyPDFLoader from langchain_community.document_loaders.
* Goal: Point the loader at a PDF file on your computer instead of a URL. The rest of the code (splitting, embedding, querying) should remain exactly the same.
2. Peek Under the Hood
Right now, the AI gives you an answer, but you don't know which part of the document it used.
* Task: Modify the
qa_chain creation.
* Action: Add
return_source_documents=True to the RetrievalQA.from_chain_type arguments.
* Goal: When you print
response, look at response['source_documents']. It will show you the exact text chunks the Retriever found.
3. Adjust Retrieval Sensitivity
By default, the retriever fetches 4 documents. Maybe you need more context.
* Task: Modify the retriever argument.
* Action: Change
retriever=vectorstore.as_retriever() to retriever=vectorstore.as_retriever(search_kwargs={"k": 2}).
* Goal: Run the query again. It will now only use the top 2 most similar chunks. Does the answer quality change?
---
Challenge Project: The Multi-Doc Researcher
Your challenge is to build a script that can ingest three different types of information at once and answer questions based on the combined knowledge.
Requirements:
Create a list of "sources". One should be a website URL, one should be a local text file ( .txt), and one should be a local PDF.
Use the correct Loader for each type (WebBaseLoader, TextLoader, PyPDFLoader).
Combine all loaded documents into a single list called all_docs.
Split, Embed, and Store all_docs into a single Chroma vector store.
Allow the user to input a question that requires knowledge from at least two of the sources to answer correctly.
Example Scenario:
Text File:* Contains a list of employee names and their IDs.
PDF:* Contains the company holiday policy.
Web:* Contains the current stock price of the company.
Question:* "Can employee John Doe (ID 123) take a holiday today given the stock price?"
Hints:
* Loaders return a list of documents. You can combine lists in Python using
+ or extend.
*
all_docs = []
*
all_docs.extend(pdf_loader.load())
*
all_docs.extend(web_loader.load())`
---
What You Learned
Today you moved from "manual labor" to "architecting systems."
* Loaders: You learned to swap out data sources (Web vs PDF) without rewriting your core logic.
* Text Splitters: You learned how to prepare text for LLMs using recursive splitting.
* Vector Stores: You learned to ingest and index data in a single line of code.
* RetrievalQA: You learned how to chain search and generation together automatically.
Why This Matters:In the real world, data lives everywhere—SharePoint, Google Drive, Slack, and Websites. LangChain allows you to build a "Brain" that connects to all these data sources using standard connectors, rather than writing custom scripts for each one.
Tomorrow: We explore LlamaIndex. While LangChain is a general-purpose framework for everything, LlamaIndex was built specifically for RAG and data indexing. We will see how it compares.