Building an AI Chatbot With LangChain: Practical Developer Guide
On this page
Most "build a chatbot" tutorials stop at calling the OpenAI API and printing the response. That's not a chatbot — that's a fancy curl wrapper. A real chatbot needs conversation memory, context retrieval, streaming responses, and guardrails. LangChain gives us the plumbing to wire all of that together without reinventing the wheel.
In this guide, we'll build a chatbot from scratch using LangChain 0.3+, Python, and an LLM of your choice. We'll go from a basic chain to a production-grade setup with retrieval-augmented generation (RAG), persistent memory, and streaming output.
TL;DR — What you'll build:
- A conversational AI chatbot with LangChain and Python
- Conversation memory that persists across sessions
- RAG pipeline to ground responses in your own documents
- Streaming responses for better UX
- Deployment-ready architecture with FastAPI
What Is LangChain and Why Use It for Chatbots?
LangChain is a Python (and JavaScript) framework for building applications powered by large language models. It abstracts away the boilerplate of chaining prompts, managing memory, and integrating retrieval — letting you focus on the logic that matters. As of 2026, LangChain is the most widely adopted LLM orchestration framework with over 100K GitHub stars and a mature ecosystem of integrations.
You should use LangChain when your chatbot needs more than a single API call — conversation history, document retrieval, tool use, or multi-step reasoning. If you're just wrapping a single prompt, you don't need it.
Prerequisites
Before we start, make sure you have:
- Python 3.11+
- An OpenAI API key (or Anthropic, Google, etc.)
- Basic familiarity with async Python
pip install langchain langchain-openai langchain-community faiss-cpu python-dotenv fastapi uvicorn
Set your API key in a .env file:
OPENAI_API_KEY=sk-your-key-here
Step 1: Build a Basic Conversational Chain
Let's start with the simplest possible chatbot — a chain that takes user input and returns a response with conversation history.
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.messages import HumanMessage, AIMessage
load_dotenv()
llm = ChatOpenAI(model="gpt-4o", temperature=0.7)
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant. Be concise and direct."),
MessagesPlaceholder(variable_name="history"),
("human", "{input}"),
])
chain = prompt | llm
# Simple in-memory history
history = []
def chat(user_input: str) -> str:
response = chain.invoke({"input": user_input, "history": history})
history.append(HumanMessage(content=user_input))
history.append(AIMessage(content=response.content))
return response.content
This works, but it has problems. The history grows forever (you'll blow past the context window), it's not persistent, and there's no retrieval. Let's fix each of these.
Step 2: Add Conversation Memory That Actually Works
LangChain provides several memory strategies. The most practical ones for production chatbots are ConversationBufferWindowMemory (keeps the last N exchanges) and ConversationSummaryMemory (summarizes older messages). For most use cases, a sliding window of the last 10-20 messages works well.
from langchain_core.chat_history import InMemoryChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
store = {}
def get_session_history(session_id: str):
if session_id not in store:
store[session_id] = InMemoryChatMessageHistory()
return store[session_id]
chain_with_history = RunnableWithMessageHistory(
chain,
get_session_history,
input_messages_key="input",
history_messages_key="history",
)
# Now each session maintains its own history
response = chain_with_history.invoke(
{"input": "What's the best way to deploy a Python app?"},
config={"configurable": {"session_id": "user-123"}},
)
print(response.content)
For production, swap InMemoryChatMessageHistory with a persistent store. LangChain supports Redis, PostgreSQL, and MongoDB out of the box. Here's the Redis version:
from langchain_community.chat_message_histories import RedisChatMessageHistory
def get_session_history(session_id: str):
return RedisChatMessageHistory(session_id, url="redis://localhost:6379")
If you're running Redis and Postgres locally, a Docker Compose setup makes this painless to manage.
Step 3: Add RAG — Ground Your Chatbot in Real Data
A chatbot that only relies on the LLM's training data will hallucinate. Retrieval-augmented generation (RAG) fixes this by fetching relevant documents before generating a response. This is the single most impactful improvement you can make to a chatbot's accuracy.
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.retrieval import create_retrieval_chain
# 1. Load your documents
loader = DirectoryLoader("./docs", glob="**/*.md", loader_cls=TextLoader)
documents = loader.load()
# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = splitter.split_documents(documents)
# 3. Create vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_documents(splits, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
# 4. Build RAG chain
rag_prompt = ChatPromptTemplate.from_messages([
("system", """Answer the user's question based on the context below.
If the context doesn't contain relevant info, say so honestly.
Context: {context}"""),
MessagesPlaceholder(variable_name="history"),
("human", "{input}"),
])
document_chain = create_stuff_documents_chain(llm, rag_prompt)
rag_chain = create_retrieval_chain(retriever, document_chain)
A few practical tips on RAG that tutorials rarely mention:
- Chunk size matters a lot. Start with 1000 characters and tune from there. Too small and you lose context; too large and you dilute relevance.
- Overlap prevents cutting sentences in half. 200 characters of overlap is a solid default.
- Embedding model choice affects cost and quality.
text-embedding-3-smallcosts roughly 80% less thantext-embedding-3-largewith minimal quality loss for most use cases. - FAISS is fine for prototyping but switch to Pinecone, Weaviate, or pgvector for production workloads over 100K documents.
Step 4: Add Streaming for Better UX
Users hate waiting 5-10 seconds for a complete response. Streaming tokens as they're generated makes the chatbot feel responsive even when the full response takes time. This is non-negotiable for production chatbots.
from langchain_core.callbacks import StreamingStdoutCallbackHandler
streaming_llm = ChatOpenAI(
model="gpt-4o",
temperature=0.7,
streaming=True,
callbacks=[StreamingStdoutCallbackHandler()],
)
# For programmatic streaming (e.g., SSE to frontend)
async def stream_response(user_input: str, session_id: str):
async for chunk in chain_with_history.astream(
{"input": user_input},
config={"configurable": {"session_id": session_id}},
):
if hasattr(chunk, "content") and chunk.content:
yield chunk.content
Step 5: Wrap It in a FastAPI Server
Let's make this chatbot accessible over HTTP with server-sent events (SSE) for streaming.
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
app = FastAPI()
class ChatRequest(BaseModel):
message: str
session_id: str
@app.post("/chat")
async def chat_endpoint(req: ChatRequest):
async def generate():
async for token in stream_response(req.message, req.session_id):
yield f"data: {token}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
@app.get("/health")
async def health():
return {"status": "ok"}
Run it:
uvicorn main:app --host 0.0.0.0 --port 8000
When you're ready to deploy, you can containerize this with Docker and push it to an EC2 instance — the deployment pattern is the same for Python services. For hosting options in the region, we've compared cloud providers for Singapore-based deployments.
Common Pitfalls (and How to Avoid Them)
1. Context window overflow. If your chatbot uses gpt-4o's 128K context, you might think you're safe. You're not. Long conversations plus retrieved documents stack up fast. Always cap your history and chunk retrieval results.
2. Not handling rate limits. OpenAI and other providers will rate-limit you in production. Use LangChain's built-in retry logic:
llm = ChatOpenAI(model="gpt-4o", max_retries=3, request_timeout=30)
3. Ignoring cost. A chatbot with RAG that retrieves 4 chunks of 1000 tokens each, plus 20 messages of history, is sending roughly 8K-10K tokens per request. At GPT-4o pricing, that's around $0.03-0.05 per conversation turn. For a chatbot handling 10,000 messages/day, that's $300-500/month in API costs alone. Track your usage.
4. No evaluation. You can't improve what you don't measure. Log every interaction and periodically review response quality. LangSmith (LangChain's tracing tool) is excellent for this.
How Does LangChain Compare to Building From Scratch?
For a simple Q&A bot, you don't need LangChain — just call the API directly. But once you need memory, retrieval, streaming, and multiple LLM providers, LangChain saves roughly 60-70% of the integration code you'd otherwise write yourself. The tradeoff is an extra dependency and some abstraction overhead.
Alternatives worth considering: LlamaIndex (better for pure RAG use cases), Haystack (strong on document processing), or just raw API calls with your own thin wrapper if you want full control.
If you're using AI code assistants to help build your chatbot, tools like Claude Code can scaffold LangChain projects quickly — especially for boilerplate like prompt templates and chain definitions. For teams looking to build custom AI chatbot solutions for business use, Adaptels offers end-to-end development services for Singapore-based companies.
Production Checklist
Before shipping your LangChain chatbot, make sure you've covered:
- Rate limiting on your API endpoints (not just the LLM provider)
- Input validation — sanitize user messages, cap length at 2000-4000 characters
- Content filtering — add a moderation chain or use OpenAI's moderation endpoint
- Logging and tracing — LangSmith or your own observability stack
- Graceful degradation — what happens when the LLM provider is down?
- Cost alerts — set billing thresholds with your LLM provider
Wrapping Up
We went from a basic LLM wrapper to a production-grade chatbot with conversation memory, RAG, streaming, and a REST API. LangChain handles the orchestration so you can focus on the parts that matter — your data, your prompts, and your user experience.
The code in this guide is a starting point. For your specific use case, you'll likely need to tune chunk sizes, experiment with different retrieval strategies, and iterate on your system prompt. The good news is LangChain makes all of that swappable without rewriting your core logic.
Start simple, measure everything, and add complexity only when you need it.
Sources
- LangChain Documentation
- OpenAI API Reference
- LangChain GitHub Repository
- OpenAI Pricing
- PDPC AI Governance Framework — relevant for chatbot deployments handling personal data in Singapore
Related Articles
How to Debug Node.js Memory Leaks (Step-by-Step Guide)
Learn how to detect, diagnose, and fix Node.js memory leaks using heap snapshots, Chrome DevTools, and clinic.js — with real code examples.
Running Local LLMs With Ollama: Developer Setup Guide
Set up Ollama to run local LLMs on your machine. Covers installation, model selection, API usage, and integrating local models into your dev workflow.
Python Virtual Environments Explained: venv vs conda vs pyenv
A practical comparison of Python's venv, conda, and pyenv — when to use each, how to set them up, and which one fits your workflow.