Home/Blog/Generative Ai/Ollama guide: Building local RAG chatbots with LangChain
Home/Blog/Generative Ai/Ollama guide: Building local RAG chatbots with LangChain

Ollama guide: Building local RAG chatbots with LangChain

Saif Ali
Nov 04, 2024
14 min read

In the rapidly evolving world of artificial intelligence (AI) and natural language processing (NLP), Ollama has emerged as a game-changer for developers and enthusiasts looking to run large language models (LLMs) locally. By enabling the deployment of LLMs on personal computers, Ollama offers significant advantages such as enhanced privacy, cost-efficiency, and reduced latency. This powerful, open-source tool simplifies the process of downloading, running, and managing LLMs, making advanced AI capabilities more accessible than ever before.

This blog explores Ollama’s features, functionalities, and potential impact. It explains what Ollama offers and how to use it to build a Retrieval-Augmented Generation (RAG) chatbot using Streamlit.

What is Ollama?#

Ollama is an open-source project allowing users to run LLMs locally on their machines. It provides a simple command-line interface for downloading, running, and managing various LLMs, including popular models like Llama 3, Mistral, Gemma 2, and LLaVA.

Key features of Ollama#

The following are some key features of Ollama:

  • Easy installation: Ollama can be installed with a single command on macOS and Linux systems.

  • Wide model support: It supports a variety of models, from smaller, faster options to larger, more capable ones. Here, you will find the complete list of models supported by Ollama.

  • Custom model creation: Users can create and share their own custom models using Modelfiles.

  • API access: Ollama provides a RESTful API, allowing integration with other applications and services.

  • Efficient resource management: It optimizes resource usage, making it possible to run models on consumer-grade hardware.

  • Cross-platform compatibility: While primarily designed for macOS and Linux, there’s growing support for Windows users as well.

Getting started with Ollama#

Ollama is a powerful tool designed for efficiently running LLMs on your local machine. Whether you’re a developer looking to integrate AI capabilities into your application or someone interested in experimenting with language models, Ollama provides a user-friendly experience.

We’ll walk you through the installation process, running models, managing them, and even creating custom models tailored to your needs. Let’s dive in!

Installation#

Ollama is now available for Windows, macOS, and Linux! You can download it from the Ollama website.

Running a model#

Once installed, you can run a model with a simple command:

ollama run llama3.1

This downloads and runs the Llama 3.1 model. You can replace llama3.1 with any other supported model name.

Model management#

Ollama provides commands for listing, removing, and updating models:

  • List models: ollama list

  • Remove a model: ollama rm modelname

  • Update models: ollama pull modelname

Creating custom models#

One of Ollama’s most powerful features is the ability to create custom models using Modelfiles. These are similar to Dockerfiles and allow you to define a model’s base, add extra data, and set various parameters.

Here’s a simple example of a Modelfile:

FROM llama3.1
# Set the temperature to 0.7 [balancing coherence and creativity]
PARAMETER temperature 0.7
# Set the system message
SYSTEM """
You are an AI assistant specialized in Ollama, an open-source project for running large language models
locally. Your primary role is to assist users with Ollama-related queries, including installation,
usage, model management, and troubleshooting etc. Provide accurate, helpful, and concise information
specifically about Ollama. If you're unsure about something, acknowledge it and guide the user to
potential sources for more information. Always ensure that your responses reflect the most recent and
verified information about Ollama.
"""

Let’s break down this Modelfile:

  • FROM llama3.1: This specifies that we’re using the Llama 3.1 model as our base. Llama 3.1 is a powerful language model that can handle complex queries and provide detailed responses.

  • PARAMETER temperature 0.7: This sets the temperature parameter to 0.7, which balances coherence and creativity. It allows the AI to provide varied responses while maintaining accuracy, which is crucial for technical assistance.

  • SYSTEM "...": This block defines the AI assistant’s role and behavior. It instructs the AI to:

    • Specialize in Ollama-related information

    • Assist with various aspects of Ollama (installation, usage, model management, troubleshooting)

    • Provide accurate and concise information

    • Acknowledge when unsure and guide users to other resources

    • Prioritize up-to-date information about Ollama

To use this Modelfile with Ollama:

  1. Save the Modelfile in a text file named Modelfile (without any file extension).

  2. Open a terminal and navigate to the directory containing the Modelfile.

  3. Run the following command to create the custom Ollama assistant model: ollama create ollama-assistant -f Modelfile

  4. Once created, you can run the Ollama AI assistant using: ollama run ollama-assistant

This will start an interactive session where you can ask questions and get assistance related to Ollama. The AI will respond with helpful information about Ollama, its features, usage, and any other relevant topics.

Remember that the assistant’s knowledge will be based on the training data of the underlying Llama 3.1 model, so verifying critical information from official Ollama documentation or resources is always a good idea.

Ollama API#

Ollama provides a powerful RESTful API that allows developers to directly integrate LLM capabilities into their applications. This API opens up a world of possibilities for creating AI-powered features without the need for complex setups or cloud services.

Understanding RESTful APIs#

Before diving into the Ollama API, let’s briefly explain what a RESTful API is:

  • REST stands for Representational State Transfer.

  • It’s an architectural style for designing networked applications.

  • RESTful APIs use HTTP requests to perform CRUD (create, read, update, delete) operations on resources.

  • They typically use JSON for data formatting.

Ollama API overview#

The Ollama API provides several endpoints for different functionalities:

  • Generating text

  • Managing models (listing, creating, deleting)

  • Embedding generation

  • Model information retrieval

For this example, we’ll focus on the text generation endpoint.

import requests
import json
response = requests.post('http://localhost:11434/api/generate', json={
'model': 'llama3.1',
'prompt': 'Why is the sky blue?'
})
# Print the raw response content
print("Raw response content:")
print(response.text)
# Check the content type
content_type = response.headers.get('Content-Type')
print("Content-Type:", content_type)
# Split the content into lines
lines = response.text.strip().split('\n')
# Parse each line as JSON and extract the response
responses = [json.loads(line)["response"] for line in lines]
# Combine the responses into a single string
combined_response = ''.join(responses)
print(combined_response)
  • Lines 1–2: We import the requests library, a popular Python package for making HTTP requests, and the json library, which provides methods for parsing and handling JSON data.

  • Lines 4–7: We send a POST request to the specified URL:

    • URL: 'http://localhost:11434/api/generate'

      • localhost: This indicates that the API is running on our local machine.

      • 11434: This is the default port for the Ollama API.

      • /api/generate: This is endpoint for text generation.

    • json={}: This send data in JSON format in the request body.

      • 'model': 'llama3.1': This specifies the model to use for generation.

      • 'prompt': 'Why is the sky blue?': This is the input text for the model to respond to.

  • Lines 10–11: We output the raw text of the response for debugging purposes.

  • Lines 14–15: We retrieve and print the content type from the response headers.

  • Line 18: We break the response text into individual lines for processing.

  • Line 21: We extract the "response" key from each line containing the generated text.

  • Line 24: We join the extracted responses into a single string.

  • Line 26: We output the complete generated text to the console.

Running the example#

To run this example:

  • Ensure Ollama is installed and running on your machine.

  • Make sure you have the requests library installed (pip install requests).

  • Save the code in a Python file (e.g., ollama_api_example.py).

  • Run the script using python3 ollama_api_example.py.

Building a local RAG-based chatbot with Streamlit and Ollama#

Let’s create an advanced Retrieval-Augmented Generation (RAG) based chatbot using Streamlit, Ollama, and other powerful libraries. For instance, a customer service team can deploy this chatbot to handle frequently asked questions by accessing and referencing internal documents such as FAQs, product manuals, and support guides. By doing so, the chatbot ensures that customers receive accurate and consistent responses quickly, reducing the workload on human agents and improving overall response times.

canvasAnimation-image
1 of 6

Let’s break down the process step by step.

Step 1: Setting up the environment#

Before we begin coding, we must ensure that our development environment has all the necessary libraries. These libraries will enable us to build our chatbot with advanced natural language processing capabilities.

First, let’s install the required packages:

pip install streamlit PyPDF2 langchain-community langchain pillow PyMuPDF chromadb

This command installs Streamlit for our web interface, PyPDF2 for PDF processing, LangChain for our language model interactions, Pillow for image processing, and PyMuPDF for PDF rendering.

Make sure you pull the Llama 3.1 and NOMICnomic-embed-text is a powerful model that converts text into numerical representations (embeddings) for tasks like search, clustering, and classification. It excels at handling long text passages and outperforms similar models. models before moving to the next step using the following commands:

ollama pull llama3.1
ollama pull nomic-embed-text

Step 2: Importing the required libraries#

Now that our environment is set up, let’s start by importing all the necessary libraries. These imports will give us access to the tools we need for building our chatbot.

import streamlit as st
import PyPDF2
from langchain_community.embeddings import OllamaEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain.chains import ConversationalRetrievalChain
from langchain_community.chat_models import ChatOllama
from langchain.schema import HumanMessage, AIMessage
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain.memory import ConversationBufferMemory
from PIL import Image
import fitz # PyMuPDF
  • Line 1: We import the streamlit library for creating web applications aliased as st.

  • Line 2: We import PyPDF2, a library for reading and manipulating PDF files.

  • Line 3: We import OllamaEmbeddings from LangChain’s community embeddings module.

  • Line 4: We immport RecursiveCharacterTextSplitter from LangChain for splitting text into smaller chunks.

  • Line 5: We immport Chroma, a vector store from LangChain’s community module, to efficiently manage and query vector embeddings.

  • Line 6: We import ConversationalRetrievalChain from LangChain for creating a conversational AI chain.

  • Line 7: We import ChatOllama, a chat model from LangChain’s community module for generating conversational responses.

  • Line 8: We import the HumanMessage and AIMessage classes from LangChain’s schema to structure and handle chat interactions between users and AI.

  • Line 9: We import ChatMessageHistory from LangChain’s community module for storing chat history.

  • Line 10: We import ConversationBufferMemory from LangChain for maintaining conversation context.

  • Line 11: We import the Image module from PIL (Python Imaging Library) for image processing.

  • Line 12: We import fitz from PyMuPDF, a library for working with PDF documents.

Step 3: Configuring the Streamlit page#

Let’s set up our Streamlit page with a custom configuration. This will give our chatbot a professional look and feel.

st.set_page_config(page_title="Ollama RAG Chatbot", page_icon="🤖", layout="wide")

The above code configures the Streamlit page with a custom title "Ollama RAG Chatbot", sets a robot emoji as the page icon, and uses a wide layout for better space utilization.

Step 4: Creating the sidebar for PDF upload and preview#

Next, we’ll create a sidebar for PDF upload and preview functionality. This allows users to easily upload documents and view their content.

with st.sidebar:
st.title("PDF Upload")
uploaded_file = st.file_uploader("Upload a PDF file", type="pdf")
if uploaded_file is not None:
st.success("PDF uploaded successfully!")
# PDF preview
st.subheader("PDF Preview")
pdf_document = fitz.open(stream=uploaded_file.read(), filetype="pdf")
num_pages = len(pdf_document)
page_num = st.number_input("Page", min_value=1, max_value=num_pages, value=1)
page = pdf_document.load_page(page_num - 1)
pix = page.get_pixmap(matrix=fitz.Matrix(2, 2))
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
st.image(img, caption=f"Page {page_num}", use_column_width=True)
  • Lines 1–3: We create a sidebar with a title "PDF Upload" and add a file uploader specifically for PDF files.

  • Lines 5–6: We display a success message when a PDF is uploaded.

  • Lines 9–12: We add a "PDF Preview" section in the sidebar. We open the uploaded PDF using PyMuPDF (fitz) and create a number input for page selection.

  • Lines 13–15: We load the selected page, render it as a pixmap with 2x scaling, and convert it to a PIL Image object.

  • Line 17: We display the rendered page image in the sidebar with a caption showing the page number.

This code snippet creates a sidebar for PDF upload and preview functionality. It allows users to upload PDFs and interactively view their pages within the Streamlit application.

Step 5: Setting up the main content area#

Now, let’s set up the main content area of our chatbot interface. This is where we’ll display the chat history and input field.

st.title("Ollama RAG Chatbot with Latest Llama Model")
if "chain" not in st.session_state:
st.session_state.chain = None
if "chat_history" not in st.session_state:
st.session_state.chat_history = []
  • Line 1: We display a large, bold title at the top of the main Streamlit app area, introducing the chatbot.

  • Line 3: We check if a 'chain' key exists in Streamlit’s session state. This is likely used to store the conversation chain.

  • Line 4: If 'chain' doesn’t exist in the session state, it’s initialized to None.

  • Line 6: We check if a 'chat_history' key exists in the session state. This is used to maintain conversation history across reruns.

  • Line 7: If 'chat_history' doesn’t exist, it’s initialized as an empty list to store conversation messages.

Step 6: Implementing PDF processing#

To work with the uploaded PDF, we need a function to extract its text content. Here’s how we can do that:

def process_pdf(file):
pdf_reader = PyPDF2.PdfReader(file)
pdf_text = ""
for page in pdf_reader.pages:
pdf_text += page.extract_text()
return pdf_text
  • Lines 1–6: This function processes a PDF file, extracting text from all pages and combining it into a single string. It uses PyPDF2 to read the PDF, iterate through each page, extract the text, and concatenate it. The function returns the entire PDF text content as a single string.

Step 7: Setting up the RAG pipeline#

The heart of our chatbot is the RAG pipeline. This system processes the PDF content, creates embeddings, and sets up the conversational chain.

if uploaded_file is not None:
if st.session_state.chain is None:
with st.spinner("Processing PDF..."):
pdf_text = process_pdf(uploaded_file)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
texts = text_splitter.split_text(pdf_text)
metadatas = [{"source": f"chunk_{i}"} for i in range(len(texts))]
embeddings = OllamaEmbeddings(model="nomic-embed-text")
docsearch = Chroma.from_texts(texts, embeddings, metadatas=metadatas)
message_history = ChatMessageHistory()
memory = ConversationBufferMemory(
memory_key="chat_history",
output_key="answer",
chat_memory=message_history,
return_messages=True,
)
st.session_state.chain = ConversationalRetrievalChain.from_llm(
ChatOllama(model="llama3.1", temperature=0.7),
chain_type="stuff",
retriever=docsearch.as_retriever(search_kwargs={"k": 1}),
memory=memory,
return_source_documents=True,
)
st.success("PDF processed successfully!")
  • Lines 1–2: We check if a PDF file is uploaded and if the conversation chain hasn’t been initialized yet.

  • Lines 3–4: We display a spinner while processing the PDF, then extract text from the uploaded PDF using the process_pdf function.

  • Lines 6–7: We split the extracted text into smaller chunks using RecursiveCharacterTextSplitter from LangChain.

  • Line 9: We create metadata for each text chunk, associating it with a source identifier.

  • Lines 11–12: We initialize OllamaEmbeddings with the "nomic-embed-text" model and create a Chroma vector store from the text chunks.

  • Lines 14–20: We set up the conversation history and memory components using LangChain’s ChatMessageHistory and ConversationBufferMemory.

  • Lines 22–28: We create a ConversationalRetrievalChain using the ChatOllama model, the Chroma vector store as a retriever, and the previously set-up memory. This chain is stored in the Streamlit session state.

  • Line 30: We display a success message indicating that the PDF has been processed successfully.

Step 8: Implementing the chat interface#

With our RAG pipeline set up, we can now create a chat interface where users can interact with the AI.

st.subheader("Chat with your PDF")
user_input = st.text_input("Ask a question about the document:")
if user_input:
if st.session_state.chain is None:
st.warning("Please upload a PDF file first.")
else:
with st.spinner("Thinking..."):
response = st.session_state.chain.invoke({"question": user_input})
answer = response["answer"]
source_documents = response["source_documents"]
st.session_state.chat_history.append(HumanMessage(content=user_input))
st.session_state.chat_history.append(AIMessage(content=answer))
  • Lines 1–2: We display a subheader for the chat section and create a text input for user questions.

  • Lines 4–6: We check if the user has entered a question, and if the conversation chain hasn’t been initialized (i.e., no PDF uploaded), then we display a warning.

  • Lines 7–11: If a chain exists, we use a spinner to indicate processing, then invoke the chain with the user’s question and extract the answer and source documents.

  • Lines 13–11: We append the user’s question and the AI’s answer to the chat history in the session state.

Step 9: Displaying chat history#

Finally, display the chat history and source documents for each AI response.

chat_container = st.container()
with chat_container:
for message in reversed(st.session_state.chat_history):
if isinstance(message, HumanMessage):
st.markdown(f'👤 {message.content}')
elif isinstance(message, AIMessage):
st.markdown(f'🤖 {message.content}')
if isinstance(message, AIMessage):
with st.expander("View Sources"):
for idx, doc in enumerate(source_documents):
st.write(f"Source {idx + 1}:", doc.page_content[:150] + "...")
  • Lines 1–2: We create a container for displaying the chat history.

  • Lines 3–7: We iterate through the chat history in reverse order, displaying human messages with a person emoji and AI messages with a robot emoji using Streamlit’s markdown function.

  • Lines 9–12: For AI messages, we create an expandable section to show source documents. Each source is displayed with a snippet of its content (the first 150 characters).

This code snippet handles the conversation history display in a chat-like format, distinguishing between user and AI messages and providing the option to view the sources used for AI responses.

Running the chatbot#

To run your newly created chatbot, save all the code in a file (e.g., rag_chatbot. py) and execute it using the following command:

streamlit run rag_chatbot.py

This will launch a web interface where users can upload a PDF, preview it, and engage in a conversation with the AI about its contents.

Following these steps, we’ve created an advanced RAG-based chatbot that can process PDF documents and answer questions based on their content, all within a user-friendly Streamlit interface. This chatbot demonstrates the power of combining local language models with retrieval-augmented generation for document-based question answering.

Use cases for Ollama#

Here are several use cases for Ollama:

  • Local development: Test and prototype AI applications without relying on cloud services.

  • Privacy-focused applications: Run AI models locally to ensure data privacy.

  • Educational tools: Learn about and experiment with LLMs in a controlled environment.

  • Offline AI capabilities: Develop applications that can function without internet connectivity.

  • Custom assistants: Create specialized AI assistants for specific domains or tasks.

Limitations and considerations#

While Ollama is powerful, it’s important to note some limitations:

  • Hardware requirements: Running large models locally, such as Llama 3.1 405B, requires significant computational resources.

  • Model availability: Not all state-of-the-art models are available or optimized for Ollama, like Google’s PaLM 2 model, since it’s not in Ollama’s library.

  • Continuous updates: Keeping open-source models up-to-date requires consistent effort. To ensure optimal performance, new versions and improvements must be integrated into the Ollama environment.

The future of Ollama#

As the field of AI continues to advance, tools like Ollama are likely to play an increasingly important role in democratizing access to powerful language models. Future developments may include support for more models, improved performance optimizations, and enhanced integration capabilities.

Conclusion#

Ollama represents a significant step forward in making LLMs accessible to developers and enthusiasts. By simplifying the process of running these models locally, it opens up new possibilities for AI application development, research, and education. As we’ve seen with our RAG-based chatbot example, Ollama can be easily integrated into practical applications, allowing for the creation of powerful, privacy-preserving AI tools. As the field continues to evolve, tools like Ollama will undoubtedly play a crucial role in shaping the future of AI development and deployment.

Next steps#

To further enhance your understanding of RAG and its applications, consider exploring the following resources and projects:

Frequently Asked Questions

How do we make a chatbot efficient?

To make a chatbot efficient, focus on technical aspects like selecting the right platform, optimizing NLP, utilizing machine learning, and implementing caching. Additionally, design efficiency is crucial, involving clear goals, limited scope, prioritized conversations, quick responses, relevant information, clear language, and human handoff options. Continuous improvement is essential, requiring monitoring, feedback gathering, and iteration. By combining these elements, we can develop a chatbot that effectively meets user needs and provides a positive experience.

How to build chatbot knowledge base?

How to create a chatbot using ollama?

What is building a rag using ollama?

Can I create a chatbot for free?

What are the requirements to create a chatbot?

How do I make my chatbot more human?


  

Free Resources