Home/Blog/Generative Ai/Ollama guide: Building local RAG chatbots with LangChain

Ollama guide: Building local RAG chatbots with LangChain

14 min read

Nov 04, 2024

content

What is Ollama?

Key features of Ollama

Getting started with Ollama

Installation

Running a model

Model management

Creating custom models

Ollama API

Understanding RESTful APIs

Ollama API overview

Running the example

Building a local RAG-based chatbot with Streamlit and Ollama

Step 1: Setting up the environment

Step 2: Importing the required libraries

Step 3: Configuring the Streamlit page

Step 4: Creating the sidebar for PDF upload and preview

Step 5: Setting up the main content area

Step 6: Implementing PDF processing

Step 7: Setting up the RAG pipeline

Step 8: Implementing the chat interface

Step 9: Displaying chat history

Running the chatbot

Use cases for Ollama

Limitations and considerations

The future of Ollama

Conclusion

Next steps

In the rapidly evolving world of artificial intelligence (AI) and natural language processing (NLP), Ollama has emerged as a game-changer for developers and enthusiasts looking to run large language models (LLMs) locally. By enabling the deployment of LLMs on personal computers, Ollama offers significant advantages such as enhanced privacy, cost-efficiency, and reduced latency. This powerful, open-source tool simplifies the process of downloading, running, and managing LLMs, making advanced AI capabilities more accessible than ever before.

This blog explores Ollama’s features, functionalities, and potential impact. It explains what Ollama offers and how to use it to build a Retrieval-Augmented Generation (RAG) chatbot using Streamlit.

What is Ollama?#

Ollama is an open-source project allowing users to run LLMs locally on their machines. It provides a simple command-line interface for downloading, running, and managing various LLMs, including popular models like Llama 3, Mistral, Gemma 2, and LLaVA.

Key features of Ollama#

The following are some key features of Ollama:

Easy installation: Ollama can be installed with a single command on macOS and Linux systems.
Wide model support: It supports a variety of models, from smaller, faster options to larger, more capable ones. Here, you will find the complete list of models supported by Ollama.
Custom model creation: Users can create and share their own custom models using Modelfiles.
API access: Ollama provides a RESTful API, allowing integration with other applications and services.
Efficient resource management: It optimizes resource usage, making it possible to run models on consumer-grade hardware.
Cross-platform compatibility: While primarily designed for macOS and Linux, there’s growing support for Windows users as well.

Getting started with Ollama#

Ollama is a powerful tool designed for efficiently running LLMs on your local machine. Whether you’re a developer looking to integrate AI capabilities into your application or someone interested in experimenting with language models, Ollama provides a user-friendly experience.

We’ll walk you through the installation process, running models, managing them, and even creating custom models tailored to your needs. Let’s dive in!

Installation#

Ollama is now available for Windows, macOS, and Linux! You can download it from the Ollama website.

Running a model#

Once installed, you can run a model with a simple command:

This downloads and runs the Llama 3.1 model. You can replace llama3.1 with any other supported model name.

Model management#

Ollama provides commands for listing, removing, and updating models:

List models: ollama list
Remove a model: ollama rm modelname
Update models: ollama pull modelname

Creating custom models#

One of Ollama’s most powerful features is the ability to create custom models using Modelfiles. These are similar to Dockerfiles and allow you to define a model’s base, add extra data, and set various parameters.

Here’s a simple example of a Modelfile:

FROM llama3.1
# Set the temperature to 0.7 [balancing coherence and creativity]
PARAMETER temperature 0.7
# Set the system message
SYSTEM """
You are an AI assistant specialized in Ollama, an open-source project for running large language models 
locally. Your primary role is to assist users with Ollama-related queries, including installation, 
usage, model management, and troubleshooting etc. Provide accurate, helpful, and concise information 
specifically about Ollama. If you're unsure about something, acknowledge it and guide the user to 
potential sources for more information. Always ensure that your responses reflect the most recent and 
verified information about Ollama.
"""

Let’s break down this Modelfile:

FROM llama3.1: This specifies that we’re using the Llama 3.1 model as our base. Llama 3.1 is a powerful language model that can handle complex queries and provide detailed responses.
PARAMETER temperature 0.7: This sets the temperature parameter to 0.7, which balances coherence and creativity. It allows the AI to provide varied responses while maintaining accuracy, which is crucial for technical assistance.
SYSTEM "...": This block defines the AI assistant’s role and behavior. It instructs the AI to:
- Specialize in Ollama-related information
- Assist with various aspects of Ollama (installation, usage, model management, troubleshooting)
- Provide accurate and concise information
- Acknowledge when unsure and guide users to other resources
- Prioritize up-to-date information about Ollama

To use this Modelfile with Ollama:

Save the Modelfile in a text file named Modelfile (without any file extension).
Open a terminal and navigate to the directory containing the Modelfile.
Run the following command to create the custom Ollama assistant model: ollama create ollama-assistant -f Modelfile
Once created, you can run the Ollama AI assistant using: ollama run ollama-assistant

This will start an interactive session where you can ask questions and get assistance related to Ollama. The AI will respond with helpful information about Ollama, its features, usage, and any other relevant topics.

Remember that the assistant’s knowledge will be based on the training data of the underlying Llama 3.1 model, so verifying critical information from official Ollama documentation or resources is always a good idea.

Ollama API#

Ollama provides a powerful RESTful API that allows developers to directly integrate LLM capabilities into their applications. This API opens up a world of possibilities for creating AI-powered features without the need for complex setups or cloud services.

Understanding RESTful APIs#

Before diving into the Ollama API, let’s briefly explain what a RESTful API is:

REST stands for Representational State Transfer.
It’s an architectural style for designing networked applications.
RESTful APIs use HTTP requests to perform CRUD (create, read, update, delete) operations on resources.
They typically use JSON for data formatting.

Ollama API overview#

The Ollama API provides several endpoints for different functionalities:

Generating text
Managing models (listing, creating, deleting)
Embedding generation
Model information retrieval

For this example, we’ll focus on the text generation endpoint.

import requests
import json
response = requests.post('http://localhost:11434/api/generate', json={
    'model': 'llama3.1',
    'prompt': 'Why is the sky blue?'
})
# Print the raw response content
print("Raw response content:")
print(response.text)
# Check the content type
content_type = response.headers.get('Content-Type')
print("Content-Type:", content_type)
# Split the content into lines
lines = response.text.strip().split('\n')
# Parse each line as JSON and extract the response
responses = [json.loads(line)["response"] for line in lines]
# Combine the responses into a single string
combined_response = ''.join(responses)
print(combined_response)

Lines 1–2: We import the requests library, a popular Python package for making HTTP requests, and the json library, which provides methods for parsing and handling JSON data.
Lines 4–7: We send a POST request to the specified URL:
- URL: 'http://localhost:11434/api/generate'
  - localhost: This indicates that the API is running on our local machine.
  - 11434: This is the default port for the Ollama API.
  - /api/generate: This is endpoint for text generation.
- json={}: This send data in JSON format in the request body.
  - 'model': 'llama3.1': This specifies the model to use for generation.
  - 'prompt': 'Why is the sky blue?': This is the input text for the model to respond to.
Lines 10–11: We output the raw text of the response for debugging purposes.
Lines 14–15: We retrieve and print the content type from the response headers.
Line 18: We break the response text into individual lines for processing.
Line 21: We extract the "response" key from each line containing the generated text.
Line 24: We join the extracted responses into a single string.
Line 26: We output the complete generated text to the console.

Running the example#

To run this example:

Ensure Ollama is installed and running on your machine.
Make sure you have the requests library installed (pip install requests).
Save the code in a Python file (e.g., ollama_api_example.py).
Run the script using python3 ollama_api_example.py.

Building a local RAG-based chatbot with Streamlit and Ollama#

Let’s create an advanced Retrieval-Augmented Generation (RAG) based chatbot using Streamlit, Ollama, and other powerful libraries. For instance, a customer service team can deploy this chatbot to handle frequently asked questions by accessing and referencing internal documents such as FAQs, product manuals, and support guides. By doing so, the chatbot ensures that customers receive accurate and consistent responses quickly, reducing the workload on human agents and improving overall response times.

This command installs Streamlit for our web interface, PyPDF2 for PDF processing, LangChain for our language model interactions, Pillow for image processing, and PyMuPDF for PDF rendering.

Make sure you pull the Llama 3.1 and NOMICnomic-embed-text is a powerful model that converts text into numerical representations (embeddings) for tasks like search, clustering, and classification. It excels at handling long text passages and outperforms similar models. models before moving to the next step using the following commands:

import streamlit as st
import PyPDF2
from langchain_community.embeddings import OllamaEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain.chains import ConversationalRetrievalChain
from langchain_community.chat_models import ChatOllama
from langchain.schema import HumanMessage, AIMessage
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain.memory import ConversationBufferMemory
from PIL import Image
import fitz  # PyMuPDF

Line 1: We import the streamlit library for creating web applications aliased as st.
Line 2: We import PyPDF2, a library for reading and manipulating PDF files.
Line 3: We import OllamaEmbeddings from LangChain’s community embeddings module.
Line 4: We immport RecursiveCharacterTextSplitter from LangChain for splitting text into smaller chunks.
Line 5: We immport Chroma, a vector store from LangChain’s community module, to efficiently manage and query vector embeddings.
Line 6: We import ConversationalRetrievalChain from LangChain for creating a conversational AI chain.
Line 7: We import ChatOllama, a chat model from LangChain’s community module for generating conversational responses.
Line 8: We import the HumanMessage and AIMessage classes from LangChain’s schema to structure and handle chat interactions between users and AI.
Line 9: We import ChatMessageHistory from LangChain’s community module for storing chat history.
Line 10: We import ConversationBufferMemory from LangChain for maintaining conversation context.
Line 11: We import the Image module from PIL (Python Imaging Library) for image processing.
Line 12: We import fitz from PyMuPDF, a library for working with PDF documents.

Step 3: Configuring the Streamlit page#

Let’s set up our Streamlit page with a custom configuration. This will give our chatbot a professional look and feel.

with st.sidebar:
    st.title("PDF Upload")
    uploaded_file = st.file_uploader("Upload a PDF file", type="pdf")
    
    if uploaded_file is not None:
        st.success("PDF uploaded successfully!")
        
        # PDF preview
        st.subheader("PDF Preview")
        pdf_document = fitz.open(stream=uploaded_file.read(), filetype="pdf")
        num_pages = len(pdf_document)
        page_num = st.number_input("Page", min_value=1, max_value=num_pages, value=1)
        
        page = pdf_document.load_page(page_num - 1)
        pix = page.get_pixmap(matrix=fitz.Matrix(2, 2))
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        
        st.image(img, caption=f"Page {page_num}", use_column_width=True)

Lines 1–3: We create a sidebar with a title "PDF Upload" and add a file uploader specifically for PDF files.
Lines 5–6: We display a success message when a PDF is uploaded.
Lines 9–12: We add a "PDF Preview" section in the sidebar. We open the uploaded PDF using PyMuPDF (fitz) and create a number input for page selection.
Lines 13–15: We load the selected page, render it as a pixmap with 2x scaling, and convert it to a PIL Image object.
Line 17: We display the rendered page image in the sidebar with a caption showing the page number.

This code snippet creates a sidebar for PDF upload and preview functionality. It allows users to upload PDFs and interactively view their pages within the Streamlit application.

Step 5: Setting up the main content area#

Now, let’s set up the main content area of our chatbot interface. This is where we’ll display the chat history and input field.

Line 1: We display a large, bold title at the top of the main Streamlit app area, introducing the chatbot.
Line 3: We check if a 'chain' key exists in Streamlit’s session state. This is likely used to store the conversation chain.
Line 4: If 'chain' doesn’t exist in the session state, it’s initialized to None.
Line 6: We check if a 'chat_history' key exists in the session state. This is used to maintain conversation history across reruns.
Line 7: If 'chat_history' doesn’t exist, it’s initialized as an empty list to store conversation messages.

Step 6: Implementing PDF processing#

To work with the uploaded PDF, we need a function to extract its text content. Here’s how we can do that:

if uploaded_file is not None:
    if st.session_state.chain is None:
        with st.spinner("Processing PDF..."):
            pdf_text = process_pdf(uploaded_file)
            
            text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
            texts = text_splitter.split_text(pdf_text)
            
            metadatas = [{"source": f"chunk_{i}"} for i in range(len(texts))]
            
            embeddings = OllamaEmbeddings(model="nomic-embed-text")
            docsearch = Chroma.from_texts(texts, embeddings, metadatas=metadatas)
            
            message_history = ChatMessageHistory()
            memory = ConversationBufferMemory(
                memory_key="chat_history",
                output_key="answer",
                chat_memory=message_history,
                return_messages=True,
            )
            
            st.session_state.chain = ConversationalRetrievalChain.from_llm(
                ChatOllama(model="llama3.1", temperature=0.7),
                chain_type="stuff",
                retriever=docsearch.as_retriever(search_kwargs={"k": 1}),
                memory=memory,
                return_source_documents=True,
            )
            
            st.success("PDF processed successfully!")

Lines 1–2: We check if a PDF file is uploaded and if the conversation chain hasn’t been initialized yet.
Lines 3–4: We display a spinner while processing the PDF, then extract text from the uploaded PDF using the process_pdf function.
Lines 6–7: We split the extracted text into smaller chunks using RecursiveCharacterTextSplitter from LangChain.
Line 9: We create metadata for each text chunk, associating it with a source identifier.
Lines 11–12: We initialize OllamaEmbeddings with the "nomic-embed-text" model and create a Chroma vector store from the text chunks.
Lines 14–20: We set up the conversation history and memory components using LangChain’s ChatMessageHistory and ConversationBufferMemory.
Lines 22–28: We create a ConversationalRetrievalChain using the ChatOllama model, the Chroma vector store as a retriever, and the previously set-up memory. This chain is stored in the Streamlit session state.
Line 30: We display a success message indicating that the PDF has been processed successfully.

Step 8: Implementing the chat interface#

With our RAG pipeline set up, we can now create a chat interface where users can interact with the AI.

st.subheader("Chat with your PDF")
user_input = st.text_input("Ask a question about the document:")
if user_input:
    if st.session_state.chain is None:
        st.warning("Please upload a PDF file first.")
    else:
        with st.spinner("Thinking..."):
            response = st.session_state.chain.invoke({"question": user_input})
            answer = response["answer"]
            source_documents = response["source_documents"]
            
            st.session_state.chat_history.append(HumanMessage(content=user_input))
            st.session_state.chat_history.append(AIMessage(content=answer))

Lines 1–2: We display a subheader for the chat section and create a text input for user questions.
Lines 4–6: We check if the user has entered a question, and if the conversation chain hasn’t been initialized (i.e., no PDF uploaded), then we display a warning.
Lines 7–11: If a chain exists, we use a spinner to indicate processing, then invoke the chain with the user’s question and extract the answer and source documents.
Lines 13–11: We append the user’s question and the AI’s answer to the chat history in the session state.

Step 9: Displaying chat history#

Finally, display the chat history and source documents for each AI response.

chat_container = st.container()
with chat_container:
    for message in reversed(st.session_state.chat_history):
        if isinstance(message, HumanMessage):
            st.markdown(f'👤 {message.content}')
        elif isinstance(message, AIMessage):
            st.markdown(f'🤖 {message.content}')
        
        if isinstance(message, AIMessage):
            with st.expander("View Sources"):
                for idx, doc in enumerate(source_documents):
                    st.write(f"Source {idx + 1}:", doc.page_content[:150] + "...")

Lines 1–2: We create a container for displaying the chat history.
Lines 3–7: We iterate through the chat history in reverse order, displaying human messages with a person emoji and AI messages with a robot emoji using Streamlit’s markdown function.
Lines 9–12: For AI messages, we create an expandable section to show source documents. Each source is displayed with a snippet of its content (the first 150 characters).

This code snippet handles the conversation history display in a chat-like format, distinguishing between user and AI messages and providing the option to view the sources used for AI responses.

Running the chatbot#

To run your newly created chatbot, save all the code in a file (e.g., rag_chatbot. py) and execute it using the following command:

This will launch a web interface where users can upload a PDF, preview it, and engage in a conversation with the AI about its contents.

Following these steps, we’ve created an advanced RAG-based chatbot that can process PDF documents and answer questions based on their content, all within a user-friendly Streamlit interface. This chatbot demonstrates the power of combining local language models with retrieval-augmented generation for document-based question answering.

Use cases for Ollama#

Here are several use cases for Ollama:

Local development: Test and prototype AI applications without relying on cloud services.
Privacy-focused applications: Run AI models locally to ensure data privacy.
Educational tools: Learn about and experiment with LLMs in a controlled environment.
Offline AI capabilities: Develop applications that can function without internet connectivity.
Custom assistants: Create specialized AI assistants for specific domains or tasks.

Limitations and considerations#

While Ollama is powerful, it’s important to note some limitations:

Hardware requirements: Running large models locally, such as Llama 3.1 405B, requires significant computational resources.
Model availability: Not all state-of-the-art models are available or optimized for Ollama, like Google’s PaLM 2 model, since it’s not in Ollama’s library.
Continuous updates: Keeping open-source models up-to-date requires consistent effort. To ensure optimal performance, new versions and improvements must be integrated into the Ollama environment.

The future of Ollama#

As the field of AI continues to advance, tools like Ollama are likely to play an increasingly important role in democratizing access to powerful language models. Future developments may include support for more models, improved performance optimizations, and enhanced integration capabilities.

Conclusion#

Ollama represents a significant step forward in making LLMs accessible to developers and enthusiasts. By simplifying the process of running these models locally, it opens up new possibilities for AI application development, research, and education. As we’ve seen with our RAG-based chatbot example, Ollama can be easily integrated into practical applications, allowing for the creation of powerful, privacy-preserving AI tools. As the field continues to evolve, tools like Ollama will undoubtedly play a crucial role in shaping the future of AI development and deployment.

Next steps#

To further enhance your understanding of RAG and its applications, consider exploring the following resources and projects:

Frequently Asked Questions

How do we make a chatbot efficient?

To make a chatbot efficient, focus on technical aspects like selecting the right platform, optimizing NLP, utilizing machine learning, and implementing caching. Additionally, design efficiency is crucial, involving clear goals, limited scope, prioritized conversations, quick responses, relevant information, clear language, and human handoff options. Continuous improvement is essential, requiring monitoring, feedback gathering, and iteration. By combining these elements, we can develop a chatbot that effectively meets user needs and provides a positive experience.

How to build chatbot knowledge base?

To create a robust chatbot knowledge base, begin by clearly defining the chatbot’s purpose and identifying the specific needs of its target audience. Gather relevant information from a variety of sources, including frequently asked questions, manuals, documents, and expert interviews. Organize this content into well-structured categories, formatted consistently as question-answer pairs for optimal understanding. Choose a suitable storage system, such as a database, and ensure easy searchability through the use of tags and metadata. Integrate the knowledge base with a natural language processing (NLP) engine to empower the chatbot to accurately comprehend and respond to user queries. Conduct thorough testing, implement a feedback loop for continuous improvement, and regularly update the knowledge base to maintain relevance and accuracy. Consider utilizing content management systems (CMS) for efficient content management and NLP engines for enhancing natural language processing capabilities.

How to create a chatbot using ollama?

To create a chatbot using Ollama, first install Ollama on your system, then choose and pull a language model like Llama 3. Next, set up a simple Python script that uses the subprocess module to interact with Ollama via command-line commands. This script should include a function to send prompts to Ollama and capture its responses, as well as a main loop that takes user input, sends it to the model, and displays the responses. Run this script to start chatting with your Ollama-powered bot. For detailed implementation, you can refer to the blog above.

What is building a rag using ollama?

Building a Retrieval-Augmented Generation (RAG) system using Ollama involves integrating a language model with a retrieval mechanism to enhance response quality by incorporating external knowledge. First, set up your environment with Ollama and prepare your data by preprocessing it, converting it into embeddings using a suitable embedding model, and storing those embeddings in a retrieval engine like Elasticsearch, ChromaDB, Weaviate, or Pinecone. Integrate Ollama with the retrieval system using a Python library like LangChain or by creating a custom integration. Configure the system to fetch relevant documents based on user queries and combine this with the language model to generate responses that utilize the retrieved information. While fine-tuning the language model can improve performance, it’s not always necessary. Thoroughly test the system to ensure it’s functioning as expected, and deploy it in a production environment while monitoring its performance for continuous improvement.

Can I create a chatbot for free?

Absolutely, you can create a chatbot for free! There are several avenues to explore:

Chatbot builders: Platforms like Dialogflow and Microsoft Bot Framework offer free tiers with basic features, making them accessible for beginners.
Open-source frameworks: For those seeking more customization and control, Rasa and Botpress are popular open-source options. While they require technical knowledge, they offer flexibility.
No-code platforms: Chatfuel and ManyChat are excellent choices if you’re looking to build chatbots without coding. Their free plans provide a great starting point, though keep in mind that they might have certain limitations compared to paid versions.
Programming languages: If you have coding skills, you can create chatbots from scratch using languages like Python or JavaScript, utilizing free libraries and APIs.
Local LLM deployment: Ollama is a free, open-source tool that allows you to run various large language models locally. This option is great for creating AI-powered chatbots with more control over data privacy and customization.

Each option has its pros and cons, so choose based on your technical skills, customization needs, and the specific requirements of your chatbot project. Remember that while these options are free to start with, some may have usage limits or offer advanced features in paid tiers.

What are the requirements to create a chatbot?

To create a chatbot, you need to start by clearly defining its purpose and target audience. Choose the right platform for deployment, such as a website, messaging app, or voice assistant, and select a suitable development framework like Rasa, Dialogflow, or Microsoft Bot Framework. Implement Natural Language Processing (NLP) to understand user inputs, focusing on intent recognition and entity extraction, and gather training data to fine-tune the model. Design conversational flows that include fallback responses and context management, ensuring smooth interactions. Integrate the chatbot with APIs and databases for dynamic responses and data handling. Choose a programming language like Python or JavaScript for development, and thoroughly test the chatbot with unit, end-to-end, and user acceptance tests. Once tested, deploy the chatbot on a reliable cloud service or hosting platform. Ensure security and compliance with data privacy regulations, and consider implementing user authentication if necessary. Finally, set up monitoring systems to track performance, gather user feedback, and continuously update and maintain the chatbot for improvements.

How do I make my chatbot more human?

To make your chatbot more human-like, focus on improving its natural language processing capabilities to better understand context and nuance. Develop a consistent personality with unique traits and conversational style, and incorporate empathy to recognize and respond to emotional cues. Enable the bot to remember conversation details and refer back to them naturally, while avoiding repetitive responses. Aim for smooth conversational flow with follow-up questions and seamless topic transitions. Add elements of humor and small talk to make interactions feel more casual and authentic. Implement subtle typing indicators and response delays to mimic human conversation pacing. Program the bot to gracefully acknowledge its limitations and mistakes, and enhance its contextual awareness to consider factors like time of day or user history in its responses. These improvements will help create a more engaging and realistic chatbot experience that feels closer to interacting with a real person.

Written By:

Saif Ali

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources