Guide to Building Python and LLM-Based Multimodal Chatbots/

...

Deploying Our Chatbot to Hugging Face

Showcase your work to the world by deploying the chatbot to Hugging Face.

We'll cover the following...

The complete chatbot
Hugging Face Spaces
Creating a Space
Deploying the application
Quiz

Gradio offers an easy way to share our application. By simply adding the pararmeter share and setting it to True in the launch() method, Gradio will create a publicly reachable temporary URL to make our application accessible. This will create a tunnel to our locally running application. We would need to keep our application running if we want to keep the URL accessible.

The complete chatbot

Now that we have built our chatbot features, we can combine them into one single application. While merging the multimodal chatbot with the RAG-based chatbot, we had to make one code change. Notice that we import Groq from the Groq library and the LlamaIndex library as well. This will cause issues. To keep the code working properly, we modified the Groq import to be as GroqOrg, referring to the original Groq library. Subsequently, we have also updated the usage on line 26. Feel free to test the application before we begin the deployment process.

Update the launch method to demo.launch(server_name="0.0.0.0", share=True) to access the application on a public URL.

import os
import PIL.Image
import gradio as gr
from groq import Groq as GroqOrg
import google.generativeai as genai
from llama_index.llms.groq import Groq
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
from llama_index.core import SimpleDirectoryReader
from llama_index.core import VectorStoreIndex

genai.configure(api_key=os.environ["GEMINI_API_KEY"])

generation_config = {
  "temperature": 1,
  "top_p": 0.95,
  "top_k": 64,
  "max_output_tokens": 8192,
}

model = genai.GenerativeModel(
  model_name="gemini-1.5-flash",
  generation_config=generation_config,
)

client = GroqOrg()

llm = Groq(model="llama-3.3-70b-versatile")
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
Settings.llm = llm
Settings.embed_model = embed_model

class RAG():
    def __init__(self):
        self.query_engine = None
        self.input_file = None

    def process_file(self, input_file):
        self.input_file = input_file
        model_card = SimpleDirectoryReader(input_files=[input_file]).load_data()
        index = VectorStoreIndex.from_documents(model_card)
        query_engine = index.as_query_engine(similarity_top_k=3)
        self.query_engine = query_engine

    def process_rag_query(self, user_query, chatbot):
        if self.query_engine:
            response = self.query_engine.query(user_query)
        elif self.input_file:
            response = "Query engine is not ready yet. Please try again."
        else:
            response = "No query engine found. Please upload a file."
        chatbot.append({'role': 'user', 'content': user_query}) 
        chatbot.append({'role': 'assistant', 'content': str(response)})
        return "", chatbot

with gr.Blocks() as demo:
    rag_instance = RAG()
    
    with gr.Tab("HTML Learning Assistant"):
        chat_context = [
            {
                "role": "system",
                "content": '''
                    You are a friendly and helpful educational chatbot named “HTML Guide.” Your purpose is to assist users in learning and understanding HyperText Markup Language (HTML). You excel at providing clear explanations, practical examples, and interactive exercises.

                    You are working alongside Gemini, a model specialized in understanding and describing images.

                    The user can input either text or an image.
                    If the user sends an image, Gemini will analyze it and provide a concise description. This description will be saved in the chat history, which you have access to.
                    You will receive the user's text input and the chat history, including any image descriptions generated by Gemini. Use this information to understand the context and formulate your response.
                    Your goals:

                    Engage in natural and coherent conversation with the user.
                    Utilize the image descriptions provided by Gemini to understand the visual context of the conversation.
                    Respond to user queries and requests accurately and comprehensively.
                    Remember:

                    Focus on the user's text input and the chat history.
                    Gemini does not receive any chat context.
                    Your role is to provide text-based responses while leveraging the visual insights from Gemini.'''
            }
        ]
        chatbot = gr.Chatbot(type="messages", height=320)
        with gr.Group():
            msg = gr.MultimodalTextbox(file_types=['image'], placeholder="Enter message or upload an image...", show_label=False) 
            audio_input = gr.Audio(sources=["microphone"], type="filepath")

        def process_audio(audio):
            with open(audio, "rb") as audio_file:
                transcription = client.audio.transcriptions.create(
                    file=(audio, audio_file.read()),
                    model="whisper-large-v3",
                    language="en",
                    response_format="verbose_json",
                )
            return {"text": transcription.text}

        def add_message(user_message, chat_history):
            file = False
            text = False

            if user_message["text"] and user_message["text"].strip():
                text = True
                chat_context.append({"role": "user", "content": user_message["text"]})
                
            if user_message["files"]:
                file = True
                chat_context.append({'role': 'user', 'content': 'The user sent an image'}) 
                chat_history.append({"role": "user", "content": {"path": user_message["files"][0]}})  
    
            # Only text
            if text and not file:    
                chat_history.append({"role": "user", "content": user_message["text"]})
            # Only image
            elif file and not text:
                chat_history.append({"role": "user", "content": "What do you see in this image?"})
                chat_history.append({"role": "assistant", "content": "Processing image ..."})
            # Both text and image
            else:
                chat_history.append({"role": "user", "content": user_message["text"]})
                chat_history.append({"role": "assistant", "content": "Processing image ..."})

            return gr.MultimodalTextbox(value=None, interactive=False), chat_history

        def respond(chat_history):
            if chat_history[-1]["content"] == "Processing image ...":         
                img = PIL.Image.open(chat_history[-3]["content"][0])
                response = model.generate_content([chat_history[-2]["content"], img], stream=True)
                chat_history[-1]["content"] = ""
                for chunk in response:
                    chat_history[-1]["content"] += chunk.text
                    yield chat_history
                chat_context.append({"role": "assistant", "content": chat_history[-1]["content"]})
            else:
                completion = client.chat.completions.create(
                    model="llama-3.2-90b-text-preview",
                    messages=chat_context,
                    temperature=1,
                    max_tokens=1024,
                    top_p=1,
                    stream=True,
                    stop=None,
                )
        
                chat_history.append({"role": "assistant", "content": ""})
                for chunk in completion:
                    chat_history[-1]["content"] += chunk.choices[0].delta.content or ""
                    yield chat_history
                chat_context.append({"role": "assistant", "content": chat_history[-1]["content"]})
        
        msg.submit(add_message, [msg, chatbot], [msg, chatbot]).then(
            respond, chatbot, chatbot).then(
            lambda: gr.MultimodalTextbox(interactive=True), None, [msg])
        
        audio_input.stop_recording(process_audio, audio_input, msg).then(
            add_message, [msg, chatbot], [msg, chatbot]).then(
            respond, chatbot, chatbot).then(
            lambda: gr.MultimodalTextbox(interactive=True), None, [msg])
    
    with gr.Tab("Document Expert"):
        gr.Markdown("# Ask questions about your PDF")
        
        file_widget = gr.File(file_types=[".pdf"])
        file_widget.upload(rag_instance.process_file, file_widget)
        
        chatbot = gr.Chatbot(type="messages", height=280)
        user_query = gr.Textbox(placeholder="Type your query here", show_label=False, interactive=True)
        user_query.submit(rag_instance.process_rag_query, [user_query, chatbot], [user_query, chatbot])
        
demo.launch(server_name="0.0.0.0")

The complete chatbot

Getting Started

Foundations of AI Chatbots

Building a Generative AI-Powered Chatbot

Speech Recognition With Whisper

Enhancing Chatbots with Advanced Capabilities

Build an LLM-powered Chatbot with RAG using LlamaIndex

Conclusion

Deploying Our Chatbot to Hugging Face

The complete chatbot