Build AI Chatbots with Open-Source LLMs, LangChain, and Streamlit/

...

Evaluating Chatbots

Learn how to evaluate chatbots that are integrated with LLMs.

We'll cover the following...

Introduction to LLM evaluation
Understanding the coding process
- Hands-on exercise
- Challenges and considerations

When it comes to large language models, performance evaluation is complex due to the diverse range of metrics used. These metrics aim to quantify different aspects of an LLM’s capabilities: general knowledge, logical reasoning, practical skills, and text quality. Understanding these metrics and their relevance to specific use cases is essential for selecting the right metric for evaluating large language models.

Generally, we can divide the metrics into four major categories:

Knowledge assessment: This category can evaluate the breadth and depth of an LLM’s knowledge across various domains, and its ability to understand and apply this knowledge. We have general knowledge benchmarks such as MMLU and TriviaQA and logical reasoning benchmarks such as HellaSwag and GSM8k. One particularly useful Python library for LLM evaluation in this category is RAGAS. RAGAS combines reinforcement learning with generative models to evaluate the quality and accuracy of generated content. It can be used to assess the knowledge and logical reasoning capabilities of an LLM by providing a framework for generating responses and scoring them based on relevance and correctness.
Functional capabilities: This category can measure the practical abilities of an LLM in specific functional areas, such as coding and problem-solving. Examples of this category are coding benchmarks such as MBPP. One particularly useful Python library for LLM evaluation in this category is HumanEval. OpenAI’s HumanEval suite includes tools for code generation and testing, making it ideal for evaluating an LLM’s ability to solve Python programming problems. Another useful Python library for LLM evaluation is Langsmith from LangChain. LangSmith is a suite of tools designed to assist in developing, debugging, and evaluating LLM applications. It includes functionalities for prompt engineering, chaining tasks, and integrating LLMs into larger systems.
Text similarity: This category can assess the quality of text ...

Introduction to Building Chatbots

Understanding Transformers

Automating Contract Review with Transformer Models

Understanding Large Language Models (LLMs)

Data Collection and Preparation

Optimizing RAG Workflows with LangChain

Prompt Engineering and Retrieval Chains

Chatbot User Interface Development with Streamlit

Chatbot Integration and Evaluation

Capstone Project

Conclusion and Future Developments

Evaluating Chatbots

Introduction to LLM evaluation