...
/Retrieval Strategies: Web and Multimedia Document Loaders
Retrieval Strategies: Web and Multimedia Document Loaders
Learn how to use LangChain online document loaders.
Document loaders
Document loaders or connectors are used to load or connect documents from many different sources.
LangChain provides over 100 different document loaders as well as integrations with other major providers in the space, such as Arxiv, AWS, Azure, Dropbox, GitHub, Google, Snowflake, Stripe, Telegram, Twitter, Wikipedia, and YouTube. LangChain provides integrations to load all types of documents, such as html
, pdf
, csv
, xlsx
, docx
, and code from all types of locations, such as private S3 buckets and public websites.
Document loaders are designed to manage both structured and unstructured documents. Structured documents, such as pandas DataFrames and Excel files, adhere to a specific format, structure, or schema, making it easily searchable and organized in a fixed field within a record or file. This kind of document is typically stored in relational databases or spreadsheets, where the model defines how the data is stored, processed, and accessed, with each piece of data categorized and stored in its predefined structure. On the other hand, unstructured documents, such as Word and GitHub files, don’t have a predefined data model or are not organized in a pre-defined manner, and they may comprise multi-modal formats, encompassing text, images, and videos. Handling structured documents is generally more direct, whereas unstructured documents require specific manipulations and transformations for effective processing.
Types of document loaders
The development of advanced chatbots requires integrating diverse sources of data to enhance functionality and ensure relevance. The LangChain framework offers a variety of document loaders designed to facilitate this integration by extracting and processing data from different types of documents and media. These loaders enable chatbots to access and utilize information from web pages, emails, encyclopedias, and even video transcripts, adapting their responses to meet specific user needs. Below is a table detailing the different types of document loaders available in LangChain:
Loader Type | Description | Primary Use Case |
HTML Loader | Extracts text and other elements from HTML documents using the UnstructuredHTMLLoader | Used for scraping specific information from web pages to provide up-to-date data such as stock prices |
Website Loader | Utilizes the WebBaseLoader to extract textual content from webpages | Ideal for content analysis, web scraping, and data mining from various websites |
Email Loader | Uses the UnstructuredEmailLoader to extract data from email files, including headers and attachments | Facilitates the analysis or automated processing of email content for generating responses |
Wikipedia Loader | Employs the WikipediaLoader to search for and retrieve content from Wikipedia | Allows chatbots to fact-check and reference information from Wikipedia for accurate user responses |
YouTube Transcripts Loader | Accesses and retrieves text from YouTube video transcripts via the YoutubeLoader | Useful for extracting audio content as text for accessibility, content analysis, and educational use |
Let’s now go a step further into each of these LangChain web-based loaders: ...