Aquileo | LangChain Document Loaders

LangChain Document Loaders convert data from various formats such as CSV, PDF, HTML and JSON into standardized Document objects. These objects contain the raw content, metadata and optional identifiers, allowing LLMs to process and analyze the data efficiently.

Document loaders also enable developers to manage and standardise content across multiple workflows, supporting a wide range of file types and sources including YouTube, Wikipedia and GitHub.

Document Object in LangChain

Before exploring loaders, we must understand the Document object which stores the content and metadata.

page_content: Stores the textual content of the document.
metadata: Provides additional information about the document like source or category.
id: Uniquely identifies the document object

Python

from langchain_core.documents import Document

data = Document(
    page_content='This is an article about LangChain Document Loaders',
    id=1,
    metadata={'source': 'AV'}
)

print(data)

print(data.page_content)

data.id = 2

Output:

This structure allows loaders to consistently format data regardless of its original source.

Types of Document Loaders

LangChain provides over 200 document loaders, categorized as follows:

By File Type: CSV, PDF, HTML, Markdown, MS Office documents, JSON.
By Data Source: YouTube, Wikipedia, GitHub.

Data sources can also be public i.e no authentication needed or private that requires credentials like AWS or Azure.

1. CSV (Comma-Separated Values)

CSVLoader loads each row of a CSV file as a separate Document object.

file_path: Path to the CSV file.
metadata_columns: Columns to include in the metadata of each Document.
csv_args: Arguments for CSV parsing (e.g., delimiter).
source_column (optional): Can replace the file name as the source identifier.

Python

from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(
    file_path="./iris.csv",
    metadata_columns=['species'],
    csv_args={"delimiter": ","}
)
data = loader.load()

print(len(data))
print(data[0].metadata)

Output:

2. HTML (HyperText Markup Language)

HTML pages can be loaded either from a saved file or directly from a URL.

urls: List of URLs to load.
mode='single': loads the entire HTML as one Document.
mode='elements': splits the page into multiple documents based on HTML tags.
Metadata includes file type, URL and parent element identifiers.

Python

from langchain_community.document_loaders import UnstructuredHTMLLoader, UnstructuredURLLoader

loader = UnstructuredURLLoader(urls=['https://www.geeksforgeeks.org'], mode='elements')
data_html = loader.load()

print(len(data_html))

print(data_html[0].page_content)
print(data_html[1].page_content)
print(data_html[2].page_content)

print(data_html[0].metadata)

Output:

3. Markdown

MarkdownLoader loads Markdown files, optionally splitting by elements or pages.

file_path: Path to the Markdown file.
mode: 'single', 'elements' or 'paged' for splitting.
Metadata includes last modified date, file type and section category.

Python

from langchain_community.document_loaders import UnstructuredMarkdownLoader

loader = UnstructuredMarkdownLoader('/content/readme.md', mode='elements')
data_md = loader.load()

print(f"Loaded elements: {len(data_md)}")

print(data_md[1].page_content)
print(data_md[1].metadata)

Output:

4. JSON

JSONLoader parses JSON files into Document objects.

file_path: Path to the JSON file.
jq_schema: JQ query to select content. '.' loads all content.
text_content: Whether to convert JSON fields to text.

Python

from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(file_path='chat.json', jq_schema='.', text_content=False)
data = loader.load()

Output:

5. MS Office Documents

MS Word documents can be loaded using the Docx2txtLoader

file_path: Path to the Word document.

Python

from langchain_community.document_loaders import Docx2txtLoader

loader = Docx2txtLoader(file_path='/content/GenAI.docx')
data_word = loader.load()

print(data_word[0].page_content[:150])

Output:

6. PDF (Portable Document Format)

PDF files can be loaded with multiple parsers.

file_path: Path to the PDF.
mode: 'single' or 'elements'.
strategy: 'hi_res', 'ocr_only', 'fast' or 'auto' for parsing method.

Python

from langchain_community.document_loaders import UnstructuredPDFLoader

loader = UnstructuredPDFLoader('business_strategy.pdf', mode='elements', strategy="auto")
data = loader.load()

print(len(data))

Output:

7. Loading Multiple Files

We can load all files in a directory using DirectoryLoader.

glob: File pattern to match.
loader_cls: Loader class for matched files.
loader_kwargs: Arguments for the loader class.
show_progress: Show loading progress.
use_multithreading: Enable multithreaded loading.

Python

from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader(
    ".",
    glob="**/*.json",
    loader_cls=JSONLoader,
    loader_kwargs={'jq_schema': '.', 'text_content': False},
    show_progress=True,
    use_multithreading=True
)
docs = loader.load()
print(f"Loaded documents from directory: {len(docs_dir)}")
if len(docs_dir) > 0:
    print(f"First loaded document content: {docs_dir[0].page_content}")

Output:

8. Wikipedia

WikipediaLoader fetches articles based on search queries.

query: Search term for Wikipedia articles.
load_max_docs: Maximum number of articles to load.
doc_content_chars_max: Maximum characters per article content.
load_all_available_meta: Include metadata like categories, references and image URLs.

Python

from langchain_community.document_loaders import WikipediaLoader

loader = WikipediaLoader(query='Generative AI', load_max_docs=5, doc_content_chars_max=5000, load_all_available_meta=True)
data = loader.load()

for doc in data:
    print(doc.metadata['title'])

Output:

With these methods we can easily load and process different types of documents in LangChain which can then be used for tasks like text analysis, question answering, summarization and building intelligent retrieval-based applications.

LangChain Document Loaders

Document Object in LangChain

Types of Document Loaders

1. CSV (Comma-Separated Values)

2. HTML (HyperText Markup Language)

3. Markdown

4. JSON

5. MS Office Documents

6. PDF (Portable Document Format)

7. Loading Multiple Files

8. Wikipedia

Explore