LangChain Document Loaders

Last Updated : 6 Nov, 2025

LangChain Document Loaders convert data from various formats such as CSV, PDF, HTML and JSON into standardized Document objects. These objects contain the raw content, metadata and optional identifiers, allowing LLMs to process and analyze the data efficiently.

Document loaders also enable developers to manage and standardise content across multiple workflows, supporting a wide range of file types and sources including YouTube, Wikipedia and GitHub.

Document Object in LangChain

Before exploring loaders, we must understand the Document object which stores the content and metadata.

  • page_content: Stores the textual content of the document.
  • metadata: Provides additional information about the document like source or category.
  • id: Uniquely identifies the document object
Python
from langchain_core.documents import Document

data = Document(
    page_content='This is an article about LangChain Document Loaders',
    id=1,
    metadata={'source': 'AV'}
)

print(data)

print(data.page_content)

data.id = 2

Output:

Screenshot-2025-11-06-111125
Output

This structure allows loaders to consistently format data regardless of its original source.

Types of Document Loaders

LangChain provides over 200 document loaders, categorized as follows:

  • By File Type: CSV, PDF, HTML, Markdown, MS Office documents, JSON.
  • By Data Source: YouTube, Wikipedia, GitHub.

Data sources can also be public i.e no authentication needed or private that requires credentials like AWS or Azure.

1. CSV (Comma-Separated Values)

CSVLoader loads each row of a CSV file as a separate Document object.

  • file_path: Path to the CSV file.
  • metadata_columns: Columns to include in the metadata of each Document.
  • csv_args: Arguments for CSV parsing (e.g., delimiter).
  • source_column (optional): Can replace the file name as the source identifier.
Python
from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(
    file_path="./iris.csv",
    metadata_columns=['species'],
    csv_args={"delimiter": ","}
)
data = loader.load()

print(len(data))
print(data[0].metadata)

Output:

Screenshot-2025-11-06-111236
Output

2. HTML (HyperText Markup Language)

HTML pages can be loaded either from a saved file or directly from a URL.

  • urls: List of URLs to load.
  • mode='single': loads the entire HTML as one Document.
  • mode='elements': splits the page into multiple documents based on HTML tags.
  • Metadata includes file type, URL and parent element identifiers.
Python
from langchain_community.document_loaders import UnstructuredHTMLLoader, UnstructuredURLLoader

loader = UnstructuredURLLoader(urls=['https://www.geeksforgeeks.org'], mode='elements')
data_html = loader.load()

print(len(data_html))

print(data_html[0].page_content)
print(data_html[1].page_content)
print(data_html[2].page_content)

print(data_html[0].metadata)

Output:

Screenshot-2025-11-06-111413
Output

3. Markdown

MarkdownLoader loads Markdown files, optionally splitting by elements or pages.

  • file_path: Path to the Markdown file.
  • mode: 'single', 'elements' or 'paged' for splitting.
  • Metadata includes last modified date, file type and section category.
Python
from langchain_community.document_loaders import UnstructuredMarkdownLoader

loader = UnstructuredMarkdownLoader('/content/readme.md', mode='elements')
data_md = loader.load()

print(f"Loaded elements: {len(data_md)}")

print(data_md[1].page_content)
print(data_md[1].metadata)

Output:

Screenshot-2025-11-06-111548
Output

4. JSON

JSONLoader parses JSON files into Document objects.

  • file_path: Path to the JSON file.
  • jq_schema: JQ query to select content. '.' loads all content.
  • text_content: Whether to convert JSON fields to text.
Python
from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(file_path='chat.json', jq_schema='.', text_content=False)
data = loader.load()

Output:

Screenshot-2025-11-06-111657
Output

5. MS Office Documents

MS Word documents can be loaded using the Docx2txtLoader

  • file_path: Path to the Word document.
Python
from langchain_community.document_loaders import Docx2txtLoader

loader = Docx2txtLoader(file_path='/content/GenAI.docx')
data_word = loader.load()

print(data_word[0].page_content[:150])

Output:

Screenshot-2025-11-06-111834
Output

6. PDF (Portable Document Format)

PDF files can be loaded with multiple parsers.

  • file_path: Path to the PDF.
  • mode: 'single' or 'elements'.
  • strategy: 'hi_res', 'ocr_only', 'fast' or 'auto' for parsing method.
Python
from langchain_community.document_loaders import UnstructuredPDFLoader

loader = UnstructuredPDFLoader('business_strategy.pdf', mode='elements', strategy="auto")
data = loader.load()

print(len(data))

Output:

Screenshot-2025-11-06-111949
Output

7. Loading Multiple Files

We can load all files in a directory using DirectoryLoader.

  • glob: File pattern to match.
  • loader_cls: Loader class for matched files.
  • loader_kwargs: Arguments for the loader class.
  • show_progress: Show loading progress.
  • use_multithreading: Enable multithreaded loading.
Python
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader(
    ".",
    glob="**/*.json",
    loader_cls=JSONLoader,
    loader_kwargs={'jq_schema': '.', 'text_content': False},
    show_progress=True,
    use_multithreading=True
)
docs = loader.load()
print(f"Loaded documents from directory: {len(docs_dir)}")
if len(docs_dir) > 0:
    print(f"First loaded document content: {docs_dir[0].page_content}")

Output:

Screenshot-2025-11-06-112141
Output

8. Wikipedia

WikipediaLoader fetches articles based on search queries.

  • query: Search term for Wikipedia articles.
  • load_max_docs: Maximum number of articles to load.
  • doc_content_chars_max: Maximum characters per article content.
  • load_all_available_meta: Include metadata like categories, references and image URLs.
Python
from langchain_community.document_loaders import WikipediaLoader

loader = WikipediaLoader(query='Generative AI', load_max_docs=5, doc_content_chars_max=5000, load_all_available_meta=True)
data = loader.load()

for doc in data:
    print(doc.metadata['title'])

Output:

Screenshot-2025-11-06-112304
Output

With these methods we can easily load and process different types of documents in LangChain which can then be used for tasks like text analysis, question answering, summarization and building intelligent retrieval-based applications.

Comment

Explore