Aquileo | Semantic Document Search

Semantic search allows computers to understand the meaning behind user queries rather than relying only on exact keyword matching. Using FAISS (Facebook AI Similarity Search), we can build a high-performance system that searches through hundreds or even thousands of documents by meaning and not just by text overlap. This approach enables smarter, faster and more context-aware information retrieval.

Implementation

Let's see how the model will work:

Document Ingestion: PDF, DOCX and TXT files are read and converted into text.
Chunking: Each document is split into smaller, meaningful parts.
Embedding Generation: Text chunks are converted into dense vector representations.
FAISS Indexing: Embeddings are stored in FAISS for efficient similarity search.
Query Encoding: A user query is embedded into the same vector space.
Similarity Search: FAISS finds top matches based on cosine similarity.
Results Display: The most relevant document snippets are shown.

Let's implement this model,

Used samples can be downloaded from here.

Step 1: Install Dependencies

We need to install the required dependencies such as faiss-cpu, sentence-transformers, python-docx.

Python

!pip install faiss-cpu sentence-transformers python-docx PyMuPDF

Step 2: Import Libraries

We will import the necessary libraires such as os, docx, numpy, SentenceTransformers, faiss.

Python

import os
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from PyPDF2 import PdfReader
from docx import Document

Step 3: Extract Text from Documents

We need to define the function for document loading,

Reads different file formats like .pdf, .docx and .txt.
Ensures content is extracted as plain text for embedding generation.

Python

def extract_text_from_file(file_path):

    text = ""
    if file_path.endswith(".pdf"):
        reader = PdfReader(file_path)
        for page in reader.pages:
            text += page.extract_text() + "\n"
    elif file_path.endswith(".docx"):
        doc = Document(file_path)
        for para in doc.paragraphs:
            text += para.text + "\n"
    elif file_path.endswith(".txt"):
        with open(file_path, 'r', encoding='utf-8') as f:
            text = f.read()
    return text.strip()

Step 4: Split Text into Chunks

Divides long documents into smaller segments (chunks).
Improves search accuracy and performance by focusing on smaller text units.

Python

def chunk_text(text, chunk_size=300):

    words = text.split()
    return [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]

Step 5: Load and Process Documents

Here:

Reads all files in the documents/ folder.
Splits them into manageable text chunks and stores the source file name for each.

Python

folder_path = "documents/"
documents = []
doc_sources = []

for file in os.listdir(folder_path):
    if file.endswith((".pdf", ".docx", ".txt")):
        path = os.path.join(folder_path, file)
        print(f"Reading file: {file}")
        content = extract_text_from_file(path)
        chunks = chunk_text(content)
        documents.extend(chunks)
        doc_sources.extend([file] * len(chunks))

print(
    f"\nLoaded {len(documents)} text chunks from {len(os.listdir(folder_path))} files.")

Output:

Step 6: Generate Text Embeddings

Here we will:

Converts text chunks into vector representations using SentenceTransformer.
Normalizes vectors for cosine similarity in FAISS.
Shows embedding progress for transparency.

Python

model = SentenceTransformer('all-MiniLM-L6-v2')
print("\nGenerating embeddings... (this may take a minute)")

embeddings = model.encode(
    documents, convert_to_numpy=True, show_progress_bar=True)
embeddings = embeddings.astype('float32')
faiss.normalize_L2(embeddings)
print(f"Embeddings shape: {embeddings.shape}")

Output:

Step 7: Create FAISS Index

Initializes a FAISS IndexFlatIP index (for cosine similarity).
Adds all text embeddings into the FAISS index for fast retrieval.

Python

dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)
index.add(embeddings)
print(f"FAISS index created with {index.ntotal} vectors.")

Output:

FAISS index created with 3 vectors.

Step 8: Define Cleaning and Search Functions

1. clean_text(): Removes unwanted formatting and extra spaces.

2. semantic_search_best():

Converts the query into a vector.
Searches the FAISS index for similar embeddings.
Displays the best matches with readable snippets.

Python

import re
import textwrap


def clean_text(text):

    text = re.sub(r'[#=*`~_-]+', '', text)
    text = re.sub(r'\*\*(.*?)\*\*', r'\1', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text


def semantic_search_best(query, top_k=1, wrap_width=100, similarity_threshold=0.35, snippet_length=300):

    query_embedding = model.encode([query]).astype('float32')
    faiss.normalize_L2(query_embedding)

    D, I = index.search(query_embedding, top_k)

    print("\nTop Semantic Search Result(s):")
    print("=" * 120)

    results_shown = 0

    for rank, idx in enumerate(I[0]):
        score = D[0][rank]
        if score < similarity_threshold:
            continue

        snippet = clean_text(documents[idx])[:snippet_length]
        wrapped_snippet = textwrap.fill(snippet, width=wrap_width)

        print(f"\nRank {rank + 1}")
        print(f"Source File     : {doc_sources[idx]}")
        print(f"Similarity Score: {score:.4f}")
        print("-" * 120)
        print(f"Preview Snippet:\n{wrapped_snippet}")
        print("=" * 120)
        results_shown += 1

    if results_shown == 0:
        print("No strong semantic matches found for your query.")

Step 9: Run Semantic Search

Retrieves top semantically relevant chunks for each query.
Displays source document name, similarity score and wrapped text preview.

Python

semantic_search_best("applications of artificial intelligence")

Output:

Python

semantic_search_best("database systems and AI", top_k=3)

Output:

The source code can be downloaded from here.

Advantages

High Performance: FAISS enables instant retrieval even across thousands of vectors.
Semantic Understanding: Captures meaning beyond keywords.
Multi-format Support: Works with PDF, DOCX and TXT files.
Scalable: Easily extensible to large datasets.

Semantic Document Search

Implementation

Step 1: Install Dependencies

Step 2: Import Libraries

Step 3: Extract Text from Documents

Step 4: Split Text into Chunks

Step 5: Load and Process Documents

Step 6: Generate Text Embeddings

Step 7: Create FAISS Index

Step 8: Define Cleaning and Search Functions

Step 9: Run Semantic Search

Advantages

Explore