In DBMS, indexing improves data retrieval speed by organizing how data is accessed. Two key types of indexes are the Forward Index and the Inverted Index. This article explores how each works, their structure, and their role in efficient information retrieval within advanced database systems.
Forward Index
A Forward Index (also known as document index) maps each document to a list of terms (words or tokens) it contains.
Structure:
- Key: Document ID
- Value: List of terms (optionally with frequency, position, etc.)
Example:
Doc1 → [“apple”, “banana”, “fruit”]
Doc2 → [“banana”, “smoothie”, “milk”]
Uses:
- Initial stage of indexing
- Document-level analysis
- Needed to build inverted index
Inverted Index
An Inverted Index (or posting list) maps each term to the list of documents (and optionally positions) where that term appears.
Structure:
- Key: Term (word)
- Value: List of Document IDs (postings)
Example:
“banana” → [Doc1, Doc2]
“milk” → [Doc2]
Enhanced Form:
Include positions, term frequency (TF), etc.
“banana” → [(Doc1, pos=2), (Doc2, pos=1)]
Uses:
- Core of search engines (e.g., Google)
- Fast keyword lookups
- Efficient document retrieval
How They're Built
Forward Index Construction:
- Tokenize and preprocess documents (stop-word removal, stemming)
- Store the document-to-term mapping.
Inverted Index Construction:
- Read forward index
- Flip mappings (term → documents)
- Add extra metadata like term frequency, position
Forward vs Inverted Index: Key Differences
| Feature | Forward Index | Inverted Index |
|---|---|---|
| Primary Key | Document ID | Term / Keyword |
| Purpose | Stores document contents | Enables fast term-based lookup |
| Search Efficiency | Inefficient for term-to-document queries | Highly efficient for keyword searches |
| Construction Stage | Built first (used to create inverted index) | Built from forward index |
| Space Efficiency | Less compact | More compact and query-efficient |
| Application | Document processing, updates | Searching, ranking, retrieval |
Key Concepts Associated
- Tokenization: Breaking documents into words or tokens
- Stemming/Lemmatization: Reducing words to their base form
- Stop Word Removal: Removing common but non-informative words
- Term Frequency (TF): How often a term appears in a doc
- Document Frequency (DF): In how many docs a term appears
- TF-IDF: Ranking documents by term importance
Real-Life Applications
- Web Search Engines (Google, Bing)
- Document Management Systems
- E-commerce product search
- Legal and academic research engines
- Spam detection and classification