Aquileo | Tokenization Using Spacy

Tokenization is a preprocessing step in NLP where text is divided into smaller units called tokens such as words, punctuation marks or special characters. This makes text easier for machines to process and analyze.

tokenization_in_natural_language_processing — Tokenization

Let's see an example:

Sentence: “I love natural language processing!”
After tokenization: ["I", "love", "natural", "language", "processing", "!"]

Features of SpaCy Tokenizer

Performs fast and efficient tokenization on large text datasets
Treats punctuation marks as separate tokens for accurate text processing
Supports language-specific tokenization rules and grammar patterns
Correctly handles spaces, newlines, URLs, hashtags, and email addresses
Allows customization through user-defined tokenization rules

Implementation

Here, we’ll see how to implement tokenization using SpaCy.

1. Blank Model Tokenization

In this method, SpaCy’s blank model spacy.blank("en") is used to perform basic tokenization without pre-trained NLP components such as POS tagging or named entity recognition.

Initializes a minimal NLP pipeline
Performs basic text tokenization
Does not include pre-trained language processing features

Python

import spacy
nlp = spacy.blank("en")

doc = nlp("GeeksforGeeks is a one stop\
learning destination for geeks")

for token in doc:
    print(token)

Output:

2. Displaying the Pipeline Components

In this step, the pre-trained en_core_web_sm model is loaded to access SpaCy’s NLP components. The pipeline components can then be displayed to view the available processing modules.

Loads the pre-trained SpaCy model
Displays available NLP pipeline components
Includes components for tasks like tagging and parsing

Python

nlp = spacy.load("en_core_web_sm")

nlp.pipe_names

Output:

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

3. Tokenization with Part-of-Speech Tagging and Lemmatization

In this step, text is processed to perform tokenization, identify part-of-speech tags and extract the lemmatized form of each token.

Splits text into tokens
Assigns grammatical POS tags
Extracts root or dictionary forms of words

Python

import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("If you want to be an excellent programmer \
, be consistent to practice daily on GFG.")

for token in doc:
    print(token, " | ",
          spacy.explain(token.pos_),
          " | ", token.lemma_)

Output:

Tokenization-Using-Spacy2 — Example with POS Tagging and Lemmatization

Download full code from here

Advantages

Provides fast and efficient tokenization for large text datasets
Offers accurate tokenization using pre-trained NLP models
Simple and user-friendly API for easy implementation
Supports integration with tasks like lemmatization, parsing and NER
Useful for building custom NLP pipelines and applications

Limitations

Pre-trained models may require high memory usage
Domain-specific text may need additional fine-tuning
Performance can vary for languages with limited linguistic resources
Large models may not suit memory constrained environments

Tokenization Using Spacy

Features of SpaCy Tokenizer

Implementation

1. Blank Model Tokenization

2. Displaying the Pipeline Components

3. Tokenization with Part-of-Speech Tagging and Lemmatization

Advantages

Limitations

Explore