Aquileo | Normalizing Textual Data with Python

Text normalization is the process of converting textual data into a clean and consistent format before processing it in Natural Language Processing (NLP). It helps improve text quality and makes analysis more accurate and efficient. It involves several preprocessing steps:

1. Text String

Take the input text string

Python

# input string 
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
print(string)

Output:

" Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."

2. Case Conversion

Case conversion converts all text into lowercase format using the lower() method in Python.

Converts uppercase letters to lowercase
Improves consistency in text data
Helps standardize similar words like “Python” and “python”

Python

# input string
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."

# convert to lower case
lower_string = string.lower()
print(lower_string)

Output:

" python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much python 2 code does not run unmodified on python 3. with python 2's end-of-life, only python 3.6.x[30] and later are supported, with older versions still supporting e.g. windows 7 (and old installers not restricted to 64-bit windows)."

3. Removing Numbers

Removing numbers is a text normalization step used when numerical values are not important for analysis. Regular expressions (Regex) are commonly used to detect and remove numbers from text.

Removes unnecessary numerical values from text
Helps simplify text preprocessing
Commonly performed using regular expressions (Regex)

Python

# import regex
import re

# input string 
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."

# convert to lower case
lower_string = string.lower()

# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
print(no_number_string)

Output:

" python ., released in , was a major revision of the language that is not completely backward compatible and much python code does not run unmodified on python . with python 's end-of-life, only python ..x[] and later are supported, with older versions still supporting e.g. windows (and old installers not restricted to -bit windows)."

4. Removing punctuation

Removing punctuation helps clean text by eliminating unnecessary symbols. Regular expressions (Regex) are commonly used to replace punctuation marks with an empty string.

Removes punctuation symbols from text
Simplifies text preprocessing and analysis
Commonly performed using regular expressions (Regex)

Python

# import regex
import re

# input string 
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."

# convert to lower case
lower_string = string.lower()

# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)

# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string) 
print(no_punc_string)

Output:

' python released in was a major revision of the language that is not completely backward compatible and much python code does not run unmodified on python with python s endoflife only python x and later are supported with older versions still supporting eg windows and old installers not restricted to bit windows'

5. Removing White space

Removing white spaces helps clean text by eliminating unnecessary spaces from the beginning and end of a string. In Python, the strip() function is used for this purpose.

Removes leading and trailing spaces
Helps clean and standardize text
Improves text preprocessing consistency

Python

# import regex
import re

# input string 
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."

# convert to lower case
lower_string = string.lower()

# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)

# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string) 

# remove white spaces
no_wspace_string = no_punc_string.strip()
print(no_wspace_string)

Output:

'python released in was a major revision of the language that is not completely backward compatible and much python code does not run unmodified on python with python s endoflife only python x and later are supported with older versions still supporting eg windows and old installers not restricted to bit windows'

6. Removing Stop Words

Stop words are common words such as “the”, “is”, “a”, and “on” that usually do not carry significant meaning in text analysis. These words are commonly removed using the NLTK library during text preprocessing.

Removes commonly used unnecessary words
Helps focus on meaningful words in text
Improves efficiency of NLP tasks
Commonly performed using the NLTK library

Python

# download stopwords
import nltk
nltk.download('stopwords')

# import nltk for stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

# assign string
no_wspace_string='python  released in  was a major revision of the language that is not completely backward compatible and much python  code does not run unmodified on python  with python s endoflife only python x and later are supported with older versions still supporting eg windows  and old installers not restricted to bit windows'

# convert string to list of words
lst_string = [no_wspace_string][0].split()
print(lst_string)

# remove stopwords
no_stpwords_string=""
for i in lst_string:
    if not i in stop_words:
        no_stpwords_string += i+' '
        
# removing last space
no_stpwords_string = no_stpwords_string[:-1]
print(no_stpwords_string)

Output:

In this, we can normalize the textual data using Python. Below is the complete python program:

Python

# import regex
import re

# download stopwords
import nltk
nltk.download('stopwords')

# import nltk for stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))


# input string 
string = "       Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."

# convert to lower case
lower_string = string.lower()

# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)

# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string) 

# remove white spaces
no_wspace_string = no_punc_string.strip()
no_wspace_string

# convert string to list of words
lst_string = [no_wspace_string][0].split()
print(lst_string)

# remove stopwords
no_stpwords_string=""
for i in lst_string:
    if not i in stop_words:
        no_stpwords_string += i+' '
        
# removing last space
no_stpwords_string = no_stpwords_string[:-1]

# output
print(no_stpwords_string)

Output:

Normalizing Textual Data with Python

1. Text String

2. Case Conversion

3. Removing Numbers

4. Removing punctuation

5. Removing White space

6. Removing Stop Words

Explore