One Hot Encoding in Machine Learning

Last Updated : 29 May, 2026

One-Hot Encoding is a data preprocessing technique used to convert categorical data into a numerical format that machine learning models can understand. It creates separate binary columns for each category, where 1 represents the presence of a category and 0 represents its absence.

  • Converts categorical values into binary columns
  • Prevents models from assuming an incorrect order between categories
  • Improves machine learning model performance
  • Helps capture relationships between categorical features
  • Required for many machine learning algorithms that accept numerical input only

Working of One-Hot Encoding

One-Hot Encoding creates a separate column for each category in the dataset. In the fruit example, when the fruit is Apple, the Fruit_Apple column gets the value 1 while the other fruit columns contain 0. Similarly, for Mango and Orange, their respective columns contain 1 and the remaining columns contain 0.

  • Each category gets its own binary column
  • 1 indicates the presence of a category
  • 0 indicates the absence of a category
  • Converts categorical values into numerical format for machine learning models
FruitCategorical value of fruitPrice
apple15
mango210
apple115
orange320

The output after applying one-hot encoding on the data is given as follows

Fruit_appleFruit_mangoFruit_orangeprice
1005
01010
10015
00120

Implementation

One-Hot Encoding can be implemented in Python using libraries such as Pandas and Scikit-learn, which provide simple and efficient methods for converting categorical data into binary columns.

1. Using Pandas

Pandas provides the get_dummies() function to perform one-hot encoding on categorical columns.

  • Converts categorical values into binary columns
  • Easy and efficient for preprocessing datasets
  • drop_first=True removes one redundant column to avoid multicollinearity
  • Example: Gender with values M and F becomes Gender_M and Gender_F columns
Python
import pandas as pd

data = {
    'Employee_ID': [10, 20, 15, 25, 30],
    'Gender': ['M', 'F', 'F', 'M', 'F'],
    'Remarks': ['Good', 'Nice', 'Good', 'Great', 'Nice']
}

df = pd.DataFrame(data)

print("Original Data:")
print(df)

encoded_df = pd.get_dummies(
    df,
    columns=['Gender', 'Remarks'],
    drop_first=True
)

print("\nOne-Hot Encoded Data:")
print(encoded_df)

Output:

output99
Output

2. Using Scikit Learn Library

Scikit-learn (sklearn) provides the OneHotEncoder function to convert categorical variables into binary columns for machine learning models.

  • Converts categorical data into binary format
  • Automatically identifies categories
  • Useful for machine learning preprocessing
  • select_dtypes() helps select categorical columns automatically
Python
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

data = {
    'Employee_ID': [10, 20, 15, 25, 30],
    'Gender': ['M', 'F', 'F', 'M', 'F'],
    'Remarks': ['Good', 'Nice', 'Good', 'Great', 'Nice']
}

df = pd.DataFrame(data)

print("Original Data:")
print(df)

categorical_columns = df.select_dtypes(include=['object']).columns

encoder = OneHotEncoder(sparse_output=False)

encoded_data = encoder.fit_transform(df[categorical_columns])

encoded_df = pd.DataFrame(
    encoded_data,
    columns=encoder.get_feature_names_out(categorical_columns)
)

final_df = pd.concat(
    [df.drop(columns=categorical_columns), encoded_df],
    axis=1
)

print("\nOne-Hot Encoded Data:")
print(final_df)

Output:

output10
Output

Download full code from here

Advantages

  • Converts categorical data into numerical format for machine learning models
  • Helps improve model performance by representing categories clearly
  • Prevents incorrect ordinal relationships between categories

Limitations

  • Increases the number of columns in the dataset
  • Can create sparse datasets with many 0 values
  • May lead to overfitting when categories are too many and data is limited
Comment