Machine Learning Model Evaluation

Last Updated : 13 May, 2026

Model evaluation is the process of assessing how well a machine learning model performs on unseen data using different metrics and techniques. It ensures that the model not only memorises training data but also generalises to new situations. By applying various techniques, we can identify whether a model has truly learned patterns or not.

1. Cross-Validation

Cross-validation ensures that the model is tested on multiple subsets of data making it less likely to overfit and improving its generalisation ability.

(a) Holdout Method

In the Holdout method the dataset is split into train and test sets (commonly 7:3 or 8:2). Let's implement it where:

  • load_iris() loads the Iris dataset (flower measurements with 3 species).
  • train_test_split() divides data into training and testing sets.
  • test_size=0.20: 20% for testing, 80% for training.
  • random_state=42 makes results reproducible
Python
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)

print("Training set size:", len(X_train))
print("Testing set size:", len(X_test))

Output:

Training set size: 120
Testing set size: 30

(b) K-Fold Cross-Validation

In K-Fold Cross-Validation the dataset is divided into k folds. Each fold is used once as a test set and the model is trained on the remaining k-1 folds. Lets implement it where:

  • DecisionTreeClassifier(): A decision tree model is created.
  • KFold(n_splits=5): Data is divided into 5 folds.
  • cross_val_score(): Runs training/testing across folds.
  • scores: Accuracy for each fold.
  • scores.mean(): Average accuracy across all folds.
Python
from sklearn.model_selection import KFold, cross_val_score
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

kfold = KFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(model, X, y, cv=kfold)

print("Cross-validation scores:", scores)
print("Average CV Score:", scores.mean())

Output:

Cross-validation scores: [1. 1. 0.93333333 0.93333333 0.93333333]
Average CV Score: 0.9600000000000002

2. Evaluation Metrics for Classification Tasks

Classification models assign inputs to predefined labels. Their performance can be measured using accuracy, precision, recall, F1 score, confusion matrix and AUC-ROC. We’ll demonstrate these metrics using a Decision Tree Classifier on the Iris dataset.

Step 1: Importing Libraries, Loading Dataset, Splitting Dataset

Python
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    precision_score, recall_score, f1_score, accuracy_score,
    confusion_matrix, ConfusionMatrixDisplay, roc_auc_score
)
import matplotlib.pyplot as plt

iris = datasets.load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=20
)

Step 2: Training Model

  • DecisionTreeClassifier(): Creates a decision tree model.
  • .fit(X_train, y_train): Trains the model on training data.
  • .predict(X_test): Generates predictions on test data.
Python
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
y_pred = tree.predict(X_test)

Step 3: Accuracy

We will calculate the accuracy,

Accuracy = \frac {TP+TN}{TP+TN+FP+FN}

  • TP: True Positives
  • TN: True Negatives
  • FP: False Positives
  • FN: False Negatives

accuracy_score() computes the proportion of correct predictions.

Python
print("Accuracy:", accuracy_score(y_test, y_pred))

Output:

Accuracy: 0.9333333333333333

Step 4: Precision and Recall

Precision: Precision measures how many predicted positives are actually positive.

\text{Precision} = \frac{TP}{TP + FP}

  • Focuses on the correctness of positive predictions.
  • High precision: few false positives.

Recall: Recall measures how many actual positives are correctly predicted.

\text{Recall} = \frac{TP}{TP + FN}

  • Focuses on capturing all positives.
  • High recall: few false negatives.
Python
print("Precision:", precision_score(y_test, y_pred, average="weighted"))
print("Recall:", recall_score(y_test, y_pred, average="weighted"))

Output:

Precision: 0.9435897435897436
Recall: 0.9333333333333333

For more details regarding Precision and Recall please refer to: Precision and Recall in Machine Learning

Step 5: F1 Score

We will calculate the F1 score which is Harmonic mean of precision and recall. Balances both metrics.

F_1 = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

  • Combines precision and recall into one metric.
  • Useful when we need a balance between false positives and false negatives.
Python
print("F1 score:", f1_score(y_test, y_pred, average="weighted"))

Output:

F1 score: 0.9327777777777778

Step 6: Confusion Matrix

We will create a confusion matrix:

  • confusion_matrix(): Creates matrix of actual vs predicted values.
  • Each cell shows correct/misclassified predictions.
Python
cm = confusion_matrix(y_test, y_pred)
cm_display = ConfusionMatrixDisplay(
    confusion_matrix=cm, display_labels=[0, 1, 2])
cm_display.plot()
plt.show()

Output:

confusion
Confusion Matrix

Step 7: AUC-ROC Curve

AUC -ROC Curve measures the area under the ROC curve, indicating the model’s ability to distinguish classes.

\text{TPR} = \frac{TP}{TP + FN}

\text{FPR} = \frac{FP}{FP + TN}

  • TPR: True Positive Rate
  • FPR: False Positive Rate
  • AUC: Area under curve; higher is better (0.5 random, 1 perfect).
Python
y_pred_probs = tree.predict_proba(X_test)
auc = roc_auc_score(y_test, y_pred_probs, multi_class="ovr", average="weighted")
print("AUC:", auc)

Output:

AUC: 0.9473684210526316

3. Evaluation Metrics for Regression Tasks

Regression predicts continuous values (e.g., temperature). We use error-based metrics to measure accuracy.

We will use the weather dataset which can be downloaded from here.

Step 1: Importing Data and Training Model

  • Dataset contains Temperature (independent variable) and Relative Humidity (dependent variable).
  • Data split into training (80%) and testing (20%).
  • LinearRegression().fit() trains the regression model.
  • Predictions are stored in Y_pred.
Python
import pandas as pd 
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error

df = pd.read_csv('weather.csv')

X, Y = df.iloc[:, 2].values.reshape(-1, 1), df.iloc[:, 3].values

X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y, test_size=0.20, random_state=0
)

regression = LinearRegression()
regression.fit(X_train, Y_train)
Y_pred = regression.predict(X_test)

Step 2: Mean Absolute Error (MAE)

Mean absolute error is average difference between actual and predicted values.

\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|

  • y_i : Actual value
  • \hat y_i ​: Predicted value
  • n: Number of samples
Python
mae = mean_absolute_error(Y_test, Y_pred)
print("MAE:", mae)

Output:

MAE: 2.2349999999999977

Step 3: Mean Squared Error (MSE)

We will calculate the mean squared error which is average squared difference between predicted and actual values.

\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

  • Penalizes larger errors more than MAE.
  • Commonly used for regression model loss functions.
Python
mse = mean_squared_error(Y_test, Y_pred)
print("MSE:", mse)

Output:

MSE: 6.470558999999991

Step 4: Root Mean Squared Error (RMSE)

We will calculate RMSE which is Square root of MSE. Converts error back to original units.

\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}

  • Brings error back to the same units as the target variable.
  • Easier to interpret than MSE.
  • Still penalizes larger errors more than MAE.
Python
rmse = np.sqrt(mean_squared_error(Y_test, Y_pred))
print("RMSE:", rmse)

Output:

RMSE: 2.54372934881052

Step 5: Mean Absolute Percentage Error (MAPE)

MAPE expresses the prediction error as a percentage of the actual value.

\text{MAPE} = \frac{100}{n} \sum_{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right|

  • Tells us how “off” predictions are in percentage terms.
  • Useful for business metrics (e.g., sales forecast error).
  • Sensitive when actual values are very small (can blow up).
Python
mape = mean_absolute_percentage_error(Y_test, Y_pred)
print("MAPE:", mape)

Output:

MAPE: 0.03925255003024494

Comment