Skip to the content.

Student Performance Prediction: Binary Classification with Logistic Regression

Predicting student pass/fail outcomes using behavioral and demographic features to enable early intervention strategies


Table of Contents


Project Overview

This project demonstrates my approach on end-to-end data science methodology by developing a binary classification model to predict whether students will pass (Grades A-C) or fail (Grades D-F) based on 12 behavioral and demographic features.

Key Metrics

Dataset


Problem Statement

Challenge

Educational institutions often adopt reactive rather than proactive approaches to student support, resulting in interventions occurring only after academic performance via grades. This results in:

Solution

A logistic regression classifier that:

  1. Identifies at-risk students before their assessments
  2. Quantifies feature importance to guide targeted interventions (addressing absences vs study time and which ones are most important)
  3. Provides probability scores of passing (0-1)

Impact


Technical Architecture

Why Python?

Python is an industry standard tool with many packages to deal with data science and machine learning. Here are some of the main ones I used in this project:

Model Choice: Logistic Regression

Why not other algorithms?

Linear Regression - Requires continuous target variable (This project uses binary pass/fail) Random Forest - Black-box model which is not suitable. Stakeholders (educators/administrators in this case) require interpretable coefficients to justify targeted interventions Neural Networks - Overkill for tabular data with 12 features - overfitting risk with only 2,392 samples

Why Logistic Regression?

  1. Interpretable coefficients: Each feature’s impact on failure risk is quantifiable (“1 additional absence increases failure odds by x%”)
  2. Probability outputs: Returns probabiity not just binary predictions, enabling risk tiers such as low, medium, and high.
  3. Statistical rigor: Compatible with hypothesis testing (p-values, confidence intervals using statsmodels)
  4. Computational efficiency: Trains in seconds making it lightweight and applicable in various dashboards

Trade-offs Acknowledged


Installation & Setup

Prerequisites

Python 3.8+
pip

Clone Repository

git clone https://github.com/yourusername/student-performance-prediction.git
cd student-performance-prediction

Use pip intsall for the required packages like this:

pip install -r requirements.txt

requirements.txt:

pandas
numpy
scikit-learN
statsmodels
matplotlib
seaborn
jupyter

Open and run the Notebook using the likes of vs code and jupyter

Project Structure

dspp/
├── student_performance_analysis.ipynb  # Main analysis notebook
├── Student_performance_data _.csv      # Raw dataset
├── requirements.txt                    # Python dependencies
├── README.md                           # This file

Data Engineering Approach

1. Data Quality Assessment (Gov UK Framework)

Before any modeling, I conducted a data quality audit against the UK Government Data Quality Framework(https://www.gov.uk/government/publications/the-government-data-quality-framework/the-government-data-quality-framework#Data-quality-dimensions) to ensure trustworthy results.

Why This Framework?

Results

Dimension Target Achieved Test Method
Completeness 95%+ 100% .isnull().sum() across 16 features
Accuracy No invalid ranges 0 issues Domain validation (GPA: 0-4.0, Age: 14-19, etc.)
Validity Schema compliance 0 issues Binary fields contain only 0/1
Consistency Unique StudentIDs 100% unique No duplicate IDs via .value_counts()
Uniqueness No duplicate rows 100% unique records .duplicated().sum() == 0

Outcome: Perfect Data Quality - Meaning no imputation or outlier treatment required & I could proceeded directly to feature engineering.

Code Example: Completeness Check

# Check for missing values using dual methods
print("NULL values:\n", df.isnull().sum())
print("\nNA values:\n", df.isna().sum())

# Result: 0 missing values across all features

2. Feature Engineering: Target Variable Creation

Challenge: Original dataset contained 5-tier GradeClass (A/B/C/D/F). Binary classification requires two classes.

Solution: Engineered PassFail target column using pandas:

# Pass = Grades A, B, C (GradeClass 0, 1, 2)
# Fail = Grades D, F (GradeClass 3, 4)
df['PassFail'] = (df['GradeClass'] < 3).astype(int)  # 1=Pass, 0=Fail

Rationale:

Class Distribution:

Imbalance Noted: Used stratified sampling in train-test split to preserve the natural distribution from the sourced data.

image


3. Feature Selection: Preventing Target Leakage

Dropped 3 columns to avoid data leakage:

Feature Reason for Removal
StudentID Non-predictive identifier (random assignment)
GPA Grade Point Average is a Big influence for PassFail. It would achieve extremely high accuracy but unusable due to detecting at-risk students from features not directly related to the grade itself
GradeClass Source of engineered target—retaining causes perfect multicollinearity

Retained 12 features to predict PassFail:

['Age', 'Gender', 'Ethnicity', 'ParentalEducation', 'StudyTimeWeekly', 
 'Absences', 'Tutoring', 'ParentalSupport', 'Extracurricular', 
 'Sports', 'Music', 'Volunteering']

4. Train-Test Split Strategy

df_train, df_test = train_test_split(
    student_df, 
    test_size=0.2,          # 80/20 split (industry standard for ML)
    random_state=1234,      # Reproducibility
    stratify=student_df['PassFail']  # Preserve class distribution with stratify
)

Why Stratification? If I opted for random sampling, it could create a test set with 75% fails (vs true 68%), leading to:

Verification:

Train Set Distribution:    Test Set Distribution:
0 (Fail):  67.9%           0 (Fail):  67.9%
1 (Pass):  32.1%           1 (Pass):  32.1%

5. Feature Scaling: StandardScaler

Applied: Z-score normalisation (mean=0, std=1)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)  # Fits ONLY on training data

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Apply train parameters

Why StandardScaler?

Consideration StandardScaler MinMaxScaler
Outlier sensitivity Robust (uses standard deviation) Sensitive (uses min/max)
Feature distribution Works with any shape Assumes bounded range
Logistic regression compatibility Preferred Can compress important signals

Critical Detail: Fitted scaler only on training data, preventing data leakage. Test set scaled using training parameters simulates real-world deployment where future data statistics are unknown.


Model Development & Rationale

Exploratory Data Analysis (EDA)

Correlation Analysis

image

Key Finding: Absences shows strongest negative correlation with PassFail (-0.66)

Business Implication: Attendance monitoring systems should trigger alerts at 5 absences (statistically significant threshold from logistic coefficients).


Model Training

from sklearn.linear_model import LogisticRegression

#initialise the model
model_scaled = LogisticRegression()

# train the model
model_scaled.fit(X_train_scaled, y_train)

Hyperparameters: Used defaults (no regularisation tuning) as:

  1. High data quality (no noise to filter)
  2. Only 12 features so there is a low overfitting risk
  3. Baseline model prioritises interpretability

Results & Performance

Evaluation Metrics

Why Multiple Metrics?

Metric What It Measures Why It Matters
Accuracy Overall correctness High-level performance (but misleading with class imbalance)
Balanced Accuracy Average of sensitivity/specificity Corrects for 68% fail bias, resulting in a true measure of model quality
F1-Score Harmonic mean of precision/recall Balances false positives vs false negatives
ROC-AUC Discrimination ability across thresholds Measures separability of classes (0.5=random, 1.0=perfect)

Results

Accuracy:          89.1%
Balanced Accuracy: 87.39%  # Key metric for imbalanced data
F1-Score:          83.01%
ROC-AUC:           0.93   # Excellent discrimination (0.5 = random guessing, 1.0 = perfect discrimination)

Interpretation:


Confusion Matrix Analysis

image

Error Breakdown:

Model Behavior: Slightly favors false negatives (5.64% vs 5.22%), could result from training on 68% fail distribution.

Acceptable for educational context where:

Unacceptable for educational context where:


ROC Curve

image

AUC = 0.93 indicates model is much better than random guessing at distinguishing pass/fail students across all probability thresholds.


Feature Importance (Coefficients)

image

Rank Feature Coefficient Interpretation P-value
1 Absences -2.7047 Increased Absences = HUGE negative effect on passing (strongest predictor) <0.001
2 StudyTimeWeekly +0.5534 Increased Study time = Moderate positive effect on passing <0.001
3 ParentalSupport +0.4688 Increased Support level = Moderate positive effect on passing <0.001
4 Tutoring +0.3649 Having tutor = Moderate positive effect on passing <0.001
5 Extracurricular +0.3430 Participation = Moderate positive effect on passing <0.001

Statistical Significance (via statsmodels logistic regression):


Key Insights & Business Value

Actionable Insights:

  1. Attendance is critical: Absences have a dramatically larger effect than all other predictora combined (Around 5x stronger than the next predictor)
  2. Study time matters: Significant positive effect on outcomes
  3. Parental support helps: Clear measurable benefit
  4. Tutoring shows improvement: Meaningful improvement in passing likelihood
  5. Demographics show fairness: Age (p=0.871), Gender (p=0.372), Ethnicity (p=0.852) are not significant, illustrating the model is unbiased and fair.

Future Improvements

Model Enhancements

Feature Engineering