ution
1. Train/Test Split Implementation
The foundational rule requires partitioning data before any model interaction. The training set drives parameter optimization, while the test set remains strictly sealed until final evaluation.
from sklearn.model_selection import train_test_split
import numpy as np
# Fake dataset: 1000 examples, 5 features
X = np.random.rand(1000, 5)
y = np.random.randint(0, 2, 1000)
# 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% goes to testing
random_state=42 # Makes the split reproducible
)
print(f"Training size: {X_train.shape[0]}") # 800
print(f"Testing size: {X_test.shape[0]}") # 200
The random_state=42 just means: every time you run this, you get the same split. Without it, you'd get a different random split each time, and your results would change every run. That makes debugging a nightmare.
2. Split Size Strategy
Split ratios must scale with dataset volume to balance learning capacity and evaluation reliability.
# Small dataset - use 70/30
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Big dataset - 90/10 is fine
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
3. Preventing Data Leakage
Data leakage occurs when test-set information contaminates the training pipeline. The two primary vectors are direct evaluation overlap and premature preprocessing.
Leakage Type 1: Training on test data directly
# WRONG - never do this
model.fit(X, y) # trained on ALL data
score = model.score(X, y) # tested on same data
print(score) # Looks amazing. Means nothing.
# RIGHT
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(score) # This number actually tells you something
Leakage Type 2: Preprocessing before splitting
from sklearn.preprocessing import StandardScaler
# WRONG - scaling before split
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # uses ALL data to calculate mean/std
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2)
# The test set influenced the scaler. Leakage.
# RIGHT - split first, then scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # learns from train only
X_test_scaled = scaler.transform(X_test) # applies same scaling to test
See the difference? In the wrong version, when you called fit_transform(X), the scaler calculated mean and standard deviation using the test data too. That information then flowed into how your model was trained. The test set is no longer truly unseen.
Always: split first, preprocess second.
4. Cross-Validation Upgrade
Single splits introduce high variance. K-fold cross-validation rotates the test partition K times, ensuring every sample contributes to both training and evaluation exactly once.
Example with K=5 (called 5-fold cross-validation):
Fold 1: [TEST ] [train] [train] [train] [train]
Fold 2: [train] [TEST ] [train] [train] [train]
Fold 3: [train] [train] [TEST ] [train] [train]
Fold 4: [train] [train] [train] [TEST ] [train]
Fold 5: [train] [train] [train] [train] [TEST ]
Final score = average of all 5 test scores
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
model = KNeighborsClassifier(n_neighbors=3)
# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5)
print(f"Scores per fold: {scores.round(3)}")
print(f"Mean accuracy: {scores.mean():.3f}")
print(f"Std deviation: {scores.std():.3f}")
Output:
Scores per fold: [0.967 1. 0.933 0.967 1. ]
Mean accuracy: 0.973
Std deviation: 0.027
The mean gives you the best estimate of real-world performance. The standard deviation tells you how consistent the model is. Small std = reliable. Large std = something is off.
5. Stratified Splits for Imbalanced Data
Random partitioning on skewed distributions can create unrepresentative test sets. Stratification preserves class proportions across all partitions.
from sklearn.model_selection import train_test_split, StratifiedKFold
import numpy as np
# Imbalanced dataset: 950 negatives, 50 positives
X = np.random.rand(1000, 4)
y = np.array([0]*950 + [1]*50)
# stratify=y makes sure both sets keep the 95/5 ratio
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42,
stratify=y # <-- this line
)
print(f"Train class 1 ratio: {y_train.mean():.3f}") # ~0.050
print(f"Test class 1 ratio: {y_test.mean():.3f}") # ~0.050
Pitfall Guide
- Evaluating on Training Data: Training and testing on the identical dataset creates a false accuracy ceiling. The model memorizes noise and fails to generalize to unseen distributions.
- Preprocessing Before Splitting: Fitting scalers, imputers, or encoders on the full dataset leaks test statistics into training parameters. Always partition first, then fit transformers exclusively on the training fold.
- Ignoring Dataset Size in Split Ratios: Applying a rigid 80/20 ratio to small datasets (<1k rows) starves the model of learning signal. Conversely, using 70/30 on massive datasets (>100k rows) wastes compute and inflates test variance unnecessarily.
- Single-Split Dependency: Relying on one random partition introduces high variance in performance estimates. A single lucky or unlucky split can mislead architecture decisions. Cross-validation mitigates this by averaging across multiple partitions.
- Neglecting Class Imbalance: Random splits on skewed targets (e.g., fraud detection) can produce test sets with near-zero positive samples, rendering accuracy metrics meaningless. Stratification enforces proportional representation.
- Omitting
random_state: Failing to seed random splits breaks reproducibility. Without deterministic partitioning, model tracking, hyperparameter tuning, and debugging become statistically invalid.
Deliverables
- 📘 Evaluation Pipeline Blueprint: A step-by-step architectural diagram for ML evaluation workflows, detailing data isolation boundaries, preprocessing ordering, and resampling strategies for tabular, time-series, and NLP datasets.
- ✅ Train/Test Split Checklist: A 12-point validation checklist covering split ratio selection, stratification requirements, leakage prevention, seed management, and cross-validation configuration before model training begins.
- ⚙️ Configuration Templates: Ready-to-use
scikit-learn pipeline templates including Pipeline + ColumnTransformer setups that enforce split-first preprocessing, StratifiedKFold cross-validation wrappers, and reproducibility locks for production-grade model evaluation.