Overview of the ML Workflow
The machine learning workflow you’ve outlined is accurate and represents the standard pipeline in ML projects. The complete flow is:
Data → Feature Representation → Model Family → Parameters → Prediction Function → Loss → Optimization → Regularization → Training Procedure → Evaluation → Inference
This is indeed the core workflow, though in practice it’s often iterative with feedback loops back to earlier stages based on evaluation results.
1. Data
The foundation of any ML project. This stage involves acquiring, understanding, and preparing raw information.
Key Concepts and Techniques:
Data Collection & Sources
- Web scraping
- APIs and databases
- Sensors and IoT devices
- User-generated content
- Public datasets (Kaggle, UCI Machine Learning Repository, etc.)
Data Types
- Structured data (tabular, relational databases)
- Unstructured data (text, images, audio, video)
- Time series data
- Graph data
- Semi-structured data (JSON, XML)
Data Quality Assessment
- Missing values detection
- Outlier identification
- Data validation
- Class imbalance analysis
- Data profiling and statistical summaries
Data Cleaning
- Handling missing values (deletion, imputation, forward-fill)
- Removing duplicates
- Correcting inconsistencies
- Standardizing formats
- Removing irrelevant features
Data Splitting
- Train/Validation/Test split (typically 70/15/15 or 80/20)
- Stratified sampling (for imbalanced classes)
- Time-based splits (for time series)
- K-fold cross-validation preparation
- Group-based splits (for grouped data)
2. Feature Representation
The process of transforming raw data into meaningful features that algorithms can learn from effectively.
Key Concepts and Techniques:
Feature Extraction
- Dimensionality reduction techniques (PCA, ICA, Factor Analysis)
- Text feature extraction (TF-IDF, Word2Vec, BERT embeddings)
- Image feature extraction (SIFT, HOG, CNN features)
- Audio feature extraction (MFCC, spectrograms)
- Domain-specific feature extraction
Feature Engineering
- Polynomial features
- Interaction terms
- Domain-driven features
- Temporal features (day, month, hour from timestamps)
- Statistical features (mean, variance, skewness)
- Binning and bucketing
- One-hot encoding for categorical variables
- Ordinal encoding
- Target encoding
- Frequency encoding
Feature Normalization & Scaling
- Standardization (z-score normalization)
- Min-Max scaling (normalization)
- Robust scaling (resistant to outliers)
- Log scaling
- Unit vector scaling (L2 normalization)
- Quantile transformation
Feature Selection
- Filter methods (correlation, chi-square, ANOVA)
- Wrapper methods (RFE - Recursive Feature Elimination)
- Embedded methods (L1/L2 regularization, tree feature importance)
- Univariate statistical tests
- Mutual information
- Permutation importance
Handling Categorical Variables
- One-hot encoding
- Label encoding
- Ordinal encoding
- Binary encoding
- Hashing
- Embeddings (for high-cardinality features)
Feature Creation from Raw Data
- Text: word counts, n-grams, sentiment scores
- Time series: lags, rolling statistics, fourier features
- Images: color histograms, texture descriptors
- Mixed: cross-features, aggregations
Tools/Frameworks: Scikit-learn’s feature tools (StandardScaler, PCA, feature selectors); TensorFlow/Keras preprocessing layers; libraries like Featuretools for automated feature engineering.
3. Model Family
The choice of which type of algorithm architecture to use for your problem.
Key Concepts and Techniques:
Regression Models (for continuous output)
- Linear Regression
- Polynomial Regression
- Ridge Regression
- Lasso Regression
- Elastic Net
- Support Vector Regression (SVR)
- Decision Trees (Regression)
- Random Forest (Regression)
- Gradient Boosting Regression
- Neural Networks (Regression output)
Classification Models (for categorical output)
- Logistic Regression
- Decision Trees
- Random Forest
- Support Vector Machines (SVM)
- Naive Bayes
- K-Nearest Neighbors (KNN)
- Gradient Boosting Classifiers (XGBoost, LightGBM, CatBoost)
- Neural Networks (Classification output)
Ensemble Methods
- Bagging (Bootstrap Aggregating)
- Boosting (AdaBoost, Gradient Boosting)
- Stacking
- Voting Classifiers/Regressors
- Blending
Deep Learning Models
- Convolutional Neural Networks (CNNs)
- Recurrent Neural Networks (RNNs, LSTM, GRU)
- Transformer models
- Autoencoders
- Generative models (GANs, VAE)
Unsupervised Learning Models
- K-Means Clustering
- Hierarchical Clustering
- DBSCAN
- Gaussian Mixture Models (GMM)
- Principal Component Analysis (PCA)
- Isolation Forest (anomaly detection)
Specialized Models
- Time Series: ARIMA, Prophet, SARIMA
- Recommendation: Collaborative Filtering, Matrix Factorization
- Ranking: Learning to Rank methods
- Reinforcement Learning: Q-Learning, Policy Gradient
Model Selection Criteria
- Problem type (regression vs classification vs clustering)
- Data size and dimensionality
- Interpretability requirements
- Computational constraints
- Latency requirements
- Training time considerations
4. Parameters
The specific configuration and hyperparameters of the chosen model family.
Key Concepts and Techniques:
Model Hyperparameters (set before training)
- Learning rate (for gradient-based optimization)
- Number of hidden layers and units (for neural networks)
- Tree depth and node splitting criteria (for tree-based models)
- Kernel type and kernel coefficient (for SVM)
- Regularization strength (C, alpha)
- Number of neighbors K (for KNN)
- Number of estimators/trees
- Batch size
- Number of epochs
- Dropout rate
- Activation functions
- Optimizer type
Model Parameters (learned during training)
- Weights and biases (for neural networks)
- Tree structure and split thresholds
- Support vectors and their coefficients (for SVM)
- Feature coefficients (for linear models)
Hyperparameter Tuning Methods
- Grid Search (exhaustive search over specified parameter values)
- Random Search (random sampling of parameter space)
- Bayesian Optimization
- Hyperband
- Optuna
- Tree-structured Parzen Estimator (TPE)
- Genetic algorithms
- Manual tuning with domain expertise
Parameter Ranges and Defaults
- Understanding default values
- Common ranges for popular models
- Parameter dependencies
- Parameter scaling (linear vs logarithmic scale)
5. Prediction Function
The mathematical function that maps input features to predictions using the model parameters.
Key Concepts and Techniques:
Linear Prediction Functions
- ( y = w^T x + b ) (linear regression/classification)
- Weighted sum of features
Non-linear Prediction Functions
- Polynomial: ( y = \beta_0 + \sum \beta_i x_i + \sum \beta_{ij} x_i x_j + … )
- Neural Network: compositions of non-linear activation functions
- RBF (Radial Basis Function): ( f(x) = \sum_{i=1}^{n} w_i \phi(||x - x_i||) )
- Decision Trees: hierarchical if-then-else rules
Output Transformations
- Sigmoid function (for binary classification): ( \sigma(z) = \frac{1}{1 + e^{-z}} )
- Softmax function (for multi-class): ( p_i = \frac{e^{z_i}}{\sum_j e^{z_j}} )
- Tanh activation
- ReLU and variants
- Linear output (for regression)
Prediction Types
- Point predictions (single value)
- Probability predictions (confidence scores)
- Quantile predictions (for quantile regression)
- Interval predictions (prediction intervals)
- Ranked predictions (for ranking tasks)
Ensemble Predictions
- Averaging (mean of predictions)
- Weighted averaging
- Majority voting (classification)
- Stacking with meta-learners
- Blending combinations
6. Loss Function
Quantifies the discrepancy between predicted and actual values, guiding the model’s learning.
Key Concepts and Techniques:
Regression Loss Functions
- Mean Squared Error (MSE): ( L = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 )
- Sensitive to outliers, differentiable
- Mean Absolute Error (MAE): ( L = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| )
- Robust to outliers, less smooth
- Root Mean Squared Error (RMSE): ( L = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} )
- Interpretable scale
- Huber Loss: Combination of MSE and MAE, robust yet smooth
- Log-Cosh Loss: Smooth approximation of MAE
- Quantile Loss: For quantile regression
Classification Loss Functions
- Binary Cross-Entropy (Log Loss): ( L = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)] )
- For binary classification
- Categorical Cross-Entropy: ( L = -\frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} y_{ik} \log(\hat{y}_{ik}) )
- For multi-class classification
- Focal Loss: Addresses class imbalance by down-weighting easy examples
- Hinge Loss: ( L = \sum_{i=1}^{n} \max(0, 1 - y_i \hat{y}_i) )
- Used in SVM
- Squared Hinge Loss: Smoother version of hinge loss
Specialized Loss Functions
- Triplet Loss: For metric learning and embeddings
- Contrastive Loss: For similarity learning
- Ranking Loss: For ranking problems (pairwise and listwise)
- Ordinal Loss: For ordinal regression
- AUC Loss: Directly optimizes for AUC
Weighted Loss Functions
- Class weights (to handle imbalanced datasets)
- Sample weights
- Focal weight (down-weight easy negatives)
Custom Loss Functions
- Business-specific metrics as loss
- Multi-task learning losses
- Adversarial losses (for GANs)
7. Optimization
The algorithm used to minimize the loss function and find optimal model parameters.
Key Concepts and Techniques:
Gradient-Based Optimization
- Gradient Descent Variants
- Batch Gradient Descent (BGD): Uses entire dataset per update
- Stochastic Gradient Descent (SGD): Uses single sample per update
- Mini-batch Gradient Descent: Uses subset of data per update
- Cycle learning rates
Adaptive Learning Rate Methods
- Momentum: Accelerates convergence by accumulating gradient direction
- Standard Momentum (SGD with momentum)
- Nesterov Accelerated Gradient (NAG)
- Adagrad: Adapts learning rate based on historical gradients
- RMSprop: Modifies Adagrad with exponential moving average
- Adam: Combines momentum and RMSprop (most popular)
- AdaMax: Variant of Adam
- Nadam: Nesterov-accelerated Adam
- AdamW: Adam with decoupled weight decay
Momentum Accumulation
- Exponential moving average of gradients
- Velocity and friction concepts
- Gradient clipping
Learning Rate Scheduling
- Fixed learning rate
- Step decay (reduce at fixed intervals)
- Exponential decay
- Polynomial decay (linear, quadratic, cubic)
- Cosine annealing
- Warm restarts
- Adaptive scheduling based on validation metrics
Second-Order Methods
- Newton’s Method
- Quasi-Newton (BFGS, L-BFGS)
- Natural Gradient Descent
- Hessian-free optimization
Gradient Descent Challenges
- Local minima vs global minima
- Saddle points
- Vanishing and exploding gradients
- Plateau regions
Alternative Optimization Methods
- Coordinate Descent
- Alternating Least Squares (ALS)
- Expectation-Maximization (EM)
- Simulated Annealing
- Genetic Algorithms
- Particle Swarm Optimization
Distributed Optimization
- Data parallelism
- Model parallelism
- Asynchronous SGD
- Parameter servers
- Federated learning
8. Regularization
Techniques to prevent overfitting and improve model generalization.
Key Concepts and Techniques:
Parameter Norm Penalties
- L1 Regularization (Lasso): ( L_{total} = L_{original} + \lambda \sum |w_i| )
- Encourages sparsity (feature selection)
- Used in Lasso regression and SVM
- L2 Regularization (Ridge): ( L_{total} = L_{original} + \lambda \sum w_i^2 )
- Penalizes large weights, smoother solutions
- Used in Ridge regression
- Elastic Net: Combination of L1 and L2
- ( L_{total} = L_{original} + \lambda_1 \sum |w_i| + \lambda_2 \sum w_i^2 )
- Regularization strength ((\lambda)): Controls trade-off between fit and complexity
Dropout
- Randomly deactivating neurons during training
- Effective for neural networks
- Dropout rate as hyperparameter (typically 0.2-0.5)
- Variants: standard dropout, spatial dropout, recurrent dropout
Early Stopping
- Monitor validation performance
- Stop training when validation loss plateaus or increases
- Preserves weights from best iteration
- Prevents overfitting without explicit penalty
Data Augmentation
- Image: rotation, flipping, cropping, color jittering, elastic deformations
- Text: back-translation, paraphrasing, synonym replacement
- General: noise injection, mixup, cutmix
- Creates additional training samples from existing data
Batch Normalization
- Normalizes layer inputs
- Reduces internal covariate shift
- Allows higher learning rates
- Has regularization effect
- Variants: layer norm, group norm, instance norm
Weight Decay
- Similar to L2 regularization
- Decoupled weight decay (AdamW)
- Reduces magnitude of weights over time
Noise Injection
- Gaussian noise to inputs
- Label smoothing (soft targets instead of hard 0/1)
- Stochastic depth
Ensemble Methods as Regularization
- Averaging multiple models reduces overfitting
- Bagging with random subsets
- Boosting with sequential weak learners
Model Complexity Constraints
- Tree depth limits
- Number of basis functions
- Number of layers/units in networks
- Sparsity constraints
Architectural Regularization
- Skip connections
- Bottleneck layers
- Attention mechanisms
- Structured pruning
9. Training Procedure
The systematic process of iteratively updating model parameters to minimize loss.
Key Concepts and Techniques:
Training Loop Components
- Forward pass: compute predictions
- Loss computation: calculate loss value
- Backward pass: compute gradients
- Parameter update: modify weights/biases
- Validation evaluation: monitor progress
Epoch and Batch Management
- Epoch: full pass through entire training dataset
- Batch/Mini-batch: subset of data for one update
- Batch size selection: trade-off between speed and stability
- Number of epochs: determined by early stopping or fixed
- Iteration: single update step
Initialization Strategies
- Random initialization
- Xavier/Glorot initialization (maintains variance across layers)
- He initialization (for ReLU networks)
- Pre-trained weights (transfer learning)
- Layer-wise initialization
- Orthogonal initialization
Training Techniques
- Curriculum learning: gradually increase difficulty
- Multi-task learning: simultaneous training on related tasks
- Meta-learning: learning to learn
- Online learning: streaming data updates
- Continual learning: learning sequentially without forgetting
Validation During Training
- Validation set evaluation at regular intervals
- Early stopping based on validation metrics
- Learning curves: plotting train/validation loss over epochs
- Overfitting detection
- Hyperparameter adjustment mid-training
Convergence Criteria
- Loss below threshold
- Gradient magnitude below threshold
- No improvement for N iterations
- Maximum iterations reached
- Validation metric plateau
Debugging Training Issues
- NaN/Inf loss detection and handling
- Gradient clipping to prevent explosions
- Layer-wise learning rate adaptation
- Mixed precision training
- Gradient accumulation (for large batches)
Distributed Training
- Data parallel training (replicate model, split data)
- Model parallel training (split model across devices)
- Synchronized vs asynchronous updates
- Gradient averaging across devices
- Communication patterns
Training Time Optimization
- GPU/TPU acceleration
- Mixed precision (float16 and float32)
- Model checkpointing (gradient checkpointing)
- Knowledge distillation
- Quantization-aware training
Training Monitoring
- Loss trajectories
- Learning rate schedules
- Weight/gradient statistics
- Activation histograms
- Training time estimation
10. Evaluation
Assessing model performance on held-out test data and validation metrics.
Key Concepts and Techniques:
Regression Metrics
- Mean Squared Error (MSE): Penalizes larger errors quadratically
- Root Mean Squared Error (RMSE): Interpretable in original scale
- Mean Absolute Error (MAE): Robust to outliers
- Mean Absolute Percentage Error (MAPE): Percentage-based error
- R² Score: Proportion of variance explained (0-1 for good fit)
- Explained Variance Score: Similar to R²
- Median Absolute Error: Robust to outliers
- Spearman/Pearson Correlation: Measures monotonic/linear relationships
Binary Classification Metrics
- Accuracy: Proportion of correct predictions
- ( \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} )
- Precision: Of predicted positives, how many are correct
- ( \text{Precision} = \frac{TP}{TP + FP} )
- Recall/Sensitivity: Of actual positives, how many are detected
- ( \text{Recall} = \frac{TP}{TP + FN} )
- F1-Score: Harmonic mean of precision and recall
- ( F_1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} )
- Specificity: Of actual negatives, how many are correctly rejected
- ( \text{Specificity} = \frac{TN}{TN + FP} )
- ROC Curve & AUC: Trade-off between TPR and FPR across thresholds
- PR Curve & AP: Precision-recall relationship (better for imbalanced data)
- Log Loss: Cross-entropy for probability predictions
- Brier Score: Mean squared difference between predicted probabilities and outcomes
Multi-class Classification Metrics
- Macro-averaging: Calculate metric for each class, average equally
- Micro-averaging: Calculate metric globally by counting TP/FP/TN/FN
- Weighted-averaging: Account for class imbalance in averaging
- Confusion Matrix: Row-by-row breakdown of predictions vs actual
- Per-class metrics: Precision, recall, F1 for each class
Clustering Metrics
- Silhouette Score: Measures how similar points are to own cluster vs others
- Davies-Bouldin Index: Average similarity ratio of clusters
- Calinski-Harabasz Index: Ratio of between-cluster to within-cluster variance
- Homogeneity, Completeness, V-measure: For labeled clustering evaluation
- Adjusted Rand Index (ARI): Similarity of clustering to ground truth
- Normalized Mutual Information (NMI): Information-theoretic cluster quality
Ranking Metrics
- Mean Reciprocal Rank (MRR): Position of first relevant item
- Mean Average Precision (MAP): Average precision at each rank
- Normalized Discounted Cumulative Gain (NDCG): Relevance-weighted ranking quality
- Click-Through Rate (CTR): For online ranking systems
Anomaly Detection Metrics
- Precision, Recall, F1 (same as classification)
- ROC-AUC for outlier scores
- Contamination rate: Expected proportion of anomalies
Calibration Metrics
- Calibration curves: Predicted probability vs observed frequency
- Brier score: Measures calibration
- Expected Calibration Error (ECE): Average calibration deviation
- Platt scaling: Post-hoc calibration
Domain-Specific Metrics
- BLEU, ROUGE (for NLP/translation)
- Mean Average Precision (for information retrieval)
- Diversity metrics (for recommendation systems)
- Fairness metrics (group parity, individual fairness)
Cross-Validation Strategies
- K-Fold Cross-Validation: Partition data into k folds, train k times
- Stratified K-Fold: Maintains class distribution in each fold
- Time Series Cross-Validation: Forward-chaining to respect temporal order
- Leave-One-Out Cross-Validation (LOOCV): Each sample is test set once
- Repeated K-Fold: Run K-fold multiple times with different splits
Evaluation Best Practices
- Test data must be completely held out
- No information leakage from test set to training
- Multiple metrics to capture different aspects
- Consider business impact and costs of different error types
- Document evaluation protocol
- Report confidence intervals or standard deviations
- Compare against baselines and state-of-the-art
Error Analysis
- Examine misclassified examples
- Identify common error patterns
- Slice performance by data subgroups
- False positive and false negative analysis
- Confusion matrix interpretation
11. Inference
Deploying the trained model to make predictions on new, unseen data.
Key Concepts and Techniques:
Inference Modes
- Batch Inference: Process multiple samples at once
- Efficient for large-scale processing
- Typical in offline analytics
- Real-time/Online Inference: Single or small number of predictions
- Low-latency requirements
- Typical in production APIs
- Edge Inference: Running models on edge devices
- Mobile phones, IoT devices
- Model compression required
- Stream Inference: Continuous processing of streaming data
Model Deployment Formats
- Model Serialization: Save trained model to disk
- Pickle (Python)
- SavedModel (TensorFlow)
- ONNX (interoperable format)
- H5 (Keras)
- JobLib (scikit-learn)
- Containerization: Docker for reproducible deployment
- Versioning: Track model versions and rollbacks
- Configuration files: Store hyperparameters and preprocessing steps
Preprocessing at Inference
- Apply same transformations used during training
- Feature normalization with training statistics (mean, std)
- Feature selection (use same features as training)
- Encoding of categorical variables
- Handling missing values with same strategy
Prediction Output Handling
- Point predictions
- Confidence scores/probabilities
- Uncertainty quantification
- Prediction explanations
Inference Optimization
- Model Compression
- Quantization: Use lower precision (int8 instead of float32)
- Pruning: Remove less important weights
- Distillation: Train smaller model from larger one
- Architecture search for efficient models
- Batching: Group predictions for efficiency
- Caching: Store results for repeated queries
- Hardware Acceleration: GPU/TPU deployment
- Model Serving Frameworks
- TensorFlow Serving
- TorchServe
- Seldon Core
- BentoML
- KServe
Performance Monitoring in Inference
- Latency: Response time per prediction
- Throughput: Predictions per second
- Resource Usage: CPU, memory, GPU utilization
- Model Performance Drift: Monitor metrics over time
- Data Distribution Shift: Detect when input distribution changes
- Concept Drift: Detect when target variable behavior changes
Handling Prediction Uncertainty
- Confidence intervals
- Bayesian predictions with uncertainty
- Monte Carlo Dropout for uncertainty estimation
- Ensemble-based uncertainty
- Calibrated probabilities
Production Considerations
- Error handling and fallbacks
- Monitoring and alerting
- A/B testing new models
- Gradual rollout (canary deployments)
- Model explainability and debugging
- Compliance with regulations (GDPR, fairness)
- Logging predictions for analysis
- Feedback loops for retraining
Serving Patterns
- Single model serving
- Multi-model serving
- Ensemble serving (multiple model outputs combined)
- Cascade/fallback models
- Shadow models (running new model alongside current)
Scalability
- Horizontal scaling: Multiple servers
- Vertical scaling: More powerful hardware
- Load balancing
- Auto-scaling based on demand
Key Relationships Between Components
- Loss Function & Optimization: Loss guides optimization; different losses suit different optimizers
- Regularization & Overfitting: Directly combats overfitting that evaluation detects
- Features & Model Family: Different models benefit from different feature representations
- Hyperparameters & Training: Hyperparameters drastically affect training dynamics
- Evaluation & Inference: Evaluation metrics guide inference performance expectations
- Data Quality & All Stages: Poor data quality impacts every downstream stage
Common Pitfalls to Avoid
- Data Leakage: Test data information flowing into training
- Not Scaling Features: Different scales can affect distance-based algorithms
- Using Same Data for Validation and Testing: Leads to optimistic evaluation
- Ignoring Class Imbalance: Accuracy can be misleading with imbalanced data
- Not Monitoring Inference: Performance can degrade silently in production
- Overfitting to Validation Set: Excessive hyperparameter tuning on validation data
- Not Reproducing Preprocessing in Inference: Train-test mismatch causes issues
- Ignoring Model Interpretability: Black-box models are risky in sensitive domains
- Not Considering Computational Constraints: Beautiful models that are too slow to serve
- Forgetting Baseline Comparisons: Always compare against simple baselines
Tools & Frameworks by Stage
Data Handling: Pandas, Polars, Apache Spark, Dask Feature Engineering: Scikit-learn, Featuretools, tsfresh Model Development: Scikit-learn, TensorFlow, PyTorch, XGBoost, LightGBM Hyperparameter Tuning: Optuna, Ray Tune, Hyperopt, Scikit-optimize Evaluation & Monitoring: Scikit-learn, TensorBoard, Wandb, Neptune Explainability: SHAP, LIME, Captum, Integrated Gradients Model Serving: TensorFlow Serving, TorchServe, BentoML, Ray Serve MLOps: MLflow, DVC, Kubeflow, Airflow, Prefect