Feature selection method

Feature selection means choosing the most useful features (columns) from your dataset that best help your model make predictions.

There are five main categories of feature selection methods:

CategoryUses Statistics?Uses Model?Captures Interaction?Cost
Filter🟢 Fast
Wrapper🔴 Slow
Embedded🟡 Balanced
Hybrid✅ + ✅🟠 Moderate
EnsembleMultiple combined🟠 Moderate

Filter Method (Statistical Base)

The Filter Method uses statistical tests between each feature and the target to measure how informative that feature is.

It filters out unhelpful features before model training.

It is called “model-independent” because it does not rely on any ML algorithm only on data statistics.

How It Works

  1. For each feature Xii​:
    • Compute a statistical score showing how related Xi​ is to the target Y.
  2. Rank features by their score.
  3. Select the top N features or those above a threshold.
Data TypeTechniqueWhat It Measures
Continuous–ContinuousCorrelation Coefficient (Pearson, Spearman)Linear or monotonic relationship strength
Categorical–CategoricalChi-Square (χ²)Independence between variables
Continuous–CategoricalANOVA F-testMean differences between groups
Any typeMutual Information (MI)Measures shared entropy between feature and label; non-linear relationships.

Methods

  • Correlation-based (for numeric features)
  • Chi-Square Test (for categorical data)
  • Mutual Information
  • ANOVA F-Test

Limitations

  • Treats each feature independently → ignores feature interactions.
  • Correlation ≠ causation.
  • Not tailored to a specific ML algorithm.

Example : Imagine we have features [Age, Salary, ZIP Code, Favorite Color] to predict “Loan Default”. A Chi-square or correlation test might show “Favorite Color” has no relation so you remove it before training.

Wrapper Method (Model-Based Iteration)

Instead of statistics, the Wrapper Method uses the model’s performance to decide which features are best.

You “wrap” the selection process around the model repeatedly training and evaluating different subsets of features.

How It Works

  1. Choose a base model (e.g., Logistic Regression, Decision Tree).
  2. Use a search strategy:
    • Forward Selection → start with none, keep adding features that improve accuracy.
    • Backward Elimination → start with all, keep removing features that hurt accuracy.
    • RFE (Recursive Feature Elimination) → train the model, drop the least important feature each time.
  3. Stop when performance no longer improves.

Embedded Method (Integrated in Model)

The Embedded Method performs feature selection while the model is being trained. It “learns” which features matter most as part of the optimization process.

How It Works

Many ML algorithms naturally assign importance to features:

  • Regression coefficients
  • Tree-based split importance
  • Regularization penalties (L1/L2)

Embedded methods penalize complexity to shrink or zero-out unimportant feature weights.

AlgorithmTechniqueHow It Selects Features
Lasso RegressionL1 RegularizationPushes some coefficients to 0 (removes features)
Ridge RegressionL2 RegularizationShrinks coefficients (reduces importance but not 0)
Elastic NetL1 + L2 mixBalances both behaviors
Decision Trees / Random Forests / XGBoostTree-based splittingSelects features with high information gain or Gini reduction

Hybrid Method (Two-Stage Approach)

Combine multiple methods — usually Filter + Wrapper or Filter + Embedded.

Goal: get the speed of Filter + the accuracy of Wrapper or Embedded.

Dataset SizeFeature CountModel TypeRecommended Method
Small (< 10K samples)Few (< 50 features)LinearWrapper
Large (> 100K samples)Many featuresAnyFilter or Embedded
MediumModerateAnyHybrid
Complex / NoisyMany featuresAnyEnsemble
Sparse / High-dimensional (text, TF-IDF)Huge (> 10K features)Linear/SVMFilter (Chi-square, MI)

Ensemble

Ensemble Feature Selection (EFS) borrows from ensemble learning in modeling where we combine multiple weak learners to produce a stronger, more stable model.

Similarly, EFS combines multiple feature selection outcomes to get a robust, stable, and generalizable set of features.

There are two main forms:

  1. Data-Level Ensembles (Resampling-Based) We repeat the feature selection process on different data subsets and then aggregate the results.

    Example:

    1. Use bootstrapped samples of your dataset (like bagging).
    2. Apply any selection method (e.g., Mutual Information or Recursive Feature Elimination) to each sample.
    3. Count how often each feature is selected across all samples.
    4. Keep the features that are selected most frequently.

    Why this helps:

    • Random data variations are averaged out → reduces variance.
    • You get stable feature importance scores that generalize better.
  2. Model-Level Ensembles (Multi-Model Based)

    We apply different models or selectors on the same dataset and aggregate their selected features.

    Example:

    • Run Logistic Regression (Embedded)
    • Run Random Forest (Embedded)
    • Run Mutual Information (Filter)
    • Combine their feature importance scores or selected feature sets.

    Aggregation strategies:

    • Voting / Ranking: Count how often a feature appears in top-K lists.
    • Weighted Averaging: Weight by model accuracy or feature importance magnitude.
    • Consensus Threshold: Select features that appear in at least X% of models.

    Why this helps:
    Different models capture different feature relationships linear, nonlinear, tree-based splits, etc. Combining them ensures generalization across model types.

normalizing training sets

Feature Normalization:

The image shows two features ( and ) and how normalization of features can be done before training a machine learning model.

  1. Data in the Image:
  • The left part shows data where the feature has a much larger range than . Here, spans from 0 to 5, while has a smaller range (e.g., 1 to 3).
  • The goal is to normalize this data, so both features have similar scales, which helps algorithms like gradient descent converge faster.
  1. Subtraction of Mean:
  • Mean Subtraction: The blue text on the bottom left indicates the calculation for normalizing the data by subtracting the mean:

where is the number of data points, and is the value of feature for the -th data point.

  • By subtracting the mean , we center the data around 0. This helps make the training more stable.

Normalizing the Variance:

  • Variance Normalization: The blue text on the bottom right indicates the formula for normalizing the variance (standardization):

where is the standard deviation of the feature, and is the normalized feature. This scales the data to have a mean of 0 and a variance of 1.

  • By normalizing both the mean and the variance, each feature will have a similar scale, making it easier for algorithms to process the data effectively.

Visualizing Normalization:

  • After normalization, the data will be spread out more evenly along the axes of and , making the dataset more balanced in terms of the features’ ranges.

Why Normalize Data?

  • Improves convergence in gradient descent: When features have very different scales, gradient descent might take longer to converge. Normalization helps in faster convergence by giving each feature equal importance.

  • Prevents bias: Without normalization, features with larger values (e.g., ) could dominate the model’s learning process. Normalizing ensures that each feature contributes equally.

  • Improves performance: Some algorithms (like SVMs, K-means, and neural networks) assume that the data is normalized or standardized. This ensures that the algorithm performs better.

Methods of Normalization:

  • Standardization (Z-score normalization): Subtract the mean and divide by the standard deviation for each feature.
  • Min-Max Scaling: Scale the data to a fixed range, usually [0, 1].

Normalization is especially important for models like SVMs, k-NN, and neural networks where the scale of the input features affects model performance.

Hashing

Feature hashing, or the hashing trick, is a machine learning technique that converts high-dimensional, categorical data into a fixed-size numerical vector by using a hash function.

How it works

  1. Hashing:  A hash function is applied to each categorical feature (e.g., a word or a user ID). 

  2. Indexing:  The hash function outputs an integer, which is used as an index in a pre-defined, fixed-size vector. 

  3. Updating:  The value at that index in the vector is updated based on the feature. A signed hash function is often used to help mitigate collisions and preserve some information. 

  4. Vector Creation:  The resulting fixed-size vector is then used as input for the machine learning model.