karasms.com

Essential Feature Selection Techniques in Scikit-Learn

Written on

Introduction to Feature Selection

Feature selection plays a vital role in improving model performance by identifying and retaining only the most relevant features. While having more features can initially enhance a model's predictive capabilities, an excessive number can lead to the "Curse of Dimensionality," where the model's performance deteriorates once a certain threshold is crossed. Therefore, selecting only the most predictive features is essential.

It’s important to note that feature selection differs from dimensionality reduction. While both aim to reduce the dataset's complexity, feature selection focuses on choosing which features to keep, whereas dimensionality reduction transforms the data into a new set of input features. For more insights on dimensionality reduction, check out my other articles.

Understanding Feature Selection Methods

In this section, we will explore five effective feature selection techniques available in Scikit-Learn. These methods are user-friendly and highly beneficial for refining your models.

1. Variance Threshold Feature Selection

Variance Thresholding is a straightforward method where features are discarded based on their variance. Features with low variance may not contribute much to the model's predictive power. This method is particularly useful for unsupervised learning, as it does not take into account the target variable.

For instance, consider a dataset where we want to apply this method:

import pandas as pd

import seaborn as sns

mpg = sns.load_dataset('mpg').select_dtypes('number')

mpg.head()

Example of a dataset with numerical features

Next, we can use the VarianceThreshold class to eliminate low-variance features:

from sklearn.feature_selection import VarianceThreshold

selector = VarianceThreshold(threshold=1)

selector.fit(mpg)

mpg.columns[selector.get_support()]

Selected features after applying Variance Threshold

2. Univariate Feature Selection with SelectKBest

This method uses univariate statistical tests to evaluate the importance of features relative to the target variable. The SelectKBest function selects the K features that have the strongest relationship with the dependent variable.

To illustrate, we can use the following code:

mpg = sns.load_dataset('mpg').select_dtypes('number').dropna()

X = mpg.drop('mpg', axis=1)

y = mpg['mpg']

from sklearn.feature_selection import SelectKBest, mutual_info_regression

selector = SelectKBest(mutual_info_regression, k=2)

selector.fit(X, y)

X.columns[selector.get_support()]

Top features selected using SelectKBest

3. Recursive Feature Elimination (RFE)

RFE is a feature selection technique that recursively removes the least important features based on the chosen model's performance. It continues until the desired number of features is reached.

Here’s how to implement RFE using a logistic regression model:

from sklearn.feature_selection import RFE

from sklearn.linear_model import LogisticRegression

titanic = sns.load_dataset('titanic')[['survived', 'pclass', 'age', 'parch', 'sibsp', 'fare']].dropna()

X = titanic.drop('survived', axis=1)

y = titanic['survived']

rfe_selector = RFE(estimator=LogisticRegression(), n_features_to_select=2, step=1)

rfe_selector.fit(X, y)

X.columns[rfe_selector.get_support()]

Features selected using Recursive Feature Elimination

4. Feature Selection via SelectFromModel

Similar to RFE, SelectFromModel selects features based on the importance attributes of a specified model, such as coef_ or feature_importances_. The features exceeding a certain threshold are retained.

Example:

from sklearn.feature_selection import SelectFromModel

sfm_selector = SelectFromModel(estimator=LogisticRegression())

sfm_selector.fit(X, y)

X.columns[sfm_selector.get_support()]

Features selected using SelectFromModel

5. Sequential Feature Selection (SFS)

Newly introduced in Scikit-Learn 0.24, SFS is a greedy algorithm that either adds or removes features based on cross-validation scores. It evaluates the model's performance iteratively, selecting the best combination of features.

Here’s how to perform backward SFS:

from sklearn.feature_selection import SequentialFeatureSelector

sfs_selector = SequentialFeatureSelector(estimator=LogisticRegression(), n_features_to_select=3, cv=10, direction='backward')

sfs_selector.fit(X, y)

X.columns[sfs_selector.get_support()]

Features selected using Sequential Feature Selection

Conclusion

Feature selection is crucial in machine learning to ensure that only the most relevant features are utilized, thereby enhancing model accuracy and reducing complexity. In this article, we've covered five essential techniques available in Scikit-Learn: Variance Threshold, Univariate Selection with SelectKBest, Recursive Feature Elimination, SelectFromModel, and Sequential Feature Selection.

I hope you find these methods helpful in your projects! Feel free to connect with me on social media.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

# The Rise of the Asian Century: Is China Set to Overtake the U.S.?

Exploring the potential of China surpassing the U.S. in technology and innovation, drawing parallels with historical industrial revolutions.

Title: Embracing Self-Honesty: A Journey Toward Personal Growth

Discover the importance of self-honesty for personal growth and how it impacts relationships and stress levels.

Challenging the Notion That

An exploration into the validity of the claim that our minds shape our realities.