Feature selection is very important step in machine learning. In this step we are suppoesed to select features which are giving high model score and drop features reponsible for model score reduction.
There are different techniques used for feature selection. - Use features with corelation (either positive or negative) with target - Train a model and select features with high importance - Use different feature selection algorithms
RFECV
Feature ranking with recursive feature elimination and cross-validated selection of the best number of features.
Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a featureimportances attribute. Then, the least important features are pruned from current set of features.That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.
RFECV performs RFE in a cross-validation loop to find the optimal number of features.
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.svm import SVR
Create a friedman1 dataset for regression. In this dataset inputs X are independent features uniformly distributed on the interval [0, 1]. The output y is computed using values of first 5 features. The rest of the features are independent of y.
So the feature importance algorithm should tell us the same, i.e. only first five features are important.
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
selector = RFECV(estimator, step=1, cv=5)
selector = selector.fit(X, y)
The mask of selected features
selector.support_
array([ True, True, True, True, True, False, False, False, False,
False])
Selection rank of features, 1 stands for best features
selector.ranking_
array([1, 1, 1, 1, 1, 6, 4, 3, 2, 5])