Feature Clustering
This module implements the clustering of features to generate a feature subset described in the book Machine Learning for Asset Managers (snippet 6.5.2.1 page-85). This subsets can be further utilised for getting Clustered Feature Importance using the clustered_subsets argument in the Mean Decreased Impurity (MDI) and Mean Decreased Accuracy (MDA) algorithm.
The algorithm projects the observed features into a metric space by applying the dependence metric function, either correlation based or information theory based (see the codependence section). Information-theoretic metrics have the advantage of recognizing redundant features that are the result of nonlinear combinations of informative features.
Next, we need to determine the optimal number of clusters. The user can either specify the number cluster to use, this will apply a hierarchical clustering on the defined distance matrix of the dependence matrix for a given linkage method for clustering, or the user can use the ONC algorithm which uses K-Means clustering, to automate these task.
The caveat of this process is that some silhouette scores may be low due to one feature being a combination of multiple features across clusters. This is a problem, because ONC cannot assign one feature to multiple clusters. Hence, the following transformation may help reduce the multicollinearity of the system:
For each cluster \(k = 1 . . . K\), replace the features included in that cluster with residual features, so that it do not contain any information outside cluster \(k\). That is let \(D_{k}\) be the subset of index features \(D = {1,...,F}\) included in cluster \(k\), where:
Then, for a given feature \(X_{i}\) where \(i \in D_{k}\), we compute the residual feature \(\hat \varepsilon _{i}\) by fitting the following equation for regression:
Where \(n = 1,\dots,N\) is the index of observations per feature. Note if the degrees of freedom in the above regression are too low, one option is to use as regressors linear combinations of the features within each cluster by following a minimum variance weighting scheme so that only \(K-1\) betas need to be estimated. This transformation is not necessary if the silhouette scores clearly indicate that features belong to their respective clusters.
Note
Underlying Literature
The following sources describe this method in more detail:
-
Machine Learning for Asset Managers by Marcos Lopez de Prado.
-
Clustered Feature Importance (Presentation Slides) by Marcos Lopez de Prado.
Implementation
This module creates clustered subsets of features described in the presentation slides: Clustered Feature Importance by Marcos Lopez de Prado.
- get_feature_clusters(X: DataFrame, dependence_metric: str, distance_metric: str | None = None, linkage_method: str | None = None, n_clusters: int | None = None, check_silhouette_scores: bool = True, critical_threshold: float = 0.0) list
-
Machine Learning for Asset Managers Snippet 6.5.2.1 , page 85. Step 1: Features Clustering
Gets clustered features subsets from the given set of features.
- Parameters:
-
-
X – (pd.DataFrame) Dataframe of features.
-
dependence_metric – (str) Method to be use for generating dependence_matrix, either ‘linear’ or ‘information_variation’ or ‘mutual_information’ or ‘distance_correlation’.
-
distance_metric – (str) The distance operator to be used for generating the distance matrix. The methods that can be applied are: ‘angular’, ‘squared_angular’, ‘absolute_angular’. Set it to None if the feature are to be generated as it is by the ONC algorithm.
-
linkage_method – (str) Method of linkage to be used for clustering. Methods include: ‘single’, ‘ward’, ‘complete’, ‘average’, ‘weighted’, and ‘centroid’. Set it to None if the feature are to be generated as it is by the ONC algorithm.
-
n_clusters – (int) Number of clusters to form. Must be less the total number of features. If None then it returns optimal number of clusters decided by the ONC Algorithm.
-
check_silhouette_scores – (bool) Flag to check if X contains features with low silh. scores and modify it.
-
critical_threshold – (float) Threshold for determining low silhouette score in the dataset. It can any real number in [-1,+1], default is 0 which means any feature that has a silhouette score below 0 will be indentified as having low silhouette and hence required transformation will be appiled to for for correction of the same.
-
- Returns:
-
(list) Feature subsets.
Example
An example showing how to generate feature subsets or clusters for a give feature DataFrame. The example will generate 4 clusters by Hierarchical Clustering for given specification.
>>> # Import Packages
>>> import pandas as pd
>>> # Import MlFinLab tools
>>> from mlfinlab.clustering.feature_clusters import get_feature_clusters
>>> from mlfinlab.util.generate_dataset import generate_classification_dataset
>>> # Generate toy dataset
>>> X, y = generate_classification_dataset(
... n_features=40, n_informative=5, n_redundant=30, n_samples=1000, sigma=0.1
... )
>>> feat_subs = get_feature_clusters(
... X,
... dependence_metric="information_variation",
... distance_metric="angular",
... linkage_method="single",
... n_clusters=4,
... )
N...
>>> feat_subs
[...]
Research Notebook
For a better understanding of its implementations see the notebook on Clustered Feature Importance.
Clustered Feature Importance