Clustered MDA and MDI

In the book Machine Learning for Asset Managers, as an approach to deal with substitution effects, Clustered Feature Importance was introduced. It clusters similar features and applies feature importance analysis (like MDA and MDI) at the cluster level. The value add of clustering is that the clusters are mutually dissimilar and hence reduces the substitution effects.

It can be implemented in two steps as described in the book:

Features Clustering: As a first step we need to generate the clusters or subsets of features we want to analyse with feature importance methods. This can be done using the feature cluster module. It implement the method of generating feature clusters as in the book.
Clustered Importance: Now that we have identified the number and composition of the clusters of features. We can use this information to apply MDI and MDA on groups of similar features, rather than on individual features. Clustered Feature Importance can be implemented by simply passing the feature clusters obtained in Step-1 to the clustered_subsets argument of the MDI or MDA feature importance algorithm.

How Cluster Feature Importance can be applied:

Clustered MDI (code Snippet 6.4 page 86 ): We compute the clustered MDI as the sum of the MDI values of the features that constitute that cluster. If there is one feature per cluster, then MDI and clustered MDI are the same.
Clustered MDA (code Snippet 6.5 page 87 ): As an extension to normal MDA to tackle multi-collinearity and (linear or non-linear) substitution effect. Its implementation was also discussed by Dr. Marcos Lopez de Prado in the Clustered Feature Importance (Presentation Slides).

Note

The implementation of Clustered feature importance is included in the functions for MDI and MDA.

Example

                        >>> # Import packages
>>> from sklearn.tree import DecisionTreeClassifier
>>> from sklearn.ensemble import BaggingClassifier
>>> from sklearn.metrics import accuracy_score, log_loss
>>> from sklearn.model_selection._split import KFold
>>> # Import MlFinLab tools
>>> from mlfinlab.util.generate_dataset import generate_classification_dataset
>>> from mlfinlab.feature_importance.importance import (
...     mean_decrease_impurity,
...     mean_decrease_accuracy,
...     plot_feature_importance,
... )
>>> from mlfinlab.cross_validation.cross_validation import ml_cross_val_score
>>> from mlfinlab.clustering.feature_clusters import get_feature_clusters
>>> # Create Clusters
>>> X, y = generate_classification_dataset(
...     n_features=40, n_informative=5, n_redundant=30, n_samples=1000, sigma=0.1
... )
>>> feature_clusters = get_feature_clusters(
...     X, dependence_metric="linear", n_clusters=None
... )  
N...
>>> # Fit model
>>> clf_base = DecisionTreeClassifier(
...     criterion="entropy",
...     max_features=1,
...     class_weight="balanced",
...     min_weight_fraction_leaf=0,
... )
>>> clf = BaggingClassifier(
...     base_estimator=clf_base,
...     n_estimators=1000,
...     max_features=1.0,
...     max_samples=1.0,
...     oob_score=True,
... )
>>> clf = clf.fit(X, y)
>>> # Score model
>>> cv_gen = KFold(n_splits=2)
>>> oos_score = ml_cross_val_score(
...     clf,
...     X,
...     y,
...     cv_gen=cv_gen,
...     sample_weight_train=None,
...     scoring=accuracy_score,
...     require_proba=False,
... ).mean()
>>> # Feature Importance
>>> clustered_mdi = mean_decrease_impurity(
...     clf, X.columns, clustered_subsets=feature_clusters
... )
>>> clustered_mda = mean_decrease_accuracy(
...     clf, X, y, cv_gen, clustered_subsets=feature_clusters, scoring=log_loss
... )
>>> # Plot
>>> plot_feature_importance(
...     clustered_mdi,
...     oob_score=clf.oob_score_,
...     oos_score=oos_score,
...     save_fig=True,
...     output_path="clustered_mdi.png",
... )  
<Figure...>
>>> plot_feature_importance(
...     clustered_mda,
...     oob_score=clf.oob_score_,
...     oos_score=oos_score,
...     save_fig=True,
...     output_path="clustered_mda.png",
... )  
<Figure...>

                      

The following are the resulting images from the Clustered MDI & Clustered MDA feature importances respectively:

Research Notebook

The following research notebooks can be used to better understand the Clustered Feature Importance and its implementations.

Clustered Feature Importance

Presentation Slides

Note

These slides are a collection of lectures so you need to do a bit of scrolling to find the correct sections.

pg 19-29: Feature Importance + Clustered Feature Importance.

pg 109: Feature Importance Analysis

pg 131: Feature Selection

pg 141-173: Clustered Feature Importance

pg 176-198: Shapley Values

Clustered MDA and MDI

Example

Research Notebook

Presentation Slides

References