mlfinlab.clustering.feature_clusters

This module creates clustered subsets of features described in the paper Clustered Feature Importance (Presentation Slides) by Dr. Marcos Lopez de Prado. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3517595 and is also explained in the book Machine Learning for Asset Managers Snippet 6.5.2 page 84.

Module Contents

Functions

get_feature_clusters(→ list)

Machine Learning for Asset Managers

get_feature_clusters(X: pandas.DataFrame, dependence_metric: str, distance_metric: str = None, linkage_method: str = None, n_clusters: int = None, check_silhouette_scores: bool = True, critical_threshold: float = 0.0) list

Machine Learning for Asset Managers Snippet 6.5.2.1 , page 85. Step 1: Features Clustering

Gets clustered features subsets from the given set of features.

Parameters:
  • X – (pd.DataFrame) Dataframe of features.

  • dependence_metric – (str) Method to be use for generating dependence_matrix, either ‘linear’ or ‘information_variation’ or ‘mutual_information’ or ‘distance_correlation’.

  • distance_metric – (str) The distance operator to be used for generating the distance matrix. The methods that can be applied are: ‘angular’, ‘squared_angular’, ‘absolute_angular’. Set it to None if the feature are to be generated as it is by the ONC algorithm.

  • linkage_method – (str) Method of linkage to be used for clustering. Methods include: ‘single’, ‘ward’, ‘complete’, ‘average’, ‘weighted’, and ‘centroid’. Set it to None if the feature are to be generated as it is by the ONC algorithm.

  • n_clusters – (int) Number of clusters to form. Must be less the total number of features. If None then it returns optimal number of clusters decided by the ONC Algorithm.

  • check_silhouette_scores – (bool) Flag to check if X contains features with low silh. scores and modify it.

  • critical_threshold – (float) Threshold for determining low silhouette score in the dataset. It can any real number in [-1,+1], default is 0 which means any feature that has a silhouette score below 0 will be indentified as having low silhouette and hence required transformation will be appiled to for for correction of the same.

Returns:

(list) Feature subsets.