Optimal Number of Clusters (ONC)
The ONC algorithm detects the optimal number of K-Means clusters using a correlation matrix as input.
Clustering is a process of grouping a set of elements where elements within a group (cluster) are more similar than elements from different groups. A popular clustering algorithm is the K-means algorithm that guarantees the convergence in a finite number of steps.
The K-means algorithm has two caveats. It requires the user to set the number of clusters in advance and the initialization of clusters is random. Consequently, the effectiveness of the algorithm is random. The ONC algorithm proposed by Marcos Lopez de Prado addresses these two issues.
Note
Underlying Literature
The following sources elaborate extensively on the topic:
-
Detection of false investment strategies using unsupervised learning methods by Marcos Lopez de Prado and Lewis, M.J. Describes the ONC algorithm in detail. The code in this module is based on the code written by the researchers.
-
Machine Learning for Asset Managers by Marcos Lopez de Prado. Features additional descriptions of the algorithm and includes exercises to understand the topic in more detail.
-
Clustering (Presentation Slides) by Marcos Lopez de Prado and Lewis, M.J. Briefly describes the logic behind the ONC algorithm.
-
Codependence (Presentation Slides) by Marcos Lopez de Prado and Lewis, M.J. Explains why the angular distance metric is used to get distances between elements.
Based on the Detection of false investment strategies using unsupervised learning methods paper, this is how the distances and scores are calculated:
Distances between the elements in the ONC algorithm are calculated using the same angular distance used in the HRP algorithm:
where \(\rho_{i,j}\) is the correlation between elements \(i\) and \(j\) .
Distances between distances in the clustering algorithm are calculated as:
Silhouette scores are calculated as:
where \(a_i\) the average distance between element \(i\) and all other components in the same cluster, and \(b_i\) is the smallest average distance between \(i\) and all the nodes in any other cluster.
The measure of clustering quality \(q\) or \(t-score\):
where \(E[\{S_i\}]\) is the mean of the silhouette scores for each cluster, and \(V[\{S_i\}]\) is the variance of the silhouette scores for each cluster.
The ONC algorithm structure is described in the work Detection of false investment strategies using unsupervised learning methods using the following diagrams:
In the base clustering stage first, the distances between the elements are calculated, then the algorithm iterates through a set of possible number of clusters \(N\) times to find the best clustering from \(N\) runs of K-means for every possible number of clusters. For each iteration, a clustering result is evaluated using the t-statistic of the silhouette scores.
The clustering result with the best t-statistic is picked, the correlation matrix is reordered so that clustered elements are positioned close to each other.
On a higher level, the average t-score of the clusters from the base clustering stage is calculated. If more than two clusters have a t-score below average, these clusters go through the base clustering stage again. This process is recursively repeated.
Then, based on the t-statistic of the old and new clusterings it is checked whether the new clustering is better than the original one. If not, the old clustering is kept, otherwise, the new one is taken.
The output of the algorithm is the reordered correlation matrix, clustering result, and silhouette scores.
Implementation
- get_onc_clusters(corr_mat: DataFrame, repeat: int = 10) DataFrame | dict | Series
-
Optimal Number of Clusters (ONC) algorithm described in the following paper: Marcos Lopez de Prado, Michael J. Lewis, Detection of False Investment Strategies Using Unsupervised Learning Methods, 2015; The code is based on the code provided by the authors of the paper.
The algorithm searches for the optimal number of clusters using the correlation matrix of elements as an input.
The correlation matrix is transformed to a matrix of distances, the K-Means algorithm is applied multiple times with a different number of clusters to use. The results are evaluated on the t-statistics of the silhouette scores.
The output of the algorithm is the reordered correlation matrix (clustered elements are placed close to each other), optimal clustering, and silhouette scores.
- Parameters:
-
-
corr_mat – (pd.DataFrame) Correlation matrix of features.
-
repeat – (int) Number of clustering algorithm repetitions
-
- Returns:
-
(tuple) [correlation matrix, optimized clusters, silh scores].
Example
An example showing how the ONC algorithm is used can be seen below:
>>> # Import packages
>>> import pandas as pd
>>> # Import MlFinLab tools
>>> from mlfinlab.clustering import onc
>>> # Import dataframe of returns for assets
>>> asset_returns = pd.read_csv(
... "https://raw.githubusercontent.com/hudson-and-thames/example-data/main/stock_prices.csv",
... index_col="Date",
... parse_dates=True,
... )
>>> # Calculate correlation matrix of returns
>>> assets_corr = asset_returns.corr()
>>> # Output of the ONC algorithm with 3 simulations for each number of clusters tested
>>> assets_corr_onc, clusters, silh_scores = onc.get_onc_clusters(assets_corr, repeat=3)