Hierarchical Clustering
This module implements various hierarchical clustering algorithms. The clustering technique groups similar data points into groups called clusters. Similarity can be measured by multiple methods, such as Pearson’s correlation, Spearman rank correlation, Euclidean distance, etc.
Hierarchical clustering is a technique that arranges a set of nested clusters as a tree. It can be agglomerative or divisive. Agglomerative hierarchical clustering merges smaller and similar clusters to form bigger clusters in multiple iterations. A dendrogram is a common technique to visualize the nested clusters. Divisive hierarchical clustering is the opposite concept. At each iteration, bigger clusters are separated into smaller and dissimilar clusters. Hierarchical clustering is useful to discern similar properties in datasets.
Note
Underlying Literature
The following sources describe this method in more detail:
-
TF 2.0 DCGAN for 100x100 financial correlation matrices by Gautier Marti.
Implementation
This module creates optimal leaf hierarchical clustering as shown in Marti, G. (2020) TF 2.0 DCGAN for 100x100 financial correlation matrices by arranging a matrix with hierarchical clustering by maximizing the sum of the similarities between adjacent leaves.
- optimal_hierarchical_cluster(mat: array, method: str = 'ward') array
-
Calculates the optimal clustering of a matrix.
It calculates the hierarchy clusters from the distance of the matrix. Then it calculates the optimal leaf ordering of the hierarchy clusters, and returns the optimally clustered matrix.
It is reproduced with modifications from the following blog post: Marti, G. (2020) TF 2.0 DCGAN for 100x100 financial correlation matrices [Online]. Available at: https://marti.ai/ml/2019/10/13/tf-dcgan-financial-correlation-matrices.html. (Accessed: 17 Aug 2020)
This method relies and acts as a wrapper for the scipy.cluster.hierarchy module. https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html
- Parameters:
-
-
mat – (np.array/pd.DataFrame) Correlation matrix.
-
method – (str) Method to calculate the hierarchy clusters. Can take the values [“single”, “complete”, “average”, “weighted”, “centroid”, “median”, “ward”].
-
- Returns:
-
(np.array) Optimal hierarchy cluster matrix.
Example
# Import packages
import matplotlib.pyplot as plt
# Import MlFinLab tools
from mlfinlab.data_generation.data_verification import plot_optimal_hierarchical_cluster, optimal_hierarchical_cluster
from mlfinlab.data_generation.hcbm import generate_hcmb_mat
# Initialize parameters
samples = 1
dim = 200
rho_low = 0.1
rho_high = 0.9
# Generate HCBM matrix
hcbm_mat = generate_hcmb_mat(t_samples=samples,
n_size=dim,
rho_low=rho_low,
rho_high=rho_high,
permute=True)[0]
# Plot it
plt.figure(figsize=(6, 4))
plt.pcolormesh(hcbm_mat, cmap='viridis')
plt.colorbar()
plt.title("Original Correlation Matrix")
# Obtain optimal clusters from HCBM matrix
ordered_corr = optimal_hierarchical_cluster(hcbm_mat, method="ward")
# Plot it
plt.figure(figsize=(6, 4))
plt.pcolormesh(ordered_corr, cmap='viridis')
plt.colorbar()
plt.title("Optimal Clustering Correlation Matrix")
plt.show()