Hierarchical Clustering

This module implements various hierarchical clustering algorithms. The clustering technique groups similar data points into groups called clusters. Similarity can be measured by multiple methods, such as Pearson’s correlation, Spearman rank correlation, Euclidean distance, etc.

Hierarchical clustering is a technique that arranges a set of nested clusters as a tree. It can be agglomerative or divisive. Agglomerative hierarchical clustering merges smaller and similar clusters to form bigger clusters in multiple iterations. A dendrogram is a common technique to visualize the nested clusters. Divisive hierarchical clustering is the opposite concept. At each iteration, bigger clusters are separated into smaller and dissimilar clusters. Hierarchical clustering is useful to discern similar properties in datasets.

Note

Underlying Literature

The following sources describe this method in more detail:

TF 2.0 DCGAN for 100x100 financial correlation matrices by Gautier Marti.

Implementation

This module creates optimal leaf hierarchical clustering as shown in Marti, G. (2020) TF 2.0 DCGAN for 100x100 financial correlation matrices by arranging a matrix with hierarchical clustering by maximizing the sum of the similarities between adjacent leaves.

optimal_hierarchical_cluster(mat: array, method: str = 'ward') → array

Calculates the optimal clustering of a matrix.

It calculates the hierarchy clusters from the distance of the matrix. Then it calculates the optimal leaf ordering of the hierarchy clusters, and returns the optimally clustered matrix.

It is reproduced with modifications from the following blog post: Marti, G. (2020) TF 2.0 DCGAN for 100x100 financial correlation matrices [Online]. Available at: https://marti.ai/ml/2019/10/13/tf-dcgan-financial-correlation-matrices.html. (Accessed: 17 Aug 2020)

This method relies and acts as a wrapper for the scipy.cluster.hierarchy module. https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html

Parameters:

mat – (np.array/pd.DataFrame) Correlation matrix.
method – (str) Method to calculate the hierarchy clusters. Can take the values [“single”, “complete”, “average”, “weighted”, “centroid”, “median”, “ward”].

Returns:

(np.array) Optimal hierarchy cluster matrix.

Example

Optimal Clustering. — (Left) HCBM matrix. (Right) Optimal Clustering of the HCBM matrix found by the function `optimal_hierarchical_cluster`.

                        # Import packages
import matplotlib.pyplot as plt

# Import MlFinLab tools
from mlfinlab.data_generation.data_verification import plot_optimal_hierarchical_cluster, optimal_hierarchical_cluster
from mlfinlab.data_generation.hcbm import generate_hcmb_mat

# Initialize parameters
samples = 1
dim = 200
rho_low = 0.1
rho_high = 0.9

# Generate HCBM matrix
hcbm_mat = generate_hcmb_mat(t_samples=samples,
                             n_size=dim,
                             rho_low=rho_low,
                             rho_high=rho_high,
                             permute=True)[0]

# Plot it
plt.figure(figsize=(6, 4))
plt.pcolormesh(hcbm_mat, cmap='viridis')
plt.colorbar()
plt.title("Original Correlation Matrix")

# Obtain optimal clusters from HCBM matrix
ordered_corr = optimal_hierarchical_cluster(hcbm_mat, method="ward")

# Plot it
plt.figure(figsize=(6, 4))
plt.pcolormesh(ordered_corr, cmap='viridis')
plt.colorbar()
plt.title("Optimal Clustering Correlation Matrix")

plt.show()

                      

References

Gautier Marti, TF 2.0 DCGAN for 100x100 financial correlation matrices