Average Linkage Minimum Spanning Tree (ALMST)

Average Linkage Minimum Spanning Tree (ALMST) shows better results in recognising the economic sectors and sub-sectors than the Minimum Spanning Tree (Tumminello et al. 2007). The ALMST, as defined by Tumminello et al. (2007), is a variation of the MST.

Just as the MST is associated with the Single Linkage Clustering Algorithm (SCLA), the ALMST is based on the UPGMA (unweighted pair group method with arithmetic mean) algorithm for hierarchal clustering. This is also known as Average Linkage Cluster Algorithm (ALCA) method. The UPGMA method is frequently used in biology for phylogenetic tree as a diagrammatic representation of the evolutionary relatedness between organisms. A step by step example of UPGMA or ALCA method for trees can be found here.


Note

Underlying Literature

The following sources describe this method in more detail:

ALMST

ALMST vs. MST

Instead of choosing the next edge by the minimum edge (distance based MST), the next edge is chosen by the minimum average distance between two existing clusters. Tumminello et al. (2007) give an example MST algorithm to show how the ALMST algorithm differs from the MST algorithm.

Where \(g\) is the graph, \(S_{i}\) is the set of vertices, \(n\) is the number of elements, \(C\) is the correlation matrix of elements \(ρ_{ij}\), connected component of a graph \(g\) containing a given vertex \(i\). The starting point of the procedure is an empty graph \(g\) with \(n\) vertices.

  1. Set \(Q\) as the matrix of elements \(q_{ij}\) such that \(Q = C\), where \(C\) is the estimated correlation matrix.

  2. Select the maximum correlation \(q_{hk}\) between elements belonging to different connected components \(S_{h}\) and \(S_{k}\) in \(g^{2}\).

  3. Find elements \(u\), \(p\) such that \(p_{up} = \max\{{\rho_{ij}}, \forall i \in S_{h}\) and \(\forall j \in S_{k} \}\)

  4. Add to \(g\) the link between elements \(u\) and \(p\) with weight \(\rho_{up}\). Once the link is added to \(g\), \(u\) and \(p\) will belong to the same connected component \(S = S_{h} \cup S_{k}\).

  5. Redefine the matrix \(Q\):

\[ \begin{align}\begin{aligned}\begin{cases} q_{ij} = q_{hk}, & \text{ if i} \in S_{h} \text{ and j} \in S_{k}\\q_{ij} = Max \{q_{pt}, p \in S \text{ and } t \in S_{j}, \text{ with } S_{j} \neq S\}, & \text{ if i} \in S \text{ and j } \in S_{j}\\q_{ij} = q_{ij}, & \text{ otherwise}; \end{cases}\end{aligned}\end{align} \]
  1. If \(g\) is still a disconnected graph then go to step (2) else stop.

This is the case for a correlation matrix (taking the maximum correlation value for the MST edges).

However, for distance matrices, the MST algorithm orders the edges by the minimum distance, by replacing step (5) with:

\[ \begin{align}\begin{aligned}\begin{cases} q_{ij} = q_{hk}, & \text{ if i} \in S_{h} \text{ and j} \in S_{k}\\q_{ij} = Min \{q_{pt}, p \in S \text{ and } t \in S_{j}, \text{ with } S_{j} \neq S\}, & \text{ if i} \in S \text{ and j } \in S_{j}\\q_{ij} = q_{ij}, & \text{ otherwise}; \end{cases}\end{aligned}\end{align} \]

By replacing eq. in step (5) with

\[ \begin{align}\begin{aligned}\begin{cases} q_{ij} = q_{hk}, & \text{ if i} \in S_{h} \text{ and j} \in S_{k}\\q_{ij} = Mean \{q_{pt}, p \in S \text{ and } t \in S_{j}, \text{ with } S_{j} \neq S\}, & \text{ if i} \in S \text{ and j } \in S_{j}\\q_{ij} = q_{ij}, & \text{ otherwise}; \end{cases}\end{aligned}\end{align} \]

we obtain an algorithm performing the ALCA. We call the resulting tree \(g\) an ALMST.

User Interface

The ALMST can be generated in the same way as the MST, but by creating an ALMST class object instead. Since MST and ALMST are both subclasses of Graph, MST and ALMST are the input to the initialisation method of class Dash. However, the recommended way to create visualisations, is to use the methods from visualisations file, unless you would like to input a custom matrix.

Creating the ALMST visualisation is a similar process to creating the MST. You can replace:

mst = MST(input_matrix, 'input matrix type')

With:

almst = ALMST(input_matrix, 'input matrix type')

This creates the ALMST object containing self.graph as the graph of the ALMST. The ALMST object can then be inputted to the Dash interface.

Implementation

class ALMST(matrix, matrix_type, mst_algorithm='kruskal')

ALMST is a subclass of Graph which creates a ALMST Graph object. The ALMST class converts a distance matrix input into a ALMST matrix. This is then used to create a nx.Graph object.

__init__(matrix, matrix_type, mst_algorithm='kruskal')

Initialises the ALMST and sets the self.graph attribute as the ALMST graph.

Parameters:
  • matrix – (pd.DataFrame) Input matrices such as a distance or correlation matrix.

  • matrix_type – (str) Name of the matrix type (e.g. “distance” or “correlation”).

  • mst_algorithm – (str) Valid MST algorithm types include ‘kruskal’, ‘prim’. By default, MST algorithm uses Kruskal’s.

__weakref__

list of weak references to the object (if defined)

static create_almst(matrix)

Creates and returns a ALMST given an input matrix using Prim’s algorithm.

Parameters:

matrix – (pd.DataFrame) Input distance matrix of all edges.

Returns:

(pd.DataFrame) Returns the ALMST in matrix format.

static create_almst_kruskals(matrix)

This method converts the input matrix into a ALMST matrix.

Currently only works with distance input matrix.

Parameters:

matrix – (pd.DataFrame) Input matrix.

Returns:

(pd.DataFrame) ALMST matrix with all other edges as 0 values.

get_difference(input_graph_two)

Given two Graph with the same nodes, return a set of differences in edge connections.

Parameters:

input_graph_two – (Graph) A graph to compare self.graph against.

Returns:

(List) A list of unique tuples showing different edge connections.

get_graph()

Returns the Graph stored as an attribute.

Returns:

(nx.Graph) Returns a NetworkX graph object.

get_graph_plot()

Returns the graph of the MST with labels. Assumes that the matrix contains stock names as headers.

Returns:

(AxesSubplot) Axes with graph plot. Call plt.show() to display this graph.

get_matrix_type()

Returns the matrix type set at initialisation.

Returns:

(str) String of matrix type (eg. “correlation” or “distance”).

get_node_colours()

Returns a map of industry group matched with list of nodes.

Returns:

(Dict) Dictionary of industry name to list of node indexes.

get_node_sizes()

Returns the node sizes as a list.

Returns:

(List) List of numbers representing node sizes.

get_pos()

Returns the dictionary of the nodes coordinates.

Returns:

(Dict) Dictionary of node coordinates.

set_node_groups(industry_groups)

Sets the node industry group, by taking in a dictionary of industry group to a list of node indexes.

Parameters:

industry_groups – (Dict) Dictionary of the industry name to a list of node indexes.

set_node_size(market_caps)

Sets the node sizes, given a list of market cap values corresponding to node indexes.

Parameters:

market_caps – (List) List of numbers corresponding to node indexes.

ALMST Algorithms

When you initialise the ALMST object, the ALMST is generated and stored as attribute an self.graph. Kruskal’s algorithms is used as a default.

almst_graph = ALMST(custom_matrix, 'custom')

To use Prim’s algorithm, pass the string ‘prim’.

almst_graph = ALMST(custom_matrix, 'custom', 'prim')

Example Code

# Import packages
import pandas as pd

# Import ALMST class
from mlfinlab.networks.almst import ALMST

# Import Dash Graph class
from mlfinlab.networks.dash_graph import DashGraph

# Import Codependence Matrix
from mlfinlab.codependence.codependence_matrix import get_distance_matrix


# Import file containing stock log returns
url = "https://raw.githubusercontent.com/hudson-and-thames/example-data/main/logReturns.csv"
# Need to make sure all other columns are removed and leave numerical data only
log_return_dataframe = pd.read_csv(url, index_col='Date')

# Create a smaller smaple for testing
log_return_dataframe = log_return_dataframe.loc[:, ['NVDA', 'TWTR', 'MSFT', 'JNJ', 'TSLA', 'GE']]

# Create your custom matrix
correlation_matrix = log_return_dataframe.corr(method='pearson')

# Create your custom matrix
custom_matrix = get_distance_matrix(correlation_matrix, metric='angular')

# Creates MST graph class objects from custom input matrix
graph = ALMST(custom_matrix, 'distance')

# Create and get the server
dash_graph = DashGraph(graph)
server = dash_graph.get_server()

# Run server
server.run_server()

Customizing the Graphs

To further customize the ALMST when it is displayed in the Dash UI, you can add colours and change the sizes to represent for example industry groups and market cap of stocks.

These features are optional and only work for the Dash interface (not the comparison interface). As the comparison interface, highlights the central nodes (with degree greater than or equal to 5).

Adding Colours to Nodes

The colours can be added by passing a dictionary of group name to list of node names corresponding to the nodes input. You then pass the dictionary to the set_node_groups method.

# Optional - add industry groups for node colour
industry = {"tech": ['NVDA', 'TWTR', 'MSFT'], "utilities": ['JNJ'], "automobiles": ['TSLA', 'GE']}
graph.set_node_groups(industry)

Adding Sizes to Nodes

The sizes can be added in a similar manner, via a list of numbers which correspond to the node indexes. The UI of the graph will then display the nodes indicating the different sizes.

# Optional - adding market cap for node size
market_caps = [2000, 2500, 3000, 1000, 5000, 3500]
graph.set_node_size(market_caps)

Comparison Interface

dualinterface.png

ALMST and MST can be compared easily using the DualDashGraph interface, where you can see the MST and ALMST side by side. You can also create the dual interface using generate_mst_almst_comparison method in the Visualisations section. This is the recommended way as it reduces the number of steps needed to create the interface.

# Create MST graph
mst = MST(input_matrix, 'input matrix type')

# Create ALMST graph
almst = ALMST(input_matrix, 'input matrix type')

# Get DualDash server
server = DualDashGraph(mst, almst)
server.run_server()

References