First Generation Models

market_microstructure_diagram.jpg

Market microstructure studies the process and outcomes of exchanging assets under explicit trading rules - O’Hara, 1995

Microstructural datasets include primary information about the auctioning process, like order cancellations, double action book, queues, partial fills, aggressor side, corrections, and replacements. These datasets provide researchers with the ability to understand how market participants conceal and reveal their true preferences, making them incredibly useful for engineering features of an ML model.

This module concerns itself with the so-called “first generation” of microstructural models and a series of transformations of their outputs - namely:

  • The Tick Rule

  • The Roll Model

  • Fractional Differentiation

  • Wald-Wolfowitz Runs Randomness

It also includes a generate_feature_matrix function, which allows for the convenient application of all of the features contained in this module to a dataset of ticks on a rolling basis, creating a ready-to-use input dataframe for an ML model

Note

Underlying Literature

The following sources elaborate extensively on the topic:

  • Advances in Financial Machine Learning, Chapter 19, Section 3 by Marcos Lopez de Prado. Describes the emergence and modern day uses of the first generation of microstructural features in more detail


The Tick Rule

The following description is based on Section 19.3.1 of Advances in Financial Machine Learning:

” In a double auction book, quotes are placed for selling a security at various price levels or for buying a security at various price levels. Offer prices always exceed bids prices, because otherwise there would be an instant match. A trade occurs whenever a buyer matches an offer or a seller matches a bid. Every trade has a buyer and a seller, but only one side initiates the trade.

The Tick Rule is an algorithm used to determine a trade’s aggressor side. A buy-initiated trade is labeled (1) and a sell-initiated trade is labeled (-1), according to the following logic:

\[\begin{split}b_{t}=\left\{\begin{array}{ll} 1 & \text { if } \Delta p_{t}>0 \\ -1 & \text { if } \Delta p_{t}<0 \\ b_{t-1} & \text { if } \Delta p_{t}=0 \end{array}\right.\end{split}\]

where \(p_{t}\) is the price of the trade indexed by \(t = 1,...T\), and \(b_{0}\) is arbitrarily set to 1.”

Implementation

tick_rule(prices: Series) Series

Generates a series of classifications that indicate whether a trade was buy-initiated (denoted by a 1) or sell-initiated (denoted by a -1) using the “tick rule” or “tick test”.

Parameters:

prices – (pd.Series) Time series of prices.

Returns:

(pd.Series) Time series of classifications.

Example

>>> import pandas as pd
>>> from mlfinlab.microstructural_features import first_generation
>>> # Read in the tick data and only storing the closing price
>>> url = "https://raw.githubusercontent.com/hudson-and-thames/example-data/main/tick_bars.csv"
>>> tick_prices = pd.read_csv(url)["close"]
>>> # Generate trade classifications for our tick data
>>> tick_classifications = first_generation.tick_rule(prices=tick_prices)
>>> tick_classifications  
 0        1.0
 1       -1.0
 2        1.0
 3        1.0
 4        1.0...

The Roll Model

The following description is based on Section 19.3.2 of Advances in Financial Machine Learning:

“The Roll Model (1984) provides market microstructure model that aims at estimating the effective bid-ask spread of a security from observed transaction prices. That said, the Roll model does not include any information on the underlying bid-ask price quotes and order flow.

Consider a mid-price series \({m_{t}}\), where prices follow a Random Walk with no drift as follows:

\[m_{t} = m_{t-1} + u_{t}\]

hence price changes \(\Delta m_{t} = m_{t} - m_{t-1}\) are independently and identically drawn from a Normal distribution:

\[\Delta m_{t} \sim N\left[0, \sigma_{u}^2\right]\]

The observed prices, \({p_{t}}\), are the result of sequential trading against the bid-ask spread:

\[p_{t} = m_{t} + b_{t}c\]

where \(c\) is half the bid-ask spread, and \(b_{t} \in\{-1, 1\}\) is the aggressor side. The Roll model assumes that buys and sells are equally likely, \(P\left[b_{t}=1\right]=P\left[b_{t}=-1\right]=\frac{1}{2}\), serially independent, \(E\left[b_{t}b_{t-1}\right]=0\), and independent from the noise, \(E\left[b_{t}u_{t}\right]=0\). Given these assumptions, Roll derives the values of \(c\) and \(\sigma_{u}^2\) as follows:

\[\begin{split}\begin{array}{c} \sigma^{2}\left[\Delta p_{t}\right]=\mathrm{E}\left[\left(\Delta p_{t}\right)^{2}\right]-\left(\mathrm{E}\left[\left(\Delta p_{t}\right)\right]\right)^{2}=2 c^{2}+\sigma_{u}^{2} \\ \sigma\left[\Delta p_{t}, \Delta p_{t-1}\right]=-c^{2} \end{array}\end{split}\]

resulting in:

\[ \begin{align}\begin{aligned}\begin{split}c=\sqrt{\max \left\{0,-\sigma\left[\Delta p_{t}, \Delta p_{t-1}\right]\right\}} \\\end{split}\\\sigma_{u}^{2}=\sigma^{2}\left[\Delta p_{t}\right]+2 \sigma\left[\Delta p_{t}, \Delta p_{t-1}\right]\end{aligned}\end{align} \]

As a result, we can conclude that the bid-ask spread is a function of the serial covariance of price changes, and the true (unobserved) price’s noise, excluding microstructural noise, is a function of the observed noise and the serial covariance of price changes.”

Implementation

roll_spread(prices: Series) float

Calculates the bid-ask spread for a given time interval using the methodology proposed by Richard Roll (1984): https://www.bauer.uh.edu/rsusmel/phd/roll1984.pdf

Parameters:

prices – (Option: pd.Series) A sequence of prices.

Returns:

(float) The bid-ask spread.

Example

>>> import pandas as pd
>>> from mlfinlab.microstructural_features import first_generation
>>> # Read in the tick data and only storing the closing price
>>> url = "https://raw.githubusercontent.com/hudson-and-thames/example-data/main/tick_bars.csv"
>>> tick_prices = pd.read_csv(url)["close"]
>>> # Calculate the Roll spread of the security based on the tick data
>>> spread = first_generation.roll_spread(prices=tick_prices)
>>> spread  
0.369...

Feature Transformations

As previously mentioned, there are many transformations that can be applied to the trade classifications yielded by the Tick Rule that make for interesting feature inputs to an ML model. The transformations contained in this module include the Wald-Wolfowitz Runs test to the classification series to determine how random the classifications are, fractional differencing of the classification series to achieve stationarity while simultaneously preserving a high degree of information, and various entropy measures that determine the amount of information contained in the classification sequence.

Fractional Differentiation

For a detailed explanation of fractional differentiation and why it’s useful for machine learning in finance, please consult Fractionally Differentiated Features, of the MLFinLab documentation and this paper: Fractional differentiation and its use in machine learning .

In the context of microstructural features, the cumulative sum of tick-rule generated trade classifications can be fractionally differenced in order to produce an information rich time series that is also stationary (as determined by the Augmented Dickey-Fuller test)

fractional_differencing(classifications: Series, differencing_amount: float = 0.73, threshold: float = 0.01) Dict[str, Series | float]

Uses fractional calculus to difference the cumulative sum of a series of trade classifications as generated by the tick rule function in this module. Fractional calculus can be used to produce a series that is more stationary than the original while preserving more memory/information than outright integer differencing. More on fractional calculus and its application to trade classifications: (1) https://en.wikipedia.org/wiki/Fractional_calculus (2) Advances in Financial Machine Learning, Lopez de Prado, Chapter 19, section 19.3.1, page 282.

The “optimal” values for differencing amount and threshold are those that yield a resulting series that is both stationary (i.e. ADF p-value < 0.05) and still highly correlated to the original series (i.e. Pearson r value > 0.9). We’ve found that a differencing amount of around 0.730 and a threshold of 0.01 yield such a result.

Parameters:
  • classifications – (pd.Series) A series of trade classifications as generated by the tick rule.

  • differencing_amount – (float) The differencing amount (note: a value of 1.0 is the same as the first derivative of the series).

  • threshold – (float) The cut-off weight for the window to be used in the differencing process.

Returns:

(dictionary) A dictionary containing the differenced series, the p-value resulting from an Augmented Dicky-Fuller test, and the Pearson r value relating the original series to the differenced one.

Example

>>> import pandas as pd
>>> from mlfinlab.microstructural_features import first_generation
>>> # Read in the tick data and only storing the closing price
>>> url = "https://raw.githubusercontent.com/hudson-and-thames/example-data/main/tick_bars.csv"
>>> tick_prices = pd.read_csv(url)["close"]
>>> # Generate trade classifications for our tick data
>>> tick_classifications = first_generation.tick_rule(prices=tick_prices)
>>> # Generate the fractionally differenced series and it's associated, relevant test statistics
>>> fractionally_differenced_classifications = first_generation.fractional_differencing(
...     classifications=tick_classifications, differencing_amount=0.453, threshold=0.01
... )
>>> fractionally_differenced_classifications  
{...}

Wald-Wolfowitz Runs Randomness

The Wald–Wolfowitz runs test is a statistical test that determines whether or not a two-valued data sequence is random. The test can be used to test the hypothesis that the elements of a sequence are mutually independent. In the context of microstructural features, the p-value of a Wald-Wolfowitz test applied to a window of classifications can be used to determine if there have been more sell-initiated or buy-initiated trades for a given security, which in turn sheds light on a security’s liquidity

wald_wolfowitz_runs_test(classifications: Series) float

Generates the z-statistic and p-value obtained by running a Wald-Wolfowitz runs test on a sequence of trade classifications.

The Wald-Wolfowitz runs test checks a randomness hypothesis for a two-valued data sequence. More specificially, it checks to see if the order of the runs in a sequence are purely random (H0) or not (H1). More on the Wald-Wolfowitz runs test: https://en.wikipedia.org/wiki/Wald%E2%80%93Wolfowitz_runs_test

Parameters:

classifications – (Option: pd.Series/np.array/list) A sequence of runs represented by 1.0s and -1.0s.

Returns:

(float) The p-value of the Wald-Wolfowitz runs test.

Example

>>> import pandas as pd
>>> from mlfinlab.microstructural_features import first_generation
>>> # Read in the tick data and only storing the closing price
>>> url = "https://raw.githubusercontent.com/hudson-and-thames/example-data/main/tick_bars.csv"
>>> tick_prices = pd.read_csv(url)["close"]
>>> # Generate trade classifications for our tick data
>>> tick_classifications = first_generation.tick_rule(prices=tick_prices)
>>> # Generate the p-value of the Wald-Wolfowitz runs test
>>> wald_wolfowitz_randomness = first_generation.wald_wolfowitz_runs_test(
...     classifications=tick_classifications
... )
>>> wald_wolfowitz_randomness  
7.49...

The Feature Matrix

The feature matrix is simply a dataframe that contains the application of all of the functions in this module and the Entropy Measures module applied to a tick bar dataset using a rolling window specified by the user (e.g. 5 ticks)

To best use this function, we first recommend using the fractional_differencing function on its own in order to best calibrate the differencing amount and differencing threshold that will be required as inputs to this function

Implementation

generate_feature_matrix(tick_prices: Series, lookback_period: int, fractional_differencing_amount: float, fractional_differencing_threshold: float) DataFrame

Generates a dataframe that has as columns each of the features outlined in this module for easy input into a machine learning model.

The user must provide a lookback_period, which is the window of entries that the user wants each of the feature transformation functions included in this module to be computed over on a rolling basis.

Parameters:
  • tick_prices – (pd.Series) Time series of prices.

  • lookback_period – (int) The lookback period to train features on.

  • fractional_differencing_amount – (float) The differencing amount to be used in the fractional differencing process.

  • fractional_differencing_threshold – (float) The threshold to be used in the fractional differencing process.

Returns:

(pd.DataFrame) A dataframe with all of the features in this module.

Example

>>> import pandas as pd
>>> from mlfinlab.microstructural_features import first_generation
>>> # Read in the tick data and only storing the closing price
>>> url = "https://raw.githubusercontent.com/hudson-and-thames/example-data/main/tick_bars.csv"
>>> tick_prices = pd.read_csv(url)["close"][
...     :500
... ]  # reduce number of ticks for example to run quickly
>>> # Generate our feature matrix
>>> feature_matrix = first_generation.generate_feature_matrix(
...     tick_prices=tick_prices,
...     lookback_period=5,
...     fractional_differencing_amount=0.453,
...     fractional_differencing_threshold=0.01,
... )
>>> feature_matrix  
roll_spread...

Research Notebook

The following research notebook can be used to better understand the so-called “first generation” of microstructural models and the series of transformations of their outputs covered in this module:


References