mlfinlab.data_structures.imbalance_data_structures

Advances in Financial Machine Learning, Marcos Lopez de Prado Chapter 2: Financial Data Structures: Imbalance Bars

This module contains the functions to help users create structured financial data from raw unstructured data, in the form of tick, volume, and dollar imbalance bars.

These bars are used throughout the text book (Advances in Financial Machine Learning, By Marcos Lopez de Prado, 2018, pg 29) to build the more interesting features for predicting financial time series data.

These financial data structures have better statistical properties when compared to those based on fixed time interval sampling. A great paper to read more about this is titled: The Volume Clock: Insights into the high frequency paradigm, Lopez de Prado, et al. These ideas are then extended in another paper: Flow toxicity and liquidity in a high-frequency world.

We have introduced two types of imbalance bars: with expected number of tick defined through EMA (book implementation) and constant number of ticks.

A good blog post to read, which helped us a lot in the implementation here is writen by Maksim Ivanov: https://towardsdatascience.com/financial-machine-learning-part-0-bars-745897d4e4ba

Module Contents

Classes

EMAImbalanceBars

Contains all of the logic to construct the imbalance bars from chapter 2. This class shouldn't be used directly.

ConstImbalanceBars

Contains all of the logic to construct the imbalance bars with fixed expected number of ticks. This class shouldn't

Functions

get_ema_dollar_imbalance_bars(file_path_or_df[, ...])

Creates the EMA dollar imbalance bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value.

get_ema_volume_imbalance_bars(file_path_or_df[, ...])

Creates the EMA volume imbalance bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value.

get_ema_tick_imbalance_bars(file_path_or_df[, ...])

Creates the EMA tick imbalance bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value.

get_const_dollar_imbalance_bars(file_path_or_df[, ...])

Creates the Const dollar imbalance bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value.

get_const_volume_imbalance_bars(file_path_or_df[, ...])

Creates the Const volume imbalance bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value.

get_const_tick_imbalance_bars(file_path_or_df[, ...])

Creates the Const tick imbalance bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value.

class EMAImbalanceBars(metric: str, num_prev_bars: int, expected_imbalance_window: int, exp_num_ticks_init: int, exp_num_ticks_constraints: List, batch_size: int, analyse_thresholds: bool)

Bases: mlfinlab.data_structures.base_bars.BaseImbalanceBars

Contains all of the logic to construct the imbalance bars from chapter 2. This class shouldn’t be used directly. We have added functions to the package such as get_ema_dollar_imbalance_bars which will create an instance of this class and then construct the imbalance bars, to return to the user.

This is because we wanted to simplify the logic as much as possible, for the end user.

__slots__ = ()
batch_run(file_path_or_df: str | Iterable[str] | pandas.DataFrame, verbose: bool = True, to_csv: bool = False, output_path: str | None = None) pandas.DataFrame | None

Reads csv file(s) or pd.DataFrame in batches and then constructs the financial data structure in the form of a DataFrame. The csv file or DataFrame must have only 3 columns: date_time, price, & volume.

Parameters:
  • file_path_or_df – (str/iterable of str/pd.DataFrame) Path to the csv file(s) or Pandas Data Frame containing raw tick data in the format[date_time, price, volume].

  • verbose – (bool) Flag whether to print message on each processed batch or not.

  • to_csv – (bool) Flag for writing the results of bars generation to local csv file, or to in-memory DataFrame.

  • output_path – (bool) Path to results file, if to_csv = True.

Returns:

(pd.DataFrame or None) Financial data structure.

run(data: list | tuple | pandas.DataFrame) list

Reads a List, Tuple, or Dataframe and then constructs the financial data structure in the form of a list. The List, Tuple, or DataFrame must have only 3 attrs: date_time, price, & volume.

Parameters:

data – (list/tuple/pd.DataFrame) Dict or np.array containing raw tick data in the format[date_time, price, volume].

Returns:

(list) Financial data structure.

class ConstImbalanceBars(metric: str, expected_imbalance_window: int, exp_num_ticks_init: int, batch_size: int, analyse_thresholds: bool)

Bases: mlfinlab.data_structures.base_bars.BaseImbalanceBars

Contains all of the logic to construct the imbalance bars with fixed expected number of ticks. This class shouldn’t be used directly. We have added functions to the package such as get_const_dollar_imbalance_bars which will create an instance of this class and then construct the imbalance bars, to return to the user.

This is because we wanted to simplify the logic as much as possible, for the end user.

__slots__ = ()
batch_run(file_path_or_df: str | Iterable[str] | pandas.DataFrame, verbose: bool = True, to_csv: bool = False, output_path: str | None = None) pandas.DataFrame | None

Reads csv file(s) or pd.DataFrame in batches and then constructs the financial data structure in the form of a DataFrame. The csv file or DataFrame must have only 3 columns: date_time, price, & volume.

Parameters:
  • file_path_or_df – (str/iterable of str/pd.DataFrame) Path to the csv file(s) or Pandas Data Frame containing raw tick data in the format[date_time, price, volume].

  • verbose – (bool) Flag whether to print message on each processed batch or not.

  • to_csv – (bool) Flag for writing the results of bars generation to local csv file, or to in-memory DataFrame.

  • output_path – (bool) Path to results file, if to_csv = True.

Returns:

(pd.DataFrame or None) Financial data structure.

run(data: list | tuple | pandas.DataFrame) list

Reads a List, Tuple, or Dataframe and then constructs the financial data structure in the form of a list. The List, Tuple, or DataFrame must have only 3 attrs: date_time, price, & volume.

Parameters:

data – (list/tuple/pd.DataFrame) Dict or np.array containing raw tick data in the format[date_time, price, volume].

Returns:

(list) Financial data structure.

get_ema_dollar_imbalance_bars(file_path_or_df: str | Iterable[str] | pandas.DataFrame, num_prev_bars: int = 3, expected_imbalance_window: int = 10000, exp_num_ticks_init: int = 20000, exp_num_ticks_constraints: List[float] = None, batch_size: int = 20000000.0, analyse_thresholds: bool = False, verbose: bool = True, to_csv: bool = False, output_path: str | None = None)

Creates the EMA dollar imbalance bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value.

Parameters:
  • file_path_or_df – (str/iterable of str/pd.DataFrame) Path to the csv file(s) or Pandas Data Frame containing raw tick data in the format[date_time, price, volume].

  • num_prev_bars – (int) Window size for E[T]s (number of previous bars to use for expected number of ticks estimation).

  • expected_imbalance_window – (int) EMA window used to estimate expected imbalance.

  • exp_num_ticks_init – (int) Initial expected number of ticks per bar.

  • exp_num_ticks_constraints – (list) Minimum and maximum possible number of expected ticks. Used to control bars sampling convergence.

  • batch_size – (int) The number of rows per batch. Less RAM = smaller batch size..

  • verbose – (bool) Print out batch numbers (True or False).

  • to_csv – (bool) Save bars to csv after every batch run (True or False).

  • analyse_thresholds – (bool) Flag to save and return thresholds used to sample imbalance bars.

  • output_path – (str) Path to csv file, if to_csv is True.

Returns:

(pd.DataFrame) DataFrame of dollar imbalance.

  • If analyse_thresholds=true, an additional DataFrame of thresholds used to sample imbalance bars will also be returned.

  • If to_csv=True, None will be returned.

get_ema_volume_imbalance_bars(file_path_or_df: str | Iterable[str] | pandas.DataFrame, num_prev_bars: int = 3, expected_imbalance_window: int = 10000, exp_num_ticks_init: int = 20000, exp_num_ticks_constraints: List[float] = None, batch_size: int = 20000000.0, analyse_thresholds: bool = False, verbose: bool = True, to_csv: bool = False, output_path: str | None = None)

Creates the EMA volume imbalance bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value.

Parameters:
  • file_path_or_df – (str/iterable of str/pd.DataFrame) Path to the csv file(s) or Pandas Data Frame containing raw tick data in the format[date_time, price, volume].

  • num_prev_bars – (int) Window size for E[T]s (number of previous bars to use for expected number of ticks estimation).

  • expected_imbalance_window – (int) EMA window used to estimate expected imbalance.

  • exp_num_ticks_init – (int) Initial expected number of ticks per bar.

  • exp_num_ticks_constraints – (list) Minimum and maximum possible number of expected ticks. Used to control bars sampling convergence.

  • batch_size – (int) The number of rows per batch. Less RAM = smaller batch size.

  • verbose – (bool) Print out batch numbers (True or False).

  • to_csv – (bool) Save bars to csv after every batch run (True or False).

  • analyse_thresholds – (bool) Flag to save and return thresholds used to sample imbalance bars.

  • output_path – (str) Path to csv file, if to_csv is True.

Returns:

(pd.DataFrame) DataFrame of volume imbalance.

  • If analyse_thresholds=true, an additional DataFrame of thresholds used to sample imbalance bars will also be returned.

  • If to_csv=True, None will be returned.

get_ema_tick_imbalance_bars(file_path_or_df: str | Iterable[str] | pandas.DataFrame, num_prev_bars: int = 3, expected_imbalance_window: int = 10000, exp_num_ticks_init: int = 20000, exp_num_ticks_constraints: List[float] = None, batch_size: int = 20000000.0, analyse_thresholds: bool = False, verbose: bool = True, to_csv: bool = False, output_path: str | None = None)

Creates the EMA tick imbalance bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value.

Parameters:
  • file_path_or_df – (str/iterable of str/pd.DataFrame) Path to the csv file(s) or Pandas Data Frame containing raw tick data in the format[date_time, price, volume].

  • num_prev_bars – (int) Window size for E[T]s (number of previous bars to use for expected number of ticks estimation).

  • expected_imbalance_window – (int) EMA window used to estimate expected imbalance.

  • exp_num_ticks_init – (int) Initial expected number of ticks per bar.

  • exp_num_ticks_constraints – (array) Minimum and maximum possible number of expected ticks. Used to control bars sampling convergence.

  • batch_size – (int) The number of rows per batch. Less RAM = smaller batch size.

  • verbose – (bool) Print out batch numbers (True or False).

  • to_csv – (bool) Save bars to csv after every batch run (True or False).

  • analyse_thresholds – (bool) Flag to save and return thresholds used to sample imbalance bars.

  • output_path – (str) Path to csv file, if to_csv is True.

Returns:

(pd.DataFrame) DataFrame of tick imbalance.

  • If analyse_thresholds=true, an additional DataFrame of thresholds used to sample imbalance bars will also be returned.

  • If to_csv=True, None will be returned.

get_const_dollar_imbalance_bars(file_path_or_df: str | Iterable[str] | pandas.DataFrame, expected_imbalance_window: int = 10000, exp_num_ticks_init: int = 20000, batch_size: int = 20000000.0, analyse_thresholds: bool = False, verbose: bool = True, to_csv: bool = False, output_path: str | None = None)

Creates the Const dollar imbalance bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value.

Parameters:
  • file_path_or_df – (str/iterable of str/pd.DataFrame) Path to the csv file(s) or Pandas Data Frame containing raw tick data in the format[date_time, price, volume].

  • expected_imbalance_window – (int) EMA window used to estimate expected imbalance.

  • exp_num_ticks_init – (int) Initial expected number of ticks per bar.

  • batch_size – (int) The number of rows per batch. Less RAM = smaller batch size.

  • verbose – (bool) Print out batch numbers (True or False).

  • to_csv – (bool) Save bars to csv after every batch run (True or False).

  • analyse_thresholds – (bool) Flag to save and return thresholds used to sample imbalance bars.

  • output_path – (str) Path to csv file, if to_csv is True.

Returns:

(pd.DataFrame) DataFrame of dollar imbalance.

  • If analyse_thresholds=true, an additional DataFrame of thresholds used to sample imbalance bars will also be returned.

  • If to_csv=True, None will be returned.

get_const_volume_imbalance_bars(file_path_or_df: str | Iterable[str] | pandas.DataFrame, expected_imbalance_window: int = 10000, exp_num_ticks_init: int = 20000, batch_size: int = 20000000.0, analyse_thresholds: bool = False, verbose: bool = True, to_csv: bool = False, output_path: str | None = None)

Creates the Const volume imbalance bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value.

Parameters:
  • file_path_or_df – (str/iterable of str/pd.DataFrame) Path to the csv file(s) or Pandas Data Frame containing raw tick data in the format[date_time, price, volume].

  • expected_imbalance_window – (int) EMA window used to estimate expected imbalance.

  • exp_num_ticks_init – (int) Initial expected number of ticks per bar.

  • batch_size – (int) The number of rows per batch. Less RAM = smaller batch size.

  • verbose – (bool) Print out batch numbers (True or False).

  • to_csv – (bool) Save bars to csv after every batch run (True or False).

  • analyse_thresholds – (bool) Flag to save and return thresholds used to sample imbalance bars.

  • output_path – (str) Path to csv file, if to_csv is True.

Returns:

(pd.DataFrame) DataFrame of volume imbalance.

  • If analyse_thresholds=true, an additional DataFrame of thresholds used to sample imbalance bars will also be returned.

  • If to_csv=True, None will be returned.

get_const_tick_imbalance_bars(file_path_or_df: str | Iterable[str] | pandas.DataFrame, expected_imbalance_window: int = 10000, exp_num_ticks_init: int = 20000, batch_size: int = 20000000.0, analyse_thresholds: bool = False, verbose: bool = True, to_csv: bool = False, output_path: str | None = None)

Creates the Const tick imbalance bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value.

Parameters:
  • file_path_or_df – (str/pd.DataFrame) Path to the csv file or Pandas Data Frame containing raw tick data in the format[date_time, price, volume].

  • expected_imbalance_window – (int) EMA window used to estimate expected imbalance.

  • exp_num_ticks_init – (int) Initial expected number of ticks per bar.

  • batch_size – (int) The number of rows per batch. Less RAM = smaller batch size.

  • verbose – (bool) Print out batch numbers (True or False).

  • to_csv – (bool) Save bars to csv after every batch run (True or False).

  • analyse_thresholds – (bool) Flag to save and return thresholds used to sample imbalance bars.

  • output_path – (str) Path to csv file, if to_csv is True.

Returns:

(pd.DataFrame) DataFrame of tick imbalance.

  • If analyse_thresholds=true, an additional DataFrame of thresholds used to sample imbalance bars will also be returned.

  • If to_csv=True, None will be returned.