Information-driven bars are based on the notion of sampling a bar when new information arrives to the market. The two types of information-driven bars implemented are imbalance bars and run bars. For each type, tick, volume, and dollar bars are included.
For those new to the topic, it is discussed in the graduate level textbook: Advances in Financial Machine Learning, Chapter 2.
Warning
This is a very advanced financial data structure with very little to no academic papers written about them. Our team has analysed the statistical properties and its not clear to us how to use this structure.
We highly recommend that you read the literature, plus that of microstructural features before committing to this data structure.
Run Bars
Run bars share the same mathematical structure as imbalance bars, however, instead of looking at each individual trade, we are looking at sequences of trades in the same direction. The idea is that we are trying to detect order flow imbalance caused by actions such as large traders sweeping the order book or iceberg orders.
2 types of run bars are implemented in MlFinLab:
Expected number of ticks, defined as EWMA (book implementation)
Constant number of expected number of ticks.
Implementations
There are 2 different implementations which have been discussed in the previous section.
EMA Version
Tick Bars
- get_ema_tick_run_bars(file_path_or_df: str | Iterable[str] | DataFrame, num_prev_bars: int = 3, expected_imbalance_window: int = 10000, exp_num_ticks_init: int = 20000, exp_num_ticks_constraints: List[float] | None = None, batch_size: int = 20000000.0, analyse_thresholds: bool = False, verbose: bool = True, to_csv: bool = False, output_path: str | None = None)
Creates the EMA tick run bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value.
- Parameters:
file_path_or_df – (str/iterable of str/pd.DataFrame) Path to the csv file(s) or Pandas Data Frame containing raw tick data in the format[date_time, price, volume].
num_prev_bars – (int) Window size for E[T]s (number of previous bars to use for expected number of ticks estimation).
expected_imbalance_window – (int) EMA window used to estimate expected run.
exp_num_ticks_init – (int) Initial expected number of ticks per bar.
exp_num_ticks_constraints – (list) Minimum and maximum possible number of expected ticks. Used to control bars sampling convergence.
batch_size – (int) The number of rows per batch. Less RAM = smaller batch size.
verbose – (bool) Print out batch numbers (True or False).
to_csv – (bool) Save bars to csv after every batch run (True or False).
analyse_thresholds – (bool) Flag to save and return thresholds used to sample run bars.
output_path – (str) Path to csv file, if to_csv is True.
- Returns:
(pd.DataFrame) DataFrame of tick run bars.
If analyse_thresholds=true, an additional DataFrame of thresholds used to sample imbalance bars will also be returned.
If to_csv=True, None will be returned.
Volume Bars
- get_ema_volume_run_bars(file_path_or_df: str | Iterable[str] | DataFrame, num_prev_bars: int = 3, expected_imbalance_window: int = 10000, exp_num_ticks_init: int = 20000, exp_num_ticks_constraints: List[float] | None = None, batch_size: int = 20000000.0, analyse_thresholds: bool = False, verbose: bool = True, to_csv: bool = False, output_path: str | None = None)
Creates the EMA volume run bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value.
- Parameters:
file_path_or_df – (str/iterable of str/pd.DataFrame) Path to the csv file(s) or Pandas Data Frame containing raw tick data in the format[date_time, price, volume].
num_prev_bars – (int) Window size for E[T]s (number of previous bars to use for expected number of ticks estimation).
expected_imbalance_window – (int) EMA window used to estimate expected run.
exp_num_ticks_init – (int) Initial expected number of ticks per bar.
exp_num_ticks_constraints – (list) Minimum and maximum possible number of expected ticks. Used to control bars sampling convergence.
batch_size – (int) The number of rows per batch. Less RAM = smaller batch size.
verbose – (bool) Print out batch numbers (True or False).
to_csv – (bool) Save bars to csv after every batch run (True or False).
analyse_thresholds – (bool) Flag to save and return thresholds used to sample run bars.
output_path – (str) Path to csv file, if to_csv is True.
- Returns:
(pd.DataFrame) DataFrame of volume run bars.
If analyse_thresholds=true, an additional DataFrame of thresholds used to sample imbalance bars will also be returned.
If to_csv=True, None will be returned.
Dollar Bars
- get_ema_dollar_run_bars(file_path_or_df: str | Iterable[str] | DataFrame, num_prev_bars: int = 3, expected_imbalance_window: int = 10000, exp_num_ticks_init: int = 20000, exp_num_ticks_constraints: List[float] | None = None, batch_size: int = 20000000.0, analyse_thresholds: bool = False, verbose: bool = True, to_csv: bool = False, output_path: str | None = None)
Creates the EMA dollar run bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value.
- Parameters:
file_path_or_df – (str/iterable of str/pd.DataFrame) Path to the csv file(s) or Pandas Data Frame containing raw tick data in the format[date_time, price, volume].
num_prev_bars – (int) Window size for E[T]s (number of previous bars to use for expected number of ticks estimation).
expected_imbalance_window – (int) EMA window used to estimate expected run.
exp_num_ticks_init – (int) Initial expected number of ticks per bar.
exp_num_ticks_constraints – (list) Minimum and maximum possible number of expected ticks. Used to control bars sampling convergence.
batch_size – (int) The number of rows per batch. Less RAM = smaller batch size.
verbose – (bool) Print out batch numbers (True or False).
to_csv – (bool) Save bars to csv after every batch run (True or False).
analyse_thresholds – (bool) Flag to save and return thresholds used to sample run bars.
output_path – (str) Path to csv file, if to_csv is True.
- Returns:
(pd.DataFrame) DataFrame of dollar run bars.
If analyse_thresholds=true, an additional DataFrame of thresholds used to sample imbalance bars will also be returned.
If to_csv=True, None will be returned.
Constant Version
Tick Bars
- get_const_tick_run_bars(file_path_or_df: str | Iterable[str] | DataFrame, num_prev_bars: int, expected_imbalance_window: int = 10000, exp_num_ticks_init: int = 20000, batch_size: int = 20000000.0, analyse_thresholds: bool = False, verbose: bool = True, to_csv: bool = False, output_path: str | None = None)
Creates the Const tick run bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value.
- Parameters:
file_path_or_df – (str/iterable of str/pd.DataFrame) Path to the csv file(s) or Pandas Data Frame containing raw tick data in the format[date_time, price, volume].
num_prev_bars – (int) Window size for estimating buy ticks proportion (number of previous bars to use in EWMA).
expected_imbalance_window – (int) EMA window used to estimate expected imbalance.
exp_num_ticks_init – (int) Initial expected number of ticks per bar.
batch_size – (int) The number of rows per batch. Less RAM = smaller batch size.
verbose – (bool) Print out batch numbers (True or False).
to_csv – (bool) Save bars to csv after every batch run (True or False).
analyse_thresholds – (bool) Flag to save and return thresholds used to sample run bars.
output_path – (str) Path to csv file, if to_csv is True.
- Returns:
(pd.DataFrame) DataFrame of tick run bars.
If analyse_thresholds=true, an additional DataFrame of thresholds used to sample imbalance bars will also be returned.
If to_csv=True, None will be returned.
Volume Bars
- get_const_volume_run_bars(file_path_or_df: str | Iterable[str] | DataFrame, num_prev_bars: int, expected_imbalance_window: int = 10000, exp_num_ticks_init: int = 20000, batch_size: int = 20000000.0, analyse_thresholds: bool = False, verbose: bool = True, to_csv: bool = False, output_path: str | None = None)
Creates the Const volume run bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value.
- Parameters:
file_path_or_df – (str/iterable of str/pd.DataFrame) Path to the csv file(s) or Pandas Data Frame containing raw tick data in the format[date_time, price, volume].
num_prev_bars – (int) Window size for estimating buy ticks proportion (number of previous bars to use in EWMA).
expected_imbalance_window – (int) EMA window used to estimate expected imbalance.
exp_num_ticks_init – (int) Initial expected number of ticks per bar.
batch_size – (int) The number of rows per batch. Less RAM = smaller batch size.
verbose – (bool) Print out batch numbers (True or False).
to_csv – (bool) Save bars to csv after every batch run (True or False).
analyse_thresholds – (bool) Flag to save and return thresholds used to sample run bars.
output_path – (str) Path to csv file, if to_csv is True.
- Returns:
(pd.DataFrame) DataFrame of volume run bars.
If analyse_thresholds=true, an additional DataFrame of thresholds used to sample imbalance bars will also be returned.
If to_csv=True, None will be returned.
Dollar Bars
- get_const_dollar_run_bars(file_path_or_df: str | Iterable[str] | DataFrame, num_prev_bars: int, expected_imbalance_window: int = 10000, exp_num_ticks_init: int = 20000, batch_size: int = 20000000.0, analyse_thresholds: bool = False, verbose: bool = True, to_csv: bool = False, output_path: str | None = None)
Creates the Const dollar run bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value.
- Parameters:
file_path_or_df – (str/iterable of str/pd.DataFrame) Path to the csv file(s) or Pandas Data Frame containing raw tick data in the format[date_time, price, volume].
num_prev_bars – (int) Window size for estimating buy ticks proportion (number of previous bars to use in EWMA).
expected_imbalance_window – (int) EMA window used to estimate expected imbalance.
exp_num_ticks_init – (int) Initial expected number of ticks per bar.
batch_size – (int) The number of rows per batch. Less RAM = smaller batch size.
verbose – (bool) Print out batch numbers (True or False).
to_csv – (bool) Save bars to csv after every batch run (True or False).
analyse_thresholds – (bool) Flag to save and return thresholds used to sample run bars.
output_path – (str) Path to csv file, if to_csv is True.
- Returns:
(pd.DataFrame) DataFrame of dollar run bars.
If analyse_thresholds=true, an additional DataFrame of thresholds used to sample imbalance bars will also be returned.
If to_csv=True, None will be returned.
Example
>>> from mlfinlab.data_structures.imbalance_data_structures import (
... get_ema_dollar_imbalance_bars,
... get_const_dollar_imbalance_bars,
... )
>>> tick_data_url = "https://raw.githubusercontent.com/hudson-and-thames/example-data/main/processed_tick_data.csv"
>>> # EMA Dollar Imbalance Bars
>>> dollar_imbalance_ema, thresholds_df = get_ema_dollar_imbalance_bars(
... tick_data_url,
... num_prev_bars=3,
... exp_num_ticks_init=10000,
... exp_num_ticks_constraints=[100, 1000],
... expected_imbalance_window=1000,
... batch_size=10000,
... verbose=False,
... analyse_thresholds=True,
... )
>>> len(dollar_imbalance_ema)
130
>>> len(thresholds_df)
548575
>>> # Const Dollar Imbalance Bars
>>> dollar_imbalance_const = get_const_dollar_imbalance_bars(
... tick_data_url,
... exp_num_ticks_init=10000,
... expected_imbalance_window=1000,
... batch_size=10000,
... verbose=False,
... )
>>> len(dollar_imbalance_const)
3