Information-driven bars are based on the notion of sampling a bar when new information arrives to the market. The two types of information-driven bars implemented are imbalance bars and run bars. For each type, tick, volume, and dollar bars are included.
For those new to the topic, it is discussed in the graduate level textbook: Advances in Financial Machine Learning, Chapter 2.
Warning
This is a very advanced financial data structure with very little to no academic papers written about them. Our team has analysed the statistical properties and its not clear to us how to use this structure.
We highly recommend that you read the literature, plus that of microstructural features before committing to this data structure.
Imbalance Bars
2 types of imbalance bars are implemented in MlFinLab:
Expected number of ticks, defined as EMA (book implementation)
Constant number of expected number of ticks.
The Generation Algorithm
Let’s discuss the generation of imbalance bars on an example of volume imbalance bars. As it is described in Advances in Financial Machine Learning book:
First let’s define what is the tick rule:
For any given \(t\), where \(p_t\) is the price associated with \(t\) and \(v_t\) is volume, the tick rule \(b_t\) is defined as:
Tick rule is used as a proxy of trade direction, however, some data providers already provide customers with tick direction, in this case we don’t need to calculate tick rule, just use the provided tick direction instead.
Cumulative volume imbalance from \(1\) to \(T\) is defined as:
Where \(T\) is the time when the bar is sampled.
Next we need to define \(E_0[T]\) as the expected number of ticks, the book suggests to use a exponentially weighted moving average (EWMA) of the expected number of ticks from previously generated bars. Let’s introduce the first hyperparameter for imbalance bars generation: num_prev_bars which corresponds to the window used for EWMA calculation.
Here we face the problem of the first bar’s generation, because we don’t know the expected number of ticks upfront. To solve this we introduce the second hyperparameter: expected_num_ticks_init which corresponds to initial guess for expected number of ticks before the first imbalance bar is generated.
Bar is sampled when:
To estimate (expected imbalance) we simply calculate the EWMA of volume imbalance from previous bars, that is why we need to store volume imbalances in an imbalance array, the window for estimation is either expected_num_ticks_init before the first bar is sampled, or expected number of ticks(\(E_0[T]\)) * num_prev_bars when the first bar is generated.
Note that when we have at least one imbalance bar generated we update \(2v^+ - E_0[v_t]\) only when the next bar is sampled and not on every trade observed
Algorithm Logic
Now that we have understood the logic of the imbalance bar generation, let’s understand the process in further detail by examining some pseudo-code of the implementation below:
# Pseudo code
num_prev_bars = 3
expected_num_ticks_init = 100000
expected_num_ticks = expected_num_ticks_init
cum_theta = 0
num_ticks = 0
imbalance_array = []
imbalance_bars = []
bar_length_array = []
for row in data.rows:
# Track high, low,c lose, volume info
num_ticks += 1
tick_rule = get_tick_rule(price, prev_price)
volume_imbalance = tick_rule * row["volume"]
imbalance_array.append(volume_imbalance)
cum_theta += volume_imbalance
if len(imbalance_bars) == 0 and len(imbalance_array) >= expected_num_ticks_init:
expected_imbalance = ewma(imbalance_array, window=expected_num_ticks_init)
if abs(cum_theta) >= expected_num_ticks * abs(expected_imbalance):
bar = form_bar(open, high, low, close, volume)
imbalance_bars.append(bar)
bar_length_array.append(num_ticks)
cum_theta, num_ticks = 0, 0
expected_num_ticks = ewma(bar_lenght_array, window=num_prev_bars)
expected_imbalance = ewma(
imbalance_array, window=num_prev_bars * expected_num_ticks
)
Note that in algorithm pseudo-code we reset \(\theta_t\) when bar is formed, in our case the formula for \(\theta_t\) is:
Let’s look at dynamics of \(|\theta_t|\) and \(E_0[T] * |2v^+ - E_0[v_t]|\) to understand why we decided to reset \(\theta_t\) when a bar is formed. The following figure highlights the dynamics when theta value is reset:
Note that on the first set of ticks, the threshold condition is not stable. Remember, before the first bar is generated, the expected imbalance is calculated on every tick with window = expected_num_ticks_init, that is why it changes with every tick. After the first bar was generated both expected number of ticks (\(E_0[T]\)) and expected volume imbalance (\(2v^+ - E_0[v_t]\)) are updated only when the next bar is generated
When theta is not reset:
The reason for that is due to the fact that theta is accumulated when several bars are generated theta value is not reset \(\Rightarrow\) condition is met on small number of ticks \(\Rightarrow\) length of the next bar converges to 1 \(\Rightarrow\) bar is sampled on the next consecutive tick.
The logic described above is implemented in the MlFinLab package under ImbalanceBars
Implementations
There are 2 different implementations which have been discussed in the previous section.
EMA Version
Tick Bars
- get_ema_tick_imbalance_bars(file_path_or_df: str | Iterable[str] | DataFrame, num_prev_bars: int = 3, expected_imbalance_window: int = 10000, exp_num_ticks_init: int = 20000, exp_num_ticks_constraints: List[float] | None = None, batch_size: int = 20000000.0, analyse_thresholds: bool = False, verbose: bool = True, to_csv: bool = False, output_path: str | None = None)
Creates the EMA tick imbalance bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value.
- Parameters:
file_path_or_df – (str/iterable of str/pd.DataFrame) Path to the csv file(s) or Pandas Data Frame containing raw tick data in the format[date_time, price, volume].
num_prev_bars – (int) Window size for E[T]s (number of previous bars to use for expected number of ticks estimation).
expected_imbalance_window – (int) EMA window used to estimate expected imbalance.
exp_num_ticks_init – (int) Initial expected number of ticks per bar.
exp_num_ticks_constraints – (array) Minimum and maximum possible number of expected ticks. Used to control bars sampling convergence.
batch_size – (int) The number of rows per batch. Less RAM = smaller batch size.
verbose – (bool) Print out batch numbers (True or False).
to_csv – (bool) Save bars to csv after every batch run (True or False).
analyse_thresholds – (bool) Flag to save and return thresholds used to sample imbalance bars.
output_path – (str) Path to csv file, if to_csv is True.
- Returns:
(pd.DataFrame) DataFrame of tick imbalance.
If analyse_thresholds=true, an additional DataFrame of thresholds used to sample imbalance bars will also be returned.
If to_csv=True, None will be returned.
Volume Bars
- get_ema_volume_imbalance_bars(file_path_or_df: str | Iterable[str] | DataFrame, num_prev_bars: int = 3, expected_imbalance_window: int = 10000, exp_num_ticks_init: int = 20000, exp_num_ticks_constraints: List[float] | None = None, batch_size: int = 20000000.0, analyse_thresholds: bool = False, verbose: bool = True, to_csv: bool = False, output_path: str | None = None)
Creates the EMA volume imbalance bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value.
- Parameters:
file_path_or_df – (str/iterable of str/pd.DataFrame) Path to the csv file(s) or Pandas Data Frame containing raw tick data in the format[date_time, price, volume].
num_prev_bars – (int) Window size for E[T]s (number of previous bars to use for expected number of ticks estimation).
expected_imbalance_window – (int) EMA window used to estimate expected imbalance.
exp_num_ticks_init – (int) Initial expected number of ticks per bar.
exp_num_ticks_constraints – (list) Minimum and maximum possible number of expected ticks. Used to control bars sampling convergence.
batch_size – (int) The number of rows per batch. Less RAM = smaller batch size.
verbose – (bool) Print out batch numbers (True or False).
to_csv – (bool) Save bars to csv after every batch run (True or False).
analyse_thresholds – (bool) Flag to save and return thresholds used to sample imbalance bars.
output_path – (str) Path to csv file, if to_csv is True.
- Returns:
(pd.DataFrame) DataFrame of volume imbalance.
If analyse_thresholds=true, an additional DataFrame of thresholds used to sample imbalance bars will also be returned.
If to_csv=True, None will be returned.
Dollar Bars
- get_ema_dollar_imbalance_bars(file_path_or_df: str | Iterable[str] | DataFrame, num_prev_bars: int = 3, expected_imbalance_window: int = 10000, exp_num_ticks_init: int = 20000, exp_num_ticks_constraints: List[float] | None = None, batch_size: int = 20000000.0, analyse_thresholds: bool = False, verbose: bool = True, to_csv: bool = False, output_path: str | None = None)
Creates the EMA dollar imbalance bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value.
- Parameters:
file_path_or_df – (str/iterable of str/pd.DataFrame) Path to the csv file(s) or Pandas Data Frame containing raw tick data in the format[date_time, price, volume].
num_prev_bars – (int) Window size for E[T]s (number of previous bars to use for expected number of ticks estimation).
expected_imbalance_window – (int) EMA window used to estimate expected imbalance.
exp_num_ticks_init – (int) Initial expected number of ticks per bar.
exp_num_ticks_constraints – (list) Minimum and maximum possible number of expected ticks. Used to control bars sampling convergence.
batch_size – (int) The number of rows per batch. Less RAM = smaller batch size..
verbose – (bool) Print out batch numbers (True or False).
to_csv – (bool) Save bars to csv after every batch run (True or False).
analyse_thresholds – (bool) Flag to save and return thresholds used to sample imbalance bars.
output_path – (str) Path to csv file, if to_csv is True.
- Returns:
(pd.DataFrame) DataFrame of dollar imbalance.
If analyse_thresholds=true, an additional DataFrame of thresholds used to sample imbalance bars will also be returned.
If to_csv=True, None will be returned.
Constant Version
Tick Bars
- get_const_tick_imbalance_bars(file_path_or_df: str | Iterable[str] | DataFrame, expected_imbalance_window: int = 10000, exp_num_ticks_init: int = 20000, batch_size: int = 20000000.0, analyse_thresholds: bool = False, verbose: bool = True, to_csv: bool = False, output_path: str | None = None)
Creates the Const tick imbalance bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value.
- Parameters:
file_path_or_df – (str/pd.DataFrame) Path to the csv file or Pandas Data Frame containing raw tick data in the format[date_time, price, volume].
expected_imbalance_window – (int) EMA window used to estimate expected imbalance.
exp_num_ticks_init – (int) Initial expected number of ticks per bar.
batch_size – (int) The number of rows per batch. Less RAM = smaller batch size.
verbose – (bool) Print out batch numbers (True or False).
to_csv – (bool) Save bars to csv after every batch run (True or False).
analyse_thresholds – (bool) Flag to save and return thresholds used to sample imbalance bars.
output_path – (str) Path to csv file, if to_csv is True.
- Returns:
(pd.DataFrame) DataFrame of tick imbalance.
If analyse_thresholds=true, an additional DataFrame of thresholds used to sample imbalance bars will also be returned.
If to_csv=True, None will be returned.
Volume Bars
- get_const_volume_imbalance_bars(file_path_or_df: str | Iterable[str] | DataFrame, expected_imbalance_window: int = 10000, exp_num_ticks_init: int = 20000, batch_size: int = 20000000.0, analyse_thresholds: bool = False, verbose: bool = True, to_csv: bool = False, output_path: str | None = None)
Creates the Const volume imbalance bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value.
- Parameters:
file_path_or_df – (str/iterable of str/pd.DataFrame) Path to the csv file(s) or Pandas Data Frame containing raw tick data in the format[date_time, price, volume].
expected_imbalance_window – (int) EMA window used to estimate expected imbalance.
exp_num_ticks_init – (int) Initial expected number of ticks per bar.
batch_size – (int) The number of rows per batch. Less RAM = smaller batch size.
verbose – (bool) Print out batch numbers (True or False).
to_csv – (bool) Save bars to csv after every batch run (True or False).
analyse_thresholds – (bool) Flag to save and return thresholds used to sample imbalance bars.
output_path – (str) Path to csv file, if to_csv is True.
- Returns:
(pd.DataFrame) DataFrame of volume imbalance.
If analyse_thresholds=true, an additional DataFrame of thresholds used to sample imbalance bars will also be returned.
If to_csv=True, None will be returned.
Dollar Bars
- get_const_dollar_imbalance_bars(file_path_or_df: str | Iterable[str] | DataFrame, expected_imbalance_window: int = 10000, exp_num_ticks_init: int = 20000, batch_size: int = 20000000.0, analyse_thresholds: bool = False, verbose: bool = True, to_csv: bool = False, output_path: str | None = None)
Creates the Const dollar imbalance bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value.
- Parameters:
file_path_or_df – (str/iterable of str/pd.DataFrame) Path to the csv file(s) or Pandas Data Frame containing raw tick data in the format[date_time, price, volume].
expected_imbalance_window – (int) EMA window used to estimate expected imbalance.
exp_num_ticks_init – (int) Initial expected number of ticks per bar.
batch_size – (int) The number of rows per batch. Less RAM = smaller batch size.
verbose – (bool) Print out batch numbers (True or False).
to_csv – (bool) Save bars to csv after every batch run (True or False).
analyse_thresholds – (bool) Flag to save and return thresholds used to sample imbalance bars.
output_path – (str) Path to csv file, if to_csv is True.
- Returns:
(pd.DataFrame) DataFrame of dollar imbalance.
If analyse_thresholds=true, an additional DataFrame of thresholds used to sample imbalance bars will also be returned.
If to_csv=True, None will be returned.
Example
>>> from mlfinlab.data_structures.imbalance_data_structures import (
... get_ema_dollar_imbalance_bars,
... get_const_dollar_imbalance_bars,
... )
>>> # Get processed tick data csv from url
>>> tick_data_url = "https://raw.githubusercontent.com/hudson-and-thames/example-data/main/processed_tick_data.csv"
>>> # EMA Dollar Imbalance Bars
>>> dollar_imbalance_ema, df_thresholds = get_ema_dollar_imbalance_bars(
... tick_data_url,
... num_prev_bars=3,
... exp_num_ticks_init=10000,
... exp_num_ticks_constraints=[100, 1000],
... expected_imbalance_window=1000,
... batch_size=10000,
... verbose=False,
... analyse_thresholds=True,
... ) # Also return thresholds
>>> len(dollar_imbalance_ema)
130
>>> dollar_imbalance_ema.columns
Index(['tick_num', 'open', 'high', 'low', 'close', 'volume', 'cum_buy_volume', 'cum_ticks', 'cum_dollar_value'],...
>>> # Const Dollar Imbalance Bars
>>> dollar_imbalance_const = get_const_dollar_imbalance_bars(
... tick_data_url,
... exp_num_ticks_init=10000,
... expected_imbalance_window=1000,
... batch_size=10000,
... verbose=False,
... )
>>> len(dollar_imbalance_const)
3
>>> dollar_imbalance_const.columns
Index(['tick_num', 'open', 'high', 'low', 'close', 'volume', 'cum_buy_volume', 'cum_ticks', 'cum_dollar_value'],...
Research Notebooks
The following research notebooks can be used to better understand the previously discussed data structures.