mlfinlab.data_structures.standard_data_structures
Advances in Financial Machine Learning, Marcos Lopez de Prado Chapter 2: Financial Data Structures
This module contains the functions to help users create structured financial data from raw unstructured data, in the form of time, tick, volume, and dollar bars.
These bars are used throughout the text book (Advances in Financial Machine Learning, By Marcos Lopez de Prado, 2018, pg 25) to build the more interesting features for predicting financial time series data.
These financial data structures have better statistical properties when compared to those based on fixed time interval sampling. A great paper to read more about this is titled: The Volume Clock: Insights into the high frequency paradigm, Lopez de Prado, et al.
Many of the projects going forward will require Dollar and Volume bars.
Module Contents
Classes
Contains all of the logic to construct the standard bars from chapter 2. This class shouldn't be used directly. |
Functions
|
Creates the dollar bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value. |
|
Creates the volume bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value, average_volume. |
|
Creates the tick bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value. |
- class StandardBars(metric: str, threshold: int = 50000, batch_size: int = 20000000)
-
Bases:
mlfinlab.data_structures.base_bars.BaseBars
Contains all of the logic to construct the standard bars from chapter 2. This class shouldn’t be used directly. We have added functions to the package such as get_dollar_bars which will create an instance of this class and then construct the standard bars, to return to the user.
This is because we wanted to simplify the logic as much as possible, for the end user.
- __slots__ = ()
- batch_run(file_path_or_df: str | Iterable[str] | pandas.DataFrame, verbose: bool = True, to_csv: bool = False, output_path: str | None = None) pandas.DataFrame | None
-
Reads csv file(s) or pd.DataFrame in batches and then constructs the financial data structure in the form of a DataFrame. The csv file or DataFrame must have only 3 columns: date_time, price, & volume.
- Parameters:
-
-
file_path_or_df – (str/iterable of str/pd.DataFrame) Path to the csv file(s) or Pandas Data Frame containing raw tick data in the format[date_time, price, volume].
-
verbose – (bool) Flag whether to print message on each processed batch or not.
-
to_csv – (bool) Flag for writing the results of bars generation to local csv file, or to in-memory DataFrame.
-
output_path – (bool) Path to results file, if to_csv = True.
-
- Returns:
-
(pd.DataFrame or None) Financial data structure.
- run(data: list | tuple | pandas.DataFrame) list
-
Reads a List, Tuple, or Dataframe and then constructs the financial data structure in the form of a list. The List, Tuple, or DataFrame must have only 3 attrs: date_time, price, & volume.
- Parameters:
-
data – (list/tuple/pd.DataFrame) Dict or np.array containing raw tick data in the format[date_time, price, volume].
- Returns:
-
(list) Financial data structure.
- get_dollar_bars(file_path_or_df: str | Iterable[str] | pandas.DataFrame, threshold: float | pandas.Series = 70000000, batch_size: int = 20000000, verbose: bool = True, to_csv: bool = False, output_path: str | None = None)
-
Creates the dollar bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value.
Following the paper “The Volume Clock: Insights into the high frequency paradigm” by Lopez de Prado, et al, it is suggested that using 1/50 of the average daily dollar value, would result in more desirable statistical properties.
- Parameters:
-
-
file_path_or_df – (str/iterable of str/pd.DataFrame) Path to the csv file(s) or Pandas Data Frame containing raw tick data in the format[date_time, price, volume].
-
threshold – (float/pd.Series) A cumulative value above this threshold triggers a sample to be taken. If a series is given, then at each sampling time the closest previous threshold is used. (Values in the series can only be at times when the threshold is changed, not for every observation).
-
batch_size – (int) The number of rows per batch. Less RAM = smaller batch size.
-
verbose – (bool) Print out batch numbers (True or False).
-
to_csv – (bool) Save bars to csv after every batch run (True or False).
-
output_path – (str) Path to csv file, if to_csv is True.
-
- Returns:
-
(pd.DataFrame) Dataframe of dollar bars.
- get_volume_bars(file_path_or_df: str | Iterable[str] | pandas.DataFrame, threshold: float | pandas.Series = 70000000, batch_size: int = 20000000, verbose: bool = True, to_csv: bool = False, output_path: str | None = None, average: bool = False)
-
Creates the volume bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value, average_volume.
Following the paper “The Volume Clock: Insights into the high frequency paradigm” by Lopez de Prado, et al, it is suggested that using 1/50 of the average daily volume, would result in more desirable statistical properties.
- Parameters:
-
-
file_path_or_df – (str/iterable of str/pd.DataFrame) Path to the csv file(s) or Pandas Data Frame containing raw tick data in the format[date_time, price, volume].
-
threshold – (float/pd.Series) A cumulative value above this threshold triggers a sample to be taken. If a series is given, then at each sampling time the closest previous threshold is used. (Values in the series can only be at times when the threshold is changed, not for every observation)
-
batch_size – (int) The number of rows per batch. Less RAM = smaller batch size.
-
verbose – (bool) Print out batch numbers (True or False).
-
to_csv – (bool) Save bars to csv after every batch run (True or False).
-
output_path – (str) Path to csv file, if to_csv is True.
-
average – (bool) If set to True, the average volume traded per candle is added to the output.
-
- Returns:
-
(pd.DataFrame) Dataframe of volume bars.
- get_tick_bars(file_path_or_df: str | Iterable[str] | pandas.DataFrame, threshold: float | pandas.Series = 70000000, batch_size: int = 20000000, verbose: bool = True, to_csv: bool = False, output_path: str | None = None)
-
Creates the tick bars: date_time, open, high, low, close, volume, cum_buy_volume, cum_ticks, cum_dollar_value.
- Parameters:
-
-
file_path_or_df – (str/iterable of str/pd.DataFrame) Path to the csv file(s) or Pandas Data Frame containing raw tick data in the format[date_time, price, volume].
-
threshold – (float/pd.Series) A cumulative value above this threshold triggers a sample to be taken. If a series is given, then at each sampling time the closest previous threshold is used. (Values in the series can only be at times when the threshold is changed, not for every observation).
-
batch_size – (int) The number of rows per batch. Less RAM = smaller batch size.
-
verbose – (bool) Print out batch numbers (True or False).
-
to_csv – (bool) Save bars to csv after every batch run (True or False).
-
output_path – (str) Path to csv file, if to_csv is True.
-
- Returns:
-
(pd.DataFrame) Dataframe of volume bars.