Sequential Bootstrapping
The key power of ensemble learning techniques is bagging (which is bootstrapping with replacement). The key idea behind bagging is to randomly choose samples for each decision tree. In this case, trees become diverse and by averaging predictions of diverse trees built on randomly selected samples and random subset of features data scientists make the algorithm much less prone to overfit.
However, in our case, we would not only like to randomly choose samples but also choose samples which are unique and non-concurrent. But how can we solve this problem? Here comes the Sequential Bootstrapping algorithm.
The key idea behind Sequential Bootstrapping is to select samples in such a way that on each iteration, we maximize the average uniqueness of selected subsamples.
Implementation
The core functions behind Sequential Bootstrapping are implemented in MlFinLab and can be seen below:
- get_ind_matrix(samples_info_sets, price_bars)
-
Advances in Financial Machine Learning, Snippet 4.3, page 65.
Build an Indicator Matrix.
Get indicator matrix. The book implementation uses bar_index as input, however there is no explanation how to form it. We decided that using triple_barrier_events and price bars by analogy with concurrency is the best option.
- Parameters:
-
-
samples_info_sets – (pd.Series): Triple barrier events(t1) from labeling.get_events.
-
price_bars – (pd.DataFrame): Price bars which were used to form triple barrier events.
-
- Returns:
-
(np.array) Indicator binary matrix indicating what (price) bars influence the label for each observation.
- get_sparse_ind_matrix(samples_info_sets, price_bars)
-
Get sparse, transposed (for performance increase) indicator matrix.
- Parameters:
-
-
samples_info_sets – (pd.Series): Triple barrier events(t1) from labeling.get_events.
-
price_bars – (pd.DataFrame): Price bars which were used to form triple barrier events.
-
- Returns:
-
(np.array) Indicator binary matrix indicating what (price) bars influence the label for each observation.
- get_ind_mat_average_uniqueness(ind_mat)
-
Advances in Financial Machine Learning, Snippet 4.4. page 65.
Compute Average Uniqueness.
Average uniqueness from indicator matrix.
- Parameters:
-
ind_mat – (np.matrix) Indicator binary matrix.
- Returns:
-
(float) Average uniqueness.
- get_ind_mat_label_uniqueness(ind_mat)
-
Advances in Financial Machine Learning, An adaption of Snippet 4.4. page 65.
Returns the indicator matrix element uniqueness.
- Parameters:
-
ind_mat – (np.matrix) Indicator binary matrix.
- Returns:
-
(np.matrix) Element uniqueness.
- seq_bootstrap(ind_mat, sample_length=None, warmup_samples=None, compare=False, verbose=False, random_state=RandomState(MT19937) at 0x7F3F59B29B40)
-
Advances in Financial Machine Learning, Snippet 4.5, Snippet 4.6, page 65.
Return Sample from Sequential Bootstrap.
Generate a sample via sequential bootstrap. Note: Moved from pd.DataFrame to np.matrix for performance increase.
- Parameters:
-
-
ind_mat – (pd.DataFrame) Indicator matrix from triple barrier events.
-
sample_length – (int) Length of bootstrapped sample.
-
warmup_samples – (list) List of previously drawn samples.
-
compare – (bool) Flag to print standard bootstrap uniqueness vs sequential bootstrap uniqueness.
-
verbose – (bool) Flag to print updated probabilities on each step.
-
random_state – (np.random.RandomState) Random state.
-
- Returns:
-
(np.array) Bootstrapped samples indexes.
- sparse_seq_bootstrap(sparse_ind_mat: csr_matrix, denominators: csr_matrix, n_samples: int, random_state=None) list
-
Sequential Bootstrap implementation through sparse matrices.
- Parameters:
-
-
sparse_ind_mat – (scipy.sparse.csr_matrix) Sparse indicator matrix.
-
denominators – (scipy.sparse.csr_matrix) Denominators output from mlfinlab.sampling.get_sparse_ind_matrix.
-
n_samples – (int) Number of matrix samples (shape[0]).
-
random_state – (int) Random seed.
-
- Returns:
-
(list) List of bootstrapped samples.
Example
An example of Sequential Bootstrap using a toy example from the book can be seen below.
Consider a set of labels \(\left\{y_i\right\}_{i=0,1,2}\) where:
-
label \(y_0\) is a function of return \(r_{0,2}\)
-
label \(y_1\) is a function of return \(r_{2,3}\)
-
label \(y_2\) is a function of return \(r_{4,5}\)
>>> import numpy as np
>>> import pandas as pd
>>> # Build an indicator matrix.
>>> # Columns correspond to samples, rows correspond to price returns timestaps
>>> # used during sampling labelling
>>> ind_mat = pd.DataFrame(index=range(0, 6), columns=range(0, 3))
>>> ind_mat.loc[:, 0] = [1, 1, 1, 0, 0, 0]
>>> ind_mat.loc[:, 1] = [0, 0, 1, 1, 0, 0]
>>> ind_mat.loc[:, 2] = [0, 0, 0, 0, 1, 1]
However, instead of specifying the indicator matrix manually, we can use the
get_ind_matrix
method from MlFinlab:
>>> import numpy as np
>>> import pandas as pd
>>> from mlfinlab.util import volatility
>>> from mlfinlab.filters import filters
>>> from mlfinlab.labeling import labeling
>>> from mlfinlab.sampling import bootstrapping
>>> from mlfinlab.sampling import concurrent
>>> # Use dollar bars example dataset to generate an indicator matrix
>>> data = pd.read_csv(
... "https://raw.githubusercontent.com/hudson-and-thames/example-data/main/dollar_bars.csv"
... )
>>> data = data.iloc[:2000, :] # slice the dataset so example doesn't run too long
>>> data.index = pd.to_datetime(data["date_time"])
>>> data = data.drop("date_time", axis=1)
>>> # Select the data from 1st September 2011
>>> data = data["2011-09-01":]
>>> # Based on the simple moving average cross-over strategy.
>>> # Compute moving averages
>>> fast_window = 20
>>> slow_window = 50
>>> data["fast_mavg"] = (
... data["close"]
... .rolling(window=fast_window, min_periods=fast_window, center=False)
... .mean()
... )
>>> data["slow_mavg"] = (
... data["close"]
... .rolling(window=slow_window, min_periods=slow_window, center=False)
... .mean()
... )
>>> # Compute sides
>>> data["side"] = np.nan
>>> long_signals = data["fast_mavg"] >= data["slow_mavg"]
>>> short_signals = data["fast_mavg"] < data["slow_mavg"]
>>> data.loc[long_signals, "side"] = 1
>>> data.loc[short_signals, "side"] = -1
>>> # Remove Look ahead biase by lagging the signal
>>> data["side"] = data["side"].shift(1)
>>> # Duplicate the raw data
>>> raw_data = data.copy()
>>> # Drop the NaN values from our data set
>>> data.dropna(axis=0, how="any", inplace=True)
>>> # Compute daily volatility
>>> daily_vol = volatility.get_daily_vol(close=data["close"], lookback=50)
>>> # Apply Symmetric CUSUM filter and get timestamps for events
>>> # Note: Only the CUSUM filter needs a point estimate for volatility
>>> cusum_events = filters.cusum_filter(
... data["close"], threshold=daily_vol["2011-09-01":"2018-01-01"].mean() * 0.5
... )
>>> # Compute (triple barrier labeling) vertical barrier
>>> vertical_barriers = labeling.add_vertical_barrier(
... t_events=cusum_events, close=data["close"], num_days=1
... )
>>> pt_sl = [1, 2]
>>> min_ret = 0.005
>>> barrier_events = labeling.get_events(
... close=data["close"],
... t_events=cusum_events,
... pt_sl=pt_sl,
... target=daily_vol,
... min_ret=min_ret,
... num_threads=3,
... vertical_barrier_times=vertical_barriers,
... side_prediction=data["side"],
... )
>>> barrier_events
t1...
>>> # Use the close prices from dollar bars dataset as the price bars for the indicator matrix.
>>> close_prices = pd.read_csv(
... "https://raw.githubusercontent.com/hudson-and-thames/example-data/main/dollar_bars.csv",
... index_col=0,
... parse_dates=[0, 2],
... )
>>> # Create the indicator matrix
>>> triple_barrier_ind_mat = bootstrapping.get_ind_matrix(barrier_events, close_prices)
>>> # MlFinlab can also get average label uniqueness on the indicator matrix
>>> ind_mat_uniqueness = bootstrapping.get_ind_mat_average_uniqueness(
... triple_barrier_ind_mat
... )
>>> av_unique = concurrent.get_av_uniqueness_from_triple_barrier(
... pd.DataFrame(barrier_events), close_prices, num_threads=1
... )
>>> # Draw sequential bootstrap
>>> bootstrapping.seq_bootstrap(
... triple_barrier_ind_mat, sample_length=4, warmup_samples=[1]
... )
[...]
Sparse Indicator Matrix and Bootstrapping
Using the sparse, transposed indicator matrix, it’s possible to achieve a significant increase in performance over the standard method, that uses the indicator matrix construction from the Advances in Financial Machine Learning book.
The example below shows how to use the sparse indicator matrix for bootstrapping:
>>> from mlfinlab.sampling.bootstrapping import get_sparse_ind_matrix
>>> from mlfinlab.sampling.bootstrapping import sparse_seq_bootstrap
>>> triple_barrier_sparse_ind_mat, denominators = get_sparse_ind_matrix(
... barrier_events, close_prices
... )
>>> bootstrapped_samples = sparse_seq_bootstrap(
... triple_barrier_sparse_ind_mat, denominators, n_samples=4
... )
>>> bootstrapped_samples
[...]
Monte-Carlo Experiment
Let’s see how sequential bootstrapping increases average label uniqueness on this example by generating 3 samples using sequential bootstrapping and 3 samples using standard random choice, repeat the experiment 10000 times and record corresponding label uniqueness in each experiment.
>>> standard_unq_list = [] # List of random sampling uniqueness
>>> seq_unq_list = [] # List of Sequential Bootstapping uniqueness
>>> for i in range(0, 100): # Can set this to a larger number to get a smoother estimate
... ind_mat = triple_barrier_ind_mat
... bootstrapped_samples = bootstrapping.seq_bootstrap(ind_mat, sample_length=3)
... random_samples = np.random.choice(ind_mat.shape[1], size=3)
... random_unq = bootstrapping.get_ind_mat_average_uniqueness(
... ind_mat[:, random_samples]
... )
... random_unq_mean = random_unq[random_unq > 0].mean()
... sequential_unq = bootstrapping.get_ind_mat_average_uniqueness(
... ind_mat[:, bootstrapped_samples]
... )
... sequential_unq_mean = sequential_unq[sequential_unq > 0].mean()
... standard_unq_list.append(random_unq_mean)
... seq_unq_list.append(sequential_unq_mean)
...
KDE plots of label uniqueness support the fact that sequential bootstrapping gives higher average label uniqueness.
Research Notebook
The following research notebooks can be used to better understand the previously discussed sampling methods