Sequentially Bootstrapped Ensembles
In the sampling section we have shown that sampling should be done by Sequential Bootstrapping.
The SequentiallyBootstrappedBaggingClassifier and SequentiallyBootstrappedBaggingRegressor extend sklearn’s Bagging Classifier/Regressor by using Sequential Bootstrapping instead of random sampling.
In order to build the indicator matrix we need the Triple Barrier Events (samples_info_sets) and price bars used to label the training data set. That is why samples_info_sets and price bars are input parameters for the classifier/regressor.
To better understand the underlying method, you may be interested in reading the Sequential Bootstrapping page of MlFinLab documentation.
Note
Underlying Literature
The following sources describe this method in more detail:
-
Advances in Financial Machine Learning, Chapter 4 by Marcos Lopez de Prado.
Implementation
Warning
This model is computationally expensive and may take some time to run.
The model from the MlFinlab package before version 1.4.0 was already orders of magnitude faster than the original, described in Advances in Financial Machine Learning.
We’ve further improved the model, starting from MlFinLab version 1.5.0 the execution is up to 200 times quicker compared to the models from version 1.4.0 and earlier. (The speed improvement depends on the size of the input dataset)
Implementation of Sequentially Bootstrapped Bagging Classifier using sklearn’s library as base class.
- class SequentiallyBootstrappedBaggingClassifier(samples_info_sets, price_bars, base_estimator=None, n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0)
-
A Sequentially Bootstrapped Bagging classifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset generated using Sequential Bootstrapping sampling procedure and then aggregate their individual predictions ( either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.
- Variables:
-
-
base_estimator – (estimator) The base estimator from which the ensemble is grown.
-
estimators – (list of estimators) The collection of fitted base estimators.
-
estimators_samples – (list of arrays) The subset of drawn samples (i.e., the in-bag samples) for each base estimator. Each subset is defined by an array of the indices selected.
-
estimators_features – (list of arrays) The subset of drawn features for each base estimator.
-
classes – (np.array) The classes labels. (Shape = [n_classes])
-
n_classes – (int/list) The number of classes.
-
oob_score – (float) Score of the training dataset obtained using an out-of-bag estimate.
-
oob_decision_function – (np.array) Decision function computed with out-of-bag estimate on the training set. If n_estimators is small it might be possible that a data point was never left out during the bootstrap. In this case, oob_decision_function_ might contain NaN. (Shape = [n_samples, n_classes])
-
- class SequentiallyBootstrappedBaggingRegressor(samples_info_sets, price_bars, base_estimator=None, n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0)
-
A Sequentially Bootstrapped Bagging regressor is an ensemble meta-estimator that fits base regressors each on random subsets of the original dataset using Sequential Bootstrapping and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.
- Variables:
-
-
estimators – (list) of estimators The collection of fitted sub-estimators.
-
estimators_samples – (list) of arrays The subset of drawn samples (i.e., the in-bag samples) for each base estimator. Each subset is defined by an array of the indices selected.
-
estimators_features – (list) of arrays The subset of drawn features for each base estimator.
-
oob_score – (float) Score of the training dataset obtained using an out-of-bag estimate.
-
oob_prediction – (array) of shape = [n_samples] Prediction computed with out-of-bag estimate on the training set. If n_estimators is small it might be possible that a data point was never left out during the bootstrap. In this case, oob_prediction_ might contain NaN.
-
Example
An example of using the SequentiallyBootstrappedBaggingClassifier
# Import packages
import pandas as pd
import numpy as np
import yfinance as yf
# Import MlFinLab tools
from mlfinlab.ensemble.sb_bagging import SequentiallyBootstrappedBaggingClassifier
from mlfinlab.util.volatility import get_daily_vol
from mlfinlab.filters.filters import cusum_filter
from mlfinlab.labeling.labeling import add_vertical_barrier, get_events, get_bins
from sklearn.ensemble import RandomForestClassifier
# Fetch data from Yahoo Finance
sp500 = yf.Ticker("^GSPC") # S&P 500
# Get historical market data
data = sp500.history(period="1mo", interval="2m")
# Filter events using the CUSUM filter
daily_vol = get_daily_vol(close=data["Close"], lookback=50)
cusum_events = cusum_filter(data["Close"], threshold=daily_vol.mean() * 0.5)
# Do triple-barrier labelling
vertical_barriers = add_vertical_barrier(t_events=data.index, close=data["Close"], num_hours=1)
pt_sl = [1, 1]
triple_barrier_events = get_events(
close=data["Close"],
t_events=cusum_events,
pt_sl=pt_sl,
target=daily_vol,
num_threads=1,
vertical_barrier_times=vertical_barriers)
labels = get_bins(triple_barrier_events, data["Close"])
# Feature Engineering
x = pd.DataFrame(index=data.index)
# Volatility
data["log_ret"] = np.log(data["Close"]).diff()
x["volatility_50"] = (data["log_ret"].rolling(window=50, min_periods=50, center=False).std())
x["volatility_31"] = (data["log_ret"].rolling(window=31, min_periods=31, center=False).std())
x["volatility_15"] = (data["log_ret"].rolling(window=15, min_periods=15, center=False).std())
# Autocorrelation
window_autocorr = 50
x["autocorr_1"] = (data["log_ret"].rolling(window=window_autocorr, min_periods=window_autocorr, center=False).apply(lambda x: x.autocorr(lag=1), raw=False))
x["autocorr_2"] = (data["log_ret"].rolling(window=window_autocorr, min_periods=window_autocorr, center=False).apply(lambda x: x.autocorr(lag=2), raw=False))
x["autocorr_3"] = (data["log_ret"].rolling(window=window_autocorr, min_periods=window_autocorr, center=False).apply(lambda x: x.autocorr(lag=3), raw=False))
x["autocorr_4"] = (
data["log_ret"].rolling(window=window_autocorr, min_periods=window_autocorr, center=False).apply(lambda x: x.autocorr(lag=4), raw=False))
x["autocorr_5"] = (data["log_ret"].rolling(window=window_autocorr, min_periods=window_autocorr, center=False).apply(lambda x: x.autocorr(lag=5), raw=False))
# Log-return momentum
x["log_t1"] = data["log_ret"].shift(1)
x["log_t2"] = data["log_ret"].shift(2)
x["log_t3"] = data["log_ret"].shift(3)
x["log_t4"] = data["log_ret"].shift(4)
x["log_t5"] = data["log_ret"].shift(5)
x.dropna(inplace=True)
labels = labels.loc[x.index.min() : x.index.max(),]
triple_barrier_events = triple_barrier_events.loc[x.index.min() : x.index.max(),]
x = x.loc[labels.index]
x_train = x # We'll use all examples in this particular case
y_train = labels.loc[x_train.index, "bin"]
# Use tools
base_est = RandomForestClassifier(n_estimators=1, criterion='entropy', bootstrap=False,
class_weight='balanced_subsample')
clf = SequentiallyBootstrappedBaggingClassifier(base_estimator=base_est,
samples_info_sets=triple_barrier_events.t1.dropna(),
price_bars=data['Close'], oob_score=True)
clf.fit(x_train, y_train)
Research Notebook
The following research notebooks can be used to better understand Ensemble Methods.