Sample Uniqueness
Let’s look at an example of 3 samples: A, B, C.
Imagine that:
-
A was generated at \(t_1\) and triggered on \(t_8\)
-
B was generated at \(t_3\) and triggered on \(t_6\)
-
C was generated on \(t_7\) and triggered on \(t_9\)
In this case we see that A used information about returns on \([t_1,t_8]\) to generate label-endtime which overlaps with \([t_3, t_6]\) which was used by B, however C didn’t use any returns information which was used by to label other samples. Here we would like to introduce the concept of concurrency.
We say that labels \(y_i\) and \(y_j\) are concurrent at \(t\) if they are a function of at least one common return at \(r_{t-1,t}\)
In terms of concurrency label C is the most ‘pure’ as it doesn’t use any piece of information from other labels, while A is the ‘dirtiest’ as it uses information from both B and C. By understanding average label uniqueness you can measure how ‘pure’ your dataset is based on concurrency of labels. We can measure average label uniqueness using get_av_uniqueness_from_triple_barrier function from the mlfinlab package.
This function is the orchestrator to derive average sample uniqueness from a dateset labeled by the triple barrier method.
Implementation
- get_av_uniqueness_from_triple_barrier(triple_barrier_events, close_series, num_threads, verbose=True)
-
This function is the orchestrator to derive average sample uniqueness from a dataset labeled by the triple barrier method.
- Parameters:
-
-
triple_barrier_events – (pd.DataFrame) Events from labeling.get_events().
-
close_series – (pd.Series) Close prices.
-
num_threads – (int) The number of threads concurrently used by the function.
-
verbose – (bool) Flag to report progress on asynch jobs.
-
- Returns:
-
(pd.Series) Average uniqueness over event’s lifespan for each index in triple_barrier_events.
Example
An example of calculating average uniqueness given that we have already have our barrier events can be seen below:
>>> # Import packages
>>> import pandas as pd
>>> # Import MlFinLab tools
>>> from mlfinlab.util import volatility
>>> from mlfinlab.labeling import labeling
>>> from mlfinlab.sampling.concurrent import get_av_uniqueness_from_triple_barrier
>>> # Load data
>>> url = "https://raw.githubusercontent.com/hudson-and-thames/example-data/main/sample_dollar_bars.csv"
>>> close_prices = pd.read_csv(url, index_col=0, parse_dates=[0])["close"]
>>> # Calculate the volatility that will be used to dynamically set the barriers
>>> vol = volatility.get_daily_vol(close=close_prices, lookback=50)
>>> # Compute vertical barrier using timedelta
>>> vertical_barriers = labeling.add_vertical_barrier(
... t_events=close_prices.index, close=close_prices, num_hours=1
... )
>>> # Set profit taking and stop loss levels
>>> pt_sl = [1, 2]
>>> triple_barrier_events = labeling.get_events(
... close=close_prices,
... t_events=close_prices.index,
... pt_sl=pt_sl,
... target=vol,
... num_threads=3,
... vertical_barrier_times=vertical_barriers,
... )
>>> # Calculate average uniqueness
>>> av_unique = get_av_uniqueness_from_triple_barrier(
... triple_barrier_events, close_prices, num_threads=3
... )
>>> av_unique.mean()
tW 0.201233
dtype: float64
We would like to build our model in such a way that it takes into account label concurrency (overlapping samples). In order to do that we need to look at the bootstrapping algorithm of a Random Forest.
Lets move onto the next section on Sequential Bootstrapping.
Research Notebook
The following research notebook can be used to better understand the previously discussed sampling method.
Sample Uniqueness and Weights