Sample Uniqueness

Let’s look at an example of 3 samples: A, B, C.

Imagine that:

  • A was generated at \(t_1\) and triggered on \(t_8\)

  • B was generated at \(t_3\) and triggered on \(t_6\)

  • C was generated on \(t_7\) and triggered on \(t_9\)

In this case we see that A used information about returns on \([t_1,t_8]\) to generate label-endtime which overlaps with \([t_3, t_6]\) which was used by B, however C didn’t use any returns information which was used by to label other samples. Here we would like to introduce the concept of concurrency.

We say that labels \(y_i\) and \(y_j\) are concurrent at \(t\) if they are a function of at least one common return at \(r_{t-1,t}\)

In terms of concurrency label C is the most ‘pure’ as it doesn’t use any piece of information from other labels, while A is the ‘dirtiest’ as it uses information from both B and C. By understanding average label uniqueness you can measure how ‘pure’ your dataset is based on concurrency of labels. We can measure average label uniqueness using get_av_uniqueness_from_triple_barrier function from the mlfinlab package.

This function is the orchestrator to derive average sample uniqueness from a dateset labeled by the triple barrier method.


Implementation

get_av_uniqueness_from_triple_barrier(triple_barrier_events, close_series, num_threads, verbose=True)

This function is the orchestrator to derive average sample uniqueness from a dataset labeled by the triple barrier method.

Parameters:
  • triple_barrier_events – (pd.DataFrame) Events from labeling.get_events().

  • close_series – (pd.Series) Close prices.

  • num_threads – (int) The number of threads concurrently used by the function.

  • verbose – (bool) Flag to report progress on asynch jobs.

Returns:

(pd.Series) Average uniqueness over event’s lifespan for each index in triple_barrier_events.


Example

An example of calculating average uniqueness given that we have already have our barrier events can be seen below:

>>> # Import packages
>>> import pandas as pd
>>> # Import MlFinLab tools
>>> from mlfinlab.util import volatility
>>> from mlfinlab.labeling import labeling
>>> from mlfinlab.sampling.concurrent import get_av_uniqueness_from_triple_barrier
>>> # Load data
>>> url = "https://raw.githubusercontent.com/hudson-and-thames/example-data/main/sample_dollar_bars.csv"
>>> close_prices = pd.read_csv(url, index_col=0, parse_dates=[0])["close"]
>>> # Calculate the volatility that will be used to dynamically set the barriers
>>> vol = volatility.get_daily_vol(close=close_prices, lookback=50)
>>> # Compute vertical barrier using timedelta
>>> vertical_barriers = labeling.add_vertical_barrier(
...     t_events=close_prices.index, close=close_prices, num_hours=1
... )
>>> # Set profit taking and stop loss levels
>>> pt_sl = [1, 2]
>>> triple_barrier_events = labeling.get_events(
...     close=close_prices,
...     t_events=close_prices.index,
...     pt_sl=pt_sl,
...     target=vol,
...     num_threads=3,
...     vertical_barrier_times=vertical_barriers,
... )
>>> # Calculate average uniqueness
>>> av_unique = get_av_uniqueness_from_triple_barrier(
...     triple_barrier_events, close_prices, num_threads=3
... )
>>> av_unique.mean()
tW    0.201233
dtype: float64

We would like to build our model in such a way that it takes into account label concurrency (overlapping samples). In order to do that we need to look at the bootstrapping algorithm of a Random Forest.

Lets move onto the next section on Sequential Bootstrapping.


Research Notebook

The following research notebook can be used to better understand the previously discussed sampling method.

  • Sample Uniqueness and Weights


References