# Labeling: Tail Sets

## Abstract

Tail set labels are a classification labeling technique introduced in the following paper: "[Huerta, R., Corbacho, F. and
Elkan, C., 2013. Nonlinear support vector machines can systematically identify stocks with high and low future returns.
Algorithmic Finance, 2(1), pp.45-58.](https://content.iospress.com/download/algorithmic-finance/af016?id=algorithmic-finance%2Faf016)

A tail set is defined to be a group of assets whose volatility-adjusted price change is in the highest or lowest
quantile, for example the highest or lowest 5%.

A classification model is then fit using these labels to determine which stocks to buy and sell, for a long / short
portfolio.

## How it works

We label the y variable using the tail set labeling technique, which makes up the positive and negative (1, -1) classes
of the training data. The original paper investigates the performance of 3 types of metrics on which the tail sets are
built:

1. Real returns
2. Residual alpha after regression on the sector index
3. Volatility-adjusted returns

For our particular implementation, we have focused on the volatility-adjusted returns.

An input DataFrame of prices is converted to returns, which can have volatility adjustment applied. The formula for volatility-adjusted return is:

$$r(t - t', t) = \frac{R(t-t',t)}{vol(t)}$$

We provide two implementations for estimations of volatility, first the exponential moving average of the mean absolute returns, and second the traditional standard deviation. The paper suggests a 180 day window period. 

The volatility adjusted return of each stock is assigned to a quantile relative to other returns in the row i.e. same timestamp. The top and bottom quantiles are then labeled as the positive and negative classes, respectively.

## How to use these labels in practice?

The tail set labels from the code above returns the names of the assets which should be labeled with a positive or
negative label. It's important to note that the model you  would develop is a many to one model, in that it has many
x variables and only one y variable. The model is a binary classifier.

The model is trained on the training data and then used to score every security in the test data (on a given day).
Example: On December 1st 2019, the strategy needs to rebalance its positions, we score all 100 securities in our tradable
universe and then rank the outputs in a top-down fashion. We form a long / short portfolio by going long the top 10
stocks and short the bottom 10 (equally weighted). We then hold the position to the next rebalance date.

---
## Examples of use

In [1]:
# Import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import yfinance as yf

# Import MlFinLab tools
from mlfinlab.labeling.tail_sets import TailSetLabels

MLFINLAB_API_KEY is valid.


In [2]:
# Load price data for 20 stocks
tickers = "AAPL MSFT COST PFE SYY F GE BABA AMD CCL ZM WFC JPM NVDA GPS KO"

data = yf.download(tickers, start="2018-01-01", end="2021-01-01")
data = data['Adj Close']
data.index = pd.to_datetime(data.index)
data.head()

[*********************100%***********************]  16 of 16 completed


Unnamed: 0_level_0,AAPL,AMD,BABA,CCL,COST,F,GE,GPS,JPM,KO,MSFT,NVDA,PFE,SYY,WFC,ZM
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2018-01-02,40.831593,10.98,183.649994,61.440411,175.129379,9.74965,103.178162,28.234156,91.977737,38.548206,80.562042,49.326302,28.254044,52.787338,52.136971,
2018-01-03,40.824478,11.55,184.0,61.578476,177.231079,9.826661,104.153732,27.650633,92.071472,38.463558,80.936974,52.572647,28.463398,52.900761,52.538078,
2018-01-04,41.014103,12.12,185.710007,61.532452,175.854706,9.996088,106.334366,27.292185,93.390442,39.005299,81.649323,52.84977,28.525427,53.401054,53.195244,
2018-01-05,41.481064,11.88,190.699997,61.026203,174.599289,10.165512,106.391724,27.133797,92.790909,38.996838,82.661636,53.29763,28.5797,53.936474,53.553688,
2018-01-08,41.326992,12.28,190.330002,60.796089,175.278137,10.127007,104.899719,26.925398,92.927933,38.937584,82.745987,54.930717,28.261805,54.050587,52.947742,


In [3]:
# Create tail set labels with mean absolute deviation as the volatility adjustment
labels = TailSetLabels(data, n_bins=10, vol_adj='mean_abs_dev', window=180)
pos_set, neg_set, matrix_set = labels.get_tail_sets()

In [4]:
# Get the positive set, of the top 10% returns for each day
pos_set.head()

Date
2020-01-06      [GPS, ZM]
2020-01-07        [F, ZM]
2020-01-08    [MSFT, SYY]
2020-01-09     [COST, KO]
2020-01-10     [GPS, PFE]
dtype: object

In [5]:
# Get the negative set, of the lowest 10% returns for each day
neg_set.head()

Date
2020-01-06    [CCL, WFC]
2020-01-07    [JPM, SYY]
2020-01-08     [AMD, GE]
2020-01-09    [GPS, PFE]
2020-01-10     [GE, JPM]
dtype: object

In [6]:
# All labels for the day
matrix_set.head()

Unnamed: 0_level_0,AAPL,AMD,BABA,CCL,COST,F,GE,GPS,JPM,KO,MSFT,NVDA,PFE,SYY,WFC,ZM
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2020-01-06,0,0,0,-1,0,0,0,1,0,0,0,0,0,0,-1,1
2020-01-07,0,0,0,0,0,1,0,0,-1,0,0,0,0,-1,0,1
2020-01-08,0,-1,0,0,0,0,-1,0,0,0,1,0,0,1,0,0
2020-01-09,0,0,0,0,1,0,0,-1,0,1,0,0,-1,0,0,0
2020-01-10,0,0,0,0,0,0,-1,1,-1,0,0,0,1,0,0,0


In [7]:
# See the numerical returns
labels.vol_adj_rets.dropna().head()

Unnamed: 0_level_0,AAPL,AMD,BABA,CCL,COST,F,GE,GPS,JPM,KO,MSFT,NVDA,PFE,SYY,WFC,ZM
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2020-01-06,0.745104,-0.227461,-0.121334,-2.423722,0.039694,-0.510354,0.905577,2.396431,-0.098566,-0.056502,0.311692,0.266653,-0.163218,-0.217696,-0.65039,1.907438
2020-01-07,-0.44573,-0.153854,0.336044,0.246498,-0.230719,0.917596,-0.481032,-0.020029,-2.095531,-1.188178,-1.104471,0.769062,-0.428513,-1.346063,-0.901816,0.959547
2020-01-08,1.499162,-0.467444,0.126608,0.379514,1.652929,0.0,-0.595931,0.115628,0.950279,0.286349,1.883811,0.121106,1.019218,1.894693,0.33163,0.391693
2020-01-09,1.950335,1.255253,1.276731,0.723627,2.271283,0.103904,-0.165243,-1.860656,0.448857,2.743676,1.471031,0.708617,-0.56109,0.196104,-0.188375,0.04249
2020-01-10,0.211695,-0.879143,0.686095,-0.5782,-1.041758,-0.1051,-1.33137,1.082708,-1.230117,0.524128,-0.552667,0.349013,1.94553,0.570532,-0.486028,0.286838


### Error Handling

Errors will be raised if inputs are invalid.

In [8]:
# If number of bins is greater than the width of the price data i.e. exceeds the number of stocks
try:
    TailSetLabels(data[:100], n_bins=50)
except Exception as exc:
    print(exc)

# If window is either not an int or too small.
try:
    TailSetLabels(data[:100], n_bins=10, vol_adj='stdev', window='str')
except Exception as exc:
    print(exc)
try:
    TailSetLabels(data[:100], n_bins=10, vol_adj='stdev', window=200)
except Exception as exc:
    print(exc)

n_bins exceeds the number of stocks!
If vol_adj is not None, window must be int.
Length of price data must be greater than the window.


---
## Conclusion

This notebook presents the tail sets labeling method. This method is useful in identifying outliers in the returns for a group of stocks during a given day. The user chooses the number of quantiles, and the top and bottom quantiles are labeled as the positive and negative tail sets, respectively. This method can be used in training data for classification. A strategy can be adopted of going long the predicted positive tail set and short the negative one.

## References

1. Huerta, R., Corbacho, F. and Elkan, C., 2013. Nonlinear support vector machines can systematically identify stocks with high and low future returns. Algorithmic Finance, 2(1), pp.45-58.