Raw Returns


Labeling data by raw returns is the most simple and basic method of labeling financial data for machine learning. Raw returns can be calculated either on a simple or logarithmic basis. Using returns rather than prices is usually preferred for financial time series data because returns are usually stationary, unlike prices. This means that returns across different assets, or the same asset at different times, can be directly compared with each other. The same cannot be said of price differences, since the magnitude of the price change is highly dependent on the starting price, which varies with time.

The simple return for an observation with price \(p_t\) at time \(t\) relative to its price at time \(t-1\) is as follows:

\[R_t = \frac{p_{t}}{p_{t-1}} - 1\]

And the logarithmic return is:

\[r_t = log(p_t) - log(p_{t-1})\]

The label \(L_t\) is simply equal to \(r_t\), or to the sign of \(r_t\), if binary labeling is desired.

\[\begin{split}\begin{equation} \begin{split} L_{t} = \begin{cases} -1 &\ \text{if} \ \ r_t < 0\\ 0 &\ \text{if} \ \ r_t = 0\\ 1 &\ \text{if} \ \ r_t > 0 \end{cases} \end{split} \end{equation}\end{split}\]

If desired, the user can specify a resampling period to apply to the price data prior to calculating returns. The user can also lag the returns to make them forward-looking.

The following shows the distribution of logarithmic daily returns on Microsoft stock during the time period between January 2010 and May 2020.

raw returns image

Distribution of logarithmic returns on MSFT.


Implementation

Labeling Raw Returns.

Most basic form of labeling based on raw return of each observation relative to its previous value.

raw_return(prices, binary=False, logarithmic=False, resample_by=None, lag=True)

Raw returns labeling method.

This is the most basic and ubiquitous labeling method used as a precursor to almost any kind of financial data analysis or machine learning. User can specify simple or logarithmic returns, numerical or binary labels, a resample period, and whether returns are lagged to be forward looking.

Parameters:
  • prices – (pd.Series/pd.DataFrame) Time-indexed price data on stocks with which to calculate return.

  • binary – (bool) If False, will return numerical returns. If True, will return the sign of the raw return.

  • logarithmic – (bool) If False, will calculate simple returns. If True, will calculate logarithmic returns.

  • resample_by – (str) If not None, the resampling period for price data prior to calculating returns. ‘B’ = per business day, ‘W’ = week, ‘M’ = month, etc. Will take the last observation for each period. For full details see here.

  • lag – (bool) If True, returns will be lagged to make them forward-looking.

Returns:

(pd.Series/pd.DataFrame) Raw returns on market data. User can specify whether returns will be based on simple or logarithmic return, and whether the output will be numerical or categorical.


Example

Below is an example on how to use the raw returns labeling method.

import yfinance as yf

# Import MlFinLab tools
from mlfinlab.labeling.raw_return import raw_return

# Loading SPY data close prices
data = yf.download(tickers="SPY", start="2020-01-01", end="2022-01-01", interval="1d")[
    "Adj Close"
]

# Create labels numerically based on simple returns
returns = raw_return(prices=data, lag=True)

# Create labels categorically based on logarithmic returns
returns = raw_return(prices=data, binary=True, logarithmic=True, lag=True)

# Create labels categorically on weekly data with forward looking log returns
returns = raw_return(
    prices=data, binary=True, logarithmic=True, resample_by="W", lag=True
)

Research Notebook

The following research notebook can be used to better understand the raw return labeling technique.

  • Raw Return Example


Presentation Slides