mlfinlab.regression.history_weight_regression
Implementation of historically weighted regression method based on relevance.
Module Contents
Classes
The class that houses all related methods for the historically weighted regression tool. |
- class HistoryWeightRegression(Y_train: numpy.array, X_train: numpy.array, check_condi_num: bool = False)
-
The class that houses all related methods for the historically weighted regression tool.
- get_fit_result() dict
-
Fit result and statistics using the training data.
- Returns:
-
(dict) The fit result and associated statistics.
- predict(X_t: numpy.array, relev_ratio_threshold: float = 1) numpy.array
-
Predict the result using fitted model from a subsample chosen by the ratio of relevance.
For example, if relev_ratio_threshold = 0.4, then it chooses the top 40 percentile data ranked by relevance to x_t. This method returns the prediction in column 0, also returns the associated prediction standard deviations in the column 1.
For each row element x_t in X_t we have the following: y_t := y_avg + 1/(n-1) * sum{relevance(x_i, x_t) * (y_i - y_avg), subsample} where y_i, x_i are from subsamples. The matrix form is: y_t := y_avg + 1/(n-1) * (x_t - x_avg).T @ fisher_info_mtx @ (X_sub - x_avg).T @ (y_sub - y_avg)
- Parameters:
-
-
X_t – (np.array) The 2D (n_t-by-k) test data, n_t is the number of instances, k is the number of variables or features.
-
relev_ratio_threshold – (float) Optional. The subsample ratio to use for predicting values ranked by relevance, must be a number between [0, 1]. For example, 0.6 corresponds to the top 60 percentile data ranked by relevance to x_t. Defaults to 1.
-
- Returns:
-
(np.array) The predicted results in col 0, and standard deviations in col 1.
- predict_one_val(x_t: numpy.array, relev_ratio_threshold: float = 1) Tuple[float, float]
-
Predict one value using fitted model from a subsample chosen by the ratio of relevance.
For example, if relev_ratio_threshold = 0.4, then it chooses the top 40 percentile data ranked by relevance to x_t. This method also returns the associated prediction standard deviations.
y_t := y_avg_sub + 1/(n-1) * sum{relevance(x_i, x_t) * (y_i - y_avg_sub), subsample} where y_i, x_i are from subsamples. The equivalent matrix form is: y_t := y_avg_sub + 1/(n-1) * (x_t - x_avg).T @ fisher_info_mtx @ (X_sub - x_avg).T @ (y_sub - y_avg_sub)
- Parameters:
-
-
x_t – (np.array) A single row element test data, 1D (k, 1). k is the number of features.
-
relev_ratio_threshold – (float) Optional. The subsample ratio to use for predicting values ranked by relevance, must be a number between [0, 1]. For example, 0.6 corresponds to the top 60 percentile data ranked by relevance to x_t. Defaults to 1.
-
- Returns:
-
(Tuple[float, float]) The predicted result and associated standard deviation.
- find_subsample(x_t: numpy.array, relev_ratio_threshold: float = 1, above: bool = True) Tuple[numpy.array, numpy.array, numpy.array, float]
-
Find the subsamples of X and Y in the training set by relevance above or below a given threshold with x_t.
For example, if relev_ratio_threshold=0.3, above=True, then it finds the top 30 percentile. If relev_ratio_threshold=0.3, above=False, then it finds the bottom 70 percentile.
The standard deviation is calculated as the sqrt of the variance of y_t hat, the prediction w.r.t. x_t: var_yt_hat = [(n-1)/n^2 * var_y] + [1/n * y_mean^2] + [var_y/n + y_mean^2/(n-1)]*var_r, where var_y is the subsample variance of Y, y_mean is the subsample average of Y, var_r is the subsample variance of relevance.
- Parameters:
-
-
x_t – (np.array) A single row element test data, 1D (k, 1). k is the number of features.
-
relev_ratio_threshold – (float) Optional. The subsample ratio to use for predicting values ranked by relevance, must be a number between [0, 1].
-
above – (bool) Optional. Whether to find the subsample above the threshold or below the threshold.
-
- Returns:
-
(Tuple[np.array, np.array, np.array, float]) The subsample for X, for Y, the corresponding indices to select the subsample and the std.
- calc_relevance(x_i: numpy.array, x_j: numpy.array, fisher_info_mtx: numpy.array = None) float
-
Calculate relevance of x_i and x_j: r(x_i, x_j).
r(x_i, x_j) := sim(x_i, x_j) + info(x_i) + info(x_j)
- Parameters:
-
-
x_i – (np.array) 1D (k, ) dependent data vector for an instance where k is the number of features.
-
x_j – (np.array) 1D (k, ) dependent data vector for an instance where k is the number of features.
-
fisher_info_mtx – (np.array) Optional. 2D (k, k) matrix for the whole training data. Defaults to the fisher info matrix stored in the class calculated using training data.
-
- Returns:
-
(float) The relevance value.
- calc_sim(x_i: numpy.array, x_j: numpy.array, fisher_info_mtx: numpy.array = None) float
-
Calculate the similarity of x_i and x_j: sim(x_i, x_j)
sim(x_i, x_j) := -1/2 * (x_i - x_j).T @ fisher_info @ (x_i - x_j)
- Parameters:
-
-
x_i – (np.array) 1D (k, ) dependent data vector for an instance where k is the number of features.
-
x_j – (np.array) 1D (k, ) dependent data vector for an instance where k is the number of features.
-
fisher_info_mtx – (np.array) Optional. 2D (k, k) matrix for the whole training data. Defaults to the fisher info matrix stored in the class calculated using training data.
-
- Returns:
-
(float) The value of similarity.
- calc_info(x_i: numpy.array, fisher_info_mtx: numpy.array = None) float
-
Calculate the informativeness of x_i: info(x_i)
info(x_i) := 1/2 * (x_i - x_avg).T @ fisher_info @ (x_i - x_avg) Here x_avg is the training data average for each column.
- Parameters:
-
-
x_i – (np.array) 1D (k, ) dependent data vector for an instance where k is the number of features.
-
fisher_info_mtx – (np.array) Optional. 2D (k, k) matrix for the whole training data. Defaults to the fisher info matrix stored in the class calculated using training data.
-
- Returns:
-
(float) The informativeness value.