Modules

regressors.stats

regressors.stats.sse(clf, X, y)[source]

Calculate the standard squared error of the model.

Parameters:
  • clf (sklearn.linear_model) – A scikit-learn linear model classifier with a predict() method.
  • X (numpy.ndarray) – Training data used to fit the classifier.
  • y (numpy.ndarray) – Target training values, of shape = [n_samples].
Returns:

The standard squared error of the model.

Return type:

float

regressors.stats.adj_r2_score(clf, X, y)[source]

Calculate the adjusted R^2 of the model.

Parameters:
  • clf (sklearn.linear_model) – A scikit-learn linear model classifier with a predict() method.
  • X (numpy.ndarray) – Training data used to fit the classifier.
  • y (numpy.ndarray) – Target training values, of shape = [n_samples].
Returns:

The adjusted R^2 of the model.

Return type:

float

regressors.stats.coef_se(clf, X, y)[source]

Calculate standard error for beta coefficients.

Parameters:
  • clf (sklearn.linear_model) – A scikit-learn linear model classifier with a predict() method.
  • X (numpy.ndarray) – Training data used to fit the classifier.
  • y (numpy.ndarray) – Target training values, of shape = [n_samples].
Returns:

An array of standard errors for the beta coefficients.

Return type:

numpy.ndarray

regressors.stats.coef_tval(clf, X, y)[source]

Calculate t-statistic for beta coefficients.

Parameters:
  • clf (sklearn.linear_model) – A scikit-learn linear model classifier with a predict() method.
  • X (numpy.ndarray) – Training data used to fit the classifier.
  • y (numpy.ndarray) – Target training values, of shape = [n_samples].
Returns:

An array of t-statistic values.

Return type:

numpy.ndarray

regressors.stats.coef_pval(clf, X, y)[source]

Calculate p-values for beta coefficients.

Parameters:
  • clf (sklearn.linear_model) – A scikit-learn linear model classifier with a predict() method.
  • X (numpy.ndarray) – Training data used to fit the classifier.
  • y (numpy.ndarray) – Target training values, of shape = [n_samples].
Returns:

An array of p-values.

Return type:

numpy.ndarray

regressors.stats.f_stat(clf, X, y)[source]

Calculate summary F-statistic for beta coefficients.

Parameters:
  • clf (sklearn.linear_model) – A scikit-learn linear model classifier with a predict() method.
  • X (numpy.ndarray) – Training data used to fit the classifier.
  • y (numpy.ndarray) – Target training values, of shape = [n_samples].
Returns:

The F-statistic value.

Return type:

float

regressors.stats.residuals(clf, X, y, r_type=u'standardized')[source]

Calculate residuals or standardized residuals.

Parameters:
  • clf (sklearn.linear_model) – A scikit-learn linear model classifier with a predict() method.
  • X (numpy.ndarray) – Training data used to fit the classifier.
  • y (numpy.ndarray) – Target training values, of shape = [n_samples].
  • r_type (str) –

    Type of residuals to return: ‘raw’, ‘standardized’, ‘studentized’. Defaults to ‘standardized’.

    • ‘raw’ will return the raw residuals.
    • ‘standardized’ will return the standardized residuals, also known as internally studentized residuals, which is calculated as the residuals divided by the square root of MSE (or the STD of the residuals).
    • ‘studentized’ will return the externally studentized residuals, which is calculated as the raw residuals divided by sqrt(LOO-MSE * (1 - leverage_score)).
Returns:

An array of residuals.

Return type:

numpy.ndarray

regressors.stats.summary(clf, X, y, xlabels=None)[source]

Output summary statistics for a fitted regression model.

Parameters:
  • clf (sklearn.linear_model) – A scikit-learn linear model classifier with a predict() method.
  • X (numpy.ndarray) – Training data used to fit the classifier.
  • y (numpy.ndarray) – Target training values, of shape = [n_samples].
  • xlabels (list, tuple) – The labels for the predictors.

regressors.plots

regressors.plots.plot_residuals(clf, X, y, r_type=u'standardized', figsize=(10, 8))[source]

Plot residuals of a linear model.

Parameters:
  • clf (sklearn.linear_model) – A scikit-learn linear model classifier with a predict() method.
  • X (numpy.ndarray) – Training data used to fit the classifier.
  • y (numpy.ndarray) – Target training values, of shape = [n_samples].
  • r_type (str) –

    Type of residuals to return: ‘raw’, ‘standardized’, ‘studentized’. Defaults to ‘standardized’.

    • ‘raw’ will return the raw residuals.
    • ‘standardized’ will return the standardized residuals, also known as internally studentized residuals, which is calculated as the residuals divided by the square root of MSE (or the STD of the residuals).
    • ‘studentized’ will return the externally studentized residuals, which is calculated as the raw residuals divided by sqrt(LOO-MSE * (1 - leverage_score)).
  • figsize (tuple) – A tuple indicating the size of the plot to be created, with format (x-axis, y-axis). Defaults to (10, 8).
Returns:

The Figure instance.

Return type:

matplotlib.figure.Figure

regressors.plots.plot_qq(clf, X, y, figsize=(7, 7))[source]

Generate a Q-Q plot (a.k.a. normal quantile plot).

Parameters:
  • clf (sklearn.linear_model) – A scikit-learn linear model classifier with a predict() method.
  • X (numpy.ndarray) – Training data used to fit the classifier.
  • y (numpy.ndarray) – Target training values, of shape = [n_samples].
  • figsize (tuple) – A tuple indicating the size of the plot to be created, with format (x-axis, y-axis). Defaults to (7, 7).
Returns:

The Figure instance.

Return type:

matplotlib.figure.Figure

regressors.plots.plot_pca_pairs(clf_pca, x_train, y=None, n_components=3, diag=u'kde', cmap=None, figsize=(10, 10))[source]

Create pairwise plots of principal components from x data.

Colors the components according to the y values.

Parameters:
  • clf_pca (sklearn.decomposition.PCA) – A fitted scikit-learn PCA model.
  • x_train (numpy.ndarray) – Training data used to fit clf_pca, either scaled or un-scaled, depending on how clf_pca was fit.
  • y (numpy.ndarray) – Target training values, of shape = [n_samples].
  • n_components (int) – Desired number of principal components to plot. Defaults to 3.
  • diag (str) –

    Type of plot to display on the diagonals. Default is ‘kde’.

    • ‘kde’: density curves
    • ‘hist’: histograms
  • cmap (str) – A string representation of a Seaborn color map. See available maps: https://stanford.edu/~mwaskom/software/seaborn/tutorial/color_palettes.
  • figsize (tuple) – A tuple indicating the size of the plot to be created, with format (x-axis, y-axis). Defaults to (10, 10).
Returns:

The Figure instance.

Return type:

matplotlib.figure.Figure

regressors.plots.plot_scree(clf_pca, xlim=[-1, 10], ylim=[-0.1, 1.0], required_var=0.9, figsize=(10, 5))[source]

Create side-by-side scree plots for analyzing variance of principal components from PCA.

Parameters:
  • clf_pca (sklearn.decomposition.PCA) – A fitted scikit-learn PCA model.
  • xlim (list) – X-axis range. If required_var is supplied, the maximum x-axis value will automatically be set so that the required variance line is visible on the plot. Defaults to [-1, 10].
  • ylim (list) – Y-axis range. Defaults to [-0.1, 1.0].
  • required_var (float, int, None) – A value of variance to distinguish on the scree plot. Set to None to not include on the plot. Defaults to 0.90.
  • figsize (tuple) – A tuple indicating the size of the plot to be created, with format (x-axis, y-axis). Defaults to (10, 5).
Returns:

The Figure instance.

Return type:

matplotlib.figure.Figure

regressors.regressors

class regressors.regressors.PCR(n_components=None, regression_type=u'ols', alpha=1.0, l1_ratio=0.5, n_jobs=1)[source]

Principal components regression model.

This model solves a regression model after standard scaling the X data and performing PCA to reduce the dimensionality of X. This class simply creates a pipeline that utilizes:

  1. sklearn.preprocessing.StandardScaler
  2. sklearn.decomposition.PCA
  3. a supported sklearn.linear_model

Attributes of the class mimic what is provided by scikit-learn’s PCA and linear model classes. Additional attributes specifically relevant to PCR are also provided, such as PCR.beta_coef_.

Parameters:
  • n_components (int, float, None, str) –

    Number of components to keep when performing PCA. If n_components is not set all components are kept:

    n_components == min(n_samples, n_features)
    

    If n_components == ‘mle’, Minka’s MLE is used to guess the dimension. If 0 < n_components < 1, selects the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components.

  • regression_type (str) – The type of regression classifier to use. Must be one of ‘ols’, ‘lasso’, ‘ridge’, or ‘elasticnet’.
  • n_jobs (int (optional)) – The number of jobs to use for the computation. If n_jobs=-1, all CPUs are used. This will only increase speed of computation for n_targets > 1 and sufficiently large problems.
  • alpha (float (optional)) – Used when regression_type is ‘lasso’, ‘ridge’, or ‘elasticnet’. Represents the constant that multiplies the penalty terms. Setting alpha=0 is equivalent to ordinary least square and it is advised in that case to instead use regression_type='ols'. See the scikit-learn documentation for the chosen regression model for more information in this parameter.
  • l1_ratio (float (optional)) – Used when regression_type is ‘elasticnet’. The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1. For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.
scaler

sklearn.preprocessing.StandardScaler, None

The StandardScaler object used to center the X data and scale to unit variance. Must have fit() and transform() methods. Can be overridden prior to fitting to use a different scaler:

pcr = PCR()
# Change StandardScaler options
pcr.scaler = StandardScaler(with_mean=False, with_std=True)
pcr.fit(X, y)

The scaler can also be removed prior to fitting (to not scale X during fitting or predictions) with pcr.scaler = None.

prcomp

sklearn.decomposition.PCA

The PCA object use to perform PCA. This can also be accessed in the same way as the scaler.

regression

sklearn.linear_model

The linear model object used to perform regression. Must have fit() and predict() methods. This defaults to OLS using scikit-learn’s LinearRegression classifier, but can be overridden either using the regression_type parameter when instantiating the class, or by replacing the regression model with a different on prior to fitting:

pcr = PCR(regression_type='ols')
# Examine the current regression model
print(pcr.regression)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1,
    normalize=False)
# Use Lasso regression with cross-validation instead of OLS
pcr.regression = linear_model.LassoCV(n_alphas=200)
print(pcr.regression)
LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001,
    fit_intercept=True, max_iter=1000, n_alphas=200, n_jobs=1,
    normalize=False, positive=False, precompute='auto',
    random_state=None, selection='cyclic', tol=0.0001,
    verbose=False)
pcr.fit(X, y)
beta_coef_
Returns:Beta coefficients, corresponding to coefficients in the original space and dimension of X. These are calculated as B = A \cdot P, where A is a vector of the coefficients obtained from regression on the principal components and P is the matrix of loadings from PCA.
Return type:numpy.ndarray
fit(X, y)[source]

Fit the PCR model.

Parameters:
  • X (numpy.ndarray) – Training data.
  • y (numpy.ndarray) – Target values.
Returns:

An instance of self.

Return type:

regression.PCR

intercept_
Returns:The intercept for the regression model, both in PCA-space and in the original X-space.
Return type:float
predict(X)[source]

Predict using the PCR model.

Parameters:X (numpy.ndarray) – Samples to predict values from.
Returns:Predicted values.
Return type:numpy.ndarray
score(X, y)[source]

Returns the coefficient of determination of R^2 of the predictions.

Parameters:
  • X (numpy.ndarray) – Training or tests samples.
  • y (numpy.ndarray) – Target values.
Returns:

The R^2 value of the predictions.

Return type:

float