Modules¶

regressors.stats¶

regressors.stats.sse(clf, X, y)[source]¶

Calculate the standard squared error of the model.

Parameters:	clf (sklearn.linear_model) – A scikit-learn linear model classifier with a predict() method. X (numpy.ndarray) – Training data used to fit the classifier. y (numpy.ndarray) – Target training values, of shape = [n_samples].
Returns:	The standard squared error of the model.
Return type:	float

regressors.stats.adj_r2_score(clf, X, y)[source]¶

Calculate the adjusted $R^2$ of the model.

Parameters:	clf (sklearn.linear_model) – A scikit-learn linear model classifier with a predict() method. X (numpy.ndarray) – Training data used to fit the classifier. y (numpy.ndarray) – Target training values, of shape = [n_samples].
Returns:	The adjusted $R^2$ of the model.
Return type:	float

regressors.stats.coef_se(clf, X, y)[source]¶

Calculate standard error for beta coefficients.

Parameters:	clf (sklearn.linear_model) – A scikit-learn linear model classifier with a predict() method. X (numpy.ndarray) – Training data used to fit the classifier. y (numpy.ndarray) – Target training values, of shape = [n_samples].
Returns:	An array of standard errors for the beta coefficients.
Return type:	numpy.ndarray

regressors.stats.coef_tval(clf, X, y)[source]¶

Calculate t-statistic for beta coefficients.

Parameters:	clf (sklearn.linear_model) – A scikit-learn linear model classifier with a predict() method. X (numpy.ndarray) – Training data used to fit the classifier. y (numpy.ndarray) – Target training values, of shape = [n_samples].
Returns:	An array of t-statistic values.
Return type:	numpy.ndarray

regressors.stats.coef_pval(clf, X, y)[source]¶

Calculate p-values for beta coefficients.

Parameters:	clf (sklearn.linear_model) – A scikit-learn linear model classifier with a predict() method. X (numpy.ndarray) – Training data used to fit the classifier. y (numpy.ndarray) – Target training values, of shape = [n_samples].
Returns:	An array of p-values.
Return type:	numpy.ndarray

regressors.stats.f_stat(clf, X, y)[source]¶

Calculate summary F-statistic for beta coefficients.

Parameters:	clf (sklearn.linear_model) – A scikit-learn linear model classifier with a predict() method. X (numpy.ndarray) – Training data used to fit the classifier. y (numpy.ndarray) – Target training values, of shape = [n_samples].
Returns:	The F-statistic value.
Return type:	float

regressors.stats.residuals(clf, X, y, r_type=u'standardized')[source]¶

Calculate residuals or standardized residuals.

Parameters:	clf (sklearn.linear_model) – A scikit-learn linear model classifier with a predict() method. X (numpy.ndarray) – Training data used to fit the classifier. y (numpy.ndarray) – Target training values, of shape = [n_samples]. r_type (str) – Type of residuals to return: ‘raw’, ‘standardized’, ‘studentized’. Defaults to ‘standardized’. ‘raw’ will return the raw residuals. ‘standardized’ will return the standardized residuals, also known as internally studentized residuals, which is calculated as the residuals divided by the square root of MSE (or the STD of the residuals). ‘studentized’ will return the externally studentized residuals, which is calculated as the raw residuals divided by sqrt(LOO-MSE * (1 - leverage_score)).
Returns:	An array of residuals.
Return type:	numpy.ndarray

regressors.stats.summary(clf, X, y, xlabels=None)[source]¶

Output summary statistics for a fitted regression model.

Parameters:	clf (sklearn.linear_model) – A scikit-learn linear model classifier with a predict() method. X (numpy.ndarray) – Training data used to fit the classifier. y (numpy.ndarray) – Target training values, of shape = [n_samples]. xlabels (list, tuple) – The labels for the predictors.

regressors.plots¶

regressors.plots.plot_residuals(clf, X, y, r_type=u'standardized', figsize=(10, 8))[source]¶

Plot residuals of a linear model.

Parameters:	clf (sklearn.linear_model) – A scikit-learn linear model classifier with a predict() method. X (numpy.ndarray) – Training data used to fit the classifier. y (numpy.ndarray) – Target training values, of shape = [n_samples]. r_type (str) – Type of residuals to return: ‘raw’, ‘standardized’, ‘studentized’. Defaults to ‘standardized’. ‘raw’ will return the raw residuals. ‘standardized’ will return the standardized residuals, also known as internally studentized residuals, which is calculated as the residuals divided by the square root of MSE (or the STD of the residuals). ‘studentized’ will return the externally studentized residuals, which is calculated as the raw residuals divided by sqrt(LOO-MSE * (1 - leverage_score)). figsize (tuple) – A tuple indicating the size of the plot to be created, with format (x-axis, y-axis). Defaults to (10, 8).
Returns:	The Figure instance.
Return type:	matplotlib.figure.Figure

regressors.plots.plot_qq(clf, X, y, figsize=(7, 7))[source]¶

Generate a Q-Q plot (a.k.a. normal quantile plot).

Parameters:	clf (sklearn.linear_model) – A scikit-learn linear model classifier with a predict() method. X (numpy.ndarray) – Training data used to fit the classifier. y (numpy.ndarray) – Target training values, of shape = [n_samples]. figsize (tuple) – A tuple indicating the size of the plot to be created, with format (x-axis, y-axis). Defaults to (7, 7).
Returns:	The Figure instance.
Return type:	matplotlib.figure.Figure

regressors.plots.plot_pca_pairs(clf_pca, x_train, y=None, n_components=3, diag=u'kde', cmap=None, figsize=(10, 10))[source]¶

Create pairwise plots of principal components from x data.

Colors the components according to the y values.

Parameters:	clf_pca (sklearn.decomposition.PCA) – A fitted scikit-learn PCA model. x_train (numpy.ndarray) – Training data used to fit clf_pca, either scaled or un-scaled, depending on how clf_pca was fit. y (numpy.ndarray) – Target training values, of shape = [n_samples]. n_components (int) – Desired number of principal components to plot. Defaults to 3. diag (str) – Type of plot to display on the diagonals. Default is ‘kde’. ‘kde’: density curves ‘hist’: histograms cmap (str) – A string representation of a Seaborn color map. See available maps: https://stanford.edu/~mwaskom/software/seaborn/tutorial/color_palettes. figsize (tuple) – A tuple indicating the size of the plot to be created, with format (x-axis, y-axis). Defaults to (10, 10).
Returns:	The Figure instance.
Return type:	matplotlib.figure.Figure

regressors.plots.plot_scree(clf_pca, xlim=[-1, 10], ylim=[-0.1, 1.0], required_var=0.9, figsize=(10, 5))[source]¶

Create side-by-side scree plots for analyzing variance of principal components from PCA.

Parameters:	clf_pca (sklearn.decomposition.PCA) – A fitted scikit-learn PCA model. xlim (list) – X-axis range. If required_var is supplied, the maximum x-axis value will automatically be set so that the required variance line is visible on the plot. Defaults to [-1, 10]. ylim (list) – Y-axis range. Defaults to [-0.1, 1.0]. required_var (float, int, None) – A value of variance to distinguish on the scree plot. Set to None to not include on the plot. Defaults to 0.90. figsize (tuple) – A tuple indicating the size of the plot to be created, with format (x-axis, y-axis). Defaults to (10, 5).
Returns:	The Figure instance.
Return type:	matplotlib.figure.Figure

regressors.regressors¶

class regressors.regressors.PCR(n_components=None, regression_type=u'ols', alpha=1.0, l1_ratio=0.5, n_jobs=1)[source]¶

Principal components regression model.

This model solves a regression model after standard scaling the X data and performing PCA to reduce the dimensionality of X. This class simply creates a pipeline that utilizes:

sklearn.preprocessing.StandardScaler

sklearn.decomposition.PCA

a supported sklearn.linear_model

Attributes of the class mimic what is provided by scikit-learn’s PCA and linear model classes. Additional attributes specifically relevant to PCR are also provided, such as PCR.beta_coef_.

Parameters:

n_components (int, float, None, str) –
Number of components to keep when performing PCA. If n_components is not set all components are kept:
```
n_components == min(n_samples, n_features)
```
If n_components == ‘mle’, Minka’s MLE is used to guess the dimension. If 0 < n_components < 1, selects the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components.
regression_type (str) – The type of regression classifier to use. Must be one of ‘ols’, ‘lasso’, ‘ridge’, or ‘elasticnet’.
n_jobs (int (optional)) – The number of jobs to use for the computation. If n_jobs=-1, all CPUs are used. This will only increase speed of computation for n_targets > 1 and sufficiently large problems.
alpha (float (optional)) – Used when regression_type is ‘lasso’, ‘ridge’, or ‘elasticnet’. Represents the constant that multiplies the penalty terms. Setting alpha=0 is equivalent to ordinary least square and it is advised in that case to instead use regression_type='ols'. See the scikit-learn documentation for the chosen regression model for more information in this parameter.
l1_ratio (float (optional)) – Used when regression_type is ‘elasticnet’. The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1. For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.

scaler¶

sklearn.preprocessing.StandardScaler, None

The StandardScaler object used to center the X data and scale to unit variance. Must have fit() and transform() methods. Can be overridden prior to fitting to use a different scaler:

pcr = PCR()
# Change StandardScaler options
pcr.scaler = StandardScaler(with_mean=False, with_std=True)
pcr.fit(X, y)

The scaler can also be removed prior to fitting (to not scale X during fitting or predictions) with pcr.scaler = None.

prcomp¶

sklearn.decomposition.PCA

The PCA object use to perform PCA. This can also be accessed in the same way as the scaler.

regression¶

sklearn.linear_model

The linear model object used to perform regression. Must have fit() and predict() methods. This defaults to OLS using scikit-learn’s LinearRegression classifier, but can be overridden either using the regression_type parameter when instantiating the class, or by replacing the regression model with a different on prior to fitting:

pcr = PCR(regression_type='ols')
# Examine the current regression model
print(pcr.regression)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1,
    normalize=False)
# Use Lasso regression with cross-validation instead of OLS
pcr.regression = linear_model.LassoCV(n_alphas=200)
print(pcr.regression)
LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001,
    fit_intercept=True, max_iter=1000, n_alphas=200, n_jobs=1,
    normalize=False, positive=False, precompute='auto',
    random_state=None, selection='cyclic', tol=0.0001,
    verbose=False)
pcr.fit(X, y)

beta_coef_¶

Returns:	Beta coefficients, corresponding to coefficients in the original space and dimension of X. These are calculated as $B = A \cdot P$ , where $A$ is a vector of the coefficients obtained from regression on the principal components and $P$ is the matrix of loadings from PCA.
Return type:	numpy.ndarray

fit(X, y)[source]¶

Fit the PCR model.

Parameters:	X (numpy.ndarray) – Training data. y (numpy.ndarray) – Target values.
Returns:	An instance of self.
Return type:	regression.PCR

intercept_¶

Returns:	The intercept for the regression model, both in PCA-space and in the original X-space.
Return type:	float

predict(X)[source]¶

Predict using the PCR model.

Parameters:	X (numpy.ndarray) – Samples to predict values from.
Returns:	Predicted values.
Return type:	numpy.ndarray

score(X, y)[source]¶

Returns the coefficient of determination of $R^2$ of the predictions.

Parameters:	X (numpy.ndarray) – Training or tests samples. y (numpy.ndarray) – Target values.
Returns:	The $R^2$ value of the predictions.
Return type:	float