catlearn.regression

catlearn.regression.cost_function

Functions to calculate the cost statistics.

catlearn.regression.cost_function.get_error(prediction, target, metrics=None, epsilon=None, return_percentiles=True)

Return error for predicted data.

Discussed in: Rosasco et al, Neural Computation, (2004), 16, 1063-1076.

Parameters:
  • prediction (list) – A list of predicted values.
  • target (list) – A list of target values.
  • metrics (list) – Define a list of additional cost functions to be returned. Can currently be ‘log’ and ‘insensitive’.
  • epsilon (float) – insensitivity value.
  • return_percentiles (boolean) – Return some percentile statistics with the predictions.

catlearn.regression.gaussian_process

Functions to make predictions with Gaussian Processes machine learning.

class catlearn.regression.gaussian_process.GaussianProcess(train_fp, train_target, kernel_list, gradients=None, regularization=None, regularization_bounds=None, optimize_hyperparameters=False, scale_optimizer=False, scale_data=False)

Bases: object

Gaussian processes functions for the machine learning.

optimize_hyperparameters(global_opt=False, algomin='L-BFGS-B', eval_jac=False, loss_function='lml')

Optimize hyperparameters of the Gaussian Process.

This function assumes that the descriptors in the feature set remain the same. Optimization is performed with respect to the log marginal likelihood. Optimized hyperparameters are saved in the kernel dictionary. Finally, the covariance matrix is updated.

Parameters:
  • global_opt (boolean) – Flag whether to do basin hopping optimization of hyperparameters. Default is False.
  • algomin (str) – Define scipy minimizer method to call. Default is L-BFGS-B.
predict(test_fp, test_target=None, uncertainty=False, basis=None, get_validation_error=False, get_training_error=False, epsilon=None)

Function to perform the prediction on some training and test data.

Parameters:
  • test_fp (list) – A list of testing fingerprint vectors.
  • test_target (list) – A list of the the test targets used to generate the prediction errors.
  • uncertainty (boolean) – Return data on the predicted uncertainty if True. Default is False.
  • basis (function) – Basis functions to assess the reliability of the uncertainty predictions. Must be a callable function that takes a list of descriptors and returns another list.
  • get_validation_error (boolean) – Return the error associated with the prediction on the test set of data if True. Default is False.
  • get_training_error (boolean) – Return the error associated with the prediction on the training set of data if True. Default is False.
  • epsilon (float) – Threshold for insensitive error calculation.
Returns:

data – Gaussian process predictions and meta data:

prediction : vector

Predicted mean.

uncertainty : vector

Predicted standard deviation of the Gaussian posterior.

training_error : dictionary

Error metrics on training targets.

validation_error : dictionary

Error metrics on test targets.

Return type:

dictionary

predict_uncertainty(test_fp)

Return uncertainty only.

Parameters:test_fp (list) – A list of testing fingerprint vectors.
update_data(train_fp, train_target=None, gradients=None, scale_optimizer=False)

Update the training matrix, targets and covariance matrix.

This function assumes that the descriptors in the feature set remain the same. That it is just the number of data ponts that is changing. For this reason the hyperparameters are not updated, so this update process should be fast.

Parameters:
  • train_fp (list) – A list of training fingerprint vectors.
  • train_target (list) – A list of training targets used to generate the predictions.
  • scale_optimizer (boolean) – Flag to define if the hyperparameters are log scale for optimization.
update_gp(train_fp=None, train_target=None, kernel_list=None, scale_optimizer=False, gradients=None, regularization_bounds=(1e-06, None), optimize_hyperparameters=False)

Potentially optimize the full Gaussian Process again.

This alows for the definition of a new kernel as a result of changing descriptors in the feature space. Other parts of the model can also be changed. The hyperparameters will always be reoptimized.

Parameters:
  • train_fp (list) – A list of training fingerprint vectors.
  • train_target (list) – A list of training targets used to generate the predictions.
  • kernel_list (dict) – This dict can contain many other dictionarys, each one containing parameters for separate kernels. Each kernel dict contains information on a kernel such as: - The ‘type’ key containing the name of kernel function. - The hyperparameters, e.g. ‘scaling’, ‘lengthscale’, etc.
  • scale_optimizer (boolean) – Flag to define if the hyperparameters are log scale for optimization.
  • regularization_bounds (tuple) – Optional to change the bounds for the regularization.

catlearn.regression.ridge_regression

Modified ridge regression function from Keld Lundgaard.

class catlearn.regression.ridge_regression.RidgeRegression(W2=None, Vh=None, cv='loocv', Ns=100, wsteps=15, rsteps=3)

Bases: object

Ridge regression class to find an optimal model.

Regualization fitting can be performed with wither the loocv or bootstrap.632 method. The loocv method is faseter, but it is better to use bootstrap when there is highly correlated training data.

RR(X, Y, omega2, p=0.0, featselect_featvar=False)

Ridge Regression (RR) solver.

Cost is (Xa-y)**2 + omega2*(a-p)**2, SVD of X.T X, where T is the transpose V, W2, Vh = X.T*X

Parameters:
  • X (array) – Feature matrix for the training data.
  • Y (list) – Target data for the training sample.
  • p (float) – Define the prior function.
  • omega2 (float) – Regularization strength.
Returns:

  • coefs (list) – Optimal coefficients.
  • neff (float) – Number of effective parameters.

bootstrap_calc(X, Y, p, omega2, samples, W2_samples, Vh_samples)

Calculate optimal omega2 from bootstrap.

Parameters:
  • X (array) – Feature matrix for the training data.
  • Y (list) – Target data for the training sample.
  • p (float) – Define the prior function.
  • omega2 (float) – Regularization strength.
  • samples (list) – Sample index for bootstrap.
  • W2_samples (array) – Sigular values for samples.
  • Vh_samples (array) – Right hand side of sigular matrix for samples.
find_optimal_regularization(X, Y, p=0.0)

Find regualization value to minimize Expected Prediction Error.

Parameters:
  • X (array) – Feature matrix for the training data.
  • Y (list) – Target data for the training sample.
  • p (float) – Define the prior function. Default is zero.
Returns:

omega2_min – Regularization corresponding to the minimum EPE.

Return type:

float

get_coefficients(train_targets, train_features, reg=None, p=0.0)

Generate the omgea2 and coef value’s.

Parameters:
  • train_targets (array) – Dependent data used for training.
  • train_features (array) – Independent data used for training.
  • reg (float) – Precomputed optimal regaluzation.
  • p (float) – Define the prior function. Default is zero.
predict(train_matrix, train_targets, test_matrix, test_targets=None, coefficients=None, reg=None, p=0.0)

Function to do ridge regression predictions.

regularization(train_targets, train_features, coef=None, featselect_featvar=False)

Generate the omgea2 and coef value’s.

Parameters:train_targets (array) – Dependent data used for training.
train_features : array
Independent data used for training.
coef : int
List of indices in the feature database.

catlearn.regression.scikit_wrapper

Regression models to assess features using scikit-learn framework.

class catlearn.regression.scikit_wrapper.RegressionFit(train_matrix, train_target, test_matrix=None, test_target=None, method='ridge', predict=False)

Bases: object

Class to perform a fit to specified regression model.

feature_select(size=None, iterations=100000.0, steps=None, line_search=False, min_alpha=1e-08, max_alpha=0.1, eps=0.001)

Find index of important featurs.

Parameters:
  • size (int) – Number best features to return.
  • iterations (float) – Maximum number of iterations taken minimizing the regression function. Implemented in elastic net and lasso.
  • steps (int) – Number of steps to be taken in the penalty function of LASSO.
  • min_alpha (float) – Starting penalty when searching over range. Default is 1.e-8.
  • max_alpha (float) – Final penalty when searching over range. Default is 1.e-1.