catlearn.preprocess

catlearn.preprocess.clean_data

Functions to clean data.

catlearn.preprocess.clean_data.clean_infinite(train, test=None, targets=None, labels=None, mask=None, max_impute_fraction=0, strategy='mean')

Remove features that have non finite values in the training data.

Optionally removes features in test data with non fininte values. Returns a dictionary with the clean ‘train’, ‘test’ and ‘index’ that were removed from the original data.

Parameters:
  • train (array) – Feature matrix for the traing data.
  • test (array) – Optional feature matrix for the test data. Default is None passed.
  • targets (array) – An array of training targets.
  • labels (array) – Optional list of feature labels. Default is None passed.
  • mask (list) – Indices of features that are not subject to cleaning.
  • max_impute_fraction (float) – Maximum fraction of values in a column that can be imputed. Columns with higher fractions of nans values will be discarded.
  • strategy (str) – Imputation strategy.
Returns:

data

key value pairs

  • ’train’ : array
    Clean training data matrix.
  • ’test’ : array
    Clean test data matrix
  • ’targets’ : list
    Boolean list on whether targets are finite.
  • ’labels’ : list
    Feature labels of clean data set.

Return type:

dict

catlearn.preprocess.clean_data.clean_skewness(train, test=None, labels=None, mask=None, skewness=3.0)

Discards features that are excessively skewed.

Parameters:
  • train (array) – Feature matrix for the traing data.
  • test (array) – Optional feature matrix for the test data. Default is None passed.
  • labels (array) – Optional list of feature labels. Default is None passed.
  • mask (list) – Indices of features that are not subject to cleaning.
  • skewness (float) – Maximum allowed skewness thresshold.
catlearn.preprocess.clean_data.clean_variance(train, test=None, labels=None, mask=None)

Remove features that contribute nothing to the model.

Removes a feature if there is zero variance in the training data. If this is the case, then the model won’t learn anything new from adding this feature as it will just act as a scalar.

Parameters:
  • train (array) – Feature matrix for the traing data.
  • test (array) – Optional feature matrix for the test data. Default is None passed.
  • labels (array) – Optional list of feature labels. Default is None passed.
  • mask (list) – Indices of features that are not subject to cleaning.
catlearn.preprocess.clean_data.remove_outliers(features, targets, con=1.4826, dev=3.0, constraint=None)

Preprocessing routine to remove outliers by median absolute deviation.

This will take the training feature and target arrays, calculate any outliers, then return the reduced arrays. It is possible to set a constraint key (‘high’, ‘low’, None) in order to allow for outliers that are e.g. very low in energy, as this may be the desired outcome of the study.

Parameters:
  • features (array) – Feature matrix for training data.
  • targets (list) – List of target values for the training data.
  • con (float) – Constant scale factor dependent on the distribution. Default is 1.4826 expecting the data is normally distributed.
  • dev (float) – The number of deviations from the median to account for.
  • constraint (str) – Can be set to ‘low’ to remove candidates with targets that are too small/negative or ‘high’ for outliers that are too large/positive. Default is to remove all.

catlearn.preprocess.feature_elimination

Functions to select features for the fingerprint vectors.

class catlearn.preprocess.feature_elimination.FeatureScreening(correlation='pearson', iterative=True, regression='ridge', random_check=False)

Bases: object

Class for feature elimination based on correlation screening.

eliminate_features(target, train_features, test_features, size=None, step=None, order=None)

Function to eliminate features from training/test data.

Parameters:
  • target (list) – The target values for the training data.
  • train_features (array) – Array of training data to eliminate features from.
  • test_features (array) – Array of test data to eliminate features from.
  • size (int) – Number of features after elimination.
  • step (int) – Number of features to eliminate at each step.
  • order (list) – Precomputed ordered indices for features.
Returns:

  • reduced_train (array) – Reduced training feature matrix, now n x size shape.
  • reduced_test (array) – Reduced test feature matrix, now m x size shape.

iterative_screen(target, feature_matrix, size=None, step=None)

Function iteratively screen featues.

Parameters:
  • target (list) – The target values for the training data.
  • feature_matrix (array) – The feature matrix for the training data.
  • size (int) – Number of features to be returned. Default is number of data.
  • step (int) – Step size by which to reduce the number of features. Default is n / log(n).
Returns:

  • index (list) – The ordered list of feature indices, top index[:size] will be indices for best features.
  • size (int) – Number of accepted features.

screen(target, feature_matrix)

Feature selection based on SIS.

Further discussion on this topic can be found in Fan, J., Lv, J., J. R. Stat. Soc.: Series B, 2008, 70, 849.

Parameters:
  • target (list) – The target values for the training data.
  • feature_matrix (array) – The feature matrix for the training data.
Returns:

  • index (list) – The ordered list of feature indices.
  • correlation (list) – The ordered list of correlations between features and targets.
  • size (int) – Number of accepted features following screening.

catlearn.preprocess.feature_engineering

Functions for feature engineering.

catlearn.preprocess.feature_engineering.generate_features(p, max_num=2, max_den=1, log=False, sqrt=False, exclude=False, s=False)

Generate composite features from a combination of input features.

developer note: This is currently scales quite slowly with max_den. There’s surely a better way to do this, but it’s apparently currently functional.

Parameters:
  • p (list) – User-provided list of physical features to be combined.
  • max_num (integer) – The maximum order of the polynomial in the numerator of the composite features. Must be non-negative.
  • max_den (integer) – The maximum order of the polynomial in the denominator of the composite features. Must be non-negative.
  • log (boolean (not currently supported)) – Set to True to include terms involving the logarithm of the input features. Default is False.
  • sqrt (boolean (not currently supported)) – Set to True to include terms involving the square root of the input features. Default is False.
  • exclude (bool) – Set exclude=True to avoid returning 1 to represent the zeroth power. Default is False.
  • s (bool) – Set True to return a list of strings and False to evaluate each element in the list. Default is False.
Returns:

features – A list of combinations of the input features to meet the required specifications.

Return type:

list

catlearn.preprocess.feature_engineering.generate_positive_features(p, N, exclude=False, s=False)

Generate list of polynomial combinations in list p up to order N.

Example: p = (a,b,c) ; N = 3

returns (order not preserved) [a*a*a, a*a*b, a*a*c, a*b*b, a*b*c, a*c*c, b*b*b, b*b*c, b*c*c, c*c*c, a*a, a*b, a*c, b*b, b*c, c*c, a, b, c]

Parameters:
  • p (list) – Features to be combined.
  • N (integer) – The maximum polynomial coefficient for combinations. Must be non-negative.
  • exclude (bool) – Set True to avoid returning 1 to represent the zeroth power. Default is False.
  • s (bool) – Set True to return a list of strings and False to evaluate each element in the list. Default is False.
Returns:

all_powers – A list of combinations of the input features to meet the required specifications.

Return type:

list

catlearn.preprocess.feature_engineering.get_ablog(A, a, b)

Get all combinations x_ij = a*log(x_i) + b*log(x_j).

The sorting order in dimension 0 is preserved.

Parameters:
  • A (array) – An n x m matrix, where n is the number of training examples and m is the number of features.
  • a (float) –
  • b (float) –
Returns:

new_features – The n x triangular(m) matrix of new features.

Return type:

array

catlearn.preprocess.feature_engineering.get_div_order_2(A)

Get all combinations x_ij = x_i / x_j, where x_i,j are features.

The sorting order in dimension 0 is preserved. If a denominator is 0, Inf is returned.

Parameters:A (array) – n x m matrix, where n is the number of training examples and m is the number of features.
Returns:new_features – The n x m**2 matrix of new features.
Return type:array
catlearn.preprocess.feature_engineering.get_labels_ablog(l, a, b)

Get all combinations ij, where i,j are feature labels.

Parameters:
  • a (float) –
  • b (float) –
Returns:

new_features – List of new feature names.

Return type:

list

catlearn.preprocess.feature_engineering.get_labels_order_2(l, div=False)

Get all combinations ij, where i,j are feature labels.

Parameters:x (list) – Length m vector, where m is the number of features.
Returns:new_features – List of new feature names.
Return type:list
catlearn.preprocess.feature_engineering.get_labels_order_2ab(l, a, b)

Get all combinations ij, where i,j are feature labels.

Parameters:x (list) – Length m vector, where m is the number of features.
Returns:new_features – List of new feature names.
Return type:list
catlearn.preprocess.feature_engineering.get_order_2(A)

Get all combinations x_ij = x_i * x_j, where x_i,j are features.

The sorting order in dimension 0 is preserved.

Parameters:A (array) – n x m matrix, where n is the number of training examples and m is the number of features.
Returns:new_features – The n x triangular(m) matrix of new features.
Return type:array
catlearn.preprocess.feature_engineering.get_order_2ab(A, a, b)

Get all combinations x_ij = x_i**a * x_j**b, where x_i,j are features.

The sorting order in dimension 0 is preserved.

Parameters:
  • A (array) – n x m matrix, where n is the number of training examples and m is the number of features.
  • a (float) –
  • b (float) –
Returns:

new_features – The n x triangular(m) matrix of new features.

Return type:

array

catlearn.preprocess.feature_engineering.single_transform(A)

Perform single variable transform x^2, x^0.5 and log(x).

Parameters:A (array) – n x m matrix, where n is the number of training examples and m is the number of features.
Returns:new_features – The n x m*3 matrix of new features.
Return type:array

catlearn.preprocess.feature_extraction

Some feature extraction routines.

catlearn.preprocess.feature_extraction.catlearn_pca(components, train_features, test_features=None, cleanup=False, scale=False)

Principal component analysis varient that doesn’t require scikit-learn.

Parameters:
  • components (int) – Number of principal components to transform the feature set by.
  • test_fpv (array) – The feature matrix for the testing data.
catlearn.preprocess.feature_extraction.pca(components, train_matrix, test_matrix)

Principal component analysis routine.

Parameters:
  • components (int) – The number of components to be returned.
  • train_matrix (array) – The training features.
  • test_matrix (array) – The test features.
Returns:

  • new_train (array) – Extracted training features.
  • new_test (array) – Extracted test features.

catlearn.preprocess.feature_extraction.pls(components, train_matrix, target, test_matrix)

Projection of latent structure routine.

Parameters:
  • components (int) – The number of components to be returned.
  • train_matrix (array) – The training features.
  • test_matrix (array) – The test features.
Returns:

  • new_train (array) – Extracted training features.
  • new_test (array) – Extracted test features.

catlearn.preprocess.feature_extraction.spca(components, train_matrix, test_matrix)

Sparse principal component analysis routine.

Parameters:
  • components (int) – The number of components to be returned.
  • train_matrix (array) – The training features.
  • test_matrix (array) – The test features.
Returns:

  • new_train (array) – Extracted training features.
  • new_test (array) – Extracted test features.

catlearn.preprocess.greedy_elimination

Greedy feature selection routines.

class catlearn.preprocess.greedy_elimination.GreedyElimination(nprocs=1, verbose=True, save_file=None)

Bases: object

The greedy feature elimination class.

greedy_elimination(predict, features, targets, nsplit=2, step=1)

Greedy feature elimination.

Function to iterate through feature set, eliminating worst feature in each pass. This is the backwards greedy algorithm.

Parameters:
  • predict (object) –

    A function that will make the predictions. predict should accept the parameters:

    train_features : array test_features : array train_targets : list test_targets : list

    predict should return either a float or a list of floats. The float or the first value of the list will be used as the fitness score.

  • features (array) – An n, d array of features.
  • targets (list) – A list of the target values.
  • nsplit (int) – Number of folds in k-fold cross-validation.
Returns:

output – First column is the index of features in the order they were eliminated.

Second column are corresponding cost function values, averaged over the k fold split.

Following columns are any additional values returned by predict, averaged over the k fold split.

Return type:

array

catlearn.preprocess.importance_testing

Functions to check feature significance.

class catlearn.preprocess.importance_testing.ImportanceElimination(transform, nprocs=1, verbose=True)

Bases: object

The feature importance elimination class.

importance_elimination(train_predict, test_predict, features, targets, nsplit=2, step=1)

Importance feature elimination.

Function to iterate through feature set, eliminating least important feature in each pass. This is the backwards elimination algorithm.

Parameters:
  • train_predict (object) –

    A function that will train a model. The function should accept the parameters:

    train_features : array train_targets : list

    predict should return a function that can be passed to test_predict.

  • test_predict (object) – A function that will accept a trained model object and return a float or a list of test metrics. The first returned metric will be used to eliminate features.
  • features (array) – An n, d array of features.
  • targets (list) – A list of the target values.
  • nsplit (int) – Number of folds in k-fold cross-validation.
  • step (int) – Optional number of features to eliminate in each round.
Returns:

output – First column is the index of features in the order they were eliminated.

Second column are corresponding cost function values, averaged over the k fold split.

Following columns are any additional values returned by test_predict, averaged over the k fold split.

Return type:

array

catlearn.preprocess.importance_testing.feature_invariance(args)

Make a feature invariant.

Parameters:args (list) –

A list of arguments:

index : int
The index of the feature to be shuffled.
train_features : array
The original training data matrix.
test_features : array
The original test data matrix.
Returns:
  • train (array) – Feature matrix with a shuffled feature column in matrix.
  • test (array) – Feature matrix with a shuffled feature column in matrix.
catlearn.preprocess.importance_testing.feature_randomize(args)

Make a feature random noise.

Parameters:args (list) –

A list of arguments:

index : int
The index of the feature to be shuffled.
train_features : array
The original training data matrix.
test_features : array
The original test data matrix.
Returns:
  • train (array) – Feature matrix with a shuffled feature column in matrix.
  • test (array) – Feature matrix with a shuffled feature column in matrix.
catlearn.preprocess.importance_testing.feature_shuffle(args)

Shuffle a feature.

The method has a number of advantages for measuring feature importance. Notably the original values and scale of the feature are maintained.

Parameters:args (list) –

A list of arguments:

index : int
The index of the feature to be shuffled.
train_features : array
The original training data matrix.
test_features : array
The original test data matrix.
Returns:
  • train (array) – Feature matrix with a shuffled feature column in matrix.
  • test (array) – Feature matrix with a shuffled feature column in matrix.

catlearn.preprocess.scaling

Functions to process the raw feature matrix.

catlearn.preprocess.scaling.min_max(train_matrix, test_matrix=None, local=True)

Normalize each feature relative to the min and max.

Parameters:
  • train_matrix (list) – Feature matrix for the training dataset.
  • test_matrix (list) – Feature matrix for the test dataset.
  • local (boolean) – Define whether to scale locally or globally.
catlearn.preprocess.scaling.normalize(train_matrix, test_matrix=None, mean=None, dif=None, local=True)

Normalize each feature relative to mean and min/max variance.

Parameters:
  • train_matrix (list) – Feature matrix for the training dataset.
  • test_matrix (list) – Feature matrix for the test dataset.
  • local (boolean) – Define whether to scale locally or globally.
  • mean (list) – List of mean values for each feature.
  • dif (list) – List of max-min values for each feature.
catlearn.preprocess.scaling.standardize(train_matrix, test_matrix=None, mean=None, std=None, local=True)

Standardize each feature relative to the mean and standard deviation.

Parameters:
  • train_matrix (array) – Feature matrix for the training dataset.
  • test_matrix (array) – Feature matrix for the test dataset.
  • mean (list) – List of mean values for each feature.
  • std (list) – List of standard deviation values for each feature.
  • local (boolean) – Define whether to scale locally or globally.
catlearn.preprocess.scaling.target_center(target)

Return a list of normalized target values.

Parameters:target (list) – A list of the target values.
catlearn.preprocess.scaling.target_normalize(target)

Return a list of normalized target values.

Parameters:target (list) – A list of the target values.
catlearn.preprocess.scaling.target_standardize(target)

Return a list of standardized target values.

Parameters:target (list) – A list of the target values.
catlearn.preprocess.scaling.unit_length(train_matrix, test_matrix=None, local=True)

Normalize each feature vector relative to the Euclidean length.

Parameters:
  • train_matrix (list) – Feature matrix for the training dataset.
  • test_matrix (list) – Feature matrix for the test dataset.
  • local (boolean) – Define whether to scale locally or globally.