catlearn.cross_validation

catlearn.cross_validation.hierarchy_cv

Cross validation routines to work with feature database.

class catlearn.cross_validation.hierarchy_cv.Hierarchy(file_name, db_name, table='FingerVector', file_format='pickle')

Bases: object

Class to form hierarchy crossvalidation setup.

This class is used to cross-validate with respect to data size. The initial dataset is split in two and subsequent datasets split further until a minimum size is reached. Predictions are made on all subsets of data giving averaged error and certainty at each data size.

get_subset_data(index_split, indicies, split=None)

Make array with training data according to index.

Parameters:
  • index_split (array) – Array with the index data.
  • indicies (array) – Index used to generate data.
globalscaledata(index_split)

Make an array with all data.

Parameters:index_split (array) – Array with the index data.
load_split()

Function to load the split from file.

split_index(min_split, max_split=None, all_index=None)

Function to split up the db index to form subsets of data.

Parameters:
  • min_split (int) – Minimum size of a data subset.
  • max_split (int) – Maximum size of a data subset.
  • all_index (list) – List of indices in the feature database.
split_predict(index_split, predict, **kwargs)

Function to make predictions looping over all subsets of data.

Parameters:
  • index_split (dict) – All data for the split.
  • predict (function) – The prediction function. Must return dict with ‘result’ in it.
Returns:

  • result (list) – A list of averaged errors for each subset of data.
  • size (list) – A list of data sizes corresponding to the errors list.

todb(features, targets)

Function to convert numpy arrays to basic db.

transform_output(data)

Function to compile results in a format for plotting average error.

Parameters:data (dict) – The dictionary output from the split_predict function.
Returns:
  • size (list) – A list of the data sizes used in the CV.
  • error (list) – A list of the mean errors at each data size.

catlearn.cross_validation.k_fold_cv

Setup k-fold array split for cross validation.

catlearn.cross_validation.k_fold_cv.k_fold(features, targets=None, nsplit=3, fix_size=None)

Routine to split feature matrix and return sublists.

Parameters:
  • features (array) – An n, d feature array.
  • targets (list) – A list to target values.
  • nsplit (int) – The number of bins that data should be devided into.
  • fix_size (int) – Define a fixed sample size, e.g. nsplit=5 fix_size=100, generates 5 x 100 data split. Default is None, all available data is divided nsplit times.
Returns:

  • features (list) – A list of feature arrays of length nsplit.
  • targets (list) – A list of targets lists of length nsplit.

catlearn.cross_validation.k_fold_cv.read_split(fname, fformat='pickle')

Function to read the k-fold split from file.

Parameters:
  • fname (str) – The name of the read file.
  • fformat (str) – File format to read from. Can be json or pickle, default is pickle.
Returns:

  • features (list) – A list of feature arrays of length nsplit.
  • targets (list) – A list of targets lists of length nsplit.

catlearn.cross_validation.k_fold_cv.write_split(features, targets, fname, fformat='pickle')

Function to write the k-fild split to file.

Parameters:
  • features (array) – An n, d feature array.
  • targets (list) – A list to target values.
  • fname (str) – The name of the write file.
  • fformat (str) – File format to write to. Can be json or pickle, default is pickle.

Cross validation functions.