Simple k-means clustering.

catlearn.utilities.clustering.cluster_features(train_matrix, train_target, k=2, test_matrix=None, test_target=None)

Function to perform k-means clustering in the feature space.

  • train_matrix (list) – Feature matrix for the training dataset.
  • train_target (list) – List of target values for training data.
  • k (int) – Number of clusters to divide data into.
  • test_matrix (list) – Feature matrix for the test dataset.
  • test_target (list) – List of target values for test data.


Functions to create databases storing feature matrix.

class catlearn.utilities.database_functions.DescriptorDatabase(db_name='descriptor_store.sqlite', table='Descriptors')

Bases: object

Store sets of descriptors for a given atoms object assigned a unique ID.

The descriptors for a given system can be stored in the ase.atoms object, though we typically find this method to be slower.


Function to create a new column in the table.

The new column will be initialized with None values.

Parameters:new_column (str) – Name of new feature or target.

Function to setup a database storing descriptors.

Parameters:names (list) – List of heading names for features and targets.
fill_db(descriptor_names, data)

Function to fill the descriptor database.

  • descriptor_names (list) – List of descriptor names for features and targets.
  • data (array) – First row should contain string of UUIDs, thereafter array should contain floats corresponding to the descriptor names provided.

Function to get the of a supplied table column names.

query_db(unique_id=None, names=None)

Return single row based on uuid or all rows.

  • unique_id (str) – If specified, the data corresponding to the given UUID will be returned. If None, all rows will be returned.
  • names (list) – If specified, only the data corresponding to provided column names will be returned. If None, all columns will be returned.
update_descriptor(descriptor, new_data, unique_id)

Function to update a descriptor based on a given uuid.

  • descriptor (str) – Name of descriptor to be updated.
  • new_data (float) – New value to be entered into table.
  • unique_id (str) – The UUID of the entry to be updated.
class catlearn.utilities.database_functions.FingerprintDB(db_name='fingerprints.db', verbose=False)

A class for accessing a temporary SQLite database.

This function works as a context manager and should be used as follows:

with FingerprintDB() as fpdb:
(Perform operation here)

This syntax will automatically construct the temporary database, or access an existing one. Upon exiting the indentation, the changes to the database will be automatically commited.


Create the database table framework used in SQLite.

This includes 3 tables: images, parameters, and fingerprints.

The images table currently stores ase_id information and a unqiue string. This can be adapted in the future to support atoms objects.

The parameters table stores a symbol (10 character maximum) for convenient reference and a description of the parameter.

The fingerprints table holds a unique image and parmeter ID along with a float value for each. The ID pair must be unique.

fingerprint_entry(ase_id, param_id, value)

Enter fingerprint value to database for given ase and parameter ID.

  • ase_id (int) – The ase unique ID associated with an atoms object in the database.
  • param_id (int or str) – The parameter ID or symbol associated with and entry in the paramters table.
  • value (float) – The value of the parameter for the atoms object.
get_fingerprints(ase_ids, params=[])

Return values of provided parameters for each ase_id provided.

  • ase_id (list) – The ase ID(s) associated with an atoms object in the database.
  • params (list) – List of symbols or int in parameters table to be selected.

fingerprint – An array of values associated with the given parameters (a fingerprint) for each ase_id.

Return type:


get_parameters(selection=None, display=False)

Return integer values corresponding to parameter IDs.

The array returned will be for a set of provided symbols. If no selection is provided, return all symbols.

  • selection (list) – List of symbols in parameters table to be selected.
  • display (bool) – If True, print parameter descriptions.

res – Return the integer values of selected parameters.

Return type:


image_entry(asedb_entry=None, identity=None)

Enter a single ase-db image into the fingerprint database.

This table can be expanded to contain atoms objects in the future.

  • d (object) – An ase-db object which can be parsed.
  • identity (str) – An identifier of the users choice.
Returns: – The ase ID colleted for the ase-db object.

Return type:


parameter_entry(symbol=None, description=None)

Function for entering unique parameters into the database.

  • symbol (str) – A unique symbol the entry can be referenced by. If None, the symbol will be the ID of the parameter as a string.
  • description (str) – A description of the parameter.


Pair distribution function.

catlearn.utilities.distribution.pair_deviation(images, cutoffs, bins=33, bounds=None, mic=True, element=None)

Return distribution of deviations from atom-pair nominal bond length.

  • images (list) – List of atoms objects.
  • cutoffs (dictionary) – Subtract elemental cutoff radii from distances. This is a useful for testing cutoff radii.
  • bins (int) – Number of bins
  • bounds (tuple) – Optional upper and lower bound of distances.
  • mic (boolean) – Use minimum image convention. Set to False for non-periodic structures.
  • subset (list) – Optionally select a subset of atomic indices to include.
catlearn.utilities.distribution.pair_distribution(images, bins=101, bounds=None, mic=True, element=None)

Return the pair distribution function from a list of atoms objects.

  • images (list) – List of atoms objects.
  • bins (int) – Number of bins
  • bounds (tuple) – Optional upper and lower bound of distances.
  • mic (boolean) – Use minimum image convention. Set to False for non-periodic structures.
  • subset (list) – Optionally select a subset of atomic indices to include.


Functions to generate the neighborlist.

catlearn.utilities.neighborlist.ase_connectivity(atoms, cutoffs=None, count_bonds=True)

Return a connectivity matrix calculated of an atoms object.

If no neighborlist or connectivity matrix is attached to the atoms object, a new one will be generated. Multiple connections are counted.

  • atoms (object) – An ase atoms object.
  • cutoffs (list) – A list of cutoff radii for the atoms, ordered by atom index.

conn – An n by n, where n is len(atoms).

Return type:


catlearn.utilities.neighborlist.ase_neighborlist(atoms, cutoffs=None)

Make dict of neighboring atoms using ase function.

This provides a wrapper for the ASE neighborlist generator. Currently default values are used.

  • atoms (object) – Target ase atoms object on which to get neighbor list.
  • cutoffs (list) – A list of radii for each atom in atoms.
  • rtol (float) – The tolerance factor to allow for small variation in the cutoff radii.

neighborlist – A dictionary containing the atom index and each neighbor index.

Return type:


catlearn.utilities.neighborlist.catlearn_neighborlist(atoms, dx=None, max_neighbor=1, mic=True)

Make dict of neighboring atoms for discrete system.

Possible to return neighbors from defined neighbor shell e.g. 1st, 2nd, 3rd by changing the neighbor number.

  • atoms (object) – Target ase atoms object on which to get neighbor list.
  • dx (dict) – Buffer to calculate nearest neighbor pairs in dict format: dx = {atomic_number: buffer}.
  • max_neighbor (int or str) – Maximum neighbor shell. If int is passed this will define how many shells to consider. If ‘full’ is passed then all neighbor combinations will be included. This might get expensive for particularly large systems.

connection_matrix – An array of the neighbor shell each atom index is located in.

Return type:



Class with penalty functions.

class catlearn.utilities.penalty_functions.PenaltyFunctions(targets=None, predictions=None, uncertainty=None, train_features=None, test_features=None)

Bases: object

Base class for penalty functions.

penalty_close(c_min_crit=100000.0, d_min_crit=1e-05)

Penalize data that is too close.

Pass an array of test features and train features and returns an array of penalties due to ‘too short distance’ ensuring no duplicates are added.

  • d_min_crit (float) – Critical distance.
  • c_min_crit (float) – Constant for penalty minimum distance.
  • penalty_min (array) – Array containing the penalty to add.
penalty_far(c_max_crit=100.0, d_max_crit=10.0)

Penalize data that is too far.

Pass an array of test features and train features and returns an array of penalties due to ‘too far distance’. This prevents to explore configurations that are unrealistic.

  • d_max_crit (float) – Critical distance.
  • c_max_crit (float) – Constant for penalty minimum distance.
  • penalty_max (array) – Array containing the penalty to add.


Function to compute Sammon’s error between original and reduced features.

catlearn.utilities.sammon.sammons_error(original, reduced)

Sammon error.

  • original (array) – The original feature set.
  • reduced (array) – The reduced feature set.

error – Sammon’s error value.

Return type:



Some useful utilities.

catlearn.utilities.utilities.formal_charges(atoms, ion_number=8, ion_charge=-2)

Return a list of formal charges on atoms.

  • atoms (object) – ase.Atoms object representing a chalcogenide. The default parameters are relevant for an oxide.
  • anion_number (int) – atomic number of anion.
  • anion_charge (int) – formal charge of anion.

all_charges – Formal charges ordered by atomic index.

Return type:



A hash based strictly on the geometry features of an atoms object.

Uses positions, cell, and symbols.

This is intended for planewave basis set calculations, so pbc is not considered.

Each element is sorted in the algorithem to help prevent new hashs for identical geometries.

catlearn.utilities.utilities.holdout_set(data, fraction, target=None, seed=None)

Return a dataset split in a hold out set and a training set.

  • matrix (array) – n by d array
  • fraction (float) – fraction of data to hold out for testing.
  • target (list) – optional list of targets or separate feature.
  • seed (float) – optional float for reproducible splits.
catlearn.utilities.utilities.target_correlation(train, target, correlation=['pearson', 'spearman', 'kendall'])

Return the correlation of all columns of train with a target feature.

  • train (array) – n by d training data matrix.
  • target (list) – target for correlation.

metric – len(metric) by d matrix of correlation coefficients.

Return type: