Machine Learning

In order to exploit the generated data from high-throughput studies this library also offers methods and interfaces to other python packages to facilitate the training of machine learning models. Generally, the workflow can be split up into three consecutive steps:

  1. Splitting up the initial data set into a training and test set

  2. Extracting features from the data sets

  3. Transforming data, model training and performance testing

  4. Applying the ML model on new unknown data

In the following, we give an overview on how the methods of this libary can be used to achieve the individual steps. As an example we use a subset of the data set used in TODO: Add reference:

[1]:
from aim2dat.strct import StructureCollection

strct_c = StructureCollection()
strct_c.import_from_hdf5_file(
    "../../tests/ml/train_test_split_crystals_ref/PBE_CSP_Cs-Te_crystal-preopt_wo_dup.h5"
)

Splitting up the initial data set into a training and test set

A good and established way to evaluate the ML model’s performance is splitting up the initial data set into a training and test set. This has to be done before any feature extraction or transformation of the data set took place in order to prevent data leakage. Many machine learning packages offer methods to do this task (e.g. scikit-learn’s train_test_split function). Usually the split is performed in a random manner, which has the disadvantage that the training and test set may be unbalanced and do not represent the distribution of the initial data set anymore.

For classification problems a stratified splitting has been introduced with the train_test_split_crystals where the fraction of target categories is maintained for the training and test set. Based on this idea, a stratified splitting for data sets of inorganic crystals is implemented by binning the elemental concentrations and target values using the numpy histogram function. During the splitting process it is ensured that each bin of the training and test set contains the same relative amount of crystals as the initial data set. The most import parameters of the function are:

  • structure_collection: The StructureCollection object serves as container for the crystal structures.

  • target_attribute: key of the target value stored in attributes of the structure.

  • train_size: Size of the training set, if smaller than 1.0 the input is considered as fraction of the initial data set otherwise as absolute size of the data set.

  • test_size: Size of the test set, same input interpretation as for the train_size parameter apply.

  • target_bins: Number of bins or list bin-edges.

  • composition_bins: Number of bins or list bin-edges.

  • elements: List of elements that are considered for the binning (if set to None all elements are considered).

  • return_structure_collection: Whether a StructureCollection or a list is returned.

[2]:
from aim2dat.ml.utils import train_test_split_crystals

comp_bins = [
    -0.05, 0.05, 0.15, 0.25, 0.35, 0.45, 0.55, 0.65, 0.75, 0.85, 0.95, 1.05
]

train_set, test_set, train_target, test_target = train_test_split_crystals(
    strct_c,
    "stability",
    train_size=0.6,
    test_size=0.35,
    return_structure_collections=False,
    composition_bins=comp_bins,
    target_bins=126,
)

Extracting features from data sets

This library implements several methods to extract feature vectors from molecular or crystalline structures that are based on scikit-learn’s BaseEstimator class. This means that the transformer classes can be combined with other transformer or estimator classes by defining pipelines (more details on scikit-learn’s pipeline framework can be found here). A list of the different classes is given here. The calculation of features is based on the StructureOperations class and can exploit its parallel implementations via the attributes n_procs and chunksize.

Here, we make use of the StructureFFPrintTransformer (with reduced numerical parameters) which uses the F-Fingerprint (doi:10.1063/1.3079326) to describe the crystals:

[3]:
from aim2dat.ml.transformers import StructureFFPrintTransformer

ffprint_transf = StructureFFPrintTransformer(
    r_max=10.0, delta_bin=0.5, sigma=2.0, add_header=True, verbose=False
)

Specifically for a parameter grid search structural properties that deman large computational resources to calculate can be “precomputed” for different parameter sets reused later on. The properties are stored in a StructureOperations object for each parameter set (note that we only precompute properties for the first 40 structures of the training set to reduce the run time):

[4]:
ffprint_transf.nprocs = 4
ffprint_transf.chunksize = 10
ffprint_transf.precompute_parameter_space(
    {"r_max": [5.0, 10.0], "sigma": [2.0]}, train_set[:40]
)
ffprint_transf.precomputed_properties
[4]:
(({'r_max': 5.0, 'delta_bin': 0.5, 'sigma': 2.0, 'distinguish_kinds': False},
  <aim2dat.strct.structure_operations.StructureOperations at 0x7f89974559c0>),
 ({'r_max': 10.0, 'delta_bin': 0.5, 'sigma': 2.0, 'distinguish_kinds': False},
  <aim2dat.strct.structure_operations.StructureOperations at 0x7f8997454b50>))

Transforming data, model training and performance testing

The bulk workload of this step can be accomplished by different machine learning packages. This library merely augments methods by implementing custom metrics and kernels that can be used for models implemented in the scikit-learn package (a list of the methods is given here).

Warning

The custom metrics and kernel functions are experimental and so far an increase in performance compared to the standard scikit-learn implementations could not be detected. It is therefore not recommended to use these methods without thorough testing and comparison.

As an example, we build a scikit-learn pipeline taking the F-Fingerprint transformer and the kernel ridge regression model in combination with the krr_ffprint_laplace kernel:

[5]:
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import Pipeline
from aim2dat.ml.kernels import krr_ffprint_laplace

pline = Pipeline(
    (
        ("ffprint", ffprint_transf),
        ("krr", KernelRidge(kernel=krr_ffprint_laplace)),
    )
)

Now we can train the model via the fit function of the pipeline and test it on the test data set:

[6]:
pline.fit(train_set, train_target).score(test_set, test_target)
[6]:
-0.11929851791104018

Applying the ML model on new unknown data

Once the model is trained and possibly evaluated based on a test set the user can finally benefit from it by applying the model to new data with unknown target values. This library implements the CellGridSearch class which changes the lattice parameters in order to minimize the target property based on e.g. a ML model.

Taking the trained kernel ridge regression model we can use it to optimize the lattice parameters of a random crystal structure:

[7]:
from aim2dat.strct import StructureImporter
from aim2dat.ml.cell_grid_search import CellGridSearch

strct_imp = StructureImporter()
strct_c_csp = strct_imp.generate_random_crystals("Cs2Te", max_structures=1)

grid_search = CellGridSearch(
    length_scaling_factors=[0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3],
    angle_scaling_factors=[0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3],
)
grid_search.set_initial_structure(strct_c_csp[0])
grid_search.set_model(pline)
print("Initial score:", grid_search.return_initial_score())
fit_info = grid_search.fit()
print("Final score:", fit_info[0], "Scaling factors:", fit_info[1])
Initial score: 0.27811197338368643
Space group of initial crystal:  123 (tetragonal)
Final score: 0.25250442798287315 Scaling factors: [0.7, 0.7, 1.2, 1.0, 1.0, 1.0]