Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,10 @@
.idea*

**/.DS_Store
*.pyc

**/data/*
**/models/*

.vscode
api_key.txt
31 changes: 31 additions & 0 deletions alec-glisman/ML-Band-Gaps.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# ML Band Gaps (Materials)

> Ideal candidate: skilled ML data scientist with solid knowledge of materials science.

# Overview

The aim of this task is to create a python package that implements automatic prediction of electronic band gaps for a set of materials based on training data.

# User story

As a user of this software I can predict the value of an electronic band gap after passing training data and structural information about the target material.

# Requirements

- suggest the bandgap values for a set of materials designated by their crystallographic and stoichiometric properties
- the code shall be written in a way that can facilitate easy addition of other characteristics extracted from simulations (forces, pressures, phonon frequencies etc)

# Expectations

- the code shall be able to suggest realistic values for slightly modified geometry sets - eg. trained on Si and Ge it should suggest the value of bandgap for Si49Ge51 to be between those of Si and Ge
- modular and object-oriented implementation
- commit early and often - at least once per 24 hours

# Timeline

We leave exact timing to the candidate. Must fit Within 5 days total.

# Notes

- use a designated github repository for version control
- suggested source of training data: materialsproject.org
66 changes: 66 additions & 0 deletions alec-glisman/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# ReWoTes: ML Property Predict

Alec Glisman

## Overview

This directory contains files for the ML Property Predict project for Mat3ra.com.

Input data is accessed from the Materials Project and the data is cleaned into Pandas Dataframes inside `data/data_load.py`.
I chose to download all materials with a bandgap of less than 10 eV from the Materials Project and parsed all data related to the crystallographic and stoichiometric properties.
Categorical data is converted to numeric data using one-hot encoding and the data is then scaled using `sklearn.preprocessing.StandardScaler`.
The input data source to the machine learning model can be augmented with additional Materials Project data with the `MaterialData` init method and external data can also be merged using its respective `add_data_columns` method.
The cleaned data is archived using Pandas in conjunction with HDF5 to lower runtime costs for model development.

I chose to pursue two machine-learning architectures: XGBoost and feed-forward, fully connected, neural networks.
XGBoost generally performs better than neural networks when the data set is not large, and XGBoost is also much faster to train.
Neural networks were included for their superior expressivity and serve as a useful comparison to XGBoost.
In both cases, I employed `KFold` and `RandomizedSearchCV` from `scikit-learn` to cross-validate and select hyperparameters, respectively.

The best XGBoost Regressor that I trained is saved during runtime under the `models` directory and has a testing sample MSE of 0.646 eV.
Similarly, the best fully connected neural network I trained is saved during runtime under the `models` directory and has a testing sample MSE of 0.817 eV.
The seed used is provided in `main.py` for reproducibility.

Areas for future work include:

1. Stratified sampling for test/train split or cross-validation to make sure different space groups are represented properly in each subset.
2. Explore the use of feed-forward neural networks and experiment with architecture, drop-out, and regularization to optimize the performance. Additionally, increase the epochs from 40. I used 40 due to computational constraints, but the loss was still noticeably shrinking.
3. Addition of more data from the Materials Project to lower the inductive bias of the models.
4. Attempt transfer-learning of these models and fine-tune to more specific databases, such as silicon semiconductors.

## Usage

A Conda environment file has been provided (`requirements.yml`) to set up a Python environment called `ml-band-gaps` with the following command

```[bash]
$ conda env create -f requirements.yml
```

The overall project can then be run with

```[bash]
$ python main.py
```

Unit tests can be run with pytest as

```[bash]
$ pytest tests
```

Data ingested is cached to the `data` directory, and machine-learning models are cached to the `models` directory.
Each of these directories is created automatically as part of the main script.

Note that the data is sourced from the Materials Project, which requires an API key to access it.
I have added my API key to the `.gitignore` for security reasons, so users will need to generate their own and add it to an `api_key.txt` file.

## Requirements

- suggest the bandgap values for a set of materials designated by their crystallographic and stoichiometric properties
- the code shall be written in a way that can facilitate easy addition of other characteristics extracted from simulations (forces, pressures, phonon frequencies etc.)

## Expectations

- the code shall be able to suggest realistic values for slightly modified geometry sets - e.g. trained on Si and Ge it should suggest the value of bandgap for Si49Ge51 to be between those of Si and Ge
- modular and object-oriented implementation
- commit early and often - at least once per 24 hours
61 changes: 61 additions & 0 deletions alec-glisman/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
"""Main script that trains models using the XGBoostModels class.

The main function in this script is `main()`, which is responsible for
executing the script. It follows the steps mentioned above and does not
return any value.

To run this script, execute the `main()` function.

Example:
python main.py

Note: Before running the script, make sure to provide the API key in a file
named "api_key.txt" located in the same directory as this script.
"""

from pathlib import Path

from src.data_load import MaterialData
from src.models import XGBoostModels, NeuralNetModels


def main() -> None:
"""
Main function that executes the script.

This function performs the following steps:
1. Reads the API key from a file.
2. Loads data using the MaterialData class.
3. Splits the data into training and testing sets.
4. Trains models using the XGBoostModels class.
5. Trains models using a Neural Network.
6. Prints a completion message.

Returns:
None
"""
file_path = Path(__file__).resolve().parent
seed = 42

# API key is not included in the code for security reasons
with open(file_path / "api_key.txt", "r", encoding="utf-8") as f:
api_key = f.read().strip()

# Load data
data = MaterialData(api_key, band_gap=(0.0, 10.0))
x_train, x_test, y_train, y_test, _, _ = data.split_data(seed=seed)

# Train models
xgb = XGBoostModels(x_train, y_train, x_test, y_test, save=True)
xgb.train_models(seed=seed)
xgb.evaluate_model()
nn = NeuralNetModels(x_train, y_train, x_test, y_test, save=True)
nn.train_models(seed=seed)
nn.evaluate_model()

# Notify user that the script has finished
print("Script completed successfully.")


if __name__ == "__main__":
main()
35 changes: 35 additions & 0 deletions alec-glisman/requirements.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
name: ml-band-gaps
channels:
- conda-forge
dependencies:
- pip
- tqdm
- joblib
- numpy
- pandas
- pytables
- scipy
- scikit-learn
- xgboost
- pytorch
- torchvision
- skorch
- matplotlib
- pymatgen
- phonopy
- ipykernel
- ipywidgets
- ipympl
- pandoc
- notebook
- jupyter_client
- pytest
- pytest-cov
- pytest-xdist
- coverage
- autopep8
- black
- flake8
- pip:
- "--editable=git+https://github.com/materialsproject/api.git@main#egg=mp-api"

Loading