Dataset download

The hackathon can be downloaded here (available on Hackathon startup):
https://docs.google.com/presentation/d/1ohGeGRV7fG8tISoTiU97VWO4n8-ha1Bl28THnV4tNPE/edit?usp=sharing

 

In order to access the computation resources provided by IBM Watson AI, 

 

Dataset description

  • 3 datasets are provided in the csv (comma-separated) format
    • train.csv   (9415 rows)
    • test_1.csv  (750 rows)
    • test_2.csv  (478 rows)
  • Each row of a dataset corresponds to a molecule
  • Each csv file comports the following columns
    • smiles : Chemical formula of the molecule in the SMILES format.
    • 199 molecular features computed with the rdkit package (from column BalabanJ to qed). These features were computed with the rdkit package.
    • ecfc_0000 to ecfc_2047 (2048 features) : bit vector representation of Morgan fingerprints
    • fcfc_0000 to fcfc_2047 (2048 features) : bit vector representation of pharmacophore feature-based Morgan fingerprints
    • class (train.csv only) :  The label to predict (1 for hERG inhibitor, 0 otherwise)

Likely, optimal predictors will not use the complete set of 4295 features provided in the datasets.

 

Chem informatics resources

  • rdkit is the most used package for processing molecules and computing molecular properties (e.g. molecular weight, charge, ...). 
  • Molecular fingerprints are a commonly used features in the litterature on molecular predictions. They are the result of a local kernel application at multiple posiitons of the molecule, aggregated in a fixed length vector.
  • Pat Walters tutorial on cheminformatics present a wide variaty of ML baseline predictors using jointly rdkit, scikit-learn and other ML packages.

 

Statistics / ML resources

Basically any model or library you can find.

If you use any external model, you need to mention it in your submission file.

Hackathon presentation (Margo + Qubit Pharmaceutical):
https://docs.google.com/presentation/d/1ohGeGRV7fG8tISoTiU97VWO4n8-ha1Bl28THnV4tNPE/edit?usp=sharing