Predict MeSH Classes

The Medical Subject Headings (MeSH) database by the National Library of Medicine (NIH) provides a chemical classification that is derived from published literature. The MeSH database started 60 years ago and NLM regularly curates and updates it. About 130,000 compounds have been linked with MeSH terms, creating a rich chemical ontology database. We have adopted the idea that chemical similarity compounds may belong to the same MeSH category. So, we have utilized the Tanimoto Chemical Similarity Coefficient to get the closest similar compounds with MeSH annotation and then assign the MeSH term to a new chemical structure.

MeSH is an ontology tree, and ChemRICH needs non-overlapping set definitions. Therefore, we took a tree-traversing approach starting from the most specialized nodes to move upwards to more generalized nodes. Whenever we find that terms have reached at least three compounds, we stop at that node and assign the MeSH class to the input structure.

You can find the example file here . It has only three columns 1) compound_name 2) pubchem_id 3 ) smiles. Use the PubChem Identifier Exchange service to obtain the SMILES code and/or PubChem IDs for your compound list.

Instructions (local):

Create a new folder "MeSh prediction" somewhere on your computer. Download and copy the example file to this folder. Now in RStudio, setup that folder as the working directory. Now, run the below code to predict chemical classes for the input compound list.



predict_mesh_classes(inputfile = "chemrich_input_mesh_prediction.xlsx")

If completed, the code will generate the "MeSh_Prediction_Results.xlsx" file. Which has the MeSH Class column.

Warning : Since, these are the predicted MeSH classes, there is always some errors. You should carefully review the predicted classes with help of a biochemistry expert to make sure that there are no errors.

To check if there are incorrect SMILES code in the input file, you can use this function "checkSmiles() ". It will tell the line numbers which have a problematic SMILES meaning rCDK cannot parse them.

checkSmiles <- function(input = "inputfile.xlsx") {

ndf <- readxl::read_xlsx(input)

fps <- lapply(1:nrow(ndf), function(x) {



charvec <- sapply(fps, nchar)

paste0("Lines with an incorrect SMILES codes are : ", paste(as.integer(which(charvec==4)), collapse = ","))