Featured publication
We worked with Kasia Arturi and Juliane Hollender from Eawag as well as Lili Gasser and others from the SDSC on MLinvitroTox - an automated Python pipeline developed for high-throughput hazard-driven prioritization of toxicologically relevant signals detected in complex environmental samples through high-resolution tandem mass spectrometry (HR-MS/MS).
HR-MS/MS is an "untargeted" environmental analysis technique, meaning that a sample is measured - for example, water or soil - and all the compounds present deliver a signal. It is challenging to unravel this mass of data: Some compounds can be identified, but most are unknown. The usual procedure with this kind of data is to try first to identify the structure of the compounds, and then to determine if they are toxic based on their structure and identity. However, determining the structure from HR-MS/MS data is difficult and can only be accomplished for a small proportion of compounds.
With MLinvitroTox, we determine a "fingerprint" for each compound based on HR-MS/MS data using the well-established Sirius tool. Each digit of the fingerprint relates to a small feature within the molecule. We directly relate this figureprint to potential toxicity using machine learning models, identifying the chemicals and features most likely to cause adverse effects. These unknown compounds can then be prioritised for further analysis.
Overview of the MLinvitroTox pipeline: (I) Training XGBoost classifiers for 490 viable assay endpoints from invitroDBv4.1; (II) Validating model performance on the ‘Internal’ test set (20% of the input data) and MassBank sets (1.5k compounds, which are both present in invitroDBv4.1 as well as contain HRMSMS2 spectra in MassBank for which molecular fingerprints could be predicted) using both structural (‘MB structure’) as well as spectral (‘MB spectra’) data via SIRIUS); (III) Applying the models to environmental HRMS/MS data.