Building an AI Drug Discovery pipeline so HairDAO can cure hair loss
HairDAO is a decentralized research & development group devising novel treatments for hair loss in a transparent way, and assembling the requisite research community.
Recently, HairDAO raised a lot of funds to start investigating promising targets for hair loss drugs. HairDAO had identified a promising signaling pathway that contributed to this condition, and one molecule that interacted with it. But this interaction was not enough to turn it into a new medicine, so HairDAO came up with a list of about ~48,000 similar compounds that might also interact with the protein of interest (possibly even doing a better job while being safer). Since running screening on ~48,000 compounds would be expensive, HairDAO wanted to reduce the search space to the five most promising, and make sure this new DAO was off to the best possible start.
The Talent
HairDAO needed an ML engineer that was also well versed in all the best practices of creating drug discovery pipelines.
With a rich history in machine learning, the consultant had a history of working with many AI drug discovery pipelines. The consultant had just published a book on trustworthy AI practices, which included details on areas such as interpretability, robustness, and out-of-distribution errors. The consultant had also previously published research on stealing neural network weights using just noise (presented at ICML).
Project Challenges
Building the AI drug discovery pipeline posed a few unique challenges:
Deep Expertise Requirement: There are plenty of ML engineers with no biology background that mistakenly think “DNA is code. I do code. Therefore I can do DNA”, before discovering the hard way why there are so many unsolved problems in biology. A drug library investigation of this sort also requires deep knowledge of molecular biology to go with the ML expertise.
Similarity of ligands in question: The library in question was made of about ~40,000 derivatives of one molecule. As such, the techniques for predicting drug-like properties and/or toxicity would need to be sensitive to sometimes subtle differences between very similar molecules.
Missing data bout the protein of interest: The literature reviews and initial investigations done by HairDAO had revealed a very promising target protein for an inhibitory molecule. There was just one issue: only 6% of the protein’s tertiary structure had been verified by NMR. As for the rest, Alphafold2’s default settings produced a predicted output with large swaths marked as “low confidence” or “very low confidence”. If there was going to be any investigation into this protein, there would need to be a LOT of planning around epistemic and aleatoric uncertainty.
MLOps and Data Engineering: Conceptually setting up the computational drug discovery pipeline would be simple as passing around CSV files, though this results in enormous time costs very quickly. For so many ligands to investigate, setting up an automated processing pipeline that’s easy to upload and track data through is paramount.
Computational overhead: Molecular Docking and molecular simulation are not simple tasks to run on any old laptop. The simulations required grabbing more specialized hardware. This presented the added challenge of making sure there was enough cloud compute to build the pipeline on time while also keeping costs within budget.
Technical Approach
Building and running an AI powered drug discovery platform involves complex components. Fortunately, the consultant had expertise in all of these components of the pipelines. These components included, but were not limited to:
Data Collection and Analysis: HairDAO had already accumulated and collected the initial data for analysis, but there was so much more to generate and analyze downstream. This sort of thing relies heavily on custom python scripts, which can make heavy use of python libraries such as RDKit, TorchDrug, DeepChem, PDB-tools, and other similar packages.
Molecular Docking: In molecular simulation, docking refers to predicting the preferred orientation of one molecule to a second when a ligand and a target are bound to each other to form a stable complex. A variety of tools exist for this task such as AutoDock Vina, DiffDock, SMINA docking benchmark, and Equibind.
Machine Learning Frameworks: Bioinformaticians have been making ML drug discovery tools in a litany of different ML frameworks running from Scikit-learn, to TensorFlow, to PyTorch, to JAX. The ultimate framework is less important compared to the model choice .
Machine Learning Models: Bioinformaticians have created a lot of ML models for tasks such as molecular property prediction and even binding prediction. When the weights are available, these are incredibly handy as starting points. This comes with a cost, namely that many of the state-of-the-art architectures for some biomedical tasks are often 4-5 years behind the techniques used in the rest of the machine learning field. However easy it is to find a graph network trained on some QSAR datasets, the next big challenge is making sure such models have taken into account recent advances like attention mechanisms and diffusion models.
MLOps/Data Engineering: There are plenty of tools for managing bioinformatics workflows such as Insitro’s Redun, Flyte, SnakeMake, and Nextflow. It’s great to have a pipeline set up that’s more carefully assembled than just passing around CSV files by flash drive. That said, it’s also important not to fall into premature optimization traps here (I’ve seen too many drug discovery companies start using Kubernetes far too early in their development)..
Research & Literature Review: Choices of algorithms and compute resources would all be heavily dependent on assumptions based on the understanding of the chemistry of the ligands, the protein pathways, and the environments in which they would interact. As such this relied heavily on both online research but also parsing physical books on drug discovery ( a few of which disappointingly aren’t available online anymore, meaning extreme caution with our remaining physical copies ).
Outcomes
For the given target, we were able to narrow down a selection of 48,000 closely-related candidates down to just five.
What’s more, compared to the original compound that inspired the research into this promising pathway, the 5 selected compounds both had better binding affinity and fewer predicted toxic side-effects.
If you want to follow HairDAO’s work more, you can learn more on HairDAO’s website, LinkedIn, Discord, Twitter/X, YouTube, and TikTok.
🗯️
” Matt was very clear in his processes, and produced several novel insights for us. ”
- Andrew Bakst, Co-Founder of HairDAO