Many drug candidates do not succeed. This is not for lack of skill on the part of researchers, but because chemistry is incredibly complex and, the technologies upon which we have largely depended have inherent limitations. Data-driven AI has not changed this to the extent the media attention suggested it would. Physics-based AI might actually be able to do so.
The difference between the two approaches is important. A statistical model built on assay data from the past can see patterns in molecules that are present in those that it has seen before, or at least something quite like them. However, developing a drug always involves exploring new chemical space, and while we are yet to see a recent estimate of how many small molecules would be possible in principle, the number that is usually kicked around is on the order of 10^60. No historical data set comes anywhere close to that, so when a statistical model gets its hands on a genuinely new structure, it’s out in the country and it’s guessing.
Why Data-Driven Models Hit A Wall
The central issue here is that the most common failure of many machine-learning models in drug discovery is that they predict molecules that are “out of distribution.” This is not caused by poor training data for a specific task; rather, it is because the model has no notion of why a molecule is the way it is or how it behaves – a shortcoming that also plagues all current generative models.
Furthermore, as we push the model to propose molecules that are progressively further away from the default training data (such as compounds with atypical scaffolds or binding modes, or targeting ultra-rare diseases), its predictions for biological activity frequently become more inaccurate – not because there were faulty training data, but because an arbitrary learning method simply lacks a capacity to generalize in a physically meaningful way to whatever molecule we might throw at it.
Bridging Quantum Accuracy and Computational Speed
The reason this hasn’t always been the default approach is straightforward: quantum chemistry calculations are expensive. Density Functional Theory, one of the most widely used methods for modeling electronic structure, can take hours per molecule even on serious hardware. Running virtual screens across millions of candidates isn’t feasible at that pace.
Machine Learning Potentials address this. Trained on DFT-quality data, these models can evaluate the potential energy of an atomic configuration in milliseconds rather than hours. The physics-informed structure means they generalize better than purely empirical models while running fast enough for real screening workflows. Advanced computational platforms, such as those developed by sandboxaq.com, are bridging this gap by deploying physics-informed AI models that simulate molecular structures with quantum-level accuracy at an enterprise scale.
That combination – quantum-level grounding, machine learning speed – is what makes the approach worth taking seriously.
Generating Synthetic Data For Rare Disease Research
An overlooked challenge in drug discovery is lack of data. There might be decades of assay results for common disease targets to use as training data. For rare diseases, that’s not the case – and data-driven models are of little use.
Physics-based AI gets around this with active learning. The model finds the areas of chemical space with the least-confidence predictions, and queries the physics engine to run targeted simulations on exactly those structures. This generates perfectly accurate synthetic training data, directly from the underlying physics, and without requiring any proprietary experimental datasets. The model bootstraps its own knowledge base, based on the underlying physics rather than historical assay results.
This is especially important for rare diseases, as small patient populations mean the traditional economics of drug development rarely work out. Better in silico prediction can make lots of programs viable that otherwise wouldn’t be.
Reducing False Positives In Binding Prediction
Conventional virtual docking tools view binding as a static geometric puzzle. Does this molecule fit the pocket? Unfortunately, this approach generates a tsunami of false positives: compounds that look promising computationally but fail to perform as expected in the lab or the clinic.
But binding is dynamic. The practical utility of a drug candidate is determined by thermodynamic and kinetic parameters – not just whether a compound can bind, but how strongly it binds, how fast it can associate and dissociate, and how stable the complex is over time. Physics-informed neural networks can estimate these properties more faithfully, as they provide a representation of the true energy landscape of the interaction, rather than a crude yes/no based on shape.
De-Risking The Clinical Transition
More than 90% of new drug candidates are unsuccessful in clinical trials, and the average cumulative development cost for a new drug is estimated to be $2.6 billion (Tufts Center for the Study of Drug Development). The vast majority of development costs for a drug that ultimately fails occur when it’s in the advanced stages of the pipeline.
Physics-based AI changes where in the pipeline attrition happens. By optimizing ADMET properties – absorption, distribution, metabolism, excretion, toxicity – computationally before a molecule ever reaches in vitro testing, researchers catch problems earlier when they’re cheaper to solve. Structure-based drug design combined with physics-informed simulation means lead compounds are better characterized going into clinical phases.
The goal isn’t to replace experimental work. It’s to make sure the candidates advancing to those expensive stages have already been stress-tested against physical reality.
Drug discovery has always involved uncertainty. Physics-based AI doesn’t eliminate that – but it gives researchers better tools to distinguish promising uncertainty from predictable failure, which changes the math considerably.



