Technology & Innovation

AI for science: creating a virtuous circle of discovery and innovation

08.22.22 | 9 min read | Text by Tom Kalil

In this interview, Tom Kalil discusses the opportunities for science agencies and the research community to use AI/ML to accelerate the pace of scientific discovery and technological advancement.

Q.  Why do you think that science agencies and the research community should be paying more attention to the intersection between AI/ML and science?

Recently, researchers have used DeepMind’s AlphaFold to predict the structures of more than 200 million proteins from roughly 1 million species, covering almost every known protein on the planet! Although not all of these predictions will be accurate, this is a massive step forward for the field of protein structure prediction.

The question that science agencies and different research communities should be actively exploring is – what were the pre-conditions for this result, and are there steps we can take to create those circumstances in other fields?   

Photo by DeepMind on Unsplash

One partial answer to that question is that the protein structure community benefited from a large open database (the Protein Data Bank) and what linguist Mark Liberman calls the “Common Task Method.”

Q.  What is the Common Task Method (CTM), and why is it so important for AI/ML?

In a CTM, competitors share the common task of training a model on a challenging, standardized dataset with the goal of receiving a better score.  One paper noted that common tasks typically have four elements:

  1. Tasks are formally defined with a clear mathematical interpretation
  2. Easily accessible gold-standard datasets are publicly available in a ready-to-go standardized format
  3. One or more quantitative metrics are defined for each task to judge success
  4. State-of-the-art methods are ranked in a continuously updated leaderboard

Computational physicist and synthetic biologist Erika DeBenedictis has proposed adding a fifth component, which is that “new data can be generated on demand.”  Erika, who runs Schmidt Futures-supported competitions such as the 2022 BioAutomation Challenge,  argues that creating extensible living datasets has a few advantages.  This approach can detect and help prevent overfitting; active learning can be used to improve performance per new datapoint; and datasets can grow organically to a useful size.

Common Task Methods have been critical to progress in AI/ML.  As David Donoho noted in 50 Years of Data Science

Q.  Why do you think that we may be under-investing in the CTM approach?

U.S. agencies have already started to invest in AI for Science.  Examples include NSF’s AI Institutes, DARPA’s Accelerated Molecular Discovery, NIH’s Bridge2AI, and DOE’s investments in scientific machine learning.  The NeurIPS conference (one of the largest scientific conferences on machine learning and computational neuroscience) now has an entire track devoted to datasets and benchmarks.

However, there are a number of reasons why we are likely to be under-investing in this approach.

  1. These open datasets, benchmarks and competitions are what economists call “public goods.”  They benefit the field as a whole, and often do not disproportionately benefit the team that created the dataset.  Also, the CTM requires some level of community buy-in.  No one researcher can unilaterally define the metrics that a community will use to measure progress. 
  2. Researchers don’t spend a lot of time coming up with ideas if they don’t see a clear and reliable path to getting them funded.  Researchers ask themselves, “what datasets already exist, or what dataset could I create with a $500,000 – $1 million grant?”  They don’t ask the question, “what dataset + CTM would have a transformational impact on a given scientific or technological challenge, regardless of the resources that would be required to create it?”  If we want more researchers to generate concrete, high-impact ideas, we have to make it worth the time and effort to do so.
  3. Many key datasets (e.g., in fields such as chemistry) are proprietary, and were designed prior to the era of modern machine learning.  Although researchers are supposed to include Data Management Plans in their grant applications, these requirements are not enforced, data is often not shared in a way that is useful, and data can be of variable quality and reliability. In addition, large dataset creation may sometimes not be considered academically novel enough to garner high impact publications for researchers. 
  4. Creation of sufficiently large datasets may be prohibitively expensive.  For example, experts estimate that the cost of recreating the Protein Data Bank would be $15 billion!   Science agencies may need to also explore the role that innovation in hardware or new techniques can play in reducing the cost and increasing the uniformity of the data, using, for example, automation, massive parallelism, miniaturization, and multiplexing.  A good example of this was NIH’s $1,000 Genome project, led by Jeffrey Schloss.

Q.  Why is close collaboration between experimental and computational teams necessary to take advantage of the role that AI can play in accelerating science?

According to Michael Frumkin with Google Accelerated Science, what is even more valuable than a static dataset is a data generation capability, with a good balance of latency, throughput, and flexibility.  That’s because researchers may not immediately identify the right “objective function” that will result in a useful model with real-world applications, or the most important problem to solve.  This requires iteration between experimental and computational teams.

Q.  What do you think is the broader opportunity to enable the digital transformation of science

I think there are different tools and techniques that can be mixed and matched in a variety of ways that will collectively enable the digital transformation of science and engineering. Some examples include:

There are many opportunities at the intersection of these different scientific and technical building blocks.  For example, use of prior knowledge can sometimes reduce the amount of data that is needed to train a ML model.  Innovation in hardware could lower the time and cost of generating training data.  ML can predict the answer that a more computationally-intensive simulation might generate.  So there are undoubtedly opportunities to create a virtuous circle of innovation.

Q.  Are there any risks of the common task method?

Some researchers are pointing to negative sociological impacts associated with “SOTA-chasing” – e.g. a single-minded focus on generating a state-of-the-art result.  These include reducing the breadth of the type of research that is regarded as legitimate, too much competition and not enough cooperation, and overhyping AI/ML results with claims of “super-human” levels of performance.  Also, a researcher who makes a contribution to increasing the size and usefulness of the dataset may not get the same recognition as the researcher who gets a state-of-the-art result.

Some fields that have become overly dominated by incremental improvements in a metric have had to introduce Wild and Crazy Ideas as a separate track in their conferences to create a space for more speculative research directions.

Q.  Which types of science and engineering problems should be prioritized?

One benefit to the digital transformation of science and engineering is that it will accelerate the pace of discovery and technological advances.  This argues for picking problems where time is of the essence, including:

Obviously, it also has to be a problem where AI and ML can make a difference, e.g. ML’s ability to approximate a function that maps between an input and an output, or to lower the cost of making a prediction.

Q.  Why should economic policy-makers care about this as well?

One of the key drivers of the long-run increases in our standard of living is productivity (output per worker), and one source of productivity is what economists call general purpose technologies (GPTs).  These are technologies that have a pervasive impact on our economy and our society, such as interchangeable parts, the electric grid, the transistor, and the Internet.  

Historically –  GPTs have required other complementary changes (e.g. organizational changes, changes in production processes and the nature of work) before their economic and societal benefits can be realized.  The introduction of electricity eventually led to massive increases in manufacturing productivity, but not until factories and production lines were reorganized to take advantage of small electric motors.  There are similar challenges for fostering the role that AI/ML and complementary technologies will play in accelerating the pace of scientific and technological advances:

Q.  Why is this an area where it might make sense to “unbundle” idea generation from execution?

Traditional funding mechanisms assume that the same individual or team who has an idea should always be the person who implements the idea.  I don’t think this is necessarily the case for datasets and CTMs.  A researcher may have a brilliant idea for a dataset, but may not be in a position to liberate the data (if it already exists), rally the community, and raise the funds needed to create the dataset.  There is still a value in getting researchers to submit and publish their ideas, because their proposal could be catalytic of a larger-scale effort.

Agencies could sponsor white paper competitions with a cash prize for the best ideas. [A good example of a white paper competition is MIT’s Climate Grand Challenge, which had a number of features which made it catalytic.]  Competitions could motivate researchers to answer questions such as:

The views and opinions expressed in this blog are the author’s own and do not necessarily reflect the view of Schmidt Futures.