AI for science: creating a virtuous circle of discovery and innovation
In this interview, Tom Kalil discusses the opportunities for science agencies and the research community to use AI/ML to accelerate the pace of scientific discovery and technological advancement.
Q. Why do you think that science agencies and the research community should be paying more attention to the intersection between AI/ML and science?
Recently, researchers have used DeepMind’s AlphaFold to predict the structures of more than 200 million proteins from roughly 1 million species, covering almost every known protein on the planet! Although not all of these predictions will be accurate, this is a massive step forward for the field of protein structure prediction.
The question that science agencies and different research communities should be actively exploring is – what were the pre-conditions for this result, and are there steps we can take to create those circumstances in other fields?
One partial answer to that question is that the protein structure community benefited from a large open database (the Protein Data Bank) and what linguist Mark Liberman calls the “Common Task Method.”
Q. What is the Common Task Method (CTM), and why is it so important for AI/ML?
In a CTM, competitors share the common task of training a model on a challenging, standardized dataset with the goal of receiving a better score. One paper noted that common tasks typically have four elements:
- Tasks are formally defined with a clear mathematical interpretation
- Easily accessible gold-standard datasets are publicly available in a ready-to-go standardized format
- One or more quantitative metrics are defined for each task to judge success
- State-of-the-art methods are ranked in a continuously updated leaderboard
Computational physicist and synthetic biologist Erika DeBenedictis has proposed adding a fifth component, which is that “new data can be generated on demand.” Erika, who runs Schmidt Futures-supported competitions such as the 2022 BioAutomation Challenge, argues that creating extensible living datasets has a few advantages. This approach can detect and help prevent overfitting; active learning can be used to improve performance per new datapoint; and datasets can grow organically to a useful size.
Common Task Methods have been critical to progress in AI/ML. As David Donoho noted in 50 Years of Data Science,
The ultimate success of many automatic processes that we now take for granted—Google translate, smartphone touch ID, smartphone voice recognition—derives from the CTF (Common Task Framework) research paradigm, or more specifically its cumulative effect after operating for decades in specific fields. Most importantly for our story: those fields where machine learning has scored successes are essentially those fields where CTF has been applied systematically.
Q. Why do you think that we may be under-investing in the CTM approach?
U.S. agencies have already started to invest in AI for Science. Examples include NSF’s AI Institutes, DARPA’s Accelerated Molecular Discovery, NIH’s Bridge2AI, and DOE’s investments in scientific machine learning. The NeurIPS conference (one of the largest scientific conferences on machine learning and computational neuroscience) now has an entire track devoted to datasets and benchmarks.
However, there are a number of reasons why we are likely to be under-investing in this approach.
- These open datasets, benchmarks and competitions are what economists call “public goods.” They benefit the field as a whole, and often do not disproportionately benefit the team that created the dataset. Also, the CTM requires some level of community buy-in. No one researcher can unilaterally define the metrics that a community will use to measure progress.
- Researchers don’t spend a lot of time coming up with ideas if they don’t see a clear and reliable path to getting them funded. Researchers ask themselves, “what datasets already exist, or what dataset could I create with a $500,000 – $1 million grant?” They don’t ask the question, “what dataset + CTM would have a transformational impact on a given scientific or technological challenge, regardless of the resources that would be required to create it?” If we want more researchers to generate concrete, high-impact ideas, we have to make it worth the time and effort to do so.
- Many key datasets (e.g., in fields such as chemistry) are proprietary, and were designed prior to the era of modern machine learning. Although researchers are supposed to include Data Management Plans in their grant applications, these requirements are not enforced, data is often not shared in a way that is useful, and data can be of variable quality and reliability. In addition, large dataset creation may sometimes not be considered academically novel enough to garner high impact publications for researchers.
- Creation of sufficiently large datasets may be prohibitively expensive. For example, experts estimate that the cost of recreating the Protein Data Bank would be $15 billion! Science agencies may need to also explore the role that innovation in hardware or new techniques can play in reducing the cost and increasing the uniformity of the data, using, for example, automation, massive parallelism, miniaturization, and multiplexing. A good example of this was NIH’s $1,000 Genome project, led by Jeffrey Schloss.
Q. Why is close collaboration between experimental and computational teams necessary to take advantage of the role that AI can play in accelerating science?
According to Michael Frumkin with Google Accelerated Science, what is even more valuable than a static dataset is a data generation capability, with a good balance of latency, throughput, and flexibility. That’s because researchers may not immediately identify the right “objective function” that will result in a useful model with real-world applications, or the most important problem to solve. This requires iteration between experimental and computational teams.
Q. What do you think is the broader opportunity to enable the digital transformation of science?
I think there are different tools and techniques that can be mixed and matched in a variety of ways that will collectively enable the digital transformation of science and engineering. Some examples include:
- Self-driving labs (and eventually, fleets of networked, self-driving labs), where machine learning is not only analyzing the data but informing which experiment to do next.
- Scientific equipment that is high-throughput, low-latency, automated, programmable, and potentially remote (e.g. “cloud labs”).
- Novel assays and sensors.
- The use of “science discovery games” that allow volunteers and citizen scientists to more accurately label training data. For example, the game Mozak trains volunteers to collaboratively reconstruct complex 3D representations of neurons.
- Advances in algorithms (e.g. progress in areas such as causal inference, interpreting high-dimensional data, inverse design, uncertainty quantification, and multi-objective optimization).
- Software for orchestration of experiments, and open hardware and software interfaces to allow more complex scientific workflows.
- Integration of machine learning, prior knowledge, modeling and simulation, and advanced computing.
- New approaches to informatics and knowledge representation – e.g. a machine-readable scientific literature, increasing number of experiments that can be expressed as code and are therefore more replicable.
- Approaches to human-machine teaming that allow the best division of labor between human scientists and autonomous experimentation.
- Funding mechanisms, organizational structures and incentives that enable the team science and community-wide collaboration needed to realize the potential of this approach.
There are many opportunities at the intersection of these different scientific and technical building blocks. For example, use of prior knowledge can sometimes reduce the amount of data that is needed to train a ML model. Innovation in hardware could lower the time and cost of generating training data. ML can predict the answer that a more computationally-intensive simulation might generate. So there are undoubtedly opportunities to create a virtuous circle of innovation.
Q. Are there any risks of the common task method?
Some researchers are pointing to negative sociological impacts associated with “SOTA-chasing” – e.g. a single-minded focus on generating a state-of-the-art result. These include reducing the breadth of the type of research that is regarded as legitimate, too much competition and not enough cooperation, and overhyping AI/ML results with claims of “super-human” levels of performance. Also, a researcher who makes a contribution to increasing the size and usefulness of the dataset may not get the same recognition as the researcher who gets a state-of-the-art result.
Some fields that have become overly dominated by incremental improvements in a metric have had to introduce Wild and Crazy Ideas as a separate track in their conferences to create a space for more speculative research directions.
Q. Which types of science and engineering problems should be prioritized?
One benefit to the digital transformation of science and engineering is that it will accelerate the pace of discovery and technological advances. This argues for picking problems where time is of the essence, including:
- Developing and manufacturing carbon-neutral and carbon-negative technologies we need for power, transportation, buildings, industry, and food and agriculture. Currently, it can take 17-20 years to discover and manufacture a new material. This is too long if we want to meet ambitious 2050 climate goals.
- Improving our response to future pandemics by being able to more rapidly design, develop and evaluate new vaccines, therapies, and diagnostics.
- Addressing new threats to our national security, such as engineered pathogens and the technological dimension of our economic and military competition with peer adversaries.
Obviously, it also has to be a problem where AI and ML can make a difference, e.g. ML’s ability to approximate a function that maps between an input and an output, or to lower the cost of making a prediction.
Q. Why should economic policy-makers care about this as well?
One of the key drivers of the long-run increases in our standard of living is productivity (output per worker), and one source of productivity is what economists call general purpose technologies (GPTs). These are technologies that have a pervasive impact on our economy and our society, such as interchangeable parts, the electric grid, the transistor, and the Internet.
Historically – GPTs have required other complementary changes (e.g. organizational changes, changes in production processes and the nature of work) before their economic and societal benefits can be realized. The introduction of electricity eventually led to massive increases in manufacturing productivity, but not until factories and production lines were reorganized to take advantage of small electric motors. There are similar challenges for fostering the role that AI/ML and complementary technologies will play in accelerating the pace of scientific and technological advances:
- Researchers and science funders need to identify and support the technical infrastructure (e.g. datasets + CTMs, self-driving labs) that will move an entire field forward, or solve a particularly important problem.
- A leading academic researcher involved in protein structure prediction noted that one of the things that allowed DeepMind to make so much progress on the protein folding problem was that “everyone was rowing in the same direction,” “18 co-first authors .. an incentive structure wholly foreign to academia,” and “a fast and focused research paradigm … [which] raises the question of what other problems exist that are ripe for a fast and focused attack.” So capitalizing on the opportunity is likely to require greater experimentation in mechanisms for funding, organizing and incentivizing research, such as Focused Research Organizations.
Q. Why is this an area where it might make sense to “unbundle” idea generation from execution?
Traditional funding mechanisms assume that the same individual or team who has an idea should always be the person who implements the idea. I don’t think this is necessarily the case for datasets and CTMs. A researcher may have a brilliant idea for a dataset, but may not be in a position to liberate the data (if it already exists), rally the community, and raise the funds needed to create the dataset. There is still a value in getting researchers to submit and publish their ideas, because their proposal could be catalytic of a larger-scale effort.
Agencies could sponsor white paper competitions with a cash prize for the best ideas. [A good example of a white paper competition is MIT’s Climate Grand Challenge, which had a number of features which made it catalytic.] Competitions could motivate researchers to answer questions such as:
- What dataset and Common Task would have a significant impact on our ability to answer a key scientific question or make progress on an important use-inspired or technological problem? What preliminary work has been done or should be done prior to making a larger-scale investment in data collection?
- To the extent that industry would also find the data useful, would they be willing to share the cost of collecting it? They could also share existing data, including the results from failed experiments.
- What advance in hardware or experimental techniques would lower the time and cost of generating high-value datasets by one or more orders of magnitude?
- What self-driving lab would significantly accelerate progress in a given field or problem, and why?
The views and opinions expressed in this blog are the author’s own and do not necessarily reflect the view of Schmidt Futures.
The U.S. government should establish a public-private National Exposome Project (NEP) to generate benchmark human exposure levels for the ~80,000 chemicals to which Americans are regularly exposed.
The federal government is responsible for ensuring the safety and privacy of the processing of personally identifiable information within commercially available information used for the development and deployment of artificial intelligence systems
The United States is in the midst of a once in a generation effort to rebuild its transportation and mobility systems. Meeting this moment will require bold investments in new and emerging transportation technologies.
Employee ownership is a powerful solution that preserves local business ownership, protects supply chains, creates quality jobs, and grows the household balance sheets of American workers and their families.