Predicting Progress: A Pilot of Expected Utility Forecasting in Science Funding

Read more about expected utility forecasting and science funding innovation here.

The current process that federal science agencies use for reviewing grant proposals is known to be biased against riskier proposals. As such, the metascience community has proposed many alternate approaches to evaluating grant proposals that could improve science funding outcomes. One such approach was proposed by Chiara Franzoni and Paula Stephan in a paper on how expected utility — a formal quantitative measure of predicted success and impact — could be a better metric for assessing the risk and reward profile of science proposals. Inspired by their paper, the Federation of American Scientists (FAS) collaborated with Metaculus to run a pilot study of this approach. In this working paper, we share the results of that pilot and its implications for future implementation of expected utility forecasting in science funding review. 

Brief Description of the Study

In fall 2023, we recruited a small cohort of subject matter experts to review five life science proposals by forecasting their expected utility. For each proposal, this consisted of defining two research milestones in consultation with the project leads and asking reviewers to make three forecasts for each milestone:

  1. the probability of success;
  2. The scientific impact of the milestone, if it were reached; and
  3. The social impact of the milestone, if it were reached.

These predictions can then be used to calculate the expected utility, or likely impact, of a proposal and design and compare potential portfolios.

Key Takeaways for Grantmakers and Policymakers

The three main strengths of using expected utility forecasting to conduct peer review are

Despite the apparent complexity of this process, we found that first-time users were able to successfully complete their review according to the guidelines without any additional support. Most of the complexity occurs behind-the-scenes, and either aligns with the responsibilities of the program manager (e.g., defining milestones and their dependencies) or can be automated (e.g., calculating the total expected utility). Thus, grantmakers and policymakers can have confidence in the user friendliness of expected utility forecasting. 

How Can NSF or NIH Run an Experiment on Expected Utility Forecasting?

An initial pilot study could be conducted by NSF or NIH by adding a short, non-binding expected utility forecasting component to a selection of review panels. In addition to the evaluation of traditional criteria, reviewers would be asked to predict the success and impact of select milestones for the proposals assigned to them. The rest of the review process and the final funding decisions would be made using the traditional criteria. 

Afterwards, study facilitators could take the expected utility forecasting results and construct an alternate portfolio of proposals that would have been funded if that approach was used, and compare the two portfolios. Such a comparison would yield valuable insights into whether—and how—the types of proposals selected by each approach differ, and whether their use leads to different considerations arising during review. Additionally, a pilot assessment of reviewers’ prediction accuracy could be conducted by asking program officers to assess milestone achievement and study impact upon completion of funded projects.

Findings and Recommendations

Reviewers in our study were new to the expected utility forecasting process and gave generally positive reactions. In their feedback, reviewers said that they appreciated how the framing of the questions prompted them to think about the proposals in a different way and pushed them to ground their assessments with quantitative forecasts. The focus on just three review criteria–probability of success, scientific impact, and social impact–was seen as a strength because it simplified the process, disentangled feasibility from impact, and eliminated biased metrics. Overall, reviewers found this new approach interesting and worth investigating further. 

In designing this pilot and analyzing the results, we identified several important considerations for planning such a review process. While complex, engaging with these considerations tended to provide value by making implicit project details explicit and encouraging clear definition and communication of evaluation criteria to reviewers. Two key examples are defining the proposal milestones and creating impact scoring systems. In both cases, reducing ambiguities in terms of the goals that are to be achieved, developing an understanding of how outcomes depend on one another, and creating interpretable and resolvable criteria for assessment will help ensure that the desired information is solicited from reviewers. 

Questions for Further Study

Our pilot only simulated the individual review phase of grant proposals and did not simulate a full review committee. The typical review process at a funding agency consists of first, individual evaluations by assigned reviewers, then discussion of those evaluations by the whole review committee, and finally, the submission of final scores from all members of the committee. This is similar to the Delphi method, a structured process for eliciting forecasts from a panel of experts, so we believe that it would work well with expected utility forecasting. The primary change would therefore be in the definition and approach for eliciting criterion scores, rather than the structure of the review process. Nevertheless, future implementations may uncover additional considerations that need to be addressed or better ways to incorporate forecasting into a panel environment. 

Further investigation into how best to define proposal milestones is also needed. This includes questions such as, who should be responsible for determining the milestones? If reviewers are involved, at what part(s) of the review process should this occur? What is the right balance between precision and flexibility of milestone definitions, such that the best outcomes are achieved? How much flexibility should there be in the number of milestones per proposal? 

Lastly, more thought should be given to how to define social impact and how to calibrate reviewers’ interpretation of the impact score scale. In our report, we propose a couple of different options for calibrating impact, in addition to describing the one we took in our pilot. 

Interested grantmakers, both public and private, and policymakers are welcome to reach out to our team if interested in learning more or receiving assistance in implementing this approach.


Introduction

The fundamental concern of grantmakers, whether governmental or philanthropic, is how to make the best funding decisions. All funding decisions come with inherent uncertainties that may pose risks to the investment. Thus, a certain level of risk-aversion is natural and even desirable in grantmaking institutions, especially federal science agencies which are responsible for managing taxpayer dollars. However, without risk, there is no reward, so the trade-off must be balanced. In mathematics and economics, expected utility is the common metric assumed to underlie all rational decision making. Expected utility has two components: the probability of an outcome occurring if an action is taken and the value of that outcome, which roughly corresponds with risk and reward. Thus, expected utility would seem to be a logical choice for evaluating science funding proposals. 

In the debates around funding innovation though, expected utility has largely flown under the radar compared to other ideas. Nevertheless, Chiara Franzoni and Paula Stephan have proposed using expected utility in peer review. Building off of their paper, the Federation of American Scientists (FAS) developed a detailed framework for how to implement expected utility into a peer review process. We chose to frame the review criteria as forecasting questions, since determining the expected utility of a proposal inherently requires making some predictions about the future. Forecasting questions also have the added benefit of being resolvable–i.e., the true outcome can be determined after the fact and compared to the prediction–which provides a learning opportunity for reviewers to improve their abilities and identify biases. In addition to forecasting, we incorporated other unique features, like an exponential scale for scoring impact, that we believe help reduce biases against risky proposals. 

With the theory laid out, we conducted a small pilot in fall of 2023. The pilot was run in collaboration with Metaculus, a crowd forecasting platform and aggregator, to leverage their expertise in designing resolvable forecasting questions and to use their platform to collect forecasts from reviewers. The purpose of the pilot was to test the mechanics of this approach in practice, see if there are any additional considerations that need to be thought through, and surface potential issues that need to be solved for. We were also curious if there would be any interesting or unexpected results that arise based on how we chose to calculate impact and total expected utility. It is important to note that this pilot was not an experiment, so we did not have a control group to compare the results of the review with. 

Since FAS is not a grantmaking institution, we did not have a ready supply of traditional grant proposals to use. Instead, we used a set of two-page research proposals for Focused Research Organizations (FROs) that we had sourced through separate advocacy work in that area.1 With the proposal authors’ permission, we recruited a cohort of twenty subject matter experts to each review one of five proposals. For each proposal, we defined two research milestones in consultation with the proposal authors. Reviewers were asked to make three forecasts for each milestone:

  1. The probability of success;
  2. The scientific impact, conditional on success; and
  3. The social impact, conditional on success.

Reviewers submitted their forecasts on Metaculus’ platform; in a separate form they provided explanations for their forecasts and responded to questions about their experience and impression of this new approach to proposal evaluation. (See Appendix A for details on the pilot study design.)

Insights from Reviewer Feedback

Overall, reviewers liked the framing and criteria provided by the expected utility approach, while their main critique was of the structure of the research proposals. Excluding critiques of the research proposal structure, which are unlikely to apply to an actual grant program, two thirds of the reviewers expressed positive opinions of the review process and/or thought it was worth pursuing further given drawbacks with existing review processes. Below, we delve into the details of the feedback we received from reviewers and their implications for future implementation.

Feedback on Review Criteria

Disentangling Impact from Feasibility

Many of the reviewers said that this model prompted them to think differently about how they assess the proposals and that they liked the new questions. Reviewers appreciated that the questions focused their attention on what they think funding agencies really want to know and nothing more: “can it occur?” and “will it matter?” This approach explicitly disentangles impact from feasibility: “Often, these two are taken together, and if one doesn’t think it is likely to succeed, the impact is also seen as lower.” Additionally, the emphasis on big picture scientific and social impact “is often missing in the typical review process.” Reviewers also liked that this approach eliminates what they consider biased metrics, such as the principal investigator’s reputation, track record, and “excellence.” 

Reducing Administrative Burden

The small set of questions was seen as more efficient and less burdensome on reviewers. One reviewer said, “I liked this approach to scoring a proposal. It reduces the effort to thinking about perceived impact and feasibility.” Another reviewer said, “On the whole it seems a worthwhile exercise as the current review processes for proposals are onerous.” 

Quantitative Forecasting

Reviewers saw benefits to being asked to quantify their assessments, but also found it challenging at times. A number of reviewers enjoyed taking a quantitative approach and thought that it helped them be more grounded and explicit in their evaluations of the proposals. However, some reviewers were concerned that it felt like guesswork and expressed low confidence in their quantitative assessments, primarily due to proposals lacking details on their planned research methods, which is an issue discussed in the section “Feedback on Proposals.” Nevertheless, some of these reviewers still saw benefits to taking a quantitative approach: “It is interesting to try to estimate probabilities, rather than making flat statements, but I don’t think I guess very well. It is better than simply classically reviewing the proposal [though].” Since not all academics have experience making quantitative predictions, we expect that there will be a learning curve for those new to the practice. Forecasting is a skill that can be learned though, and we think that with training and feedback, reviewers can become better, more confident forecasters.

Defining Social Impact

Of the three types of questions that reviewers were asked to answer, the question about social impact seemed the harder one for reviewers to interpret. Reviewers noted that they would have liked more guidance on what was meant by social impact and whether that included indirect impacts. Since questions like these are ultimately subjective, the “right” definition of social impact and what types of outcomes are considered most valuable will depend on the grantmaking institution, their domain area, and their theory of change, so we leave this open to future implementers to clarify in their instructions. 

Calibrating Impact

While the impact score scale (see Appendix A) defines the relative difference in impact between scores, it does not define the absolute impact conveyed by a score. For this reason, a calibration mechanism is necessary to provide reviewers with a shared understanding of the use and interpretation of the scoring system. Note that this is a challenge that rubric-based peer review criteria used by science agencies also face. Discussion and aggregation of scores across a review committee helps align reviewers and average out some of this natural variation.2

To address this, we surveyed a small, separate set of academics in the life sciences about how they would score the social and scientific impact of the average NIH R01 grant, which many life science researchers apply to and review proposals for. We then provided the average scores from this survey to reviewers to orient them to the new scale and help them calibrate their scores. 

One reviewer suggested an alternative approach: “The other thing I might change is having a test/baseline question for every reviewer to respond to, so you can get a feel for how we skew in terms of assessing impact on both scientific and social aspects.” One option would be to ask reviewers to score the social and scientific impact of the average grant proposal for a grant program that all reviewers would be familiar with; another would be to ask reviewers to score the impact of the average funded grant for a specific grant program, which could be more accessible for new reviewers who have not previously reviewed grant proposals. A third option would be to provide all reviewers on a committee with one or more sample proposals to score and discuss, in a relevant and shared domain area.

When deciding on an approach for calibration, a key consideration is the specific resolution criteria that are being used — i.e., the downstream measures of impact that reviewers are being asked to predict. One option, which was used in our pilot, is to predict the scores that a comparable, but independent, panel of reviewers would give the project some number of years following its successful completion. For a resolution criterion like this one, collecting and sharing calibration scores can help reviewers get a sense for not just their own approach to scoring, but also those of their peers.

Making Funding Decisions

In scoring the social and scientific impact of each proposal, reviewers were asked to assess the value of the proposal to society or to the scientific field. That alone would be insufficient to determine whether a proposal should be funded though, since it would need to be compared with other proposals in conjunction with its feasibility. To do so, we calculated the total expected utility of each proposal (see Appendix C). In a real funding scenario, this final metric could then be used to compare proposals and determine which ones get funded. Additionally, unlike a traditional scoring system, the expected utility approach allows for the detailed comparison of portfolios — including considerations like the expected proportion of milestones reached and the range of likely impacts.

In our pilot, reviewers were not informed that we would be doing this additional calculation based on their submissions. As a result, one reviewer thought that the questions they were asked failed to include other important questions, like “should it occur?” and “is it worth the opportunity cost?” Though these questions were not asked of reviewers explicitly, we believe that they would be answered once the expected utility of all proposals is calculated and considered, since the opportunity cost of one proposal would be the expected utility of the other proposals. Since each reviewer only provided input on one proposal, they may have felt like the scores they gave would be used to make a binary yes/no decision on whether to fund that one proposal, rather than being considered as a part of a larger pool of proposals, as it would be in a real review process.

Feedback on Proposals

Missing Information Impedes Forecasting

The primary critique that reviewers expressed was that the research proposals lacked details about their research plans, what methods and experimental protocols would be used, and what preliminary research the author(s) had done so far. This hindered their ability to properly assess the technical feasibility of the proposals and their probability of success. A few reviewers expressed that they also would have liked to have had a better sense of who would be conducting the research and each team member’s responsibilities. These issues arose because the FRO proposals used in our pilot had not originally been submitted for funding purposes, and thus lacked the requirements of traditional grant proposals, as we noted above. We assume this would not be an issue with proposals submitted to actual grantmakers.3  

Improving Milestone Design

A few reviewers pointed out that some of the proposal milestones were too ambiguous or were not worded specifically enough, such that there were ways that researchers could technically say that they had achieved the milestone without accomplishing the spirit of its intent. This made it more challenging for reviewers to assess milestones, since they weren’t sure whether to focus on the ideal (i.e., more impactful) interpretation of the milestone or to account for these “loopholes.” Moreover, loopholes skew the forecasts, since they increase the probability of achieving a milestone, while lowering the impact of doing so if it is achieved through a loophole.

One reviewer suggested, “I feel like the design of milestones should be far more carefully worded – or broken up into sub-sentences/sub-aims, to evaluate the feasibility of each. As the questions are currently broken down, I feel they create a perverse incentive to create a vaguer milestone, or one that can be more easily considered ‘achieved’ for some ‘good enough’ value of achieved.” For example, they proposed that one of the proposal milestones, “screen a library of tens of thousands of phage genes for enterobacteria for interactions and publish promising new interactions for the field to study,” could be expanded to

  1. “Generate a library of tens of thousands of genes from enterobacteria, expressed in E. coli
  2. “Validate their expression under screenable conditions
  3. “Screen the library for their ability to impede phage infection with a panel of 20 type phages
  4. “Publish … 
  5. “Store and distribute the library, making it as accessible to the broader community”

We agree with the need for careful consideration and design of milestones, given that “loopholes” in milestones can detract from their intended impact and make it harder for reviewers to accurately assess their likelihood. In our theoretical framework for this approach, we identified three potential parties that could be responsible for defining milestones: (1) the proposal author(s), (2) the program manager, with or without input from proposal authors, or (3) the reviewers, with or without input from proposal authors. This critique suggests that the first approach of allowing proposal authors to be the sole party responsible for defining proposal milestones is vulnerable to being gamed, and the second or third approach would be preferable. Program managers who take on the task of defining milestones should have enough expertise to think through the different potential ways of fulfilling a milestone and make sure that they are sufficiently precise for reviewers to assess.

Benefits of Flexibility in Milestones

Some flexibility in milestones may still be desirable, especially with respect to the actual methodology, since experimentation may be necessary to determine the best technique to use. For example, speaking about the feasibility of a different proposal milestone – “demonstrate that Pro-AG technology can be adapted to a single pathogenic bacterial strain in a 300 gallon aquarium of fish and successfully reduce antibiotic resistance by 90%” – a reviewer noted that 

“The main complexity and uncertainty around successful completion of this milestone arises from the native fish microbiome and whether a CRISPR delivery tool can reach the target strain in question. Due to the framing of this milestone, should a single strain be very difficult to reach, the authors could simply switch to a different target strain if necessary. Additionally, the mode of CRISPR delivery is not prescribed in reaching this milestone, so the authors have a host of different techniques open to them, including conjugative delivery by a probiotic donor or delivery by engineered bacteriophage.”

Peer Review Results

Sequential Milestones vs. Independent Outcomes

In our expected utility forecasting framework, we defined two different ways that a proposal could structure its outcomes: as sequential milestones where each additional milestone builds off of the success of the previous one, or as independent outcomes where the success of one is not dependent on the success of the other(s). For proposals with sequential milestones in our pilot, we would expect the probability of success of milestone 2 to be less than the probability of success of milestone 1 and for the opposite to be true of their impact scores. For proposals with independent outcomes, we do not expect there to be a relationship between the probability of success and the impact scores of milestones 1 and 2. There are different equations for calculating the total expected utility, depending on the relationship between outcomes (see Appendix C).

For each of the proposals in our study, we categorized them based on whether they had sequential milestones or independent outcomes. This information was not shared with reviewers. Table 1 presents the average reviewer forecasts for each proposal. In general, milestones received higher scientific impact scores than social impact scores, which makes sense given the primarily academic focus of research proposals. For proposals 1 to 3, the probability of success of milestone 2 was roughly half of the probability of success of milestone 1; reviewers also gave milestone 2 higher scientific and social impact scores than milestone 1. This is consistent with our categorization of proposals 1 to 3 as sequential milestones.

Table 1. Mean forecasts for each proposal.
See next section for discussion about the categorization of proposal 4’s milestones.
Milestone 1Milestone 2
ProposalMilestone CategoryProbability of SuccessScientific Impact ScoreSocial Impact ScoreProbability of SuccessScientific Impact ScoreSocial Impact Score
1sequential0.807.837.350.418.228.25
2sequential0.886.413.720.368.217.62
3sequential0.687.076.450.348.207.50
4?0.726.583.920.477.064.19
5independent0.557.142.370.406.662.25

Further Discussion on Designing and Categorizing Milestones

We originally categorized proposal 4’s milestones as sequential, but one reviewer gave milestone 2 a lower scientific impact score than milestone 1 and two reviewers gave it a lower social impact score. One reviewer also gave milestone 2 roughly the same probability of success as milestone 1. This suggests that proposal 4’s milestones can’t be considered strictly sequential. 

The two milestones for proposal 4 were

The reviewer who gave milestone 2 a lower scientific impact score explained: “Given the wording of the milestone, I do not believe that if the scientific milestone was achieved, it would greatly improve our understanding of the brain.” Unlike proposals 1-3, in which milestone 2 was a scaled-up or improved-upon version of milestone 1, these milestones represent fundamentally different categories of output (general-purpose tool vs specific model). Thus, despite the necessity of milestone 1’s tool for achieving milestone 2, the reviewer’s response suggests that the impact of milestone 2 was being considered separately rather than cumulatively.

Milestone Design Recommendations
Explicitly define sequential milestones
Recommendation 1

To properly address this case of sequential milestones with different types of outputs, we recommend that for all sequential milestones, latter milestones should be explicitly defined as inclusive of prior milestones. In the above example, this would imply redefining milestone 2 as “Complete milestone 1 and develop a model of the C. elegans nervous system…” This way, reviewers know to include the impact of milestone 1 in their assessment of the impact of milestone 2.

Clarify milestone category with reviewers
Recommendation 2

To help ensure that reviewers are aligned with program managers in how they interpret the proposal milestones (if they aren’t directly involved in defining milestones), we suggest that either reviewers be informed of how program managers are categorizing the proposal outputs so they can conduct their review accordingly or allow reviewers to decide the category (and thus how the total expected utility is calculated), whether individually or collectively or both.

Allow for a flexible number of milestones
Recommendation 3

We chose to use only two of the goals that proposal authors provided because we wanted to standardize the number of milestones across proposals. However, this may have provided an incomplete picture of the proposals’ goals, and thus an incomplete assessment of the proposals. We recommend that future implementations be flexible and allow the number of milestones to be determined based on each proposal’s needs. This would also help accommodate one of the reviewers’ suggestion that some milestones should be broken down into intermediary steps.

Importance of Reviewer Explanations

As one can tell from the above discussion, reviewers’ explanation of their forecasts were crucial to understanding how they interpreted the milestones. Reviewers’ explanations varied in length and detail, but the most insightful responses broke down their reasoning into detailed steps and addressed (1) ambiguities in the milestone and how they chose to interpret ambiguities if they existed, (2) the state of the scientific field and the maturity of different techniques that the authors propose to use, and (3) factors that improve the likelihood of success versus potential barriers or challenges that would need to be overcome.

Exponential Impact Scales Better Reflect the Real Distribution of Impact 

The distribution of NIH and NSF proposal peer review scores tends to be skewed such that most proposals are rated above the center of the scale and there are few proposals rated poorly. However, other markers of scientific impact, such as citations (even with all of its imperfections), tend to suggest a long tail of studies with high impact. This discrepancy suggests that traditional peer review scoring systems are not well-structured to capture the nonlinearity of scientific impact, resulting in score inflation. The aggregation of scores at the top end of the scale also means that very negative scores have a greater impact than very positive scores when averaged together, since there’s more room between the average score and the bottom end of the scale. This can generate systemic bias against more controversial or risky proposals.

In our pilot, we chose to use an exponential scale with a base of 2 for impact to better reflect the real distribution of scientific impact. Using this exponential impact scale, we conducted a survey of a small pool of academics in the life sciences about how they would rate the impact of the average funded NIH R01 grant. They responded with an average scientific impact score 5 and an average social impact score of 3, which are much lower on our scale compared to traditional peer review scores4, suggesting that the exponential scale may be beneficial for avoiding score inflation and bunching at the top. In our pilot, the distribution of scientific impact scores was centered higher than 5, but still less skewed than NIH peer review scores for significance and innovation typically are. This partially reflects the fact that proposals were expected to be funded at one to two orders of magnitude more than NIH R01 proposals are, so impact should also be greater. The distribution of social impact scores exhibits a much wider spread and lower center.

Figure 1. Distribution of Impact scores for milestone 1 (top) and 2 (bottom)

Conclusion

In summary, expected utility forecasting presents a promising approach to improving the rigor of peer review and quantitatively defining the risk-reward profile of science proposals. Our pilot study suggests that this approach can be quite user-friendly for reviewers, despite its apparent complexity. Further study into how best to integrate forecasting into panel environments, define proposal milestones, and calibrate impact scales will help refine future implementations of this approach. 

More broadly, we hope that this pilot will encourage more grantmaking institutions to experiment with innovative funding mechanisms. Reviewers in our pilot were more open-minded and quick-to-learn than one might expect and saw significant value in this unconventional approach. Perhaps this should not be so much of a surprise given that experimentation is at the heart of scientific research. 

Interested grantmakers, both public and private, and policymakers are welcome to reach out to our team if interested in learning more or receiving assistance in implementing this approach. 

Acknowledgements

Many thanks to Jordan Dworkin for being an incredible thought partner in designing the pilot and providing meticulous feedback on this report. Your efforts made this project possible!


Appendix A: Pilot Study Design

Our pilot study consisted of five proposals for life science-related Focused Research Organizations (FROs). These proposals were solicited from academic researchers by FAS as part of our advocacy for the concept of FROs. As such, these proposals were not originally intended as proposals for direct funding, and did not have as strict content requirements as traditional grant proposals typically do. Researchers were asked to submit one to two page proposals discussing (1) their research concept, (2) the motivation and its expected social and scientific impact, and (3) the rationale for why this research can not be accomplished through traditional funding channels and thus requires a FRO to be funded.

Permission was obtained from proposal authors to use their proposals in this study. We worked with proposal authors to define two milestones for each proposal that reviewers would assess: one that they felt confident that they could achieve and one that was more ambitious but that they still thought was feasible. In addition, due to the brevity of the proposals, we included an additional 1-2 pages of supplementary information and scientific context. Final drafts of the milestones and supplementary information were provided to authors to edit and approve. Because this pilot study could not provide any actual funding to proposal authors, it was not possible to solicit full length research proposals from proposal authors.

We recruited four to six reviewers for each proposal based on their subject matter expertise. Potential participants were recruited over email with a request to help review a FRO proposal related to their area of research. They were informed that the review process would be unconventional but were not informed of the study’s purpose. Participants were offered a small monetary compensation for their time.

Confirmed participants were sent instructions and materials for the review process on the same day and were asked to complete their review by the same deadline a month and a half later. Reviewers were told to assume that, if funded, each proposal would receive $50 million in funding over five years to conduct the research, consistent with the proposed model for FROs. Each proposal had two technical milestones, and reviewers were asked to answer the following questions for each milestone: 

  1. Assuming that the proposal is funded by 2025, will the milestone be achieved before 2031?
  2. What will be the average scientific impact score, as judged in 2032, of accomplishing the milestone?
  3. What will be the average social impact score, as judged in 2032, of accomplishing the milestone?

The impact scoring system was explained to reviewers as follows:

Please consider the following in determining the impact score: the current and expected long-term social or scientific impact of a funded FRO’s outputs if a funded FRO accomplishes this milestone before 2030.

The impact score we are using ranges from 1 (low) to 10 (high). It is base 2 exponential, meaning that a proposal that receives a score of 5 has double the impact of a proposal that receives a score of 4, and quadruple the impact of a proposal that receives a score of 3. In a small survey we conducted of SMEs in the life sciences, they rated the scientific and social impact of the average NIH R01 grant — a federally funded research grant that provides $1-2 million for a 3-5 year endeavor — on this scale to be 5.2 ± 1.5 and 3.1 ± 1.3, respectively. The median scores were 4.75 and 3.00, respectively.

Below is an example of how a predicted impact score distribution (left) would translate into an actual impact distribution (right). You can try it out yourself with this interactive version (in the menu bar, click Runtime > Run all) to get some further intuition on how the impact score works. Please note that this is meant solely for instructive purposes, and the interface is not designed to match Metaculus’ interface.

The choice of an exponential impact scale reflects the tendency in science for a small number of research projects to have an outsized impact. For example, studies have shown that the relationship between the number of citations for a journal article and its percentile rank scales exponentially.

Scientific impact aims to capture the extent to which a project advances the frontiers of knowledge, enables new discoveries or innovations, or enhances scientific capabilities or methods. Though each is imperfect, one could consider citations of papers, patents on tools or methods, or users of software or datasets as proxies of scientific impact. 

Social impact aims to capture the extent to which a project contributes to solving important societal problems, improving well-being, or advancing social goals. Some proxy metrics that one might use to assess a project’s social impact are the value of lives saved, the cost of illness prevented, the number of job-years of employment generated, economic output in terms of GDP, or the social return on investment. 

You may consider any or none of these proxy metrics as a part of your assessment of the impact of a FRO accomplishing this milestone.

Reviewers were asked to submit their forecasts on Metaculus’ website and to provide their reasoning in a separate Google form. For question 1, reviewers were asked to respond with a single probability. For questions 2 and 3, reviewers were asked to provide their median, 25th percentile, and 75th percentile predictions, in order to generate a probability distribution. Metaculus’ website also included information on the resolution criteria of each question, which provided guidance to reviewers on how to answer the question. Individual reviewers were blind to other reviewers’ responses until after the submission deadline, at which point the aggregated results of all of the responses were made public on Metaculus’ website. 

Additionally, in the Google form, reviewers were asked to answer a survey question about their experience: “What did you think about this review process? Did it prompt you to think about the proposal in a different way than when you normally review proposals? If so, how? What did you like about it? What did you not like? What would you change about it if you could?” 

Some participants did not complete their review. We received 19 complete reviews in the end, with each proposal receiving three to six reviews. 

Study Limitations

Our pilot study had certain limitations that should be noted. Since FAS is not a grantmaking institution, we could not completely reproduce the same types of research proposals that a grantmaking institution would receive nor the entire review process. We will highlight these differences in comparison to federal science agencies, which are our primary focus.

  1. Review Process: There are typically two phases to peer review at NIH and NSF. First, at least three individual reviewers with relevant subject matter expertise are assigned to read and evaluate a proposal independently. Then, a larger committee of experts is convened. There, the assigned reviewers present the proposal and their evaluation, and then the committee discusses and determines the final score for the proposal. Our pilot study only attempted to replicate the first phase of individual review.
  1. Sample Size: In our pilot, the sample size was quite small, since only five proposals were reviewed, and they were all in different subfields, so different reviewers were assigned to each proposal. NIH and NSF peer review committees typically focus on one subfield and review on the order of twenty or so proposals. The number of reviewers per proposal–three to six–in our pilot was consistent with the number of reviewers typically assigned to a proposal by NIH and NSF. Peer review committees are typically larger, ranging from six to twenty people, depending on the agency and the field.
  1. Proposals: The FRO proposals plus supplementary information were only two to four pages long, which is significantly shorter than the 12 to 15 page proposals that researchers submit for NIH and NSF grants. Proposal authors were asked to generally describe their research concept, but were not explicitly required to describe the details of the research methodology they would use or any preliminary research. Some proposal authors volunteered more information on this for the supplementary information, but not all authors did. 
  1. Grant Size: For the FRO proposals, reviewers were asked to assume that funded proposals would receive $50 million over five years, which is one to two orders of magnitude more funding than typical NIH and NSF proposals.

Appendix B: Feedback on Study-Specific Implementation

In addition to feedback about the review framework, we received feedback on how we implemented our pilot study, specifically the instructions and materials for the review process and the submission platforms. This feedback isn’t central to this paper’s investigation of expected value forecasting, but we wanted to include it in the appendix for transparency.

Reviewers were sent instructions over email that outlined the review process and linked to Metaculus’ webpage for this pilot. On Metaculus’ website, reviewers could find links to the proposals on FAS’ website and the supplementary information in Google docs. Reviewers were expected to read those first and then read through the resolution criteria for each forecasting question before submitting their answers on Metaculus’ platform. Reviewers were asked to submit the explanations behind their forecasts in a separate Google form.

Some reviewers had no problem navigating the review process and found Metaculus’ website easy to use. However, feedback from other reviewers suggested that the different components necessary for the review were spread out over too many different websites, making it difficult for reviewers to keep track of where to find everything they needed.

Some had trouble locating the different materials and pieces of information needed to conduct the review on Metaculus’ website. Others found it confusing to have to submit their forecasts and explanations in two separate places. One reviewer suggested that the explanation of the impact scoring system should have been included within the instructions sent over email rather than in the resolution criteria on Metaculus’ website so that they could have read it before reading the proposal. Another reviewer suggested that it would have been simpler to submit their forecasts through the same Google form that they used to submit their explanations rather than through Metaculus’ website. 

Based on this feedback, we would recommend that future implementations streamline their submission process to a single platform and provide a more extensive set of instructions rather than seeding information across different steps of the review process. Training sessions, which science funding agencies typically conduct, would be a good supplement to written instructions.

Appendix C: Total Expected Utility Calculations

To calculate the total expected utility, we first converted all of the impact scores into utility by taking two to the exponential of the impact score, since the impact scoring system is base 2 exponential:

Utility=2Impact Score.

We then were able to average the utilities for each milestone and conduct additional calculations. 

To calculate the total utility of each milestone, ui, we averaged the social utility and the scientific utility of the milestone:

ui = (Social Utility + Scientific Utility)/2.

The total expected utility (TEU) of a proposal with two milestones can be calculated according to the general equation:

TEU = u1P(m1 ∩ not m2) + u2P(m2 ∩ not m1) + (u1+u2)P(m1m2),

where P(mi) represents the probability of success of milestone i and

P(m1 ∩ not m2) = P(m1) – P(m1 ∩ m2)
P(m2 ∩ not m1) = P(m2) – P(m1 ∩ m2).

For sequential milestones, milestone 2 is defined as inclusive of milestone 1 and wholly dependent on the success of milestone 1, so this means that

u2, seq = u1+u2
P(m2) = Pseq(m1 ∩ m2)
P(m2 ∩ not m1) = 0.

Thus, the total expected utility of sequential milestones can be simplified as

TEU = u1P(m1)-u1P(m2) + (u2, seq)P(m2)
TEU = u1P(m1) + (u2, seq-u1)P(m2)

This can be generalized to

TEUseq = Σi(ui, seq-ui-1, seq)P(mi).

Otherwise, the total expected utility can be simplified to 

TEU = u1P(m1) + u2P(m2) – (u1+u2)P(m1 ∩ m2).

For independent outcomes, we assume 

Pind(m1 ∩ m2) = P(m1)P(m2), 

so

TEUind = u1P(m1) + u2P(m2) – (u1+u2)P(m1)P(m2).

To present the results in Tables 1 and 2, we converted all of the utility values back into the impact score scale by taking the log base 2 of the results.

Expected Utility Forecasting for Science Funding

The typical science grantmaker seeks to maximize their (positive) impact with a limited amount of money. The decision-making process for how to allocate that funding requires them to consider the different dimensions of risk and uncertainty involved in science proposals, as described in foundational work by economists Chiara Franzoni and Paula Stephan. The Von Neumann-Morgenstern utility theorem implies that there exists for the grantmaker — or the peer reviewer(s) assessing proposals on their behalf — a utility function whose expected value they will seek to maximize. 

Common frameworks for evaluating proposals leave this utility function implicit, often evaluating aspects of risk, uncertainty, and potential value independently and qualitatively. Empirical work has suggested that such an approach may lead to biases, resulting in funding decisions that deviate from grantmakers’ ultimate goals. An expected utility approach to reviewing science proposals aims to make that implicit decision-making process explicit, and thus reduce biases, by asking reviewers to directly predict the probability and value of different potential outcomes occurring. Implementing this approach through forecasting brings the added benefits of providing (1) a resolution and scoring process that could help incentivize reviewers to make better, more accurate predictions over time and (2) empirical estimates of reviewers’ accuracy and tendency to over or underestimate the value and probability of success of proposals.

At the Federation of American Scientists, we are currently piloting this approach on a series of proposals in the life sciences that we have collected for Focused Research Organizations (FROs), a new type of non-profit research organization designed to tackle challenges that neither academia or industry is incentivized to work on. The pilot study was developed in collaboration with Metaculus, a forecasting platform and aggregator, and is hosted on their website. In this paper, we provide the detailed methodology for the approach that we have developed, which builds upon Franzoni and Stephan’s work, so that interested grantmakers may adapt it for their own purposes. The motivation for developing this approach and how we believe it may help address biases against risk in traditional peer review processes is discussed in our article “Risk and Reward in Peer Review”.

Defining Outcomes

To illustrate how an expected utility forecasting approach could be applied to scientific proposal evaluation, let us first imagine a research project consisting of multiple possible outcomes or milestones. In the most straightforward case, the outcomes that could arise are mutually exclusive (i.e., only a single one will be observed). Indexing each outcome with the letter 𝑖, we can define the expected value of each as the product of its value (or utility; 𝓊𝑖) and the probability of it occurring, 𝑃(𝑚𝑖). Because the outcomes in this example are mutually exclusive, the total expected utility (TEU) of the proposed project is the sum of the expected value of each outcome1:

𝑇𝐸𝑈 = 𝛴𝑖𝓊𝑖𝑃(𝑚𝑖).

However, in most cases, it is easier and more accurate to define the range of outcomes of a research project as a set of primary and secondary outcomes or research milestones that are not mutually exclusive, and can instead occur in various combinations.

For instance, science proposals usually highlight the primary outcome(s) that they aim to achieve, but may also involve important secondary outcome(s) that can be achieved in addition to or instead of the primary goals. Secondary outcomes can be a research method, tool, or dataset produced for the purpose of achieving the primary outcome; a discovery made in the process of pursuing the primary outcome; or an outcome that researchers pivot to pursuing as they obtain new information from the research process. As such, primary and secondary outcomes are not necessarily mutually exclusive. In the simplest scenario with just two outcomes (either two primary or one primary and one secondary), the total expected utility becomes

𝑇𝐸𝑈 = 𝓊1𝑃(𝑚1⋂ not 𝑚2) + 𝓊2𝑃(𝑚2⋂ not 𝑚1) + (𝓊1 + 𝓊2)𝑃(𝑚1⋂𝑚2),

𝑇𝐸𝑈 = 𝓊1𝑃(𝑚1) – (𝑚1⋂ 𝑚2)) + 𝓊2𝑃(𝑚2) – 𝑃(𝑚1⋂ 𝑚2)) + (𝓊1 + 𝓊2)𝑃(𝑚1⋂𝑚2)

𝑇𝐸𝑈 = 𝓊1𝑃(𝑚1) + 𝓊2𝑃(𝑚2) – (𝓊1 + 𝓊2)𝑃(𝑚1⋂𝑚2).

As the number of outcomes increases, the number of joint probability terms increases as well. Assuming the outcomes are independent though, they can be reduced to the product of the probabilities of individual outcomes. For example,

𝑃(𝑚1⋂𝑚2) = 𝑃(𝑚1) * 𝑃(𝑚2)

On the other hand, milestones are typically designed to build upon one another, such that achieving later milestones necessitates the achievement of prior milestones. In these cases, the value of later milestones typically includes the value of prior milestones: for example, the value of demonstrating a complete pilot of a technology is inclusive of the value of demonstrating individual components of that technology. The total expected utility can thus be defined as the sum of the product of the marginal utility of each additional milestone and its probability of success:

𝑇𝐸𝑈 = 𝛴𝑖(𝓊𝑖 + 𝓊𝑖-1)𝑃(𝑚𝑖),
where 𝓊0 = 0.

Depending on the science proposal, either of these approaches — or a combination — may make the most sense for determining the set of outcomes to evaluate.

In our FRO Forecasting pilot, we worked with proposal authors to define two outcomes for each of their proposals. Depending on what made the most sense for each proposal, the two outcomes reflected either relatively independent primary and secondary goals, or sequential milestone outcomes that directly built upon one another (though for simplicity, we called all of the outcomes milestones).

Defining Probability of Success

Once the set of potential outcomes have been defined, the next step is to determine the probability of success between 0% and 100% for each outcome if the proposal is funded. A prediction of 50% would indicate the highest level of uncertainty about the outcome, whereas the closer the predicted probability of success is to 0% or 100%, the more certainty there is that the outcome will be one over the other. 
Furthermore, Franzoni and Stephan decompose probability of success into two components: the probability that the outcome can actually occur in nature or reality and the probability that the proposed methodology will succeed in obtaining the outcome (conditional on it being possible in nature). The total probability is then the product of these two components:

𝑃(𝑚𝑖) = 𝑃nature(𝑚𝑖) * 𝑃proposal(𝑚𝑖)

Depending on the nature of the proposal (e.g., more technology-driven, or more theoretical/discovery driven), each component may be more or less relevant. For example, our forecasting pilot includes a proposal to perform knockout validation of renewable antibodies for 10,000 to 15,000 human proteins; for this project, 𝑃nature(𝑚𝑖) approaches 1 and 𝑃proposal(𝑚𝑖) drives the overall probability of success.

Defining Utility

Similarly, the value of an outcome can be separated into its impact on the scientific field and its impact on society at large. Scientific impact aims to capture the extent to which a project advances the frontiers of knowledge, enables new discoveries or innovations, or enhances scientific capabilities or methods. Social impact aims to capture the extent to which a project contributes to solving important societal problems, improving well-being, or advancing social goals. 

In both of these cases, determining the value of an outcome entails some subjective preferences, so there is no “correct” choice, at least mathematically speaking. However, proxy metrics may be helpful in considering impact. Though each is imperfect, one could consider citations of papers, patents on tools or methods, or users of method, tools, and datasets as proxies of scientific impact. For social impact, some proxy metrics that one might consider are the value of lives saved, the cost of illness prevented, the number of job-years of employment generated, economic output in terms of GDP, or the social return on investment.

The approach outlined by Franzoni and Stephan asks reviewers to assess scientific and social impact on a linear scale (0-100), after which the values can be averaged to determine the overall impact of an outcome. However, we believe that an exponential scale better captures the tendency in science for a small number of research projects to have an outsized impact and provides more room at the top end of the scale for reviewers to increase the rating of the proposals that they believe will have an exceptional impact.

Exponential relationship between the impact score and actual impact + Citation distribution of journal articles

As such, for our FRO Forecasting pilot, we chose to use a framework in which a simple 1–10 score corresponds to real-world impact via a base 2 exponential scale. In this case, the overall impact score of an outcome can be calculated according to

𝓊𝑖 = log[2science impact of 𝑖 + 2social impact of 𝑖] – 1.

For an exponential scale with a different base, one would substitute that base for two in the above equation. Depending on each funder’s specific understanding of impact and the type(s) of proposals they are evaluating, different relationships between scores and utility could be more appropriate.

In order to capture reviewers’ assessment of uncertainty in their evaluations, we asked them to provide median, 25th, and 75th percentile predictions for impact instead of a single prediction. High uncertainty would be indicated by a narrow confidence interval, while low uncertainty would be indicated by a wide confidence interval.

Determining the “But For” Effect of Funding

The above approach aims to identify the highest impact proposals. However, a grantmaker may not want to simply fund the highest impact proposals; rather, they may be most interested in understanding where their funding would make the highest impact — i.e., their “but for” effect. In this case, the grantmaker would want to fund proposals with the maximum difference between the total expected utility of the research proposal if they chose to funded it versus if they chose not to:

“But For” Impact = 𝑇𝐸𝑈(funding) – 𝑇𝐸𝑈(no funding).

For TEU(funding), the probability of the outcome occurring with this specific grantmaker’s funding using the proposed approach would still be defined as above

𝑃(𝑚𝑖 | funding) = 𝑃nature(𝑚𝑖) * 𝑃proposal(𝑚𝑖),

but for 𝑇𝐸𝑈(no funding),  reviewers would need to consider the likelihood of the outcome being achieved through other means. This could involve the outcome being realized by other sources of funding, other researchers, other approaches, etc. Here, the probability of success without this specific grantmaker’s funding could be described as

𝑃(𝑚𝑖 | no funding) = 𝑃nature(𝑚𝑖) * 𝑃other mechanism(𝑚𝑖).

In our FRO Forecasting pilot, we assumed that 𝑃other mechanism(𝑚𝑖) ≈ 0. The theory of change for FROs is that there exists a set of research problems at the boundary of scientific research and engineering that are not adequately supported by traditional research and development models and are unlikely to be pursued by academia or industry. Thus, in these cases it is plausible to assume that,

𝑃(𝑚𝑖 | no funding) ≈ 0
𝑇𝐸𝑈(no funding) ≈ 0
“But For” Impact ≈ 𝑇𝐸𝑈(funding).

This assumption, while not generalizable to all contexts, can help reduce the number of questions that reviewers have to consider — a dynamic which we explore further in the next section.

Designing Forecasting Questions

Once one has determined the total expected utility equation(s) relevant for the proposal(s) that they are trying to evaluate, the parameters of the equation(s) must be translated into forecasting questions for reviewers to respond to. In general, for each outcome, reviewers will need to answer the following four questions:

  1. If this proposal is funded, what is the probability that this outcome will occur?
  2. If this proposal is not funded, what is the probability that this outcome will still occur? 
  3. What will be the scientific impact of this outcome occurring?
  4. What will be the social impact of this outcome occurring?

For the probability questions, one could alternatively ask reviewers about the different probability components (𝑃nature(𝑚𝑖), 𝑃proposal(𝑚𝑖), 𝑃other mechanism(𝑚𝑖), etc.), but in most cases it will be sufficient — and simpler for the reviewer — to focus on the top-level probabilities that feed into the TEU calculation.

In order for the above questions to tap into the benefits of the forecasting framework, they must be resolvable. Resolving the forecasting questions means that at a set time in the future, reviewers’ predictions will be compared to a ground truth based on the actual events that have occurred (i.e., was the outcome actually achieved and, if so, what was its actual impact?). Consequently, reviewers will need to be provided with the resolution date and the resolution criteria for their forecasts. 

Resolution of the probability-based questions hinges mostly on a careful and objective definition of the potential outcomes, and is otherwise straightforward — though note that only one of the probability questions will be resolved, since they are mutually exclusive. The optimal resolution of the scientific and social impact questions may depend on the context of the project and the chosen approach to defining utility. A widely applicable approach is to resolve the utility forecasts by having either program managers or subject matter experts evaluate the results of the completed project and score its impact at the resolution date.

For our pilot, we asked forecasting questions only about the probability of success given funding (question 1 above) and the scientific and social impact of each outcome (questions 3 and 4); since we assumed that the probability of success without funding was zero, we did not ask question 2. Because outcomes for the FRO proposals were designed to be either independent or sequential, we did not have to ask additional questions on the joint probability of multiple outcomes being achieved. We chose to resolve our impact questions with a post-project panel of subject matter experts.

Additional Considerations

In general, there is a tradeoff in implementing this approach between simplicity and thoroughness, efficiency and accuracy. Here are some additional considerations on that tradeoff for those looking to use this approach:

  1. The responsibility of determining the range of potential outcomes for a proposal could be assigned to three different parties: the proposal author, the proposal reviewers, or the program manager. First, grantmakers could ask proposal authors to comprehensively define within their proposal the potential primary and secondary outcomes and/or project milestones. Alternatively, reviewers could be allowed to individually — or collectively — determine what they see as the full range of potential outcomes. The third option would be for program managers to define the potential outcomes based on each proposal, with or without input from proposal authors. In our pilot, we chose to use the third approach with input from proposal authors, since it simplified the process for reviewers and allowed us to limit the number of outcomes under consideration to a manageable amount.
  1. In many cases, a “failed” or null outcome may still provide meaningful value by informing other scientists that the research method doesn’t work or that the hypothesis is unlikely to be true. Considering the replication crises in multiple fields, this could be an important and unaddressed aspect of peer review. Grantmakers could choose to ask reviewers to consider the value of these null outcomes alongside other outcomes to obtain a more complete picture of the project’s utility. We chose not to address this consideration in our pilot for the sake of limiting the evaluation burden on reviewers.
  1. If grant recipients’ are permitted greater flexibility in their research agendas, this expected value approach could become more difficult to implement, since reviewers would have to consider a wider and more uncertain range of potential outcomes. This was not the case for our FRO Forecasting pilot, since FROs are designed to have specific and well-defined research goals.

Other Similar Efforts

Currently, forecasting is an approach rarely used in grantmaking. Open Philanthropy is the only grantmaking organization we know of that has publicized their use of internal forecasts about grant-related outcomes, though their forecasts do not directly influence funding decisions and are not specifically of expected value. Franzoni and Stephan are also currently piloting their Subjective Expected Utility approach with Novo Nordisk.

Conclusion

Our goal in publishing this methodology is for interested grantmakers to freely adapt it to their own needs and iterate upon our approach. We hope that this paper will help start a conversation in the science research and funding communities that leads to further experimentation. A follow up report will be published at the end of the FRO Forecasting pilot sharing the results and learnings from the project.

Acknowledgements

We’d like to thank Peter Mühlbacher, former research scientist at Metaculus, for his meticulous feedback as we developed this approach and for his guidance in designing resolvable forecasting questions. We’d also like to thank the rest of the Metaculus team for being open to our ideas and working with us on piloting this approach, the process of which has helped refine our ideas to their current state. Any mistakes here are of course our own.

Risk and Reward in Peer Review

This article was written as a part of the FRO Forecasting project, a partnership between the Federation of American Scientists and Metaculus. This project aims to conduct a pilot study of forecasting as an approach for assessing the scientific and societal value of proposals for Focused Research Organizations. To learn more about the project, see the press release here. To participate in the pilot, you can access the public forecasting tournament here.

The United States federal government is the single largest funder of scientific research in the world. Thus, the way that science agencies like the National Science Foundation and the National Institutes of Health distribute research funding has a significant impact on the trajectory of science as a whole. Peer review is considered the gold standard for evaluating the merit of scientific research proposals, and agencies rely on peer review committees to help determine which proposals to fund. However, peer review has its own challenges. It is a difficult task to balance science agencies’ dual mission of protecting government funding from being spent on overly risky investments while also being ambitious in funding proposals that will push the frontiers of science, and research suggests that peer review may be designed more for the former rather than the latter. We at FAS are exploring innovative approaches to peer review to help tackle this challenge.

Biases in Peer Review

A frequently echoed concern across the scientific and metascientific community is that funding agencies’ current approach to peer review of science proposals tends to be overly risk-averse, leading to bias against proposals that entail high risk or high uncertainty about the outcomes. Reasons for this conservativeness include reviewer preferences for feasibility over potential impact, contagious negativity, and problems with the way that peer review scores are averaged together.

This concern, alongside studies suggesting that scientific progress is slowing down, has led to a renewed effort to experiment with new ways of conducting peer review, such as golden tickets and lottery mechanisms. While golden tickets and lottery mechanisms aim to complement traditional peer review with alternate means of making funding decisions — namely individual discretion and randomness, respectively — they don’t fundamentally change the way that peer review itself is conducted. 

Traditional peer review asks reviewers to assess research proposals based on a rubric of several criteria, which typically include potential value, novelty, feasibility, expertise, and resources. These criteria are given a score based on a numerical scale; for example, the National Institutes of Health uses a scale from 1 (best) to 9 (worst). Reviewers then provide an overall score that need not be calculated in any specific way based on the criteria scores. Next, all of the reviewers convene to discuss the proposal and submit their final overall scores, which may be different from what they submitted prior to the discussion. The final overall scores are averaged across all of the reviewers for a specific proposal. Proposals are then ranked based on their average overall score and funding is prioritized for those ranked before a certain cutoff score, though depending on the agency, some discretion by program administrators is permitted.  

The way that this process is designed allows for the biases mentioned at the beginning—reviewer preferences for feasibility, contagious negativity, and averaging problems—to influence funding decisions. First, reviewer discretion in deciding overall scores allows them to weigh feasibility more heavily than potential impact and novelty in their final scores. Second, when evaluations are discussed reviewers tend to adjust their scores to better align with their peers. This adjustment tends to be greater when correcting in the negative direction than in the positive direction, resulting in a stronger negative bias. Lastly, since funding tends to be quite limited, cutoff scores tend to be quite close to the best score. This means that even if almost all of the reviewers rate a proposal positively, one very negative review can potentially bring the average below the cutoff.

Designing a New Approach to Peer Review

In 2021, the researchers Chiara Franzoni and Paula Stephan published a working paper arguing that risk in science results from three sources of uncertainty: uncertainty of research outcomes, uncertainty of the probability of success, and uncertainty of the value of the research outcomes. To comprehensively and consistently account for these sources of uncertainty, they proposed a new expected utility approach to peer review evaluations, in which reviewers are asked to

  1. Identify the primary expected outcome of a research proposal and, optionally, a potential secondary outcome;
  2. Assess the probability between 0 to 1 of achieving each expected outcome (P(j); and
  3. Assess the value of achieving each expected outcome (uj) on a numerical scale (e.g., 0 to 100).

From this, the total expected utility can be calculated for each proposal and used to rank them.1 This systematic approach addresses the first bias we discussed by limiting the extent to which reviewers’ preferences for more feasible proposals would impact the final score of each proposal.

We at FAS see a lot of potential in Franzoni and Stephan’s expected value approach to peer review, and it inspired us to design a pilot study using a similar approach that aims to chip away at the other biases in review.

To explore potential solutions for negativity bias, we are taking a cue from forecasting by complementing the peer review process with a resolution and scoring process. This means that at a set time in the future, reviewers’ assessments will be compared to a ground truth based on the actual events that have occurred (i.e., was the outcome actually achieved and, if so, what was its actual impact?). Our theory is that if implemented in peer review, resolution and scoring could incentivize reviewers to make better, more accurate predictions over time and provide empirical estimates of a committee’s tendency to provide overly negative (or positive) assessments, thus potentially countering the effects of contagion during review panels and helping more ambitious proposals secure support. 

Additionally, we sought to design a new numerical scale for assessing the value or impact of a research proposal, which we call an impact score. Typically, peer reviewers are free to interpret the numerical scale for each criteria as they wish; Franzoni and Stephan’s design also did not specify how the numerical scale for the value of the research outcome should work. We decided to use a scale ranging from 1 (low) to 10 (high) that was base 2 exponential, meaning that a proposal that receives a score of 5 has double the impact of a proposal that receives a score of 4, and quadruple the impact of a proposal that receives a score of 3.

Plot demonstrating the exponential nature of the impact score: a score of 1 shows an impact of zero, while a score of 10 shows an impact for 1000.
Figure 1. Plot demonstrating the exponential nature of the impact score.
Table 1. Example of how to interpret the impact score.
ScoreImpact
1None or negative
2Minimal
3Low or mixed
4Moderate
5High
6Very high
7Exceptional
8Transformative
9Revolutionary
10Paradigm-shifting

The choice of an exponential scale reflects the tendency in science for a small number of research projects to have an outsized impact (Figure 2), and provides more room at the top end of the scale for reviewers to increase the rating of the proposals that they believe will have an exceptional impact. We believe that this could help address the last bias we discussed, which is that currently, bad scores are more likely to pull a proposal’s average below the cutoff than good scores are likely to pull a proposal’s average above the cutoff.

Figure 2. Citation distribution of accepted and rejected journal articles

We are now piloting this approach on a series of proposals in the life sciences that we have collected for Focused Research Organizations, a new type of non-profit research organization designed to tackle challenges that neither academia or industry is incentivized to work on. The pilot study was developed in collaboration with Metaculus, a forecasting platform and aggregator, and will be hosted on their website. We welcome subject matter experts in the life sciences — or anyone interested! — to participate in making forecasts on these proposals here. Stay tuned for the results of this pilot, which we will publish in a report early next year.

Enabling Faster Funding Timelines in the National Institutes of Health

Summary

The National Institutes of Health (NIH) funds some of the world’s most innovative biomedical research, but rising administrative burden and extended wait times—even in crisis—have shown that its funding system is in desperate need of modernization. Examples of promising alternative models exist: in the last two years, private “fast science funding” initiatives such as Fast Grants and Impetus Grants have delivered breakthroughs in responding to the coronavirus pandemic and aging research on days to one-month timelines, significantly faster than the yearly NIH funding cycles. In response to the COVID-19 pandemic the NIH implemented a temporary fast funding program called RADx, indicating a willingness to adopt such practices during acute crises. Research on other critical health challenges like aging, the opioid epidemic, and pandemic preparedness deserves similar urgency. We therefore believe it is critical that the NIH formalize and expand its institutional capacity for rapid funding of high-potential research.

Using the learnings of these fast funding programs, this memo proposes actions that the NIH could take to accelerate research outcomes and reduce administrative burden. Specifically, the NIH director should consider pursuing one of the following approaches to integrate faster funding mechanisms into its extramural research programs: 

Future efforts by the NIH and other federal policymakers to respond to crises like the COVID-19 pandemic would also benefit from a clearer understanding of the impact of the decision-making process and actions taken by the NIH during the earliest weeks of the pandemic. To that end, we also recommend that Congress initiate a report from the Government Accountability Office to illuminate the outcomes and learnings of fast governmental programs during COVID-19, such as RADx.

Challenge and Opportunity

The urgency of the COVID-19 pandemic created adaptations not only in how we structure our daily lives but in how we develop therapeutics and fund science. Starting in 2020, the public saw a rapid emergence of nongovernmental programs like Fast Grants, Impetus Grants, and Reproductive Grants to fund both big clinical trials and proof-of-concept scientific studies within timelines that were previously thought to be impossible. Within the government, the NIH launched RADx, a program for the rapid development of coronavirus diagnostics with significantly accelerated approval timelines. Though the sudden onset of the pandemic was unique, we believe that an array of other biomedical crises deserve the same sense of urgency and innovation. It is therefore vital that the new NIH director permanently integrate fast funding programs like RADx into the NIH in order to better respond to these crises and accelerate research progress for the future. 

To demonstrate why, we must remember that the coronavirus is far from being an outlier—in the last 20 years, humanity has gone through several major pandemics, notably swine flu, SARS CoV-1, and Ebola. Based on the long-observed history of infectious diseases, the risk of pandemics with an impact similar to that of COVID-19 is about two percent in any year. An extension of naturally occurring pandemics is the ongoing epidemic of opioid use and addiction. The rapidly changing landscape of opioid use—with overdose rates growing rapidly and synthetic opioid formulations becoming more common—makes slow, incremental grantmaking ill-suited for the task. The counterfactual impact of providing some awards via faster funding mechanisms in these cases is self-evident: having tests, trials, and interventions earlier saves lives and saves money, without sacrificing additional resources.

Beyond acute crises, there are strong longer-term public health motivations for achieving faster funding of science. In about 10 years, the United States will have more seniors (people aged 65+) than children. This will place substantial stress on the U.S. healthcare system, especially given that two-thirds of seniors suffer from more than one chronic disease. New disease treatments may help, but it often takes years to translate the results of basic research into approved drugs. The idiosyncrasies of drug discovery and clinical trials make them difficult to accelerate at scale, but we can reliably accelerate drug timelines on the front end by reducing the time researchers spend in writing and reviewing grants—potentially easing the long-term stress on U.S. healthcare.

The existing science funding system developed over time with the best intentions, but for a variety of reasons—partly because the supply of federal dollars has not kept up with demand—administrative requirements have become a major challenge for many researchers. According to surveys, working scientists now spend 44% of their research time on administrative activities and compliance, with roughly half of that time spent on pre-award activities. Over 60% of scientists say administrative burden compromises research productivity, and many fear it discourages students from pursuing science careers. In addition, the wait for funding can be extensive: one of the major NIH grants, R01, takes more than three months to write and around 8–20 months to receive (see FAQ). Even proof-of-concept ideas face onerous review processes and take at least a year to fund. This can bottleneck potentially transformative ideas, as with Katalin Kariko famously struggling to get funding for her breakthrough mRNA vaccine work when it was at its early stages. These issues have been of interest for science policymakers for more than two decades, but with little to show for it. 

Though several nongovernmental organizations have attempted to address this need, the model of private citizens continuously fundraising to enable fast science is neither sustainable nor substantial enough compared to the impact of the NIH. We believe that a coordinated governmental effort is needed to revitalize American research productivity and ensure a prompt response to national—and international—health challenges like naturally occurring pandemics and imminent demographic pressure from age-related diseases. The new NIH director has an opportunity to take bold action by making faster funding programs a priority under their leadership and a keystone of their legacy. 

The government’s own track record with such programs gives grounds for optimism. In addition to the aforementioned RADx program at NIH, the National Science Foundation (NSF) runs the Early-Concept Grants for Exploratory Research (EAGER) and Rapid Response Research (RAPID) programs, which can have response times in a matter of weeks. Going back further in history, during World War II, the National Defense Research Committee maintained a one-week review process.
Faster grant review processes can be either integrated into existing grant programs or rolled out by institutes in temporary grant initiatives responding to pressing needs, as the RADx program was. For example, when faced with data falsification around the beta amyloid hypothesis, the National Institute of Aging (NIA) could leverage fast grant review infrastructure to quickly fund replication studies for key papers, without waiting for the next funding cycle. In case of threats to human health due to toxins, the National Institute of Environmental Health Sciences (NIEHS) could rapidly fund studies on risk assessment and prevention, giving public evidence-based recommendations with no delay. Finally, empowering the National Institute of Allergy and Infectious Diseases (NIAID) to quickly fund science would prepare us for many yet-to-come pandemics.

Plan of Action

The NIH is a decentralized organization, with institutes and centers (ICs) that each have their own mission and focus areas. While the NIH Office of the Director sets general policies and guidelines for research grants, individual ICs have the authority to create their own grant programs and define their goals and scope. The Center for Scientific Review (CSR) is responsible for the peer review process used to review grants across the NIH and recently published new guidelines to simplify the review criteria. Given this organizational structure, we propose that the NIH Office of the Director, particularly the Office of Extramural Research, assess opportunities for both NIH-wide and institute-specific fast funding mechanisms and direct the CSR, institutes, and centers to produce proposed plans for fast funding mechanisms within one year. The Director’s Office should consider the following approaches. 

Approach 1. Develop an expedited peer review process for the existing R21 grant mechanism to bring it more in line with the NIH’s own goals of funding high-reward, rapid-turnaround research. 

The R21 program is designed to support high-risk, high-reward, rapid-turnaround, proof-of-concept research. However, it has been historically less popular among applicants compared to the NIH’s traditional research mechanism, the R01. This is in part due to the fact that its application and review process is known to be only slightly less burdensome than the R01, despite providing less than half of the financial and temporal support. Therefore, reforming the application and peer review process for the R21 program to make it a fast grant–style award would both bring it more in line with its own goals and potentially make it more attractive to applicants. 

All ICs follow identical yearly cycles for major grant programs like the R21, and the CSR centrally manages the peer review process for these grant applications. Thus, changes to the R21 grant review process must be spearheaded by the NIH director and coordinated in a centralized manner with all parties involved in the review process: the CSR, program directors and managers at the ICs, and the advisory councils at the ICs. 

The track record of federal and private fast funding initiatives demonstrates that faster funding timelines can be feasible and successful (see FAQ). Among the key learnings and observations of public efforts that the NIH could implement are:

Pending the success of these changes, the NIH should consider applying similar changes to other major research grant programs.

Approach 2. Direct NIH institutes and centers to independently develop and deploy programs with faster funding timelines using Other Transaction Authority (OTA).

Compared to reforming an existing mechanism, the creation of institute-specific fast funding programs would allow for context-specific implementation and cross-institute comparison. This could be accomplished using OTA—the same authority used by the NIH to implement COVID-19 response programs. Since 2020, all ICs at the NIH have had this authority and may implement programs using OTA with approval from the director of NIH, though many have yet to make use of it.

As discussed previously, the NIA, NIDA, and NIAID would be prime candidates for the roll-out of faster funding. In particular, these new programs could focus on responding to time-sensitive research needs within each institute or center’s area of focus—such as health crises or replication of linchpin findings—that would provide large public benefits. To maintain this focus, these programs could restrict investigator-initiated applications and only issue funding opportunity announcements for areas of pressing need. 

To enable faster peer review of applications, ICs should establish (a) new study section(s) within their Scientific Review Branch dedicated to rapid review, similar to how the RADx program had its own dedicated review committees. Reviewers who join these study sections would commit to short meetings on a monthly or bimonthly basis rather than meeting three times a year for one to two days as traditional study sections do. Additionally, as recommended above, these new programs should have a three-page limit on applications to reduce the administrative burden on both applicants and reviewers. 

In this framework, we propose that the ICs be encouraged to direct at least one percent of their budget to establish new research programs with faster funding processes. We believe that even one percent of the annual budget is sufficient to launch initial fast grant programs funded through National Institutes. For example, the National Institute of Aging had an operating budget of $4 billion in the 2022 fiscal year. One percent of this budget would constitute $40 million for faster funding initiatives, which would be on the order of initial budgets of Impetus and Fast Grants ($25 million and $50 million accordingly). 

NIH ICs should develop success criteria in advance of launching new fast funding programs. If the success criteria are met, they should gradually increase the budget and expand the scope of the program by allowing for investigator-initiated applications, making it a real alternative to R01 grants. A precedent for this type of grant program growth is the Maximizing Investigators’ Research Award (MIRA) (R35) grant program within the National Institute of General Medical Sciences (NIGMS), which set the goal of funding 60% of all R01 equivalent grants through MIRA by 2025. In the spirit of fast grants, we recommend setting a deadline on how long each institute can take to establish a fast grants program to ensure that the process does not extend for too many years.

Additional recommendation. Congress should initiate a Government Accountability Office report to illuminate the outcomes and learnings of governmental fast funding programs during COVID-19, such as RADx.

While a number of published papers cite RADx funding, the program’s overall impact and efficiency haven’t yet been assessed. We believe that the agency’s response during the pandemic isn’t yet well-understood but likely played an important role. Illuminating the learnings of these interventions would greatly benefit future emergency fast funding programs.

Conclusion

The NIH should become a reliable agent for quickly mobilizing funding to address emergencies and accelerating solutions for longer-term pressing issues. As present, no funding mechanisms within NIH or its branch institutes enable them to react to such matters rapidly. However, both public and governmental initiatives show that fast funding programs are not only possible but can also be extremely successful. Given this, we propose the creation of permanent fast grants programs within the NIH and its institutes based on learnings from past initiatives.

The changes proposed here are part of a larger effort from the scientific community to modernize and accelerate research funding across the U.S. government. In the current climate of rapidly advancing technology and increasing global challenges, it is more important than ever for U.S. agencies to stay at the forefront of science and innovation. A fast funding mechanism would enable the NIH to be more agile and responsive to the needs of the scientific community and would greatly benefit the public through the advancement of human health and safety.

Frequently Asked Questions
What actions, besides RADx, did the NIH take in response to the COVID-19 pandemic?

The NIH released a number of Notices of Special Interest to allow emergency revision to existing grants (e.g., PA-20-135 and PA-18-591) and a quicker path for commercialization of life-saving COVID technologies (NOT-EB-20-008). Unfortunately, repurposing existing grants reportedly took several months, significantly delaying impactful research.

What does the current review process look like?

The current scientific review process in NIH involves  multiple stakeholders. There are two stages of review at NIH, with the first stage being conducted by a Scientific Review Group that consists primarily of nonfederal scientists. Typically, Center for Scientific Review committees meet three times a year for one or two days. This way, the initial review starts only four months after the proposal submission. Special Emphasis Panel meetings that are not recurring take even longer due to panel recruitment and scheduling. The Institute and Center National Advisory Councils or Boards are responsible for the second stage of review, which usually happens after revision and appeals, taking the total timeline to approximately a year.

Is there evidence for the NIH’s current approach to scientific review?

Because of the difficulty of empirically studying drivers of scientific impact, there has been little research evaluating peer review’s effects on scientific quality. A Cochrane systematic review from 2007 found no studies directly assessing review’s effects on scientific quality, and a recent Rand review of the literature in 2018 found a similar lack of empirical evidence. A few more recent studies have found modest associations between NIH peer review scores and research impact, suggesting that peer review may indeed successfully identify innovative projects. However, such a relationship still falls short of demonstrating that the current model of grant review reliably leads to better funding outcomes than alternative models. Additionally, some studies have demonstrated that the current model leads to variable and conservative assessments. Taken together, we think that experimentation with models of peer review that are less burdensome for applicants and reviewers is warranted.

One concern with faster reviews is a lower science quality. How do you ensure high-quality science while keeping fast response times and short proposals?

Intuitively, it seems that having longer grant applications and longer review processes ensures that both researchers and reviewers expend great effort to address pitfalls and failure modes before research starts. However, systematic reviews of the literature have found that reducing the length and complexity of applications has minimal effects on funding decisions, suggesting that the quality of resulting science is unlikely to be affected. 


Historical examples have also suggested that the quality of an endeavor is largely uncorrelated from its planning times. It took Moderna 45 days from COVID-19 genome publication to submit the mRNA-1273 vaccine to the NIH for use in its Phase 1 clinical study. Such examples exist within government too: during World War II, National Defense Research Committee set a record by reviewing and authorizing grants within one week, which led to DUKWProject PigeonProximity fuze, and Radar.


Recent fast grant initiatives have produced high-quality outcomes. With its short applications and next-day response times, Fast Grants enabled:



  • detection of new concerning COVID-19 variants before other sources of funding became available.

  • work that showed saliva-based COVID-19 tests can work just as well as those using nasopharyngeal swabs.

  • drug-repurposing clinical trials, one of which identified a generic drug reducing hospitalization from COVID-19 by ~40%. 

  • Research into “Long COVID,” which is now being followed up with a clinical trial on the ability of COVID-19 vaccines to improve symptoms.


Impetus Grants focused on projects with longer timelines but led to a number of important preprints in less than a year from the moment person applied:



With the heavy toll that resource-intensive approaches to peer review take on the speed and innovative potential of science—and the early signs that fast grants lead to important and high-quality work—we feel that the evidentiary burden should be placed on current onerous methods rather than the proposed streamlined approaches. Without strong reason to believe that the status quo produces vastly improved science, we feel there is no reason to add years of grant writing and wait times to the process.

Why focus on the NIH, as opposed to other science funding agencies?

The adoption of faster funding mechanisms would indeed be valuable across a range of federal funding agencies. Here, we focus on the NIH because its budget for extramural research (over $30 billion per year) represents the single largest source of science funding in the United States. Additionally, the NIH’s umbrella of health and medical science includes many domains that would be well-served by faster research timelines for proof-of-concept studies—including pandemics, aging, opioid addiction, mental health, cancer, etc.

Supercharging Biomedical Science at the National Institutes of Health

Summary

For decades, the National Institutes of Health (NIH) has been the patron of groundbreaking biomedical research in the United States. NIH has paved the way for life-saving gene therapies, cancer treatments, and most recently, mRNA vaccines. More than 80% of NIH’s $42 billion budget supports extramural research, including nearly 50,000 grants disbursed to more than 300,000 researchers.

But NIH has grown incremental in its funding decisions. The result is a U.S. biomedical-research enterprise discouraged from engaging in the risk-taking and experimentation needed to foster scientific breakthroughs. To maximize returns on its massive R&D budget, NIH should consider the following actions:

Challenge and Opportunity

Each year, federal science agencies allocate billions of dollars to launch new research initiatives and to create novel grant mechanisms.  But an embarrassingly tiny amount is invested into discerning which funding policies are actually effective. Despite having the requisite data, methods, and technology, science agencies such as NIH do not subject science-funding policies to nearly the same rigor as the funded science itself.

Another problem plaguing science funding at NIH is that it is difficult for scientists to secure funding for risky but potentially transformative work. When NIH’s peer-review process was designed more than half a century ago, over half of grant applications to the agency were funded. NIH’s proposal-success rate has dropped to 15% today. Even credible researchers must submit an ever-growing number of proposals in order to have a reasonable chance of securing funding. The result is that scientists spend almost half of their working time on average writing grants—time that could otherwise be spent conducting research and training other scientists. Our nation has created a federally funded research ecosystem that makes scientists beg, fight, and rewrite to do the work they’ve spent years training to do.

Compounding the problem is the fact that fewer and fewer early-career researchers are getting adequate support to do their work. Indeed, it takes fewer years to become an experienced surgeon than it does to launch a biomedical research career and obtain a first R01 grant from NIH (the average age of R01 grantees in 2020 was 44 years). When we place hurdles in front of young scientists, we lose out on empowering them at a particularly innovative career stage.1 Limited access to funding early on hamstrings the ability of early-career scientists to set up labs, tackle interesting ideas, and train the next generation. And the early careers of young scientists are often judged by their publishing records, which has the pernicious effect of guiding young scientists to propose safe research that will easily pass peer review. 

A scientific ecosystem that incentivizes incrementalism instead of impact discourages scientists from bringing their best, most creative ideas to the table2 — an effect multiplied for women and underrepresented minorities. The risky research underpinning mRNA vaccines would struggle to be funded under today’s peer-review system. To catalyze groundbreaking biomedical research—and lead the way for other federal science-funding agencies to follow suit—NIH should reconsider how it funds research, what it funds, and who it funds. The Plan of Action presented below includes recommendations aligned with each of these policy questions.

Plan of Action

Recommendation 1. Diversify and assess NIH’s grant-funding mechanisms.

In 2020, privately funded COVID “Fast Grants” accelerated pandemic science by allocating over $50 million in grants awarded within 48 hours of proposal receipt. In a world where grant proposals typically take months to prepare and months more to receive a decision, Fast Grants offered a welcome departure from the norm. The success of Fast Grants signals that federal research funders like the NIH can and must adopt faster, more flexible approaches to scientific grantmaking—an approach that improves productivity and impact by getting scientists the resources they need when they need them. 

While Fast Grants have received a great deal of attention for their novelty and usefulness during a crisis, it’s unclear whether the wealth of experimental funding approaches that the NIH has tried—such as its R21 grant for developmental research, or its K99 grant for on-ramping postdoctoral researchers to traditional R01 grant funding—have positively impacted scientific productivity. Indeed, NIH has never rigorously assessed the efficacy of these approaches. NIH must institute mechanisms for evaluating the success of funding experiments to understand how to optimize its resources and stretch R&D dollars as far as possible. 

As such, the NIH Director should establish a “Science of Science Funding” Working Group within the NIH’s Advisory Committee to the Director. The Working Group should be tasked with (1) evaluating the efficacy of existing funding mechanisms at the NIH and, (2) piloting three to five) experimental funding mechanisms. The Working Group should also suggest a structure for evaluating existing and novel funding mechanisms through Randomized Control Trials (RCTs), and should recommend ways in which the NIH can expand its capacity for policy evaluation (see FAQ for more on RCTs).

Novel funding mechanisms that the Working Group could consider include:

This Working Group should be chaired by the incoming Director of Extramural Research and should include other NIH leaders (such as the Director of the Office of Strategic Coordination and the Director of the Office of Research Reporting and Analysis) as participants. The Working Group should also include members from other federal science agencies such as NSF and NASA. The Working Group should include and/or consult with diverse faculty at all career stages as well. Buy-in from the NIH Director will be crucial for this group to enact transformative change.

Lastly, the working group should seek to open up NIH up to outside evaluation by the public. Full access to grantmaking data and the corresponding outcomes could unlock transformative insights that holistically uplift the biomedical community. While NIH has a better track record of data sharing than some other science-funding agencies, there is still a long way to go. One key step is putting data on grant applicants in an open-access database (with privacy-preserving properties) so that it can be analyzed and merged with other relevant datasets, informing decision-making. Opening up data on grant applicants and their outcomes also supports external evaluation—paving the way for other groups to augment NIH evaluations conducted internally, as well as helping keep the NIH accountable for its programmatic outcomes.

Recommendation 2. Foster a culture of scientific risk-taking by funding more high-risk, high-reward grants.

Uncertainty is a hallmark of breakthrough scientific discovery. The research that led to rapid development of mRNA COVID vaccines, for instance, would have struggled to get funded through traditional funding channels.  NIH has taken some admirable steps to encourage risk-taking. Since 2004, NIH has rolled out a set of High-Risk, High-Reward (HRHR) grant-funding mechanisms (Table 1). The agency’s evaluations have found that its HRHR grants have led to increased scientific productivity relative to other grant types. Yet HRHR grants account for a vanishingly small percentage of NIH’s extramural R&D funding. Only 85 HRHR grants were awarded in all of 2020, compared to 7,767 standard R01 grants awarded in the same year.3 Such disproportionate allocation of funds to safe and incremental research largely yields safe and incremental results. Additionally, it should be noted that designating specific programs “high-risk, high-reward” does not necessarily guarantee that those programs are funding high-risk, high-reward research in reality.

AwardPurposeFunding Amount# Awarded in 2020
New Innovator AwardFor exceptionally creative early-career scientists proposing innovative, high-impact projects. $1.5M/5 yrs53
Pioneer AwardFor individuals of exceptional creativity proposing pioneering approaches, at all career stages$3.5M/5 yrs10
Transformative Research AwardFor individuals or teams proposing transformative research that may require very large budgets          No cap9
Early Independence AwardFor outstanding junior scientists wishing to “skip the postdoc” and immediately begin independent research$250K/yr12
R01 Investigator (NIH’s flagship Grant)For mature research projects that are hypothesis-driven with strong preliminary data$250K/yr7,767
Table 1: NIH’s High-Risk, High Reward Grant Mechanisms and its flagship R01 grant.

It is time for the NIH to actively foster a culture of scientific risk-taking. The agency can do this by balancing funding relatively predictable projects with projects that are riskier but have the potential to deliver greater returns.

Specifically, NIH should:

Recommendation 3. Better support early-career scientists.

NIH can supercharge the biomedical R&D ecosystem by better embracing newer investigators bringing bold, fresh approaches to science. In recent years, NIH allocated seven times more R01 funding to scientists who are older than 65 years old than it did to scientists under 35. The average age of R01 grantees in 2020 was 44 years. In other words, it takes fewer years to become an experienced surgeon than it does to launch a biomedical research career and obtain a first R01 grant. This paradigm leaves promising early-career researchers scrambling for alternative funding sources, or causes them to change careers entirely. Postdoctoral researchers in particular struggle to have their ideas funded.

NIH has attempted to alleviate funding disparities through some grants—R00, R03, K76, K99, etc.—targeted at younger scientists. However, these grants do not provide a clear onramp to NIH’s “bread and butter” R01 grants. 

NIH should better support early-career researchers by:

Conclusion

NIH funding forms the backbone of the American biomedical research enterprise. But if the NIH does not diversify its approach to research funding, progress in the field will stagnate. Any renewed commitment to biomedical innovation demands that NIH reconsider how it funds research, what it funds, and who it funds — and to rigorously evaluate its funding processes as well.

The federal government spent about $160 billion on scientific R&D in 2021. It is shocking that it doesn’t routinely seek to optimize how those dollars are spent. While this memo focuses on the NIH, the analysis and recommendations contained herein are broadly applicable to other federal agencies with large extramural R&D funding operations, including the National Science Foundation; the Departments of Defense, Agriculture, NASA, Commerce; and others. Increasing funding for science is a necessary but not sufficient part of catalyzing scientific progress. The other side of the coin is ensuring that research dollars are being spent effectively and optimizing return on investment.

Frequently Asked Questions
Are Randomized Controlled Trials (RCTs) the only way for the NIH to effectively evaluate funding mechanisms?

To really understand what works and what doesn’t, NIH must consider how to evaluate the success of existing and novel funding mechanisms. MIT economist Pierre Azoulay suggests that the NIH can systematically build out a knowledge base of what funding mechanisms are effective by “turning the scientific method on itself” using RCTs, the “gold standard” of evaluation methods. NIH could likely launch a suite of RCTs that would evaluate multiple funding mechanisms at scale with minimal disruption for around $250,000 per year for five years—a small investment relative to the value of knowing what types of funding work.


RCTs can be easier to implement than is often thought.[1] That said, NIH would be wise to couple RCTs with less ambitious mechanisms for evaluating funding mechanisms, such as a two-step approach that filters out clearly sub-par applicants and then applies narrower criteria based on the remaining pool to filter a second time for the most competitive or prioritized applicants.  Even just collecting and comparing data on NIH grant applicants—data such as education level, career stage, and prior funding history—would provide insight into whether different funding interventions are affecting the composition of the applicant pool.


[1] For more on this topic, see Why Government Needs More Randomized Controlled Trials: Refuting the Myths from the Arnold Foundation.

How would the proposed “Science of Science Funding” Working Group differ from the ACD Working Group on High-Risk, High-Reward Programs?

The ACD Working Group on HRHR programs reviewed “the effectiveness of distinct NIH HRHR research programs that emphasize exceptional innovation.” This working group only focused on evaluating a couple of HRHR programs, which form a trivial portion of grantmaking compared to the rest of the extramural NIH funding apparatus. The Science of Science Funding Working Group would (i) build NIH’s capacity to evaluate the efficacy of different funding mechanisms, and (ii) oversee implementation of several (three to five) experimental funding mechanisms or substantial modifications to existing mechanisms.

How would the “Science of Science Funding” Working Group differ from the Science of Science Policy Approach to Analyzing and Innovating the Biomedical Research Enterprise (SCISIPBIO) Active Awards, jointly hosted by the NSF and the NIH?

SCISIPBIO isn’t focused on systematic change in the biomedical innovation ecosystem. Instead, it is a curiosity-driven grant program for individual PIs to conduct “science of science policy” research. NIH can build on SCISIPBIO to advance rigorous evaluation of science funding internally and agency-wide.

Isn’t the NIH one of the government’s premier research institutions? Is it really doing such a bad job funding research?

NIH funding certainly supports an extensive body of high-quality, high-impact work. But just because something is performing acceptably doesn’t mean that there are not still improvements to be made. As outlined in this memo, there is good reason to believe that static funding practices are preventing the NIH from maximizing returns on its investments in biomedical research. NIH is the nation’s crown jewel of biomedical research. We should seek to polish it to its fullest shine.

What are platform technologies?

Platform technologies are tools, techniques, and instruments that are applicable to many areas of research, enabling novel approaches for scientific investigation that were not previously possible. Platform technologies often generate orders-of-magnitude improvements over current abilities in fundamental aspects such as accuracy, precision, resolution, throughput, flexibility, breadth of application, costs of construction or operation, or user-friendliness. The following are examples of platform technologies:



  • Polymerase chain reaction (PCR)

  • CRISPR-Cas9

  • Cryo-electron microscopy

  • Phage display

  • Charge-coupled device (CCD) sensor

  • Fourier transforms

  • Atomic force microscopy (AFM) and scanning force microscopy (SFM)


There has been an appetite to fund more platform technologies. The recently announced ARPA-H seeks to achieve medical breakthroughs and directly impact clinical care by building new platform technologies. During the Obama Administration, the White House Office of Science and Technology Policy (OSTP) hosted a platform technologies ideation contest. Although multiple NIH-funded Nobel Prize winners have won the award for platform technologies that have fundamentally shifted the way scientists approach problem solving, not enough emphasis is placed on development of such technologies. Without investing deeply in platform technologies, our nation risks continuing its piecemeal approach to solving pressing challenges.