K-12 STEM Education For the Future Workforce: A Wish List for the Next Five Year Plan

This report was prepared in partnership with the Alliance for Learning Innovation (ALI), to advocate for building a better research and development (R&D) infrastructure in education. The Federation of American Scientists believes that STEM education evolution is necessary to prepare today’s students for tomorrow’s in-demand scientific and technological careers, as well as being a national security pursuit.

American STEM Education in Context

“This country is in the midst of a STEM and data literacy crisis,” opined Elena Gerstmann and Laura Albert in a recent piece for The Hill. Their sentiment represents a widely held concern that America’s global leadership in scientific and technological innovation, anchored in educational excellence, is being relinquished, thereby jeopardizing our economy and national security. Their message recycles a 65-year-old warning to U.S. policy makers, educators, and employers when the USSR seemingly eclipsed our innovation pace with the launch of Sputnik. 

Life magazine devoted their March 1958 edition to a scathing comparison of the playful approach to STEM education in U.S. schools versus the no-nonsense rigor of Russian classrooms. The issue’s theme, “Crisis in Education” was summed up soberly: “The outcomes of the arms race will depend eventually on our schools and those of the Russians.” America answered the bell and came out swinging. Under President Eisenhower, the National Aeronautics and Space Administration (NASA) and the Defense Advanced Research Projects Agency (DARPA) were both established in 1958, as was the National Defense Education Act that channeled billions of dollars into K-12 and collegiate STEM education. By innumerable metrics (the Apollo program, the internet, GPS, and manufacturing dominance, all fueled by an internationally envied higher education system), the United States reclaimed preeminence in STEM innovation.

LIFE March 24, 1958

Over the next four decades tectonic shifts in demographics, economics, and politics rearranged continental competition such that complacent U.S. education systems were once again called on the carpet. In 2001, shortly before terrorists struck the World Trade Center and Pentagon, a U.S. Senate report on homeland vulnerability echoed that of Life magazine decades prior: “The inadequacies of our systems of research and education pose a greater threat to U.S. national security over the next quarter century than any potential conventional war that we might imagine.”  The painfully prescient study, product of the Hart-Rudman Commission on National Security/21st Century, identified the advancement of information technology, bioscience, energy production, and space science, all overlain by economic and geopolitical destabilization, as the nation’s greatest challenge and our new Sputnik. The Commission called on reformed education systems to quadruple the number of scientists and engineers and to dramatically increase the number and skills of science and mathematics teachers. As in 1958, leaders responded boldly, creating the Department of Homeland Security in 2001, and planting the seeds for the 2007 America Creating Opportunities to Meaningfully Promote Excellence in Technology, Education, and Science (COMPETES) Act.

Funding for research and development across federal agencies significantly increased over the decade, including a budget boost for the National Science Foundation’s grant programs supporting emergent scholars (Faculty Early Career Development Program, or CAREER), the research capacities of targeted jurisdictions (Established Program to Stimulate Competitive Research, or EPSCoR), Graduate Research Fellowships (GRF), the Robert Noyce Teacher Scholarships, the Advanced Technological Education (ATE) program, and others designed to bolster diverse talent pipelines to STEM careers. Despite increases in the number of students studying science and engineering in the U.S, there is still a significant gap in diverse representation and equitable access to opportunities in the STEM field; ensuring greater inclusion and diversity in the American science and engineering landscape is essential to engaging the “missing millions,” or persistently underrepresented minority groups and women, in the nation’s STEM workforce and education programs.

Nearly a quarter century later, America is once again in a STEM talent crisis. The solutions of Hart-Rudman and of the Eisenhower era need an update. This latest Sputnik moment, unlike the space race that motivated the National Defense Education Act, and the terrorism that spawned Homeland Security, is more perfuse and profound, permeating every aspect of our lives: artificial intelligence and machine learning, CRISPR (clustered regularly interspaced short palindromic repeat), quantum computing, 6G and 7G communications, semiconductors, hydrogen and other energy sources, lithium and other ionic energy storage, robotics, big data, blockchain, biopharmaceuticals, and other emergent technologies.

To relinquish the lead in these arenas would put the U.S. economy, national security, and social fabric in the hands of other nations. Our new USSR is a roulette wheel of friends and foes vying for STEM supremacy including Singapore, Japan, China, Germany, the UK, Taiwan, Saudi Arabia, India, South Korea and many more. Not unlike the education crises that came to a head in 1958 and in 2001, our educational Achilles heel is a lack of exposure to and under-preparedness for STEM career pursuit for the majority of diverse young Americans. Further, the U.S. Bureau of Labor Statistics projects that STEM career opportunities will grow 10.8% by 2032, more than four times faster than non-STEM occupations.

What the United States has going for it in 2024 (and was comparatively lacking in the 1950s and the early 2000s) are STEM-rich local schools, communities, and states. Powered by investments of federal agencies (e.g., Smithsonian, NSF, NASA, DOL, ED and others), state governments (governors in Massachusetts, Iowa, Alabama, for example), nonprofits (Project Lead The Way and the Teaching Institute for Excellence in STEM for example), and industries (Regeneron, Collins Aerospace, John Deere, Google, etc.), STEM is now seen as an imperative field by most Americans.  

Today’s STEM education landscape presents significant opportunities and challenges. Existing models of excellence demonstrate readiness to scale. To focus on what works and to channel resources in the direction of broader impacts for maximal benefit is to answer the call of our omni-present 2024 Sputnik.

The Current State: Future STEM Workforce Cultivation

At its root, STEM education is about workforce cultivation for high-demand and high-skill occupations of fundamental importance to American economic vitality and national security. In the ideal state, STEM education also prepares all learners to be critical thinkers who make evidence-based decisions by equipping them with analytical, computational, and scientific ways of knowing. STEM students should learn effective collaboration and problem-solving skills with an interdisciplinary approach, and feel prepared to apply STEM skills and knowledge to everyday life as voters, consumers, parents, and citizens.1

Target Audiences and Service Providers 

The early childhood education community (pre-K-grade 3), both in school and out-of-school (at informal learning centers), has emerged over the last decade as a prime target for boosting STEM education as research findings accumulate around the importance of early exposure to and comfort with STEM concepts and processes. Popular providers of kits and activities, curricula, software platforms, and professional development for educators include Hand2Mind (Numberblocks), Robo Wunderkind, StoryTimeSTEM (Dragonland), NewBoCo (Tiny Techies), BirdBrain Tech (Finch robot), FIRST Lego League (Discover), Museum of Science Boston (Wee Engineer), Iowa Regents’ Center for Early Developmental Education (Light & Shadow), and Mind Research (Spatial-Temporal Math).  

The elementary to middle school level of STEM education options both in and out of school enjoys the richest menu of STEM programming on the market, reflecting stronger curricular freedom to integrate content compared to high schools. Popular STEM programs include Blackbird Code, Derivita Math, FUSE Studio, Positive Physics, Micro:bit, Nepris (now Pathful), Project Lead The Way (Launch and Gateway), FIRST Tech Challenge, Code.org (CS Discoveries), Bootstrap Data Science, and many more.    

The secondary education STEM landscape differs from pre-K-8 in a significant way: although discrete STEM activities and programs are plentiful for integration into secondary science, mathematics, and other classes, the adoption of packaged courses or after-school enrichment opportunities is more common. Project Lead The Way and Code.org offer an array of stand-alone elective STEM courses2, as do local community colleges and universities. Nonprofits and industry sources offer STEM enrichment programs such as the Society of Women Engineers’ SWEnext Leadership Academy, Google’s CodeNext, the Society of Hispanic Professional Engineers’ Virtual STEM Labs, and Girls Who Code’s Summer Immersion. Finally, a number of federal, state, nonprofit and business organizations conduct future workforce programs for targeted students including the federal TRIO program, Advancement Via Individual Determination (AVID), Jobs for America’s Graduates (JAG), and Jobs For the Future (JFF). 

 Investment in STEM Education 

A modestly conservative estimate of the total American investment in STEM education annually is $12 billion, nearly the equivalent of the entire budget of the National Science Foundation or the Environmental Protection Agency. 

For fiscal year 2023 the White House budgeted $4.0424 billion for STEM education across 16 agencies that make up the Subcommittee on Federal Coordination in STEM Education (FC-STEM). Total nonprofit and philanthropic investments are more elusive since there are so many, with origins of their dollars often overlapping with state or local government (grants for example), and wildly variable definitions of STEM investments. That said, U.S. charitable giving to the education sector totaled $64 billion in 2019. A reasonable assumption that two percent made its way to STEM education equates to over $1 billion contributed to the overall funding pie. Business and industry in the United States contribute well over $5 billion annually, a conservative estimated proportion of total annual STEM education market share among ten nations, according to a recent study. K-12 schools spend well over $1 billion on STEM, a minimally modest fraction of the $870 billion total spent on K-12 across the U.S. The same figure would likely be true of America’s annual $700 billion higher education expenditure, minimally $1 billion to STEM. Elusive as definitive figures can be in this space, a glaring reality is that funds are streaming into STEM education at a level where measurable results should be expected. Are resources being distributed for maximal impact? Are measures capturing that impact? Is it enough money?

There are approximately 55.4 million K-12 students across the nation. At $12 billion per year on STEM, that comes to about $217 worth of STEM education annually per young American. Is that enough to move the needle? The answer is a qualified “yes” based on Iowa’s experience. The state launched a legislatively funded STEM education program in 2012, investing on average about $4.2 million annually to provide enrichment opportunities for about one-fifth of all K-12 students, or 100,000 per year. To date, about 1.2 million youth have been served through a total investment of about $50 million. That calculates to $42 per student. The result? Among participants: increased standardized test scores in math and science; increased interest in STEM study and careers; a near doubling of post-secondary STEM majors at community colleges and universities. Thus, from Iowa’s experience, the amount of funding toward American STEM education is adequate to expect systemic gains. The qualifier is that Iowa funds flow toward increased equity (most needy are top priority), school-work alignment (career linked curriculum, professional development), and proof of effectiveness (rigorously vetted and carefully monitored programs). Variance in these three factors can separate ambitions from realities.

Ambitions vs. Realities

The federal STEM education strategic plan Charting a Course for Success: America’s Strategy for STEM Education, identified three consensus goals for U.S. STEM education: a strong STEM literacy foundation for all Americans; increased diversity, equity, and inclusion in STEM study and work; and preparation of the STEM workforce of the future. Three challenges lie between those goals and reality.

Elusive equity. The provision of quality STEM education opportunities to Americans most in need is universally embraced yet difficult to achieve at the program level. Unequal funding of school STEM programs across urban, rural, and suburban public and private school districts equate to less experienced educators and diminished material resources (laboratories, computers, transportation to enrichment experiences) in socioeconomically disadvantaged communities. The challenge is then compounded by the lack of role models to inspire and support youth of underserved subpopulations by race, ability, ethnicity, gender, and geography. Bias, whether implicit or explicit, fuels stereotype threat and identity doubt for too many individuals in schools, colleges, and workplaces, countering diversity and equity efforts.    

School-work misalignment. For most learners, the school experience can seem quite different from the higher education which follows, and the work and life experiences beyond. Employer and learner polls unearth misalignment in priorities: employers value in new hires skills such as relationship building, dealing with complexity and ambiguity, balancing opposing views, collaboration, co-creativity, and cultural sensitivity, in addition to expectations of work-related experiences. Schools typically proclaim missions like “Educating each student to be a lifelong learner and a caring, responsible citizen” omitting the importance of employability. Learners feel that school taught them time management, academic knowledge, and analytical skills, while experiential learning remains limited.

Elusive proof. Evidence of effect can be vexingly evasive. The 2022 progress report of the federal STEM plan clarified the difficulty in verifying reach to those most in need: the identification of participants in STEM programs can be restricted for privacy/legality reasons. The gathering of racial, ethnic, and demographic data on STEM participants may often be unreliable given self-reported or observational identifications as well as the fleeting, often anonymous encounters typical of “STEM Night” or informal experiences at science centers, zoos, and museums.  

Participant profiles aside, variability in program assessments – design and objectives – make meaningful meta-analysis challenging, which creates difficulties in scaling promising STEM programs. “We recommend that states and programs prioritize research and evaluation using a common framework, common language, and common tools” advised a group of evaluators recently.     

Exemplars 

Plentiful success stories exist at the local, regional, and national levels. The following six exemplars are each funded in whole or in part by federal and/or state grants. The first examples are local education systems (one in-school, one out-of-school) masterfully aligning learning experiences to career preparation. The second pair of examples profile a regional out-of-school STEM program powerfully documenting effects on participants, and an in-school enrichment course demonstrating success. And the final pair of examples are a nationwide equity program successfully preparing STEM educators to effectively serve students of diversity, and an exciting consortium effort aimed at refocusing the entire educational enterprise on skills that matter most.

1.a. School-work alignment at the local level

The Barrow Community School District (BCSD) in Georgia is strongly committed to work-based learning (WBL). All 15,000 students are required to take a sequence of exploratory STEM career classes beginning in ninth grade. Fifteen career pathways are available ranging from computing to health, manufacturing to engineering. It all culminates in an optional senior year internship serving 400 students annually. Interns earn dual-enrollment credits in partnership with local colleges and are paid by the employer host. Interns spend 7.5 to 15 hours per week at work experiences in a hospital, on a construction site, or in a production plant. The district employs a full time WBL coordinator to oversee, administer, and evaluate, as well as to cultivate community employer partners. Teachers are expected to spend one week in an industry externship every three to five years. The BCSD commitment to a school experience aligned to future careers is something that every student in any district ought to be able to experience.

1.b. Diverse workforce of the future – local-to-global level

The World Smarts STEM Challenge is a community-based, after-school, real-world problem-solving experience for student workforce development. Funded by a 2021 National Science Foundation ITEST (Innovative Technology Experiences for Students and Teachers) grant in partnership with North Carolina State University, students in the Washington D.C. area are assigned bi-national groups (arranged through a partnership with the International Research and Exchanges Board) to collaborate in solving local/global STEM issues via virtual communications. Groups are mentored by industry professionals. In the process, students develop skills in innovation, investigation, problem-solving, and global citizenship for careers in STEM. Participant diversity is a primary objective. Learners of underrepresented backgrounds, including Black, Hispanic, economically disadvantaged, and female students, are actively recruited from local schools. Educator-facilitators are treated to professional development opportunities to build mentorship skills that support students. The end-product is a World Smarts STEM Activation Kit for implementing the model elsewhere.

2.a. Proof of effect at the regional level out-of-school

NE STEM 4U is an after-school program serving Omaha, Nebraska regional elementary school youth. Programs are hands-on problem-based challenges relevant to children. The staff were interested in the effect of their activities on the excitement, curiosity, and STEM concept gains of participants. The instrument they chose to use is the Dimensions of Success (DoS) observational tool of the P.E.A.R. Institute (Program in Education, Afterschool & Resiliency). The DoS is conducted by a certified administrator who observes and rates four groups of criteria: the learning environment itself, level of engagement in the activity, STEM knowledge and skills gained, and relevancy. Through multiple cohorts over two years, the DoS findings validated the learning approach at NE STEM 4U across dimensions, though with natural variations in positive effect. The upshot is not only that this after-school model is readily replicable, but that the DoS observation tool is a thoroughly vetted, powerful, and readily available instrument that could become a “common tool” in the STEM education program evaluation community.  

2.b. Proof of effect at the regional school level

From a modest New York origin in 1997, Project Lead The Way (PLTW) has blossomed into a nationwide tour de force in STEM education, funded by the Kern Foundation, Chevron, and other philanthropies. Adopted at the community school level where trained educators integrate units at the pre-K-5 and middle school levels (Launch, and Gateway, respectively), or offer courses at the secondary level (Algebra, Computer Science, Engineering, Biomedical), all share a common focus on developing in-demand, transportable skills like problem solving, critical and creative thinking, collaboration, and communication. Career connections are a mainstay. To that end, PLTW is notable for expecting schools to form advisory boards of local employers for feedback and connections. Attitudinal surveys attest to increased student interest in STEM careers. 

3.a. Equity at the national level – diversity and inclusion

The National Alliance for Partnerships in Equity (NAPE) offers a wide array of professional development programs related to STEM equity. One module is called Micromessaging to Reach and Teach Every Student. Educators in and out of school convey micro-messages to students at every encounter. Micro-messages are subtle and typically unconscious. Sometimes they are helpful – a smile or eye contact. Sometimes they can be harmful towards individuals or reveal bias towards a group to which a student may belong – a furrowed brow or a stereotypical comment. Exceedingly rare is micro-message expertise in the teacher preparatory pipeline or in standard professional development. Yet micro-messaging is tremendously influential in the self-perceptions of learners as welcome in STEM. 

3.b. Equity at the national level – leveling the playing field

Durable skills – e.g., teamwork, collaboration, negotiation, empathy, critical thinking, initiative, risk-taking, creativity, adaptability, leadership, and problem-solving – define jobs of the future. AI and automation cannot replace durable skills. The nonprofit America Succeeds has championed a list of 100 durable skills grouped into 10 competencies, based on industry input. They studied state standards for college and career readiness against those competencies and prescribe remedies to states whose standards fall short (most U.S. states). Durable Skills, packaged by America Succeeds, is an equity service par excellence – every learner can command these 100 enduring skills, setting them up for success.

Black and white photo of early 20th century science class

The Case for Increased Investment in STEM Education R&D at the Federal and State Level

Billions of dollars pour into American STEM education each year. Millions of learners and employers benefit from the investment. Outstanding programs produce undeniably successful results for individuals and organizations. And yet, “This country is in the midst of a STEM and data literacy crisis.” How can that be? Here are some of the factors in play.

Recent STEM Education/Workforce Investment Trends

The bi-annual Science and Technology Indicators compiled by the National Science Board were released in March 2024. Noteworthy findings (necessarily a couple of years old given the retrospective analysis) include:

The federal government funds 52% of all academic research and development taking place at colleges and universities (2021).

Contrasting the findings of the NSB against current federal budgets, FY2024 appropriation for STEM education research and development is a work in progress. In comparison to FY23, the budget presented to Congress by the executive branch called for increases for STEM spending across many agencies but not all. The U.S. House and Senate generally propose reductions in spending. The Defense Department’s STEM education line, for example, the National Defense Education Program, is slated for significant reduction (-7.3 percent to -20 percent). The Department of Energy’s Office of Science which funds STEM education, is slated for a slight increase (+1.7 percent). The same is true for the NSF’s STEM education programs (+1.6 percent). NASA’s Office of STEM Engagement is on track for a slight decrease (-.3 percent). The Department of Agriculture’s Research and Education budget is down slightly (-1.7 percent). The U.S. Geological Survey’s Science Support budget that includes human capital development, is down slightly (-1.2 percent). The Department of Education’s Institute for Education Sciences was slated for significant increase by the executive branch though slated for reduction in both the House and Senate budgets. The Department of Homeland Security’s Science and Technology budget which includes funding for university-based centers and minority institution programs is set for reduction (-1.3 percent to -19 percent).

Significant STEM education and workforce development support resides within the CHIPS and Science Act of 2022 which has yet to be fully funded by the Congress. An overall trend in shifting R&D, including education, from federal to private sector support means greater reliance on business and industry to invest in STEM program development. The NSB Indicators report highlights this shift in R&D investment: federal government investment in R&D is at 19 percent in 2021 (down from 30 percent in 2011), while the business sector now funds 75 percent of U.S. R&D funding.

A bottom-line interpretation is that federal investment in STEM education/workforce development, though significant, can hardly be described as a generational response to an economic and national security crisis.

Emergent Frontiers

Meanwhile, economic Sputniks are circling the globe. All driven by semiconducting silicon and germanium chips. Yet another testament to American STEM education is the home-grown invention of chips. But they are built mostly elsewhere – Taiwan, South Korea, and Japan. Semiconductors lie at the heart of our communications (e.g. cell phones, satellites), transportation (e.g. planes, trains, automobiles), defense (e.g. guidance systems and risk analytics), health (e.g. pacemakers, insulin pumps), lifestyle (e.g. dishwashers, Siri and Alexa), and virtually every other aspect of life and commerce. The federal government committed $53 billion through the 2022 CHIPS and Science Act to expand semiconductor talent development, research, and manufacture in the U.S., amplified by $231 billion in commitments to semiconductor development by business and industry. Guidance through the National Strategy on Microelectronics Research was recently released by the White House Office of Science and Technology Policy. When fully realized, the CHIPS Act may come to be a generational response to an international adversarial threat far more profound than Sputnik. 

Equally compelling and weighty in terms of life, liberty, and the pursuit of happiness is to lead in research and development as well as governance around artificial intelligence. Extraordinary workplace and homelife evolution are underway resulting from applications of this new technology. For example, AI dramatically increases precision and thus reduces error in health care. Machine learning is far superior to human eyes at image analysis – MRI or x-ray – for detecting cancer early. On a lighter note, machine learning can dramatically increase the likely appeal of new movies by compressing millions of historic data points and a sea of YouTube videos into a sure box office hit. Conversely are the misuses both present and potential, to AI. The displacement of radiologists, movie script writers, and countless others whose routine, analytical, or creative skills can be performed by robots and neural networked sensors is troublesome yes, but a mild effect of AI compared to the proneness of our privacy, our democratic systems, business and finance integrity, and national defense structures for starters.

The White House Blueprint for an AI Bill of Rights plants an important stake in the ground around AI safeguards. But it does not speak to the cultivation of future managers of AI. Similarly, the U.S. Department of Education report Artificial Intelligence and the Future of Teaching and Learning advises on risks of and uses for AI in diagnostics and descriptive statistics. However, guidance for preparing the upcoming generation to manage AI is not included. The National Science Foundation supports several AI-education studies that may prove worthy of scaling.

A potpourri of additional emergent trends fuel the current STEM crisis. Many are technological innovations, unearthing powers of manipulation and control with which society is ill-prepared to manage. Quantum computing is one such innovation – using subatomic particle positioning, qubits, to store information. Computers will become exponentially faster and more powerful, possibly solving climate change while also deciphering everyone’s passwords. Relatedly, revolutions in cybersecurity and data analytics may be out ahead of societal grasp. Many educational programs at the local and national levels have emerged in this space, including eCybermission from the Army Education Outreach Program (AEOP), and Data Science Foundations using sports, finance, and other contexts for sense-making, from EverFi.

Not everyone needs to know how a microwave oven works in order to use it effectively. But U.S. citizens bear the responsibility for weighing ethical, equitable, and legal dimensions of STEM advancements as voters, educators, parents, and consumers. Whether it be CRISPR alterations of individuals’ genetics, socioeconomic dimensions of factory automation, morality aspects of Directed Energy Weaponry (DEW), the cost/benefit balance of climate mitigation technologies such as carbon sequestration, and so on, STEM education and workforce development need to be out front. That requires additional investment.

Supply-Demand Imbalance

Emergent technologies will drive job opportunities in the STEM arena that are expected to grow at four times the rate of jobs in other sectors in the coming decade. While it is encouraging that post-secondary STEM certificates and degrees have increased over the last decade (growing from 982,000 in 2012 to 1,310,000 in 2021), this growth is a ripple when the field needs a wave. Further, significant subpopulations of Americans are underrepresented in STEM majors and jobs. Women make up just about one-third of the science and engineering workforce. While racial and ethnic subgroups including Alaska Native, Black or African American, American Indian, and Hispanic or Latino comprise 30% of the total workforce, just 23% are in STEM jobs. Rural residency exacerbates those disparities for all subpopulations regarding the STEM education pipeline. While 40% of urban adults have at least a bachelor’s degree, only 25% of rural residents do.

The commitment to diversify the STEM talent pipeline is a universal consensus across federal, state, local, corporate, nonprofit, and philanthropic investors in STEM education and workforce development. Numerous programs devoted to equity and inclusion are at work today with promising results, ripe for scaling.

Impact on Individuals and Society

Of all the arguments supporting increased investment in STEM education R&D to solve our current STEM crisis – tepid federal spending, ominously powerful inventions, and the dearth of talent for advancing and managing those inventions – a fourth argument eclipses each of them: STEM education improves the lives of individuals irrespective of their occupation. And in so doing, STEM education improves communities and the country at large.

Learners fortunate to enjoy quality STEM education develop creativity through imaginative design, interpretation, and representation of investigations. The tools they use strengthen technology literacy. The mode of discovery is highly social, honing communication and cooperation skills. With no sage-on-the-stage they develop independence of thought. Failure happens, forging perseverance and resilience in its wake. Asking and answering questions nurtures curiosity. Defending and refuting ideas cultivates critical thinking, Truth and facts are evidence-based yet always tentative. Empathy is cultivated through alternative interpretations or points of view. And confidence to pursue STEM as a career comes from doing STEM.

The prospect of an entire population of Americans thus equipped is the most compelling case for strategically increased R&D investment in STEM education.

Photo of 2008 Ethics in the Science Classroom

Policy Recommendations for Increasing the Efficacy of Education R&D to Support STEM Education

Where do federal, state, local, corporate, nonprofit, and philanthropic STEM investors look for guidance in the alignment and leveraging of their dollars to nationwide priorities? The closest we have to a “master plan” is the federal STEM education strategic plan mandated by the America COMPETES Act. Updated every five years by the White House Office of Science and Technology Policy in close collaboration with federal agencies, the 2018-2023 plan is due for an update, and it is likely the next iteration will be released soon. 

While the STEM community waits, valuable input on the next iteration was recently provided to the OSTP from the STEM Education Coalition. Coalition members, (numbering over 600) represent the spectrum of STEM advocates – business and industry, higher education, formal and informal K-12 education, nonprofits, and national/state policy groups – and collectively hold great sway in matters of STEM education nationally. The expiring federal STEM plan is closely reflective of their input, as its successor likely will be as well. 

Six of the following ten recommendations build upon the STEM Education Coalition’s priorities, while the remaining four recommendations address gaps in the pipeline from STEM education to workforce pathways.

In order to maximize research and development to improve STEM education, we have distilled ten recommendations:

  1. Devote resources (human and financial) to both the scaling of, and continued research and development in, interventions that disrupt the status quo when it comes to rural under-reach and under-service in STEM education.
  2. Devote resources to both the scaling of, and continued research and development in transdisciplinary (a.k.a. Convergent) STEM teaching and learning, formally and informally.
  3. STEM teacher recruitment and training to support learning characterized on page 11 is a high-value target for investment in both the scaling of existent models as well as research and development on this essential frontier.
  4. Expand student authentic career-linked or work-based learning experiences to all, earning credits while acquiring job skills, by improving coordination capacity, and crediting – especially earning core (graduation) credits. 
  5. Devote resources to research and development on coordination across components of the STEM education system – in school and out of school, educator preparation – at the local, state and national levels.
  6. Devote resources to research and development toward improved awareness/communication systems of Federal STEM education agencies.
  7. Devote resources to research and development on supporting the training of STEM teachers and professionals for career coaching on a real-time, as-needed basis for all youth.
  8. Devote resources to research and development on the expansion of local/global challenge-solution learning opportunities and how they  influence student self-efficacy and STEM career trajectories.
  9. Devote resources to research and development of a digital platform readily accessible, easily navigable, and comprehensively thorough, for education-providers to harvest effective, vetted STEM programs from across the entire producer spectrum.
  10. Devote resources to the design and development of a catalog of STEM/workforce education “discoveries” funded by federal grant agencies (e.g., NSF’s I-Test, DR-K12, INCLUDES, CSforAll, etc.) to be used by STEM educators, developers and practitioners.

Recommendation 1. Devote resources (human and financial) to both the scaling of, and continued research and development in, interventions that disrupt the status quo when it comes to rural under-reach and under-service in STEM education.

Aligning to the STEM Ed Coalition’s priority of “Achieving Equity in STEM Education Must Be a National Priority,” this recommendation is central to the success of STEM education. The economic and moral imperative to broaden access to quality STEM education and to high-demand STEM careers is a national consensus. Lack of access and opportunity across rural America, where 20% of all youth attend half of all school districts  and where persistent inequality hits members of racial and ethnic minority groups hardest, creates a high-value target.

STEM Excellence and Leadership Project

Identifying and nurturing STEM talent in rural K-12 settings can be a challenge. The Belin-Blank Center for Gifted Education and Talent Development successfully designed and implemented the “STEM Excellence and Leadership Project” at the middle school level. Funded by the NSF’s Advancing Informal STEM Learning program, flexible professional development, wide-net-casting of students, networking within the community, and career-counseling, resulted in increased creatively, critical thinking, and positive perceptions of mathematics and science.

Recommendation 2. Devote resources to both the scaling of, and continued research and development in transdisciplinary (a.k.a. Convergent) STEM teaching and learning, formally and informally. 

Aligning to the STEM Ed Coalition’s Priority “Science Education Must Be Elevated as a National Priority within a Transdisciplinary Well-Rounded STEM Education,” we need more investment in R&D to understand the transdisciplinary STEM teaching and learning models that improve student outcomes. America’s formal education model remains largely reflective of the 1894 recommendations of the Committee of Ten: annually teach all students History, English, Mathematics, Physics, Chemistry, etc. This prevailing “layer cake” approach serves transdisciplinary education poorly. Even the Next Generation Science Standards upon which state and district science standards are largely based, focuses on developing… “an in-depth understanding of content and develop key skills…” All modern STEM-related challenges facing Generations Z, Alpha, and Beta require an entirely different brand of education – one of transdisciplinary inquiry.

USPTO Motivates Young Innovators and Entrepreneurs

The United States Patent and Trade Office (USPTO)’s National Summer Teacher Institute (NSTI) on Innovation, STEM, and Intellectual Property (IP) trains teachers to incorporate concepts of making, inventing, and intellectual property creation and protection into classroom instruction, with the goal to inspire and motivate young innovators and entrepreneurs. To date the program claims 22,000 hours of IP and invention education training of 444 teachers in 50 states – 110 of whom have inventions – now equipped to spread the power of invention education and IP to hundreds of thousands of learners across the country and the world. We should better understand the program components that enable this kind of transdisciplinary learning.

Recommendation 3. STEM teacher recruitment and training to support learning is a high-value target for investment in both the scaling of existent models as well as research and development on this essential frontier. 

Aligning to the STEM Ed Coalitions’ priority “Increase the Number of STEM Teachers in Our Nation’s Classrooms,” we need to deploy more education R&D to address America’s well-documented STEM teacher shortage. But the shortage is only half of the challenge we face. The other half is equipping teachers to authentically teach STEM, not merely a discipline underneath the STEM umbrella. Efforts such as the NSF’s Robert Noyce Teacher Scholarship program and the UTeach model support the production of excellent teachers of mathematics and science, but not STEM overall. To teach in a convergent (transdisciplinary) fashion through collaborative community partnerships, on local/global complex issues is beyond the scope and capacity of traditional teacher preparatory models.

Example Programs

Two means for equipping educators to teach STEM are (1) in their pre-professional preparation, and (2) as in-service professional development for disciplinary instructors. Promising examples are flourishing.

  1. STEM Teaching Certificate. A few U.S. states and some national organizations have built STEM licenses and endorsements. Georgia State University’s STEM Certificate program trains teachers to bring a convergent STEM approach to whatever course, “[candidates] figure out how to work across their schools, with the arts, with connections to other subjects.”
  2. In-service STEM Externships. Teachers in industry externships discover workplace connections and durable skills important to build in classrooms. Numerous businesses (e.g., 3M), organizations (e.g. Aerospace/NASA), and states (e.g., Iowa’s NSF ITEST funded externships) conduct variations on the concept, with compelling results.

Recommendation 4. Expand student authentic career-linked or work-based learning experiences to all, earning credits while acquiring job skills, by improving coordination capacity, and crediting – especially earning core (graduation) credits.

Aligning to the STEM Ed Coalition’s priority to “Support Partnerships with Community Based STEM Organizations, Out of School Providers and Informal Learning Providers” education R&D needs to better understand career based learning models that work and deploy these evidence-based practices at scale.

Example Programs

With all 50 U.S. states aggressively pursuing work-based learning (WBL) policies and support, there is an opportunity to study and codify what states are learning to improve and iterate faster. According to the Education Commission of the States, 33 states have a definition for WBL, though variable. Nearly all states report WBL as a state strategy for their Workforce Innovation and Opportunity Act (WIOA) profile. Twenty-eight states legislate funding to support WBL. Less than half of all states permit WBL to count for graduation credits. Of all states, Tennessee presents a particularly aggressive WBL profile worthy of scale/replication. 

Recommendation 5. Devote resources to research and development on coordination across components of the STEM education system – in school and out of school, educator preparation – at the local, state and national levels.

Aligning to the STEM Ed Coalition’s priority to “Take a Systemic Approach to Future STEM Education Interventions,” more R&D should be deployed to study ecosystem models to understand the components that lead to student outcomes

The STEM learning that takes place during the K-12 school day may or may not mesh well with the STEM learning that takes place at museum nights or at summer camp. In both instances, it may or may not align well with local, state, or national assessments. The preparation of educators is widely variable. The curricular content classroom-to-classroom and state-to-state varies. To drop novel grant-funded interventions into the mix is a random act of hope.

Example Programs

STEM Learning Ecosystems now number over 100 across the U.S., providing vertebral backbone to a national coordinative skeleton for STEM education. Formally designated by their membership in the STEM Learning Ecosystems Community of Practice supported by the Teaching Institute for Excellence in STEM (TIES), they each unite “…pre-K-16 schools; community-based organizations, such as after-school and summer programs; institutions of higher education; STEM-expert organizations, such as science centers, museums, corporations, intermediary and non-profit organizations and professional associations; businesses; funders; and informal experiences at home and in a variety of environments” to “…spark young people’s engagement, develop their knowledge, strengthen their persistence and nurture their sense of identity and belonging in STEM disciplines.” Every one of America’s 20,000 cities and towns ought to have a STEM Ecosystem. Just 19,900 to go.

Recommendation 6. Devote resources to research and development toward improved awareness/communication systems of Federal STEM education agencies.

Aligning to the STEM Ed Coalition’s priority toClarify and Define the Role of Federal Agencies and OSTP in Supporting STEM Education” we should utilize R&D and inspiration from other fields to ensure we are propagating knowledge and systems in ways that foster increased transparency and evidence-use.

Awareness is the weak link in the chain of federal STEM education outreach to consumers at local levels. Seventeen federal agencies engage in STEM education via 156 programs spanning pre-K-12 formal and informal, higher education, and adult education.

In 2018-19 a strong push was put forth by the OSTP and the Federal Coordination in STEM subcommittee (FC-STEM) to build STEM.gov or STEMeducation.gov in the spirit of AI.gov and Grants.gov. A one-stop clearinghouse through which Americans can explore and discover funding, programs, and expertise in STEM. To date, the closest analog is https://www.ed.gov/stem.

Example Programs

Discrete programs of various federal agencies have employed clever tactics for awareness and communication, as described in the 2022 Progress Report on the Implementation of the Federal STEM Education Strategic Plan. The AmeriCorp program, for example, partnered with Mathematica to build a web-based interactive SCALERtool useable by education professionals, local education agencies, state education agencies, nonprofits, state and local government agencies, universities and colleges, tribal nations, and others to request participants to address local challenges they have identified, including STEM. Similarly, the National Institute of Standards and Technology launched their NIST Educational STEM Resource registry (NEST-R) to provide wide access to NIST educational and workforce development content including STEM resource records. Can the concept be broadened to a grand unifying collective? 

Recommendation 7. Devote resources to research and development on supporting the training of STEM teachers and professionals for career coaching on a real-time, as-needed basis for all youth. 

Gen Z and Gen Alpha may end up in jobs like machine learning tech, molecular medical therapist, cryptocurrency auditor, big data distiller, climate change mitigator, or jetpack mechanic. From whom can they expect good career coaching? It is unrealistic to expect that their school counselors can keep up, with an average caseload of 385 students across all disciplines, their hands are full. STEM teachers, both the disciplinary and the integrated type, are best positioned to take on more responsibility for career coaching, with the help of counselors, administrators, librarians in fact it is an all-hands-on-deck challenge.

Example Programs

Meaningful Career Conversations is a program begun in Colorado now spreading to other states. It is a light training experience of four hours to equip educators and others with whom youth come into contact to conduct conversations that steer students toward reflection, exploration, and consideration of career pathways of interest. Trainings are based upon starters and prompts that get students talking about and reflecting on their strengths and interests, such as “What activities or places make you feel safe and valued? Why?” Not a silver bullet, but a model of distributed responsibility which, by engaging core teachers and other adults in career guidance, can help more students find their way toward a STEM career.

Recommendation 8. Devote resources to research and development on the expansion of local/global challenge-solution learning opportunities and how they  influence student self-efficacy and STEM career trajectories.

The standardization of a vision for STEM in classrooms across America will take time and resources. In the meantime, programs like MIT Solve can fast-track authentic learning experiences in school and after school. It is the ultimate student-centeredness to invite groups of youth to think big – identify challenges for which they are enthused and tap all imaginable resources in dreaming up solutions – to command their own learning.

Example Programs

Common in higher education are capstone projects, applied coursework, even entire college missions (e.g., Olin College) that center the student learning experience around local/global challenges and solutions. 

For citizens of all ages there are opportunities like Changemakers Challenges, and the “Reinvent the Toilet” competition of the Gates Foundation.

At the K-12 level, FIRST Lego League teams learn about robotics through humanitarian themes such as adaptive technologies for the disabled. The World Food Prize offers student group projects focused on global food security challenges. Of similar format is Future Cities, and Invention Convention. These well-evaluated programs are prime for expansion or replication.

Recommendation 9.  Devote resources to research and development of a digital platform readily accessible, easily navigable, and comprehensively thorough, for education-providers to harvest effective, vetted STEM programs from across the entire producer spectrum.

More than 50 different programs are named in this paper, each an exemplar, a mere snapshot of the STEM programs available to the pre-K-12 community in and out of school. Therein lies a challenge/opportunity uniquely defining this moment in American educational history compared to the 1958 and 2001 crises: an embarrassment of riches.

Example Programs

The number of databases and resource catalogs on STEM education programs available to educators is almost as overwhelming as the number of programs themselves. A few standouts help dampen the decibels (though none are perfect):  

  1. What Works Clearinghouse (WWC). Established in 2002 under the Institute for Education Sciences at the U.S. Department of Education, the WWC does the hard work for educators of reviewing the research to make evidence-based recommendations about instruction. A priceless service. The trick is distillation. Their goal to digest and disseminate education research gets the material down to the level of curriculum developers, publishers, teacher-trainers, etc. Overwhelming though, for casual-shopping educators.  
  2. STEMworks Database. Born under Change The Equation in 2012, WestEd acquired STEMworks in 2017, a tool to sift through the noise using a rigorous rubric (Design Principles) to present sure-fire winning STEM programs to educators and organizations. Programs (kits, courses, software, lessons) submit applications for expert review. The result is a “Searchable honor-roll” of high-quality STEM. The hitch? Relatively few providers apply, especially not the emergent or experimental yet to acquire robust impact evidence.

Recommendation 10. Devote resources to the design and development of a catalog of STEM/workforce education “discoveries” funded by federal grant agencies (e.g., NSF’s I-Test, DR-K12, INCLUDES, CSforAll, etc.) to be used by STEM educators, developers and practitioners.

This recommendation relates to recommendation #9 except expressly regarding federal programs, and related to recommendation #6 except not a mere roster of offerings, but a vetted (and user-friendly) What Works Clearinghouse for all prior grants that yields empirical support for preK-12 STEM, across all agencies. What a treasure-trove of proven interventions and innovations across NSF, DE, DOE, DoD and on, mostly unknown to practitioners across the United States.

Each federal agency currently posts STEM opportunities at their websites (e.g., http://www.ed.gov/stem, http://dodstem.us/, http://www.nsf.gov/funding, http://www.nasa.gov/education, https://science.education.nih.gov/). These tools are valuable, but a desperate need remains for a singular STEM.gov style searchable landing page. 

There must be a way to view what worked for the thousands of R&D projects funded by these agencies. An online shopping mall for successful preK-12 STEM curricula, teaching approaches, equity practices, virtual platforms, etc. CoSTEM could create a “STEM Ideas that Work” landing page to ensure that emerging research insights are captured in systematic and accessible ways.

Example Programs

The Ideas That Work resource is an analog. Curated by the Office of Special Education Programs at the U.S. Department of Education, it is a searchable database that includes all grants past and current exclusive to the NSF. Special educators and families can search, e.g., “behavioral challenge” yielding resources and toolkits, training modules, tip sheets, etc. 

Black and white photo of early 20th century science class

Recommended Actions of ALI and Other Stakeholders

While we hope to see many of these recommendations in the forthcoming Five Year STEM Plan, to actualize these recommendations, it will take multiple actors working together to advance the STEM education field.  

The Alliance for Learning Innovation has perhaps the most potent of tools among STEM/workforce stakeholders to affect change: communication.

ALI should host events, publish white papers, develop convenings and deploy mass media and other awareness and advocacy modes for rallying the august collective of member organizations toward amplifying America’s rural STEM equity opportunity, career coaching capacity, educator-employer partnership potential, convergence approach to learning, along with six other recommendations, doing more for preparing the future STEM workforce than any other action, including investment. 

Investment is a close second-most impactful action ALI can take. If all STEM investors – federal, state, corporate, and philanthropic, aligned around a finite array of pressing priorities served by a proven set of interventions (the very function of this report), the collective impact would transform systems. What it would take is an aggregator. ALI or a designee organization, functioning as an agent for businesses, philanthropies and other STEM investors, can make funding recommendations (or more ambitiously, pool investor funds) based on consensus goals of the STEM cooperative, acting to focus investments accordingly.               

Federal Agencies have made significant gains toward cooperative and complementary STEM education support through the sustenance of interagency working groups on Computational Literacy, Convergence, Strategic Partnerships, Transparency & Accountability, Inclusion in STEM, and Veterans and Military Spouses in STEM. As a result, improvements are being made in coordination and increased transparency about federal education R&D investments, especially between the National Science Foundation and the Department of Education. And yet, more needs to be done. 

Business, Industry, and Philanthropic Organizations have the ability to pilot or expand proven programs to national scale, as many examples herein attest. However, the impact of the investments of the private sector may fall short of systemic change due to a smorgasbord of pet programs chosen by each entity, leading to incremental rather than wholesale progress.

Business, industry, and philanthropic investors in STEM education should pool their resources around a finite array of proven programs for maximal, collective impact. A functional intermediary such as the Alliance for Learning Innovation could represent the interests of all non-government STEM funders by winnowing the horde of pre-K-12 STEM education programs to only those most effective at achieving consensus goals and priorities. The outcome might be a Consumer Reports-style top-rated performers menu that concentrates investments, amplifying impact. Like federal agencies, non-government funders should consider driving the advancement of transdisciplinary (convergent) STEM education, work-based or career-linked learning, the synchronization of in-school and out-of-school STEM education, educator career-coaching capacity, and the development of rural, diverse STEM workforce talent.     

States are best positioned to help local education/workforce organizations meet the human resource challenges and the material challenges inhibiting full production of future workers for high demand careers. It is state government that sets the policies that determine practices.

K-12 formal and informal education at the daily practical level bears the greatest responsibility to act on behalf of the future STEM workforce. Insofar as government and non-government funded programs support, and state policies empower, and preparatory trainings equip, educators should seize this moment in history to help American economic vitality and national security one student at a time.

Others at the table include post-secondary institutions, media outlets, faith communities, local trade and professional societies, social service providers, families, and citizens at-large. Each should contribute to the goal of producing a vibrant future workforce by advocating for education research and development policies at the state and federal levels and by partnering with formal and nonformal learning organizations to inspire tomorrow’s innovators in today’s classrooms. 

Students work in cell biology lab in Peckham Hall, 2012

Conclusion

American competitiveness through innovation is driven by leading-edge education systems. Legitimate concern for whether those systems can maintain their lead surfaces during periods of vulnerability whether eclipsed in the space race, or comparatively under-armored in military advancement, or surpassed in the advancement of information technology. To relinquish leadership in innovation is a threat to the U.S. economy and national security. In response to periodic threats to American innovation preeminence, bold investments in STEM education have produced waves of talent for securing the helm. 

This era is different. A myriad of fronts for innovation advancement – automation, machine learning, molecular medicine, energy transformation, cybersecurity – each harboring an existential challenge, heighten the imperative for action to an unprecedented level. And yet, the U.S. has never been more prepared to act. A wealth of pre-K-12 STEM programs and infrastructure stand in testament to legacy investments by the federal government and the private sector. This time, the challenge is to engage a broader swath of the population, especially those underserved and underrepresented in STEM programs of the past. And in tight budgetary times, broadened opportunities must utilize evidence-based solutions proven to work, whether they be in the realm of teacher preparation, equity and inclusion, early learning, informal education, community engagement, mathematics, coding, quantum physics, or all the above and more.

The best time to invest is when the pathway to success is clear. The tools and the know-how for producing tomorrow’s STEM workforce reside within pre-K-12 systems today. For public and private investors alike, there is an opportunity for amplification through collective impact. By collectively identifying high-impact solutions transparent in design and indisputable in effect, aligning resources for surgical precision rather than shotgun spray, and scaling known winners to all young Americans, the current challenge to U.S. innovation leadership will be met. Enough with moving the needle. It is time to pin the needle, shattering the gauge.  

Predicting Progress: A Pilot of Expected Utility Forecasting in Science Funding

Read more about expected utility forecasting and science funding innovation here.

The current process that federal science agencies use for reviewing grant proposals is known to be biased against riskier proposals. As such, the metascience community has proposed many alternate approaches to evaluating grant proposals that could improve science funding outcomes. One such approach was proposed by Chiara Franzoni and Paula Stephan in a paper on how expected utility — a formal quantitative measure of predicted success and impact — could be a better metric for assessing the risk and reward profile of science proposals. Inspired by their paper, the Federation of American Scientists (FAS) collaborated with Metaculus to run a pilot study of this approach. In this working paper, we share the results of that pilot and its implications for future implementation of expected utility forecasting in science funding review. 

Brief Description of the Study

In fall 2023, we recruited a small cohort of subject matter experts to review five life science proposals by forecasting their expected utility. For each proposal, this consisted of defining two research milestones in consultation with the project leads and asking reviewers to make three forecasts for each milestone:

  1. the probability of success;
  2. The scientific impact of the milestone, if it were reached; and
  3. The social impact of the milestone, if it were reached.

These predictions can then be used to calculate the expected utility, or likely impact, of a proposal and design and compare potential portfolios.

Key Takeaways for Grantmakers and Policymakers

The three main strengths of using expected utility forecasting to conduct peer review are

Despite the apparent complexity of this process, we found that first-time users were able to successfully complete their review according to the guidelines without any additional support. Most of the complexity occurs behind-the-scenes, and either aligns with the responsibilities of the program manager (e.g., defining milestones and their dependencies) or can be automated (e.g., calculating the total expected utility). Thus, grantmakers and policymakers can have confidence in the user friendliness of expected utility forecasting. 

How Can NSF or NIH Run an Experiment on Expected Utility Forecasting?

An initial pilot study could be conducted by NSF or NIH by adding a short, non-binding expected utility forecasting component to a selection of review panels. In addition to the evaluation of traditional criteria, reviewers would be asked to predict the success and impact of select milestones for the proposals assigned to them. The rest of the review process and the final funding decisions would be made using the traditional criteria. 

Afterwards, study facilitators could take the expected utility forecasting results and construct an alternate portfolio of proposals that would have been funded if that approach was used, and compare the two portfolios. Such a comparison would yield valuable insights into whether—and how—the types of proposals selected by each approach differ, and whether their use leads to different considerations arising during review. Additionally, a pilot assessment of reviewers’ prediction accuracy could be conducted by asking program officers to assess milestone achievement and study impact upon completion of funded projects.

Findings and Recommendations

Reviewers in our study were new to the expected utility forecasting process and gave generally positive reactions. In their feedback, reviewers said that they appreciated how the framing of the questions prompted them to think about the proposals in a different way and pushed them to ground their assessments with quantitative forecasts. The focus on just three review criteria–probability of success, scientific impact, and social impact–was seen as a strength because it simplified the process, disentangled feasibility from impact, and eliminated biased metrics. Overall, reviewers found this new approach interesting and worth investigating further. 

In designing this pilot and analyzing the results, we identified several important considerations for planning such a review process. While complex, engaging with these considerations tended to provide value by making implicit project details explicit and encouraging clear definition and communication of evaluation criteria to reviewers. Two key examples are defining the proposal milestones and creating impact scoring systems. In both cases, reducing ambiguities in terms of the goals that are to be achieved, developing an understanding of how outcomes depend on one another, and creating interpretable and resolvable criteria for assessment will help ensure that the desired information is solicited from reviewers. 

Questions for Further Study

Our pilot only simulated the individual review phase of grant proposals and did not simulate a full review committee. The typical review process at a funding agency consists of first, individual evaluations by assigned reviewers, then discussion of those evaluations by the whole review committee, and finally, the submission of final scores from all members of the committee. This is similar to the Delphi method, a structured process for eliciting forecasts from a panel of experts, so we believe that it would work well with expected utility forecasting. The primary change would therefore be in the definition and approach for eliciting criterion scores, rather than the structure of the review process. Nevertheless, future implementations may uncover additional considerations that need to be addressed or better ways to incorporate forecasting into a panel environment. 

Further investigation into how best to define proposal milestones is also needed. This includes questions such as, who should be responsible for determining the milestones? If reviewers are involved, at what part(s) of the review process should this occur? What is the right balance between precision and flexibility of milestone definitions, such that the best outcomes are achieved? How much flexibility should there be in the number of milestones per proposal? 

Lastly, more thought should be given to how to define social impact and how to calibrate reviewers’ interpretation of the impact score scale. In our report, we propose a couple of different options for calibrating impact, in addition to describing the one we took in our pilot. 

Interested grantmakers, both public and private, and policymakers are welcome to reach out to our team if interested in learning more or receiving assistance in implementing this approach.


Introduction

The fundamental concern of grantmakers, whether governmental or philanthropic, is how to make the best funding decisions. All funding decisions come with inherent uncertainties that may pose risks to the investment. Thus, a certain level of risk-aversion is natural and even desirable in grantmaking institutions, especially federal science agencies which are responsible for managing taxpayer dollars. However, without risk, there is no reward, so the trade-off must be balanced. In mathematics and economics, expected utility is the common metric assumed to underlie all rational decision making. Expected utility has two components: the probability of an outcome occurring if an action is taken and the value of that outcome, which roughly corresponds with risk and reward. Thus, expected utility would seem to be a logical choice for evaluating science funding proposals. 

In the debates around funding innovation though, expected utility has largely flown under the radar compared to other ideas. Nevertheless, Chiara Franzoni and Paula Stephan have proposed using expected utility in peer review. Building off of their paper, the Federation of American Scientists (FAS) developed a detailed framework for how to implement expected utility into a peer review process. We chose to frame the review criteria as forecasting questions, since determining the expected utility of a proposal inherently requires making some predictions about the future. Forecasting questions also have the added benefit of being resolvable–i.e., the true outcome can be determined after the fact and compared to the prediction–which provides a learning opportunity for reviewers to improve their abilities and identify biases. In addition to forecasting, we incorporated other unique features, like an exponential scale for scoring impact, that we believe help reduce biases against risky proposals. 

With the theory laid out, we conducted a small pilot in fall of 2023. The pilot was run in collaboration with Metaculus, a crowd forecasting platform and aggregator, to leverage their expertise in designing resolvable forecasting questions and to use their platform to collect forecasts from reviewers. The purpose of the pilot was to test the mechanics of this approach in practice, see if there are any additional considerations that need to be thought through, and surface potential issues that need to be solved for. We were also curious if there would be any interesting or unexpected results that arise based on how we chose to calculate impact and total expected utility. It is important to note that this pilot was not an experiment, so we did not have a control group to compare the results of the review with. 

Since FAS is not a grantmaking institution, we did not have a ready supply of traditional grant proposals to use. Instead, we used a set of two-page research proposals for Focused Research Organizations (FROs) that we had sourced through separate advocacy work in that area.1 With the proposal authors’ permission, we recruited a cohort of twenty subject matter experts to each review one of five proposals. For each proposal, we defined two research milestones in consultation with the proposal authors. Reviewers were asked to make three forecasts for each milestone:

  1. The probability of success;
  2. The scientific impact, conditional on success; and
  3. The social impact, conditional on success.

Reviewers submitted their forecasts on Metaculus’ platform; in a separate form they provided explanations for their forecasts and responded to questions about their experience and impression of this new approach to proposal evaluation. (See Appendix A for details on the pilot study design.)

Insights from Reviewer Feedback

Overall, reviewers liked the framing and criteria provided by the expected utility approach, while their main critique was of the structure of the research proposals. Excluding critiques of the research proposal structure, which are unlikely to apply to an actual grant program, two thirds of the reviewers expressed positive opinions of the review process and/or thought it was worth pursuing further given drawbacks with existing review processes. Below, we delve into the details of the feedback we received from reviewers and their implications for future implementation.

Feedback on Review Criteria

Disentangling Impact from Feasibility

Many of the reviewers said that this model prompted them to think differently about how they assess the proposals and that they liked the new questions. Reviewers appreciated that the questions focused their attention on what they think funding agencies really want to know and nothing more: “can it occur?” and “will it matter?” This approach explicitly disentangles impact from feasibility: “Often, these two are taken together, and if one doesn’t think it is likely to succeed, the impact is also seen as lower.” Additionally, the emphasis on big picture scientific and social impact “is often missing in the typical review process.” Reviewers also liked that this approach eliminates what they consider biased metrics, such as the principal investigator’s reputation, track record, and “excellence.” 

Reducing Administrative Burden

The small set of questions was seen as more efficient and less burdensome on reviewers. One reviewer said, “I liked this approach to scoring a proposal. It reduces the effort to thinking about perceived impact and feasibility.” Another reviewer said, “On the whole it seems a worthwhile exercise as the current review processes for proposals are onerous.” 

Quantitative Forecasting

Reviewers saw benefits to being asked to quantify their assessments, but also found it challenging at times. A number of reviewers enjoyed taking a quantitative approach and thought that it helped them be more grounded and explicit in their evaluations of the proposals. However, some reviewers were concerned that it felt like guesswork and expressed low confidence in their quantitative assessments, primarily due to proposals lacking details on their planned research methods, which is an issue discussed in the section “Feedback on Proposals.” Nevertheless, some of these reviewers still saw benefits to taking a quantitative approach: “It is interesting to try to estimate probabilities, rather than making flat statements, but I don’t think I guess very well. It is better than simply classically reviewing the proposal [though].” Since not all academics have experience making quantitative predictions, we expect that there will be a learning curve for those new to the practice. Forecasting is a skill that can be learned though, and we think that with training and feedback, reviewers can become better, more confident forecasters.

Defining Social Impact

Of the three types of questions that reviewers were asked to answer, the question about social impact seemed the harder one for reviewers to interpret. Reviewers noted that they would have liked more guidance on what was meant by social impact and whether that included indirect impacts. Since questions like these are ultimately subjective, the “right” definition of social impact and what types of outcomes are considered most valuable will depend on the grantmaking institution, their domain area, and their theory of change, so we leave this open to future implementers to clarify in their instructions. 

Calibrating Impact

While the impact score scale (see Appendix A) defines the relative difference in impact between scores, it does not define the absolute impact conveyed by a score. For this reason, a calibration mechanism is necessary to provide reviewers with a shared understanding of the use and interpretation of the scoring system. Note that this is a challenge that rubric-based peer review criteria used by science agencies also face. Discussion and aggregation of scores across a review committee helps align reviewers and average out some of this natural variation.2

To address this, we surveyed a small, separate set of academics in the life sciences about how they would score the social and scientific impact of the average NIH R01 grant, which many life science researchers apply to and review proposals for. We then provided the average scores from this survey to reviewers to orient them to the new scale and help them calibrate their scores. 

One reviewer suggested an alternative approach: “The other thing I might change is having a test/baseline question for every reviewer to respond to, so you can get a feel for how we skew in terms of assessing impact on both scientific and social aspects.” One option would be to ask reviewers to score the social and scientific impact of the average grant proposal for a grant program that all reviewers would be familiar with; another would be to ask reviewers to score the impact of the average funded grant for a specific grant program, which could be more accessible for new reviewers who have not previously reviewed grant proposals. A third option would be to provide all reviewers on a committee with one or more sample proposals to score and discuss, in a relevant and shared domain area.

When deciding on an approach for calibration, a key consideration is the specific resolution criteria that are being used — i.e., the downstream measures of impact that reviewers are being asked to predict. One option, which was used in our pilot, is to predict the scores that a comparable, but independent, panel of reviewers would give the project some number of years following its successful completion. For a resolution criterion like this one, collecting and sharing calibration scores can help reviewers get a sense for not just their own approach to scoring, but also those of their peers.

Making Funding Decisions

In scoring the social and scientific impact of each proposal, reviewers were asked to assess the value of the proposal to society or to the scientific field. That alone would be insufficient to determine whether a proposal should be funded though, since it would need to be compared with other proposals in conjunction with its feasibility. To do so, we calculated the total expected utility of each proposal (see Appendix C). In a real funding scenario, this final metric could then be used to compare proposals and determine which ones get funded. Additionally, unlike a traditional scoring system, the expected utility approach allows for the detailed comparison of portfolios — including considerations like the expected proportion of milestones reached and the range of likely impacts.

In our pilot, reviewers were not informed that we would be doing this additional calculation based on their submissions. As a result, one reviewer thought that the questions they were asked failed to include other important questions, like “should it occur?” and “is it worth the opportunity cost?” Though these questions were not asked of reviewers explicitly, we believe that they would be answered once the expected utility of all proposals is calculated and considered, since the opportunity cost of one proposal would be the expected utility of the other proposals. Since each reviewer only provided input on one proposal, they may have felt like the scores they gave would be used to make a binary yes/no decision on whether to fund that one proposal, rather than being considered as a part of a larger pool of proposals, as it would be in a real review process.

Feedback on Proposals

Missing Information Impedes Forecasting

The primary critique that reviewers expressed was that the research proposals lacked details about their research plans, what methods and experimental protocols would be used, and what preliminary research the author(s) had done so far. This hindered their ability to properly assess the technical feasibility of the proposals and their probability of success. A few reviewers expressed that they also would have liked to have had a better sense of who would be conducting the research and each team member’s responsibilities. These issues arose because the FRO proposals used in our pilot had not originally been submitted for funding purposes, and thus lacked the requirements of traditional grant proposals, as we noted above. We assume this would not be an issue with proposals submitted to actual grantmakers.3  

Improving Milestone Design

A few reviewers pointed out that some of the proposal milestones were too ambiguous or were not worded specifically enough, such that there were ways that researchers could technically say that they had achieved the milestone without accomplishing the spirit of its intent. This made it more challenging for reviewers to assess milestones, since they weren’t sure whether to focus on the ideal (i.e., more impactful) interpretation of the milestone or to account for these “loopholes.” Moreover, loopholes skew the forecasts, since they increase the probability of achieving a milestone, while lowering the impact of doing so if it is achieved through a loophole.

One reviewer suggested, “I feel like the design of milestones should be far more carefully worded – or broken up into sub-sentences/sub-aims, to evaluate the feasibility of each. As the questions are currently broken down, I feel they create a perverse incentive to create a vaguer milestone, or one that can be more easily considered ‘achieved’ for some ‘good enough’ value of achieved.” For example, they proposed that one of the proposal milestones, “screen a library of tens of thousands of phage genes for enterobacteria for interactions and publish promising new interactions for the field to study,” could be expanded to

  1. “Generate a library of tens of thousands of genes from enterobacteria, expressed in E. coli
  2. “Validate their expression under screenable conditions
  3. “Screen the library for their ability to impede phage infection with a panel of 20 type phages
  4. “Publish … 
  5. “Store and distribute the library, making it as accessible to the broader community”

We agree with the need for careful consideration and design of milestones, given that “loopholes” in milestones can detract from their intended impact and make it harder for reviewers to accurately assess their likelihood. In our theoretical framework for this approach, we identified three potential parties that could be responsible for defining milestones: (1) the proposal author(s), (2) the program manager, with or without input from proposal authors, or (3) the reviewers, with or without input from proposal authors. This critique suggests that the first approach of allowing proposal authors to be the sole party responsible for defining proposal milestones is vulnerable to being gamed, and the second or third approach would be preferable. Program managers who take on the task of defining milestones should have enough expertise to think through the different potential ways of fulfilling a milestone and make sure that they are sufficiently precise for reviewers to assess.

Benefits of Flexibility in Milestones

Some flexibility in milestones may still be desirable, especially with respect to the actual methodology, since experimentation may be necessary to determine the best technique to use. For example, speaking about the feasibility of a different proposal milestone – “demonstrate that Pro-AG technology can be adapted to a single pathogenic bacterial strain in a 300 gallon aquarium of fish and successfully reduce antibiotic resistance by 90%” – a reviewer noted that 

“The main complexity and uncertainty around successful completion of this milestone arises from the native fish microbiome and whether a CRISPR delivery tool can reach the target strain in question. Due to the framing of this milestone, should a single strain be very difficult to reach, the authors could simply switch to a different target strain if necessary. Additionally, the mode of CRISPR delivery is not prescribed in reaching this milestone, so the authors have a host of different techniques open to them, including conjugative delivery by a probiotic donor or delivery by engineered bacteriophage.”

Peer Review Results

Sequential Milestones vs. Independent Outcomes

In our expected utility forecasting framework, we defined two different ways that a proposal could structure its outcomes: as sequential milestones where each additional milestone builds off of the success of the previous one, or as independent outcomes where the success of one is not dependent on the success of the other(s). For proposals with sequential milestones in our pilot, we would expect the probability of success of milestone 2 to be less than the probability of success of milestone 1 and for the opposite to be true of their impact scores. For proposals with independent outcomes, we do not expect there to be a relationship between the probability of success and the impact scores of milestones 1 and 2. There are different equations for calculating the total expected utility, depending on the relationship between outcomes (see Appendix C).

For each of the proposals in our study, we categorized them based on whether they had sequential milestones or independent outcomes. This information was not shared with reviewers. Table 1 presents the average reviewer forecasts for each proposal. In general, milestones received higher scientific impact scores than social impact scores, which makes sense given the primarily academic focus of research proposals. For proposals 1 to 3, the probability of success of milestone 2 was roughly half of the probability of success of milestone 1; reviewers also gave milestone 2 higher scientific and social impact scores than milestone 1. This is consistent with our categorization of proposals 1 to 3 as sequential milestones.

Table 1. Mean forecasts for each proposal.
See next section for discussion about the categorization of proposal 4’s milestones.
Milestone 1Milestone 2
ProposalMilestone CategoryProbability of SuccessScientific Impact ScoreSocial Impact ScoreProbability of SuccessScientific Impact ScoreSocial Impact Score
1sequential0.807.837.350.418.228.25
2sequential0.886.413.720.368.217.62
3sequential0.687.076.450.348.207.50
4?0.726.583.920.477.064.19
5independent0.557.142.370.406.662.25

Further Discussion on Designing and Categorizing Milestones

We originally categorized proposal 4’s milestones as sequential, but one reviewer gave milestone 2 a lower scientific impact score than milestone 1 and two reviewers gave it a lower social impact score. One reviewer also gave milestone 2 roughly the same probability of success as milestone 1. This suggests that proposal 4’s milestones can’t be considered strictly sequential. 

The two milestones for proposal 4 were

The reviewer who gave milestone 2 a lower scientific impact score explained: “Given the wording of the milestone, I do not believe that if the scientific milestone was achieved, it would greatly improve our understanding of the brain.” Unlike proposals 1-3, in which milestone 2 was a scaled-up or improved-upon version of milestone 1, these milestones represent fundamentally different categories of output (general-purpose tool vs specific model). Thus, despite the necessity of milestone 1’s tool for achieving milestone 2, the reviewer’s response suggests that the impact of milestone 2 was being considered separately rather than cumulatively.

Milestone Design Recommendations
Explicitly define sequential milestones
Recommendation 1

To properly address this case of sequential milestones with different types of outputs, we recommend that for all sequential milestones, latter milestones should be explicitly defined as inclusive of prior milestones. In the above example, this would imply redefining milestone 2 as “Complete milestone 1 and develop a model of the C. elegans nervous system…” This way, reviewers know to include the impact of milestone 1 in their assessment of the impact of milestone 2.

Clarify milestone category with reviewers
Recommendation 2

To help ensure that reviewers are aligned with program managers in how they interpret the proposal milestones (if they aren’t directly involved in defining milestones), we suggest that either reviewers be informed of how program managers are categorizing the proposal outputs so they can conduct their review accordingly or allow reviewers to decide the category (and thus how the total expected utility is calculated), whether individually or collectively or both.

Allow for a flexible number of milestones
Recommendation 3

We chose to use only two of the goals that proposal authors provided because we wanted to standardize the number of milestones across proposals. However, this may have provided an incomplete picture of the proposals’ goals, and thus an incomplete assessment of the proposals. We recommend that future implementations be flexible and allow the number of milestones to be determined based on each proposal’s needs. This would also help accommodate one of the reviewers’ suggestion that some milestones should be broken down into intermediary steps.

Importance of Reviewer Explanations

As one can tell from the above discussion, reviewers’ explanation of their forecasts were crucial to understanding how they interpreted the milestones. Reviewers’ explanations varied in length and detail, but the most insightful responses broke down their reasoning into detailed steps and addressed (1) ambiguities in the milestone and how they chose to interpret ambiguities if they existed, (2) the state of the scientific field and the maturity of different techniques that the authors propose to use, and (3) factors that improve the likelihood of success versus potential barriers or challenges that would need to be overcome.

Exponential Impact Scales Better Reflect the Real Distribution of Impact 

The distribution of NIH and NSF proposal peer review scores tends to be skewed such that most proposals are rated above the center of the scale and there are few proposals rated poorly. However, other markers of scientific impact, such as citations (even with all of its imperfections), tend to suggest a long tail of studies with high impact. This discrepancy suggests that traditional peer review scoring systems are not well-structured to capture the nonlinearity of scientific impact, resulting in score inflation. The aggregation of scores at the top end of the scale also means that very negative scores have a greater impact than very positive scores when averaged together, since there’s more room between the average score and the bottom end of the scale. This can generate systemic bias against more controversial or risky proposals.

In our pilot, we chose to use an exponential scale with a base of 2 for impact to better reflect the real distribution of scientific impact. Using this exponential impact scale, we conducted a survey of a small pool of academics in the life sciences about how they would rate the impact of the average funded NIH R01 grant. They responded with an average scientific impact score 5 and an average social impact score of 3, which are much lower on our scale compared to traditional peer review scores4, suggesting that the exponential scale may be beneficial for avoiding score inflation and bunching at the top. In our pilot, the distribution of scientific impact scores was centered higher than 5, but still less skewed than NIH peer review scores for significance and innovation typically are. This partially reflects the fact that proposals were expected to be funded at one to two orders of magnitude more than NIH R01 proposals are, so impact should also be greater. The distribution of social impact scores exhibits a much wider spread and lower center.

Figure 1. Distribution of Impact scores for milestone 1 (top) and 2 (bottom)

Conclusion

In summary, expected utility forecasting presents a promising approach to improving the rigor of peer review and quantitatively defining the risk-reward profile of science proposals. Our pilot study suggests that this approach can be quite user-friendly for reviewers, despite its apparent complexity. Further study into how best to integrate forecasting into panel environments, define proposal milestones, and calibrate impact scales will help refine future implementations of this approach. 

More broadly, we hope that this pilot will encourage more grantmaking institutions to experiment with innovative funding mechanisms. Reviewers in our pilot were more open-minded and quick-to-learn than one might expect and saw significant value in this unconventional approach. Perhaps this should not be so much of a surprise given that experimentation is at the heart of scientific research. 

Interested grantmakers, both public and private, and policymakers are welcome to reach out to our team if interested in learning more or receiving assistance in implementing this approach. 

Acknowledgements

Many thanks to Jordan Dworkin for being an incredible thought partner in designing the pilot and providing meticulous feedback on this report. Your efforts made this project possible!


Appendix A: Pilot Study Design

Our pilot study consisted of five proposals for life science-related Focused Research Organizations (FROs). These proposals were solicited from academic researchers by FAS as part of our advocacy for the concept of FROs. As such, these proposals were not originally intended as proposals for direct funding, and did not have as strict content requirements as traditional grant proposals typically do. Researchers were asked to submit one to two page proposals discussing (1) their research concept, (2) the motivation and its expected social and scientific impact, and (3) the rationale for why this research can not be accomplished through traditional funding channels and thus requires a FRO to be funded.

Permission was obtained from proposal authors to use their proposals in this study. We worked with proposal authors to define two milestones for each proposal that reviewers would assess: one that they felt confident that they could achieve and one that was more ambitious but that they still thought was feasible. In addition, due to the brevity of the proposals, we included an additional 1-2 pages of supplementary information and scientific context. Final drafts of the milestones and supplementary information were provided to authors to edit and approve. Because this pilot study could not provide any actual funding to proposal authors, it was not possible to solicit full length research proposals from proposal authors.

We recruited four to six reviewers for each proposal based on their subject matter expertise. Potential participants were recruited over email with a request to help review a FRO proposal related to their area of research. They were informed that the review process would be unconventional but were not informed of the study’s purpose. Participants were offered a small monetary compensation for their time.

Confirmed participants were sent instructions and materials for the review process on the same day and were asked to complete their review by the same deadline a month and a half later. Reviewers were told to assume that, if funded, each proposal would receive $50 million in funding over five years to conduct the research, consistent with the proposed model for FROs. Each proposal had two technical milestones, and reviewers were asked to answer the following questions for each milestone: 

  1. Assuming that the proposal is funded by 2025, will the milestone be achieved before 2031?
  2. What will be the average scientific impact score, as judged in 2032, of accomplishing the milestone?
  3. What will be the average social impact score, as judged in 2032, of accomplishing the milestone?

The impact scoring system was explained to reviewers as follows:

Please consider the following in determining the impact score: the current and expected long-term social or scientific impact of a funded FRO’s outputs if a funded FRO accomplishes this milestone before 2030.

The impact score we are using ranges from 1 (low) to 10 (high). It is base 2 exponential, meaning that a proposal that receives a score of 5 has double the impact of a proposal that receives a score of 4, and quadruple the impact of a proposal that receives a score of 3. In a small survey we conducted of SMEs in the life sciences, they rated the scientific and social impact of the average NIH R01 grant — a federally funded research grant that provides $1-2 million for a 3-5 year endeavor — on this scale to be 5.2 ± 1.5 and 3.1 ± 1.3, respectively. The median scores were 4.75 and 3.00, respectively.

Below is an example of how a predicted impact score distribution (left) would translate into an actual impact distribution (right). You can try it out yourself with this interactive version (in the menu bar, click Runtime > Run all) to get some further intuition on how the impact score works. Please note that this is meant solely for instructive purposes, and the interface is not designed to match Metaculus’ interface.

The choice of an exponential impact scale reflects the tendency in science for a small number of research projects to have an outsized impact. For example, studies have shown that the relationship between the number of citations for a journal article and its percentile rank scales exponentially.

Scientific impact aims to capture the extent to which a project advances the frontiers of knowledge, enables new discoveries or innovations, or enhances scientific capabilities or methods. Though each is imperfect, one could consider citations of papers, patents on tools or methods, or users of software or datasets as proxies of scientific impact. 

Social impact aims to capture the extent to which a project contributes to solving important societal problems, improving well-being, or advancing social goals. Some proxy metrics that one might use to assess a project’s social impact are the value of lives saved, the cost of illness prevented, the number of job-years of employment generated, economic output in terms of GDP, or the social return on investment. 

You may consider any or none of these proxy metrics as a part of your assessment of the impact of a FRO accomplishing this milestone.

Reviewers were asked to submit their forecasts on Metaculus’ website and to provide their reasoning in a separate Google form. For question 1, reviewers were asked to respond with a single probability. For questions 2 and 3, reviewers were asked to provide their median, 25th percentile, and 75th percentile predictions, in order to generate a probability distribution. Metaculus’ website also included information on the resolution criteria of each question, which provided guidance to reviewers on how to answer the question. Individual reviewers were blind to other reviewers’ responses until after the submission deadline, at which point the aggregated results of all of the responses were made public on Metaculus’ website. 

Additionally, in the Google form, reviewers were asked to answer a survey question about their experience: “What did you think about this review process? Did it prompt you to think about the proposal in a different way than when you normally review proposals? If so, how? What did you like about it? What did you not like? What would you change about it if you could?” 

Some participants did not complete their review. We received 19 complete reviews in the end, with each proposal receiving three to six reviews. 

Study Limitations

Our pilot study had certain limitations that should be noted. Since FAS is not a grantmaking institution, we could not completely reproduce the same types of research proposals that a grantmaking institution would receive nor the entire review process. We will highlight these differences in comparison to federal science agencies, which are our primary focus.

  1. Review Process: There are typically two phases to peer review at NIH and NSF. First, at least three individual reviewers with relevant subject matter expertise are assigned to read and evaluate a proposal independently. Then, a larger committee of experts is convened. There, the assigned reviewers present the proposal and their evaluation, and then the committee discusses and determines the final score for the proposal. Our pilot study only attempted to replicate the first phase of individual review.
  1. Sample Size: In our pilot, the sample size was quite small, since only five proposals were reviewed, and they were all in different subfields, so different reviewers were assigned to each proposal. NIH and NSF peer review committees typically focus on one subfield and review on the order of twenty or so proposals. The number of reviewers per proposal–three to six–in our pilot was consistent with the number of reviewers typically assigned to a proposal by NIH and NSF. Peer review committees are typically larger, ranging from six to twenty people, depending on the agency and the field.
  1. Proposals: The FRO proposals plus supplementary information were only two to four pages long, which is significantly shorter than the 12 to 15 page proposals that researchers submit for NIH and NSF grants. Proposal authors were asked to generally describe their research concept, but were not explicitly required to describe the details of the research methodology they would use or any preliminary research. Some proposal authors volunteered more information on this for the supplementary information, but not all authors did. 
  1. Grant Size: For the FRO proposals, reviewers were asked to assume that funded proposals would receive $50 million over five years, which is one to two orders of magnitude more funding than typical NIH and NSF proposals.

Appendix B: Feedback on Study-Specific Implementation

In addition to feedback about the review framework, we received feedback on how we implemented our pilot study, specifically the instructions and materials for the review process and the submission platforms. This feedback isn’t central to this paper’s investigation of expected value forecasting, but we wanted to include it in the appendix for transparency.

Reviewers were sent instructions over email that outlined the review process and linked to Metaculus’ webpage for this pilot. On Metaculus’ website, reviewers could find links to the proposals on FAS’ website and the supplementary information in Google docs. Reviewers were expected to read those first and then read through the resolution criteria for each forecasting question before submitting their answers on Metaculus’ platform. Reviewers were asked to submit the explanations behind their forecasts in a separate Google form.

Some reviewers had no problem navigating the review process and found Metaculus’ website easy to use. However, feedback from other reviewers suggested that the different components necessary for the review were spread out over too many different websites, making it difficult for reviewers to keep track of where to find everything they needed.

Some had trouble locating the different materials and pieces of information needed to conduct the review on Metaculus’ website. Others found it confusing to have to submit their forecasts and explanations in two separate places. One reviewer suggested that the explanation of the impact scoring system should have been included within the instructions sent over email rather than in the resolution criteria on Metaculus’ website so that they could have read it before reading the proposal. Another reviewer suggested that it would have been simpler to submit their forecasts through the same Google form that they used to submit their explanations rather than through Metaculus’ website. 

Based on this feedback, we would recommend that future implementations streamline their submission process to a single platform and provide a more extensive set of instructions rather than seeding information across different steps of the review process. Training sessions, which science funding agencies typically conduct, would be a good supplement to written instructions.

Appendix C: Total Expected Utility Calculations

To calculate the total expected utility, we first converted all of the impact scores into utility by taking two to the exponential of the impact score, since the impact scoring system is base 2 exponential:

Utility=2Impact Score.

We then were able to average the utilities for each milestone and conduct additional calculations. 

To calculate the total utility of each milestone, ui, we averaged the social utility and the scientific utility of the milestone:

ui = (Social Utility + Scientific Utility)/2.

The total expected utility (TEU) of a proposal with two milestones can be calculated according to the general equation:

TEU = u1P(m1 ∩ not m2) + u2P(m2 ∩ not m1) + (u1+u2)P(m1m2),

where P(mi) represents the probability of success of milestone i and

P(m1 ∩ not m2) = P(m1) – P(m1 ∩ m2)
P(m2 ∩ not m1) = P(m2) – P(m1 ∩ m2).

For sequential milestones, milestone 2 is defined as inclusive of milestone 1 and wholly dependent on the success of milestone 1, so this means that

u2, seq = u1+u2
P(m2) = Pseq(m1 ∩ m2)
P(m2 ∩ not m1) = 0.

Thus, the total expected utility of sequential milestones can be simplified as

TEU = u1P(m1)-u1P(m2) + (u2, seq)P(m2)
TEU = u1P(m1) + (u2, seq-u1)P(m2)

This can be generalized to

TEUseq = Σi(ui, seq-ui-1, seq)P(mi).

Otherwise, the total expected utility can be simplified to 

TEU = u1P(m1) + u2P(m2) – (u1+u2)P(m1 ∩ m2).

For independent outcomes, we assume 

Pind(m1 ∩ m2) = P(m1)P(m2), 

so

TEUind = u1P(m1) + u2P(m2) – (u1+u2)P(m1)P(m2).

To present the results in Tables 1 and 2, we converted all of the utility values back into the impact score scale by taking the log base 2 of the results.

Scaling AI Safely: Can Preparedness Frameworks Pull Their Weight?

A new class of risk mitigation policies has recently come into vogue for frontier AI developers. Known alternately as Responsible Scaling Policies or Preparedness Frameworks, these policies outline commitments to risk mitigations that developers of the most advanced AI models will implement as their models display increasingly risky capabilities. While the idea for these policies is less than a year old, already two of the most advanced AI developers, Anthropic and OpenAI, have published initial versions of these policies. The U.K. AI Safety Institute asked frontier AI developers about their “Responsible Capability Scaling” policies ahead of the November 2023 UK AI Safety Summit. It seems that these policies are here to stay.

The National Institute of Standards & Technology (NIST) recently sought public input on its assignments regarding generative AI risk management, AI evaluation, and red-teaming. The Federation of American Scientists was happy to provide input; this is the full text of our response. NIST’s request for information (RFI) highlighted several potential risks and impacts of potentially dual-use foundation models, including: “Negative effects of system interaction and tool use…chemical, biological, radiological, and nuclear (CBRN) risks…[e]nhancing or otherwise affecting malign cyber actors’ capabilities…[and i]mpacts to individuals and society.” This RFI presented a good opportunity for us to discuss the benefits and drawbacks of these new risk mitigation policies.

This report will provide some background on this class of risk mitigation policies (we use the term Preparedness Framework, for reasons to be described below). We outline suggested criteria for robust Preparedness Frameworks (PFs) and evaluate two key documents, Anthropic’s Responsible Scaling Policy and OpenAI’s Preparedness Framework, against these criteria. We claim that these policies are net-positive and should be encouraged. At the same time, we identify shortcomings of current PFs, chiefly that they are underspecified, insufficiently conservative, and address structural risks poorly. Improvement in the state of the art of risk evaluation for frontier AI models is a prerequisite for a meaningfully binding PF. Most importantly, PFs, as unilateral commitments by private actors, cannot replace public policy.

Motivation for Preparedness Frameworks

As AI labs develop potentially dual-use foundation models (as defined by Executive Order No. 14110, the “AI EO”) with capability, compute, and efficiency improvements, novel risks may emerge, some of them potentially catastrophic. Today’s foundation models can already cause harm and pose some risks, especially as they are more broadly used. Advanced large language models at times display unpredictable behaviors

To this point, these harms have not risen to the level of posing catastrophic risks, defined here broadly as “devastating consequences for vast numbers of people.” The capabilities of models at the current state of the art simply do not imply levels of catastrophic risk above current non-AI related margins.1 However, as these models continue to scale in training compute, some speculate they may develop novel capabilities that could potentially be misused. The specific capabilities that will emerge from further scaling remain difficult to predict with confidence or certainty. Some analysis indicates that as training compute for AI models has doubled approximately every six months since 2015, performance on capability benchmarks has also steadily improved. While it’s possible that bigger models could lead to better performance, it wouldn’t be surprising if smaller models emerge with better capabilities, as despite years of research by machine learning theorists, our knowledge of just how the number of model parameters relates to model capabilities remains uncertain. 

Nonetheless, as capabilities increase, risks may also increase, and new risks may appear. Executive Order 14110 (the Executive Order on Artificial Intelligence, or the “AI EO”) detailed some novel risks of potentially dual-use foundation models, including potential risks associated with chemical, biological, radiological, or nuclear (CBRN) risks and advanced cybersecurity risks. Other risks are more speculative, such as risks of model autonomy, loss of control of AI systems, or negative impacts on users including risks of persuasion.2 Without robust risk mitigations, it is plausible that increasingly powerful AI systems will eventually pose greater societal risks.

Other technologies that pose catastrophic risks, such as nuclear technologies, are heavily regulated in order to prevent those risks from resulting in serious harms. There is a growing movement to regulate development of potentially dual-use biotechnologies, particularly gain-of-function research on the most pathogenic microbes. Given the rapid pace of progress at the AI frontier, comprehensive government regulation has yet to catch up; private companies that develop these models are starting to take it upon themselves to prevent or mitigate the risks of advanced AI development.

Prevention of such novel and consequential risks requires developers to implement policies that address potential risks iteratively. That is where preparedness frameworks come in. A preparedness framework is used to assess risk levels across key categories and outline associated risk mitigations. As the introduction to OpenAI’s PF states, “The processes laid out in each version of the Preparedness Framework will help us rapidly improve our understanding of the science and empirical texture of catastrophic risk, and establish the processes needed to protect against unsafe development.” Without such processes and commitments, the tendency to prioritize speed over safety concerns might prevail. While the exact consequences of failing to mitigate these risks are uncertain, they could potentially be significant.

Preparedness frameworks are limited in scope to catastrophic risks. These policies aim to prevent the worst conceivable outcomes of the development of future advanced AI systems; they are not intended to cover risks from existing systems. We acknowledge that this is an important limitation of preparedness frameworks. Developers can and should address both today’s risks and future risks at the same time; preparedness frameworks attempt to address the latter, while other “trustworthy AI” policies attempt to address a broader swathe of risks. For instance, OpenAI’s “Preparedness” team sits alongside its “Safety Systems” team, which “focuses on mitigating misuse of current models and products like ChatGPT.”

A note about terminology: The term “Responsible Scaling Policy” (RSP) is the term that took hold first, but it presupposes scaling of compute and capabilities by default. “Preparedness Framework” (PF) is a term coined by OpenAI, and it communicates the idea that the company needs to be prepared as its models approach the level of artificial general intelligence. Of the two options, “Preparedness Framework” communicates the essential idea more clearly: developers of potentially dual-use foundation models must be prepared for and mitigate potential catastrophic risks from development of these models.

The Industry Landscape

In September of 2023, ARC Evals (now METR, “Model Evaluation & Threat Research”) published a blog post titled “Responsible Scaling Policies (RSPs).” This post outlined the motivation and basic structure of an RSP, and revealed that ARC Evals had helped Anthropic write its RSP (version 1.0) which had been released publicly a few days prior. (ARC Evals had also run pre-deployment evaluations on Anthropic’s Claude model and OpenAI’s GPT-4.) And in December 2023, OpenAI published its Preparedness Framework in beta; while using new terminology, this document is structurally similar to ARC Evals’ outline of the structure of an RSP. Both OpenAI and Anthropic have indicated that they plan to update their PFs with new information as the frontier of AI development advances.

Not every AI company should develop or maintain a preparedness framework. Since these policies relate to catastrophic risk from models with advanced capabilities, only those developers whose models could plausibly attain those capabilities should use PFs. Because these advanced capabilities are associated with high levels of training compute, a good interim threshold for who should develop a PF could be the same as the AI EO threshold for potentially dual-use foundation models; that is, developers of models trained on over 10^26 FLOPS (or October 2023-equivalent level of compute adjusted for compute efficiency gains).3 Currently, only a handful of developers have models that even approach this threshold. This threshold should be subject to change, like that of the AI EO, as developers continue to push the frontier (e.g. by developing more efficient algorithms or realizing other compute efficiency gains).

While several other companies published “Responsible Capability Scaling” documents ahead of the UK AI Safety Summit, including DeepMind, Meta, Microsoft, Amazon, and Inflection AI, the rest of this report focuses primarily on OpenAI’s PF and Anthropic’s RSP. 

Weaknesses of Preparedness Frameworks

Preparedness frameworks are not panaceas for AI-associated risks. Even with improvements in specificity, transparency, and strengthened risk mitigations, there are important weaknesses to the use of PFs. Here we outline a couple weaknesses of PFs and possible answers to them.

1. Spirit vs. text: PFs are voluntary commitments whose success depends on developers’ faithfulness to their principles.

Current risk thresholds and mitigations are defined loosely. In Anthropic’s RSP, for instance, the jump from the current risk level posed by Claude 2 (its state of the art model) to the next risk level is defined in part by the following: “Access to the model would substantially increase the risk of catastrophic misuse, either by proliferating capabilities, lowering costs, or enabling new methods of attack….” A “substantial increase” is not well-defined. This ambiguity leaves room for interpretation; since implementing risk mitigations can be costly, developers could have an incentive to take advantage of such ambiguity if they do not follow the spirit of the policy.

This concern about the gap between following the spirit of the PF and following the text might be somewhat eased with more specificity about risk thresholds and associated mitigations, and especially with more transparency and public accountability to these commitments.

To their credit, OpenAI’s PF and Anthropic’s RSP show a serious approach to the risks of developing increasingly advanced AI systems. OpenAI’s PF includes a commitment to fine-tune its models to better elicit capabilities along particular risk categories, then evaluate “against these enhanced models to ensure we are testing against the ‘worst case’ scenario we know of.” They also commit to triggering risk mitigations “when any of the tracked risk categories increase in severity, rather than only when they all increase together.” And Anthropic “commit[s] to pause the scaling and/or delay the deployment of new models whenever our scaling ability outstrips our ability to comply with the safety procedures for the corresponding ASL [AI Safety Level].” These commitments are costly signals that these developers are serious about their PFs.

2. Private commitment vs. public policy: PFs are unilateral commitments that individual developers take on; we might prefer more universal policy (or regulatory) approaches.

Private companies developing AI systems may not fully account for broader societal risks. Consider an analogy to climate change—no single company’s emissions are solely responsible for risks like sea level rise or extreme weather. The risk comes from the aggregate emissions of all companies. Similarly, AI developers may not consider how their systems interact with others across society, potentially creating structural risks. Like climate change, the societal risks from AI will likely come from the cumulative impact of many different systems. Unilateral commitments are poor tools to address such risks.

Furthermore, PFs might reduce the urgency for government intervention. By appearing safety-conscious, developers could diminish the perceived need for regulatory measures. Policymakers might over-rely on self-regulation by AI developers, potentially compromising public interest for private gains.

Policy can and should step into the gap left by PFs. Policy is more aligned to the public good, and as such is less subject to competing incentives. And policy can be enforced, unlike voluntary commitments. In general, preparedness frameworks and similar policies help hold private actors accountable to their public commitments; this effect is stronger with more specificity in defining risk thresholds, better evaluation methods, and more transparency in reporting. However, these policies cannot and should not replace government action to reduce catastrophic risks (especially structural risks) of frontier AI systems.

Suggested Criteria for Robust Preparedness Frameworks

These criteria are adapted from the ARC Evals post, Anthropic’s RSP, and OpenAI’s PF. Broadly, they are aspirational; no existing preparedness framework meets all or most of these criteria.

For each criterion, we explain the key considerations for developers adopting PFs. We analyze OpenAI’s PF and Anthropic’s RSP to illustrate the strengths and shortcomings of their approaches. Again, these policies are net-positive and should be encouraged. They demonstrate costly unilateral commitments to measuring and addressing catastrophic risk from their models; they meaningfully improve on the status quo. However, these initial PFs are underspecified and insufficiently conservative. Improvement in the state of the art of risk evaluation and mitigation, and subsequent updates, would make them more robust.

Suggested Criteria for Robust Preparedness Frameworks
Table 1: Summary of suggested criteria for robust preparedness frameworks.
BreadthPreparedness frameworks should cover the breadth of potential catastrophic risks of developing frontier AI models.“What risks are covered?”
Risk appetitePreparedness frameworks should define the developer’s acceptable risk level (“risk appetite”) in terms of likelihood and severity of risk.“What is an acceptable level of risk?”
ClarityPreparedness frameworks should clearly define capability levels and risk thresholds.“How will developers know they have hit capability levels associated with particular risks?”
EvaluationPreparedness frameworks should include detailed evaluation procedures for AI models, ensuring comprehensive risk assessment.“What tests will developers run on their models?”
MitigationFor different risk thresholds, preparedness frameworks should identify and commit to pre-specified risk mitigations.“What will developers do when their models reach particular levels of risk?”
RobustnessPreparedness frameworks’ pre-specified risk mitigations must effectively address potentially catastrophic risks.“How do developers know their risk mitigations will work?”
AccountabilityPreparedness frameworks should combine credible risk mitigation commitments with governance structures that ensure these commitments are fulfilled.“How can developers hold themselves accountable to their commitment to safety?”
AmendmentsPreparedness frameworks should include a mechanism for regular updates to the framework itself, in light of ongoing research and advances in AI.“How will developers change their PFs over time?”
TransparencyFor models with risk above the lowest level, both pre- and post-mitigation evaluation results and methods should be public, including any performed mitigations.“How will developers communicate about their models’ capabilities and risks?”

1. Preparedness frameworks should cover the breadth of potential catastrophic risks of developing frontier AI models. 

These risks may include:

Preparedness frameworks should apply to catastrophic risks in particular because they govern the scaling of capabilities of the most advanced AI models, and because catastrophic risks are of the highest consequence to such development. PFs are one tool among many that developers of the most advanced AI models should use to prevent harm. Developers of advanced AI models tend to also have other “trustworthy AI” policies, which seek to prevent and address already-existing risks such as harmful outputs, disinformation, and synthetic sexual content. Despite PFs’ focus on potentially catastrophic risks, faithfully applying PFs may help developers catch many other kinds of risks as well, since they involve extensive evaluation for misuse potential and adverse human impacts.

2. Preparedness frameworks should define the developer’s acceptable risk level (“risk appetite”) in terms of likelihood and severity of risk, in accordance with the NIST AI Risk Management Framework, section Map 1.5.

Neither OpenAI nor Anthropic has publicly declared their risk appetite. This is a nascent field of research, as these risks are novel and perhaps less predictable than eg. nuclear accident risk.5 NIST and other standard-setting bodies will be crucial in developing AI risk metrology. For now, PFs should state developers’ risk appetites as clearly as possible, and update them regularly with research advances.6

AI developers’ risk appetites might be different than a regulatory risk appetite. Developers should elucidate their risk appetite in quantitative terms so their PFs can be evaluated accordingly. As in the case of nuclear technology, regulators may eventually impose risk thresholds on frontier AI developers. At this point, however, there is no standard, scientifically-grounded approach to measuring the potential for catastrophic AI risk; this has to start with the developers of the most capable AI models.

3. Preparedness frameworks should clearly define capability levels and risk thresholds. Risk thresholds should be quantified robustly enough to hold developers accountable to their commitments.

OpenAI and Anthropic both outline qualitative risk thresholds corresponding with different categories of risk. For instance, in OpenAI’s PF, the High risk threshold in the CBRN category reads: “​​Model enables an expert to develop a novel threat vector OR model provides meaningfully improved assistance that enables anyone with basic training in a relevant field (e.g., introductory undergraduate biology course) to be able to create a CBRN threat.” And Anthropic’s RSP defines the ASL-3 [AI Safety Level] threshold as: “Low-level autonomous capabilities, or access to the model would substantially increase the risk of catastrophic misuse, either by proliferating capabilities, lowering costs, or enabling new methods of attack, as compared to a non-LLM baseline of risk.”

These qualitative thresholds are under-specified; reasonable people are likely to differ on what “meaningfully improved assistance” looks like, or a “substantial increase [in] the risk of catastrophic misuse.” In PFs, these thresholds should be quantified to the extent possible.

To be sure, the AI development research community currently lacks a good empirical understanding of the likelihood or quantification of frontier AI-related risks. Again, this is a novel science that needs to be developed with input from both the private and public sectors. Since this science is still developing, it is natural to want to avoid too much quantification. A conceivable failure mode is that developers “check the boxes,” which may become obsolete quickly, in lieu of using their judgment to determine when capabilities are dangerous enough to warrant higher risk mitigations. Again, as research improves, we should expect to see improvements in PFs’ specification of risk thresholds.

4. Preparedness frameworks should include detailed evaluation procedures for AI models, ensuring comprehensive risk assessment within a developer’s tolerance. 

Anthropic and OpenAI both have room for improvement on detailing their evaluation procedures. Anthropic’s RSP includes evaluation procedures for model autonomy and misuse risks. Its evaluation procedures for model autonomy are impressively detailed, including clearly defined tasks on which it will evaluate its models. Its evaluation procedures for misuse risk are much less well-defined, though it does include the following note: “We stress that this will be hard and require iteration. There are fundamental uncertainties and disagreements about every layer…It will take time, consultation with experts, and continual updating.” And OpenAI’s PF includes a “Model Scorecard,” a mock evaluation of an advanced AI model. This model scorecard includes the hypothetical results of various evaluations in all four of their tracked risk categories; it does not appear to be a comprehensive list of evaluation procedures.

Again, the science of AI model evaluation is young. The AI EO directs NIST to develop red-teaming guidance for developers of potentially dual-use foundation models. NIST, along with private actors such as METR and other AI evaluators, will play a crucial role in creating and testing red-teaming practices and model evaluations that elicit all relevant capabilities.

5. For different risk thresholds, preparedness frameworks should identify and commit to pre-specified risk mitigations.

Classes of risk mitigations may include:

Both OpenAI’s PF and Anthropic’s RSP commit to a number of pre-specified risk mitigations for different thresholds. For example, for what Anthropic calls “ASL-2” models (including its most advanced model, Claude 2), they commit to measures including publishing model cards, providing a vulnerability reporting mechanism, enforcing an acceptable use policy, and more. Models at higher risk thresholds (what Anthropic calls “ASL-3” and above) have different, more stringent risk mitigations, including “limit[ing] access to training techniques and model hyperparameters…” and “implement[ing] measures designed to harden our security…”

Risk mitigations can and should differ in approaches to development versus deployment. There are different levels of risk associated with possessing models internally and allowing external actors to interact with them. Both OpenAI’s PF and Anthropic’s RSP include different risk mitigation approaches for development and deployment. For example, OpenAI’s PF restricts deployment of models such that “Only models with a post-mitigation score of “medium” or below can be deployed,” whereas it restricts development of models such that “Only models with a post-mitigation score of “high” or below can be developed further.”

Mitigations should be defined as specifically as possible, with the understanding that as the state of the art changes, this too is an area that will require periodic updates. Developers should include some room for judgment here.

6. Preparedness frameworks’ pre-specified risk mitigations must effectively address potentially catastrophic risks.

Having confidence that the risk mitigations do in fact address potential catastrophic risks is perhaps the most important and difficult aspect of a PF to evaluate. Catastrophic risk from AI is a novel and speculative field; evaluating AI capabilities is a science in its infancy; and there are no empirical studies of the effectiveness of risk mitigations preventing such risks. Given this uncertainty, frontier AI developers should err on the side of caution.

Both OpenAI and Anthropic should be more conservative in their risk mitigations. Consider OpenAI’s commitment to restricting development: “[I]f we reach (or are forecasted to reach) ‘critical’ pre-mitigation risk along any risk category, we commit to ensuring there are sufficient mitigations in place…for the overall post-mitigation risk to be back at most to ‘high’ level.” To understand this commitment, we have to look at their threshold definitions. Under the Model Autonomy category, the “critical” threshold in part includes: “model can self-exfiltrate under current prevailing security.” Setting aside that this threshold is still quite vague and difficult to evaluate (and setting aside the novelty of this capability), a model that approaches or exceeds this threshold by definition can self-exfiltrate, rendering all other risk mitigations ineffective. A more robust approach to restricting development would not permit training or possessing a model that comes close to exceeding this threshold.

As for Anthropic, consider their threshold for “ASL-3,” which reads in part: “Access to the model would substantially increase the risk of catastrophic misuse…” The risk mitigations for ASL-3 models include the following: “Harden security such that non-state attackers are unlikely to be able to steal model weights and advanced threat actors (e.g. states) cannot steal them without significant expense.” While an admirable approach to development of potentially dual-use foundation models, assuming state actors seek out tools whose misuse involves catastrophic risk, a more conservative mitigation would entail hardening security such that it is unlikely that any actor, state or non-state, could steal the model weights of such a model.9

7. Preparedness frameworks should combine credible risk mitigation commitments with governance structures that ensure these commitments are fulfilled.

Preparedness Frameworks should detail governance structures that incentivize actually undertaking pre-committed risk mitigations when thresholds are met. Other incentives, including profit and shareholder value, sometimes conflict with risk management.

Anthropic’s RSP includes a number of procedural commitments meant to enhance the credibility of its risk mitigation commitments. For example, Anthropic commits to proactively planning to pause scaling of its models,10 publicly sharing evaluation results, and appointing a “Responsible Scaling Officer.” However, Anthropic’s RSP also includes the following clause: “[I]n a situation of extreme emergency, such as when a clearly bad actor (such as a rogue state) is scaling in so reckless a manner that it is likely to lead to lead to imminent global catastrophe if not stopped…we could envisage a substantial loosening of these restrictions as an emergency response…” This clause potentially undermines the credibility of Anthropic’s other commitments in the RSP, if at any time it can point to another actor who in its view is scaling recklessly.

OpenAI’s PF also outlines commendable governance measures, including procedural commitments, meant to enhance its risk mitigation credibility. It summarizes its operation structure: “(1) [T]here is a dedicated team “on the ground” focused on preparedness research and monitoring (Preparedness team), (2) there is an advisory group (Safety Advisory Group) that has a sufficient diversity of perspectives and technical expertise to provide nuanced input and recommendations, and (3) there is a final decision-maker (OpenAI Leadership, with the option for the OpenAI Board of Directors to overrule).” 

8. Preparedness frameworks should include a mechanism for regular updates to the framework itself, in light of ongoing research and advances in AI.

Both OpenAI’s PF and Anthropic’s RSP acknowledge the importance of regular updates. This is reflected in both of these documents’ names: Anthropic labels its RSP as “Version 1.0,” while OpenAI’s PF is labeled as “(Beta).”

Anthropic’s RSP includes an “Update Process” that reads in part: “We expect most updates to this process to be incremental…as we learn more about model safety features or unexpected capabilities…” This language directly commits Anthropic to changing its RSP as the state of the art changes. OpenAI references updates throughout its PF, notably committing to updating its evaluation methods and rubrics (“The Scorecard will be regularly updated by the Preparedness team to help ensure it reflects the latest research and findings”).

9. For models with risk above the lowest level, most evaluation results and methods should be public, including any performed mitigations

Publishing model evaluations and mitigations is an important tool for holding developers accountable to their PF commitments. Sensitivity about the level of transparency is key. For example, full information about evaluation methodology and risk mitigations could be exploited by malicious actors. Anthropic’s RSP takes a balanced approach in committing to “[p]ublicly share evaluation results after model deployment where possible, in some cases in the initial model card, in other cases with a delay if it serves a broad safety interest.” OpenAI’s PF does not commit to publishing its Model Scorecards, but OpenAI has since published related research on whether its models aid the creation of biological threats.

Conclusion

Preparedness frameworks represent a promising approach for AI developers to voluntarily commit to robust risk management practices. However, current versions have weaknesses—particularly their lack of specificity in risk thresholds, insufficiently conservative risk mitigation approaches, and inadequacy in addressing structural risks. Frontier AI developers without PFs should consider adopting them, and OpenAI and Anthropic should update their policies to strengthen risk mitigations and include more specificity.

Strengthening preparedness frameworks will require advancing AI safety science to enable precise risk quantification and develop new mitigations. NIST, academics, and companies plan to collaborate to measure and model frontier AI risks. Policymakers have a crucial opportunity to adapt regulatory approaches from other high-risk technologies like nuclear power to balance AI innovation and catastrophic risk prevention. Furthermore, standards bodies could develop more robust AI evaluations best practices, including guidance for third-party auditors.

Overall the AI community must view safety as an intrinsic priority, not just private actors creating preparedness frameworks. All stakeholders, including private companies, academics, policymakers and civil society organizations have roles to play in steering AI development toward societally beneficial outcomes. Preparedness frameworks are one tool, but not sufficient absent more comprehensive, multi-stakeholder efforts to scale AI safely and for the public good.

Many thanks to Madeleine Chang, Di Cooke, Thomas Woodside, and Felipe Calero Forero for providing helpful feedback.

Working with academics: A primer for U.S. government agencies

Collaboration between federal agencies and academic researchers is an important tool for public policy. By facilitating the exchange of knowledge, ideas, and talent, these partnerships can help address pressing societal challenges. But because it is rarely in either party’s job description to conduct outreach and build relationships with the other, many important dynamics are often hidden from view. This primer provides an initial set of questions and topics for agencies to consider when exploring academic partnership.

Why should agencies consider working with academics?

What considerations may arise when working with academics?

Table (Of Contents)
Characteristics of discussed collaborative structures
StructurePrimary needPotential mechanismsStructural complexityLevel of effort
Informal advisingKnowledge >> CapacityAd-hoc engagement; formal consulting agreementLowOccasional work, over the short- to long-term
Study groupsKnowledge > CapacityInformal working group; formal extramural awardModerateOccasional to part-time work, over the short- to medium-term
Collaborative researchCapacity ~= KnowledgeInformal research partnership, formal grant, or cooperative agreement / contractVariablePart-time work, over the medium- to long-term
Short-term placementsCapacity > KnowledgeIPA, OPM Schedule A(r), or expert contract; either ad-hoc or through a formal programModeratePart- to full-time work, over a short- to medium-term
Long-term rotationsCapacity >> KnowledgeIPA, OPM Schedule A(r), or SGE designation; typically through a formal programHighFull-time work, over a medium- to long-term
BOX 1. Key academic considerations
Academic career stages.

Academic faculty progress through different stages of professorship — typically assistant, associate, and full — that affect their research and teaching expectations and opportunities. Assistant professors are tenure-track faculty who need to secure funding, publish papers, and meet the standards for tenure. Associate professors have job security and academic freedom, but also more mentoring and leadership responsibilities; associate professors are typically tenured, though this is not always the case. Full professors are senior faculty who have a high reputation and recognition in their field, but also more demands for service and supervision. The nature of agency-academic collaboration may depend on the seniority of the academic. For example, junior faculty may be more available to work with agencies, but primarily in contexts that will lead to traditional academic outputs; while senior faculty may be more selective, but their academic freedom will allow for less formal and more impact-oriented work.

Soft vs. hard money positions.

Soft money positions are those that depend largely or entirely on external funding sources, typically research grants, to support the salary and expenses of the faculty. Hard money positions are those that are supported by the academic institution’s central funds, typically tied to more explicit (and more expansive) expectations for teaching and service than soft-money positions. Faculty in soft money positions may face more pressure to secure funding for research, while faculty in hard money positions may have more autonomy in their research agenda but more competing academic activities. Federal agencies should be aware of the funding situation of the academic faculty they collaborate with, as it may affect their incentives and expectations for agency engagement.

Sabbatical credits.

A sabbatical is a period of leave from regular academic duties, usually for one or two semesters, that allows faculty to pursue an intensive and unstructured scope of work — this can include research in their own field or others, as well as external engagements or tours of service with non-academic institutions . Faculty accrue sabbatical credits based on their length and type of service at the university, and may apply for a sabbatical once they have enough credits. The amount of salary received during a sabbatical depends on the number of credits and the duration of the leave. Federal agencies may benefit from collaborating with academic faculty who are on sabbatical, as they may have more time and interest to devote to impact-focused work.

Consulting/outside activity limits.

Consulting limits & outside activity limits are policies that regulate the amount of time that academic faculty can spend on professional activities outside their university employment. These policies are intended to prevent conflicts of commitment or interest that may interfere with the faculty’s primary obligations to the university, such as teaching, research, and service, and the specific limits vary by university. Federal agencies may need to consider these limits when engaging academic faculty in ongoing or high-commitment collaborations.

9 vs. 12 month salaries.

Some academic faculty are paid on a 9-month basis, meaning that they receive their annual salary over nine months and have the option to supplement their income with external funding or other activities during the summer months. Other faculty are paid on a 12-month basis, meaning that they receive their annual salary over twelve months and have less flexibility to pursue outside opportunities. Federal agencies may need to consider the salary structure of the academic faculty they work with, as it may affect their availability to engage on projects and the optimal timing with which they can do so.

Advisory relationships consist of an academic providing occasional or periodic guidance to a federal agency on a specific topic or issue, without being formally contracted or compensated. This type of collaboration can be useful for agencies that need access to cutting-edge expertise or perspectives, but do not have a formal deliverable in mind.

Academic considerations

Regulatory & structural considerations

Box 2. Key structural considerations
Regulatory guidance.

Federal agencies and academic institutions are subject to various laws and regulations that affect their research collaboration, and the ownership and use of the research outputs. Key legislation includes the Federal Advisory Committee Act (FACA), which governs advisory committees and ensures transparency and accountability; the Federal Acquisition Regulation (FAR), which controls the acquisition of supplies and services with appropriated funds; and the Federal Grant and Cooperative Agreement Act (FGCAA), which provides criteria for distinguishing between grants, cooperative agreements, and contracts. Agencies should ensure that collaborations are structured in accordance with these and other laws.

Contracting mechanisms.

Federal agencies may use various contracting mechanisms to engage researchers from non-federal entities in collaborative roles. These mechanisms include the IPA Mobility Program, which allows the temporary assignment of personnel between federal and non-federal organizations; the Experts & Consultants authority, which allows the appointment of qualified experts and consultants to positions that require only intermittent and/or temporary employment; and Cooperative Research and Development Agreements (CRADAs), which allow agencies to enter into collaborative agreements with non-federal partners to conduct research and development projects of mutual interest.

University Office of Sponsored Programs.

Offices of Sponsored Programs are units within universities that provide administrative support and oversight for externally funded research projects. OSPs are responsible for reviewing and approving proposals, negotiating and accepting awards, ensuring compliance with sponsor and university policies and regulations, and managing post-award activities such as reporting, invoicing, and auditing. Federal agencies typically interact with OSPs as the authorized representative of the university in matters related to sponsored research.

Non-disclosure agreements.

When engaging with academics, federal agencies may use NDAs to safeguard sensitive information. Agencies each have their own rules and procedures for using and enforcing NDAs involving their grantees and contractors. These rules and procedures vary, but generally require researchers to sign an NDA outlining rights and obligations relating to classified information, data, and research findings shared during collaborations.

A study group is a type of collaboration where an academic participates in a group of experts convened by a federal agency to conduct analysis or education on a specific topic or issue. The study group may produce a report or hold meetings to present their findings to the agency or other stakeholders. This type of collaboration can be useful for agencies that need to gather evidence or insights from multiple sources and disciplines with expertise relevant to their work.

Academic considerations

Regulatory & structural considerations

Case study

In 2022, the National Science Foundation (NSF) awarded the National Bureau of Economic Research (NBER) a grant to create the EAGER: Place-Based Innovation Policy Study Group. This group, led by two economists with expertise in entrepreneurship, innovation, and regional development — Jorge Guzman from Columbia University and Scott Stern from MIT — aimed to provide “timely insight for the NSF Regional Innovation Engines program.” During Fall 2022, the group met regularly with NSF staff to i) provide an assessment of the “state of knowledge” of place-based innovation ecosystems, ii) identify the insights of this research to inform NSF staff on design of their policies, and iii) surface potential means by which to measure and evaluate place-based innovation ecosystems on a rigorous and ongoing basis. Several of the academic leads then completed a paper synthesizing the opportunities and design considerations of the regional innovation engine model, based on the collaborative exploration and insights developed throughout the year. In this case, the study group was structured as a grant, with funding provided to the organizing institution (NBER) for personnel and convening costs. Yet other approaches are possible; for example, NSF recently launched a broader study group with the Institute for Progress, which is structured as a no-cost Other Transaction Authority contract.

Active collaboration covers scenarios in which an academic engages in joint research with a federal agency, either as a co-investigator, a subrecipient, a contractor, or a consultant. This type of collaboration can be useful for agencies that need to leverage the expertise, facilities, data, or networks of academics to conduct research that advances their mission, goals, or priorities.

Academic considerations

Regulatory & structural considerations

Case studies

External collaboration between academic researchers and government agencies has repeatedly proven fruitful for both parties. For example, in May 2020, the Rhode Island Department of Health partnered with researchers at Brown University’s Policy Lab to conduct a randomized controlled trial evaluating the effectiveness of different letter designs in encouraging COVID-19 testing. This study identified design principles that improved uptake of testing by 25–60% without increasing cost, and led to follow-on collaborations between the institutions. The North Carolina Office of Strategic Partnerships provides a prime example of how government agencies can take steps to facilitate these collaborations. The office recently launched the North Carolina Project Portal, which serves as a platform for the agency to share their research needs, and for external partners — including academics — to express interest in collaborating. Researchers are encouraged to contact the relevant project leads, who then assess interested parties on their expertise and capacity, extend an offer for a formal research partnership, and initiate the project.

Short-term placements allow for an academic researcher to work at a federal agency for a limited period of time (typically one year or less), either as a fellow, a scholar, a detailee, or a special government employee. This type of collaboration can be useful for agencies that need to fill temporary gaps in expertise, capacity, or leadership, or to foster cross-sector exchange and learning.

Academic considerations

Regulatory & structural considerations

Case studies

Various programs exist throughout government to facilitate short-term rotations of outside experts into federal agencies and offices. One of the most well-known examples is the American Association for the Advancement of Science (AAAS) Science & Technology Policy Fellowship (STPF) program, which places scientists and engineers from various disciplines and career stages in federal agencies for one year to apply their scientific knowledge and skills to inform policy making and implementation. The Schedule A(r) hiring authority tends to be well-suited for these kinds of fellowships; it is used, for example, by the Bureau of Economic Analysis to bring on early career fellows through the American Economic Association’s Summer Economics Fellows Program. In some circumstances, outside experts are brought into government “on loan” from their home institution to do a tour of service in a federal office or agency; in these cases, the IPA program can be a useful mechanism. IPAs are used by the National Science Foundation (NSF) in its Rotator Program, which brings outside scientists into the agency to serve as temporary Program Directors and bring cutting-edge knowledge to the agency’s grantmaking and priority-setting. IPA is also used for more ad-hoc talent needs; for example, the Office of Evaluation Sciences (OES) at GSA often uses it to bring in fellows and academic affiliates.

Long-term rotations allow an academic to work at a federal agency for an extended period of time (more than one year), either as a fellow, a scholar, a detailee, or a special government employee. This type of collaboration can be useful for agencies that need to recruit and retain expertise, capacity, or leadership in areas that are critical to their mission, goals, or priorities.

Academic considerations

Regulatory & structural considerations

Case study

One example of a long-term rotation that draws experts from academia into federal agency work is the Advanced Research Projects Agency (ARPA) Program Manager (PM) role. ARPA PMs — across DARPA, IARPA, ARPA-E, and now ARPA-H — are responsible for leading high-risk, high-reward research programs, and have considerable autonomy and authority in defining their research vision, selecting research performers, managing their research budget, and overseeing their research outcomes. PMs are typically recruited from academia, industry, or government for a term of three to five years, and are expected to return to their academic institutions or pursue other career opportunities after their term at the agency. PMs coming from academia or nonprofit organizations are often brought on through the IPA mobility program, and some entities also have unique term-limited, hiring authorities for this purpose. PMs can also be hired as full government employees; this mechanism is primarily used for candidates coming from the private sector.

Laying the Foundation for the Low-Carbon Cement and Concrete Industry

This report is part of a series on underinvested clean energy technologies, the challenges they face, and how the Department of Energy can use its Other Transaction Authority to implement programs custom tailored to those challenges.

Cement and concrete production is one of the hardest industries to decarbonize. Solutions for low-emissions cement and concrete are much less mature than those for other green technologies like solar and wind energy and electric vehicles. Nevertheless, over the past few years, young companies have achieved significant milestones in piloting their technologies and certifying their performance and emissions reductions. In order to finance new manufacturing facilities and scale promising solutions, companies will need to demonstrate consistent demand for their products at a financially sustainable price. Demand support from the Department of Energy (DOE) can help companies meet this requirement and unlock private financing for commercial-scale projects. Using its Other Transactions Authority, DOE could design a demand-support program involving double-sided auctions, contracts for difference, or price and volume guarantees. To fund such a program using existing funds, the DOE could incorporate it into the Industrial Demonstrations Program. However, additional funding from Congress would allow the DOE to implement a more robust program. Through such an initiative, the government would accelerate the adoption of low-emissions cement and concrete, providing emissions reductions benefits across the country while setting the United States up for success in the future clean industrial economy.

Besides water, concrete is the most consumed material in the world. It is the material of choice for construction thanks to its durability, versatility, and affordability. As of 2022, the cement and concrete sector accounted for nine percent of global carbon emissions. The vast majority of the embodied emissions of concrete come from the production of Portland cement. Cement production emits carbon through the burning of fossil fuels to heat kilns (40% of emissions) and the chemical process of turning limestone and clay into cement using that heat (60% of emissions). Electrifying production facilities and making them more energy efficient can help decarbonize the former but not the latter, which requires deeper innovation.

Current solutions on the market substitute a portion of the cement used in concrete mixtures with Supplementary Cementitious Materials (SCMs) like fly ash, slag, or unprocessed limestone, reducing the embodied emissions of the resulting concrete. But these SCMs cannot replace all of the cement in concrete, and currently there is an insufficient supply of readily usable fly ash and slag for wider adoption across the industry.

The next generation of ultra-low-carbon, carbon-neutral, and even carbon-negative solutions seeks to develop alternative feedstocks and processes for producing cement or cementitious materials that can replace cement entirely and to capture carbon in aggregates and wet concrete. The DOE reports that testing and scaling these new technologies is crucial to fully eliminate emissions from concrete by 2050. Bringing these new technologies to the market will not only help the United States meet its climate goals but also promote U.S. leadership in manufacturing. 

A number of companies have established pilot facilities or are in the process of constructing them. These companies have successfully produced near-carbon-neutral and even carbon-negative concrete. Building off of these milestones, companies will need to secure financing to build full-scale commercial facilities and increase their manufacturing capacity. 

A key requirement for accessing both private-sector and government financing for new facilities is that companies obtain long-term offtake agreements, which assure financiers that there will be a steady source of revenue once the facility is built. But the boom-and-bust nature of the construction industry discourages construction companies and intermediaries from entering into long-term financial commitments in case there won’t be a project to use the materials for. Cement, aggregates, and other concrete inputs also take up significant volume, so it would be difficult and costly for potential offtakers to store excess amounts during construction lulls. For these reasons, construction contractors procure concrete on an as-needed, project-specific basis. 

Adding to the complexity, structural features of the cement and concrete market increase the difficulty of securing long-term offtake agreements:

Luckily, private construction is not the only customer for concrete. The U.S. government (federal, state, and local combined) accounts for roughly 50% of all concrete procurement in the country. Used correctly, the government’s purchasing power can be a powerful lever for spurring the adoption of decarbonized cement and concrete. However, the government faces similar barriers as the private sector against entering into long-term offtake agreements. Government procurement of concrete goes through multiple intermediaries and operates on an as-needed, project-specific basis: government agencies like the General Services Administration (GSA) enter into agreements with construction contractors for specific projects, and then the contractors or their subcontractors make the ultimate purchasing decisions for concrete.

The Federal Buy Clean Initiative, enacted in 2021 by the Biden Administration, is starting to address the procurement challenge for low-carbon cement and concrete. Among the initiative’s programs is the allocation of $4.5 billion from the Inflation Reduction Act (IRA) for the GSA and the Department of Transportation (DOT) to use lower-carbon construction materials. Under the initiative, the GSA is piloting directly procuring low-embodied-carbon materials for federal construction projects. To qualify as low-embodied-carbon concrete under the GSA’s interim requirements, concrete mixtures only have to achieve a roughly 25–50% reduction in carbon content,1 depending on the compressive strength. The requirement may be even less if no concrete meeting this standard is available near the project site. Since the bar is only slightly below traditional concrete, young companies developing the solutions to fully decarbonize concrete will have trouble competing in terms of price against companies producing more well-established but higher-emission solutions like fly ash, slag, and limestone concrete mixtures to secure procurement contracts. Moreover, the just-in-time and project-specific nature of these procurement contracts means they still don’t address juvenile companies’ need for long-term price and customer security in order to scale up.

The ideal solution for this is a demand-support program. The DOE Office of Clean Energy Demonstrations (OCED) is developing a demand-support program for the Hydrogen Hubs initiative, setting aside $1 billion for demand-support to accompany the $7 billion in direct funding to regional Hydrogen Hubs. In its request for proposals, OCED says that the hydrogen demand-support program will address the “fundamental mismatch in [the market] between producers, who need long-term certainty of high-volume demand in order to secure financing to build a project, and buyers, who often prefer to buy on a short-term basis at more modest volumes, especially for products that have yet to be produced at scale and [are] expected to see cost decreases.” 

A demand-support program could do the same for low-carbon cement and concrete, addressing the market challenges that grants alone cannot. OCED is reviewing applications for the $6.3 billion Industrial Demonstrations Program. Similar to the Hydrogen Hubs, OCED could consider setting aside $500 million to $1 billion of the program funds to implement demand-support programs for the two highest-emitting heavy industries, low-carbon cement/concrete and steel, at $250 million to $500 million each.

Additional funding from Congress would allow DOE to implement a more robust demand-support program. Federal investment in industrial decarbonization grew from $1.5 billion in FY21 to over $10 billion in FY23, thanks largely to new funding from BIL and IRA. However, the sector remains underfunded relative to its emissions, contributing 23% of the country’s emissions while receiving less than 12% of Federal climate innovation funding. A promising piece of legislation that was recently introduced is The Concrete and Asphalt Innovation Act of 2023, which would, among other things, direct the DOE to establish a program of research, development, demonstration, and commercial application of low-emissions cement, concrete, asphalt binder, and asphalt mixture. This would include a demonstration initiative authorized at $200 million and the production of a five-year strategic plan to identify new programs and resources needed to carry out the mission. If the legislation is passed, the DOE could propose a demand-support program in its strategic plan and request funding from Congress to set it up, though the faster route would be for Congress to add a section to the Act directly establishing a demand-support program within DOE and authorizing funding for it before passing the Act.

BIL and IRA gave DOE an expanded mandate to support innovative technologies from early-stage research through commercialization. In order to do so, DOE must be just as innovative in its use of its available authorities and resources. Tackling the challenge of bringing technologies from pilot to commercialization requires DOE to look beyond traditional grant, loan, and procurement mechanisms. Previously, we have identified the DOE’s Other Transaction Authority (OTA) as an underleveraged tool for accelerating clean energy technologies. 

OTA is defined in legislation as the authority to enter into transactions that are not government grants or contracts in order to advance an agency’s mission. This negative definition provides DOE with significant freedom to design and implement flexible financial agreements that can be tailored to address the unique challenges that different technologies face. DOE plans to use OTA to implement the hydrogen demand-support program, and it could also be used for a demand-support program for low-carbon cement and concrete. The DOE’s new Guide to Other Transactions provides official guidance on how DOE personnel can use the flexibilities provided by OTA. 

Before setting up a demand-support program, DOE first needs to define what a low-carbon cement or concrete product is and the value it provides in emissions avoided. This is not straightforward due to (1) the heterogeneity of solutions, which prevents apples-to-apples comparisons in price, and (2) variations in the amount of avoided emissions that different solutions can provide. To address the first issue, for products that are not ready-mix concrete, the DOE should calculate the cost of a unit of concrete made using the product, based on a standardized mix ratio of a specific compressive strength and market prices for the other components of the concrete mix. To address the second issue, the DOE should then divide the calculated price per unit of concrete (e.g., $/m3) by the amount of CO2 emissions avoided per unit of concrete compared to the NRCMA’s industry average (e.g., kg/m3) to determine the effective price per unit of CO2 emissions avoided. The DOE can then fairly compare bids from different projects using this metric. Such an approach would result in the government providing demand support for the products that are most cost-effective at reducing carbon emissions, rather than solely the cheapest.

Furthermore, the DOE should put an upper limit on the amount of embodied carbon that the concrete product or concrete made with the product must meet in order to qualify as “low carbon.” We suggest that the DOE use the limits established by the First Movers Coalition, an international corporate advanced market commitment for concrete and other hard-to-abate industries organized by the World Economic Forum. The limits were developed through conversations with incumbent suppliers, start-ups, nonprofits, and intergovernmental organizations on what would be achievable by 2030. The limits were designed to help move the needle towards commercializing solutions that enable full decarbonization.

Companies that participate in a DOE demand-support program should be required after one or two years of operations to confirm that their product meets these limits through an Environmental Product Declaration.2 Using carbon offsets to reach that limit should not be allowed, since the goal is to spur the innovation and scaling of technologies that can eventually fully decarbonize the cement and concrete industry.

Below are some ideas for how DOE can set up a demand-support program for low-carbon cement and concrete.

Double-Sided Auction 

Double-sided auctions are designed to support the development of production capacity for green technologies and products and the creation of a market by providing long-term price certainty to suppliers and facilitating the sale of their products to buyers. As the name suggests, a double-sided auction consists of two phases: First, the government or an intermediary organization holds a reverse auction for long-term purchase agreements (e.g., 10 years) for the product from suppliers, who are incentivized to bid the lowest possible price in order to win. Next, the government conducts annual auctions of short-term sales agreements to buyers of the product. Once sales agreements are finalized, the product is delivered directly from the supplier to the buyer, with the government acting as a transparent intermediary. The government thus serves as a market maker by coordinating the purchase and sale of the product from producers to buyers. Government funding covers the difference between the original purchase price and the final sale price, reducing the impact of the green premium for buyers and sellers. 

While the federal government has not yet implemented a double-sided auction program, OCED is considering setting up the hydrogen demand-support measure as a “market maker” that provides a “ready purchaser/seller for clean hydrogen.” Such a market maker program could be implemented most efficiently through double-sided auctions.

Germany was the first to conceive of and develop the double-sided auction scheme. The H2Global initiative was established in 2021 to support the development of production capacity for green hydrogen and its derivative products. The program is implemented by Hintco, an intermediary company, which is currently evaluating bids for its first auction for the purchase of green ammonia, methanol, and e-fuels, with final contracts expected to be announced as soon as this month. Products will start to be delivered by the end of 2024.

A double-sided auction scheme for low-carbon cement and concrete would address producers’ need for long-term offtake agreements while matching buyers’ short-term procurement needs. The auctions would also help develop transparent market prices for low-carbon cement and concrete products.

(Source: H2Global)

A double-sided auction scheme for low-carbon cement and concrete would address producers’ need for long-term offtake agreements while matching buyers’ short-term procurement needs. The auctions would also help develop transparent market prices for low-carbon cement and concrete products. 

All bids for purchase agreements should include detailed technical specifications and/or certifications for the product, the desired price per unit, and a robust, third-party life-cycle assessment of the amount of embodied carbon per unit of concrete made with the product, at different compressive strengths. Additionally, bids of ready-mix concrete should include the location(s) of their production facility or facilities, and bids of cement and other concrete inputs should include information on the locations of ready-mix concrete facilities capable of producing concrete using their products. The DOE should then select bids through a pure reverse auction using the calculated effective price per unit of CO2 emissions avoided. To account for regional fragmentation, the DOE could conduct separate auctions for each region of the country.

A double-sided auction presents similar benefits to the low-carbon cement and concrete industry as an advance market commitment would. However, the addition of an efficient, built-in system for the government to then sell that cement or concrete allotment to a buyer means that the government is not obligated to use the cement or concrete itself. This is important because the logistics of matching cement or concrete production to a suitable government construction project can be difficult due to regional fragmentation, and the DOE is not a major procurer of cement and concrete.3 Instead, under this scheme, federal, state, or local agencies working on a construction project or their contractors could check the double-sided auction program each year to see if there is a product offering in their region that matches their project needs and sustainability goals for that year, and if so, submit a bid to procure it. In fact, this should be encouraged as a part of the Federal Buy Clean Initiative, since the government is such an important consumer of cement and concrete products.

Contracts for Difference

Contracts for difference (CfD, or sometimes called two-way CfD) programs aim to provide price certainty for green technology projects and close the gap between the price that producers need and the price that buyers are willing to offer. CfD have been used by the United Kingdom and France primarily to support the development of large-scale renewable energy projects. However, CfD can also be used to support the development of production capacity for other green technologies. OCED is considering CfD (also known as pay-for-difference contracts) for its hydrogen demand-support program. 

CfD are long-term contracts signed between the government or a government-sponsored entity and companies looking to expand production capacity for a green product.4 The contract guarantees that once the production facility comes online, the government will ensure a steady price by paying suppliers the difference between the market price for which they are able to sell their product and a predetermined “strike price.” On the other hand, if the market price rises above the strike price, the supplier will pay the difference back to the government. This prevents the public from funding any potential windfall profits.

A CfD program could provide a source of demand certainty for low-carbon cement and concrete companies looking to finance the construction of pilot- and commercial-scale manufacturing plants or the retrofitting of existing plants. The selection of recipients and strike prices should be determined through annual reverse auctions. In a typical reverse auction for CfD, the government sets a cap on the maximum number of units of product and the max strike price they’re willing to accept. Each project candidate then places a sealed bid for a unit price and the amount of product they plan to produce. The bids are ranked by unit price, and projects are accepted from low to high unit price until either the max total capacity or max strike price is reached. The last project accepted sets the strike price for all accepted projects. The strike price is adjusted annually for inflation but otherwise fixed over the course of the contract. Compared to traditional subsidy programs, a CfD program can be much more cost-efficient thanks to the reverse auction process. The UK’s CfD program has seen the strike price fall with each successive round of auctions.

Applying this to the low-carbon cement and concrete industry requires some adjustments, since there are a variety of products for decarbonizing cement and concrete. As discussed prior, the DOE should compare project bids according to the effective price per unit CO2 abated when the product is used to make concrete. The DOE should also set a cap on the maximum volume of CO2 it wishes to abate and the maximum effective price per unit of CO2 abated that it is willing to pay. Bids can then be accepted from low to high price until one of those caps is hit. Instead of establishing a single strike price, the DOE should use the accepted project’s bid price as the strike price to account for the variation in types of products.

Backstop Price Guarantee 

A CfD program could be designed as a backstop price guarantee if one removes the requirement that suppliers pay the government back when market prices rise above the strike price. In this case, the DOE would set a lower maximum strike price for CO2 abatement, knowing that suppliers will be willing to bid lower strike prices, since there is now the opportunity for unrestricted profits above the strike price. The DOE would then only pay in the worst-case scenario when the market price falls below the strike price, which would operate as an effective price floor.

Backstop Volume Guarantee

Alternatively, the DOE could address demand uncertainty by providing a volume guarantee. In this case, the DOE could conduct a reverse auction for volume guarantee agreements with manufacturers, wherein the DOE would commit to purchasing any units of product short of the volume guarantee that the company is unable to sell each year for a certain price, and the company would commit to a ceiling on the price they will charge buyers.5 Using OTA, the DOE could implement such a program in collaboration with DOT or GSA, wherein DOE would purchase the materials and DOT or GSA would use the materials for their construction needs.

Rather than directly managing a demand-support program, the DOE should enter into an OT agreement with an external nonprofit entity to administer the contracts.6 The nonprofit entity would then hold auctions and select, manage, and fulfill the contracts. DOE is currently in the process of doing this for the hydrogen demand-support program. 

A nonprofit entity could provide two main benefits. First, the logistics of implementing such a program would not be trivial, given the number of different suppliers, intermediaries, and offtakers involved. An external entity would have an easier and faster time hiring staff with the necessary expertise compared to the federal hiring process and limited budget for program direction that the DOE has to contend with. Second, the entity’s independent nature would make it easier to gain lasting bipartisan support for the demand-support program, since the entity would not be directly associated with any one administration.

The green premium for near-zero-carbon cement and concrete products is steep, and demand-support programs like the ones proposed in this report should not be considered a cure-all for the industry, since it may be difficult to secure a large enough budget for any one such program to fully address the green premium across the industry. Rather, demand-support programs can complement the multiple existing funding authorities within the DOE by closing the residual gap between emerging technologies and conventional alternatives after other programs have helped to lower the green premium. 

The DOE’s Loan Programs Office (LPO) received a significant increase in their lending authorities from the IRA and has the ability to provide loans or loan guarantees to innovative clean cement facilities, resulting in cheaper capital financing and providing an effective subsidy. In addition, the IRA and the Bipartisan Infrastructure Law provided substantial new funding for the demonstration of industrial decarbonization technologies through OCED. 

Policies like these can be chained together. For example, a clean cement start-up could simultaneously apply to OCED for funding to demonstrate their technology at scale and a loan or loan guarantee from LPO after due diligence on their business plan. Together, these two programs drive down the cost of the green premium and derisk the companies that successfully receive their support, leaving a much more modest price premium that a mechanism like a double-sided auction could affordably cover with less risk. 

Successfully chaining policies like this requires deep coordination across DOE offices. OCED and LPO would need to work in lockstep in conducting technical evaluations and due diligence of projects that apply to both and prioritize funding of projects that meet both offices’ criteria for success. The best projects should be offered both demonstration funding from OCED and conditional commitments from LPO, which would provide companies with the confidence that they will receive follow-on funding if the demonstration is successful and other conditions are met, while posing no added risk to LPO since companies will need to meet their conditions first before receiving funds. The assessments should also consider whether the project would be a strong candidate for receiving demand support through a double-sided auction, CfD program, or price/volume guarantee, which would help further derisk the loan/loan guarantee and justify the demonstration funding. 

Candidates for receiving support from all three public funding instruments would of course need to be especially rigorously evaluated, since the fiscal risk and potential political backlash of such a project failing is also much greater. If successful, such coordination would ensure that the combination of these programs substantially moves the needle on bringing emerging technologies in green cement and concrete to commercial scale. 

Demand support can help address the key barrier that low-carbon cement and concrete companies face in scaling their technologies and financing commercial-scale manufacturing facilities. Whichever approach the DOE chooses to take, the agency should keep in mind (1) the importance of setting an ambitious standard for what qualifies as low-carbon cement and concrete and comparing proposals using a metric that accounts for the range of different product types and embodied emissions, (2) the complex implementation logistics, and (3) the benefits of coordinating a demand-support program with the agency’s demonstration and loan programs. Implemented successfully, such a program would crowd in private investment, accelerate commercialization, and lay the foundation for the clean industrial economy in the United States.

Breaking Ground on Next-Generation Geothermal Energy

This report is part one of a series on underinvested clean energy technologies, the challenges they face, and how the Department of Energy can use its Other Transaction Authority to implement programs custom tailored to those challenges.

The United States has been gifted with an abundance of clean, firm geothermal energy lying below our feet – tens of thousands of times more than the country has in untapped fossil fuels. Geothermal technology is entering a new era, with innovative approaches on their way to commercialization that will unlock access to more types of geothermal resources. However, the development of commercial-scale geothermal projects is an expensive affair, and the U.S. government has severely underinvested in this technology. The Inflation Reduction Act and the Bipartisan Infrastructure Law concentrated clean energy investments in solar and wind, which are great near-term solutions for decarbonization, but neglected to invest sufficiently in solutions like geothermal energy, which are necessary to reach full decarbonization in the long term. With new funding from Congress or potentially the creative (re)allocation of existing funding, the Department of Energy (DOE) could take a number of different approaches to accelerating progress in next-generation geothermal energy, from leasing agency land for project development to providing milestone payments for the costly drilling phases of development.

As the United States power grid transitions towards clean energy, the increasing mix of intermittent renewable energy sources like solar and wind must be balanced by sources of clean firm power that are available around the clock in order to ensure grid reliability and reduce the need to overbuild solar, wind, and battery capacity. Geothermal power is a leading contender for addressing this issue. 

Conventional geothermal (also known as hydrothermal) power plants tap into existing hot underground aquifers and circulate the hot water to the surface to generate electricity. Thanks to an abundance of geothermal resources close to the earth’s surface in the western part of the country, the United States currently leads the world in geothermal power generation. Conventional geothermal power plants are typically located near geysers and steam vents, which indicate the presence of hydrothermal resources belowground. However, these hydrothermal sites represent just a small fraction of the total untapped geothermal potential beneath our feet — more than the potential of fossil fuel and nuclear fuel reserves combined.

Next-generation geothermal technologies, such as enhanced geothermal systems (EGS), closed-loop or advanced geothermal systems (AGS), and other novel designs, promise to allow access to a wider range of geothermal resources. Some designs can potentially also serve double duty as long-duration energy storage. Rather than tapping into existing hydrothermal reservoirs underground, these technologies drill into hot dry rock, engineer independent reservoirs using either hydraulic stimulation or extensive horizontal drilling, and then introduce new fluids to bring geothermal energy to the surface. These new technologies have benefited from advances in the oil and gas industry, resulting in lower drilling costs and higher success rates. Furthermore, some companies have been developing designs for retrofitting abandoned oil and gas wells to convert them into geothermal power plants. The commonalities between these two sectors present an opportunity not only to leverage the existing workforce, engineering expertise, and supply chain from the oil and gas industry to grow the geothermal industry but also to support a just transition such that current workers employed by the oil and gas industry have an opportunity to help build our clean energy future. 

Over the past few years, a number of next-generation geothermal companies have had successful pilot demonstrations, and some are now developing commercial-scale projects. As a result of these successes and the growing demand for clean firm power, power purchase agreements (PPAs) for an unprecedented 1GW of geothermal power have been signed with utilities, community choice aggregators (CCAs), and commercial customers in the United States in 2022 and 2023 combined. In 2023, PPAs for next-generation geothermal projects surpassed those for conventional geothermal projects in terms of capacity. While this is promising, barriers remain to the development of commercial-scale geothermal projects. To meet its goal of net-zero emissions by 2050, the United States will need to invest in overcoming these barriers for next-generation geothermal energy now, lest the technology fail to scale to the level necessary for a fully decarbonized grid. 

Meanwhile, conventional hydrothermal still has a role to play in the clean energy transition. The United States needs all the clean firm power that it can get, whether that comes from conventional or next-generation geothermal, in order to retire baseload coal and natural gas plants. The construction of conventional hydrothermal power plants is less expensive and cheaper to finance, since it’s a tried and tested technology, and there are still plenty of untapped hydrothermal resources in the western part of the country.

Funding is the biggest barrier to commercial development of next-generation geothermal projects. There are two types of private financing: equity financing or debt financing. Equity financing is more risk tolerant and is typically the source of funding for start-ups as they move from the R&D to demonstration phases of their technology. But because equity financing has a dilutive effect on the company, when it comes to the construction of commercial-scale projects, debt financing is preferred. However, first-of-a-kind commercial projects are almost always precluded from accessing debt financing. It is commonly understood within industry that private lenders will not take on technology risk, meaning that technologies must be at a Technology Readiness Level (TRL) of 9, where they have been proven to operate at commercial scale, and government lenders like the DOE Loan Programs Office (LPO) generally will not take on any risk that private lenders won’t. Manifestations of technology risk in next-generation geothermal include the possibility of underproduction, which would impact the plant’s profitability, or that capacity will decline faster than expected, reducing the plant’s operating lifetime. Moving next-generation technologies from the current TRL-7 level to TRL-9 will be key to establishing the reliability of these emerging technologies and unlocking debt financing for future commercial-scale projects. 

Underproduction will likely remain a risk, though to a lesser extent, for next-generation projects even after technologies reach TRL-9. This is because uncertainty in the exploration and subsurface characterization process makes it possible for developers to overestimate the temperature gradient and thus the production capacity of a project. Hydrothermal projects also share this risk: the factors determining the production capacity for hydrothermal projects include not only the temperature gradient but also the flow rate and enthalpy of the natural reservoir. In the worst-case scenario, drilling can result in a dry hole that produces no hot fluids at all. This becomes a financial issue if the project is unable to generate as much revenue as expected due to underproduction or additional wells must be drilled to compensate, driving up the total project cost. Thus, underproduction is a risk shared by both next-generation and conventional geothermal projects. Research into improvements to the accuracy and cost of geothermal exploration and subsurface characterization can help mitigate this risk but may not eliminate it entirely, since there is a risk-cost trade-off in how much time is spent on exploration and subsurface characterization.

Another challenge for both next-generation and conventional geothermal projects is that they are more expensive to develop than solar or wind projects. Drilling requires significant upfront capital expenditures, making up about half of the total capital costs of developing a geothermal project, if not more. For example, in EGS projects, the first few wells can cost around $10 million each, while conventional hydrothermal wells, which are shallower, can cost around $3–7 million each. While conventional hydrothermal plants only consist of two to six wells on average, designs for commercial EGS projects can require several times that amount of wells. Luckily, EGS projects benefit from the fact that wells can be drilled identically, so projects expect to move down the learning curve as they drill more wells, resulting in faster and cheaper drilling. Initial data from commercial-scale projects currently being developed suggest that the learning curves may be even steeper than expected. Nevertheless, this will need to be proven at scale across different locations. Some companies have managed to forgo expensive drilling costs by focusing on developing technologies that can be installed within idle hydrothermal wells or abandoned oil and gas wells to convert them into productive geothermal wells.

Beyond funding, geothermal projects need to obtain land where there are suitable geothermal resources and permits for each stage of project development. The best geothermal resources in the United States are concentrated in the West, where the federal government owns most of the land. The Bureau of Land Management (BLM) manages a lot of that land, in addition to all subsurface resources on federal land. However, there is inconsistency in how the BLM leases its land, depending on the state. While Nevada BLM has been very consistent about holding regular lease sales each year, California BLM has not held a lease sale since 2016. Adding to the complexity is the fact that although BLM manages all subsurface resources on federal land, surface land may sometimes be managed by a different agency, in which case both agencies will need to be involved in the leasing and permitting process.

Last, next-generation geothermal companies face a green premium on electricity produced using their technology, though the green premium does not appear to be as significant of a challenge for next-generation geothermal as it is for other green technologies. In states with high renewables penetration, utilities and their regulators are beginning to recognize the extra value that clean firm power provides in terms of grid reliability. For example, the California Public Utility Commission has issued an order for utilities to procure 1 GW of clean, firm power by 2026, motivating a wave of new demand from utilities and community choice aggregators. As a result of this demand and California’s high electricity prices in general, geothermal projects have successfully signed a flurry of PPAs over the past year. These have included projects located in Nevada and Utah that can transmit electricity to California customers. In most other western states, however, electricity prices are much lower, so utility companies can be reluctant to sign PPAs for next-generation geothermal projects if they aren’t required to, due to the high cost and technology risk. As a result, next-generation geothermal projects in those states have turned to commercial customers, like those operating data centers, who are willing to pay more to meet their sustainability goals. 

The federal government is beginning to recognize the important role of next-generation geothermal power for the clean energy transition. For the first time in 2023, geothermal energy became eligible for the renewable energy investment and production tax credits, thanks to technology-neutral language introduced in the Inflation Reduction Act (IRA). Within the DOE, the agency launched the Enhanced Geothermal Shot in 2022, led by the Geothermal Technologies Office (GTO), to reduce the cost of EGS by 90% to $45/MWh by 2035 and make geothermal widely available. In 2020, the Frontier Observatory for Research in Geothermal Energy (FORGE), a dedicated underground field laboratory for EGS research, drilling, and technology testing established by GTO in 2014, drilled their first well using new approaches and tools the lab had developed. This year, GTO announced funding for seven EGS pilot demonstrations from the Bipartisan Infrastructure Law (BIL), for which GTO is currently reviewing the first round of applications. GTO also awarded the Geothermal Energy from Oil and gas Demonstrated Engineering (GEODE) grant to a consortium formed by Project Innerspace, the Society of Petroleum Engineering International, and Geothermal Rising, with over 100 partner entities, to transfer best practices from the oil and gas industry to geothermal, support demonstrations and deployments, identify barriers to growth in the industry, and encourage workforce adoption. 

While these initiatives are a good start, significantly more funding from Congress is necessary to support the development of pilot demonstrations and commercial-scale projects and enable wider adoption of geothermal energy. The BIL notably expanded the DOE’s mission area in supporting the deployment of clean energy technologies, including establishing the Office of Clean Energy Demonstrations (OCED) and funding demonstration programs from the Energy Division of BIL and the Energy Act of 2020. However, the $84 million in funding authorized for geothermal pilot demonstrations was only a fraction of the funding that other programs received from BIL and not commensurate to the actual cost of next-generation geothermal projects. Congress should be investing an order of magnitude more into next-generation geothermal projects, in order to maintain U.S. leadership in geothermal energy and reap the many benefits to the grid, the climate, and the economy.

Another key issue is that DOE has currently and in the past limited all of its funding for next-generation geothermal to EGS technologies only. As a result, companies pursuing closed-loop/AGS and other next-generation technologies cannot qualify, leading some projects to be moved abroad. Given GTO’s historically limited budget, it’s possible that this was the result of a strategic decision to focus their funding on one technology rather than diluting it across multiple technologies. However, given that none of these technologies have been successfully commercialized at a wide scale yet, DOE may be missing the opportunity to invest in the full range of viable approaches. DOE appears to be aware of this, as the agency currently has a working group on AGS. New funding from Congress would allow DOE to diversify its investments to support the demonstration and commercial application of other next-generation geothermal technologies. 

Alternatively, there are a number of OCED programs with funding from BIL that have not yet been fully spent (Table 1). Congress could reallocate some of that funding towards a new program supporting next-generation geothermal projects within OCED. Though not ideal, this may be a more palatable near-term solution for the current Congress than appropriating new funding.

Table 1. OCED programs that have remaining unspent funding from BIL as of publication in January 2024.
OCED ProgramTotal FundingCommitted FundingUnspent Funding
Carbon Capture Demonstration Projects$2.547 billion$1.889 billion$658 million
Carbon Capture Large Scale Pilot Projects$937 million$820 million$117 million
Energy Improvements in Rural and Remote Areas$1 billion$365 million$635 million
Clean Energy Demonstration Program on Current and Former Mine Land$500 million$450 million$50 million
Energy Storage Demonstration Projects and Pilot Grant Program$355 million$349 million$6 million
Long-Duration Demonstration Program and Joint Initiative$150 million$30 million$120 million

A third option is that DOE could use some of the funding for the Energy Improvements in Rural and Remote Areas program, of which $635 million remains unallocated, to support geothermal projects. Though the program’s authorization does not explicitly mention geothermal energy, geothermal is a good candidate given the abundance of geothermal production potential in rural and remote areas in the West. Moreover, as a clean firm power source, geothermal has a comparative advantage over other renewable energy sources in improving energy reliability. 

Other Transactions Authority

BIL and IRA gave DOE an expanded mandate to support innovative technologies from early stage research through commercialization. To do so, DOE will need to be just as innovative in its use of its available authorities and resources. Tackling the challenge of scaling technologies from pilot to commercialization will require DOE to look beyond traditional grant, loan, and procurement mechanisms. Previously, we identified the DOE’s Other Transaction Authority (OTA) as an underleveraged tool for accelerating clean energy technologies. 

OTA is defined in legislation as the authority to enter into any transaction that is not a government grant or contract. This negative definition provides DOE with significant freedom to design and implement flexible financial agreements that can be tailored to the unique challenges that different technologies face. OT agreements allow DOE to be more creative, and potentially more cost-effective, in how it supports the commercialization of new technologies, such as facilitating the development of new markets, mitigating risks and market failures, and providing innovative new types of demand-side “pull” funding and supply-side “push” funding. The DOE’s new Guide to Other Transactions provides official guidance on how DOE personnel can use the flexibilities provided by OTA. 

With additional funding from Congress, the DOE could use OT agreements to address the unique barriers that geothermal projects face in ways that may not be possible through other mechanisms. Below are four proposals for how the DOE can do so. We chose to focus on supporting next-generation geothermal projects, since the young industry currently requires more governmental support to grow, but we included ideas that would benefit conventional hydrothermal projects as well.

Geothermal Development on Agency Land

This year, the Defense Innovation Unit issued its first funding opportunity specifically for geothermal energy. The four winning projects will aim to develop innovative geothermal power projects on Department of Defense (DoD) bases for both direct consumption by the base and sale to the local grid. OT agreements were used for this program to develop mutually beneficial custom terms. For project developers, DoD provided funding for surveying, design, and proposal development in addition to land for the actual project development. The agreement terms also gave companies permission to use the technology and information gained from the project for other commercial use. For DoD, these projects are an opportunity to improve the energy resilience and independence of its bases while also reducing emissions. By implementing the prototype agreement using OTA, DoD will have the option to enter into a follow-on OT agreement with project developers without further competition, expediting future processes.

DOE could implement a similar program for its 2.4 million acres of land. In particular, the DOE’s land in Idaho and other western states has favorable geothermal resources, which the DOE has considered leasing. By providing some funding for surveying and proposal development like the DoD, the DOE can increase the odds of successful project development, compared to simply leasing the land without funding support. The DOE could also offer technical support to projects from its national labs. 

With such a program, a lot of the value that the DOE would be providing is the land itself, which the DOE currently has more of than actual funding for geothermal energy. The funding needed for surveying and proposal development is much less than would be needed to support the actual construction of demonstration projects, so GTO could feasibly request funding for such a program through the annual appropriations process. Depending on the program outcomes and the resulting proposals, the DOE could then go back to Congress to request follow-on funding to support actual project construction. 

Drilling Cost-Share Program

To help defray the high cost of drilling, the DOE could implement a milestone-based cost-share program. There is precedent for government cost-share programs for geothermal: in 1973, before the DOE was even established, Congress passed the Geothermal Loan Guarantee Program to provide “investment security to the public and private sectors to exploit geothermal resources” in the early days of the industry. Later, the DOE funded the Cascades I and II Cost Shared Programs. Then, from 2000 to 2007, the DOE ran the Geothermal Resource Exploration and Definitions (GRED) I, II, and III Cost-Share Programs. This year, the DOE launched its EGS Pilot Demonstrations program.

A milestone payment structure could be favorable for supporting expensive, next-generation geothermal projects because the government takes on less risk compared to providing all of the funding upfront. Initial funding could be provided for drilling the first few wells. Successful and on-time completion of drilling could then unlock additional funding to drill more wells, and so on. In the past, both the DoD and the National Aeronautics and Space Administration (NASA) have structured their OT agreements using milestone payments, most famously between NASA and SpaceX for the development of the Falcon9 space launch vehicle. The NASA and SpaceX agreement included not just technical but also financial milestones for the investment of additional private capital into the project. The DOE could do the same and include both technical and financial milestones in a geothermal cost-share program. 

Risk Insurance Program

Longer term, the DOE could implement a risk insurance program for conventional hydrothermal and next-generation geothermal projects. Insuring against underproduction could make it easier and cheaper for projects to be financed, since the potential downside for investors would be capped. The DOE could initially offer insurance just for conventional hydrothermal, since there is already extensive data on past commercial projects that can inform how the insurance is designed. In order to design insurance for next-generation technologies, more commercial-scale projects will first need to be built to collect the data necessary to assess the underproduction risk of different approaches.

France has administered a successful Geothermal Public Risk Insurance Fund for conventional hydrothermal projects since 1982. The insurance originally consisted of two parts: a Short-Term Fund to cover the risk of underproduction and a Long-Term Fund to cover uncertain long-term behavior over the operating lifetime of the geothermal plant. The Short-Term Fund asked project owners to pay a premium of 1.5% of the maximum guaranteed amount. In return, the Short-Term Fund provided a 20% subsidy for the cost of drilling the first well and, in the case of reduced output or a dry hole, a compensation between 20% and 90% of the maximum guaranteed amount (inclusive of the subsidy that has already been paid). The exact compensation is determined based on a formula for the amount necessary to restore the project’s profitability with its reduced output. The Short-Term Fund relied on a high success rate, especially in the Paris Basin where there is known to be good hydrothermal resources, to fund the costs of failures. Geothermal developers that chose to get coverage from the Short-Term Fund were required to also get coverage from the Long-Term Fund, which was designed to hedge against the possibility of unexpected geological or geothermal changes within the wells, such as if their output declined faster than expected or severe corrosion or scaling occurred, over the geothermal plant’s operating lifetime. The Long-Term Fund ended in 2015, but a new iteration of the Short-Term Fund was approved in 2023.

The Netherlands has successfully run a similar program to the Short-Term Fund since the 2000s. Private-sector attempts at setting up geothermal risk insurance packages in Europe and around the world have mostly failed, though. The premiums were often too high, costing up to 25–30% of the cost of drilling, and were established in developing markets where not enough projects were being developed to mutualize the risk. 

To implement such a program at the DOE, projects seeking coverage would first submit an application consisting of the technical plan, timeline, expected costs, and expected output. The DOE would then conduct rigorous due diligence to ensure that the project’s proposal is reasonable. Once accepted, projects would pay a small premium upfront; the DOE should keep in mind the failed attempts at private-sector insurance packages and ensure that the premium is affordable. In the case that either the installed capacity is much lower than expected or the output capacity declines significantly over the course of the first year of operations, the Fund would compensate the project based on the level of underproduction and the amount necessary to restore the project’s profitability with a reduced output. The French Short-Term Fund calculated compensation based on characteristics of the hydrothermal wells; the DOE would need to develop its own formulas reflective of the costs and characteristics of different next-generation geothermal technologies once commercial data actually exists. 

Before setting up a geothermal insurance fund, the DOE should investigate whether there are enough geothermal projects being developed across the country to ensure the mutualization of risk and whether there is enough commercial data to properly evaluate the risk. Another concern for next-generation geothermal is that a high failure rate could cause the fund to run out. To mitigate this, the DOE will need to analyze future commercial data for different next-generation technologies to assess whether each technology is mature enough for a sustainable insurance program. Last, poor state capacity could impede the feasibility of implementing such a program. The DOE will need personnel on staff that are sufficiently knowledgeable about the range of emerging technologies in order to properly evaluate technical plans, understand their risks, and design an appropriate insurance package. 

Production Subsidy

While the green premium for next-generation geothermal has not been an issue in California, it may be slowing down project development in other states with lower electricity prices. The Inflation Reduction Act introduced a new clean energy Production Tax Credit that included geothermal energy for the first time. However, due to the higher development costs of next-generation geothermal projects compared to other renewable energy projects, that subsidy is insufficient to fully bridge the green premium. DOE could use OTA to introduce a production subsidy for next-generation geothermal energy with varied rates depending on the state that the electricity is sold to and its average baseload electricity price (e.g., the production subsidy likely would not apply to California). This would help address variations in the green premium across different states and expand the number of states in which it is financially viable to develop next-generation geothermal projects. 

The United States is well-positioned to lead the next-generation geothermal industry, with its abundance of geothermal resources and opportunities to leverage the knowledge and workforce of the domestic oil and gas industry. The responsibility is on Congress to ensure that DOE has the necessary funding to support the full range of innovative technologies being pursued by this young industry. With more funding, DOE can take advantage of the flexibility offered by OTA to create agreements tailored to the unique challenges that the geothermal industry faces as it begins to scale. Successful commercialization would pave the way to unlocking access to 24/7 clean energy almost anywhere in the country and help future-proof the transition to a fully decarbonized power grid. 

FAS Annual Report 2023

Friends and Colleagues,

In today’s political climate in Washington, it is sometimes hard to believe that change is possible. Yet, at the Federation of American Scientists (FAS), we know firsthand that progress happens when the science community has a seat at the policymaking table. At our core, we believe that when passionate advocates join forces and share a commitment to ongoing learning, adaptation, and a drive toward action – science and technology progress can both solve the toughest challenges and uncover new ways to deliver the greatest impact.

In 2023, we remained steadfast in our ability to spur collective action. FAS supported our federal partners on the most significant investments in science and technology in decades with the Creating Helpful Incentives to Produce Semiconductors and Science Act (CHIPS) and the Inflation Reduction Act. Our Talent Hub team placed 71 Impact Fellows on tours of service in government and secured a first-of-its-kind partnership with the U.S. Department of Agriculture (USDA) to place 35 Impact Fellows in key positions within USDA over the next five years. Our expert network published 47 actionable policy memos through our Day One Project platform and drove impact by working with the U.S. Department of Transportation (USDOT) to launch the new Advanced Research Projects Agency-Infrastructure (ARPA-I). And our renowned Nuclear Information Project continues to inform the public and challenge assumptions about nuclear weapons arsenals and trends with record breaking public attention. I hope you’ll read more about all of our wins in this year’s FAS Impact Report.

FAS remains focused on honoring our 80-year legacy as a leading voice on global risk while seeking out new policy areas and domains that advance and support science and technology priorities. To support this new era for FAS, we completed a full rebrand—modernizing our look and retelling our story—and rolled out organization-wide strategic goals to drive and define the impact we seek to instill across government. Together, we focus on more than progress for its own sake—we intentionally create the systems and paradigms that make such progress sustainable and tangible.

We have continued to build our team and expertise, and with that growth we are inspired by the caliber of our new teammates. We also remain committed to fulfilling our expectations on Diversity, Equity, Inclusion and Belonging (DEIB) and continue to advocate for stronger commitments to social equality with all of our partners. 

It is impossible for me to fit the entire year’s successes into a single letter, but I hope our annual report brings my update to life.

Thank you for your continued support,

A picture containing dark

Description automatically generated

Dan Correa, FAS CEO

For several years, FAS has been evangelizing the power of policy entrepreneurship to galvanize policy change, helping an entire community of experts and practitioners embrace the tools, mindsets and networks needed to get results. The power of policy entrepreneurship is two-fold: 

In FY23, FAS advanced policy entrepreneurship across all of its core issue domains by convening change agents, crafting policy memos, curating policy ideas, and seeding countless actionable policy ideas through policy entrepreneurship. Below are just some of our highlights over the past year.

Championing Critical Funding across the Science and Technology (S&T) Ecosystem—FY23 Omnibus Spending Bill

Public investments in science and technology have declined precipitously since the Cold War, when two percent of the U.S. gross domestic product (GDP) went to research and development (R&D). With estimates of R&D investment currently below one percent of GDP  and challenges from peer competitors like China threatening U.S. leadership in emerging technologies, FAS advocates for strong investments in critical and emerging technologies as well as science, technology, engineering, and math (STEM) education to maintain America’s edge in innovation. 

In December 2022, President Biden signed the FY23 Omnibus appropriations package into law, funding a broad range of new science and technology priorities. This funding will strengthen our country’s ability to invest in better science and technology education, stay globally competitive and ensure that innovation opportunities are available across the country. The bill included provisions that stemmed from a number of ideas that FAS staff and Day One Project contributors helped seed, including:

Reversing Megafire through Science and Data

Against a backdrop of the growing scourge of megafires, FAS has helped to put wildfires on the policy agenda in a bipartisan way that would have seemed impossible only a year ago. FAS organized more than 30 experts to contribute actionable policy ideas that have been shared directly with the Congressionally-mandated Wildland Fire Mitigation and Management Commission. Through this effort, we are advancing our goal of helping reduce the risks of catastrophic uncontrolled fires and protect people from the health risks of wildfire smoke while promoting beneficial controlled fire to improve ecosystem health. FAS policy recommendations influenced recommendations in the Commission’s report to Congress to guide a legislative implementation strategy which has included $1.6 billion in appropriations requests for smoke and public health.

Addressing Inequities in Medical Devices

The COVID-19 public health emergency revealed deep disparities in medical device use, specifically with pulse oximeters—devices widely used to measure oxygen saturation in blood. Medical researchers and policymakers had overlooked this issue for years until the COVID-19 pandemic revealed a large disparity in the diagnosis and treatment of severe respiratory conditions in Black and Brown communities. Through policy entrepreneurship, FAS identified an opportunity on a previously under-examined health policy issue and achieved two major wins. 

First, FAS brought together more than 60 stakeholders to highlight policy opportunities to address racial bias in pulse oximeters and to cultivate a comprehensive strategy to address biases and inequities in medical innovation from industry to philanthropy and government by hosting an in-person Forum on Bias in Pulse Oximetry in November 2022. 

Second, recognizing the importance of continuing the conversation on disparate impacts of technology and the COVID-19 pandemic on underrepresented communities, FAS developed a research and policy agenda for near-term mitigation of inequities in pulse oximetry and other medical technologies as well as the long-term solutions from the Bias in Pulse Oximetry Forum. FAS’ research and convening on this issue prompted the Veterans Health Administration (VHA)—a major health agency within the U.S. government—to evaluate the use of all pulse oximeters (~50 types) and to understand the impact of the technologies on the more than nine million patients served by the VHA system.

FAS experts frequently collaborate with stakeholders in Congress and the executive branch to help solve complex science and technology policy challenges that align with government priorities and needs. In FY23, FAS’s unique ability to coordinate actors across the legislative and executive branches and facilitate crucial discourse and planning efforts across government agencies yielded tangible successes as described below.

Accelerating Technology Deployment through Flexible Financial Mechanisms to Maximize Spending from the Bipartisan Infrastructure Law (BIL) and the Inflation Reduction Act (IRA)

Promising technologies and opportunities for innovation exist across health, clean energy, and other domains but often lack an existing market—or guarantee of a future market—to support their creation and commercialization. The federal government can play a unique role in signaling and even guaranteeing demand for these solutions, including using its power as a buyer. 

FAS worked with the DOE front office to diffuse flexible financial mechanisms to support and accelerate the deployment of novel clean energy technologies that lower greenhouse gas emissions, while supporting the implementation of BIL and IRA. FAS compiled a set of policy recommendations for how DOE could leverage its Other Transactions Authority (OTA) to accelerate commercialization and scale high-impact clean energy technologies. FAS recommended that DOE use its other transaction authority by establishing a formal internal process that encourages the formation of consortia to promote efficiency and collaboration across technology areas, while still appropriately mitigating risk. 

These recommendations prompted DOE to release informed guidance in September 2023 for how program offices and leaders across the agency can leverage other transactions to catalyze demand for clean energy. DOE continues to engage FAS in ongoing discussions on deploying OTAs and other flexible financial mechanisms to stimulate demand and accelerate deployment of promising technologies.

Creating stronger infrastructure through innovation

The United States faces multiple challenges in using innovation to not only deliver transportation infrastructure that is more resilient against climate change, but also to deliver on the clean energy transition and advance equity for communities that have historically been excluded from decision-making on these projects. To address these challenges, in November 2021 Congress passed the Infrastructure Investment and Jobs Act (IIJA), which included $550 billion in new funding for dozens of new programs across the USDOT.

The bill created the Advanced Research Projects Agency-Infrastructure (ARPA-I) and historic investments in America’s roads and bridges. ARPA-I’s mission is to unlock the full potential of public and private innovation ecosystems to improve U.S. infrastructure by accelerating climate game-changers across the entire U.S. R&D ecosystem. Since its authorization, USDOT has invited FAS to use our expertise to scope advanced research priorities across diverse infrastructure topics where targeted research can yield innovative new infrastructure technologies, materials, systems, capabilities, or processes through ARPA-I.

For example, this year FAS has engaged more than 160 experts in ARPA-I program idea generation and created 50 wireframes for ARPA-I’s initial set of programs, leading to a powerful coalition of stakeholders and laying a strong foundation for the potential that ARPA-I can achieve as it evolves. ARPA-I’s authorization and subsequent initial appropriation in December 2022 provides an opportunity to tackle monumental challenges across transportation and infrastructure through breakthrough innovation. FAS’s programming is helping shape the future of the ARPA-I office.

Providing Government with the Tools to Assess Risks in Artificial Intelligence (AI) and Biosecurity

With increased warnings that AI may support the development of chemical and biological weapons, the federal government must act to protect the public from malicious actors. Senators Ed Markey (D-MA) and Ted Budd (R-NC) introduced the Artificial Intelligence and Biosecurity Risk Assessment Act and the Strategy for Public Health Preparedness and Response to Artificial Intelligence Threats Act with FAS’s technical assistance. These two pieces of legislation empower the federal government to better understand public health security risks associated with AI by directing the Assistant Secretary for Preparedness and Response of the U.S. Department of Health and Human Services (HHS) to conduct comprehensive risk assessments of advances in AI.

Helping International STEM Students and Workers in the United States

Sixty percent of computer science PhDs and nearly 50% of STEM PhDs are foreign born, and these workers have contributed to America’s continuing science and technological leadership. FAS has worked across the legislative and executive branches of government to keep the best and brightest science and technology minds in the United States.

In the legislative branch, interest in keeping talented scientific and technical talent in the United States has increased as a natural security concern. Recognizing the importance of this moment, FAS provided technical assistance to the offices of Senators Dick Durbin (D-IL) and Mike Rounds (R-SD) and Representatives Bill Foster (D-IL11) and Mike Lawler (R-NY17) in introducing the Keep STEM Talent Act of 2023, a bill that would make it easier for international students with advanced STEM degrees to stay in the United States after graduation.

An executive branch rule states that most nonimmigrants (i.e., non-green card holders) must renew visas outside the United States at an American embassy or consulate overseas. This rule requires students and workers to leave the United States during school or employment and bear the costs of going back to their country of origin; it also creates an administrative burden for consular officers who have heavy caseloads. FAS experts published a policy document that provides specific recommendations for how to reinstate domestic visa renewal. The State Department implemented some of these recommendations through a pilot program. This pilot program, the first step to solving this challenge, allows high-skilled immigrants to renew their work visas in the United States rather than having to travel to their home country to do so.

At FAS, we are proud of our impact and realize there is still more to be done. While we are working to expand the breadth and depth of our work above, we also see three major opportunities for FAS in the next fiscal year.

Expanding Government’s Capacity 

The U.S. government is critical to solving the largest problems of the 21st century. While significant progress has been made, institutional complexity challenges the government’s ability to quickly innovate and deliver on its mission. Lackluster incentives, bureaucratic bottlenecks, and the lack of feedback loops slow progress and hinder capacity building across four key areas: financial mechanisms, evidence, talent, and culture. This work is especially important in an election year where either a second term or new administration will bring new people and ideas to Washington, DC, and the government’s ability to execute these ideas hinges on its capacity.

FAS is in a unique position to support the federal government in building federal capacity. Since delivering 100 implementation-ready policy proposals for the 2020 presidential transition, FAS has grown and matured, expanding our capabilities as an organization. We are working to diagnose key science and technology policy issues ripe for bipartisan innovation and support. As we move forward with our findings, FAS will use our Day One platform to publicize grand challenges in this space and gather the best ideas from experts across the country on how best to solve these issues.

Mitigating Global Risk

FAS was founded to address the new, human-created nuclear danger that threatened global extinction. Today, in a world vastly more complicated than the one into which nuclear weapons were introduced, FAS supports the development and execution of sound public policy based on proven and effective technical skills to improve the human condition and, increasingly, to reduce global risks. 

FAS’s new Global Risk program is focused on both the promise and peril posed by evolving AI capabilities in the nuclear landscape and beyond. Dedicated to reducing nuclear dangers and ensuring that qualified technical experts are integral parts of the policy process, FAS seeks to advance its work in support of U.S. and global security at the intersection between nuclear weapons, AI, and global risk. By drawing on technical experts, engaging the policy community, convening across multiple skill sets and sectors, and developing joint projects and collaborations with the government, FAS seeks to drive positive policy outcomes and shape the security landscape for the better.

Deepening Knowledge of Emerging Technologies across All Branches of Government

AI’s rapid evolution, combined with a lack of understanding of how it works, makes today’s policy decisions incredibly important but fraught with misconceptions. This is a pivotal moment, and FAS seeks to engage, educate, and inspire congressional staff, executive branch personnel, military decision makers, and state lawmakers on AI’s substantial potential—and risks. Our mission is to translate this transformative technology for lawmakers by advancing impactful policy development and promoting positive and productive discourse.  

FAS finds itself in an unprecedented position to directly inform and influence crucial decisions that will shape AI governance. Our nonpartisan expertise and ability to move rapidly have made us the go-to resource for members of Congress across party lines when they require technical advice on AI-related issues. In the 118th Congress, FAS’s AI team has provided support on six vital AI bills and received requests for assistance and briefings on AI-related topics from over 40 congressional offices.

We recognize that this momentum offers FAS a unique opportunity to not only continue guiding policymakers with much-needed perspectives but also strive for actionable and equitable policy change that addresses the challenges linked with advancements in artificial intelligence.

The Federation of American Scientists continued our fundraising momentum from FY22 into FY23, securing $51 million in new commitments across 47 total awards and 31 unique funders, representing a 46% increase in funding allocations from last year. These investments by FAS’s philanthropic and agency partners reflect a sustained focus by FAS staff to continue diversifying and expanding our funding portfolio while simultaneously deepening our connections with existing partners and positioning FAS as an indispensable voice for evidence-based, scientifically-driven policy analysis and research.

The majority of the funding FAS receives (99.6%) is restricted for the use of specific projects and initiatives, while unrestricted funding (which only accounts for 0.04% of funding) bolsters the organization’s operational capacity.

The critical work being done at FAS would not be possible without the generous support of its philanthropic partners who continue to invest in the organization’s vision for the future.

Anonymous DonorFuture of Life InstituteLEGO FoundationOceans 5The Alfred P. Sloan Foundation
Arnold VenturesThe Gates FoundationLincoln NetworkOpen PhilanthropyThrive Together LLC
Bayshore GlobalGeneral Services AdministrationLongview PhilanthropyThe David and Lucille Packard FoundationUnited States Department of Agriculture
Breakthrough EnergyGood Ventures FoundationThe John D. and Catherine T. MacArthur FoundationPIT FundUnited States Department of Transportation
Camelback VenturesHorizon Institute for Public ServiceMercatus CenterThe Ploughshares FundUnited States Economic Development Administration
Carnegie Corporation of New YorkThe William and Flora Hewlett FoundationThe Gordon and Betty Moore FoundationThe Prospect Hill FoundationUnlockAid
The Catena FoundationKapor CenterNational Center for Entrepreneurship and InnovationResource Legacy FundThe Walton Family Foundation
Chan Zuckerberg InitiativeKorea FoundationNational Philanthropic TrustSchmidt Futures
The Dallas FoundationThe Ewing Marion Kauffman FoundationThe New Land FoundationSiegel Family Endowment
The Energy FoundationThe Kresge FoundationNorwegian People’s AidSilicon Valley Community Foundation

Nuclear Notebook: Nuclear Weapons Sharing, 2023

The FAS Nuclear Notebook is one of the most widely sourced reference materials worldwide for reliable information about the status of nuclear weapons and has been published in the Bulletin of the Atomic Scientists since 1987. The Nuclear Notebook is researched and written by the staff of the Federation of American Scientists’ Nuclear Information Project: Director Hans M. Kristensen, Senior Research Fellow Matt Korda, Research Associate Eliana Johns, and Scoville Peace Fellow Mackenzie Knight.

This issue’s column examines the current state of global nuclear sharing arrangements, which include non-nuclear countries that possess nuclear-capable delivery systems for employment of a nuclear-armed state’s nuclear weapons.

Read the full “Nuclear weapons sharing, 2023” Nuclear Notebook in the Bulletin of the Atomic Scientists, or download a PDF using the button on the left side of this page. The complete archive of FAS Nuclear Notebooks can be found here.

This research was carried out with generous contributions from the New-Land Foundation, Ploughshares Fund, the Prospect Hill Foundation, Longview Philanthropy, and individual donors.

Unlocking American Competitiveness: Understanding the Reshaped Visa Policies under the AI Executive Order

The looming competition for global talent has brought forth a necessity to evaluate and update the policies concerning international visa holders in the United States. Recognizing this, President Biden has directed various agencies to consider policy changes aimed at improving processes and conditions for legal foreign workers, students, researchers, and scholars through the upcoming AI Executive Order (EO). The EO recognizes that attracting global talent is vital for continued U.S. economic growth and enhancing competitiveness. 

Here we offer a comprehensive analysis of potential impacts and beneficiaries under several key provisions brought to attention by this EO. The provisions considered herein are categorized under six paramount categories: domestic revalidation for J-1 and F-1 Visas; modernization of H-1B Visa Rules; updates to J-1 Exchange Visitor Skills List; the introduction of Global AI Talent Attraction Program; issuing an RFI to seek updates to DOL’s Schedule A; and policy manual updates for O-1A, EB-1, EB-2 and International Entrepreneur Rule. Each policy change carries the potential to advance America’s ability to draw in international experts that hugely contribute to our innovation-driven economy.

Domestic Revalidation for J-1 and F-1 Visas

The EO directive on expanding domestic revalidation for J-1 research scholars and F-1 STEM visa students simplifies and streamlines the renewal process for a large number of visa holders. 

There are currently approximately 900,000 international students in the US, nearly half of whom are enrolled in STEM fields. This policy change has the potential to impact almost 450,000 international students, including those who partake in optional practical training (OPT). The group of affected individuals consists greatly of scholars with advanced degrees as nearly half of all STEM PhDs are awarded to international students.

One of the significant benefits offered by this EO directive is the reduction in processing times and associated costs. In addition, it improves convenience for these students and scholars. For example, many among the several hundreds of thousands of STEM students will no longer be obligated to spend excessive amounts on travel to their home country for a 10-minute interview at an Embassy.

Aside from saving costs, this directive also allows students to attend international conferences more easily and enjoy hassle-free travel without being worried about having to spend a month away from their vital research waiting for visa renewal back home.

Expanding domestic revalidation to F and J visa holders was initially suggested by the Secure Borders and Open Doors Advisory Committee in January 2008, indicating its long-standing relevance and importance. By implementing it, we not only enhance efficiency but also foster a more supportive environment for international students contributing significantly to our scientific research community.

Modernization of H-1B Visa Rules

The EO directive to update the rules surrounding H-1B visas would positively impact the over 500k H-1B visa holders. The Department of Homeland Security recently released a Notice of Proposed Rulemaking to reform the H-1B visa rules. It would allow these visa holders to easily transition into new jobs, have more predictability and certainty in the renewal process and more flexibility or better opportunities to apply their skills, and allow entrepreneurs to more effectively access the H-1B visa. Last year, 206,002 initial and continuing H-1Bs were issued. The new rules would apply to similar numbers in FY2025. But what amplifies this modification’s impact is its potential crossover with EB-1 and EB-2 petitioners waiting on green cards—currently at over 400k petitions. 

Additionally, the modernization would address the issue of multiple applications per applicant. This has been a controversial issue in the H-1B visa program as companies would often file multiple registrations for the same employee, thus increasing the exhaustion rate of yearly quotas, thereby reducing chances for others. This modernization could potentially address this problem by introducing clear rules or restrictions on the number of applications per applicant. USCIS recently launched fraud investigations into several companies engaging in this practice.

Updates to J-1 Exchange Visitor Skills List

The EO directive to revamp the skills list will synchronize with evolving global labor market needs. Nearly 37k of the J-1s issued in 2022 went to professors, research scholars and short term scholars, hailing from mainly China and India (nearly 40% of all). Therefore, this update not only expands opportunities available to these participants but also tackles critical skill gaps within fields like AI in the U.S. Once the J-1 skills list is updated to meet the realities of the global labor market today, it will allow thousands of additional high skilled J-1 visa holders to apply for other visa categories immediately, without spending 2-years in their countries of origin, as laid out in this recent brief by the Federation of American Scientists.

Global AI Talent Attraction Program

Recognizing AI talent is global, the EO directive on using the State Department’s public diplomacy function becomes strategically important. By hosting overseas events to appeal to such crucial talent bases abroad, we can effectively fuel the U.S. tech industry’s unmet demand that has seen a steep incline over recent years. While 59% of the top-tier AI researchers work in the U.S., only 20% of them received their undergraduate degree in the U.S. Only 35% of the most elite (top 0.5%) of AI researchers received their undergraduate degree in the U.S., but 65% of them work in the U.S. The establishment of a Global AI Talent Attraction program by the State Department will double down on this uniquely American advantage.

Schedule A Update & DOL’s RFI

Schedule A is a list of occupations for which the U.S. Department of Labor (DOL) has determined there are not sufficient U.S. workers who are able, willing, qualified and available. Foreign workers in these occupations can therefore have a faster process to receive a Green Card because the employer does not need to go through the Labor Certification process. Schedule A Group I was created in 1965 and has remained unchanged since 1991. If the DOL were to update Schedule A, it would impact foreign workers and employers in several ways depending on how the list changes:

Foreign workers with occupations that are on Schedule A do not have to go through the PERM (Program Electronic Review Management) labor certification process, a process that otherwise takes on average 300 days to complete. This is because Schedule A lists occupations for which the Department of Labor has already determined there are not sufficient U.S. workers who are able, willing, qualified and available. An updated Schedule A could cut PERM applications filed significantly down from current high volumes (over 86,000 already filed by the end of FY23 Q3). While the EO only calls for an RFI seeking information on the Schedule A List, this is a critical first step to an eventual update that is badly needed.

Policy Manual Updates for O-1A, EB-1, EB-2 and International Entrepreneur Rule

The EO’s directive to DHS to modernize pathways for experts in AI and other emerging technologies will have profound effects on the U.S. tech industry. Fields such as Artificial Intelligence (AI), Quantum computing, Biotechnology, etc., are increasingly crucial in defining global technology leadership and national security. As per the NSCAI report, the U.S. significantly lags behind in terms of AI expertise due to severe immigration challenges.

The modernization would likely include clarification and updates to the criteria of defining ‘extraordinary ability’ and ‘exceptional ability’ under O-1A, EB-1 and EB-2 visas, becoming more inclusive towards talents in emerging tech fields. For instance, the current ‘extraordinary ability’ category is restrictive towards researchers as it preferentially favors those who have received significant international awards or recognitions—a rarity in most early-stage research careers. Similarly, despite O-1A and EB-1 both designed for aliens with extraordinary ability, the criteria for EB-1 is more restrictive than O-1A and bringing both in line would allow a more predictable path for an O-1A holder to transition to an EB-1. Such updates also extend to the International Entrepreneur Rule, facilitating startup founders from critical technology backgrounds more straightforward access into the U.S. landscape.

Altogether, these updates could lead to a surge in visa applications under O-1A, EB-1, EB-2 categories and increase entrepreneurship within emerging tech sectors. In turn, this provision would bolster the U.S.’ competitive advantage globally by attracting top-performing individuals working on critical technologies worldwide.

Enhanced Informational Resources and Transparency

The directives in Section 4 instruct an array of senior officials to create informational resources that demystify options for experts in critical technologies intending to work in the U.S. The provision’s ramifications include:

Streamlining Visa Services 

This area of the order directly addresses immigration policy with a view to accelerating access for talented individuals in emerging tech fields. 

Using Discretionary Authorities to Support and Attract AI Talent

The EO’s directive to the Secretary of State and Secretary of Homeland Security to use discretionary authorities—consistent with applicable law and implementing regulations—to support and attract foreign nationals with special skills in AI seeking to work, study, or conduct research in the U.S. could have enormous implications. 

One way this provision could be implemented is through the use of public benefit parole. Offering parole to elite AI researchers who may otherwise be stuck in decades long backlogs (or are trying to evade authoritarian regimes) could see a significant increase in the inflow of intellectual prowess into the U.S. Public benefit parole is also the basis for the International Entrepreneur Rule. Given how other countries are actively poaching talent from the U.S. because of our decades long visa backlogs, creating a public benefit parole program for researchers in AI and other emerging technology areas could prove extremely valuable. These researchers could then be allowed to stay and work in the U.S. provided they are able to demonstrate (on an individual basis) that their stay in the U.S. would provide a significant public benefit through their AI research and development efforts.

Another potential utilization of this discretionary authority could be in the way of the Department of State issuing a memo announcing a one‐​time recapture of certain immigrant visa cap numbers to redress prior agency failures to issue visas. There is precedence for this as when the government openly acknowledged its errors that made immigrants from Western Hemisphere countries face longer wait times between 1968 and 1976 as it incorrectly charged Cuban refugees to the Western Hemisphere limitation. To remedy the situation, the government recaptured over 140,000 visas from prior fiscal years on its own authority, and issued them to other immigrants who were caught in the Western Hemisphere backlog. 

In the past, considerable quantities of green cards have gone unused due to administrative factors. Recapturing these missed opportunities could immediately benefit a sizable volume of immigrants, including those possessing AI skills and waiting for green card availability. For instance, if a hypothetical 300,000 green cards that were not allocated due to administrative failures are recaptured, it could potentially expedite the immigration process for a similar number of individuals. 

Finally, as a brief from the Federation of American Scientists stated earlier, it is essential that the Secretary of State and the Secretary of Homeland Security extend the visa interview waivers indefinitely, considering the significant backlogs faced by the State Department at several consular posts that are preventing researchers from traveling to the U.S. 

In August 2020, Secretary Pompeo announced that applicants seeking a visa in the same category they previously held would be allowed to get an interview waiver if their visa expired in the last 24 months. Before this, the expiration period for an interview waiver was only 12 months. In December 2020, just two days before this policy was set to expire, DOS extended it through the end of March 2021. In March, the expiration period was doubled again, from 24 months to 48 months and the policy extended through December 31, 2021. In September of 2021, DOS also approved waivers through the remainder of 2021 for applicants of F, M, and academic J visas from Visa Waiver Program countries who were previously issued a visa.

In December 2021, DOS extended its then-existing policies (with some minor modifications) through December 2022. Moreover, the interview waiver policy that individuals renewing a visa in the same category as a visa that expired in the preceding 48 months may be eligible for issuance without an interview was announced as a standing policy of the State Department, and added to the department’s Foreign Affairs Manual for consular officers.  In December 2022, DOS announced another extension of these policies, which are set to expire at the end of 2023. 

As the State Department recently noted: “These interview waiver authorities have reduced visa appointment wait times at many embassies and consulates by freeing up in-person interview appointments for other applicants who require an interview. Nearly half of the almost seven million nonimmigrant visas the Department issued in Fiscal Year 2022 were adjudicated without an in-person interview. We are successfully lowering visa wait times worldwide, following closures during the pandemic, and making every effort to further reduce those wait times as quickly as possible, including for first-time tourist visa applicants. Embassies and consulates may still require an in-person interview on a case-by-case basis and dependent upon local conditions.”

These changes would also benefit U.S. companies and research institutions, who often struggle to retain and attract international AI talent due to the lengthy immigration process and uncertain outcomes. In addition, exercising parole authority can open a new gateway for attracting highly skilled AI talent that might have otherwise chosen other countries due to the rigid U.S. immigration system. 

The use of such authorities can result in a transformational change for AI research and development in the U.S. However, all these outcomes entirely depend upon the actual changes made to existing policies—a task that many acknowledge will require serious thoughtfulness for walking a balance between remaining advantageously selective yet inclusive enough.

In summary, these provisions would carry massive impacts—enabling us to retain foreign talent vital across sectors including but not limited to education, technology and healthcare; all fuelling our national economic growth in turn.

Nuclear Notebook: Pakistan Nuclear Weapons, 2023

The FAS Nuclear Notebook is one of the most widely sourced reference materials worldwide for reliable information about the status of nuclear weapons and has been published in the Bulletin of the Atomic Scientists since 1987. The Nuclear Notebook is researched and written by the staff of the Federation of American Scientists’ Nuclear Information Project: Director Hans M. Kristensen, Senior Research Fellow Matt Korda, and Research Associate Eliana Johns.

This issue’s column finds that Pakistan is continuing to gradually expand its nuclear arsenal with more warheads, more delivery systems, and a growing fissile material production industry. We estimate that Pakistan now has a nuclear weapons stockpile of approximately 170 warheads.

Read the full “Pakistan Nuclear Weapons, 2023” Nuclear Notebook in the Bulletin of the Atomic Scientists, or download a PDF using the button on the left side of this page. The complete archive of FAS Nuclear Notebooks can be found here.


This research was carried out with generous contributions from the New-Land Foundation, Ploughshares Fund, the Prospect Hill Foundation, Longview Philanthropy, and individual donors.

Trust Issues: An Analysis of NSF’s Funding for Trustworthy AI

Below, we analyze AI R&D grants from the National Science Foundation’s Computer and Information Science and Engineering (NSF CISE) directorate, estimating those supporting “trustworthy AI” research. NSF hasn’t offered an overview of specific funding for such studies within AI. Through reviewing a random sample of granted proposals 2018-2022, we estimate that ~10-15% of annual AI funding supports trustworthy AI research areas, including interpretability, robustness, privacy-preservation, and fairness, despite an increased focus on trustworthy AI in NSF’s strategic plan as well as public statements by key NSF and White House officials. Robustness receives the most allocation (~6% annually), while interpretability and fairness each obtain ~2%. Funding for privacy-preserving machine learning has seen a significant rise, from .1% to ~5%. We suggest NSF increases funding towards responsible AI, incorporating specific programs and solicitations addressing critical AI trustworthiness issues. We also clarify that NSF should consider trustworthiness in all AI grant application assessments and prioritize projects enhancing the safety of foundation models.

Background on Federal AI R&D

Federal R&D funding has been critical to AI research, especially a decade ago when machine learning (ML) tools had less potential for wide use and received limited private investment. Much of the early AI development occurred in academic labs that were mainly federally funded, forming the foundation for modern ML insights and attracting large-scale private investment. With private sector investments outstripping public ones and creating notable AI advances, federal funding agencies are now reevaluating their role in this area. The key question lies in how public investment can complement private finance to advance AI research that is beneficial for American wellbeing.

Figure 1.

Inspiration for chart from from Our World in Data

The Growing Importance of Trustworthy AI R&D

A growing priority within the discourse of national AI strategy is the advancement of “trustworthy AI”. Per the National Institutes of Standards and Technology, Trustworthy AI refers to AI systems that are safe, reliable, interpretable, robust, demonstrate respect for privacy, and have harmful biases mitigated. Though terms such as “trustworthy AI”, “safe AI”, “responsible AI”, and “beneficial AI” are not precisely defined, they are an important part of the government’s characterization of high-level AI R&D strategy. We aim to elucidate these concepts further in this report, focusing on specific research directions aimed at bolstering the desirable attributes in ML models. We will start by discussing an increasing trend we observe in governmental strategies and certain program solicitations emphasizing such goals.

This increased focus has been reflected in many government strategy documents in recent years. Both the 2016 National AI R&D Strategic Plan and its 2019 update from the National Science and Technology Council pinpointed trustworthiness in AI as a crucial objective. This was reiterated even more emphatically in the recent 2023 revision, which stressed ensuring confidence and reliability of AI systems as especially significant objectives. The plan also underlined how burgeoning numbers of AI models have necessitated urgent efforts towards enhancing safety parameters in AIs. Public feedback regarding previous versions of this plan highlight an expanded priority across academia, industry and society at large for AI models that maintain safety codes, transparency protocols, and equitable improvements without trespassing privacy norms. The NSF’s FY2024 budget proposal submission articulated its primary intention in advancing “the frontiers of trustworthy AI“, deviating from earlier years’ emphasis on sowing seeds for future advancements across various realms of human pursuits.

Concrete manifestations of this increasing emphasis on trustworthy AI can be seen not only in high-level discussions of strategy, but also through specific programs designed to advance trustworthiness in AI models. One of the seven new NSF AI institutes established recently focuses exclusively on “trustworthy AI“. Other programs like NSF’s Fairness in Artificial Intelligence and Safe-Learning Enabled Systems focus chiefly on cultivating dimensions of trustworthy AI research.

Despite their value, these individual programs focused on AI trustworthiness form only a small fragment of total funding allocated for AI R&D by the NSF; at around $20 million per year against nearly $800 million per year in funding towards AI R&D. It remains unclear how much this mounting concern surrounding trustworthy and responsible AI influences NSF’s funding commitments towards responsible AI research. In this paper, we aim to provide an initial investigation of this question by estimating the proportion of grants over the past five fiscal years (FY 2018-2022) from NSF’s CISE directorate (the primary funder of AI R&D within NSF) which support a few key research directions within trustworthy AI: interpretability, robustness, fairness, and privacy-preservation.

Please treat our approximations cautiously; these are neither exact nor conclusive responses to this question. Our methodology heavily relies upon individual judgments categorizing nebulous grant types within a sample of the overall grants. Our goal is to offer an initial finding into federal funding trends directed towards trustworthy AI research.

Methodology

We utilized NSF’s online database of granted awards from the CISE directorate to facilitate our research. Initially, we identified a representative set of AI R&D-focused grants (“AI grants”) funded by NSF’s CISE directorate across certain fiscal years 2018-2022. Subsequently, we procured a random selection of these grants and manually classified them according to predetermined research directions relevant to trustworthy AI. An overview of this process is given below, with details on each step of our methodology provided in the Appendix.

  1. Search: Using NSF’s online award search feature, we extracted a near comprehensive collection of abstracts of grant applications approved by NSF’s CISE directorate during fiscal years 2018-2022. Since the search function relies on keywords, we focused on high recall in the search results over high precision, leading to an overly encompassing result set yielding close to 1000 grants annually. It is believed that this initial set encompasses nearly all AI grants from NSF’s CISE directorate while also incorporating numerous non-AI-centric R&D awards.
  2. Sample: For each fiscal year, a representative random subset of 100 abstracts was drawn (approximating 10% of the total abstracts extracted). This sample size was chosen as it strikes a balance between manageability for manual categorization and sufficient numbers for reasonably approximate funding estimations.
  3. Sort: Based on prevailing definitions of trustworthy AI, four clusters were conceptualized for research directions: i) interpretability/explainability, ii) robustness/safety, iii) fairness, iv) privacy-preservation. To furnish useful contrasts with trustworthy AI funding numbers, additional categories were designated: v) capabilities and vi) applications of AI. Herein, “capabilities” corresponds to pioneering initiatives in model performance and “application of AI” refers to endeavors leveraging extant AI techniques for progress in other domains. Non-AI-centric grants were sorted out of our sample and marked as “other” in this stage. Each grant within our sampled allotment was manually classified into one or more of these research directions based on its primary focus and possible secondary or tertiary objectives where applicable—additional specifics regarding this sorting process are delineated in the Appendix.

Findings

Based on our sorting process, we estimate the proportion of AI grant funds from NSF’s CISE directorate which are primarily directed at our trustworthy AI research directions.

Figure 2.

As depicted in Figure 2, the collective proportion of CISE funds allocated to trustworthy AI research directions usually varies from approximately 10% to around 15% of the total AI funds per annum. However, there are no noticeable positive or negative trends in this overall metric, indicating that over the five-year period examined, there were no dramatic shifts in the funding proportion assigned to trustworthy AI projects. 

Considering secondary and tertiary research directions

As previously noted, several grants under consideration appeared to have secondary or tertiary focuses or seemed to strive for research goals which bridge different research directions. We estimate that over the five-year evaluation period, roughly 18% of grant funds were directed to projects having at least a partial focus on trustworthy AI.

Figure 3.

Specific Research Directions

Robustness/safety

Presently, ML systems tend to fail unpredictably when confronted with situations considerably different from their training scenarios (non-iid settings). This failure propensity may induce detrimental effects, especially in high-risk environments. With the objective of diminishing such threats, robustness or safety-related research endeavors aim to enhance system reliability across new domains and mitigate catastrophic failure when facing untrained situations.1 Additionally, this category encompasses projects addressing potential risks and failure modes identification for further safety improvements.

Over the past five years, our analysis shows that research pertaining to robustness is typically the most funded trustworthy AI direction, representing about 6% of the total funds allocated by CISE to AI research. However, no definite trends have been identified concerning funding directed at robustness over this period.

Figure 4.

Interpretability/explainability

Explaining why a machine learning model outputs certain predictions for a given input is still an unsolved problem.2 Research on interpretability or explainability aspires to devise methods for better understanding the decision-making processes of machine learning models and designing more easily interpretable decision systems.

Over the investigated years, funding supporting interpretability and explainability doesn’t show substantial growth, averagely accounting for approximately 2% of all AI funds.

Figure 5.

Fairness/non-discrimination

ML systems often reflect and exacerbate existing biases present in their training data. To circumvent these issues, research focusing on fairness or non-discrimination purposes works towards creating systems that sidestep such biases. Frequently this area of study involves exploring ways to reduce dataset biases and developing bias-assessment metrics for current models along with other bias-reducing strategies for ML models.3

The funding allocated to this area also generally accounts for around 2% of annual AI funds. Our data did not reveal any discernible trend related to fairness/non-discrimination orientated fundings throughout the examined period.

Figure 6.

Privacy-preservation

AI systems training typically requires large volumes of data that can include personal information; therefore privacy preservation is crucial. In response to this concern, privacy-preserving machine learning research aims at formulating methodologies capable of safeguarding private information.4

Throughout the studied years, funding for privacy-preserving machine learning exhibits significant growth from under 1% in 2018 (the smallest among our examined research directions) escalating to over 6% in 2022 (the largest among our inspect trustworthy AI research topics). This increase flourishes around fiscal year 2020; however, its cause remains indeterminate.

Figure 7.

Recommendations

NSF should continue to carefully consider the role that its funding can play in an overall AI R&D portfolio, taking into account both private and public investment. Trustworthy AI research presents a strong opportunity for public investment. Many of the lines of research within trustworthy AI may be under-incentivized within industry investments, and can be usefully pursued by academics. Concretely, NSF could: 


Appendix

Methodology

For this investigation, we aim to estimate the proportion of AI grant funding from NSF’s CISE directorate which supports research that is relevant to trustworthy AI. To do this, we rely on publicly-provided data of awarded grants from NSF’s CISE directorate, accessed via NSF’s online award search feature. We first aim to identify, for each of the examined fiscal years, a set of AI-focused grants (“AI grants”) from NSF’s CISE directorate. From this set, we draw a random sample of grants, which we manually sort into our selected trustworthy AI research directions. We go into more detail on each of these steps below. 

How did we choose this question? 

We touch on some of the motivation for this question in the introduction above. We investigate NSF’s CISE directorate because it is the primary directorate within NSF for AI research, and because focusing on one directorate (rather than some broader focus, like NSF as a whole) allows for a more focused investigation. Future work could examine other directorates within NSF or other R&D agencies for which grant awards are publicly available. 

We focus on estimating trustworthy AI funding as a proportion of total AI funding, with our goal being to analyze how trustworthy AI is prioritized relative to other AI work, and because this information could be more action-guiding for funders like NSF who are choosing which research directions within AI to prioritize.

Search (identifying a list of AI grants from NSF’s CISE Directorate)

To identify a set of AI grants from NSF’s CISE directorate, we used the advanced award search feature on NSF’s website. We conducted the following search:

This search yielded a set of ~1000 grants for each fiscal year. This set of grants was over-inclusive, with many grants which were not focused on AI. This is because we aimed for high recall, rather than high precision when choosing our key words; our focus was to find a set of grants which would include all of the relevant AI grants made by NSF’s CISE directorate. We aim to sort out false positives, i.e. grants not focused on AI, in the subsequent “sorting” phase. 

Sampling

We assigned a random number to each grant returned by our initial search, and then sorted the grants from smallest to largest. For each year, we copied the 100 grants with the smallest randomly assigned numbers and into a new spreadsheet which we used for the subsequent “sorting” step. 

We now had a random sample of 500 grants (100 for each FY) from the larger set of ~5000 grants which we identified in the search phase. We chose this number of grants for our sample because it was manageable for manual sorting, and we did not anticipate massive shifts in relative proportions were we to expand from a ~10% sample to say, 20% or 30%. 

Identifying Trustworthy AI Research Directions

We aimed to identify a set of broad research directions which would be especially useful for promoting trustworthy properties in AI systems, which could serve as our categories in the subsequent manual sorting phase. We consulted various definitions of trustworthy AI, relying most heavily on the definition provided by NIST: “characteristics of trustworthy AI include valid and reliable, safe, secure and resilient, accountable and transparent, explainable and interpretable, privacy-enhanced, and fair with harmful bias managed.” We also consulted some lists of trustworthy AI research directions, identifying research directions which appeared to us to be of particular importance for trustworthy AI. Based on the above process, we identify the following clusters of trustworthy AI research:

It is important to note here that none of these research areas are crisply defined, but we thought that these clusters provided a useful, high-level, way to break trustworthy AI research down into broad categories. 

In the subsequent steps, we aim to compare the amount of grant funds that are specifically aimed at promoting the above trustworthy AI research directions with the amount of funds which are directed towards improving AI systems’ capabilities in general, or simply applying AI to other classes of problems.

Sorting

For our randomly sampled set of 500 grants, we aimed to sort each grant according to its intended research direction. 

For each grant, we a) read the title and the abstract of the grant and b) assigned the grant a primary research direction, and if applicable, a secondary and tertiary research direction. Secondary and tertiary research directions were not selected for each grant, but were chosen for some grants which stood out to us as having a few different objectives. We provide examples of some of these “overlapping” grants below.

We sorted grants into the following categories:

  1. Capabilities
    1. This category was used for projects that are primarily aimed at advancing the capabilities of AI systems, by making them more competent at some task, or for research which could be used to push forward the frontier of capabilities for AI systems. 
    2. This category also includes investments in resources that are generally useful for AI research, e.g. computing clusters at universities. 
    3. Example: A project which aims to develop a new ML model which achieves SOTA performance on a computer vision benchmark.
  2. Application of AI/ML.
    1. This category was used for projects which apply existing ML/AI techniques to research questions in other domains. 
    2. Example: A grant which uses some machine learning techniques to analyze large sets of data on precipitation, temperature, etc. to test a hypothesis in climatology.
  3. Interpretability/explainability.
    1. This category was used for projects which aim to make AI systems more interpretable or explainable, by allowing for a better understanding of their decision-making process. Here, we included both projects which offer methods for better interpreting existing models, and also on projects which offer new training methods that are easier to interpret.
    2. Example: A project which determines the features of a resume that make it more or less likely to be scored positively by a resume-ranking algorithm.
  4. Robustness/safety
    1. This category was used for projects which aim to make AI systems more robust to distribution shifts and adversarial inputs, and more reliable in unfamiliar circumstances. Here, we include both projects which introduce methods for making existing systems more robust, and those which introduce new techniques that are more robust in general. 
    2. Example: A project which explores new methods for providing systems with training data that causes a computer vision model to learn robustly useful patterns from data, rather than spurious ones. 
  5. Fairness/non-discrimination
    1. This category was used for projects which aim to make AI systems less likely to entrench or reflect harmful biases. Here, we focus on work directly geared at making models themselves less biased. Many project abstracts described efforts to include researchers from underrepresented populations in the research process, which we chose not to include because of our focus on model behavior.
    2. Example: A project which aims to design techniques for “training out” certain undesirable racial or gender biases.
  6. Privacy preservation
    1. This category was used for projects which aim to make AI systems less privacy-invading. 
    2. Example: A project which provides a new algorithm that allows a model to learn desired behavior without using private data. 
  7. Other
    1. This category was used for grants which are not focused on AI. As mentioned above, the random sample included many grants which were not AI grants, and these could be removed as “other.”

Some caveats and clarifications on our sorting process

This sorting focuses on the apparent intentions and goals of the research as stated in the abstracts and titles, as these are the aspects of each grant the NSF award search feature makes readily viewable. Our process may therefore miss research objectives which are outlined in the full grant application (and not within the abstract and title). 

A focus on specific research directions

We chose to focus on specific research agendas within trustworthy and responsible AI, rather than just sorting grants between a binary of “trustworthy” or “not trustworthy” in order to bring greater clarity to our grant sorting process. We still make judgment calls with regards to which individual research agendas are being promoted by various grants, but we hope that such a sorting approach will allow greater agreement.

As mentioned above, we also assigned secondary and tertiary research directions to some of these grants. You can view the grants in the sample and how we sorted each here. Below, we offer some examples of the kinds of grants which we would sort into these categories.

Examples of Grants with Multiple Research Directions

To summarize: in the sorting phase, we read the title and abstract of each grant in our random sample, and assigned these grants to a research direction. Many grants received only a “primary” research direction, though some received secondary and tertiary research directions as well. This sorting was based on our understanding of the main goals of the project, based on the description provided by the project title and abstract.

State of the Federal Clean Energy Workforce

How Improved Talent Practices Can Help the Department of Energy Meet the Moment

This report aims to provide a snapshot of clean energy talent at the Department of Energy and its surrounding orbit: the challenges, successes, and opportunities that the workforce is experiencing at this once-in-a-generation moment.

To compile the findings in this report, FAS worked with nonprofit and philanthropic organizations, government agencies, advocacy and workforce coalitions, and private companies over the last year. We held events, including information sessions, recruitment events, and convenings; we conducted interviews with more than 25 experts from the public and private sector; we developed recommendations for improving talent acquisition in government, and helped agencies find the right talent for their needs.

Overall, we found that DOE has made significant progress towards its talent and implementation goals, taking advantage of the current momentum to bring in new employees and roll out new programs to accelerate the clean energy transition. The agency has made smart use of flexible hiring mechanisms like the Direct Hire Authority and Intergovernmental Personnel Act (IPA) agreements, ramped up recruitment to meet current capacity needs, and worked with partners to bring in high-quality talent.

But there are also ways to build on DOE’s current approaches. We offer recommendations for expanding the use of flexible hiring mechanisms: through expanding IPA eligibility to organizations vetted by other agencies, holding trainings for program offices through the Office of the Chief Human Capital Officer, and asking Congress to increase funding for human capital resources. Another recommendation encourages DOE to review its use to date of the Clean Energy Corps’ Direct Hire Authority and identify areas for improvement. We also propose ways to build on DOE’s recruitment successes: by partnering with energy sector affinity groups and clean energy membership networks to share opportunities; and by building closer relationships with universities and colleges to engage early career talent.

Some of these findings and recommendations are pulled from previous memos and reports, but many are new recommendations based on our experiences working and interacting with partners within the ecosystem over the past year. The goal of this report is to help federal and non-federal actors in the clean energy ecosystem grow talent and prepare for the challenges in clean energy in the coming decades.

The Moment

The climate crisis is not just a looming threat–it’s already here, affecting the lives of American citizens. The federal government has taken a central role in climate mitigation and adaptation, especially with the recent passage of several pieces of legislation. The bipartisan Infrastructure Investment and Jobs Act (IIJA), the CHIPS and Science Act, and the Inflation Reduction Act (IRA) all provide levers for federal agencies to address the crisis and reduce emissions.

The Department of Energy (DOE) is leading the charge and is the target of much of the funding from the above bills. The legislation provides DOE over $97 billion dollars of funding aimed at commercializing and deploying new clean energy technologies, expanding energy efficiency in homes and businesses, and decreasing emissions in a range of industries.

These are robust and much-needed investments in federal agencies, and the effects will ripple out across the whole economy. The Energy Futures Initiative, in a recent report, estimated that IRA investments will lead to 1.46 million more jobs over the next ten years than there would have been without the bill. Moreover, these jobs will be focused in key industries, like construction, manufacturing, and electric utilities.

But those jobs won’t magically appear–and the IIJA and IRA funding won’t magically be spent. That amount of money would be overwhelming for any large organization, and initiatives and benefits will take time to manifest.

When it passed these two bills, Congress recognized that the Department of Energy–and the federal government more broadly– would need new tools to use these new resources effectively. That is why it included new funding and expanded hiring authorities to allow the agencies to quickly find and hire expert staff. 

Now it is up to DOE to find the subject matter expertise, talent, partnerships, and cross-sector knowledge sharing from the larger clean energy ecosystem it needs to execute on Congress’s incredibly ambitious goals. Perhaps the most critical factor in DOE’s success will be ensuring that the agency has the staff it needs to meet the moment and implement the bold targets established in the recent legislation.

Why Talent?

To implement policy effectively and spend taxpayer dollars efficiently, the federal government needs people. Investing in a robust talent pipeline is important for all agencies, especially given that only about 8% of federal employees are under 30, and at DOE only 4% are under 30. Building this pipeline is critical for the clean energy transition that’s already underway–not only for not the federal government, but for the entire ecosystem. In order to meet clean energy deployment estimates across the country, clean energy jobs will need to increase threefold by 2025 and almost sixfold by 2030 from 2020 jobs numbers. This job growth will require cross-sector investment in workforce training and education, innovation ecosystems, and research and development of new technologies. Private firms, venture capital, and the civil sector can all play a role, but as the country’s largest employer, the government will need to lead the way.

To meet its ambitious policy goals, government agencies need to move beyond stale hiring playbooks and think creatively. Strategies like flexible hiring mechanisms can help the Department of Energy–and all federal agencies–meet urgent needs and begin to build a longer-term talent pipeline. Workforce development, recruitment, and hiring can take years to do right – but mechanisms like tour-of-service models (i.e. temporary or termed positions), direct hire authorities, and excepted service hiring allow agencies to retain talent quickly, overcome administrative bottlenecks, and access individuals with technical expertise who may not otherwise consider working in the public sector. See the Appendix for more information on specific hiring authorities.

This paper outlines insights, strategies, and opportunities for DOE’s talent needs based on the Federation of American Scientists’ (FAS) one-year pilot partnership with the department. Non-federal actors in the clean energy ecosystem can also benefit from this report–by understanding the different avenues into the federal government, civil society and private organizations can work more effectively with DOE to shepherd in the clean energy revolution. 

Broadly, we hope that our experience working with DOE can serve as a case study for other federal agencies when considering the challenges and opportunities around talent recruitment, onboarding, and retention.

Where does DOE need talent? 

While the IRA and IIJA funded dozens of programs across DOE, there are several offices that received larger amounts of funding and have critical talent needs currently. 

A Pilot Partnership: FAS and DOE Talent Efforts

In January 2022, FAS established a partnership with DOE to support the implementation of a broad range of ambitious priorities to stimulate a clean energy transition. Through a partnership with DOE’s Office of Under Secretary for Science and Innovation (S4), our team discovered unmet talent needs and worked with S4 to develop strategies to address hiring challenges posed by DOE’s rapid growth through the IIJA. 

This included expanding FAS’s Impact Fellowship program to DOE. This program supports fellows who bring scientific and technical expertise to bear in the public domain, including within government. To date, through IPA (Intergovernmental Personnel Act) agreements, FAS has placed five fellows in high-impact positions in DOE, with another cohort of 5 fellows in the pipeline.

FAS Impact Fellows placed at DOE have proven that this mechanism can have a positive impact on government operations. Current Fellows work in a number of DOE offices, using their expertise to forward work on emerging clean energy technologies, facilitate the transition of energy communities from fossil fuels to clean energy, and ensure that DOE’s work is communicated strategically and widely, among other projects. In a short time, these fellows have had a large impact–they are bringing expertise from outside government to bear in their roles at the agency. 

In addition to placing fellows, FAS has worked to evangelize DOE’s Clean Energy Corps by actively recruiting, holding events, and advertising for specific roles within DOE. To more broadly support hiring and workforce development at the agency, we piloted a series of technical assistance projects in coordination with DOE, including hiring webinars and cross-sector roundtables with leaders in the agency and the larger clean energy ecosystem. 

From this work, FAS has learned more about the challenges and opportunities of talent acquisition–from flexible hiring mechanisms to recruitment–and has developed several recommendations for both Congress and DOE to strengthen the federal clean energy workforce.

Flexible Hiring Mechanisms

One key lesson from the past year of work is the importance of flexible hiring mechanisms broadly. This includes special authorities like the Direct Hire Authority, but also includes tour-of-service models of employment. A ‘tour-of-service’ position can take many forms, but generally is a termed or temporary position, often full-time and focused on a specific project or set of projects. In times of urgency, like the onset of the COVID-19 pandemic or following the passage of large pieces of new legislation, hiring managers may need high numbers of staff in a short amount of time to implement policy–a challenge often heightened by stringent federal hiring guidelines. 

Traditional federal hiring is frustrating for both sides. For applicants, filling out applications is complicated and jargony and the wait times are long and unpredictable. For offices, resources are scarce, there are seemingly endless legal and administrative hoops to jump through, and the wait times are still long and unpredictable. In general, tour-of-service hiring mechanisms offer a way to hire key staff for specific needs more quickly, while offering many other unique benefits, including, but not limited to, cross-sector knowledge sharing, professional development, recruitment tools, and relationship-building.

These mechanisms can also expand the potential talent pool for a particular position–highly trained technical professionals can prove difficult to recruit on a full-time basis, but temporary positions may be more attractive to them. IPA agreements, for example, can last for 1-2 years and take less time to execute than hiring permanent employees or contractors. More generally, all types of flexible hiring authorities can give agencies quicker ways of hiring highly qualified staff in sometimes niche fields. Flexible hiring mechanisms can also reduce the barrier to entry for professionals not as familiar with federal hiring processes–broadening offices’ reach and increasing the diversity of applicants.

FAS’s work with DOE has demonstrated these benefits. With FAS and other organizations, DOE has successfully used IPAs to staff high-impact positions. More recommendations on the use of IPAs specifically can be found in a later section. Through its Impact Fellowship, FAS has yielded successful case studies of how cross-sector talent can support impactful policy implementation in the department.

DOE should expand awareness and use of flexible hiring mechanisms.

DOE should work to expand the awareness and use of flexible hiring mechanisms in order to bring in more highly skilled employees with cross-sector knowledge and experience. This could be achieved in a number of ways. The Office of the Chief Human Capital Office (CHCO) should continue to educate hiring managers across DOE about potential hiring authorities available: they could offer additional trainings on different mechanisms and work with OPM to identify opportunities for new authorities. There are existing communities of practice for recruitment and other talent topics at DOE, and hiring officials can use these to discuss best practices and challenges around using hiring authorities effectively. 

DOE can also look to other agencies for ideas on innovative hiring. Agencies like the Department of Homeland Security, Department of Defense, and Department of Veterans Affairs run different forms of industry exchange programs that allow private sector experts to bring their skills and knowledge into government and vice versa. Another example is the Joint Statistical Research Program hosted by the Internal Revenue Service’s Statistics of Income Office. This program brings in tax policy experts on term appointments using the IPA mechanism, similar to the National Science Foundation’s Rotator program. Once developed, these programs can allow agencies to benefit from talent and expertise from a larger pool and access specialized skill sets while protecting against conflicts of interest.

DOE should partner with external organizations to champion tour-of-service programs.

There are other ways to expand flexible hiring mechanism use as well. Program offices and the Office of the CHCO can partner with outside organizations like FAS to champion tour-of-service programs in the wider clean energy community, in order to educate non-federal eligible parties on how they can get involved. Federal hiring processes can seem opaque to outside organizations, with additional paperwork, conflict of interest concerns, long timelines, and potential clearance hurdles. If outside organizations better understand the different ways they can partner with agencies and the benefits of doing so, agencies could increase enthusiasm for programs like tour-of-service hiring. At NSF, for example, the Rotator program is well known in the communities it operates within–both academia and government understand the benefits of participating. 

Although these mechanisms and authorities have significant medium- and long-term benefits for agencies, they require upfront administrative effort and cost. Even if staff are aware of potential tools they can use, understanding the logistics, funding mechanisms, conflict of interest regulations, and recruitment and placement of staff hired through these mechanisms often requires investment of time and money from the agency side and can overwhelm already stressed hiring managers. 

Congress should increase funding for DOE’s Office of the Chief Human Capital Officer.

In order to support DOE in using flexible hiring mechanisms more effectively, Congress should direct more funding to the agency’s Office of the Chief Human Capital Officer. In FY23, the office has not only continued to execute on mandates from the IIJA and the IRA, but has introduced new programs aimed at modernizing the office and improving on hiring. These programs and tools, including standing up talent teams to better assess competency gaps across program offices and developing HR IT platforms to more effectively make data-driven personnel decisions, are vital to the growth of the office and in turn the ability of DOE to follow through on key executive priorities. Congress should increase funding to DOE’s Human Capital office by $10M in FY24 over FY23 levels. As IRA and IIJA priorities continue to be rolled out, the Human Capital office will remain pivotal to the agency’s success. 

Congress should increase DOE’s baseline program direction funds. 

A related recommendation is for Congress to further support hiring at DOE by increasing the base budget of program direction funds across agency offices. Restrictions on this funding limits the agency’s ability to hire and the number of employees it can bring on. When offices are limited in the number of staff they can hire, they have tended to bring on more senior employees. This helps achieve the agency’s mission but limits the overall growth of the agency – without early career talent, offices are unable to train a new generation of diverse clean energy leaders. Increasing program direction budgets through the annual appropriations process will allow DOE to have more flexibility in who they hire, building a stronger workforce across the agency.

Clean Energy Corps and the Direct Hire Authority

Expanded Direct Hire Authority has been a boon for DOE, despite some implementation challenges. Congress included DHA in the IIJA, in order to help federal agencies quickly add staff to implement the legislation. In response, DOE set an initial goal of hiring over 1,000 new employees in its Clean Energy Corps, which encompasses all DOE staff who work on clean energy and climate. DOE also requested an additional authority for supporting implementation of the IRA through OPM. To date, the program has received almost 100,000 applications and has hired nearly 700 employees. We have heard positive feedback from offices across the agency about how the DHA has helped hire qualified staff more quickly than through traditional hiring. It has allowed DOE offices to take advantage of the momentum in the clean energy movement right now and made it easier for applicants to show their interest and move through the hiring process. To date, among federal agencies with IIJA/IRA direct hire authorities, DOE has been an exemplar in implementation.

The Direct Hire Authority has been successful so far in part because of its advertisement; there was public excitement about the climate impact of the IIJA and IRA, and DOE took advantage of the momentum and shared information about the Clean Energy Corps widely, including through partnerships with non-governmental entities. For example, FAS and Clean Energy for America held hiring webinars, and other organizations and individuals have continued to share the announcement. 

Congress should extend the Direct Hire Authority.

Congress should consider extending the authority past its current timeline. The agency’s direct hire authority under the IIJA expires in 2027, while its authority requested through OPM expires at the end of 2025 – and is capped at only 300 positions.  With DOE taking on more demonstration and deployment activities as well as increased community and stakeholder engagement with the passage of the IIJA and IRA, the agency needs capacity–and the Direct Hire Authority can help it get the specialized resources it needs. Extending the authority beyond 2025 and requesting that OPM increase the cap on positions is more urgent, but the authority should continue past 2027 as well, to ensure that DOE can continue to hire effectively. 

Congress should expand the breadth of DHA. 

Additionally, Congress should expand the authority to other offices across DOE. It is currently limited to certain roles and offices, but there are additional opportunities within the department to support the clean energy transition that don’t have access to DHA. This is especially important given that offices with the direct hire authority can pull employees from offices without–leaving the latter to backfill positions on a much longer timeline using conventional merit hiring practices. Expanding the authority would support the development of the agency as a whole. 

Beyond just removing the authority’s cap on roles supporting the IRA, expansions or extensions of the authority should increase the number of authorized positions to account for a baseline attrition rate. The authority limits the number of positions that can be filled – once that number of staff is hired, the authority can no longer be used for that office or agency. As with any workplace, federal agencies experience a normal amount of attrition, but the stakes are higher when direct hire employees leave the organization because of the authority’s constraints. Any authorization of the DHA in the future should consider how attrition will impact actual hires over the authorization period. 

In order to bolster support for expanding the authority, DOE can take steps to share out successes of the program. The DHA has been a huge win for federal clean energy hiring, and publicizing news about related programs, offices, funding opportunities, and employees that would not exist but for the support of the Clean Energy Corps would help make the connection between flexible hiring and government effectiveness and would generate excitement about DOE’s activities in the general public.

DOE should highlight success stories of the Clean Energy Corps.

As part of a larger external communications strategy, DOE should highlight success stories of current employees hired through the Clean Energy Corps portal. These spotlights could focus on projects, partnerships, or funding opportunities that employees contributed to and put a face to the achievements of the Clean Energy Corps thus far. Not only would this encourage future high-quality applicants and ensure continued interest in the program, but would also advertise to Congress and the general public that the authority is successful and increase support for more flexible hiring authorities and clean energy funding. 

There are also some opportunities to improve DOE’s use of the authority and make it even more effective. With so many applications, hiring managers and program offices are often overwhelmed by sheer volume – leading to long wait times for applicants. Some offices at DOE have tried to address this bottleneck by building informal processes to screen and refer candidates–using their internal system to identify qualified applicants and sharing those applications with other program offices. But there may be additional ways to reduce the backlog of applications. 

DOE should conduct a review of DHA’s use thus far.

DOE should conduct an assessment of the use of the Direct Hire Authority in relevant offices. The program has been running for over a year, and there is enough data to review and better understand strengths and areas of growth of the authority. The review could be an opportunity to highlight and build on successful strategies like the informal process above–with program offices who currently use those strategies helping to scale them up. It could also assess attrition rates and compare them to agency-wide and non-DHA attrition rates to understand opportunities to improve or share out successes around retention. Finally, the review could also act as a resource for Congress to help justify the authority’s renewal in the future. 

Use of IPA Agreements

One of the most well-known tour-of-service programs is the Intergovernmental Personnel Act. When used effectively, it can allow agencies to share cross-sector knowledge, increase their capacity, and achieve their missions more fully. As noted previously, DOE has made use of IPAs in some capacities, but barriers to expanding the program still exist. First, the DOE maintains a list of ‘IPA-certified’ organizations, including non-profits that must first certify their eligibility to participate in IPA agreements. According to OPM, if an organization has already been certified by an agency, this certification is permanent and may apply throughout the federal government. This is an effective practice that theoretically allows DOE to bring on IPAs from those organizations more quickly – without the additional administrative work necessary to research and vet each organization multiple times. 

However, when FAS engaged DOE to expand the Impact Fellowship to the agency, FAS was asked to re-certify its eligibility separately with DOE despite already having conducted IPA agreements with other agencies. As of May 2021, DOE has only approved 22 organizations for IPA eligibility. With the clean energy ecosystem booming, this leaves a large amount of talent potential going untapped. 

DOE should amend its IPA directive.

One solution to this issue would be for DOE to amend its IPA directive, which was last updated in 2000, to automatically approve IPA eligibility for organizations that have been certified by other agencies. Agencies such as NSF, USDA, GSA, and others also maintain lists of IPA-eligible organizations, providing DOE a readily available pool of potential IPA talent without certifying those organizations independently. This solution could expand the list of certified organizations and reduce DOE’s internal administrative burden. Organizations that know they will go through an initial vetting process once rather than multiple times could redouble efforts to build that partnership with DOE. 

DOE should work with outside organizations to share strategies. 

The previous recommendation on educating eligible non-federal organizations on tour-of-service mechanisms applies here as well. Organizations like FAS with a proven track record of setting up IPA agreements with agencies can share best practices, success stories, and champion the program in the broader non-profit ecosystem. However, agencies can also develop externally facing IPA resources, sharing training and ‘how-to’ guides with nonprofits and academic institutions that could be good fits for the program but aren’t aware of their eligibility or requirements to participate.

Recruitment

Recruitment is another area where we learned lessons from our work alongside DOE. FAS and Clean Energy for America held recruitment information sessions for people interested in working for DOE, spotlighting offices who needed more staff. One strategy that helped target specific skill gaps within the agency was developing ‘personas’ based on certain skill sets, like finance and manufacturing. These personas were short descriptions of a specific skill set for an industry, consisting of several highlighted experiences, skills, or certifications that are key to roles in that industry. This enabled our team to develop a more tailored recruitment event, conduct targeted outreach, and execute the event with a more invested group of attendees. 

DOE should identify specific skills gaps to target recruitment efforts.

DOE hiring managers and program offices should identify skills gaps in their offices and recruit for those gaps using personas. Personas can help managers more intentionally target outreach and recruit in certain industries by allowing them to advertise to associations, academic programs, or on job boards that include potential applicants with those skills and experiences. This practice could bolster recruitment and reduce the time to hire by attracting more qualified candidates up front. It also helps offices take a more proactive approach to hiring–a difficult ask for hiring managers, who are often overworked. 

DOE should continue to utilize remote flexibilities.

Another successful recruitment strategy highlighted in our work with DOE has been the use of remote flexible positions. DOE should continue to widely utilize remote flexibilities in job opportunities and recruitment in order to attract talent from all 50 states, not just those where DOE has a physical presence. As the desire for remote employment remains high across the public and private sector, fully utilizing the remote flexibilities can help federal employers stay competitive with the private sector and attract high-quality talent.

Another area of recruitment where DOE could capitalize further is with more partnerships with non-federal organizations. Outside organizations can leverage their networks–helping expand the talent pool, increase diversity, and support candidates through the federal hiring process, competitive or otherwise.  Networks like New York Climate Tech have been tirelessly organizing the climate tech community in New York City, and even plan to start expanding to other cities soon. These types of organizing are invigorating for climate professionals; they can energize existing advocates and evangelize to new ones. Helping connect those networks to government opportunities–whether prize competitions, job opportunities, or grants–can strengthen cross-sector relationships and the clean energy workforce overall. 

Such efforts would also support federal recruitment strategies, which are often not as visible as they could be given the sheer amount of work required for proactive outreach. Earth Partners, a climate tech venture capital firm, partnered with the Office of Clean Energy Deployment to hire for high-impact positions by leveraging its own network. 

DOE should use partner organizations to broadcast hiring needs. 

DOE Office of the Human Capital Officer, hiring managers, or program offices should consider how they can partner with other organizations to broadcast hiring needs. These can be larger clean energy associations and member organizations like Clean Energy for America, New York Climate Tech, FAS, and Climate Power, or they could be energy sector affinity groups like Women In Renewable Industries and Sustainable Energy (WRISE) and the American Association of Blacks in Energy (AABE). Coordinated social media campaigns, partnered recruitment events, or even sending out open positions in those organizations’ regular newsletters could help broaden DOE’s recruitment reach. Because of the momentum in the clean energy community, non-federal organizations have built out substantial recruitment infrastructure for potential applicants and can help publicize positions. 

DOE should build a presence at campus hiring events.

Similarly, DOE hiring managers should build and maintain a presence at higher education hiring events. There are a number of ways to bring more early career staff into government, but DOE can focus on recruiting more intentionally from universities and community colleges. The agency should cultivate relationships with university networks–especially those of Historically Black Colleges and Universities (HBCUs) and Minority Serving Institutions (MSIs)–and develop recruitment messages that appeal to younger populations. DOE could also focus on universities with strong clean energy curricula–in the form of recognized courses and programs or student associations. 

DOE should expand partnerships with external recruitment firms.

Some positions, of course, are harder to recruit for. In addition to mid-level employees, government also needs strong senior leaders–candidates for these positions don’t often come in droves to recruitment events. Some DOE offices have found success with using private recruitment firms to identify candidates from the private sector and invite them to apply for Senior Executive Service (SES) level positions in government. This practice, in addition to bringing in specific executive recruitment, also helps career private-sector applicants navigate the government hiring process. 

DOE should learn from current strategies and continue to partner with private recruitment firms to identify potential SES candidates and invite them to apply. Using recruitment firms can help simplify position description language and help guide candidates through the process. DOE currently uses this successfully for certain skill set gaps, but should seek to expand the practice for recruitment needs that are more specific. 

DOE should develop its own senior talent recruitment strategy. 

Longer term, DOE should develop its own senior talent recruitment strategy. This strategy can be developed using lessons learned from private recruitment firms or from meeting with other agencies to understand best practices in the space. SES positions require different candidate management strategies, and if DOE aims to attract more non-federal talent, developing in-house expertise is important.

DOE already has the infrastructure for strategies like this. Offices involved in IIJA implementation are building office-specific recruitment strategies. These strategies consider diversity, equity, inclusion and accessibility, as well as skill sets and high-need positions within offices. Incorporating senior talent needs into these strategies could help uncover best practices for attracting quality leaders, and expanding these recruitment strategies beyond just IIJA-oriented offices could support workforce development across the agency more broadly. 

The Path Forward

DOE has made significant progress on the road to implementation, hiring hundreds of new employees to support the clean energy transition and carry out programs from IIJA, IRA, and the CHIPS and Science Act. The agency still faces challenges, but also opportunities to grow its workforce, improve its hiring processes, and bring in even more high-quality, skilled talent into the federal government. We hope DOE and Congress will consider these recommendations as they continue to work toward a stronger clean energy ecosystem in the years to come.


Appendix: Overview of hiring authorities

IPAs 

The Intergovernmental Personnel Act (IPA) Mobility Program that allows temporary assignment of personnel between the federal government and state/local/tribal governments, colleges/universities, FFRDCs, and approved non-profit organizations. According to a 2022 Government Accountability Office report, IPAs are a high-impact mechanism for bringing talent into the federal government quickly, yet they’re often underutilized. As detailed in the report, agencies’(including DOE) can use the IPA Mobility Program to address agency skills gaps in highly technical or complex mission areas, provide talent with flexibility and opportunities for temporary commitments, and can be administratively light touch and cost effective, when the program is implemented correctly. The report noted that agencies struggled to use the program to its full effectiveness, and that there’s an opportunity for agencies to increase their use of the program, if they can tackle the challenges. 

Direct Hire

The Direct Hire Authority allows agencies to directly hire candidates for critical needs or when a severe shortage of candidates exists. This authority circumvents competitive hiring and candidates preferences, allowing agencies to significantly reduce the time involved to hire candidates. It also presents an easier application process for candidates. DHA must be specially granted by OPM unless a governmentwide authority already exists–as it does for Information Technology Management, STEM, and Cybersecurity. For example, DOE was granted a DHA for positions related to implementing the IIJA and IRA.

Excepted Service

EJ and EK

EJ and EK hiring authorities are a form of “excepted service” unique to DOE. According to DOE, the EJ authority is used to enhance the Department’s recruitment and retention of highly qualified scientific, engineering, and professional and administrative personnel. Appointments and corresponding compensation determined  under this authority can be made without regard to the civil service laws.” The EK authority is similar, but more specific to personnel whose duties will relate to safety at defense nuclear facilities of the Department. The EK authority is time-limited by law and must be renewed.

Schedule A(r)

Also known as the “fellowship authority,”  Schedule A(r) facilitates term appointments for 1 to 4 years. This authority is especially helpful for:

Experts and Consultants

According to the department’s HR resources, DOE uses Experts and Consultants to, “provide professional or technical expertise that does not exist or is not readily available within DOE or to perform services that are not of a continuing nature and/or could not be performed by DOE employees in competitive or other permanent full-time positions.” Typically, Expert and Consultants can be used for intermittent, part-time, or term-limited, full-time roles.

Understanding and using these flexible hiring authorities can help DOE expand its network of talent and hire the people it needs for this current moment. More details on flexible hiring mechanisms can be found here.