What exactly does “all lawful use” of AI mean? No one knows.
What exactly does “all lawful use” of AI mean? No one knows.
As a result of this weekend’s highly-publicized Department of Defense (DoD)-Anthropic dispute, we’re hearing a lot about the “lawful use” of frontier AI systems in classified environments.
“Lawful” is a legal floor that will look increasingly shaky as AI capabilities advance. It doesn’t answer whether we have adequate civil liberties guardrails or technical safety standards in place. Company “red lines” only matter if they are backed by enforceable technical and contractual safeguards. Otherwise, they function primarily as signaling. From use to testing to deployment, the scaffolding for responsible integration of AI into high-risk use cases is just not there.
Privacy is a major concern for experts and the public alike. When increasingly capable models are paired with large-scale government data holdings—including commercially purchased data on Americans—the result could materially change the practical boundaries of surveillance, even if each underlying dataset was obtained legally. AI systems expand the possibility of large-scale inference, enabling automated link analysis, behavioral pattern detection, and probabilistic assessments about individuals’ networks or intent across disparate datasets.
Next, there’s the reliability problem. Frontier systems remain probabilistic and brittle, particularly in adversarial settings. The companies building this technology do not yet have a mature testing, evaluation, validation, and verification (TEVV) ecosystem for high-stakes national security uses. At the same time, DoD strategy documents are calling for a “wartime” posture toward eliminating blockers in testing and deployment. That tension should concern us all.
Then, there are the numerous cybersecurity risks. Agentic systems that access sensitive data, ingest untrusted inputs, and can take external actions create new attack surfaces that adversaries will probe and exploit. In classified environments, these risks might be mitigated, but they don’t disappear. Subtle manipulation or model failure inside a military workflow can propagate quickly.
Capability is advancing quickly, but policymakers shouldn’t adopt faster than we can test and govern.
A National AI Laboratory to Support the Administration’s AI Agenda at the Department of Commerce
The United States faces intensifying international competition in Artificial Intelligence (AI). The Trump administration’s AI Action Plan places the Department of Commerce at the center of its agenda to strengthen international standards-setting, protect intellectual property, enforce export controls, and ensure the reliability of advanced AI systems. Yet no existing federal institution combines the flexibility, scale, and technical depth needed to fully support these functions.
To deliver on this agenda, Commerce should expand their AI capability by sponsoring a new Federally Funded Research and Development Center (FFRDC), the National AI Laboratory (NAIL). NAIL would:
- Advance the science of AI,
- Ensure that the United States leads in international AI standards and promotes the trusted adoption of U.S. AI products abroad,
- Identify and mitigate AI security risks,
- Protect U.S. technologies through effective export controls.
While the National Institute of Standards and Technology’s (NIST’s) Center for AI Standards and Innovation (CAISI) within Commerce provides a base of expertise to advance these goals, a dedicated FFRDC offers Commerce the scale, flexibility, and talent recruitment necessary to deliver on this broader commercial and strategic agenda. Together with complementary efforts to strengthen CAISI and expand public-private partnerships, NAIL would serve as the backbone of a more capable AI ecosystem within Commerce. By aligning with Commerce’s broader mission, NAIL will give the Administration a powerful tool to advance exports, protect American leadership, and counter foreign competition.
Challenge
AI’s breakneck pace is having a real-world impact. The Trump administration has made clear that widespread adoption of AI, backed by strong export promotion and international standards leadership, is essential for maintaining America’s position as the world’s technology leader. The Department of Commerce sits at the center of this agenda: advancing AI trade, developing international standards, advancing the science of AI, promoting exports, and ensuring effective export controls on critical technology.
Even as companies and countries race to adopt AI, the U.S. lacks the capacity to fully characterize the behavior and risks of AI systems and ensure leadership across the AI stack. This gap has direct consequences for Commerce’s core missions. First, advances in the science of AI are necessary to ensure that AI systems are sufficiently robust and well understood to be widely adopted at home and abroad. Second, without trusted methods for evaluating AI, the U.S. cannot credibly lead the development of international standards, an area where allies are seeking American leadership and where adversaries are pushing their own approaches. Third, this deep understanding of AI models is needed to identify and mitigate security concerns present in both foreign and domestic models. Fourth, deep technical expertise within the federal government is required to properly create and enforce export controls, ensuring that sensitive AI technologies and underlying hardware are not misused abroad. A deep bench of subject matter experts in AI models and infrastructure is increasingly critical to these efforts.
As AI systems become more capable, the lack of predictable and understandable behavior risks further eroding public trust in AI and inhibiting beneficial AI adoption. Jailbreaking attacks, in which carefully crafted prompts get around Large Language Model (LLM) guardrails, can produce unexpected behavior of models. For example, jailbreaking can prime LLMs for use in cyberattacks, which can cause significant economic harms, or cause them to leak personal information, or produce toxic content, causing legal liability and reputational harm to companies using these models. As companies deploy custom models built on top of LLMs they need to know that medical assistants will not produce harmful recommendations, or that agentic AI systems will not misspend personal funds. Addressing these concerns is an extremely challenging technical problem that requires more effective and consistent methods of evaluating and predicting model performance.
The ability to effectively characterize these models is central to the Trump administration’s AI Action Plan, which highlights widespread adoption of AI as a major policy priority, while also recognizing that the government has a key role to play in managing emerging national security threats. The AI Action Plan gives Commerce a central role in addressing these concerns; nearly two fifths of the plan’s recommendations involve Commerce. Commerce’s responsibilities include:
- Creating methods of AI model evaluation and developing international standards.
- Identifying security risks.
- Promoting research on AI interpretability, control and robustness.
- Recruiting leading AI researchers.
- Promoting exports of AI technology.
For a full list of AI Action Plan recommendations involving Commerce, see Appendix A.
While Commerce has an impressive track record in AI, including through its work at the National Institute of Standards and Technology and CAISI, it will face immense institutional challenges in delivering on the ambitions of the AI Action Plan, which require broad and deep expertise. Like other U.S. government entities, Commerce operates under federal hiring rules that make it difficult to quickly recruit and retain top technical talent. The government also struggles to match AI industry pay scales. For example, fresh PhDs joining AI companies frequently receive total compensation that is twice the cap set for the overwhelming majority of government workers, and senior researchers earn five times this cap or more. In some cases, top researchers may also hold equity in private companies, further complicating their employment by the government. Without a new institutional mechanism designed to attract and deploy world-class expertise, Commerce will struggle to execute on the ambitious goals of the AI Action Plan.
Opportunity
To deliver on the scope of the AI Action Plan, the Department of Commerce needs a dedicated institution with the resources, flexibility, and talent pipeline that existing structures cannot provide. A Federally Funded Research and Development Center (FFRDC) offers this capacity. Unlike traditional government offices, an FFRDC can recruit competitively from the same pools as industry, while remaining mission-driven and independent of commercial interests.
At its core, a new FFRDC, the National AI Laboratory (NAIL), would provide the technical expertise Commerce needs to carry out its central responsibilities. Specifically, NAIL would:
- Advance the science of AI, including the measurement and evaluation of AI models.
- Develop the methods and benchmarks that underpin international standards and ensure U.S. companies remain the trusted source for global AI solutions.
- Identify and mitigate AI security risks, ensuring U.S. technologies are not exploited by adversaries.
- Provide the technical expertise needed to support export promotion, export controls, and international trade negotiations.
NAIL would equip Commerce with the authoritative science and engineering base it needs to advance America’s commercial and strategic AI leadership.
FFRDCs are unique in combining the flexibility of private organizations with the mission focus of federal agencies. Their long-term partnership with a sponsoring agency ensures alignment with government priorities, while their independent status allows them to provide objective analysis and rapid technical response. This hybrid structure is particularly well-suited to the fast-moving and security-relevant domain of frontier AI. More background information on FFRDCs can be found in Appendix C.
The current talent landscape underscores the value of the FFRDC model. While industry salaries are high, many senior researchers are constrained by proprietary agendas and limited opportunities to pursue foundational, publishable work. To obtain greater freedom in their research, many top industry researchers have been seeking positions at universities, despite drastically lower salaries. An FFRDC focused on frontier model understanding, interpretability, and security offers a rare combination: freedom to pursue scientifically important problems, the ability to publish, and a mission anchored in national competitiveness and public service. This environment can attract researchers who would not join the civil service but are motivated by high-impact scientific and policy goals.
FFRDCs have repeatedly demonstrated their ability to deliver large-scale technical capability for federal sponsors. For example, NASA’s Jet Propulsion Laboratory has successfully built and landed multiple rovers on Mars, among many other achievements. The Departments of Energy and Defense have led much of the U.S.’ efforts in science and technology assisted by more than two dozen FFRDCs. Their track record shows that FFRDCs are uniquely suited to problems where neither academia nor industry is structured to meet federal needs—exactly the situation Commerce now faces in AI. Commerce currently supports one FFRDC, the fourth smallest. As advanced AI technology grows even more central to Commerce’s mission, it makes sense to add to this capacity.
Plan of Action
Recommendation 1. Establish an FFRDC to support the AI Mission at Commerce.
Commerce should establish a new FFRDC within two years with a mission to begin important research and timely evaluations. Establishing a new FFRDC requires the sponsoring organization (Commerce in this case) to satisfy the criteria laid out in the Federal Acquisition Regulations (48 CFR 35.017-2) for creating a new FFRDC. Key requirements involve demonstrating needs that are not met by existing sources and that Commerce has sufficient expertise to evaluate the FFRDC. It will require consistent government support through appropriations, and Commerce must identify an appropriate organization to manage it. The rapid pace of AI development makes it an urgent priority to move forward as soon as possible. Recent FFRDCs have taken about 18 months to establish after initial announcement, a significant length of time in the AI field. Further details related to establishing an FFRDC can be found in Appendix D.
Recommendation 2. NAIL should focus on topics that will advance the Administration’s AI Agenda, including recommendations given to Commerce in the AI Action Plan.
These topics should include:
- Development of a standardized federal science of measurement that enables evaluation and comparison of models. These evaluations should be predictive of their performance on real-world tasks. NIST has already laid out how measurement science can advance AI innovation in this report.
- Use of these advances in the science of AI measurement for the development of unified AI standards. This would build greater confidence in models, promoting adoption and U.S. AI exports.
- Development of comprehensive methods to assess security implications of models. This includes security concerns in foreign models and vulnerabilities, such as jailbreaks, backdoors, and leakage of sensitive data, and their susceptibility to data poisoning attacks. Of particular note are attacks that can obtain dangerous information related to topics such as biological weapons. While much of this work can be done without access to classified information, NAIL workers may need security clearances, for example, to determine whether models could leak specific secure data. NAIL should also promote AI security by advancing technical work on AI interpretability, robustness, and control, which was highlighted as a priority in the AI Action Plan.
- Determination of whether AI models or hardware provide capabilities that might warrant export controls.
The proposed FFRDC should pursue activities that range from longer term, fundamental research to rapid response to new developments. Much of the knowledge needed to fulfill Commerce’s mandate lies at the heart of the most significant research questions in AI. This requires deep research, which is also important in attracting top tier talent. On a shorter time scale, it will be important for the FFRDC to provide regular evaluations of models as they progress, including the evaluation of security concerns in foreign models. NAIL can speed up these time critical security evaluations. It will also need to use these evaluations to help create and update procurement guidelines for federal agencies and assess the state of international AI competition. Finally, the FFRDC should be a source of expertise that can support Commerce in a wide range of topics such as export control and development of a workforce trained to appropriately take advantage of AI tools.
The FFRDC will also need to work closely with industry to develop standards for the evaluation of models, and support efforts to create international standards. For example, it may seek to facilitate an industry consensus on the evaluation of new models for security concerns. NIST is well known for similar efforts in many technical areas. Finally, the FFRDC should provide a capacity for rapid response to significant AI developments, including possible urgent security concerns.
Recommendation 3. Provide a sufficient budget to cover the necessary scale of work.
There are different possible scales at which NAIL might be created. It is important to note that creating industry scale models from scratch can cost tens or hundreds of millions of dollars. However, the task of evaluating models may be undertaken without this expense by experimenting on models that have already been trained. Much of the published work on model evaluation takes this course. Such evaluations and experiments still require access to significant computational resources, requiring millions of dollars a year in compute, depending on the size of the effort. The FFRDC’s research might also include experiments in which smaller models are built from scratch at a much smaller expense than what is required to train industry sized models.
We consider two alternatives as to the size and budget of the proposed FFRDC:
- Testbed for AI Competitiveness and Knowledge (TACK): A smaller, prototype effort involving dozens of researchers and support staff, including staff that will facilitate collaborations with industry and other agencies. Such a small-scale effort will not be able to address the full range of problems that Commerce has been tasked with, but will be able to contribute to important missions and demonstrate the value of such an FFRDC on an accelerated timeline. This might cost a few tens of millions of dollars per year, on the scale of Commerce’s current National Cybersecurity FFRDC (NCF).
- Full NAIL: A larger-scale effort could address the full range of tasks outlined above. At this scale, the FFRDC could also take the lead in shaping international standards. For comparison, the Software Engineering Institute (SEI) operates as an FFRDC with a staff of roughly 700 and an annual budget of about $130 million.
The figure in Appendix B lists all current FFRDCs and their annual budget in 2023.
The budget of the FFRDC would need to cover several different costs:
- Research staff. This would consist of experienced researchers who would lead fundamental research and oversee shorter term technical work.
- Research support staff. This would include experienced developers, many with experience in data collection and cleaning, model training and evaluation.
- Administrative support.
- Policy experts skilled in interfacing with industry and other government agencies.
- Computer staff, with experience in supporting large scale computing resources.
- Computing resources, including funds to purchase GPU clusters or to obtain them through cloud services.
- Other expenses such as travel, office space, and miscellaneous overhead.
Recommendation 4. Make NAIL the Backbone of a Broader AI Ecosystem at Commerce.
While an FFRDC offers a unique combination of technical depth and recruiting flexibility, other institutional approaches could also expand Commerce’s AI expertise. One option is to expand the Center for AI Standards and Innovation (CAISI) within NIST, leveraging its standards and measurement mission, though it remains bound by federal hiring and funding rules that slow recruitment and limit pay competitiveness.
A separate proposal envisions a NIST Foundation—a congressionally authorized nonprofit akin to the CDC Foundation or the newly created Foundation for Energy Security and Innovation (FESI)—to mobilize philanthropic and private funding, convene stakeholders, and run fellowships supporting NIST’s mission. Such a foundation could strengthen public-private engagement but would not provide the sustained, large-scale technical capacity needed for Commerce’s AI responsibilities.
Taken together, these models could form a complementary ecosystem: an expanded CAISI to coordinate standards and technical policy within government as well as providing oversight over the FFRDC; a NIST Foundation to channel flexible funding and external partnerships; and an FFRDC to serve as the enduring research and engineering backbone capable of executing large-scale technical work.
Conclusion
The Trump administration has set ambitious goals for advancing U.S. leadership in artificial intelligence, with the Department of Commerce at the center of this effort. Ensuring America’s continued leadership in AI requires technical expertise that existing institutions cannot provide at scale.
NAIL, a new Federally Funded Research and Development Center (FFRDC) offers Commerce the capacity to:
- Push forward our fundamental understanding of frontier AI models along axes that are central to Commerce’s mission, including measurement and evaluation.
- Build the trusted benchmarks and standards that can become the global default.
- Rapidly respond to new technical and security challenges, ensuring the U.S. stays ahead of competitors.
- Provide authoritative analysis for export promotion and control, ensuring U.S. technologies are widely adopted abroad while protected from adversaries, and strengthening America’s hand in international negotiations and trade forums.
By sponsoring this FFRDC, Commerce can secure the talent, flexibility, and independence needed to deliver on the Administration’s commercial AI agenda. While CAISI provides the technical anchor within NIST, the FFRDC will enable Commerce to act at the necessary scale—ensuring the U.S. leads the world in AI innovation, standards, and exports.
Appendix A. References to the Department of Commerce in America’s AI Action Plan
Appendix B. FFRDC Budgets
Appendix C. Further Background on FFRDCs
FFRDCs in Practice: Successes and Pitfalls
FFRDCs have been supporting US government institutions since World War II. Overviews can be found here and here. In this appendix we briefly describe the functioning of FFRDCs and lessons that can be drawn for the current proposal.
In a paper by the Institute for Defense Analyses (IDA) a panel of experts “expressed their belief that high-quality technical expertise and a trusting relationship between laboratory leaders and their sponsor agencies were important to the success of FFRDC laboratories” and felt that “The most effective customers and sponsors set only ‘the what’ (research objectives to be met) and allow the laboratories to determine ‘the how’ (specific research projects and procedures).” Frequent personnel exchange programs between the FFRDC and its sponsor are also suggested.
This and the experience of successful FFRDCs suggests that the proposed FFRDC be closely linked to relevant ongoing efforts in NIST, especially CAISI, with frequent exchanges of information and even personnel. At the same time, the proposed FFRDC should have the freedom to explore very challenging research questions that lie at the heart of its mission.
As an example of the relationship between agencies and associated FFRDCs, the Jet Propulsion Laboratory supports many of NASA’s priorities, addressing long-term goals such as understanding how life emerged on earth, along with more immediate goals such as catalyzing economic growth and contributing to national security. Caltech manages operations of JPL. In general, NASA sets strategic goals, and JPL aligns its long-term quests with these goals. NASA may solicit proposals and JPL may compete to lead or participate in appropriate missions. JPL may also propose missions to NASA. As an example, in 2011 the National Academies recommended that NASA begin a mission to return samples from Mars. NASA decided to launch a new Mars rover mission. NASA then tasked JPL to build and manage operations of Perseverance, to accomplish this mission.
On a less positive note, after concerns about the Department of Energy’s (DOE) management of FFRDCs, DOE shifted from a “transactional model to a systems-based approach” offering greater oversight, but also leading to concerns of loss of flexibility and micromanagement. Concerns have also previously been raised about the level of transparency and assessment of alternatives when agencies renew FFRDC contracts, as well as mission creep of existing FFRDCs
Existing FFRDCs Relevant to AI Work
One of the most important criteria for establishing a new FFRDC is to demonstrate that this will fill a need that cannot be filled by existing entities. Many current FFRDCs are conducting work on AI, but this work does not adequately address the needs of Commerce, especially in light of the requirements of the AI Action Plan. For example, the Software Engineering Institute (SEI) run by CMU has deep expertise in the development of AI systems, along with software development and acquisition. However, their mission is to “execute applied research to drive systemic transition of new capabilities for the DoD.” Its AI work focuses on defense related capabilities, and not on the comprehensive evaluation of frontier models needed by NIST.
NIST does support the National Cybersecurity FFRDC (NCF) operated by MITRE. This unit focuses on security needs, not on general model evaluation (although it will be important to clearly delineate the scopes of a new Commerce FFRDC and the NCF). Other FFRDCs, such as Los Alamos or Lawrence Berkeley have significant AI efforts aimed at using AI to enhance scientific discovery. Industry AI labs address some of the questions central to the proposed FFRDC, but it is important that the government have access to deep technical expertise that is able to act in the public interest.
Establishing a New FFRDC
A precedent on the establishment of FFRDCs comes from the Department of Homeland Security (DHS). Under Section 305 of the Homeland Security Act of 2002, DHS was authorized to establish one or more FFRDCs to provide independent technical analysis and systems engineering for critical homeland security missions. In April 2004, DHS created its first FFRDC, the Homeland Security Institute. Four years later, on April 3, 2008, it issued a notice of intent to establish a successor organization, the Homeland Security Systems Engineering and Development Institute (HSSEDI), and in 2009 selected the MITRE Corporation to operate it. HSSEDI—along with DHS’s other FFRDC, the Homeland Security Operational Analysis Center—is overseen by the Department’s FFRDC Program Management Office. This case illustrates both a procedural pathway (statutory authorization, public notice, operator selection) and the typical timeline for standing up such an entity: roughly 12–18 months from notice of intent to full operation. Similarly, the National Cybersecurity FFRDC had its first notice of intent filed April 22, 2013, with the final contract to operate the FFRDC awarded to MITRE on September 24, 2014, about 17 months later.
Appendix D. Requirements for Establishing an FFRDC
Establishing a new FFRDC requires the sponsoring organization (Commerce in this case) to satisfy the criteria laid out in the Federal Acquisition Regulations (48 CFR 35.017-2) for creating a new FFRDC.
These include:
- Requirement: Existing alternative sources for satisfying agency requirements cannot effectively meet the special research or development needs.
- Meeting the Requirement: The special research or development need for improved understanding, measurement and reliability of AI models is clearly highlighted by the Trump Administration’s AI Action plan, and will only increase with more capable AI systems. As detailed in Appendix C, existing FFRDCs do not focus on problems central to Commerce’s mission, including the promotion of AI exports through model understanding and evaluation, model measurement to promote international standards, and identifying security issues central to export controls. Some work on this is done in industry and universities, but as noted, this is not comprehensive or sufficient to address Commerce’s mandate and the goals of the AI Action Plan.
- Requirement: There is sufficient Government expertise available to adequately and objectively evaluate the work to be performed by the FFRDC.
- Meeting the Requirement: CAISI would serve as a source of expertise within the government that can evaluate the work performed by the FFRDC.
- Requirement: A reasonable continuity in the level of support to the FFRDC is maintained, consistent with the agency’s need for the FFRDC and the terms of the sponsoring agreement.
- Meeting the Requirement: Satisfying this requirement may require ongoing support from Congressional appropriations committees, depending on the level of support needed.
- Requirement: The FFRDC is operated, managed, or administered by an autonomous organization or as an identifiably separate operating unit of a parent organization, and is required to operate in the public interest, free from organizational conflict of interest, and to disclose its affairs (as an FFRDC) to the primary sponsor.
- Meeting the Requirement: The DOC and NIST must identify the appropriate contractor to run the FFRDC. There are many non-profits and universities with relevant expertise, including non-profits devoted to AI measurement.
The establishment of an FFRDC must follow the notification process laid out in 48 CFR 5.205(b). The sponsoring agency must transmit at least three notices over a 90-day period to the GPE (Governmentwide point of entry) and the Federal Register, indicating the agency’s intention to sponsor an FFRDC, and its scope and nature, requesting comments. This plan must be reviewed by the Office of Federal Procurement Policy (OFPP) within the White House Office of Management and Budget (OMB).
A sponsoring agreement (described in 48 CFR 35.017-1) must be generated by Commerce for the new FFRDC. This agreement is required by regulations (48 CFR 35.017-1(e)) to last for no more than five years, but may be renewed. It outlines conditions for awarding contracts and methods of ensuring independence and integrity of the FFRDC. FFRDCs initiate work at the request of federal entities, which would then be approved by appropriate units within DOC. The proposed FFRDC should align its mission closely with Commerce and NIST, obtaining contracts from these sponsoring agencies that will determine its priorities. The FFRDC would hire top tier researchers who can both execute this research and provide bottom-up identification of important new research topics.
On the Precipice: Artificial Intelligence and the Climb to Modernize Nuclear Command, Control, and Communications
The United States’ nuclear command, control, and communications (NC3) system remains a foundational pillar of national security, ensuring credible nuclear deterrence under the most extreme conditions. Yet as the United States embarks on long-overdue NC3 modernization, this effort has received less scholarly and policy attention than the modernization of nuclear delivery systems. This paper addresses that gap by providing a critical assessment of the U.S. NC3 enterprise and its evolving role in a rapidly transforming strategic environment.
Geopolitically, U.S. NC3 modernization must now contend with issues including China’s rise as a nuclear near peer, Russia’s deployment of increasingly threatening hypersonic and counterspace capabilities, and the erosion of norms restraining limited nuclear use.
Technologically, the shift from legacy analog to digital architectures introduces both great opportunities for enhanced speed and resilience and unprecedented vulnerabilities across cyber, space, and electronic domains.
Bureaucratically, modernization efforts face challenges from fragmented acquisition responsibilities and the need to align with broader initiatives such as Combined Joint All-Domain Command and Control (CJADC2) and the deployment of hybrid space architectures.
This paper argues that successful NC3 modernization must do more than update hardware and software: it must integrate emerging technologies, particularly artificial intelligence (AI), in ways that enhance resilience, ensure meaningful human control, and preserve strategic stability. The study evaluates the key systems, organizational challenges, and operational dynamics shaping U.S. NC3 and offers policy recommendations to strengthen deterrence credibility in an era of accelerating geopolitical and technological change.
Read the complete publication here.
This publication was made possible by a grant from the Carnegie Corporation of New York. The statements made and views expressed are solely the responsibility of the author.
AI Implementation is Essential Education Infrastructure
State education agencies (SEAs) are poised to deploy federal funding for artificial intelligence tools in K–12 schools. Yet, the nation risks repeating familiar implementation failures that have limited educational technology for more than a decade. The July 2025 Dear Colleague Letter from the U.S. Department of Education (ED) establishes a clear foundation for responsible artificial intelligence (AI) use, and the next step is ensuring these investments translate into measurable learning gains. The challenge is not defining innovation—it is implementing it effectively. To strengthen federal–state alignment, upcoming AI initiatives should include three practical measures: readiness assessments before fund distribution, outcomes-based contracting tied to student progress, and tiered implementation support reflecting district capacity. Embedding these standards within federal guidance—while allowing states bounded flexibility to adapt—will protect taxpayer investments, support educator success, and ensure AI tools deliver meaningful, scalable impact for all students.
Challenge and Opportunity
For more than a decade, education technology investments have failed to deliver meaningful results—not because of technological limitations, but because of poor implementation. Despite billions of dollars in federal and local spending on devices, software, and networks, student outcomes have shown only minimal improvement. In 2020 alone, K–12 districts spent over $35 billion on hardware, software, curriculum resources, and connectivity—a 25 percent increase from 2019, driven largely by pandemic-related remote learning needs. While these emergency investments were critical to maintaining access, they also set the stage for continued growth in educational technology spending in subsequent years.
Districts that invest in professional development, technical assistance, and thoughtful integration planning consistently see stronger results, while those that approach technology as a one-time purchase do not. As the University of Washington notes, “strategic implementation can often be the difference between programs that fail and programs that create sustainable change.” Yet despite billions spent on educational technology over the past decade, student outcomes have remained largely unchanged—a reflection of systems investing in tools without building the capacity to understand their value, integrate them effectively, and use them to enhance learning. The result is telling: an estimated 65 percent of education software licenses go unused, and as Sarah Johnson pointed out in an EdWeek article, “edtech products are used by 5% of students at the dosage required to get an impact”.
Evaluation practices compound the problem. Too often, federal agencies measure adoption rates instead of student learning, leaving educators confused and taxpayers with little evidence of impact. As the CEO of the EdTech Evidence Exchange put it, poorly implemented programs “waste teacher time and energy and rob students of learning opportunities.” By tracking usage without outcomes, we perpetuate cycles of ineffective adoption, where the same mistakes resurface with each new wave of innovation.
Implementation Capacity is Foundational
A clear solution entails making implementation capacity the foundation of federal AI education funding initiatives. Other countries show the power of this approach. Singapore, Estonia, and Finland all require systematic teacher preparation, infrastructure equity, and outcome tracking before deploying new technologies, recognizing, as a Swedish edtech implementation study found, that access is necessary but not sufficient to achieve sustained use. These nations treat implementation preparation as essential infrastructure, not an optional add-on, and as a result, they achieve far better outcomes than market-driven, fragmented adoption models.
The United States can do the same. With only half of states currently offering AI literacy guidance, federal leadership can set guardrails while leaving states free to tailor solutions locally. Implementation-first policies would allow federal agencies to automate much of program evaluation by linking implementation data with existing student outcome measures, reducing administration burden and ensuring taxpayer investments translate into sustained learning improvements.
The benefits would be transformational:
- Educational opportunity. Strong implementation support can help close digital skill gaps and reduce achievement disparities. Rural districts could gain greater access to technical assistance networks, students with disabilities could benefit from AI tools designed with accessibility at their core, and all students could build the AI literacy necessary to participate in civic and economic life. Recent research suggests that strategic implementation of AI in education holds particular promise for underserved and geographically isolated communities.
- Workforce development. Educators could be equipped to use AI responsibly, expanding coherent career pathways that connect classroom expertise to emerging roles in technology coaching, implementation strategy, and AI education leadership. Students graduating from systematically implemented AI programs would enter the workforce ready for AI-driven jobs, reducing skills gaps and strengthening U.S. competitiveness against global rivals.
In short, implementation is not a secondary concern; it is the primary determinant of whether AI in education strengthens learning or repeats the costly failures of past ed-tech investments. Embedding implementation capacity reviews before large-scale rollout—focused on educator preparation, infrastructure adequacy, and support systems—would help districts identify strengths and gaps early. Paired with outcomes-based vendor contracts and tiered implementation support that reflects district capacity, this approach would protect taxpayer dollars while positioning the United States as a global leader in responsible AI integration.
Plan of Action
AI education funding must shift to being both tool-focused and outcome-focused, reducing repeated implementation failures and ensuring that states and districts can successfully integrate AI tools in ways that strengthen teaching and learning. Federal guidance has made progress in identifying priority use cases for AI in education. With stronger alignment to state and local implementation capacity, investments can mitigate cycles of underutilized tools and wasted resources.
A hybrid approach is needed: federal agencies set clear expectations and provide resources for implementation, while states adapt and execute strategies tailored to local contexts. This model allows for consistency and accountability at the national level, while respecting state leadership.
Recommendation 1. Establish AI Education Implementation Standards Through Federal–State Partnership
To safeguard public investments and accelerate effective adoption, the Department of Education, working in partnership with state education agencies, should establish clear implementation standards that ensure readiness, capacity, and measurable outcomes.
- Implementation readiness benchmarks. Federal AI education funds should be distributed with expectations that recipients demonstrate the enabling systems necessary for effective implementation—including educator preparation, technical infrastructure, professional learning networks, and data governance protocols. ED should provide model benchmarks while allowing states to tailor them to local contexts.
- Dedicated implementation support. Funding streams should ensure AI education investments include not only tool procurement but also consistent, evidence-based professional development, technical assistance, and integration planning. Because these elements are often vendor-driven and uneven across states, embedding them in policy guidance helps SEAs and local education agencies (LEAs) build sustainable capacity and protect against ineffective or commodified approaches—ensuring schools have the human and organizational capacity to use AI responsibly and effectively.
- Joint oversight and accountability. ED and SEAs should collaborate to monitor and publicly share progress on AI education implementation and student outcomes. Metrics could be tied to observable indicators, such as completion of AI-focused professional development, integration of AI tools into instruction, and adherence to ethical and data governance standards. Transparent reporting builds public trust, highlights effective practices, and supports continuous improvement, while recognizing that measures of quality will evolve with new research and local contexts.
Recommendation 2. Develop a National AI Education Implementation Infrastructure
The U.S. Department of Education, in coordination with state agencies, should encourage a national infrastructure that helps and empowers states to build capacity, share promising practices, and align with national economic priorities.
- Regional implementation hubs. ED should partner with states to create regional AI education implementation centers that provide technical assistance, professional development, and peer learning networks. States would have flexibility to shape programming to their context while benefiting from shared expertise and federal support.
- Research and evaluation. ED, in coordination with the National Science Foundation (NSF), should conduct systematic research on AI education implementation effectiveness and share annual findings with states to inform evidence-based decision-making.
- Workforce alignment. Federal and state education agencies should continue to coordinate AI education implementation with existing workforce development initiatives (Department of Labor) and economic development programs (Department of Commerce) to ensure AI skills align with long-term economic and innovation priorities.
Recommendation 3. Adopt Outcomes Based Contracting Standards for AI Education Procurement
The U.S. Department of Education should establish outcomes based contracting (OBC) as a preferred procurement model for federally supported AI education initiatives. This approach ties vendor payment directly to demonstrated student success, with at least 40% of contract value contingent on achieving agreed-upon outcomes, ensuring federal investments deliver measurable results rather than unused tools.
- Performance-based payment structures. ED should support contracts that include a base payment for implementation support and contingent payments earned only as students achieve defined outcomes. Payment should be based on individual student achievement rather than aggregate measures, ensuring every learner benefits while protecting districts from paying full price for ineffective tools.
- Clear outcomes and mutual accountability:. Federal guidance should encourage contracts that specify student populations served, measurable success metrics tied to achievement and growth, and minimum service requirements for both districts and vendors (including educator professional learning, implementation support, and data sharing protocols).
- Vendor transparency and reporting. AI education vendors participating in federally supported programs should provide real-time implementation data, document effectiveness across participating sites, and report outcomes disaggregated by student subgroups to identify and address equity gaps.
- Continuous improvement over termination. Rather than automatic contract cancellation when challenges arise, ED should establish systems that prioritize joint problem-solving, technical assistance, and data-driven adjustments before considering more severe measures.
Recommendation 4. Pilot Before Scaling
To ensure responsible, scalable, and effective integration of AI in education, ED and SEAs should prioritize pilot testing before statewide adoption while building enabling conditions for long-term success.
- Pilot-to-scale strategy. Federal and state agencies could jointly identify pilot districts representing diverse contexts (rural, urban, and suburban) to test AI implementation models before large-scale rollout. Lessons learned would inform future funding decisions, minimize risk, and increase effectiveness for states and districts.
- Enabling conditions for sustainability. States could build ongoing professional learning systems, technical support networks, and student data protections to ensure tools are used effectively over time.
- Continuous improvement loop. ED could coordinate with states to develop feedback systems that translate implementation data into actionable improvements for policy, procurement, and instruction, ensuring educators, leaders, and students all benefit.
Recommendation 5. Build a National AI Education Research & Development Network
To promote evidence-based practice, federal and state agencies should co-develop a coordinated research and development infrastructure that connects implementation data, policy learning to practice, and global collaboration.
- Implementation research partnerships. Federal agencies (ED, NSF) should partner with states and research institutions to fund systematic studies on effective AI education implementation, with emphasis on scalability and outcomes across diverse student populations. Rather than creating a new standalone program, this would coordinate existing ED and NSF investments while expanding state-level participation.
- Testbed site networks. States should designate urban, suburban, and rural AI education implementation labs or “sandboxes”, modeled on responsible AI testbed infrastructure, where funding supports rigorous evaluation, cross-district peer learning, and local adaptation.
- Evidence-to-policy pipeline. Federal agencies should integrate findings from these research-practice partnerships into national AI education guidance, while states embed lessons learned into local technical assistance and professional development.
- National leadership and evidence sharing. Federal and state agencies should establish mechanisms to share evidence-based approaches and emerging insights, positioning the U.S. as a leader in responsible AI education implementation. This collaboration should leverage continuous, practice-informed research, called living evidence, which integrates real-world implementation data, including responsibly shared vendor-generated insights, to inform policy, guide best practices, and support scalable improvements.
Conclusion
The Department’s guidance on AI in education marks a pivotal step toward modernizing teaching and learning nationwide. To realize the promise of AI in education, funding should support both the acquisition of tools and the strategies that ensure their effective implementation. To realize its promise, we must shift from funding tools to funding effective implementation. Too often, technologies are purchased only to sit on the shelf while educators lack the support to integrate them meaningfully. International evidence shows that countries investing in teacher preparation and infrastructure before technology deployment achieve better outcomes and sustain them.
Early research also suggests that investments in professional development, infrastructure, and systems integration substantially increase the long-term impact of educational technology. Prioritizing these supports reduces waste and ensures federal dollars deliver measurable learning gains rather than unused tools. The choice before us is clear: continue the costly cycle of underused technologies or build the nation’s first sustainable model for AI in education—one that makes every dollar count, empowers educators, and delivers transformational improvements in student outcomes.
Clear implementation expectations don’t slow innovation—they make it sustainable. When systems know what effective implementation looks like, they can scale faster, reduce trial-and-error costs, and focus resources on what works to ultimately improve student outcomes.
Quite the opposite. Implementation support is designed to build capacity where it’s needed most. Embedding training, planning, and technical assistance ensures every district, regardless of size or resources, can participate in innovation on an equal footing.
AI education begins with people, not products. Implementation guidelines should help educators improve their existing skills to incorporate AI tools into instruction, offer access to relevant professional learning, and receive leadership support, so that AI enhances teaching and learning.
Implementation quality is multi-dimensional and may look different depending on local context. Common indicators could include: educator readiness and training, technical infrastructure, use of professional learning networks, integration of AI tools into instruction, and adherence to data governance protocols. While these metrics provide guidance, they are not exhaustive, and ED and SEAs will iteratively refine measures as research and best practices evolve. Transparent reporting on these indicators will help identify effective approaches, support continuous improvement, and build public trust.
Not when you look at the return. Billions are spent on tools that go underused or abandoned within a year. Investing in implementation is how we protect those investments and get measurable results for students.
The goal isn’t to add red tape—it’s to create alignment. States can tailor standards to local priorities while still ensuring transparency and accountability. Early adopters can model success, helping others learn and adapt.
Federation of American Scientists and 16 Tech Organizations Call on OMB and OSTP to Maintain Agency AI Use Case Inventories
The first Trump Administration’s E.O. 13859 commitment laid the foundation for increasing government accountability in AI use; this should continue
Washington, D.C. – March 6, 2025 – The Federation of American Scientists (FAS), a non-partisan, nonprofit science think tank dedicated to developing evidence-based policies to address national challenges, today released a letter to the White House Office of Management and Budget (OMB) and the Office of Science and Technology Policy (OSTP), signed by 16 additional scientific and technical organizations, urging the current Trump administration to maintain the federal agency AI use cases inventories at the current level of detail.
“The federal government has immense power to shape industry standards, academic research, and public perception of artificial intelligence,” says Daniel Correa, CEO of the Federation of American Scientists. “By continuing the work set forth by the first Trump administration in Executive Order 13960 and continued by the bipartisan 2023 Advancing American AI Act, OMB’s detailed use cases help us understand the depth and scope of AI systems used for government services.”
“FAS and our fellow organizations urge the administration to maintain these use case standards because these inventories provide a critical check on government AI use,” says Dr. Jedidah Isler, Chief Science Officer at FAS.
AI Guidance Update Mid-March
“Transparency is essential for public trust, which in turn is critical to maximizing the benefits of government AI use. That’s why FAS is leading a letter urging the administration to uphold the current level of agency AI use case detail—ensuring transparency remains a top priority,” says Oliver Stephenson, Associate Director of AI and Emerging Tech Policy at FAS.
“Americans want reassurances that the development and use of artificial intelligence within the federal government is safe; and that we have the ability to mitigate any adverse impacts. By maintaining guidance that federal agencies have to collect and publish information on risks, development status, oversight, data use and so many other elements, OMB will continue strengthening Americans’ trust in the development and use of artificial intelligence,” says Clara Langevin, AI Policy Specialist at FAS.
Surging Use of AI in Government
This letter follows the dramatic rise in the use of artificial intelligence across government, with anticipated growth coming at a rapid rate. For example, at the end of 2024 the Department of Homeland Security (DHS) alone reported 158 active AI use cases. Of these, 29 were identified as high-risk, with detailed documentation on how 24 of those use cases are mitigating potential risks. OMB and OSTP have the ability and authority to set the guidelines that can address the growing pace of government innovation.
FAS and our signers believe that sustained transparency is crucial to ensuring responsible AI governance, fostering public trust, and enabling responsible industry innovation.
Signatories Urging AI Use Case Inventories at Current Level of Detail
Federation of American Scientists
Beeck Center for Social Impact + Innovation at Georgetown University
Bonner Enterprises, LLC
Center for AI and Digital Policy
Center for Democracy & Technology
Center for Inclusive Change
CUNY Public Interest Tech Lab
Electronic Frontier Foundation
Environmental Policy Innovation Center
Mozilla
National Fair Housing Alliance
NETWORK Lobby for Catholic Social Justice
New America’s Open Technology Institute
POPVOX Foundation
Public Citizen
SeedAI
The Governance Lab
###
ABOUT FAS
The Federation of American Scientists (FAS) works to advance progress on a broad suite of contemporary issues where science, technology, and innovation policy can deliver dramatic progress, and seeks to ensure that scientific and technical expertise have a seat at the policymaking table. Established in 1945 by scientists in response to the atomic bomb, FAS continues to work on behalf of a safer, more equitable, and more peaceful world. More information about FAS work at fas.org.
ABOUT THIS COALITION
Organizations signed on to this letter represent a range of technology stakeholders in industry, academia, and nonprofit realms. We share a commitment to AI transparency. We urge the current administration, OMB, and OSTP to retain the policies set forth in Trump’s Executive Order 13960 and continued in the bipartisan 2023 Advancing American AI Act.
A Quantitative Imaging Infrastructure to Revolutionize AI-Enabled Precision Medicine
Medical imaging, a non-invasive method to detect and characterize disease, stands at a crossroads. With the explosive growth of artificial intelligence (AI), medical imaging offers extraordinary potential for precision medicine yet lacks adequate quality standards to safely and effectively fulfill the promise of AI. Now is the time to create a quantitative imaging (QI) infrastructure to drive the development of precise, data-driven solutions that enhance patient care, reduce costs, and unlock the full potential of AI in modern medicine.
Medical imaging plays a major role in healthcare delivery and is an essential tool in diagnosing numerous health issues and diseases (e.g., oncology, neurology, cardiology, hepatology, nephrology, pulmonary, and musculoskeletal). In 2023, there were more than 607 million imaging procedures in the United States and, per a 2021 study, $66 billion (8.9% of the U.S. healthcare budget) is spent on imaging.
Despite the importance and widespread use of medical imaging like magnetic resonance imaging (MRI), X-ray, ultrasound, computed tomography (CT), it is rarely standardized or quantitative. This leads to unnecessary costs due to repeat scans to achieve adequate image quality, and unharmonized and uncalibrated imaging datasets, which are often unsuitable for AI/machine learning (ML) applications. In the nascent yet exponentially expanding world of AI in medical imaging, a well-defined standards and metrology framework is required to establish robust imaging datasets for true precision medicine, thereby improving patient outcomes and reducing spiraling healthcare costs.
Challenge and Opportunity
The U.S. spends more on healthcare than any other high-income country yet performs worse on measures of health and healthcare. Research has demonstrated that medical imaging could help save money for the health system with every $1 spent on inpatient imaging resulting in approximately $3 total savings in healthcare delivered. However, to generate healthcare savings and improve outcomes, rigorous quality assurance (QA)/quality control(QC) standards are required for true QI and data integrity.
Today, medical imaging suffers two shortcomings inhibiting AI:
- Lack of standardization: Findings or measurements differ based on numerous factors such as system manufacturer, software version, or imaging protocol.
- A reliance on qualitative (subjective) measurements despite the technological capabilities to perform quantitative (objective) measurements.
Both result in variability impacting assessments and reducing the generalizability of, and confidence in, imaging test results and compromise data quality required for AI applications.
The growing field of QI, however, provides accurate and precise (repeatable and reproducible) quantitative-image-based metrics that are consistent across different imaging devices and over time. This benefits patients (fewer scans, biopsies), doctors, researchers, insurers, and hospitals and enables safe, viable development and use of AI/ML tools.
Quantitative imaging metrology and standards are required as a foundation for clinically relevant and useful QI. A change from “this might be a stage 3 tumor” to “this is a stage 3 tumor” will affect how oncologists can treat a patient. Quantitative imaging also has the potential to remove the need for an invasive biopsy and, in some cases, provide valuable and objective information before even the most expert radiologist’s qualitative assessment. This can mean the difference between taking a nonresponding patient off a toxic chemotherapeutic agent or recognizing a strong positive treatment response before a traditional assessment.
Plan of Action
The incoming administration should develop and fund a Quantitative Imaging Infrastructure to provide medical imaging with a foundation of rigorous QA/QC methodologies, metrology, and standards—all essential for AI applications.
Coordinated leadership is essential to achieve such standardization. Numerous medical, radiological, and standards organizations support and recognize the power of QI and the need for rigorous QA/QC and metrology standards (see FAQs). Currently, no single U.S. organization has the oversight capabilities, breadth, mandate, or funding to effectively implement and regulate QI or a standards and metrology framework.
As set forth below, earlier successful approaches to quality and standards in other realms offer inspiration and guidance for medical imaging and this proposal:
- Clinical Laboratory Improvement Amendments of 1988 (CLIA)
- Mammographic Quality Standards Act of 1992 (MQSA)
- Centers for Disease Control’s (CDC) Clinical Standardization Program (CSP)
Recommendation 1. Create a Medical Metrology Center of Excellence for Quantitative Imaging.
Establishing a QI infrastructure would transform all medical imaging modalities and clinical applications. Our recommendation is that an autonomous organization be formed, possibly appended to existing infrastructure, with the mandate and responsibility to develop and operationally support the implementation of quantitative QA/QC methodologies for medical imaging in the age of AI. Specifically this fully integrated QI Metrology Center of Excellence would need federal funding to:
- Define a metrology standards framework, in accord with international metrology standards;
- Implement a standards program to oversee:
- approach to sequence definition
- protocol development
- QA/QC methodology
- QIB profiles
- guidance and standards regarding “digital twins”
- applications related to radiomics
- clinical practice and operations
- vendor-neutral applications (where application is agnostic to manufacturer/machine)
- AI/ML validation
- data standardization
- training and continuing education for doctors and technologists.
Once implemented, the Center could focus on self-sustaining approaches such as testing and services provided for a fee to users.
Similar programs and efforts have resulted in funding (public and private) ranging from $90 million (e.g., Pathogen Genomics Centers of Excellence Network) to $150 million (e.g., Biology and Machine Learning – Broad Institute). Importantly, implementing a QI Center of Excellence would augment and complement federal funding currently being awarded through ARPA-H and the Cancer Moonshot, as neither have an overarching imaging framework for intercomparability between projects.
While this list is by no means exhaustive, any organization would need input and buy-in from:
- National Institutes of Health (NIH) (and related organizations such as National Institute of Biomedical Imaging and Bioengineering (NIBIB))
- National Institute of Standards and Technology (NIST)
- Centers for Disease Control (CDC)
- Centers for Medicare and Medicaid Services (CMS)
- U.S. Department of Defense (DoD)
- Radiological Society of North America (RSNA)
- American Association of Physicists in Medicine (AAPM)
- International Society for Magnetic Resonance in Medicine (ISMRM)
- Society of Nuclear Medicine & Molecular Imaging (SNMMI)
- American Institute of Ultrasound in Medicine (AIUM)
- American College of Radiology (ACR)
- ARPA-H
- HHS
International organizations also have relevant programs, guidance, and insight, including:
- European Society of Radiology (ESR)
- European Association of National Metrology Institutes (EURAMET)
- European Society of Breast Imaging (EUSOBI)
- European Society of Radiology’s Imaging Biomarkers Alliance (EIBALL)
- European Association of Nuclear Medicine (EANM)
- Institute of Physics and Engineering in Medicine (IPEM)
- Japan Quantitative Imaging Biomarker Alliance (JQIBA)
- National Physical Laboratory (NPL)
- National Imaging Facility (NIF)
Recommendation 2. Implement legislation and/or regulation providing incentives for standardizing all medical imaging.
The variability of current standard-of-care medical imaging (whether acquired across different sites or over a period of time) creates different “appearances.” This variability can result in different diagnoses or treatment response measurements, even though the underlying pathology for a given patient is unchanged. Real-world examples abound, such as one study that found 10 MRI studies over three weeks resulted in 10 different reports. This heterogeneity of imaging data can lead to a variable assessment by a radiologist (inter-reader variability), AI interpretation (“garbage-in-garbage-out”), or treatment recommendations from clinicians. Efforts are underway to develop “vendor-neutral sequences” for MRI and other methods (such as quantitative ground truth references, metrological standards, etc.) to improve data quality and ensure intercomparable results across vendors and over time.
To do so, however, requires coordination by all original equipment manufacturers (OEMs) or legislation to incentivize standards. The 1992 Mammography Quality Standards Act (MQSA) provides an analogous roadmap. MQSA’s passage implemented rigorous standards for mammography, and similar legislation focused on quality assurance of quantitative imaging, reducing or eliminating machine bias, and improved standards would reduce the need for repeat scans and improve datasets.
In addition, regulatory initiatives could also advance quantitative imaging. For example, in 2022, the Food and Drug Administration (FDA) issued Technical Performance Assessment of Quantitative Imaging in Radiological Device Premarket Submissions, recognizing the importance of ground truth references with respect to quantitative imaging algorithms. A mandate requiring the use of ground truth reference standards would change standard practice and be a significant step to improving quantitative imaging algorithms.
Recommendation 3. Ensure a funded QA component for federally funded research using medical imaging.
All federal medical research grant or contract awards should contain QA funds and require rigorous QA methodologies. The quality system aspects of such grants would fit the scope of the project; for example, a multiyear, multisite project would have a different scope than single-site, short-term work.
NIH spends the majority of its $48 billion budget on medical research. Projects include multiyear, multisite studies with imaging components. While NIH does have guidelines on research and grant funding (e.g., Guidance: Rigor and Reproducibility in Grant Applications), this guidance falls short in multisite, multiyear projects where clinical scanning is a component of the study.
To the extent NIH-funded programs fail to include ground truth references where clinical imaging is used, the resulting data cannot be accurately compared over time or across sites. Lack of standardization and failure to require rigorous and reproducible methods compromises the long-term use and applicability of the funded research.
By contrast, implementation of rigorous standards regarding QA/QC, standardization, etc. improve research in terms of reproducibility, repeatability, and ultimate outcomes. Further, confidence in imaging datasets enables the use of existing and qualified research in future NIH-funded work and/or imaging dataset repositories that are being leveraged for AI research and development, such as the Medical Imaging and Resource Center (MIDRC). (See also: Open Access Medical Imaging Repositories.)
Recommendation 4. Implement a Clinical Standardization Program (CSP) for quantitative imaging.
While not focused on medical imaging, the CDC’s CSPs have been incredibly successful and “improve the accuracy and reliability of laboratory tests for key chronic biomarkers, such as those for diabetes, cancer, and kidney, bone, heart, and thyroid disease.” By way of example, the CSP for Lipids Standardization has “resulted in an estimated benefit of $338M at a cost of $1.7M.” Given the breadth of use of medical imaging, implementing such a program for QI would have even greater benefits.
Although many people think of the images derived from clinical imaging scans as “pictures,” the pixel and voxel numbers that make up those images contain meaningful biological information. The objective biological information that is extracted by QI is conceptually the same as the biological information that is extracted from tissue or fluids by laboratory assay techniques. Thus, quantitative imaging biomarkers can be understood to be “imaging assays.”
The QA/QC standards that have been developed for laboratory assays can and should be adapted to quantitative imaging. (See also regulations, history, and standards of the Clinical Laboratory Improvement Amendment (CLIA) ensuring quality laboratory testing.)
Recommendation 5. Implement an accreditation program and reimbursement code for quantitative imaging starting with qMRI.
The American College of Radiology currently provides basic accreditation for clinical imaging scanners and concomitant QA for MRI. These requirements, however, have been in place for nearly two decades and do not address many newer quantitative aspects (e.g., relaxometry and ADC) nor account for the impact of image variability in effective AI use. Several new Current Procedural Terminology (CPT) codes have been recently adopted focused on quantitative imaging. An expansion of reimbursement codes for quantitative imaging could drive more widespread clinical adoption.
QI is analogous to the quantitative blood, serum and tissue assays done in clinical laboratories, subject to CLIA, one of the most impactful programs for improving the accuracy and reliability of laboratory assays. This CMS-administered mandatory accreditation program promulgates quality standards for all laboratory testing to ensure the accuracy, reliability, and timeliness of patient test results, regardless of where the test was performed.
Conclusion
These five proposals provide a range of actionable opportunities to modernize the approach to medical imaging to fit the age of AI, data integrity, and precision patient health. A comprehensive, metrology-based quantitative imaging infrastructure will transform medical imaging through:
- Improved clinical care due to an objective, quantitative understanding of a disease state(s)
- Accurate quantitative image datasets for (1) analysis and implementation to establish new imaging and composite biomarkers; and (2) training, evaluating/validating, and optimizing high-accuracy AI/ML applications;
- Reduced healthcare costs due to higher quality (fewer repeat) scans, more accurate diagnoses, and fewer invasive biopsies; and
- Greater accessibility of medical imaging with new to market technologies (e.g., low-field and/or organ-specific MRI) tied to quantitative imaging biomarkers.
With robust metrological underpinnings and a funded infrastructure, the medical community will have confidence in the QI data, unlocking powerful health insights only imaginable until now.
This action-ready policy memo is part of Day One 2025 — our effort to bring forward bold policy ideas, grounded in science and evidence, that can tackle the country’s biggest challenges and bring us closer to the prosperous, equitable and safe future that we all hope for whoever takes office in 2025 and beyond.
PLEASE NOTE (February 2025): Since publication several government websites have been taken offline. We apologize for any broken links to once accessible public data.
Yes. Using MRI as an example, numerous articles, papers, and publications acknowledge qMRI variability in scanner output can vary between manufacturers, over time, and after software or hardware maintenance or upgrades.
With in-vivo metrology, measurements are performed on the “body of living subjects (human or animal) without taking the sample out of the living subject (biopsy).” True in-vivo metrology will enable the diagnosis or understanding of tissue state before a radiologist’s visual inspection. Such measurement capabilities are objective, in contrast to the subjective, qualitative interpretation by a human observer. In-vivo metrology will enhance and support the practice of radiology in addition to reducing unnecessary procedures and associated costs.
Current digital imaging modalities provide the ability to measure a variety of biological and physical quantities with accuracy and reliability, e.g., tissue characterization, physical dimensions, temperature, body mass components, etc. However, consensus standards and corresponding certification or accreditation programs are essential to bring the benefits of these objective QI parameters to patient care. The CSP follows this paradigm as does the earlier CLIA, both of which have been instrumental in improving the accuracy and consistency of laboratory assays. This proposal aims to bring the same rigor to immediately improve the quality, safety and effectiveness of medical imaging in clinical care and to advance the input data needed to create, as well as safely and responsibly use, robust imaging AI tools for the benefit of all patients.
Phantoms are specialized test objects used as ground truth references for quantitative imaging and analysis. NIST plays a central role in measuring and testing solutions for phantoms. Phantoms are used in ultrasound, CT, MRI, and other imaging modalities for routine QA/QC and machine testing. Phantoms are key to harmonizing and standardizing data and improve data quality needed for AI applications.
Precision medicine is a popular term with many definitions/approaches applying to genetics, oncology, pharmacogenetics, oncology, etc. (See, e.g., NCI, FDA, NIH, National Human Genome Research Institute.) Generally, precision (or personalized) medicine focuses on the idea that treatment can be individualized (rather than generalized). While there have been exciting advances in personalized medicine (such as gene testing), the variability of medical imaging is a major limitation in realizing the full potential of precision medicine. Recognizing that medical imaging is a fundamental measurement tool from diagnosis through measurement of treatment response and toxicity assessment, this proposal aims to transition medical imaging practices to quantitative imaging to enable the realization of precision medicine and timely personalized approaches to patient care.
Radiologists need accurate and reliable data to make informed decisions. Improving standardization and advancing QI metrology will support radiologists by improving data quality. To the extent radiologists are relying on AI platforms, data quality is even more essential when it is used to drive AI applications, as the outputs of AI models rely on sound acquisition methods and accurate quantitative datasets.
Standardized data also helps patients by reducing the need for repeat scans, which saves time, money, and unnecessary radiation (for ionizing methods).
Yes! Using MRI as an example, qMRI can advance and support efforts to make MRI more accessible. Historically, MRI systems cost millions of dollars and are located in high-resource hospital settings. Numerous healthcare and policy providers are making efforts to create “accessible” MRI systems, which include portable systems at lower field strengths and to address organ-specific diseases. New low-field systems can reach patient populations historically absent from high-resource hospital settings. However, robust and reliable quantitative data are needed to ensure data collected in rural, nonhospital settings, or in Low and Middle Income Countries, can be objectively compared to data from high-resource hospital settings.
Further, accessibility can be limited by a lack of local expertise. AI could help fill the gap.
However, a QI infrastructure is needed for safe and responsible use of AI tools, ensuring adequate quality of the input imaging data.
The I-SPY 2 Clinical Breast Trials provide a prime example of the need for rigorous QA and scanner standardization. The I-SPY 2 trial is a novel approach to breast cancer treatment that closely monitors treatment response to neoadjuvant therapy. If there is no immediate/early response, the patient is switched to a different drug. MR imaging is acquired at various points during the treatment to determine the initial tumor size and functional characteristics and then to measure any tumor shrinkage/response over the course of treatment. One quantitative MRI tumor characteristic that has shown promise for evaluation of treatment response and is being evaluated in the trial is ADC, a measure of tissue water mobility which is calculated from diffusion-weighted imaging. It is essential for the trial that MR results can be compared over time as well as across sites. To truly know whether a patient is responding, the radiologist must have confidence that any change in the MR reading or measurement is due to a physiological change and not due to a scanner change such as drift, gradient failure, or software upgrade.
For the I-SPY 2 trial, breast MRI phantoms and a standardized imaging protocol are used to test and harmonize scanner performance and evaluate measurement bias over time and across sites. This approach then provides clear data/information on image quality and quantitative measurement (e.g., ADC) for both the trial (comparing data from all sites is possible) as well as for the individual imaging sites.
Nonstandardized imaging results in variation that requires orders of magnitude more data to train an algorithm. More importantly, without reliable and standardized datasets, AI algorithms drift, resulting in degradation of both protocols and performance. Creating and supporting a standards-based framework for medical imaging will mitigate these issues as well as lead to:
- Integrated and coordinated system for establishing QIBs, screening, and treatment planning.
- Cost savings: Standardizing data and implementing quantitative results in superior datasets for clinical use or as part of large datasets for AI applications. Clinical Standardization Programs have focused on standardizing tests and have been shown to save “millions in health care costs.”
- Better health outcomes: Standardization reduces reader error and enables new AI applications to support current radiology practices.
- Support for radiologists’ diagnoses.
- Fewer incorrect diagnoses (false positives and false negatives).
- Elimination of millions of unnecessary invasive biopsies.
- Fewer repeat scans.
- Robust and reliable datasets for AI applications (e.g., preventing model collapse).
It benefits federal organizations such as the National Institutes of Health, Centers for Medicare and Medicaid Services, and Veterans Affairs as well as the private and nonprofit sectors (insurers, hospital systems, pharmaceutical, imaging software, and AI companies). The ultimate beneficiary, however, is the patient, who will receive an objective, reliable quantitative measure of their health—relevant for a point-in-time assessment as well as longitudinal follow-up.
Possible pushback from such a program may come from: (1) radiologists who are unfamiliar with the power of quantitative imaging for precision health and/or the importance and incredible benefits of clean datasets for AI applications; or (2) manufacturers (OEMs) who aim to improve output through differentiation and are focused on customers who are more interested in their qualitative practice.
Radiology practices: Radiology practices’ main objective is to provide the most accurate diagnosis possible in the least amount of time, as cost-effectively as possible. Standardization and calibration are generally perceived as requiring additional time and increased costs; however, these perceptions are often not true, and the variability in imaging introduces more time consumption and challenges. The existing standard of care relies on qualitative assessments of medical images.
While excellent for understanding a patient’s health at a single point in time (though even in these cases subtle abnormalities can be missed), longitudinal monitoring is impossible without robust metrological standards for reproducibility and quantitative assessment of tissue health. While a move from qualitative to quantitative imaging may require additional education, understanding, and time, such an infrastructure will provide radiologists with improved capabilities and an opportunity to supplement and augment the existing standard of care.
Further, AI is undeniably being incorporated into numerous radiology applications, which will require accurate and reliable datasets. As such, it will be important to work with radiology practices to demonstrate a move to standardization will, ultimately, reduce time and increase the ability to accurately diagnose patients.
OEMs: Imaging device manufacturers work diligently to improve their outputs. To the extent differentiation is seen as a business advantage, a move toward vendor-neutral and scanner-agnostic metrics may initially be met with resistance. However, all OEMs are investing resources to improve AI applications and patient health. All benefit from input data that is standard and robust and provides enough transparency to ensure FAIR data principles (findability, accessibility, interoperability, and reusability).
OEMs have plenty of areas for differentiation including improving the patient experience and shortening scan times. We believe OEMs, as part of their move to embrace AI, will find clear metrology and standards-based framework a positive for their own business and the field as a whole.
The first step is to convene a meeting of leaders in the field within three months to establish priorities and timelines for successful implementation and adoption of a Center of Excellence. Any Center must be well-funded with experienced leadership and will need the support and collaboration across the relevant agencies and organizations.
There are numerous potential pilots. The key is to identify an actionable study where results could be achieved within a reasonable time. For example, a pilot study to demonstrate the importance of quantitative MRI and sound datasets for AI could be implemented at the Veterans Administration hospital system. This study could focus on quantifying benefits from standardization and implementation of quantitative diffusion MRI, an “imaging biopsy” modality as well as mirror advances and knowledge identified in the existing I-SPY 2 clinical breast trials.
The timing is right for three reasons: (1) quantitative imaging is doable; (2) AI is upon us; and (3) there is a desire and need to reduce healthcare costs and improve patient outcomes.
There is widespread agreement that QI methodologies have enormous potential benefits, and many government agencies and industry organizations have acknowledged this. Unfortunately, there has been no unifying entity with sufficient resources and professional leadership to coordinate and focus these efforts. Many organizations have been organized and run by volunteers. Finally, some previously funded efforts to support quantitative imaging (e.g., QIN and QIBA) have recently lost dedicated funding.
With rapid advances in technology, including the promise of AI, there is new and shared motivation across communities to revise our approach to data generation and collection at-large—focused on standardization, precision, and transparency. By leveraging the existing widespread support, along with dedicated resources for implementation and enforcement, this proposal will drive the necessary change.
Yes. Human health has no geographical boundaries, so a global approach to quantitative imaging would benefit all. QI is being studied, implemented, and adopted globally.
However, as is the case in the U.S., while standards have been proposed, there is no international body to govern the implementation, coordination, and maturation of this process. The initiatives put forth here could provide a roadmap for global collaboration (ever-more important with AI) and standards that would speed up development and implementation both in the U.S. and abroad.
Some of the organizations that recognize the urgent need for QI standards are National Institute of Standards and Technology (quantitative MR project); National Cancer Institute’s Quantitative Imaging Network; Radiological Society of North America (through its Quantitative Imaging Committee); Quantitative Imaging Biomarker Alliance; International Society of Magnetic Resonance in Medicine; Object Management Group; American Association of Physicists in Medicine; National Institutes of Health; Food and Drug Administration (“guidance” – ground truth references for quantitative imaging algorithms).
FAS Receives $1.5 Million Grant on The Artificial Intelligence / Global Risk Nexus
Grant Funds Research of AI’s Impact on Nuclear Weapons, Biosecurity, Military Autonomy, Cyber, and other global issues
Washington, D.C. – September 11, 2024 – The Federation of American Scientists (FAS) has received a $1.5 million grant from the Future of Life Institute (FLI) to investigate the implications of artificial intelligence on global risk. The 18-month project supports FAS’s efforts to bring together the world’s leading security and technology experts to better understand and inform policy on the nexus between AI and several global issues, including nuclear deterrence and security, bioengineering, autonomy and lethality, and cyber security-related issues.
FAS’s CEO Daniel Correa noted that “understanding and responding to how new technology will change the world is why the Federation of American Scientists was founded. Against this backdrop, FAS has embarked on a critical journey to explore AI’s potential. Our goal is not just to understand these risks, but to ensure that as AI technology advances, humanity’s ability to understand and manage the potential of this technology advances as well.
“When the inventors of the atomic bomb looked at the world they helped create, they understood that without scientific expertise and brought her perspectives humanity would never live the potential benefits they had helped bring about. They founded FAS to ensure the voice of objective science was at the policy table, and we remain committed to that effort after almost 80 years.”
“We’re excited to partner with FLI on this essential work,” said Jon Wolfsthal, who directs FAS’ Global Risk Program. “AI is changing the world. Understanding this technology and how humans interact with it will affect the pressing global issues that will determine the fate of all humanity. Our work will help policy makers better understand these complex relationships. No one fully understands what AI will do for us or to us, but having all perspectives in the room and working to protect against negative outcomes and maximizing positive ones is how good policy starts.”
“As the power of AI systems continues to grow unchecked, so too does the risk of devastating misuse and accidents,” writes FLI President Max Tegmark. “Understanding the evolution of different global threats in the context of AI’s dizzying development is instrumental to our continued security, and we are honored to support FAS in this vital work.”
The project will include a series of activities, including high-level focused workshops with world-leading experts and officials on different aspects of artificial intelligence and global risk, policy sprints and fellows, and directed research, and conclude with a global summit on global risk and AI in Washington in 2026.
###
ABOUT FAS
The Federation of American Scientists (FAS) works to advance progress on a broad suite of contemporary issues where science, technology, and innovation policy can deliver dramatic progress, and seeks to ensure that scientific and technical expertise have a seat at the policymaking table. Established in 1945 by scientists in response to the atomic bomb, FAS continues to work on behalf of a safer, more equitable, and more peaceful world. More information at fas.org.
ABOUT FLI
Founded in 2014, the Future of Life Institute (FLI) is a leading nonprofit working to steer transformative technology towards benefiting humanity. FLI is best known for their 2023 open letter calling for a six-month pause on advanced AI development, endorsed by experts such as Yoshua Bengio and Stuart Russell, as well as their work on the Asilomar AI Principles and recent EU AI Act.
Public Comment on the U.S. Artificial Intelligence Safety Institute’s Draft Document: NIST AI 800-1, Managing Misuse Risk for Dual-Use Foundation Models
Public comments serve the executive branch by informing more effective, efficient program design and regulation. As part of our commitment to evidence-based, science-backed policy, FAS staff leverage public comment opportunities to embed science, technology, and innovation into policy decision-making.
The Federation of American Scientists (FAS) is a non-partisan organization dedicated to using science and technology to benefit humanity through equitable and impactful policy. With a strong track record in AI governance, FAS has actively contributed to the development of AI standards and frameworks, including providing feedback on NIST AI 600-1, the Generative AI Profile. Our work spans advocating for federal AI testbeds, recommending policy measures for frontier AI developers, and evaluating industry adoption of the NIST AI Risk Management Framework. We are members of the U.S. AI Safety Institute Research Consortium, and we responded to NIST’s request for information earlier this year concerning its responsibilities under sections 4.1, 4.5, and 11 of the AI Executive Order.
We commend NIST’s U.S. Artificial Intelligence Safety Institute for developing the draft guidance on “Managing Misuse Risk for Dual-Use Foundation Models.” This document represents a significant step toward establishing robust practices for mitigating catastrophic risks associated with advanced AI systems. The guidance’s emphasis on comprehensive risk assessment, transparent decision-making, and proactive safeguards aligns with FAS’s vision for responsible AI development.
In our response, we highlight several strengths of the guidance, including its focus on anticipatory risk assessment and the importance of clear documentation. We also identify areas for improvement, such as the need for harmonized language and more detailed guidance on model development safeguards. Our key suggestions include recommending a more holistic socio-technical approach to risk evaluation, strengthening language around halting development for unmanageable risks, and expanding the range of considered safeguards. We believe these adjustments will further strengthen NIST’s crucial role in shaping responsible AI development practices.
Background and Context
The rapid advancement of AI foundation models has spurred novel industry-led risk mitigation strategies. Leading AI companies have voluntarily adopted frameworks like Responsible Scaling Policies and Preparedness Frameworks, outlining risk thresholds and mitigation strategies for increasingly capable AI systems. (Our response to NIST’s February RFI was largely an exploration of these policies, their benefits and drawbacks, and how they could be strengthened.)
Managing misuse risks in foundation models is of paramount importance given their broad applicability and potential for dual use. As these models become more powerful, they may inadvertently enable malicious actors to cause significant harm, including facilitating the development of weapons, enabling sophisticated cyber attacks, or generating harmful content. The challenge lies not only in identifying current risks but also in anticipating future threats that may emerge as AI capabilities expand.
NIST’s new guidance on “Managing Misuse Risk for Dual-Use Foundation Models” builds upon these industry initiatives, providing a more standardized and comprehensive approach to risk management. By focusing on objectives such as anticipating potential misuse, establishing clear risk thresholds, and implementing robust evaluation procedures, the guidance creates a framework that can be applied across the AI development ecosystem. This approach is crucial for ensuring that as AI technology advances, appropriate safeguards are in place to protect against potential misuse while still fostering innovation.
Strengths of the guidance
1. Comprehensive Documentation and Transparency
The guidance’s emphasis on thorough documentation and transparency represents a significant advancement in AI risk management. For every practice under every objective, the guidance indicates appropriate documentation; this approach is more thorough in advancing transparency than any comparable guidance to date. The creation of a paper trail for decision-making and risk evaluation is crucial for both internal governance and potential external audits.
The push for transparency extends to collaboration with external stakeholders. For instance, practice 6.4 recommends providing “safe harbors for third-party safety research,” including publishing “a clear vulnerability disclosure policy for model safety issues.” This openness to external scrutiny and feedback is essential for building trust and fostering collaborative problem-solving in AI safety. (FAS has published a legislative proposal calling for enshrining “safe harbor” protections for AI researchers into law.)
2. Lifecycle Approach to Risk Management
The guidance excels in its holistic approach to risk management, covering the entire lifecycle of foundation models from pre-development assessment through to post-deployment monitoring. This comprehensive approach is evident in the structure of the document itself, which follows a logical progression from anticipating risks (Objective 1) through to responding to misuse after deployment (Objective 6).
The guidance demonstrates a proactive stance by recommending risk assessment before model development. Practice 1.3 suggests to “Estimate the model’s capabilities of concern before it is developed…”, which helps anticipate and mitigate potential harms before they materialize. The framework for red team evaluations (Practice 4.2) is particularly robust, recommending independent external experts and suggesting ways to compensate for gaps between red teams and real threat actors. The guidance also emphasizes the importance of ongoing risk assessment. Practice 3.2 recommends to “Periodically revisit estimates of misuse risk stemming from model theft…” This acknowledgment of the dynamic nature of AI risks encourages continuous vigilance.
3. Strong Stance on Model Security and Risk Tolerance
The guidance takes a firm stance on model security and risk tolerance, particularly in Objective 3. It unequivocally states that models relying on confidentiality for misuse risk management should only be developed when theft risk is sufficiently mitigated. This emphasizes the critical importance of security in AI development, including considerations for insider threats (Practice 3.1).
The guidance also demonstrates a realistic approach to the challenges posed by different deployment strategies. In Practice 5.1, it notes, “For example, allowing fine-tuning via API can significantly limit options to prevent jailbreaking and sharing the model’s weights can significantly limit options to monitor for misuse (Practice 6.1) and respond to instances of misuse (Practice 6.2).” This candid discussion of the limitations of safety interventions for open weight foundation models is crucial for fostering realistic risk assessments.
Additionally, the guidance promotes a conservative approach to risk management. Practice 5.3 recommends to “Consider leaving a margin of safety between the estimated level of risk at the point of deployment and the organization’s risk tolerance.” It further suggests considering “a larger margin of safety to manage risks that are more severe or less certain.” This approach provides an extra layer of protection against unforeseen risks or rapid capability advancements, which is crucial given the uncertainties inherent in AI development.
These elements collectively demonstrate NIST’s commitment to promoting realistic and robust risk management practices that prioritize safety and security in AI development and deployment. However, while the NIST guidance demonstrates several important strengths, there are areas where it could be further improved to enhance its effectiveness in managing misuse risks for dual-use foundation models.
Areas for improvement
1. Need for a More Comprehensive Socio-technical Approach to Measuring Misuse Risk
Objective 4 of the guidance demonstrates a commendable effort to incorporate elements of a socio-technical approach in measuring misuse risk. The guidance recognizes the importance of considering both technical and social factors, emphasizes the use of red teams to assess potential misuse scenarios, and acknowledges the need to consider different levels of access and various threat actors. Furthermore, it highlights the importance of avoiding harm during the measurement process, which is crucial in a socio-technical framework.
However, the guidance falls short in fully embracing a comprehensive socio-technical perspective. While it touches on the importance of external experts, it does not sufficiently emphasize the value of diverse perspectives, particularly from individuals with lived experiences relevant to specific risk scenarios. The guidance also lacks a structured approach to exploring the full range of potential misuse scenarios across different contexts and risk areas. Finally, the guidance does not mention measuring absolute versus marginal risks (ie., how much total misuse risk a model poses in a specific context versus how much marginal risk it poses compared to existing tools). These gaps limit the effectiveness of the proposed risk measurement approach in capturing the full complexity of AI system interactions with human users and broader societal contexts.
Specific recommendations for improving socio-technical approach
The NIST guidance in Practice 1.3 suggests estimating model capabilities by comparison to existing models, but provides little direction on how to conduct these comparisons effectively. To improve this, NIST could incorporate the concept of “available affordances.” This concept emphasizes that an AI system’s risk profile depends not just on its absolute capabilities, but also on the environmental resources and opportunities for affecting the world that are available to it.
Additionally, Kapoor et al. (2024) emphasize the importance of assessing the marginal risk of open foundation models compared to existing technologies or closed models. This approach aligns with a comprehensive socio-technical perspective by considering not just the absolute capabilities of AI systems, but also how they interact with existing technological and social contexts. For instance, when evaluating cybersecurity risks, they suggest considering both the potential for open models to automate vulnerability detection and the existing landscape of cybersecurity tools and practices. This marginal risk framework helps to contextualize the impact of open foundation models within broader socio-technical systems, providing a more nuanced understanding of their potential benefits and risks.
NIST could recommend that organizations assess both the absolute capabilities of their AI systems and the affordances available to them in potential deployment contexts. This approach would provide a more comprehensive view of potential risks than simply comparing models in isolation. For instance, the guidance could suggest evaluating how a system’s capabilities might change when given access to different interfaces, actuators, or information sources.
Similarly, Weidinger et al. (2023) argue that while quantitative benchmarks are important, they are insufficient for comprehensive safety evaluation. They suggest complementing quantitative measures with qualitative assessments, particularly at the human interaction and systemic impact layers. NIST could enhance its guidance by providing more specific recommendations for integrating qualitative evaluation methods alongside quantitative benchmarks.
NIST should acknowledge potential implementation challenges with a comprehensive socio-technical approach. Organizations may struggle to create benchmarks that accurately reflect real-world misuse scenarios, particularly given the rapid evolution of AI capabilities and threat landscapes. Maintaining up-to-date benchmarks in a fast-paced field presents another ongoing challenge. Additionally, organizations may face difficulties in translating quantitative assessments into actionable risk management strategies, especially when dealing with novel or complex risks. NIST could enhance the guidance by providing strategies for navigating these challenges, such as suggesting collaborative industry efforts for benchmark development or offering frameworks for scalable testing approaches.
OpenAI‘s approach of using human participants to evaluate AI capabilities provides both a useful model for more comprehensive evaluation and an example of quantification challenges. While their evaluation attempted to quantify biological risk increase from AI access, they found that, as they put it, “Translating quantitative results into a meaningfully calibrated threshold for risk turns out to be difficult.” This underscores the need for more research on how to set meaningful thresholds and interpret quantitative results in the context of AI safety.
2. Inconsistencies in Risk Management Language
There are instances where the guidance uses varying levels of strength in its recommendations, particularly regarding when to halt or adjust development. For example, Practice 2.2 recommends to “Plan to adjust deployment or development strategies if misuse risks rise to unacceptable levels,” while Practice 3.2 uses stronger language, suggesting to “Adjust or halt further development until the risk of model theft is adequately managed.” This variation in language could lead to confusion and potentially weaker implementation of risk management strategies.
Furthermore, while the guidance emphasizes the importance of managing risks before deployment, it does not provide clear criteria for what constitutes “adequately managed” risk, particularly in the context of development rather than deployment. More consistent and specific language around these critical decision points would strengthen the guidance’s effectiveness in promoting responsible AI development.
Specific recommendations for strengthening language on halting development for unmanageable risks
To address the inconsistencies noted above, we suggest the following changes:
1. Standardize the language across the document to consistently use strong phrasing such as “Adjust or halt further development” when discussing responses to unacceptable levels of risk.
The current guidance uses varying levels of strength in its recommendations regarding development adjustments. For instance, Recommendation 4 of Practice 2.2 uses the phrase “Plan to adjust deployment or development strategies,” while Recommendation 3 of Practice 3.2 more strongly suggests to “Adjust or halt further development.” Consistent language would emphasize the critical nature of these decisions and reduce potential confusion or weak implementation of risk management strategies. This could be accomplished by changing the language of Practice 2.2, Recommendation 4 to “Plan to adjust or halt further development or deployment if misuse risks rise to unacceptable levels before adequate security and safeguards are available to manage risk.”
The need for stronger language regarding halting development is reflected both in NIST’s other work and in commitments that many frontier AI developers have publicly agreed to. For instance, the NIST AI Risk Management Framework, section 1.2.3 (Risk Prioritization), suggests: “In some cases where an AI system presents the highest risk – where negative impacts are imminent, severe harms are actually occurring, or catastrophic risks are present – development and deployment should cease in a safe manner until risks can be sufficiently mitigated.” Further, the AI Seoul Summit frontier AI safety commitments explicitly state that organizations should “set out explicit processes they intend to follow if their model or system poses risks that meet or exceed the pre-defined thresholds.” Importantly, these commitments go on to specify that “In the extreme, organisations commit not to develop or deploy a model or system at all, if mitigations cannot be applied to keep risks below the thresholds.”
2. Add to the list of transparency documentation for Practice 2.2 the following: “A decision-making framework for determining when risks have become truly unmanageable, considering factors like the severity of potential harm, the likelihood of the risk materializing, and the feasibility of mitigation strategies.”
While the current guidance emphasizes the importance of managing risks before deployment (e.g., in Practice 5.3), it does not provide clear criteria for what constitutes “adequately managed” risk, particularly in the context of development rather than deployment. A decision-making framework would provide clearer guidance on when to take the serious step of halting development. This addition would help prevent situations where development continues despite unacceptable risks due to a lack of clear stopping criteria. This recommendation aligns with the approach suggested by Alaga and Schuett (2023) in their paper on coordinated pausing, where they emphasize the need for clear thresholds and decision criteria to determine when AI development should be halted due to unacceptable risks.
3. Gaps in Model Development Safeguards
The guidance’s treatment of safeguards, particularly those related to model development, lacks sufficient detail to be practically useful. This is most evident in Appendix B, which lists example safeguards. While this appendix is a valuable addition, the safeguards related to model training (“Improve the model’s training”) are notably lacking in detail compared to the safeguards around model security and detecting misuse.
While the guidance covers many aspects of risk management comprehensively, especially model security, it does not provide enough specific recommendations for technical approaches to building safer models during the development phase. This gap could limit the practical utility of the guidance for AI developers seeking to implement safety measures from the earliest stages of model creation.
Specific recommendations for additional safeguards for model development
For some safeguards, we recommend that the misuse risk guidance explicitly reference relevant sections of NIST 600-1, the Generative Artificial Intelligence Profile. Specifically, the GAI profile offers more comprehensive guidance on data-related and monitoring safeguards. For instance, the profile emphasizes documenting training data curation policies (MP-4.1-004) and establishing policies for data collection, retention, and quality (MP-4.1-005), which are crucial for managing misuse risk from the earliest stages of development. Additionally, the profile suggests implementing real-time monitoring processes for analyzing generated content performance and trustworthiness characteristics (MG-3.2-006), which could significantly enhance ongoing risk management during development. These references to the GAI Profile on model development safeguards could take the form of an additional item in Appendix B, or be incorporated into the relevant sections earlier in the guidance.
Beyond pointing to the model development safeguards included in the GAI Profile, we also recommend expanding Appendix B to include further safeguards for the model development phase. Both the GAI Profile and the current misuse risk guidance lack specific recommendations for two key model development safeguards: iterative safety testing throughout development and staged development/release processes. Below are two proposed additions to Appendix B:
The proposed safeguard “Implement iterative safety testing throughout development” addresses the current guidance’s limited detail on model training and development safeguards. This approach aligns with Barrett, et al.’s AI Risk-Management Standards Profile for General-Purpose AI Systems and Foundation Models (the “GPAIS Profile”)’s emphasis on proactive and ongoing risk assessment. Specifically, the Profile recommends identifying “GPAIS impacts…and risks (including potential uses, misuses, and abuses), starting from an early AI lifecycle stage and repeatedly through new lifecycle phases or as new information becomes available” (Barrett et al., 2023, p. 19). The GPAIS Profile further suggests that for larger models, developers should “analyze, customize, reanalyze, customize differently, etc., then deploy and monitor” (Barrett et al., 2023, p. 19), where “analyze” encompasses probing, stress testing, and red teaming. This iterative safety testing would integrate safety considerations throughout development, aligning with the guidance’s emphasis on proactive risk management and anticipating potential misuse risk.
Similarly, the proposed safeguard “Establish a staged development and release process” addresses a significant gap in the current guidance. While Practice 5.1 discusses pre-deployment risk assessment, it lacks a structured approach to incrementally increasing model capabilities or access. Solaiman et al. (2023) propose a “gradient of release” framework for generative AI, a phased approach to model deployment that allows for iterative risk assessment and mitigation. This aligns with the guidance’s emphasis on ongoing risk management and could enhance the ‘margin of safety’ concept in Practice 5.3. Implementing such a staged process would introduce multiple risk assessment checkpoints throughout development and deployment, potentially improving safety outcomes.
Conclusion
NIST’s guidance on “Managing Misuse Risk for Dual-Use Foundation Models” represents a significant step forward in establishing robust practices for mitigating catastrophic risks associated with advanced AI systems. The document’s emphasis on comprehensive risk assessment, transparent decision-making, and proactive safeguards demonstrates a commendable commitment to responsible AI development. However, to more robustly contribute to risk mitigation, the guidance must evolve to address key challenges, including a stronger approach to measuring misuse risk, consistent language on halting development, and more detailed model development safeguards.
As the science of AI risk assessment advances, this guidance should be recursively updated to address emerging risks and incorporate new best practices. While voluntary guidance is crucial, it is important to recognize that it cannot replace the need for robust policy and regulation. A combination of industry best practices, government oversight, and international cooperation will be necessary to ensure the responsible development of high-risk AI systems.
We appreciate the opportunity to provide input on this important document. FAS stands ready to continue assisting NIST in refining and implementing this guidance, as well as in developing further resources for responsible AI development. We believe that close collaboration between government agencies, industry leaders, and civil society organizations is key to realizing the benefits of AI while effectively mitigating its most serious risks.
Recent Advances in Artificial Intelligence and the Department of Energy’s Role in Ensuring U.S. Competitiveness and Security in Emerging Technologies
Statement For The Record
Chairman Manchin, Ranking Member Barrasso, and members of the Senate Energy and Natural Resources Committee. I appreciate the opportunity to submit this statement underpinning the Department of Energy’s visions to shape our strategic investments in AI.
The Federation of American Scientists (FAS) is a catalytic, non-partisan, and nonprofit organization committed to using science and technology to benefit humanity by delivering on the promise of equitable and impactful policy. FAS believes that society benefits from a federal government that harnesses science, technology, and innovation to meet ambitious policy goals and deliver impact to the public.
I am the Associate Director for Emerging Technologies and National Security at FAS where I lead our work on emerging technologies’ policy from the lens of our national security innovation base, as well as focusing on the strategic competition between the United States and the Chinese Communist Party. I wish to commend your work in bringing the Committee together to discuss the Department of Energy (DOE)’s role in ensuring U.S. competitiveness and security in emerging technologies. This hearing could not have come at a more opportune time.
In March, the Chinese Communist Party (CCP) held its yearly “two sessions” meeting—referring to the coming together of China’s principal political bodies, the National People’s Congress (NPC) and the National Committee of the Chinese People’s Political Consultative Conference (CPPCC)—during which they not only confirmed Xi Jinping’s third term as president but also introduced a set of new policies and government appointments. During this meeting, Xi emphasized the importance of self-reliance in science and technology as a strategic goal to combat Western influence. Meanwhile, the Central Committee revealed plans to restructure the Chinese government to better position China’s national innovation system for driving advancements in both commercial and dual-purpose military-civilian technologies. This latest initiative underscores two decades of unwavering CCP commitment toward indigenous innovation, calibrated specifically to outflank its Western competitors like the United States. And it’s getting results: a recent analysis by the Australian Strategic Policy Institute found that China now leads in 37 out of 44 critical technology areas globally, while Chinese production of high-value patents in the global marketplace has increased by 400% over the past decade.
The Committee’s hearing is exploring a question that is of vital national interest. The two proposals—creating an Office of Critical and Emerging Technology within the DOE and the Frontiers in Artificial Intelligence for Science, Security and Technology—could change this trajectory for the better.
First, the creation of an Office of Critical and Emerging Technology within the DOE. This office would enable a robust assessment of U.S. technological competitiveness and prepare us for emerging technology surprises conveying a potential threat to national security. This framework will refine our strategic direction, facilitate rapid threats-response coordination with interagency collaboration from entities like DoD, DNI and NSF amongst others, while advancing proactive countermeasure strategies.
The Office should serve as a hub for innovative practices across all 17 National Labs and 34 user facilities that the DOE stewards. The DOE labs and user facilities have expertise and capabilities that are important in national and international science policy challenges. This office should promote greater participation from our labs to better inform these discussions, thereby effectively fostering a diversity of perspectives within national science policy discourse and international forums, which is ever-critical given the ascending competition from nations including China and Russia in domains like AI, quantum computing, and biotechnology.
Secondly, the FASST initiative—Frontiers in Artificial Intelligence for Science, Security, and Technology—is another imperative. AI’s transformative potential is undeniable but demands substantial improvement in fundamental aspects like explainability, trustworthiness, reliability, especially for mission-critical applications and privacy-sensitive issues.
The DOE, with its high-performance computing prowess, is uniquely positioned to deliver secure and dependable AI solutions for the challenging problems of the century. By leveraging DOE’s world-leading exascale computing capabilities while working synergistically with key stakeholders from academia, industry, and interagency groups, we can unlock groundbreaking AI innovations.
Efforts must be made to accelerate integrated math and science R&D, particularly foundational AI research to develop secure, trustworthy techniques. Rigorous verification and validation processes, guided by scientific validity, can vet new technologies for their societal implications before widespread deployment.
Moreover, expanding on foundational research in physics-informed AI could lead to better integration of AI models with our understanding of real-world phenomena. This involves cooperative research among diverse specialties, an endeavor DOE labs and associated universities are equipped for.
The proposed multi-billion-dollar annual program involving DOE Office of Science, National Nuclear Security Administration, and applied energy programs aims to leverage unique leadership capabilities in computing to create transformative AI hubs focused on solving grand challenge problems, innovate world-class AI technologies, and harness cutting-edge testbeds for developing energy-efficient AI hardware platforms in concert with US industry.
Adding to the testimony, I would like to emphasize the pivotal role the FASST initiative will play in the development of unique open and secure foundation models for discovery and national security. The objective is to harness unique and highly-curated datasets to foster advancements and ensure that the United States remains at the helm of science and technology.
The creation of uniquely crafted models, possible only through supercomputing, will offer unprecedented insights into complex processes like molecular dynamics crucial for additive manufacturing or power grid dynamics, leading to a more resilient energy infrastructure. Moreover, it’s crucial for the DOE to develop classified models to manage threats to our national security, from maintaining space situational awareness to advancing biodefense, nuclear deterrence, and nonproliferation efforts. However, I would also urge caution as this could provide our adversaries with a single point of attack to extract classified data if they were to gain access to the frontier model trained on classified data.
We are observing an unprecedented deployment of large language models and other advanced AI models like AlphaFold 2, AlphaGo, amongst others, across the country. AI tools and foundational models developed by the DOE could test and validate these AI tools. This capability is imperative to ensuring AI models deployed meet safety and ethical standards that align with our societal values. Furthermore, it will allow DOE to assess risks posed by other AI models that are outside of U.S. regulatory jurisdictions.
In terms of tool and software development, FASST could develop common platforms for safe, trustworthy AI suitable for high-stake usage scenarios. This would involve crafting tools and methodologies that enhance the trustworthiness and reliability of AI systems while preserving privacy. It also involves an acute focus on cybersecurity, establishing classified platforms capable of evaluating potential adversarial AI systems.
The harnessing of both classified and unclassified scientific datasets will be instrumental in this endeavor. By transforming DOE’s leading-edge facilities into a nationwide integrated research infrastructure, we will cultivate a common platform for training and evaluation, thereby deriving valuable findings from the world’s largest volumes of scientific data.
Furthermore, FASST will be instrumental in bolstering state-of-the-art production capabilities for our nuclear stockpile by advancing the state-of-the-art in foundation models to rapidly validate AI technologies addressing emerging nuclear security missions. In addition, FASST’s aims to develop new foundation models for unique types of data such as seismic and electromagnetic are worthy of support as these areas where current capabilities are lacking.
Through these concerted efforts, we aim to combine the strides in AI innovation with critical missions in science, security, and technology—encompassing scientific discovery, energy sustainability, and national security. We will continue to boldly ride the tidal wave of AI evolution while ensuring that we stay ahead of possible detriments that could compromise our nation’s security and leadership in technology.
Eventually, the transformation of DOE facilities into a nationwide integrated research infrastructure can stimulate advanced AI research deployment across sectors, enhance resource utility, drive unprecedented growth potential, and reinforce U.S.’s techno-economic leadership.
In conclusion, championing these proposed provisions underscores the urgent need for research, development, and deployment to ensure our ongoing global competitiveness within the critical emerging technology fields. Proactive investments today promise substantial strategic dividends for our nation’s future by maintaining its vital role in technological innovation while robustly addressing potential risks tied to these technological breakthroughs. At the same time, we must proceed with caution as our adversaries try to gain access to our classified information every hour of every day. Creating frontier models with classified information could provide significant benefits to our national security apparatus, yet at the same time, it could also provide our adversaries an easier path to gain access to our secrets, hence we must do it in a way that ensures our systems are safe, secure, and reliable.
In the end, this is not just about maintaining a competitive edge; this is about national security, about establishing ethical guidelines for technology usage; it’s about mission-critical deployments where failure is unimaginable, about enhancing global standings through technological supremacy.
We believe this strategic investment into critical and emerging technologies will empower our nation to confront 21st-century challenges with solutions that are timely, scientifically rigorous, and security-enhancing. We express our unwavering support towards these provisions and encourage their decisive endorsement.
Thank you for considering our views on these pressing topics.
If you have any questions, please reach out to me at dkaushik@fas.org.
Divyansh Kaushik
Associate Director for Emerging Technologies and National Security
Federation of American Scientists
Strengthening the Integrity of Government Payments Using Artificial Intelligence
Summary
Tens of billions of taxpayer dollars are lost every year due to improper payments to the federal government. These improper payments arise from agency and claimant errors as well as outright fraud. Data analytics can help identify errors and fraud, but often only identify improper payments after they have already been issued.
Artificial intelligence (AI) in general—and machine learning (ML) in particular (AI/ML)—could substantially improve the accuracy of federal payment systems. The next administration should launch an initiative to integrate AI/ML into federal agencies’ payment processes. As part of this initiative, the federal government should work extensively with non-federal entities—including commercial firms, nonprofits, and academic institutions—to address major enablers and barriers pertaining to applications of AI/ML in federal payment systems. These include the incidence of false positives and negatives, perceived and actual fairness and bias issues, privacy and security concerns, and the use of ML for predicting the likelihood of future errors and fraud.