What the Metascience Community Should Learn From the Federal Evidence Movement Before Making Our Mistakes
There is a growing community of people inside and around the federal government who believe we should apply the scientific method to science itself: how grants are awarded, how peer review works, how labs are organized, how R&D portfolios are built. In some circles this is called metascience, others it goes by science of science, or research on research. The label matters less than the conviction that how we fund and structure science isn’t fixed and that we could be doing it a lot better.
The political moment may be unusually open to acting on this conviction, as R&D institutions face pressures and disruptions not seen since the post-World War II era.
A quick orientation on where things stand: most metascience activity today is external researchers studying government R&D programs from the outside, and that community is growing. Inside the government, interest is picking up: a handful of agencies are starting to think seriously about what internal capacity might look like, with NSF’s proposed metascience unit in the FY2027 budget request as the most visible signal so far. Whether that momentum builds into something more structured, or stays scattered or administration-dependent, remains to be seen.
There’s no Evidence Act equivalent being seriously discussed, but it’s a great moment for laying the ingredients for what comes next. This piece is aimed at both audiences: researchers trying to make their work matter inside agencies, and the agency leaders and staff thinking about standing something up.
I want to be a serious champion for building this capacity inside the government. But I also want to make sure we don’t sleepwalk into a set of traps that I watched swallow another reform movement — one I was part of! — over the last decade. The federal evidence community, which grew dramatically following the Foundations for Evidence-Based Policymaking Act of 2018, had serious ambitions and major accomplishments. It also made structural mistakes that a metascience community could easily repeat. Here’s my take on how we can learn from each other (and what you should steal).
Design around decisions people need (or want) to make, not just questions the research community finds interesting, and be useful early.
Know the decision calendar; a finding that arrives late doesn’t exist.
Co-design with program officers; make their success your success.
Existence of evidence doesn’t equal use; figure out what motivates the people who need to act.
Government needs in-house flexibility to do the work.
Decide whether this is a destination or a waystation and build accordingly.
Solve the structural problems first.
External accountability, cross-agency champions, and Congressional relationships are survival infrastructure.
Episodic engagement is a design failure.
What the evidence community got right (somewhat-evidence-based answer: quite a bit!)
The Evidence Act was a major achievement both as legislation and systems change that continues to make stronger policy possible. It normalized the idea that the government can admit knowledge gaps and curiosity. That agencies should be asking hard questions about whether their programs work, and that building the infrastructure to answer them is important (to me, this is a fundamental of democratic governance, something we owe the American people to maintain legitimacy). Asking “does this program actually do what we think it does?” could read as hostile or politically threatening. The Evidence Act made it standard management practice and that cultural shift, however incomplete, was not nothing!
The infrastructure that followed (Learning Agendas, Evaluation Officers, CDO Councils, OMB evaluation guidance) created shared vocabulary and accountability that hadn’t existed before. In the agencies where it took hold, it opened space for questions, roles, partnerships, and curiosity that previously had no institutional home. Giving someone a title that made clear their role was to facilitate knowledge generation and translation in a bureaucracy that knows how to build on structural opportunity is a big step. Setting a standard process to collect questions needed for effective governance is huge, culturally and administratively.
External accountability mattered too. OMB guidance, GAO oversight, and congressional interest created pressure that internal motivation alone couldn’t sustain. Compliance requirements work when someone is going to ask about them and care about the response (spoiler: I had to do this a lot, and occasionally explain the difference between, say, audits and evaluations). Where the evidence work shaped decisions, it was usually because someone with budget authority and leadership access wanted it. And because a community of practice built enough shared norms to carry the work across agencies and administrations.
What went wrong (or not as well as it should have) and why metascience can learn from our experiments
Insert here tremendous respect and awe for the evaluation officers and their colleagues who fought the hard fight without the support they should have had.
We built supply without equal attention to demand. Evaluation planning and learning agendas were sometimes produced because Congress and OMB required them, not just because program offices were always asking for answers. Carol Weiss has called this the “two communities” problem for ages: researchers and policymakers operating in parallel universes with different timelines, incentives, and languages. And while the community has iterated in that moniker and concept for a long time, we’ve never quite solved it. Too often the results landed in reports nobody read (if they were published at all!), or in inboxes where they became someone else’s problem, or on a timeline that didn’t match decisionmaking. The basic customer question — who needs this, and when, and in what form — wasn’t asked enough, and when it was, we didn’t have great leverage to change.
We got divorced from the workflow. Evaluations routinely finished after the budget cycles and policy windows they were meant to inform. The evidence community struggled to map its work to actual decision points: appropriations timelines, leadership transitions, program reauthorizations. While the evidence community would be well served by considering a range of flexible and timely evidence models, gold-standard evidence methods like Randomized Controlled Trials of major programs can and do take time (certainly more time than a single fiscal year). Unsurprisingly, format mattered too: the people who needed to act, needed a two-pager, or, better, a conversation; more than a technical report delivered six months after the window had closed.
We (cringe) made ourselves hard to work with. The evidence community was often expert-centric rather than partner-centric, more focused on what constituted the highest quality legitimate evidence than on what would be useful, approachable, or on what timeline (see Jen Pahlka’s thinking on “stop energy” vs. “go energy”). The vocabulary was sometimes alienating and methodological gatekeeping was a real downer. More structurally, evaluation offices were sometimes poorly located organizationally, sitting outside program design and budget processes where leverage lived, and relationships upstream or downstream didn’t always come naturally.
We had a LOT of questions but buried them where no one could find them. On the other side of the equation, we too often made a reasonably good effort at compiling our research and evaluation questions in Learning Agendas and did the government equivalent of post and pray, launching a PDF deep on a federal website without requisite effort to connect it it to researchers who would’ve loved to follow up. There were great exceptions: outside the government, I participated in a “matchmaking” session on the President’s Management Agenda Learning agenda, connecting federal leaders with research teams excited to engage on their challenges. The OMB evidence lead I was privileged to work with created a Learning Agenda Questions Dashboard (on evaluation.gov, RIP), and the “evidence project portal” to consolidate opportunities for outside researchers.
We lost the hiring, funding, and buying battles. The Evidence Act directed OPM to develop a hiring classification to support building out the evaluation community. As the person at OMB responsible for pushing that effort (years after the deadline), I watched OPM’s underresourced and sometimes calcified approach to classification make this so challenging that colleagues described it as the worst professional experience of their careers. As an ongoing consequence, agencies defaulted to using generic job series for evidence functions that couldn’t elevate qualified people. Evaluation officers are frequently double and triple-hatted as performance managers, data scientists, and learning officers, often with no dedicated staff, no protected budget, and no solid career path. Likewise, the paths to funding research were highly varied and full of dragons. I could not in good faith consistently tell an agency “here’s how to get your high priority research funded” because it was so variable across agencies. Likewise, unwieldy procurement vehicles added unnecessary burden to a process that already struggled to get RFPs out the door.
We struggled with the theory of adoption. The simplistic foundational assumption was: create the requirement, do the study, policymakers use it; policymakers create a program, evidence is generated, change is made. It SOUNDS right but in practice so much was wrong in that chain because it didn’t consider incentives and timelines. Who needs this finding the most, and when? What would motivate them to change their behavior? What’s standing in their way? Am I asking a question they can act on? Even when the evidence was good, the pathway from finding to decision was assumed rather than designed.
We kept building administrative burden while assuming people wanted it. Learning Agendas and Annual Evaluation Plans and Policies are great concepts and valuable ways to bring learning and policy communities together. But even in the best of worlds these were still compliance requirements layered on top of staff who were already stretched, and in the worst, when done badly, they overcomplicated what should have been a culture changing moment. A metascience function that responds to that history by adding more reporting requirements would be its own kind of failure. The goal should be fewer dragons and headaches on on the path from question to useful answer.
And we struggled with politics. The truth is that many policy leaders don’t want to know if their idea won’t work or didn’t work. Publishing work that shows waste to taxpayers is politically costly, and that problem doesn’t disappear because a law requires evaluation plans. Likewise, sometimes programs do work well and the evidence shows it brilliantly, but politics means that success is less desirable to advertise.
But failures weren’t all inside the government. The academic communities best positioned to do rigorous, policy-relevant evaluation work faced their own incentive problems. Publishing in top journals rewards novelty, methodological elegance, and positive findings (even if you have to p-hack your way there); relevance to a policymaker’s actual questions is less important. The researcher who produces a technically brilliant study and never engages with the agency whose program they studied is likely more fully rewarded by their institution than those supporting policy design. Fortunately, there are researchers across disciplines who care about public impact, and there are organizations like the Evidence-to-Impact Collaborative at Penn State doing serious work to build the infrastructure that makes researcher-policymaker relationships function. But consistently orienting the research community toward the questions that matter inside agencies is a question metascience will inherit too.
Hark! There is a Fork in the Road!
The emerging federal metascience community is asking fascinating questions that are equally vital for democratic legitimacy: beyond “did this program work” to “how does the federal R&D enterprise itself work, and how could it work better?”
But it faces the same fork in the road and even more disruptive moment. The metascience community is also trying to do this work in a volatile moment, where the institutions being studied are changing fast, and where interest in metascience inside the government is emerging alongside real disruption to the research enterprise. That combination is an argument for urgency: the window to shape how internal metascience capacity gets built may be shorter than anyone expected. A unit stood up quickly, without a protected budget or independent authority, narrowly focused on politically convenient questions, and with no plan for continuity — that’s a real risk. The design choices that prevent it aren’t complicated, but they have to happen early.
A metascience function that produces insights about peer review and grant mechanisms without building serious demand from program officers is the evidence community’s supply problem in a new form. A “Metascience Officer” role with no potential for career path or growth, no protected budget, no customer or audience, and competing responsibilities is the Evaluation Officer problem with a different name. Learning agenda questions about R&D mechanisms that nobody follows up on become checkboxes. Evidence that never reaches the room where program design decisions happen, regardless of its quality, has no impact.
Part of what makes institutional design so hard is that the distance between “we produced amazing insights” and “that knowledge changed anything” can be enormous. Experts at the Institutional Architecture Lab have a great framework here. They distinguish between institutions that produce knowledge (authoritative but loosely coupled to action), institutions that have knowledge formally embedded in decision processes (where findings must be engaged), and institutions where specific evidence thresholds trigger changes in practice. The Evidence Act was designed for the middle category and often ended up in the first.
Before we tell you where to go next, a note that applies to both communities: the questions metascience is asking aren’t exactly new inside agencies. Learning Agendas have been wrestling with peer review design, funding mechanisms, and portfolio effectiveness for years: imperfectly, under-resourced, but with real interest and curiosity. Arriving like you’re the first person to notice the building is on fire is a real pattern in the good governance world, and it’s one the evidence community sometimes got too good at before metascience got here. Ask what’s already been tried before you propose what’s next. It’s faster and it might save you from reinventing something that already didn’t work.
What to do instead: a checklist we wish we’d had
- Start with demand AND supply. Map the actual decisions agency leadership faces, like peer review redesigns, new funding mechanisms, portfolio rebalancing, and build the research agenda around those decisions instead of around what the metascience community finds most interesting. Before you build anything, build relationships with the people who will act on what you find. Understand what questions keep them up at night.
- Master the workflow problem. Know the decision calendar and what inputs people will actually read, in what format, and when. A finding that arrives after the window has closed doesn’t exist for practical purposes.
- Embed partnership in the working model. Co-design questions with program officers and make their success your success. Whether metascience becomes a resource people seek out or an office people avoid is something you can shape now.
- Take incentives seriously. Just because a metascience function exists doesn’t mean program officers will care, or that agency leaders will act on what it produces, or Congress will be curious. What are program officers actually rewarded for? What are agency leaders trying to protect? What would make peer reviewers engage differently with evidence about their own processes?
- Develop in-house capacity in addition to solid relationships with the outside. While it’s vital to find consistent and reliable communication paths between government and external research institutions, the government also needs some internal capacity to help be more responsive, flexible, and secure on time sensitive and issue sensitive questions.
- Design the talent model with purpose instead of happenstance. Is this a destination or a waystation? A fixed-term appointment that makes people more attractive when they leave, building an alumni network that carries the practice forward? Or a permanent career function that builds institutional memory? Both have a role: pure rotation and you lose institutional memory; pure permanence and you lose touch with the field.Think about where people come from, where they go, and what signal the function sends about whether this is meaningful work or a backwater.
- Build for durability. External accountability, cross-agency benchmarking, champions in OMB and Congress are what keeps a function alive across administrations. Build them early, when you have momentum and goodwill (by the way, though evidence work is still doing democracy good across government, the Evidence Team I led at OMB doesn’t exist anymore).
- Invest in relationships before you need them. One of the deepest structural failures in the evidence community was treating researcher-policymaker relationships as something that happened naturally, or that individual researchers could maintain on their own. Individual researchers can’t track decision-makers across election cycles, persist through staff turnover, or stay useful for years before they need anything back. And FAS research on local governments shows that policymakers often struggle to find the “front door” into research partnerships, even when they do want to build those relationships. The result is that academic engagement with policy tends to be episodic: it activates when someone needs something, fades when the grant or policy window ends, and depends entirely on who happens to know whom (there’s also interesting research by Max Crowley and colleagues that suggest all these ties are better built early in careers and levels of influence, on all sides). A well-designed metascience function has the ability to solve that “front door” problem – but should treat relationship-building as a core function, rather than assuming it will happen automatically, and invest in presence before anyone needs help.
The steps taken over the last several years to build federal evaluation capacity were good ones. The people who did that work were serious, and they built something real under difficult conditions. We hope this piece lands as what it’s meant to be: a love letter to that work, and a friendly peer review of the structural choices that will determine whether metascience does better
The emerging federal metascience community is asking fascinating questions that are equally vital for democratic legitimacy: beyond “did this program work” to “how does the federal R&D enterprise itself work, and how could it work better?”
Get it right, and pooled hiring becomes a model for how the federal government decides what to do together and what to do apart. That’s a bigger prize than faster hiring. It’s a more functional government.
No one will be surprised if we end up with a continuing resolution to push our shutdown deadline out past the midterms, so the real question is what else will they get done this summer?
Rebuilding public participation starts with something simple — treating the public not as a problem to manage, but as a source of ingenuity government cannot function without.