The Data We Take for Granted: Telling the Story of How Federal Data Benefits American Lives and Livelihoods

Across the nation, researchers, data scientists, policy analysts and other data nerds are anxiously monitoring the demise of their favorite federal datasets. Meanwhile, more casual users of federal data continue to analyze the deck chairs on the federal Titanic, unaware of the coming iceberg as federal cuts to staffing, contracting, advisory committees, and funding rip a giant hole in our nation’s heretofore unsinkable data apparatus. Many data users took note when the datasets they depend on went dark during the great January 31 purge of data to “defend women,” but then went on with their business after most of the data came back in the following weeks. 

Frankly, most of the American public doesn’t care about this data drama. 

However, like many things in life, we’ve been taking our data for granted and will miss it terribly when it’s gone. 

As the former U.S. Chief Data Scientist, I know first-hand how valuable and vulnerable our nation’s federal data assets are. However, it took one of the deadliest natural disasters in U.S history to expand my perspective from that of just a data user, to a data advocate for life. 

Twenty years ago this August, Hurricane Katrina made landfall in New Orleans. The failure of the federal flood-protection infrastructure flooded 80% of the city resulting in devastating loss of life and property. As a data scientist working in New Orleans at the time, I’ll also note that Katrina rendered all of the federal data about the region instantly historical.

Our world had been turned upside down. Previous ways of making decisions were no longer relevant, and we were flying blind without any data to inform our long-term recovery. Public health officials needed to know where residents were returning to establish clinics for tetanus shots. Businesses needed to know the best locations to reopen. City Hall needed to know where families were returning to prioritize which parks they should rehabilitate first.

Normally, federal data, particularly from the Census Bureau, would answer these basic questions about population, but I quickly learned that the federal statistical system isn’t designed for rapid, localized changes like those New Orleans was experiencing. 

We explored proxies for repopulation: Night lights data from NASA, traffic patterns from the local regional planning commission, and even water and electricity hookups from utilities. It turned out that our most effective proxy came from an unexpected source: a direct mail marketing company. In other words, we decided to use junk mail data to track repopulation.

Access to direct mail company Valassis’ monthly data updates was transformative, like switching on a light in a dark room. Spring Break volunteers, previously surveying neighborhoods to identify which houses were occupied or not, could now focus on repairing damaged homes. Nonprofits used evidence of returning residents to secure grants for childcare centers and playgrounds.

Even the police chief utilized this “junk mail” data. The city’s crime rates were artificially inflated because they used a denominator of annual Census population estimates that couldn’t keep pace with the rapid repopulation. Displaced residents had been afraid to return because of the sky-high crime rates, and the junk mail denominator offered a more timely, accurate picture. 

I had two big realizations during this tumultuous period:

  1. Though we might be able to Macgyver some data to fill the immediate need, there are some datasets that only the federal government can produce, and
  2. I needed to expand my worldview from being just a data user, to also being an advocate for the high quality, timely, detailed data we need to run a modern society.

Today, we face similar periods of extreme change. Socio-technological shifts from AI are reshaping the workforce; climate-fueled disasters are coming at a rapid pace; and federal policies and programs are undergoing massive shifts. All of these changes will impact American communities in different ways. We’ll need data to understand what’s working, what’s not, and what we do next.

For those of us who rely on federal data in small or large ways, it’s time to champion the federal data we often take for granted. And, it’s also going to be critical that we, as active participants in this democracy, take a close look at the downstream consequences of weakening or removing any federal data collections. 

There are upwards of 300,000 federal datasets. Here are just three that demonstrate their value:

  1. Bureau of Justice Statistics’ National Crime Victimization Survey (NCVS): The NCVS is a sample survey produced through a collaboration between the Department of Justice and the Census Bureau that asks people if they’ve been victims of crime. It’s essential because it reveals the degree to which different types of crimes are underreported. Knowing the degree to which crimes like intimate partner violence, sexual assault, and hate crimes tend to be under-reported helps law enforcement agencies better interpret their own crime data and protect some of their most vulnerable constituents.
  2. NOAA’s ARGO fleet of drifting buoys: Innovators in the autonomous shipping industry depend on NOAA data such as that collected by the Argo Fleet of drifting buoys – an international collaboration that measures global ocean conditions. These detailed data train AI algorithms to find the safest and most fuel-efficient ocean routes. 
  3. USGS’ North American Bat Monitoring Program: Bats save the American agricultural industry billions of dollars annually by consuming insects that damage crops. Protecting this essential service requires knowing where bats are. The USGS North American Bat Monitoring Program database is an essential resource for developers of projects that could disturb bat populations – projects such as highways, wind farms, and mining operations. This federal data not only protects bats but also helps streamline permitting and environmental impact assessments for developers.

If your work relies on federal data like these examples, it’s time to expand your role from a data user to a data advocate. Be explicit about the profound value this data brings to your business, your clients, and ultimately, to American lives and livelihoods.

That’s why I’m devoting my time as a Senior Fellow at FAS to building EssentialData.US to collect and share the stories of how specific federal datasets can benefit everyday Americans, the economy, and America’s global competitiveness.

EssentialData.US is different from a typical data use case repository. The focus is not on the user – researchers, data analysts, policymakers, and the like. The focus is on who ultimately benefits from the data, such as farmers, teachers, police chiefs, and entrepreneurs. 

A good example is the Department of Transportation’s T-100 Domestic Segment Data on airplane passenger traffic. Analysts in rural economic development offices use these data to make the case for airlines to expand to their market, or for state or federal investment to increase an airport’s capacity. But it’s not the data analysts who benefit from the T-100 data. The people who benefit are the cancer patient living in a rural county who can now fly directly from his local airport to a metropolitan cancer center for lifesaving treatment, or the college student who can make it back to her home town for her grandmother’s 80th birthday without missing class.

Federal data may be largely invisible, but it powers so many products and services we depend on as Americans, starting with the weather forecast when we get up in the morning. The best way to ensure that these essential data keep flowing is to tell the story of their value to the American people and economy. Share the story of your favorite dataset with us at EssentialData.US. Here’s a direct link to the form.

Kickstarting Collaborative, AI-Ready Datasets in the Life Sciences with Government-funded Projects

In the age of Artificial Intelligence (AI), large high-quality datasets are needed to move the field of life science forward. However, the research community lacks strategies to incentivize collaboration on high-quality data acquisition and sharing. The government should fund collaborative roadmapping, certification, collection, and sharing of large, high-quality datasets in life science. In such a system, nonprofit research organizations engage scientific communities to identify key types of data that would be valuable for building predictive models, and define quality control (QC) and open science standards for collection of that data. Projects are designed to develop automated methods for data collection, certify data providers, and facilitate data collection in consultation with researchers throughout various scientific communities. Hosting of the resulting open data is subsidized as well as protected by security measures. This system would provide crucial incentives for the life science community to identify and amass large, high-quality open datasets that will immensely benefit researchers.

Challenge and Opportunity 

Life science has left the era of “one scientist, one problem.” It is becoming a field wherein collaboration on large-scale research initiatives is required to make meaningful scientific progress. A salient example is Alphafold2, a machine learning (ML) model that was the first to predict how a protein will fold with an accuracy meeting or exceeding experimental methods. Alphafold2 was trained on the Protein Data Bank (PDB), a public data repository containing standardized and highly curated results of >200,000 experiments collected over 50 years by thousands of researchers.

Though such a sustained effort is laudable, science need not wait another 50 years for the ‘next PDB’. If approached strategically and collaboratively, the data necessary to train ML models can be acquired more quickly, cheaply, and reproducibly than efforts like the PDB through careful problem specification and deliberate management. First, by leveraging organizations that are deeply connected with relevant experts, unified projects taking this approach can account for the needs of both the people producing the data and those consuming it. Second, by centralizing plans and accountability for data and metadata standards, these projects can enable rigorous and scalable multi-site data collection. Finally, by securely hosting the resulting open data, the projects can evaluate biosecurity risk and provide protected access to key scientific data and resources that might otherwise be siloed in industry. This approach is complementary to efforts that collate existing data, such as the Human Cell Atlas and UCSD Genome Browser, and satisfy the need for new data collection that adheres to QC and metadata standards.

In the past, mid-sized grants have allowed multi-investigator scientific centers like the recently funded Science and Technology Center for Quantitative Cell Biology (QCB, $30M in funding 2023) to explore many areas in a given field. Here, we outline how the government can expand upon such schemes to catalyze the creation of impactful open life science data. In the proposed system, supported projects would allow well-positioned nonprofit organizations to facilitate distributed, multidisciplinary collaborations that are necessary for assembling large, AI-ready datasets. This model would align research incentives and enable life science to create the ‘next PDBs’ faster and more cheaply than before.  

Plan of Action 

Existing initiatives have developed processes for creating open science data and successfully engaged the scientific community to identify targets for the ‘next PDB’ (e.g., Chan Zuckerberg Initiative’s Open Science program, Align’s Open Datasets Initiative). The process generally occurs in five steps:

  1. A multidisciplinary set of scientific leaders identify target datasets, assessing the scale of data required and the potential for standardization, and defining standards for data collection methods and corresponding QC metrics.
  2. Collaboratively develop and certify methods for data acquisition to de-risk the cost-per-datapoint and utility of the data.
  3. Data collection methods are onboarded at automation partner organizations, such as NSF BioFoundries and existing National Labs, and these automation partners are certified to meet the defined data collection standards and QC metrics.
  4. Scientists throughout the community, including those at universities and for-profit companies, can request data acquisition, which is coordinated, subsidized, and analyzed for quality.
  5. Data becomes publicly available and is hosted in a stable, robustly maintained database with biosecurity, cybersecurity, and privacy measures in perpetuity for researchers to access. 

The U.S. Government should adapt this process for collaborative, AI-ready data collection in the life sciences by implementing the following recommendations:  

Recommendation 1. An ARPA-like agency — or agency division — should launch a Collaborative, AI-Ready Datasets program to fund large-scale dataset identification and collection.

This program should be designed to award two types of grants:

  1. A medium-sized “phase 1” award of $1-$5m to fund new dataset identification and certification. To date, roadmapping dataset concepts (Steps 1-2 above) has been accomplished by small-scale projects of $1-$5M with a community-driven approach. Though selectively successful, these projects have not been as comprehensive or inclusive as they could otherwise be. Government funding could more sustainably and systematically permit iterative roadmapping and certification in areas of strategic importance.
  2. A large “phase 2” award of $10-$50m to fund the collection of previously identified datasets. Currently, there are no funding mechanisms designed to scale up acquisition (Steps #3-4 above) for dataset concepts that have been deemed valuable and derisked. To fill this gap, the government should leverage existing expertise and collaboration across the nonprofit research ecosystem by awarding grants of $10-50m for the coordination, acquisition, and release of mature dataset concepts. The Human Genome project is a good analogy, wherein a dataset concept was identified and collection was distributed amongst several facilities.

Recommendation 2. The Office of Management and Budget should direct the NSF and NIH to develop plans for funding academics and for-profits traunched on data deposition.

Once an open dataset is established, the government can advance the use and further development of that dataset by providing grants to academics that are traunched on data deposition. This approach would be in direct alignment with the government’s goals for supporting open, shared resources for AI innovation as laid out in section 5.2 of the Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence

Agencies’ approaches to meeting this priority could vary. In one scenario, a policy or program could be established in which grantees would use a portion of the funds disbursed to them to pay for open data acquisition at a certified data provider. Analogous structures have enabled scientists to access other types of shared scientific infrastructure, such as the NSF’s ACCESS program. In the same way that ACCESS offers academics access to compute resources, it could be expanded to offer academic access to data acquisition resources at verified facilities. Offering grants in this way would incentivize the scientific community to interact with and expand upon open datasets, as well as encourage compliance through traunching. 

Efforts to support use and development of open, certified datasets could also be incorporated into existing programs, including the National AI Research Resource, for which complementary programs could be developed to provide funding for standardized data acquisition and deposition. Similar ideas could also be incorporated into core programs within NSF and NIH, which already disburse funds after completion of annual progress reports. Such programs could mandate checks for data deposition in these reports.

Conclusion 

Collaborative, AI-Ready datasets would catalyze progress in many areas of life science, but realizing them requires innovative government funding. By supporting coordinated projects that span dataset roadmapping, methods and standards development, partner certification, distributed collection, and secure release on a large scale, the government can coalesce stakeholders and deliver the next generation of powerful predictive models. To do so, it should combine small-sized, mid-sized, and traunched grants in unified initiatives that are orchestrated by nonprofit research organizations, which are uniquely positioned to execute these initiatives end-to-end. These initiatives should balance intellectual property protection and data availability, and thereby help deliver key datasets upon which new scientific insights depend.

This action-ready policy memo is part of Day One 2025 — our effort to bring forward bold policy ideas, grounded in science and evidence, that can tackle the country’s biggest challenges and bring us closer to the prosperous, equitable and safe future that we all hope for whoever takes office in 2025 and beyond.

PLEASE NOTE (February 2025): Since publication several government websites have been taken offline. We apologize for any broken links to once accessible public data.

Frequently Asked Questions
What is involved in roadmapping dataset opportunities?

Roadmapping dataset opportunities, which can take up to a year, requires convening experts across multiple disciplines, including experimental biology, automation, machine learning, and others. In collaboration, these experts assess both the feasibility and impact of opportunities, as well as necessary QC standards. Roadmapping culminates in determination of dataset value — whether it can be used to train meaningful new machine learning models.

Why should data collection be centralized but redundant?

To mitigate single-facility risk and promote site-to-site interoperability, data should be collected across multiple sites. To ensure that standards and organization holds across sites, planning and documentation should be centralized.

How should automation partners be certified?

Automation partners will be evaluated according to the following criteria:



  • Commitment to open science

  • Rigor and consistency in methods and QC procedures

  • Standardization of data and metadata ontologies


More specifically, certification will depend upon the abilities of partners to accommodate standardized ontologies, capture sufficient metadata, and reliably pass data QC checks. It will also require partners to have demonstrated a commitment to data reusability and replicability, and that they are willing to share methods and data in the open science ecosystem.

Should there be an embargo before data is made public?

Today, scientists have no obligation to publish every piece of data they collect. In an Open Data paradigm, all data must eventually be shared. For some types of data, a short, optional embargo period would enable scientists to participate in open data efforts without compromising their ability to file patents or publish papers. For example, in protein engineering, the patentable product is the sequence of a designed protein, making immediate release of data untenable. An embargo period of one to two years is sufficient to alleviate this concern and may even hasten data sharing by linking it to a fixed length of time after collection, rather than to publication. Whether or not an embargo should be implemented and its length should be determined for each data type, and designed to encourage researchers to participate in acquisition of open data.

How do we ensure biosecurity of the data?

Biological data is a strategic resource and requires stewardship and curation to ensure it has maximum impact. Thus, data that is generated through the proposed system should be hosted by high-quality providers that adhere to biosecurity standards and enforce embargo periods. Appropriate biosecurity standards will be specific to different types of data, and should be formulated and periodically reevaluated by a multidisciplinary group of stakeholders. When access to certified, post-embargo data is requested, the same standards will apply as will export controls. In some instances, for some users, restricting access may be reasonable. For offering this suite of valuable services, hosting providers should be subsidized through reimbursements.