How many Americans have COVID-19?

Using CDC data about flu-like illnesses, we estimate that between March 1 and April 4, only 19% of symptomatic COVID-19 patients who visited their doctor were able to get tested. This percentage increased from nearly zero at the beginning of March to 60% at the end. While low, this estimate is consistent with the lack of availability of testing during this period. We used the same data set to study the 2009-2010 H1N1 pandemic, and estimated that of the 61 million Americans who had H1N1, only 14% visited a primary care provider with symptoms. If we assume the same percentage applies to COVID-19, we can estimate that only one in 40 COVID cases were detected between March 1 and April 4, and over 12 million Americans were infected.

Using data and BigQuery ML to estimate hidden COVID-19

Over the last few weeks, the United States has experienced something not seen since 1957: a deadly, nationwide pandemic. Despite all the great work done by epidemiologists and journalists to document the outbreak, we still can't answer a critical question: How many people have contracted COVID-19? Only a tiny fraction of the population has tested positive, but this likely reflects the lack of testing rather than the true prevalence of the virus [1].

Until we have randomized serological surveys to assess the true scope of the pandemic, is there another way to estimate the true number of infected Americans? Perhaps [2]. For over 20 years, the Centers for Disease Control (CDC) has been monitoring the prevalence of flu and flu-like illnesses. It's called the Influenza-like Illness Surveillance Network (ILINet). Every week, the CDC publishes several key statistics:

How many people saw a participating primary care provider?
How many of these patients had flu-like symptoms, defined as a fever of at least 100 F and a cough or sore throat?
How many respiratory samples were tested by participating clinical and public health laboratories?
How many of these specimens tested positive for flu?

We can use these data points to calculate two key facts: the percentage of patients with flu-like symptoms, and the percentage of samples that test positive for flu. Over the last 20 years, these two metrics have been tightly correlated every year:

This x-axis of this figure goes from July of one year to June of the following year, so the peak of the flu season is in the center of the chart. Notably, in March of 2020, this relationship between the percentage of patients with ILI (in red) and the percentage of specimens that test positive for flu (in blue) breaks down:

The "extra peak" in the chart occurs at the same time as the COVID-19 pandemic. The percentage of patients presenting with flu-like symptoms in March is much higher than we should expect given the low prevalence of flu in tested specimens. Not only that, March is not typically a high flu month. The number of people presenting with flu-like symptoms this March (in red) is unprecedented, with the sole exception of the 2009-2010 H1N1 pandemic (in green):

In order to quantify how unexpected this peak is, we used BigQuery ML to fit a simple linear model, where the percentage of flu-positive specimens is predicted by a seasonal trend + the percentage of specimens that test positive for flu. It's an excellent fit to the data, with the notable exception of March 2020, in red:

Looking at just the current flu season, you can see that the % of patients visiting primary care providers with flu-like symptoms (in red) is accurately predicted by the model (in blue), until suddenly it diverges at the beginning of March:

We used this model's prediction to calculate the number of "excess" patients with flu-like symptoms each week [4]. We determined that in the five weeks between March 1 and April 4, about 3.5% of patient visits to primary care providers can be attributed to COVID-19.

How do we get from here to the number of people with COVID-19 in the United States? We're going to need to make some additional assumptions.

First, we need to estimate the number of people who visit the doctor each week. According to the CDC, the average American sees a primary-care provider 1.5 times per year. We can verify in the ILINet data that primary-care providers see about the same number of patients per week throughout the year, so we'll assume 0.03 visits per person per week.

Next, we need to estimate the percentage of people with flu-like symptoms who don't go to the doctor. To estimate this, we're going to take advantage of the 2009-2010 H1N1 pandemic. During the H1N1 pandemic, there was very little non-H1N1 flu, so we can use the same ILINet data to estimate the number of people with H1N1 who went to a primary care provider [3]. After the H1N1 pandemic, the CDC conducted a nationwide serological survey and determined that 60.8 million Americans had H1N1. Using these two numbers, we estimated that 86% of people with flu-like symptoms don't see the doctor.

Between March 1 and April 4, there were 311 thousand confirmed cases of COVID-19 in the United States. Using our model, we estimate that an additional 1.3 million COVID-19 patients visited their doctors during this time period but were not able to get tested, and an additional 10.7 million people were infected but never visited a primary care provider.

These estimates rely on many assumptions, all of which are subject to uncertainty. We've done our best to validate them with data. Fortunately, randomized serological surveys are now being conducted, and over the next few weeks they'll tell us whether these estimates are accurate.

If you'd like to experiment with this model, our code is available at github.com/fivetran/covid and the data is publicly accessible in the BigQuery data set fivetran-covid:covid.

About the author

George Fraser is the CEO of Fivetran, the leading provider of automated data integration. Previously, he was a scientist at Emerald Therapeutics. He received his Ph.D. in neurobiology from the University of Pittsburgh in 2011.

Notes

[1] A few small-scale serological tests have produced astonishing results: a hospital in New York tested every woman who gave birth and found 15% of them were positive for COVID-19. A homeless shelter in Boston tested every resident and found 36% of them were positive for COVID-19. These examples are not representative samples, and we still don't know whether they're outliers or whether they indicate COVID-19 is much more widespread than we've realized.

[2] This analysis was inspired by an earlier paper using the ILINet data to estimate COVID-19 prevalence. The main contribution of our analysis is the way we scale ILINet patient visits to all patient visits using the national number of patient visits per week, and the way we estimate patients who don't visit a doctor using data from the 2009-2010 H1N1 pandemic.

[3] We computed a 95% confidence interval for our model's prediction, and calculated "excess ILI" as the percentage of visits exceeding this interval.

[4] We assumed a 1% baseline level of non-H1N1 flu-like illnesses, based on historical norms for the time of year when the H1N1 pandemic happened. We assume everyone above this threshold with flu-like symptoms had H1N1, and use the same "number of people who visit the doctor each week" assumption to extrapolate how many people nationwide with H1N1 saw a primary care provider.

Using data and BigQuery ML to estimate hidden COVID-19

About the author

Notes

Start for free