Since the world began its battle against the outbreak of Covid-19, mobile apps have promised to do it all: pinpoint infections, predict who may be at highest risk, learn how long the virus survives on surfaces, estimate the fraction of asymptomatic carriers, target medical resources, prevent people from being exposed, the list goes on. And while some mobile apps can indeed be useful as we adapt to life with this virus, there is also evidence that by skewing our understanding of this disease, certain apps are more harmful than helpful.
Kaiser Fung has been the data science lead at various companies. He is the author of Numbers Rule Your World. You can find all three installments of his comprehensive review of this study on his blog, Junk Charts.
Currently, there are no fewer than seven major Covid-19-related apps in the US, if we count only those backed by governments or reputable health care organizations. Most will, of course, attract few users and fade away—but there is one mobile app that has already garnered attention for its surprising discoveries. The COVID Symptom Tracker was first launched in the United Kingdom by a research team at King’s College, and it has been promoted in the US by Harvard and Stanford Medical Schools. The Symptom Tracker boasted over 1.6 million downloads in its first week of launch in late March. The response was so rapid and remarkable that the researchers needed just five days of data to fire off the first preprint of scientific findings. But if the initial analytics coming out of the COVID Symptom Tracker are a sign of what’s to come, then app developers have much heavy lifting ahead as they battle a large volume of low-quality data.
Each day, users of the Covid Symptom Tracker are asked to file a report of their health. They can also see an estimate of cases in their area. The app offers a self-diagnosis of Covid-19, which is not necessarily accurate but undoubtedly useful while testing is triaged and rationed by governments. (Facing a shortage of diagnostic testing, both the UK and the US governments have restricted tests to people with severe symptoms.) An undesirable side effect of targeted testing, however, is contaminating the data feeding downstream analytics, such as estimating the population prevalence of Covid-19 and identifying the most relevant symptoms. This harm is laid bare in the preprint of scientific findings published by the Tracker app team. As this app and others like it become more popular, it’s critical that we understand what the data coming out of self-reporting symptom trackers can and cannot tell us.
At the outset, the researchers’ goal was to use the self-reported symptoms to predict test results. They began by assembling an analytical sample (also known as the training data) containing symptoms and self-reported diagnostic test results. Given test rationing, we can assume that all of the users included in the sample experienced severe symptoms.
The Tracker App study provided the first scientific evidence to support loss of smell and taste, or anosmia, as a symptom of Covid-19, which some doctors and patients had suspected since the early days of the pandemic. The researchers went further, though, declaring anosmia the single best predictor of infection—even better than the usual suspects like persistent cough. But anosmia’s predictive prowess is nothing more than a mirage of triage testing.
Notably, about one-third of the analytical sample tested positive for coronavirus while two-thirds tested negative, meaning a perfect predictive model would flag one in three of the app users as positive for the virus. The obvious symptoms, such as persistent cough, afflicted half of the analytical sample, so using cough as a single predictor would have meant flagging half of the users as likely to be infected. So, out of every 100 users, 50 are predicted positive, but since 67 have reported testing negative, we know immediately that 17 out of every 50 positives are false positives. To have the best chance of correctly predicting all true positives, while making the fewest possible false-negative errors, the symptom must affect about a third of the sample. What proportion of users are reporting loss of smell and taste? You guessed it—one out of three. No wonder anosmia is the Nate Silver of Covid-19 diagnosis.
Anosmia appears to predict infection better than coughing only because it’s less common in the analytical sample. And it’s only less common because of triage testing. Unlike coughing, loss of smell and taste wasn’t yet recognized as a symptom of Covid-19 in March. So in order to get a test, you had to have a cough, but not necessarily loss of smell and taste. And since the analytical sample included only people who had been tested, it overrepresented the qualifying symptoms. But this finding is unlikely to apply to the larger population.
What we have witnessed is a case study of how even simple analytical frameworks are defeated by low-quality data. The analytical sample was contaminated with selection bias, which was a byproduct of triage testing. If the research team had access to a sample of random diagnostic tests, the same analysis could have perhaps yielded more useful results.
Read all of our coronavirus coverage here.
Around the same time, the research team grabbed headlines with a second result: They estimated that 13 percent of the UK population had been infected with the novel coronavirus by early April. To support this conclusion, they constructed a predictive model that assigned people their chance of testing positive for SARS-CoV-2 based on multiple self-reported symptoms. The model flags as positive 13 percent of the “scoring sample,” or about 400,000 app users, who reported symptoms but not test results.
To understand why this method is flawed, take the infamous model Target once deployed to predict which shoppers might be pregnant. Data scientists at the Minneapolis-based retailer used a random subset of female customers, the analytical sample, to discover which product purchases were highly correlated with pregnancy. The model was then able to tally orders of those relevant items, such as a blue rug, to compute the probability that any given customer was pregnant.
What happens if the scoring sample contains only male shoppers? If we asked a human, the answer is obvious. The chance of a pregnant cisgender man is always zero; however, the chance that Target’s pregnancy model predicted zero for all male shoppers was, well, zero. The model had no concept of gender, as it paid attention only to past purchases. As a result, a minority of the men would be flagged with a positive chance of being pregnant, just because they happened to have purchased that blue rug. Every such prediction is a false positive error.
When a model is trained using one type of people (women), and applied to a different type of people (men), it does not know to call foul; it dutifully extrapolates, producing false positive errors. In my extreme example, the false positive error rate on the men is 100 percent; when the same model scores women shoppers, the error rate is naturally much smaller.
In this comparison, the product purchases at Target are the equivalent of symptoms reported to the Covid Symptom Tracker. Just as Target used only women to build its predictive model, the Tracker App team included only users with prior test results in its analytical sample. And because of triage testing, filtering by testing status is equivalent to filtering by the severity of qualifying symptoms. The researchers took a model built on people with severe symptoms and used it to score people without severe symptoms. When the researchers concluded that 13 percent of the population may have been infected, based on their predictive model which found that 13 percent of the scoring sample was likely to test positive, they neglected this false-positive problem.
These first-to-market analyses of mobile app data raise serious concerns about our ability to derive useful insights from such apps. As discussions of reopening continue, often guided by the existence of such apps or the information they’ve provided, it’s important that policymakers and users alike understand their limitations. Until we fix the error of triage testing, the dragnet apps will continue to collect poor-quality data and fuel spurious analytics.
WIRED Opinion publishes articles by outside contributors representing a wide range of viewpoints. Read more opinions here. Submit an op-ed at firstname.lastname@example.org.
More From WIRED on Covid-19
- “Let’s save some lives”: A doctor’s journey into the pandemic
- Inside the early days of China’s coronavirus coverup
- An oral history of the day everything changed
- How is the coronavirus pandemic affecting climate change?
- FAQs and your guide to all things Covid-19
- Read all of our coronavirus coverage here