Joel Hellewell home

Is COVID-19 forecasting bad, or are you just projecting?

Introduction

On March 23rd 2020, the Scientific Pandemic Influenza Group on Modelling (SPI-M) put together a framework for collating forecasts for COVID-19 hospital admissions, total hospital occupancy, and deaths, from six different institutions both government and academic. If you lived in England at this time, upon the announcement of the first national lockdown, you lived through the policy choices of a government that had been at least partly informed by the statistical analysis performed day and night by me and a few other people in my group at London School and Hygiene and Tropical Medicine. We were not forecasting savants, you can see the inadequacies of our forecasts cruelly exposed by objective evaluation. It turns out that we worked roughly eighty hour weeks for a few months only for our model to make forecasts for a week down the road that were often worse than just assuming that the number of hospital admissions in one weeks time will be exactly the same as it is now. In a very real and demonstrable way, we did not do a very good job of infectious disease forecasting.

There have been hundreds if not thousands of other predictions made during this pandemic and those from ever more complex models have continued to feed directly into the development of policy. Most of these predictions have not had the same clinical scrutiny that ours did. Instead, it has been left for people to notice instances where case numbers were orders of magnitude lower than a prediction said they would be, leading to an understandable scepticism over their accuracy and usefulness. Criticism of modelling has been both academic, such as that led by John Ioannidis whose recent article is called simply “Forecasting for COVID-19 has failed”, and non-academic, such as the general degree of negative feeling towards model predictions detectable in the general public. I’ve written before about how some criticism of the modelling response has been whipped up by the right-wing press at the behest of anti-lockdown employers that would rather that you carried on working and earning them money, regardless of the cost to health. However, that doesn’t mean that all criticism has been unjustified and since I have stopped being so enveloped in COVID-19 work the question of the validity of this criticism has engrossed me.

In this piece I will try to explain for a general audience how the state of play of infectious disease modelling before the pandemic anticipated some of the criticism that modelling has received during the pandemic, when it has had to quickly switch to a higher pressure policy based way of operating. My argument broadly is that, prior to the pandemic, infectious disease modelling had little experience evaluating the quality of their predictions because of structural incentives within the field. Most models also never had to grapple with the implementation of interventions that themselves caused incidental harm, since you were mostly modelling the distribution of things such as life-saving drugs or insecticidal bed nets. The combination of these two things has meant that modellers have had to stray into uncomfortable territory where models now generate forecasts of (at least initially) unknown accuracy that recommend policies which cause widespread suffering concentrated among people whose wellbeing is already the most precarious. It was entirely reasonable for people to ask whether lockdowns were absolutely necessary, especially in the punitive form that they have been implemented in Britain, but this innocent-seeming question actually struck at the heart of quite an uncomfortable methodological blind spot in how we understand and evaluate models that are used to motivate policy. As a result, a frequent response to people asking about the trustworthiness of model predictions has been for them to “follow the science” rather than anyone being able to point them towards any demonstrable accuracy of infectious disease models.

Modelling before the pandemic

Prior to the pandemic almost all modelling involved either an “archeological” exploration of factors that influenced historical outbreaks, an attempt to predict the outcomes of imagined intervention scenarios, or the combination of both. I refer to it as an archaeological process because it aims to unearth what you think the relevant factors that influenced transmission were in historical outbreaks that have finished. The models were mechanistic, replicating the physical process of disease transmission using the structure of the mathematical equations within the model. Improving your archeological model was a case of introducing new complexity, maybe parameters that rely on newly collated sources of data, and demonstrating a superior model fit compared to the old model. Through adding your complexity you have produced a better picture of what happened based on the data from the past. For instance, you might compare a malaria model with a constant mosquito population and a new model that ties mosquito population to rainfall to see if there is an improvement in model fit. There is a biological justification to explore this new model since mosquitoes lay eggs in water and an increase in ground puddles due to rainfall increases their viable breeding habitat and therefore the population.

There is a back and forth process between modelling and experimental work (in laboratories and field trials) during an ideal process of using models in scientific enquiry. Experiments test hypotheses about the causal workings of the biological processes of disease. The structure of this hypothesis is then added to a mechanistic model and information on the goodness-of-fit is used to confirm that the new structural assumptions better replicate observed data (perhaps using a wider geographical or temporal scale than the original experiment). The modelling then lends support to the hypotheses in question having a wider relevance than the particular setting that the experiment took place in, hopefully spurring on further experiments to test this hypothesis more or test further related hypotheses. In less extraordinary times, this job of generalisation and abstraction of the important factors relevant to transmission of a disease is precisely the point of infectious disease modelling. This process of abstraction is important since field trials and laboratory work are expensive and time consuming, it is not possible to run a new trial for every unique combination of the factors of transmission that you will come across within a country, let alone across a continent.

Introducing interventions

The mechanistic models used for archaeological analyses lend themselves naturally to being used to imagine counterfactual intervention scenarios: what would happen to the incidence of malaria if half of the population are given an insecticidal bed net? Experimental hut trials measure the reduction in mosquito bites on people sitting under insecticidal nets and laboratory assays measure the probability of mosquito death as a result of exposure to insecticides on the surface of the net. This information can be incorporated into a model of malaria transmission to estimate the impact of mass distribution of insecticidal nets. While performing randomised controlled trials of net distribution also give an estimate of how net distribution reduces malaria incidence, the advantage of incorporating a model is that it attempts to link the size of the reduction to different combinations of mosquito species, vegetation coverage, rainfall, and so on. It’s not possible to perform a randomised controlled trial in every combination of these factors of malaria transmission, so a great deal of time and money could be saved if a model can do a decent job of predicting the expected impact of mass net distribution by feeding it information on the factors of transmission.

As it turns out, this is a pretty big “if”. Often, models happily demonstrate a good fit to historical data and then produce scenario estimates for a range of intervention coverages using nice round numbers like 60% and 80% (along with varying other parameters). These estimates justify the release of huge amounts of money from funders such as the Bill and Melinda Gates Foundation to implement the intervention. The problem is that very rarely does anyone ever go back to check how accurate their model estimates of intervention impact were compared to what happened when the nets were actually distributed. It is possible to re-run predictions using the exact parameter values that actually happened such as vaccine uptake and see how well a model predicts the future if these unknowns are removed. Modellers don’t really do this because the best case scenario would be a paper saying that their predictions were alright that would be fairly uninteresting from the point of view of the journals that you need to relentlessly publish your work in to remain a scientist. The worst case scenario would be showing that the predictions from the model that they and probably their colleagues base their careers upon and the funder used to justify spending millions of pounds are not very accurate.

Publish or perish

It’s easier to understand this lack of interest in the accuracy of scenario modelling results within the wider context of infectious disease modelling as a field. In my experience, working as a scientist in infectious disease modelling has been a state of relentless and ever-increasing informal competition between you and everyone else around your level. While it’s not uncommon for scientists to spend a day selflessly helping each other, in the background lurks the fact that all of you will be in competition for shrinking numbers of jobs and fellowships as you increase in seniority. As a postdoc you move between one-to-three year contracts in which you hope to accrue enough decent publications to eventually be able to apply for funding to study your own research interest. This competition between scientists usually takes the form of who is willing to accept the worst work-life balance so that they can put out more publications, alongside all sorts of other activities such as public outreach, press appearances, networking, supervision, and organising journal clubs or symposiums. As far as I can tell the work life balance of more senior academics only gets even worse. There is very little time during this relentless fight to publish or perish to reflect on problems within your scientific practice and very little reward (relative to the other work that you could have spent your time doing) if you do choose to engage with them.

To survive you fling yourself from one project to the next, barely pausing for breath let alone to reflect on the merit of what you’ve done - achieving a publication is all that matters. You neglect to check the accuracy of your previous work, but at the same time no-one asks you to and no-one else goes out of their way to check the accuracy of your work because career-wise checking other people’s work is an even bigger waste of time than checking their own. There was little motivation for anyone to scrutinise the accuracy of your predictions because the end result is that people are given things like insecticidal bed nets, which if people don’t want they might sell or use as a fishing net. If you do somehow find out that your predictions were bad, then you can console yourself with the fact that mechanistic models are always a work in progress so failure in the past is excusable as long as you keep going with the process of iterative improvement. Unfortunately for modellers, we rapidly found out that if the intervention under consideration is not a net and is instead suddenly the criminalisation of most of your social interactions, then the accuracy of your predictions of intervention effectiveness become paramount.

The pandemic begins

The pandemic shortened the timelines of modelling work from months to weeks, work needed to be finished as quickly as possible so that it could be used by various advisory committees to help inform policy. SARS-CoV-2 is a fairly straightforward pathogen to model in that none of its biological properties were unprecedented in the study of previously known infectious diseases. However, going beyond very simple models of spread required making initial assumptions about whether infection confers lasting immunity, what interventions might be put in place, or how people’s behaviour would change as hundreds of people were admitted to hospital. It’s hard to formulate reasonable assumptions about these sorts of things, especially near the start of the pandemic. The way to continue while acknowledging this uncertainty is usually to produce predictions for a range of assumptions and combinations of different assumptions, this process is referred to as making “projections” or “scenario modelling”. Making projections with a range of different sets of assumptions shows policy makers a range of scenarios. There will hopefully be further input from immunologists and virologists about the feasibility of the biological assumptions and those in government can decide what are feasible assumptions in terms of policy response. A policy-maker can be shown a projection of what will happen if we reach 20, 40, 60, and 80 percent COVID-19 vaccine coverage and she can decide which one of those scenarios is most likely given her separate involvement with vaccination campaign planning.

A forecast is slightly different from a projection, a forecast is a prediction for the set of external assumptions that we think are true (or most true). A forecast says this is what I predict will happen whereas a projection says this is what I predict will happen in a world where these assumptions are true. Implicit in a forecast is a set of assumptions that you think will occur. Forecasting infectious diseases is very difficult, over the long term it involves the kind of unruly fat-tailed probability distributions that Nassim Taleb focuses on. We tried to make short-term forecasts for hospital admissions by assuming that the number of new people each new infectious person would themselves infect would remain relatively stable over a short space of time. The model results that SPI-M and SAGE consider when making policy recommendations have been projections, not forecasts. The projections have varying assumptions regarding factors such as lockdown adherence, vaccine uptake, and vaccine effectiveness. The distinction between forecasts and projections might seem pedantic, but it’s actually the source of much bad criticism of infectious disease modelling.

Forecasting or projecting?

The previously mentioned paper “Forecasting COVID-19 has failed” by John Ioannidis et. al. cites a paper on bovine spongiform encephalopathy (BSE) transmission in sheep written by Neil “Professor Lockdown” Ferguson among others. In humans infection with BSE can lead to variant Creutzfeldt-Jakob disease (vCJD), a brain disease that is often referred to as “Mad Cow Disease” in Britain. Ioannidis et. al. state: “Predictions for bovine spongiform encephalopathy expected up to 150,000 deaths in the UK. However, the lower bound predicted as low as 50 deaths, which is a figure close to eventual fatalities. Predictions may work in ‘ideal’, isolated communities with heterogeneous populations, not the complex current global world”. The BSE paper tried to model the consequences of different potential transmission pathways for BSE moving into the British sheep flock, stating that “The aim of this study was not to evaluate the probability that BSE has entered the sheep flock, but rather, given the pessimistic assumption that infection has occurred, to explore its potential extent and pattern of spread”. Since the profile of BSE infectivity in sheep is poorly understood, the modellers considered three different scenarios: that BSE was likely to spread within a flock of sheep but not between different flocks, that BSE could spread within and between flocks, and that BSE could spread neither within a flock or between flocks. The upper bound of the prediction for the scenario where an outbreak of BSE occurred within and between sheep flocks, meaning that the disease moved across the country and would require stringent testing and interventions, was 150,000 deaths in humans in a completely imagined outbreak. It’s hard to tell what figure Ioannidis et. al. intend to compare this prediction to when they refer to “eventual fatalities”, the first death from vCJD in the UK was in 1995 and increased year on year until it peaked in 2000, before the BSE sheep paper was even published in 2002. It might be possible to criticise the very speculative nature of considering BSE spread in sheep, but to claim that the paper even made a forecast that should be compared to reality is clearly not true.

The above is a particularly crude example, but unfortunately if you dig into most claims of wildly inaccurate infectious disease forecasts the original work is treated with a similar level of nuance and generosity. In the same paragraph Ioannidis et. al. write “Modeling for swine flu predicted 3,100 - 65,000 deaths in the UK. Eventually, 457 deaths occured”. They link an article quoting Liam Donaldson who was chief medical officer back in 2009. With a bit of effort it is possible to find how he likely got to those figures in a BBC news article, which seems to be a process of multiplying several attack rate (the final percentage of the population infected) assumptions with several fatality rate (percentage of infections that are fatal) assumptions. This is an extremely simple model of an unmitigated outbreak, one that it seems a little unfair to compare to reality where just under 40% of the highest risk groups were vaccinated against swine flu. The model should be thought of as saying that if we do nothing and the case fatality rate is X and the attack rate is Y then the expected deaths is Z. As it happens the case fatality rate assumptions were higher than was observed during the outbreak, but the projections by Liam Donaldson were very different from reality both because of this and because they do not account for any interventions. Such criticisms always present any projections as a forecast, completely shorn of the particular assumptions attached to them that might help to provide context as to why they are so far off.

The problem with projections

It matters that even the criticism of infectious disease modelling published in academic journals is bad because it fails to consider a far better criticism of projections than merely treating them as forecasts. Only performing projections is quite evasive, you avoid criticism of your ability to forecast by saying that you are projecting, not forecasting. This has become the line of defence employed by the chair of SPI-M, who is also a member of SAGE. A better question than “how good are COVID-19 forecasts?” might be “how good are COVID-19 projections?”. A projection is a prediction tied to a specific combination of assumptions A. In reality the specific combination A never occurs exactly, the most obvious example being that we are very unlikely to hit the nice round numbers for vaccine coverage used in projections. There is no easy way to tell how good a projection was because A didn’t occur. One way would be to retrospectively make a “forecast” by using the collection of assumptions, B, that actually occurred. This is the same as evaluating the accuracy of intervention impact projections as discussed earlier and means that (ideally) any discrepancy between the predictions and the ground truth is due to the predictive power of the model. It is ideally due to the predictive power because in reality it is difficult to measure and characterise the assumptions B required to do this. It can also only be done retrospectively, since it requires knowing the correct assumptions to use by observing the past. Therefore, it would have been impossible to know the predictive power of the model projecting lockdown impact in March 2020.

Someone asking whether there was any proof of previous projection accuracy might have been convinced by a historical catalogue of accurate intervention impact prediction. As explained, due to structural forces at play prior to the pandemic, no consistent evaluations of projection accuracy exist. Even after March 2020, the SPI-M models that produced previous projections have not been evaluated. It’s unlikely that this is because the SPI-M modellers are lazy or purposefully deceitful and more to do with a combination of the pre-existing practice of producing projections without evaluation that existed before the pandemic, as well as the requirement on SPI-M’s part to spend all their time adding new features and making new projections during the ongoing outbreak. It’s worth thinking about how this process must have looked to an enthusiastic outsider: modellers have continued to deploy their essentially unverified models again and again at each new wave and reopening stage, seemingly oblivious to the accuracy of previous predictions. People might have been happy to let them get on with it if it weren’t for the fact that the strange charts they produce were used to justify the criminalisation of nearly all social interaction. The term “scientism” has been bandied about, denoting a strong belief in the truth of anything that is thought of as being scientific. For many people modelling must fit the bill for scientism due to its lack of producing demonstrable consistent successes and this seems to have made many people deeply cynical of the modelling that has motivated the government response.

No return to normal

Before the COVID-19 pandemic the field of infectious disease modelling was in the final stages of capture by billionaire philanthrocapitalists, pharmaceutical companies, the foreign policy objectives of imperialist nations, and the intergovernmental organisations that coordinate all of these interests. The need to keep demonstrating success as a means to continue your scientific career creates a worrying incentive towards self-reflection that is either relatively gentle or involves only dry, technical solutions (e.g better methodology) that doesn’t get to the heart of the issues involved. As argued here, weaknesses in the pandemic response can (and should) be traced back to tendencies in infectious disease modelling as a field prior to January 2020. This can only be done with careful and rigorous criticism of infectious disease modelling, a world away from showing people one plot of a poor forecast in the hopes that it will finally make everyone realise that the emperor is not wearing any clothes. How modelling as a scientific discipline operates should also be understood in terms of the material structural forces pulling scientists and their research topics in certain directions and not through the perceived moral failings of specific individuals.

The pure scientific wonder that a child might feel looking up at the stars has to be subjugated to the discipline of grant applications, unpaid peer review, and the journal rejection lottery. These features of the daily working life of a scientist, often treated as separate problems with distinct causes, came into existence and continue to change under our feet at the whim of a profession that is being continually reconfigured to extract maximum surplus value from our labour. Scientists need to recognise that this process does not necessarily produce an outcome that is beneficial to the interests of scientists and the quality of science, just as over the course of the pandemic many people have become much more interested in the role that science plays in policymaking.