Summary:Statistical tests need to be paired with proper data and study design to yield valid results. A recent review paper on Long Covid in children provides a useful example of how researchers can get this wrong. We use causal diagrams to decompose the problem and illustrate where errors were made.

## Background

A recent review paper by Australian and Swiss doctors, How Common Is Long COVID in Children and Adolescents?, was widely discussed in the press, with 128 news stories from 103 outlets. The headlines were reassuring:

- “Global studies on long COVID and children ‘unnecessarily worrying’, say researchers”
- “Long Covid in children and adolescents is less common than previously feared”
- “Kids’ Covid-19 risk less than we feared, says study”.

The paper in question does not actually say any of these things, but rather concludes that “*the true incidence of this syndrome in children and adolescents remains uncertain*.” However, the challenges of accurate science journalism are not the topic for our article today. Rather, we will describe a critical flaw in the statistical analysis in this review, as an exercise in better understanding how to interpret statistical tests.

A key contribution of the review is that it separates those studies that use a “control group” from those that do not. The authors suggest we should focus our attention on the studies with a control group, because “*in the absence of a control group, it is impossible to distinguish symptoms of long COVID from symptoms attributable to the pandemic*.” The National Academy of Sciences warns that “*use of an inappropriate control group can make it impossible to draw meaningful conclusions from a study*.” As we will see, this is, unfortunately, what happened in this review. But first, let’s do a brief recap of control groups and statistical tests.

## Control groups and RCTs

When assessing the impact of an intervention, such as the use of a new drug, the gold standard is to use a *Randomised Controlled Trial (RCT)*. In an RCT, a representative sample is selected, and randomly split into two groups, one of which receives the medical intervention (e.g. the drug), and one which doesn’t (normally that one gets a placebo). This can, when things go well, show clearly whether the drug made a difference. Generally, a “p value” is calculated, which is the probability that the effect seen in the data would be observed by chance if there was truly no difference between cases and controls (i.e. null hypothesis was true), along with a “confidence interval”, which is the range of outcomes that would be expected after considering random variation. If the p value is less than some number (often 0.05) the RCT is considered to be “statistically significant”. Without an RCT, it can be harder to distinguish whether two groups differ because of the intervention, or because of some other difference between the groups.

We can represent this analysis as a diagram like so:

This is an example of a (simplified and informal) causal diagram. The black arrows show the direct relationships we can measure or control – in this case, our *selection* of control group vs experimental group is used to decide who gets the *drug*, and we then measure the *outcome* (e.g. do symptoms improve) for each group based on our group *selection*. Because the selection was random (since this is an RCT), we can infer the dotted line: how much does taking the drug change the outcome? If the size of the control or experimental group is small, then it is possible that the difference in outcomes between the two groups is entirely due to random chance. To handle that, we pop the effect size and sample size into statistical software such as R and it will tell us the p value and confidence interval of the effect.

Because RCTs are the gold standard for assessing the impact of a medical intervention, they are used whenever possible. Nearly all drugs on the market have been through multiple RCTs, and most medical education includes some discussion of the use and interpretation of RCTs.

## Control groups and observational studies

Sometimes, as discussed in The Planning of Observational Studies of Human Populations, “it is not feasible to use controlled experimentation”, but we want to investigate a causal relationship between variables, in which case we may decide to use an observational study. For instance, studying “the relationship between smoking and health”, risk factors for “injuries in motor accidents”, or “effects of new social programmes”. In cases like these, it isn’t possible to create a true “control group” like in an RCT, since we cannot generally randomly assign people, for instance, to a group that are told to start smoking.

Instead, we have to try to find two groups that are as similar as possible, but differ only in the variable under study – for instance, a group of smokers and a group of non-smokers that are of similar demographics, health, etc. This can be challenging. Indeed, the question “does smoking cause cancer” remained controversial for decades, despite many attempts at observational studies.

Researchers have noted that “*results from observational studies can confuse the effect of interest with other variables’ effects, leading to an association that is not causal. It would be helpful for clinicians and researchers to be able to visualize the structure of biases in a clinical study*”. They suggest using causal diagrams for this purpose, including to help avoid confounding bias in epidemiological studies. So, let’s give that a try now!

## Structure of the Long Covid review

In How Common Is Long COVID in Children and Adolescents? the authors suggest we focus on studies of Long Covid prevalence that include a control group. The idea is that we take one group that has (or had) COVID, and one group that didn’t, and then see if they have Long Covid symptoms a few weeks or months later. Here’s what the causal diagram would look like:

Here we are trying to determine if *COVID infection* causes *Long Covid symptoms*. Since *COVID infection* is the basis of the *Control group selection*, and we can compare the *Long Covid symptoms* for each group, that would allow us to infer the answer to our question. The statistical tests reported in the review paper only apply if this structure is correct.

However, it’s not quite this simple. We don’t directly know who has a COVID infection, but instead we have to infer it using a test (e.g serology, PCR, or rapid). It is so easy nowadays to run a statistical test on a computer, it can be quite tempting to just use the software and report what it says, without being careful to check that the statistical assumptions implicitly being made are met by the data and design.

We might hope that we could modify our diagram like so:

In this case, we could still directly infer the dotted line (i.e “does *COVID infection* cause *Long Covid symptoms*?”), since there is just one unknown relationship, and all the arrows go in the same direction.

But unfortunately, this doesn’t work either. The link between test results and infection is not perfect. Some researchers, for instance, have estimated that PCR tests may miss half, or even 90% of infections. Part of the reason is that “*thresholds for SARS-CoV-2 antibody assays have typically been determined using samples from symptomatic, often hospitalised, patients*”. Others have found that 36% of infections do not seroconvert, and that children in particular may serorevert. It appears that false negative test results may be more common in children – tests are most sensitive when used for middle-aged men.

To make things even more complicated, research shows that “*Long-COVID is associated with weak anti-SARS-CoV-2 antibody response*.”

Putting this all together, here’s what our diagram now looks like, using red arrows here to indicate negative relationships:

This shows that test results are not just associated with COVID infection, but also with Age and Long Covid symptoms, and that the association between COVID infection and test result is not imperfect and not fully understood.

Because of this, we can’t now directly infer the relationship between *COVID infection* and *Long Covid symptoms*. We would first need to fully understand and account for the confounders and uncertainties. Simply reporting the results of a statistical test does not give meaningful information in this case.

In particular, we can see that the issues we have identified all bias the data in the same direction: they result in infected cases being incorrectly placed in the control group.

For more details about this issue, see the article Long Covid: issues around control selection, by Dr Nisreen Alwan MBE.

## The problem of p-values

The review claims that “*all studies to date have substantial limitations or do not show a difference between children who had been infected by SARS-CoV-2 and those who were not*”. This claim appears to be made on the basis of *p-values*, which are shown for each control group study in the review. All but one study did actually find a statistically significant difference between the groups being compared (at p<0.05, which is the usual cut-off for such analyses).

Regardless of what the results actually show, p-values are not being used in an appropriate way here. The American Statistical Association (ASA) has released a “Statement on Statistical Significance and P-Values” with six principles underlying the proper use and interpretation of the p-value. In particular, note the following principles:

- P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
- Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
- A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.

A p-value is lower when there is more data, or a stronger relationship in the data (and visa versa). A high p-value does not necessarily mean that there is not a relationship in the data – it may simply mean that not enough data has been collected.

Because a p-value “does not measure the size of an effect or the importance of a result”, they don’t actually tell us about the prevalence of Long Covid. The use of p-values in studying drug efficacy is very common, since we do often want to answer the question “does this drug help at all”? But to assess what the range of prevalence levels may be, we instead need to look at *confidence intervals*, which unfortunately are not shown at all in the review.

Furthermore, we should not look at p-values out of context, but instead need to also consider the likelihood of alternative hypotheses. The alternative hypothesis provided in the review is that the symptoms may be due to “lockdown measures, including school closures”.

One of the included control group studies stood out as an outlier, in which 10% of Swiss children with negative tests were found to have Long Covid symptoms, many times higher than other similar studies. Was this because of the confounding effects discussed in the previous section, or was it due to lockdowns and school closures? Switzerland did not have a full lockdown, and schools were only briefly closed, reopening nearly a year before the Long Covid symptom tests in the study. On the other hand, Switzerland may have had a very high number of cases. Wikipedia notes that “*the Swiss government has had an official policy of not testing people with only mild symptoms*”, and has still recorded nearly 900 thousand cases in a population of just 8 million people.

In a statistical design, an alternative hypothesis should not be considered the *null hypothesis* unless we are quite certain it represents the normal baseline behaviour. But assuming that the symptoms found in the control group are due to pandemic factors other than infection is itself a hypothesis that needs careful testing and does not seem to be fully supported by the data in the study. It is not an appropriate design to use this as the base case, as was done in the review.

## Conclusion and next steps

The problem with control group definition, incorrect use of statistical tests, and statistical design problems does not change the key conclusion of the review: “*the true incidence of this syndrome in children and adolescents remains uncertain*.” So, how to we resolve this uncertainty?

The review has a number of suggestions for future research to improve our understanding or Long Covid prevalence in children. As we’ve seen in this article, we also need to more carefully consider and account for confounding bias. It is often possible, mathematically, to infer an association even in more complex causal relationships such as we see above. However, doing so requires a full and accurate understanding of all of the relationships in the causal structure.

Furthermore, a more complete and rigorous assessment of confounders needs to be completed. We’ve only scratched the surface in this article on one aspect: bias in the control group. Bias in the “*Long Covid symptoms*” node also needs to be considered. For instance: are all Long Covid symptoms being considered; is there under-reporting due to difficulties of child communication or understanding; is there under-reporting due to gender bias; are “on again / off again” variable symptoms being tracked correctly; and so forth.

Whatever the solution turns out to be, it seems that for a while at least, the prevalence of Long Covid in children will remain uncertain. How parents, doctors, and policy makers respond to this risk and uncertainty will be a critical issue for children around the world.

## Acknowledgements

Many thanks to Hannah Davis, Dr Deepti Gurdasani, Dr Rachel Thomas, Dr Zoë Hyde, and Dr Nisreen Alwan MBE for invaluable help with research and review for this article.