Did you know that the number of epidemiologists grew exponentially after COVID started? Kidding. If you were a frequent social media user over the past two years, then you likely encountered an overwhelming number of people analyzing and interpreting COVID-related data.
While I admire the renewed interest in data science among society, it is unfortunate that knowledge of statistics and probability was often left at the door once these discussions ensued. One of the most dangerous consequences of the democratization of data is the misuse and misunderstanding of the limitations of data.
While the use (and misuse) of data is growing in society, the spread of data literacy is growing much slower. Even though I have a strong background in statistics, I have been relatively quiet about COVID-related data on social media.
Why do you think that is? In order to draw conclusions about COVID-related data, you need to consider so many different questions. The study of statistics requires understanding uncertainty and how to draw inferences beyond what the data tells you. Below is the minimum set of questions that would get you close to a statistically sound conclusion regarding COVID data.
Is the frequency of testing consistent among states/cities?
What is the false positive vs. false-negative rate among different testing instruments?
What is the level of data quality coming from this source in comparison to others?
What is the efficacy of each type of vaccine?
What is the hospitalization rate in each state/city?
How much does the density of population affect the spread of the virus?
How much does the local vaccine rate affect the spread of the virus?
To what degree is this data lagging in terms of data collection (e.g., hospitalizations lags compared to virus spread)?
Will vaccines change the rate of virus mutations in society?
What is the level of uncertainty among each piece of data and each conclusion?
Whenever you read your friends', doctors', family members,’ data scientists,' or news stations' conclusions based on COVID-related data, do you think they took into consideration all of the questions listed above? Do you think they considered even two of these questions? Given enough time and the right data set, I could answer some of these questions.
However, I would never be able to answer all of these questions without the assistance of an actual epidemiologist and other statisticians to verify my findings. Regardless of how much analysis you do, your conclusions will never be 100 percent settled.
Uncertainty surrounding data quality, bias, and variance will always be present no matter the methodology you choose. This uncertainty justifies the need to incorporate qualitative thinking coupled with probabilistic thinking for any analysis or viewpoint to be trustworthy.
It comes as no surprise that some of the best wisdom I have heard about the virus has come from Nassim Taleb, an expert on uncertainty*. Next time you analyze data, I would caution you against being certain of your conclusion and swinging too far to the quantitative side of the pendulum. This becomes even more critical when the level of data quality is in question.
*I believe Taleb is saying that if vaccines were more dangerous than the virus, then there would have been millions of issues popping up. This would be the left side of the tail when looking at a probability distribution that describes all possible outcomes for taking the vaccine. Therefore, the vaccine is less risky than the virus.
Image Sources: Images by Author
Also Published Here