Phase II: A Look at Where the Covid-19 Data is and What it Means
The Week of Monday April 20, 2020
Dr. Joseph Robertson
Mato Ohitika Analytics LLC
Tuesday April 21, 2020
(Last Updated Tuesday April 21, 2020)
For the past month, I have spent countless hours examining the data structures produced as a result of worldwide reporting of the Covid-19 pandemic. There were two primary reasons I have spent so much time on this analysis:
I have outlined on the previous page the basic data structures John Hopkins University (JHU) and USA Facts are using to produce data related to confirmed cases and the number of deaths. Since this pandemic is happening in real time, many things are changing very quickly such as how the scientific community is interpreting this information and making pseudo predictions about what is to come, as well as public officials who are making decisions partially based in scientific facts, and some who ignore the facts completely.
So What Does the Data Tell Us and What Does it Mean?
Let’s begin with basics. In the early stages of the JHU dashboard there was simply counts of the number of confirmed and deaths. This simply represents aggregates of each area defined by a region, province or country.
Today the dashboard hosts many other types of data that has been useful in understanding other things related to the pandemic such as:
All of this information is a collaboration between many organizations to provide the best up to date data useful to the public, government officials and scientists alike. As the data structures of this information has continually evolved, I found it necessary to understand what are the current limitations of reporting that hinder the ongoing analyses.
The datasets available provide a wealth of information, however like most well intentioned data collection processes; most are constructed without the foresight of design theory and this allows for a certain type of error or variance to occur if one is attempting to make an inference of the population of study. This not uncommon in data collection, however there are limitations to be considered once we form hypotheses based on this data.
How Do You Construct a Space Time Cube?
For a number of weeks, I have been conducting tests to construct a space time cube that would allow for the visualization of the Covid-19 data over time specific to location. This requires a very specific data structure to visualize beyond just global aggregates.
Global aggregates do not convey any specific picture unless we are interested in say “The US currently has 788,920 confirmed cases of Covid-19 as of April 21, 2020”. This number is staggering considering in the beginning of the second week of March 8, 2020 there were less than 1,000 reported cases nationwide and by the end of the week March 15, 2020 there were nearly 4,000 cases.
However shocking (or amazing) these numbers are, this tells us nothing about the diversity of reporting across all 50 states in the US at any given time. It took until approximately March 22, 2020 for JHU to begin a broader strategy of restructuring the Covid-19 data into more specific aggregates that would provide better results.
For instance, the original worldwide dataset contained one column that represented all of the world’s provinces and states with no apparent way to query different parts of the dataset:
Although there are accompanying GPS coordinates, as you can see the province and country data was a mixed bag of possible redundant or unstructured data. For instance if you look at line 91, does Denmark have a state or province named Denmark? Does France?
For the record, I am not pointing this out as an attack on JHU, the scientific community, or any other organization putting this data together; it is the reflection of the need for well-structured data to make informed decisions and as time went on, more work was put into this and it has been greatly improved.
My primary interest was gathering county level data in the US to understand how the virus was spreading based on the numbers, the geographical location, and any specific point in time. To do this I needed to create categorical variables to achieve this.
As you can see from the original dataset, the US county and state were also not a separate group and this was problematic because even if you could easily separate these, there is already data fragmentation on the state and county level; thus, today we have a less representative sample to work with beyond just the global state and nationwide counts.
These are examples of the data fragmentation in the JHS dataset and the USA Facts:
Originally before the data change, this data fragmentation would have to be rectified in order to create the queries necessary to produce a meaningful time series analysis. As you can see below, this data was restructured to produce these results that allowed for different categorical variables to be separated from the original province_state column.
The purpose for this structuring was to create the time series into a row format for certain visual tasks that will be discussed in the next phase. Since the time series were in a column versus row format, it was important to prep the data so that when converting these columns to rows, the count data was preserved and intact. Indices were also created for reference.
So Why is All of this Important?
The importance of understanding how the data was collected is as important as how much data has been omitted or missing, intended or not. There are many nuances to data analysis that must be considered prior to building a statistical model such as understanding the strength of the data to explain a stochastic process or understand which assumptions might have be violated after a model of a spatial process has been built and/or analyzed.
Thus, by examining the nature of the original and ongoing data collected at the US county level, data that has not been incorporated could have consequences that make the space time cube spatial analysis less robust. However, since these are exploratory tools and the limitations have been stated; the construction of the cube was dependent on this exploratory process to assess the data structure in order make sensible interpretations of the results.
Let’s recap on some of these preliminary findings: