Covid 19 Page 2

For the past month, I have spent countless hours examining the data structures produced as a result of worldwide reporting of the Covid-19 pandemic. There were two primary reasons I have spent so much time on this analysis:

To examine the strength and weaknesses of the data being produced
To develop a preliminary understanding of the risk associated with Covid-19 and how that applies to governmental and personal decisions made given that information

I have outlined on the previous page the basic data structures John Hopkins University (JHU) and USA Facts are using to produce data related to confirmed cases and the number of deaths. Since this pandemic is happening in real time, many things are changing very quickly such as how the scientific community is interpreting this information and making pseudo predictions about what is to come, as well as public officials who are making decisions partially based in scientific facts, and some who ignore the facts completely.

So What Does the Data Tell Us and What Does it Mean?

Let’s begin with basics. In the early stages of the JHU dashboard there was simply counts of the number of confirmed and deaths. This simply represents aggregates of each area defined by a region, province or country.

Today the dashboard hosts many other types of data that has been useful in understanding other things related to the pandemic such as:

Cases are separated by Country/Province/US County
Number of hospitalized by US state
Number of US tested by US state
Incident rates
Case-fatality ratio
Testing rate by US state
Hospitalization rate by US state

All of this information is a collaboration between many organizations to provide the best up to date data useful to the public, government officials and scientists alike. As the data structures of this information has continually evolved, I found it necessary to understand what are the current limitations of reporting that hinder the ongoing analyses.

The datasets available provide a wealth of information, however like most well intentioned data collection processes; most are constructed without the foresight of design theory and this allows for a certain type of error or variance to occur if one is attempting to make an inference of the population of study. This not uncommon in data collection, however there are limitations to be considered once we form hypotheses based on this data.

How Do You Construct a Space Time Cube?

For a number of weeks, I have been conducting tests to construct a space time cube that would allow for the visualization of the Covid-19 data over time specific to location. This requires a very specific data structure to visualize beyond just global aggregates.

Global aggregates do not convey any specific picture unless we are interested in say “The US currently has 788,920 confirmed cases of Covid-19 as of April 21, 2020”. This number is staggering considering in the beginning of the second week of March 8, 2020 there were less than 1,000 reported cases nationwide and by the end of the week March 15, 2020 there were nearly 4,000 cases.

However shocking (or amazing) these numbers are, this tells us nothing about the diversity of reporting across all 50 states in the US at any given time. It took until approximately March 22, 2020 for JHU to begin a broader strategy of restructuring the Covid-19 data into more specific aggregates that would provide better results.

For instance, the original worldwide dataset contained one column that represented all of the world’s provinces and states with no apparent way to query different parts of the dataset:

JHU Original Data Structures

Although there are accompanying GPS coordinates, as you can see the province and country data was a mixed bag of possible redundant or unstructured data. For instance if you look at line 91, does Denmark have a state or province named Denmark? Does France?

Note:

For the record, I am not pointing this out as an attack on JHU, the scientific community, or any other organization putting this data together; it is the reflection of the need for well-structured data to make informed decisions and as time went on, more work was put into this and it has been greatly improved.

My primary interest was gathering county level data in the US to understand how the virus was spreading based on the numbers, the geographical location, and any specific point in time. To do this I needed to create categorical variables to achieve this.

As you can see from the original dataset, the US county and state were also not a separate group and this was problematic because even if you could easily separate these, there is already data fragmentation on the state and county level; thus, today we have a less representative sample to work with beyond just the global state and nationwide counts.

JHU Original Data Structures

These are examples of the data fragmentation in the JHS dataset and the USA Facts:

USA Facts Unallocated Data

The fragmented data outside the US county level that cannot be incorporated locally.

JHU Unallocated Data

The fragmented data outside the US county level that cannot be incorporated locally.

Originally before the data change, this data fragmentation would have to be rectified in order to create the queries necessary to produce a meaningful time series analysis. As you can see below, this data was restructured to produce these results that allowed for different categorical variables to be separated from the original province_state column.

JHU Data Restructuring

The purpose for this structuring was to create the time series into a row format for certain visual tasks that will be discussed in the next phase. Since the time series were in a column versus row format, it was important to prep the data so that when converting these columns to rows, the count data was preserved and intact. Indices were also created for reference.

So Why is All of this Important?

The importance of understanding how the data was collected is as important as how much data has been omitted or missing, intended or not. There are many nuances to data analysis that must be considered prior to building a statistical model such as understanding the strength of the data to explain a stochastic process or understand which assumptions might have be violated after a model of a spatial process has been built and/or analyzed.

Thus, by examining the nature of the original and ongoing data collected at the US county level, data that has not been incorporated could have consequences that make the space time cube spatial analysis less robust. However, since these are exploratory tools and the limitations have been stated; the construction of the cube was dependent on this exploratory process to assess the data structure in order make sensible interpretations of the results.

Discussion (4/21/2020)

Let’s recap on some of these preliminary findings:

As described above, one of the primary directives of this investigation was to examine how good is the data coming from our agencies? In short, the data in the beginning was fragmented and truthfully some of the data continues to be fragmented. The reason for this concern is data beyond the aggregate state level is not very helpful in modeling or examining spatial events on the local level.
The data from JHU and USA Facts are similarly structured now and allow for county level analysis of confirmed and deaths of the Covid-19 virus.
Shapefiles are special types of files that are the representative data, points, or polygons that are used in GIS software such as ArcGIS Pro to visual data such as the Covid-19 data. The data in its raw form could be joined to a specific US county shapefile provided the data had a linking ID that would allow for a join such as state and county FIPS code. This allowed for a the visualization of one time step at a time, but using more advanced tools such as the space time cube toolbox required additional work to produce a working dataset/ shapefile.
After extensive research and testing, it was determined that a row normalized set of time series data would be required to examine the data further. This would require a transpose of all of the column dates to an individual bin for all of the 3,142 counties. This is part of the next phase and these results will be examined very soon.

Next: Phase III: Building a Working Space Time Cube

Updated Tuesday April 21, 2020

Contact

Phone: (605) 691-2248

Location: Sioux Falls, South Dakota USA

Email: info@bravebearanalytics.com

Mato Ohitika Analytics LLC

Specializing in American Indian and

Tribal Government Data Science Solutions

including Machine Learning and

Artificial Intelligence Research and Development

All Images and Logos are Trademarks of

Mato Ohitika Analytics LLC