Joseph C. Robertson PhD
Data Science Solutions. Statistical Consulting. Machine Learning and Artificial Intelligence
Research & Development.
Mato Ohitika Analytics LLC
Corona Virus (Covid-19)
Analysis Landing Page 3
Quick Links
Covid-19 Analysis Phases
Next: Phase III: Building a Working Space Time Cube
The Week of Monday April 20, 2020
Dr. Joseph Robertson
Mato Ohitika Analytics LLC
Thursday April 23, 2020
The previous exploratory analysis in Phase II consisted of examining optimal data structures in order to construct a Space Time Cube (STC) with the Covid-19 data available through JHU and USA Facts. There were multiple steps to achieve this task.
The way this was achieved required an algorithmic approach to achieve the following:
*The term row normalized loosely refers to the action in which a column structured time series is transposed in a such a way that it creates a conventional row form, which makes analyzing time series more efficient in software such as R.
So What Can a Space Time Cube Do?
The Create Space Time Cube tool (ESRI) takes time stamped point features and structures them into a netCDF data cube by aggregating the points into space-time bins. Within each bin the points are counted and the trend for bin values across time at each location is measured using the Mann-Kendall statistic.
Below is a theoretical representation of time represented as the third dimension in a set of geographical bins that share boundaries and contain a value of interest for each time step that stacks on each other as time increases.
In studying the Covid-19 Pandemic in the United States, it became apparent that this cube would allow for a look at the spread of the virus over time as it related to the cumulative totals contained in the JHU and USA Facts data reporting and that this information is not only useful in visualizing the increasing trend of the number of confirmed cases and deaths on the county level; it could also provide additional emerging hot spots analyses to examine and verify what the public has been told daily about the Covid-19 virus.
Creation of the Space Time Cube in Practice
This next section outlines the algorithmic steps I took to create a universal database file that could be instantly loaded in to ArcGIS Pro and create a feature class that contains all of the row normalized data. There are two fundamental pieces that had to be in place in order for the STC to function correctly:
Here are the steps taken:
First, either you can download the available data or you could pull the data directly from JHU's github page or USA Facts webpage, in this example we pull from USA Facts:
#### Code Developed by Mato Ohitika Analytics LLC
####### Please cite this code properly if you use it.
All code was written in R.
#Pull data from USA Facts
library(RCurl)
y <- getURL("https://usafactsstatic.blob.core.windows.net/public/data/covid-19/covid_confirmed_usafacts.csv")
x <- read.csv(text = y)
str(x)
summary(x)
There was one fundamental issue here that exists in both the JHU and the USA Facts data that made pulling this data in R problematic: The time columns as you can see are written as categorical variables rather than time as they are in the original document. This is not the problem though; the problem lies in machine code that recognizes the order of digits.
Now the dates in the form below are actually intentional to create a verifiable index when we place the final time series in row format. If you notice, the pseudo date that was pulled from the original data has the format 2.1.20. If you sort a date without a mm/dd/yyyy format, the order in which this sequence is arranged as: 2.1.20, 2.10.20, 2.11.20...The format has to be 2.01.20 at a minimum to create the proper date sequence.
There are two things you can do to solve this:
This example is from USA Facts and the data structure is fundamentally based on county FIPS code. To capture only US county data we need to remove any non FIPS coding that is not a recognized county, in this case any FIPS coded as a zero or represented a non-county FIPS such as the Diamond Princess. There was also a redundant flaw in the Alaska FIPS codes that needed to be removed for accuracy.
The JHU data was similar in that we needed to capture the county level data using a similar procedure to remove unallocated or non-county FIPS entries.
Note: Removing this data is necessary to create a clean US county only time series. I spoke in the last phase, that the reasons for laying out the fragmentation of this data was that in constructing the STC, there will be a loss of data. How much data? Well it depends. Either it could be a little or a lot, look below from the USA Facts data from 4-22-20:
As you can see, a little could mean there is zero unallocated data which suggests the county level counts are as accurate as they can be given the reporting. However, if we wanted to look at emerging hot spots say in Georgia on the heels of attempting to open up the state prematurely, the 1,000 confirmed unallocated cases, might have an influence on which counties these cases are in, but because of the fragmentation, we can only guess where these cases are located.
Moving on, the following code sorts, and checks for the identified rows not needed for the STC and then writes a new file that is a merged pre-file used to construct a row normalized STC.
#### Code Developed by Mato Ohitika Analytics LLC
####### Please cite this code properly if you use it.
# Order fips codes so the 0's are in the first position
x <- x[order(x$countyFIPS),]
head(x)
#next pull out all unallocated points
#check points for wade hampton and diamond princess row 94 and 188
##remove unallocated and the 2 non county points
x <- x[-c(1:51),]
x[94,]
x[188,]
x <- x[-c(94,188),]
#Check range to see if off point have been removed
head(x[185:190,])
head(x[90:95,])
## Next let's pull the ID's file needed to create the shapefile
## This file I created has all the row data from the tl_2019_county shapefile from the US census
y <- read.csv("c://data//covid//covid_row_ID.csv")
#Examine the Structure
head(y)
str(y)
# We use the merge function to merge the keeping all entries by countyFIPS
z <- merge(x,y, by.x="countyFIPS", by.y="countyFIPS",all=TRUE)
#Examine the Structure
str(z)
length(z)
head(z)
names(z)
# We rearrange elements of merge to create order for the time steps
z <- z[,c((97:104),(1:96))]
#drop redundant elements(if necessary)
#a <- z[,c(1:88)]
#head(a)
#finally write pre-long form csv
#write.table(a, file = "c://data//county_merge_ID_4_21_20.csv", sep = ",", col.names = NA, qmethod = "double")
write.table(z, file = "c://data//county_merge_ID_4_21_20.csv", sep = ",", col.names = NA, qmethod = "double")
The table below shows the process of grabbing the data directly from the web, notice the error in the countyFIPS column 2, if you create the dates in the format I described above, you will not see that error, but you can also simply rename it using R.
The second table is the covid_row_ID, a file I created from the 2019 US census tiger shapefile to create a linking ID for county FIPS as well as creating in column 11 and unique ID for matching the locations when creating the space time cube.
The third table is the result of running the code above. As you can see, the metadata has all been moved to the left of the column time series entries. If you notice the first column has no name, this is because R writes an index every time it merges things. I always keep these to assure there is a way to check the data integrity with every step.
Most of those can be removed once you have created the row normalized file, but any good data scientist knows that creating an audit of the data system is paramount so the results can be reproduced or copied. This merge is kind of master file that contains all 3,142 counties in the US with a way to query each county separately.
The next step now is to use this pre-long file to create a row normalized file with a date index. This is tricky since the number of rows will go from three thousand to around three hundred thousand.
Now that we have a working pre-file with all of the components we need to row normalize the dates in the columns to rows, the next section of code:
#### Code Developed by Mato Ohitika Analytics LLC
####### Please cite this code properly if you use it.
#reset workspace
rm(list = ls())
# Now call the merged ID file and prepare it for row transpose
#generic read.csv code
#x <- read.csv("c://data//county_merge_ID.csv")
x <- read.csv("c://data//county_merge_ID_4_21_20.csv")
#Examine the Structure
head(x)
str(x)
summary(x)
names(x)
# Call tidyr library to perform the gather function
library(tidyr)
# To transpose the data, we use the gather command in tidyr
## The way the gather function works, is in the gather(a,b,c,d) variables
## a denotes the data frame you want to gather from, b denotes the column you want R to call all of the
## column headers that contain the time steps, and c denotes the name of what you will call all of the gathered
## counts from all of the entries contained in the column headers, in this case it is all of the covid-19 confirmed cases or
## confirmed for short and finally d denotes the rows that you are telling R to gather into a row format. This will be
## whatever row your data starts with. The way this is structured, I am telling R to duplicate other columns in the data
## frame as it performs this task, thus making duplicate copies of each county I need for the STC
# For the dates from Jan 22 to April 21 rows 15:105 are called
covid_long <- gather(x,time,confirmed,15:105)
# Data checks on how this process went
str(covid_long)
head(covid_long)
tail(covid_long$time)
## Now write the transposed time data to a csv called covid_long
write.csv(covid_long,"c:\\data\\covid\\covid_long_data_ID_4_21_20.csv")
As you can see, the covid_long_data_ID_4_21_20 data file has successfully transposed the county data with each column of the metadata neatly copied and repeated for each successive date in the sequence. The time column is the original categorical variable of the time columns from the USA Facts data I purposely created as a time index to reference against the final step in the next code sample that adds a time stamp to all of the confirmed cases and finalizing the STC data structure.
This transformation produced a row normalized time series 91 days in length per county and 285,922 rows long.
#### Code Developed by Mato Ohitika Analytics LLC
####### Please cite this code properly if you use it.
## Now we create a time stamp of the row transposed dates.
##The time column will serve as an index to assure proper alignment
## The today variable will create a sequence of dates from 1/22 to 4/21 or whatever
## date you wish. The date1 formats the date into the correct format mm/dd/yyy.
today <- seq(as.Date("2020/1/22"), as.Date("2020/4/21"), "days")
date1 <- format(today, "%m/%d/%Y")
date1
length(date1)
## The rep() function allows the block of dates to be written in succession
## x number of times to achieve the correct time sequence for every county
## This will repeat in this example 91 days x 3142 counties
## Thus, date will be the correct sequenced time series for the STC
date <- rep(date1, times=3142)
summary(date)
## Read back in the transposed long form data
x <- read.csv("c://data//covid//covid_long_data_ID_4_21_20.csv")
## Order first by time and ID
x <- x[with(x, order(NAMELSAD, time,countyFIPS )), ]
##cut extra redundant columns
x <- x[,-c(1,3,12,13)]
colnames(x)[1] <- "xid"
x[1:91,]
head(x)
str(x)
summary(x)
names(x)
## Column Bind (cbind) dates with the final file
x <-cbind(x,date)
x <- x[with(x, order(ID, time)), ]
## Now write the transposed time data to a csv called covid_long
write.csv(x,"c:\\data\\covid\\covid_long_data_ID_4_21_20_final.csv")
##Check the new csv written
#x <- read.csv("c://data//covid//covid_long_data_ID_4_18_20_final.csv")
#str(x)
The code above puts the finishing touches on converting a raw data table from JHU or USA Facts into a row normalized data file with the correct GPS coordinates, ID's and Covid-19 data. One additional step to save steps is to cbind either the deaths or confirmed cases into one of the files created by this technique and you can make two STC's using one shapefile.
As you can below, the date variable created perfect sequence of dates for each county, 91 days in length in ascending order by confirmed case. Notice that time column was specifically designed as an audit for the last steps in the code above. If you open up the csv and scroll down the rows, you can verify all these procedures produced the necessary data structure to create a space time cube in literally one hit of the run key in R. Amazing stuff!
Discussion (4/23/2020)
These past few weeks have been difficult to say the least. I had originally planned on having the bulk of these maps ready for public study. However, as the rhetoric has shifted not only scientifically and politically, I felt an urgent need to go back and perform a battery of tests not only the code I wrote to create the STC(s) but to also firm my footing as to why I was doing this in the first place.
I have had the once in a lifetime opportunity to extensively study spatial statistics under my advisor, Dr. Gary Hatfield at South Dakota State University who assisted me constructing a case study in my dissertation of this very type of information being produced by the Covid-19 pandemic.
That is why county level data is so important, we can learn a lot about what is happening if we are not just looking for predictive models to solve the pandemic crisis. Often, we can use emerging hot spots and trends to strengthen the evidence of say state governors making false claims that there is no longer a spread of the Covid-19 virus and therefore we go back to normal, causing irreprehensible damage to others not based on data, but magical thinking.
I felt a strong call to duty after I read the Wired Magazine's Electric Word (05/20) from Adam Roger, a senior correspondent, when he wrote: "Good scientists give an honest accounting of their own uncertainties, but when scientists point that out that powerful people are doing dangerous things, those people dilute the critique by emphasizing the uncertainties...”
He is right when he explains that we have to Hold the Line, to buy scientists time. We must not forget that if any of this work we do ends up saving someone’s life, then it was worth risk to hold our fellow citizens accountable through the lens of science.
The emerging hot spot and trend maps that I have constructed through this process, I hope will lend valuable information to the Covid-19 cause and allow for further discussion of next steps perhaps in hours instead of days. Look for these maps very soon.
Next: Phase IV: Emerging Hot Spot and TrendsMapping (Preview)
Updated Thursday April 23, 2020
Specializing in American Indian and
Tribal Government Data Science Solutions
including Machine Learning and
Artificial Intelligence Research and Development
Copyright (2017-2020) Mato Ohitika Analytics LLC
All Images and Logos are Trademarks of
Mato Ohitika Analytics LLC
All Rights Reserved