Translate

Sunday, May 24, 2020

COVID-19 May 25, 2020

Let's talk about data!  Not the data from yesterday, but the origin of the data.  A lot of you posted on my Facebook site questioning the veracity of the death data.  Several others contacted me privately. Let me say, "Thank you!"  Data is king and it is absolutely proper to question the data I'm providing.  I have been as transparent as possible about the source of my data from my very first post.  I even include the link on the plots I put out daily.  Here it is again: https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data/United_States_medical_cases

"Wait! You're using Wikipedia as a data source?  That's not a reliable source!".  You're right.  It isn't. At least by itself. I'll come back to this at the end.  But first, let's look at some other possible data sources: The CDC, Johns Hopkins University, Worldometer.  I'll say a little about each of these and why I have rejected them as a source in favor of Wikipedia.

The CDC (U.S. Center for Disease Control):  The CDC Covid-19 website is here.  If you poke around a little bit, you'll find a page with a graphic that looks like this:
If you compare these numbers to what I had for May 24 (which are actually numbers accumulated as of May 23) I have 1,61,2018 for total cases and 87,372 for total deaths.  Neither of my numbers agree with the CDC, but the deaths appear to be substantially different, as many of you point out.   If you click on the little "About the Data" link below the Total Deaths statistic on the CDC site, you can learn a little bit more about the origin of their data.  There are a couple of important things.  First, it says, "Numbers reported on Saturdays and Sundays are preliminary and not yet confirmed by state and territorial health departments. These numbers may be modified when numbers are updated on Mondays."  I'll point out that today is Sunday.  That means yesterday was Saturday. Second, and more importantly, it says, "State and local public health departments are now testing and publicly reporting their cases. In the event of a discrepancy between CDC cases and cases reported by state and local public health officials, data reported by states should be considered the most up to date."  So, right there on their own site they tell you that the data from the states should take precedence.  Finally, the CDC says, "There are currently 55 U.S.-affiliated jurisdictions reporting cases of COVID-19. This includes 50 states, District of Columbia, Guam, the Northern Mariana Islands, Puerto Rico, and the U.S Virgin Islands."   So, at least we know where they claim to get their data from. They even have a map that you can click on that takes you to each of the 55 jurisdictions.  Basically, the CDC is a data aggregator and it's up to the user to go and check all the individual sources of data and make it's consistent.  I don't have the time for that, and further, the CDC doesn't make it easy to see all the daily data from all the 55 jurisdictions.  As I'll show when I get back to Wikipedia, I have reason to believe the CDC has not tallied things correctly.  Nevertheless, if one takes the CDC data without digging further to the actual source, you get the numbers above.

John Hopkins University info is here.  It's a really fancy interface, albeit sometimes painfully slow even on my fairly beefy desktop.  There's a section on that site that shows total deaths.  The screenshot is below:


It shows a number closer, but still different than the CDC.  Perhaps they've incorporated numbers from today whereas the CDC has not yet done so? Who knows.  And I mean that literally.  Who knows!  I don't, and neither do you, because try as I might, I was unable to find any statement of where their data is coming from.  That's amazingly sloppy work for such a respectable institution.  Presumably, they aren't collecting it themselves, so like the CDC I assume they are a data aggregator.  Are they getting it from the same 55 jurisdictions as the CDC?  Are they getting it directly from the CDC?  Who knows.  I've rejected this site as a source of data because it provides no traceable way to verify the data.  I don't know where it came from nor can I verify that the data has been tabulated properly.

How about Worldometer?  They are currently showing 99,300 deaths.  Different and higher again.  At least they have some info on the source of their data: "Our sources include Official Websites of Ministries of Health or other Government Institutions and Government authorities' social media accounts. Because national aggregates often lag behind the regional and local health departments' data, part of our work consists in monitoring thousands of daily reports released by local authorities. Our multilingual team also monitors press briefings' live streams throughout the day. Occasionally, we can use a selection of leading and trusted news wires with a proven history of accuracy in communicating the data reported by Governments in live press conferences before it is published on the Official Websites."  So, I want to verify the Worldometer, like the CDC, I have to go hunting for the data, but this time they don't even tell me exactly where.  I rejected this data for exactly that reason.  Like JHU, I don't know where the data came from and I am unable to verify that the data has been tabulated correctly.

Let's loop back to Wikipedia.  That site is an aggregator like the others.  Unlike the others, however, they have comparatively exquisite detail on all the data. It is broken down by state and territory and further provide on a daily basis.  There are direct links to the sources of the data--usually the state health departments.  Further, there are now hundreds of footnotes describing any irregularities or idiosyncracies about the data, almost all of which can be traced directly to the source of that information.  This is a nice data set, but it doesn't mean it's correct.

I have selected Wikipedia as a data source because it is fully traceable to the origin of the data.  All the information is fully documented.  And, importantly, every time I do a spot check, I find the data to be exactly as reported by the individual states.  For example, this evening I see that New York is reporting 109 new deaths for a total of 23,391.  The New York health department shows the same exact values for the day and for the total.  I find the same agreement for Texas.  And Colorado.  And others.  Exact.  And it's not just this time, but every time I've checked.  To the extent I can do this for the CDC data and JHU data, I find that they DO NOT agree with the state data.  Why? I don't know, because they don't provide any information on their data sources.

Now, if I take Wikipedia as a good and verifiable data set, I get the numbers I display every day.  For today that was 87,372.  As far as I can tell that matches the data provided by the states. Yes, it's lower than many other sources and is lower than what's been touted in the press (Note: the press is not a reliable source of data). But, as the CDC says,  you should trust the states' numbers over the CDC's numbers.  

Why such a difference?  I don't know for sure, because the CDC doesn't provide numbers in a way that I can easily check against the states' data.  I do have some ideas.  It is not uncommon for the states to adjust their data days and sometimes a week or more after the numbers are reported.  I adjust my numbers accordingly.  Sometimes states have double counted deaths for example.  Perhaps the CDC took those numbers, locked them in, and then never went back and fixed them once they were revised.  It seems unlikely that alone could explain the large discrepancy between the numbers I'm using and the CDC.  Your guess is as good as mine.  We'd know if all the institutions practiced good and transparent data management.  But they don't. 

One of the tenets of science is reproducibility.  Anyone should be able to conduct your experiment or crunch the same numbers and get the same answer.  The only site that allows this is Wikipedia.  So, is my data correct?  I don't know.  I do know that it is reproducible.  I know that anyone can check my data and go straight to the source to verify the values. 

Having said all that, here's the data for today.  It came in while I was typing this up so I'll post it now instead of waiting until tomorrow.









2 comments:

Surfaholic said...

Thanks, Dr.

Even the MSM porn is likely a gross under-estimate. It's becoming more evident that the virus was here before we knew it.

Looking at the Florida CDC Data for Flu/Pneumonia; the death data for the same period year over year we had a 170% more cases. I recall, early in flu season that Florida was reporting a higher than normal flu season. We also had a higher rate of unseasonal fevers being reported at hospitals/clinics. Our health care system is so costly and inefficient many people were likely getting diagnosed with the flu without a test/screening and getting diagnosed based on symptom index only. It's common, esp. in poorer communities. At least that is my hypothesis.

Scot Rafkin said...

It's sad when the best data is likely to come from statistically analyzing excess deaths rather than properly recording them in the first place. This is a multidimensional total failure by the federal government.