Hackernoon logoOpen-Source COVID-19 ML Datasets; Models; Tools for Data Scientists by@sharmi1206

Open-Source COVID-19 ML Datasets; Models; Tools for Data Scientists

Sharmistha Chatterjee Hacker Noon profile picture

@sharmi1206Sharmistha Chatterjee


As Covid-19 has impacted almost all countries, data scientists and researchers have come up with different predictive models to predict the spread of disease so that it can help the respective governments to come up with suitable plans and policies to curb the spread.

In this context, it becomes essential for data scientists and analysts to know the most popular and useful models that have been proposed as an outcome of the research, as well as get familiar with different datasets available on the internet.

In this blog, I'll discuss essential parameters for Covid19 disease spread, in machine learning models, the types of different models, tools, and datasets available.

Some of the project initiatives supported by Google are:

  • Monitoring and forecasting disease spread
  • Improving health equity and minimizing secondary effects of the pandemic
  • Supporting healthcare workers
  • Slowing transmission by advancing the science of contact tracing, environmental sensing
  • Devising effective vaccination plans.

Researchers, data scientists, engineers have all come forward to use existing AI/ML algorithms or to innovate new algorithms through research, experimentation, and trials. The below figures represent the percentage of contribution of different algorithms in various predictions for covid19.



Age-based Mortality Models

This type of ML model is trained using age-stratified Generalized linear models (GLMs) with component-wise gradient boosting. It helps to predict the probability of death based on information available for patients before they contracted the virus. The process of stratification of the overall model by age groups helps to reduce the variability in age and to identify risk factors of different ages.

In the overall model, 18 features were identified in at least 20% of the models (2 of 10) as being associated with increased mortality risk. Data scientists/researchers took Odds Ratios (ORs) with interquartile ranges (IQR) to compare the relative importance of the variables for predicting mortality. Of these features, age had the most prominent association—median OR: 2.82 (iqr: 0.03)—for predicting mortality.



Google Data Studio

Google has played a commendable role in coming up with a Dashboard that gives an ML model-based forecast of the development of COVID-19 in each US state and county, in order to help responders in healthcare, the public sector, and other impacted organizations be better prepared for uncertainties. The data can be directly accessed from BigQuery or as a downloadable CSV (state forecastscounty forecasts). The COVID Tracking Project assists the forecasts with historical values for hospital, ICU, and ventilator usage. Johns Hopkins Coronavirus Resource Center assists with historical data for confirmed cases and deaths while data for vaccine distribution are taken from Govex.



The below figures illustrate different predictions using BigQuery with data from the above sources as stated above.



For detailed predictions per month, for a given state in US, explore more about it at BigQuery console

Example of a sample query include:

state_fips_code = “48AND prediction_date >= forecast_date

Covid19 datasets and models

Regression-based Time-Series Modeling with Covid19

This is a forecasting approach for COVID-19 case prediction relying on Graph Neural Networks and mobility data. This ML modeling approach uses a single large-scale Spatio-temporal graph, with the following assumptions.

  • Spatial domain –  Edges represent direct location-to-location movement and are weighted by mobility flows, based on, the amount of flow internal to the location.
  • Temporal domain – Edges represent a binary connection to past days.
  • Each node contains features for the state, county, day, past cases, and past deaths

The most important advantage of Spatio-temporal graphs for COVID-19 prediction is that it does not male assumptions of the underlying disease dynamics and can learn from a variety of data, including inter-region interaction and region-level features.


COVID-19 Forecasting using Spatio-Temporal Graph Neural Networks

The above COVID-19 graph showing spatial and temporal edges (highlighted in red) across three days. Each slice represents spatial connections between counties, while the connections between slices represent temporal relationships. Every node in the graph has direct temporal edges to nodes in d previous days.


COVID-19 Forecasting using Spatio-Temporal Graph Neural Networks

The above figure represents the 2-hop Skip-Connection model. Multiple layers of spatial aggregations are used on temporal embedding vectors. At each layer, the embedding of the seed node (represented in blue) is concatenated and propagated up to the next embedding layer. The final embedding is passed through an MLP and used to predict P

Fairness on Covid-19 Datasets

As Google is committed to Responsible AI principles, it has come forward, to study the disproportionate impact, the disease has had in the United States. As a pioneer of Fairness in AI, Google’s AI team could try to follow “Avoid creating or reinforcing unfair bias”, to study the actual impact of the disease.

CDC research has shown that communities of color in the United States have been the hardest hit by COVID-19 with disproportionately high rates of cases and deaths. The causes of it are related to structural racism, various systemic inequities in access to healthcare, inherent systemic bias, and underlying negatively impacting social determinants of health.

The below figure illustrates:

  • The absolute error in predicting covid19 deaths is significantly higher for counties with a higher proportion of younger and middle-aged people.
  • The demographic sections of these countries comprising of younger and middle-aged groups have a higher proportion of COVID-19 case counts.
  • Further, after the absolute errors are normalized by actual death counts, there is less difference between the confidence intervals across the demographic groups.
  • image


During the analysis of median income, the analysis was done by bucketing (segregating them to bins) county populations according to their income. The results are represented by the bottom figure which shows higher absolute errors for higher-income counties.


Similarly, for Race and Ethnicity, the figure clearly depicts, there is a direct correlation between the absolute errors and death counts, and this is meaningfully reduced when the error is normalized by the death count, causing the confidence intervals to overlap.



Interpretable Sequence Learning with Covid19 datasets

This kind of ML framework proposed how different compartments (composed of different direct and indirect factors that affect prediction coefficients) evolve. It uses interpretable encoders to incorporate covariates and improve model performance. The performance of the model has been further analyzed for different subgroups based on the subgroup distributions within the counties.



The model is based on an extension to the standard SEIR (susceptibleexposedinfectiousremoved) model that includes additional compartments for undocumented cases and hospital resource usage. The end-to-end modeling framework can infer meaningful estimates for undocumented cases even if there is no direct supervision for them.

The model takes into account disease dynamics that vary over time – e.g. as mobility reduces, the spreading decays. Further, the framework has improved generalization while learning from limited training data, using

  • Masked supervision from partial observations,
  • Partial teacher-forcing to minimize error propagation,
  • Regularization, and
  • Cross-location information-sharing

The most important assumptions introduced for the model are:

  • Introduction of compartments for undocumented infected and recovered cases
  • Introduction of hospitalized, ICU, and ventilator compartments
  • Partial immunity
  • No death from undocumented infected cases
  • Invariant populationExplainable AI for Covid19

The most important characteristics of Interpretable Sequence-learning works on the basis of modeling the compartments explicitly to provide an understanding of disease evolution.

Explainable AI for Covid19

The most important characteristics of Interpretable Sequence-Learning works on the basis of modeling the compartments explicitly to provide an understanding of disease evolution. The below figure demonstrates how the fitted curves can be used to infer important insights on where the peaking occurs or the current decay trends.



The ratio of undocumented to documented infected at different phases is computed, as well as the amount of increase/decrease for each compartment is analyzed. For intervention covariates, the largest weights (with significant changes of disease spread) is noticed after a lag of a few days, suggesting their effectiveness after some lag. The positive weights of the mobility index, and negative weights of public interventions are also clearly observed.



The above figure demonstrates the Learned weights of the time-varying covariates for β (Average contacts of doc. infected/undoc. infected), for 7-day state-level forecasting models for three weeks starting from 24th May 2020 to 7th June 2020. It is observed that the mobility index consistently has a highly positive impact on β while gathering bans, school closures and shelter-in-place interventions have highly negative effects. In addition, the weight magnitude of the interventions gets larger after a lag of few days.

Impact of non-pharmaceutical interventions (NPIs) to reduce COVID-19 mortality and healthcare demand

The exponential rise of COVID-19 cases and the number of deaths have forced governments in different countries to introduce interventions too early. However, it possesses the risk of allowing the transmission to return once they are lifted (if insufficient herd immunity has developed).

It became likely that researchers model the impact of different measures, the time-period over which the interventions need to be maintained, and its effect on the critical care beds occupied per 100,000 of the population.

The below figures illustrate the impact of different measures in correspondence with. critical care beds occupied.


Source – Mitigation strategy scenarios for GB showing critical care (ICU) bed requirements. The black line shows the unmitigated epidemic. The green line shows a mitigation strategy incorporating closure of schools
and universities; the orange line shows case isolation; the yellow line shows case isolation and household quarantine; and the blue line shows case isolation, home quarantine, and social distancing of those aged over 70. The blue shading shows the 3-month period in which these interventions are assumed to remain in place.

AI/ML Model-based COVID-19 vaccine prioritization

With the ongoing COVID19 vaccinations and high demand for the limited supplies of vaccine, researchers have come forward to build ML models to enable prioritization of vaccine distributions. The objective of this approach is to :

i) directly vaccinate those at the highest risk (risk of death, persons over 60 years of age, and those with comorbidities) for severe outcomes and

(ii) protect them indirectly by vaccinating those who do the most transmitting.

This method involves building a mathematical SEIR model (susceptible, exposed, infectious, recovered) to compare five age-stratified prioritization strategies. The prioritization strategies remain consistent across countries, transmission rates, vaccination rollout speeds, and estimates of naturally acquired immunity. In addition, this ML-based framework allows comparing the impacts of prioritization strategies across contexts.



Figure A demonstrates age-dependent vaccine efficacy shows a decrease from 90% baseline efficacy to 50% efficacy among individuals aged 80+ years, beginning at age 60. Figures (B and C) Percent reduction in deaths in comparison with an unmitigated outbreak for transmission-blocking all-or-nothing vaccines with either constant 90% efficacy for all age groups (solid lines) or age-dependent efficacy. 


Different research organizations and universities came up with innovations to predict covid19 disease spread and effective measures that could prevent disease spread. As hospitals worldwide are faced with finite resources, new ML models are also being developed to help allocate therapies and equipment to those most at-risk, maximizing survival. This will help clinicians to predict which currently uninfected individuals might derive the greatest benefit from vaccination.

In order to increase the accuracy of forecasts, a Covid19 CDC Hub – COVID-19 Forecast Hub, has been developed which aggregates forecasts from over 30 models and sends them to the CDC each week to help inform public health decision making. The model works in collaboration with the US CDC, which takes in data and builds a single ensemble forecast (by assembling forecasts of the trajectory of the COVID-19 pandemic from different modeling teams submitted at the forecast data repository).



Join Hacker Noon

Create your free account to unlock your custom reading experience.