SIR-X

This notebook exemplifies how Open-SIR can be used to fit the SIR-X model by Maier and Dirk (2020) to existing data and make predictions. The SIR-X model is a standard generalization of the Susceptible-Infectious-Removed (SIR) model, which includes the influence of exogenous factors such as policy changes, lockdown of the whole population and quarantine of the infectious individuals.

The Open-SIR implementation of the SIR-X model will be validated reproducing the parameter fitting published in the supplementary material of the original article published by Maier and Brockmann (2020). For simplicity, the validation will be performed only for the city of Guangdong, China.

Import modules

[1]:
# Uncomment this cell to activate black code formatter in the notebook
# %load_ext nb_black
[2]:
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

Data sourcing

We will source data from the repository of the [John Hopkins University COVID-19 dashboard] (https://coronavirus.jhu.edu/map.html) published formally as a correspondence in The Lancet. This time series data contains the number of reported cases \(C(t)\) per day for a number of cities.

[3]:
# Source data from John Hokpins university reposotiry
# jhu_link = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/who_covid_19_situation_reports/who_covid_19_sit_rep_time_series/who_covid_19_sit_rep_time_series.csv"
jhu_link = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
jhu_df = pd.read_csv(jhu_link)
# Explore the dataset
jhu_df.head(10)
[3]:
Province/State Country/Region Lat Long 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20 ... 7/2/20 7/3/20 7/4/20 7/5/20 7/6/20 7/7/20 7/8/20 7/9/20 7/10/20 7/11/20
0 NaN Afghanistan 33.0000 65.0000 0 0 0 0 0 0 ... 32022 32324 32672 32951 33190 33384 33594 33908 34194 34366
1 NaN Albania 41.1533 20.1683 0 0 0 0 0 0 ... 2662 2752 2819 2893 2964 3038 3106 3188 3278 3371
2 NaN Algeria 28.0339 1.6596 0 0 0 0 0 0 ... 14657 15070 15500 15941 16404 16879 17348 17808 18242 18712
3 NaN Andorra 42.5063 1.5218 0 0 0 0 0 0 ... 855 855 855 855 855 855 855 855 855 855
4 NaN Angola -11.2027 17.8739 0 0 0 0 0 0 ... 315 328 346 346 346 386 386 396 458 462
5 NaN Antigua and Barbuda 17.0608 -61.7964 0 0 0 0 0 0 ... 69 68 68 68 70 70 70 73 74 74
6 NaN Argentina -38.4161 -63.6167 0 0 0 0 0 0 ... 69941 72786 75376 77815 80447 83426 87030 90693 94060 97509
7 NaN Armenia 40.0691 45.0382 0 0 0 0 0 0 ... 26658 27320 27900 28606 28936 29285 29820 30346 30903 31392
8 Australian Capital Territory Australia -35.4735 149.0124 0 0 0 0 0 0 ... 108 108 108 108 108 111 112 113 113 113
9 New South Wales Australia -33.8688 151.2093 0 0 0 0 3 4 ... 3211 3405 3419 3429 3433 3440 3453 3467 3474 3478

10 rows × 176 columns

It is observed that the column “Province/States” contains the name of the cities, and since the forth column a time series stamp (or index) is provided to record daily data of reported cases. Additionally, there are many days without recorded data for a number of chinese cities. This won’t be an issue for parameter fitting as Open-SIR doesn’t require uniform spacement of the observed data.

Data preparation

In the following lines, the time series for Guangdong reported cases \(C(t)\) is extracted from the original dataframe. Thereafter, the columns are converted to a pandas date time index in order to perform further data preparation steps.

[4]:
China = jhu_df[jhu_df[jhu_df.columns[1]] == "China"]
city_name = "Guangdong"
city = China[China["Province/State"] == city_name]
city = city.drop(columns=["Province/State", "Country/Region", "Lat", "Long"])
time_index = pd.to_datetime(city.columns)
data = city.values
# Visualize the time
ts = pd.Series(data=city.values[0], index=time_index)

Using the function ts.plot() a quick visualization of the dataset is obtained:

[5]:
ts.plot()
plt.title("Guangdong COVID-19 cases")
plt.ylabel("$C(t)$: Number of reported cases", size=12)
plt.show()
../_images/tutorials_SIR-X_12_0.png

Data cleaning

[6]:
ts_clean = ts.dropna()
# Extract data
ts_fit = ts_clean["2020-01-21":"2020-02-12"]
# Convert index to numeric
ts_num = pd.to_numeric(ts_fit.index)
t0 = ts_num[0]
# Convert datetime to days
t_days = (ts_num - t0) / (10 ** 9 * 86400)
t_days = t_days.astype(int).values
# t_days is an input for SIR
[7]:
# Define the X number
nX = ts_fit.values  # Number of infected
N = 104.3e6  # Population size of Guangdong

Exploration of the dataset

[8]:
ts_fit.plot(style="ro")
plt.xlabel("Number of infected")
plt.show()
../_images/tutorials_SIR-X_17_0.png
[ ]:

Setting up SIR and SIR-X models

The population \(N\) of the city is a necessary input for the model. In this notebook, this was hardocded, but it can be sourced directly from a web source.

Note that whilst the SIR model estimates directly the number of infected people, \(N I(t)\), SIR-X estimates the number of infected people based on the number of tested cases that are in quarantine or in an hospital \(N X(t)\)

[9]:
# These lines are required only if opensir wasn't installed using pip install, or if opensir is running in the pipenv virtual environment
import sys

path_opensir = "../../"
sys.path.append(path_opensir)

# Import SIR and SIRX models
from opensir.models import SIR, SIRX

nX = ts_fit.values  # Number of observed infections of the time series
N = 104.3e6  # Population size of Guangdong
params = [0.95, 0.38]
w0 = (N - nX[0], nX[0], 0)

G_sir = SIR()
G_sir.set_params(p=params, initial_conds=w0)
G_sir.fit_input = 2
G_sir.fit(t_days, nX)
G_sir.solve(t_days[-1], t_days[-1] + 1)
t_SIR = G_sir.fetch()[:, 0]
I_SIR = G_sir.fetch()[:, 2]

Try to fit a SIR model to Guangdong data

[10]:
ax = plt.axes()
ax.tick_params(axis="both", which="major", labelsize=14)
plt.plot(t_SIR, I_SIR)
plt.plot(t_days, nX, "ro")
plt.show()
../_images/tutorials_SIR-X_23_0.png

The SIR model is clearly not appropriate to fit this data, as it cannot resolve the effect of exogeneous containment efforts such as quarantines or lockdown. We will repeat the process with a SIR-X model.

Fit SIR-X to Guangdong Data

[11]:
g_sirx = SIRX()
params = [6.2 / 8, 1 / 8, 0.05, 0.05, 5]
# X_0 can be directly ontained from the statistics
n_x0 = nX[0]  # Number of people tested positive
n_i0 = nX[0]

w0 = (N - n_x0 - n_i0, n_i0, 0, n_x0)
g_sirx.set_params(p=params, initial_conds=w0)
# Fit all parameters
fit_index = [False, False, True, True, True]
g_sirx.fit(t_days, nX, fit_index=fit_index)
g_sirx.solve(t_days[-1], t_days[-1] + 1)
t_sirx = g_sirx.fetch()[:, 0]
inf_sirx = g_sirx.fetch()[:, 4]
[12]:
plt.figure(figsize=[6, 6])
ax = plt.axes()
plt.plot(t_sirx, inf_sirx, "b-", linewidth=2)
plt.plot(t_SIR, I_SIR, "g-", linewidth=2)
plt.plot(t_days, nX, "ro")
plt.legend(
    ["SIR-X model fit", "SIR model fit", "Number of reported cases"], fontsize=13
)
plt.title("SARS-CoV-2 evolution in Guangdong, China", size=15)
plt.xlabel("Days", fontsize=14)
plt.ylabel("COVID-19 confirmed cases", fontsize=14)
ax.tick_params(axis="both", which="major", labelsize=14)
plt.show()
../_images/tutorials_SIR-X_27_0.png

After fitting the parameters, the effective infectious period \(T_{I,eff}\) and the effective reproduction rate \(R_{0,eff}\) can be obtained from the model properties

\[T_{I,eff} = (\beta + \kappa + \kappa_0)^{-1}\]
\[R_{0,eff} = \alpha T_{I,eff}\]

Aditionally, the Public containment leverage \(P\) and the quarantine probability \(Q\) can be calculated through:

\[P = \frac{\kappa_0}{\kappa_0 + \kappa}\]
\[Q = \frac{\kappa_0 + \kappa}{\beta + \kappa_0 + \kappa}\]
[13]:
print("Effective infectious period T_I_eff =  %.2f days " % g_sirx.t_inf_eff)
print(
    "Effective reproduction rate R_0_eff =  %.2f, Maier and Brockmann = %.2f"
    % (g_sirx.r0_eff, 3.02)
)
print(
    "Public containment leverage =  %.2f, Maier and Brockmann = %.2f"
    % (g_sirx.pcl, 0.75)
)
print(
    "Quarantine probability =  %.2f, Maier and Brockmann = %.2f" % (g_sirx.q_prob, 0.51)
)
Effective infectious period T_I_eff =  3.89 days
Effective reproduction rate R_0_eff =  3.01, Maier and Brockmann = 3.02
Public containment leverage =  0.80, Maier and Brockmann = 0.75
Quarantine probability =  0.51, Maier and Brockmann = 0.51

Make predictions using model.predict

[14]:
# Make predictions and visualize
# Obtain the results 14 days after the train data ends
sirx_pred = g_sirx.predict(14)
print("T n_S \t   n_I \tn_R \tn_X")
for i in sirx_pred:
    print(*i.astype(int))
T n_S      n_I  n_R     n_X
0 11450984 227 92847569 1219
1 10307571 190 93990991 1246
2 9278326 158 95020245 1269
3 8351856 130 95946723 1288
4 7517894 107 96780694 1304
5 6767209 87 97531385 1317
6 6091483 70 98207117 1327
7 5483230 57 98815376 1336
8 4935707 45 99362903 1342
9 4442871 36 99855743 1348
10 3999251 29 100299366 1352
11 3599916 23 100698704 1356
12 3240456 18 101058166 1358
13 2916888 14 101381735 1361
14 2625630 11 101672995 1362

Prepare date time index to plot predictions

[15]:
# Import datetime module from the standard library
import datetime

# Obtain the last day from the data used to train the model
last_time = ts_fit.index[-1]
# Create a date time range based on the number of rows of the prediction
numdays = sirx_pred.shape[0]
day_zero = datetime.datetime(last_time.year, last_time.month, last_time.day)
date_list = [day_zero + datetime.timedelta(days=x) for x in range(numdays)]

Plot predictions

[16]:
# Extract figure and axes
fig, ax = plt.subplots(figsize=[5, 5])
# Create core plot attributes
plt.plot(date_list, sirx_pred[:, 4], color="blue", linewidth=2)
plt.title("Prediction of Guangdong Cases", size=14)
plt.ylabel("Number of infected", size=14)
# Remove trailing space
plt.xlim(date_list[0], date_list[-1])
# Limit the amount of data displayed
ax.xaxis.set_major_locator(plt.MaxNLocator(3))
# Increase the size of the ticks
ax.tick_params(labelsize=12)
plt.show()
../_images/tutorials_SIR-X_35_0.png

Calculation of predictive confidence intervals

The confidence intervals on the predictions of the SIR-X model can be calculated using a block cross validation. This technique is widely used in Time Series Analysis. In the open-sir API, the function model.ci_block_cv calculates the average mean squared error of the predictions, a list of the rolling mean squared errors and the list of parameters which shows how much each parameter changes taking different number of days for making predictions.

The three first parameters are the same as the fit function, while the last two parameters are the lags and the min_sample. The lags parameter indicates how many periods in the future will be forecasted in order to calculate the mean squared error of the model prediction. The min_sample parameter indicates the initial number of observations and days that will be taken to perform the block cross validation.

In the following example, model.ci_block_cv is used to estimate the average mean squared error of 1-day predictions taking 6 observations as the starting point of the cross validation. For Guangdong, a min_sample=6 higher than the default 3 is required to handle well the missing data. This way, both the data on the four first days, and two days after the data starts again, are considered for cross validation.

[17]:
# Calculate confidence intervals
mse_avg, mse_list, p_list, pred_data = g_sirx.block_cv(lags=1, min_sample=6)

If it is assumed that the residuals distribute normally, then a good estimation of a 95% confidence interval on the one-day prediction of the number of confirmed cases is

\[\sigma \sim \mathrm{MSE} \rightarrow n_{X,{t+1}} \sim \hat{n}_{X,{t+1}} \pm 2 \sigma\]

Where \(n_{X,{t+1}}\) is the real number of confirmed cases in the next day, and \(\hat{n}_{X,{t+1}}\) is the estimation using the SIR-X model using cross validation. We can use the PredictionResults instance pred_data functionality to explore the mean-squared errors and the predictions confidence intervals:

[18]:
pred_data.print_mse()
Average MSE for 0-day predictions = 18.16, MSE sample size = 16
Average MSE for 1-day predictions = 28.81, MSE sample size = 15
Average MSE for 2-day predictions = 40.23, MSE sample size = 14
Average MSE for 3-day predictions = 54.97, MSE sample size = 13
Average MSE for 4-day predictions = 73.92, MSE sample size = 12
Average MSE for 5-day predictions = 95.56, MSE sample size = 11
Average MSE for 6-day predictions = 119.34, MSE sample size = 10
Average MSE for 7-day predictions = 145.10, MSE sample size = 9
Average MSE for 8-day predictions = 172.19, MSE sample size = 8
Average MSE for 9-day predictions = 198.89, MSE sample size = 7
Average MSE for 10-day predictions = 223.80, MSE sample size = 6
Average MSE for 11-day predictions = 245.12, MSE sample size = 5
Average MSE for 12-day predictions = 283.18, MSE sample size = 4
Average MSE for 13-day predictions = 266.35, MSE sample size = 3
Average MSE for 14-day predictions = 92.48, MSE sample size = 2
Average MSE for 15-day predictions = 196.84, MSE sample size = 1

The predictive accuracy of the model is quite impressive, even for 9-day predictions. Let’s take advantage of the relatively low mean squared error to forecast a 10 days horizon with confidence intervals using pred_data.plot_predictions(n_days=9)

[19]:
pred_data.plot_pred_ci(n_days=9)
../_images/tutorials_SIR-X_41_0.png

If it is assumed that the residuals distribute normally, then a good estimation of a 95% confidence interval on the one-day prediction of the number of confirmed cases is

\[\sigma \sim \mathrm{MSE} \rightarrow n_{X,{t+1}} \sim \hat{n}_{X,{t+1}} \pm 2 \sigma\]

Where \(n_{X,{t+1}}\) is the real number of confirmed cases in the next day, and \(\hat{n}_{X,{t+1}}\) is the estimation using the SIR-X model using cross validation. We use solve to make a 1-day prediction and append the 95% confidence interval.

[20]:
# Predict
g_sirx.solve(t_days[-1] + 1, t_days[-1] + 2)
n_X_tplusone = g_sirx.fetch()[-1, 4]
print("Estimation of n_X_{t+1} = %.0f +- %.0f " % (n_X_tplusone, 2 * mse_avg[0]))
Estimation of n_X_{t+1} = 1268 +- 36
[21]:
# Transform parameter list into a DataFrame
par_block_cv = pd.DataFrame(p_list)
# Rename dataframe columns based on SIR-X parameter names
par_block_cv.columns = g_sirx.PARAMS
# Add the day. Note that we take the days from min_sample until the end of the array, as days
# 0,1,2 are used for the first sampling in the block cross-validation
par_block_cv["Day"] = t_days[5:]
# Explore formatted dataframe for parametric analysis
par_block_cv.head(len(p_list))
[21]:
alpha beta kappa_0 kappa inf_over_test Day
0 0.775 0.125 0.115899 1.020877e-10 2.705780 5
1 0.775 0.125 0.109540 4.023231e-13 2.774667 6
2 0.775 0.125 0.068467 1.099828e-01 1.818714 7
3 0.775 0.125 0.082939 6.627930e-02 2.100077 8
4 0.775 0.125 0.104995 6.901938e-11 2.822296 9
5 0.775 0.125 0.090598 4.458775e-02 2.285169 10
6 0.775 0.125 0.092793 3.733649e-02 2.355333 11
7 0.775 0.125 0.104239 1.903129e-09 2.824648 12
8 0.775 0.125 0.105289 3.024668e-09 2.843301 13
9 0.775 0.125 0.106338 7.624121e-11 2.866025 14
10 0.775 0.125 0.107283 1.264505e-10 2.889912 15
11 0.775 0.125 0.108250 3.590963e-14 2.917548 16
12 0.775 0.125 0.108956 1.935729e-15 2.939909 17
13 0.775 0.125 0.110043 8.197801e-11 2.977366 18
14 0.775 0.125 0.109884 4.608366e-03 2.954701 19
15 0.775 0.125 0.106441 2.069015e-02 2.797521 20
16 0.775 0.125 0.104914 2.805141e-02 2.739572 21
[22]:
plt.figure(figsize=[5, 5])
ax = plt.axes()
ax.tick_params(axis="both", which="major", labelsize=14)
plt.plot(mse_list[0], "ro")
plt.xlabel("Number of days used to predict the next day", size=14)
plt.ylabel("MSE", size=14)
plt.show()
../_images/tutorials_SIR-X_45_0.png

There is an outlier on day 1, as this is when the missing date starts. A more reliable approach would be to take the last 8 values of the mean squared error to calculate a new average assuming that there will be no more missing data.

Variation of fitted parameters

Finally, it is possible to observe how the model parameters change as more days and number of confirmed cases are introduced in the block cross validation.

It is clear to observe that after day 15 all parameters except kappa begin to converge. Therefore, care must be taken when performing inference over the parameter kappa.

Long term prediction

Now we can use the model to predict when the peak will occur and what will be the maximum number of infected

[23]:
# Predict
plt.figure(figsize=[6, 6])
ax = plt.axes()
ax.tick_params(axis="both", which="major", labelsize=14)
g_sirx.solve(40, 41)
# Plot
plt.plot(g_sirx.fetch()[:, 4], "b-", linewidth=2)  # X(t)
plt.plot(g_sirx.fetch()[:, 2], "b--", linewidth=2)  # I(t)
plt.xlabel("Day", size=14)
plt.ylabel("Number of people", size=14)
plt.legend(["X(t): Confirmed", "I(t) = Infected"], fontsize=13)
plt.title(city_name)
plt.show()
../_images/tutorials_SIR-X_50_0.png

The model was trained with a limited amount of data. It is clear to observe that since the measures took place in Guangdong, at least 6 weeks of quarantine were necessary to control the pandemics. Note that a limitation of this model is that it predicts an equilibrium where the number of infected, denoted by the yellow line in the figure above, is 0 after a short time. In reality, this amount will decrease to a small number.

After the peak of infections is reached, it is necessary to keep the quarantine and effective contact tracing for at least 30 days more.

Validate long term plot using model.plot()

The function model.plot() offers a handy way to visualize model fitting and predictions. Custom visualizations can be validated against the model.plot() function.

[24]:
g_sirx.plot()
../_images/tutorials_SIR-X_53_0.png