Submission. Knit your Rmd file into HTML and submit it via Brightspace. Include a link to your git repository (GitHub or git.cs.dal.ca) as a comment on your submission.
library(tidyverse) # dplyr, ggplot2, tidyr, purrr, ...
library(tidymodels) # rsample, recipes, parsnip, workflows, tune, yardstick, ...
library(ranger) # random forest engine
library(xgboost) # gradient-boosted trees engine
library(glmnet) # penalised logistic regression engine
library(here)
library(fs)
theme_set(theme_minimal(base_size = 11) +
theme(panel.grid.minor = element_blank()))Sepsis is life-threatening organ dysfunction caused by a dysregulated immune response to infection. It accounts for roughly 20 % of global deaths, and each hour of delayed treatment increases mortality by 4–8 %.
The PhysioNet/CinC
Challenge 2019 dataset contains hourly ICU records for ~40 000
patients from two US teaching hospitals. Each record (row) has 8 vital
signs, 26 laboratory measurements, 6 demographic fields, and a binary
SepsisLabel (set to 1 up to 6 hours before
clinical onset so that early-warning models can be trained). We simplify
the temporal problem to a static one by taking one random hourly record
per patient, giving a standard tabular classification task with
realistic clinical challenges: class imbalance, high missingness, and
mixed data types.
Q1. (2 marks) Look up the PhysioNet variable descriptions and include an explanation of the dataset into this notebook (one row per variable: name, description, units).
Q2. (2 marks) Then choose three variables you hypothesise might be strong predictors of sepsis and write one sentence each explaining your reasoning. There is no right or wrong answer to this, but you should think about your and problem before diving into the EDA and prediction! (Hint: look up how sepsis is clinically diagnosed.)
To avoid inadvertent data leakage from your test set, we are going to perform our test-train split before even our EDA and cleaning related analyses. This is not always done but is a good practice if you aren’t using an external/independent test dataset.
I’ve already done the random sampling of the dataset to remove the time-series elements but the original dataset can be acquired directly from PhysioNet/CinC Challenge 2019
data_dir <- here("data")
cache_dir <- fs::path(data_dir, "cache")
fs::dir_create(cache_dir)
data_url <- "http://maguire-lab.github.io/health_data_science_research_2026/static_files/practicals/lab1_data/physionet_subsample_modified_1.0.0.rds"
rds_path <- fs::path(cache_dir, "physionet_subsample_1.0.0.rds")Rmd lets you cache expensive or slow operations (like downloading big files or complicated processing) so they only re-run if the code and variables have changed. This can be very useful although can sometimes cause some inadvertent issues that reduce reproducibility! You should also do a completely clean re-run of everything once your analyses are finalised if possible.
We split the data before any exploration so that analysis choices cannot be influenced by the test set - this is a key discipline in applied machine learning. Stratified splitting preserves the sepsis rate in both halves.
# will make later things easier to do!s
sepsis <- sepsis |>
mutate(SepsisLabel = factor(SepsisLabel,
levels = c(1, 0),
labels = c("Sepsis", "No Sepsis")))
# splitting
split <- initial_split(sepsis, prop = 0.75, strata = SepsisLabel)
train <- training(split)
test <- testing(split)
cat("Training rows:", nrow(train), "\n")## Training rows: 2250
## Test rows: 750
Q3. (1 mark) What percentage of our dataset is being assigned to test?
Again, do your EDA on train only - we do not look at
test until Section 6.
Q4. (2 marks). Uncomment and csomplete the code
below to check the class balance in the training set.
count() tallies rows per group; add a prop
column to show each class as a fraction of the total.
Q5. (2 mark) Using your solution to Q4, complete this plotting code to create a barplot of the relative sepsis label balance. Change the colour of the sepsis bar to green.
#code from your answer to Q3,
# |>
# ggplot(aes(SepsisLabel, prop, fill = SepsisLabel)) +
# geom_col() +
# geom_text(aes(label = scales::percent(prop, accuracy = 0.01)), vjust = -0.4) +
# scale_y_continuous(labels = scales::percent) +
# scale_fill_manual(values = c("No Sepsis" = "grey60",
# "Sepsis" = "firebrick")) +
# labs(x = NULL, y = "Percentage of Trainining Data", title = "Sepsis Class Balance") +
# theme(legend.position = "none")Q6. (3 marks) Given the class imbalance, which of the following 2 metrics might be the most misleading and why? accuracy, precision, recall, F1-score, ROC-AUC, PR-AUC?
train |>
summarise(across(everything(), ~ mean(is.na(.x)))) |>
pivot_longer(everything(),
names_to = "variable", values_to = "pct_missing") |>
filter(pct_missing > 0) |>
ggplot(aes(pct_missing, fct_reorder(variable, pct_missing))) +
geom_col(fill = "steelblue") +
scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(title = "Missing data by variable (training set)",
x = "Percent missing", y = NULL)Q7. (1 mark) Modify the above to plot only missingness for Sepsis positive data
Q8. (3 marks) Which three variables have the highest missingness? In an ICU context, why might a lab value be missing - is this likely to be random, or informative? Does this mean you should drop or impute these values?
vitals <- c("HR", "O2Sat", "Temp", "SBP", "MAP", "DBP", "Resp")
train |>
select(all_of(vitals), SepsisLabel) |>
pivot_longer(-SepsisLabel) |>
filter(!is.na(value)) |>
ggplot(aes(value, fill = SepsisLabel)) +
geom_density(alpha = 0.55, colour = NA) +
facet_wrap(~ name, scales = "free", ncol = 4) +
scale_fill_manual(values = c("No Sepsis" = "grey60",
"Sepsis" = "firebrick")) +
labs(x = NULL, y = "Density", fill = NULL,
title = "Vital-sign distributions by sepsis status")Q9 (2 marks). Modify the code above to plot the distribution of three other lab variables of your choice.
Correlation plot of all variables by similarity of their correlation in Sepsis and Non-Sepsis
cor_mat <- function(df) {
df |>
select(where(is.numeric)) |>
cor(use = "pairwise.complete.obs")
}
m_ns <- cor_mat(train |> filter(SepsisLabel == "No Sepsis"))
m_sep <- cor_mat(train |> filter(SepsisLabel == "Sepsis"))
m_dif <- m_sep - m_ns
m_dif[is.na(m_dif)] <- 0
hc <- hclust(as.dist(1 - abs(m_dif)), method = "average")
var_order <- hc$labels[hc$order]
m_dif |>
as.data.frame() |>
tibble::rownames_to_column("var1") |>
tidyr::pivot_longer(-var1, names_to = "var2", values_to = "diff") |>
mutate(var1 = factor(var1, levels = var_order),
var2 = factor(var2, levels = var_order)) |>
ggplot(aes(var1, var2, fill = diff)) +
geom_tile() +
scale_fill_gradient2(low = "steelblue", mid = "white", high = "firebrick",
midpoint = 0, limits = c(-0.5, 0.5),
oob = scales::squish, name = "Δ r") +
coord_fixed() +
labs(x = NULL, y = NULL,
title = "Change in pairwise correlations: Sepsis − No Sepsis") +
theme_minimal(base_size = 9) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
panel.grid = element_blank())PCA compresses all numeric features into a small number of axes and
lets us ask: do septic and non-septic patients separate at all in
the joint feature space? We build it inside a small
recipe so the preprocessing (imputation, normalisation) is
explicit and reproducible.
Recipes are a special type of function in tidymodels
that differ from normal R functions in key ways (internal state).
A plain R function executes immediately and is stateless i.e., every call recomputes from scratch.
rscale_hr <- function(x) (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
train$HR_z <- scale_hr(train$HR)
test$HR_z <- scale_hr(test$HR) # BUG: uses test set's own mean/sd
The test set is standardised using statistics computed from the
test set, which leaks information and means train and test are now
on different scales. To do it correctly you’d have to manually save
mean(trainHR) and sd(trainHR) and
sd(trainHR) and sd(trainHR) and reuse them and
then do the same for every other transformation, for every column, for
imputation, encoding, etc. It gets unmanageable fast.
A recipe on the otherhand is a two-phase object: first you declare
the steps, then you prep() it on training data (which
estimates and stores any required parameters means, sds, imputation
values, factor levels, PCA loadings…), then you bake() it
on any new data to apply the same learned transformations.
pca_prep <- recipe(SepsisLabel ~ ., data = train) |>
step_rm(ICULOS, HospAdmTime) |>
step_impute_median(all_numeric_predictors()) |>
step_impute_mode(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors(), one_hot = FALSE) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors()) |>
step_pca(all_numeric_predictors(), num_comp = 10, id = "pca") |>
prep()
pca_scores <- bake(pca_prep, new_data = NULL)pca_scores |>
ggplot(aes(PC01, PC02, colour = SepsisLabel)) +
geom_point(alpha = 0.75, size = 0.7) +
scale_colour_manual(values = c("No Sepsis" = "grey60",
"Sepsis" = "firebrick")) +
labs(title = "PCA: PC1 vs PC2", colour = NULL) +
guides(colour = guide_legend(override.aes = list(alpha = 1, size = 2)))tidy(pca_prep, id = "pca", type = "variance") |>
filter(terms == "percent variance") |>
ggplot(aes(component, value)) +
geom_col(fill = "steelblue") +
labs(x = "Principal component", y = "% variance explained",
title = "PCA scree plot")The two classes overlap heavily in the PC1/PC2 projection — this is common in clinical data where no single linear combination cleanly separates cases.
Q10 (2 marks). What does this tell us about the potential decision boundary our predictive model would have to fit? Which models easily support this type of boundary?
Let’s create a few new features likely to be informative for sepsis:
Q11. (1 mark) Modify the following code to also add
an additional feature called shock_index which is defined
as heart rate divided by the systolic blood pressure.
Now lets try to fit some predictive models to the training set
Five-fold stratified CV on the training set, used to select hyperparameters (see below) by repeatedly training the model with different hyperparameter values on 4/5ths of the data and evaluating on each of the held-out 1/5ths.s
Not all models can handle the same type of data and some models (like linear models) benefit from normalising values to similar ranges. Most models will ignore missing values but we are going to add a missingness-indicator columns so the model can learn from which values are absent.
We are going to use tidymodel recipes again so we can
apply the training normalisation etc to the test set to prevent
leakage.
base_recipe <- recipe(SepsisLabel ~ ., data = train) |>
step_rm(ICULOS, HospAdmTime) |>
step_indicate_na(all_numeric_predictors(), prefix = "miss") |>
step_impute_median(all_numeric_predictors()) |>
step_impute_mode(all_nominal_predictors()) |>
step_novel(all_nominal_predictors()) |>
step_dummy(all_nominal_predictors(), one_hot = FALSE) |>
step_zv(all_predictors())
glm_recipe <- base_recipe |>
step_normalize(all_numeric_predictors())Now let’s specify our 3 models: - Logistic regression: a linear
baseline. Outputs P(Sepsis∣X) via a logistic link function (i.e.,
squeezing our linear model through a sigmoid function that constrains
between 0 and 1). By default glmnet uses a regularised
model. This means we are penalising the model complexity (the size of
the weights/coefficients are added to the cost function). Specifically,
we are using an cross-validation optimied combination of L1 and L2
penalties (absolute and squared weight/coefficient sizes). This is
useful as many of our ~40 lab variables are correlated or weakly
informative.
Random forest: an ensemble of decision trees, each fit on a
bootstrap sample with a random subset of features at each split.
Captures non-linearities and interactions automatically, and handles
mixed-scale features without preprocessing. Robust to outliers and (via
ranger) to NAs.
XGBoost: gradient-boosted trees. Builds trees sequentially, each one correcting the residual errors of the previous ensemble. Typically the strongest tabular performer, but the most sensitive to model (hyper)parameters and easy to overfit without tuning.
Each of these has several hyperparameters (parts of the model we
don’t directly learn from the data during fitting; instead we choose
them before fitting and select the best values by cross-validation).
Marking a hyperparameter with tune() is a placeholder that
says “leave this blank for now — we’ll fill it in during the tuning
step.” The values will later be chosen by tune_grid() based on
cross-validated performance on the training folds.
Q12 4 marks: Using R’s documentation (e.g.,
?logistic_reg) look up and briefly explain each of the
hyperparamters we are tuning in cross-validation for each of the 3
models:s
glm_spec <- logistic_reg(penalty = tune(), mixture = tune()) |>
set_engine("glmnet") |>
set_mode("classification")
# hyperparameters: penalty scaling factor and proportion L1 vs L2s
rf_spec <- rand_forest(trees = 500, min_n = tune()) |>
set_engine("ranger", importance = "permutation", num.threads = 1) |>
set_mode("classification")
xgb_spec <- boost_tree(
trees = 500, tree_depth = tune(), learn_rate = tune(),
loss_reduction = tune(), sample_size = tune(),
min_n = tune()
) |>
set_engine("xgboost", nthread = 1) |>
set_mode("classification")Now let’s specify our workflow to use cross-validation to tune all models and compare across them.
wf_set <- workflow_set(
preproc = list(glm = glm_recipe, rf = base_recipe, xgb = base_recipe),
models = list(glm = glm_spec, rf = rf_spec, xgb = xgb_spec),
cross = FALSE
)We are going to calculate a set of different performance metrics across CV folds.
pr_auc: Area under the Precision-Recall curve.
Sweeps the classification threshold from strict to lenient and plots
precision against recall at each. Single-number summary of how well the
model ranks septic patients above non-septic ones, without being
inflated by the large true-negative pool. Baseline ≈ class prevalence
(~0.02 here), so any value meaningfully above that is real
signal.
brier_class: Mean squared error between predicted
probabilities and the binary sepsis label 0/1 outcome. Measures
calibration + discrimination together a model that says
“80% sepsis risk” should be right 80% of the time. Lower is better; 0 is
perfect, 0.25 is random guessing at 50/50. Unlike AUC metrics, Brier
penalises over-confident probabilities
accuracy: (TP+TN)/N. Due to imbalance this is
included for familiarity and catching bugs! Reminder: Predicting “No
Sepsis” for everyone scores ~98% here.
sensitivity (= recall, = true positive rate)
TP/(TP+FN). Of patients who actually had sepsis, what fraction did we
flag? The clinically critical metric for the positive class: missed
sepsis (low sensitivity) means delayed antibiotics, which directly
increases mortality.
specificity: TN/(TN+FP). Of patients who did
not have sepsis, what fraction did we correctly leave
unflagged? The counterpart to sensitivity — high specificity means low
false-alarm rate, which matters for alarm fatigue in
the ICU. Sensitivity and specificity trade off as you move the
threshold, so they’re reported together.
Now we can run it, this will take a few minutes so we can use cache-ing again to avoid frustration!
autoplot(tuned, metric = "pr_auc", select_best = TRUE) +
labs(title = "Best CV PR-AUC per model family")Q13 (2 marks). Which model family achieves the highest CV PR-AUC? Is the gap between the best and worst model large relative to the standard errors? What does that suggest about model selection here?
rank_results(tuned, rank_metric = "pr_auc", select_best = TRUE) |>
filter(.metric %in% c("pr_auc", "brier_class", "accuracy", "sensitivity", "specificity")) |>
select(wflow_id, .metric, mean, std_err, rank)Q14 (3 marks). By summarising and comparing the metrics output by this code, describe how each model performed relative to one another.
In the real-world we’d try to optimise these models further and/or try some different architectures but lets just take our best model and continue to test evaluation.
best_id <- rank_results(tuned, rank_metric = "pr_auc",
select_best = TRUE) |>
filter(.metric == "pr_auc") |>
slice_head(n = 1) |>
pull(wflow_id)
final_wf <- extract_workflow(tuned, id = best_id) |>
finalize_workflow(
extract_workflow_set_result(tuned, id = best_id) |>
select_best(metric = "pr_auc")
)
split <- make_splits(train, test)
final_fit <- last_fit(final_wf, split = split, metrics = cls_metrics)
collect_metrics(final_fit)preds <- collect_predictions(final_fit)
preds |>
pr_curve(truth = SepsisLabel, .pred_Sepsis,
event_level = "first") |>
autoplot() + labs(title = "Precision–Recall curve (test set)")Q15 (1 marks). How many times were true sepsis cases classified as non-sepsis in the test set? Is this more or less than the correct sepsis predictions?
preds |>
conf_mat(truth = SepsisLabel, estimate = .pred_class) |>
autoplot(type = "heatmap") +
scale_fill_gradient(low = "white", high = "firebrick") +
labs(title = "Confusion matrix (threshold = 0.5)")Now let’s investigate whether the model work equally well/badly across patient groups?
errors <- augment(final_fit)
subgroup_auc <- function(df, group_var) {
df |>
group_by({{ group_var }}) |>
filter(n_distinct(SepsisLabel) == 2) |>
roc_auc(truth = SepsisLabel, .pred_Sepsis,
event_level = "second") |>
select({{ group_var }}, .estimate)
}
overall_auc <- collect_metrics(final_fit) |>
filter(.metric == "pr_auc") |>
pull(.estimate)
bind_rows(
errors |> subgroup_auc(Gender) |> rename(subgroup = 1) |>
mutate(subgroup = as.character(subgroup), group = "Sex"),
errors |> subgroup_auc(age_group) |> rename(subgroup = 1) |>
mutate(subgroup = as.character(subgroup), group = "Age group")
) |>
ggplot(aes(reorder(subgroup, .estimate), .estimate, fill = group)) +
geom_col() +
geom_hline(yintercept = overall_auc, linetype = 2, colour = "grey40") +
coord_flip() +
facet_grid(group ~ ., scales = "free_y", space = "free_y") +
labs(x = NULL, y = "PR-AUC",
title = "Subgroup PR-AUC",
subtitle = "Dashed line = overall test AUC") +
theme(legend.position = "none")Q15 (2 marks). Is there a subgroup where the model performs noticeably worse than the overall PR-AUC? Suggest one data-driven reason and one clinical reason that could explain the gap.
parsnip_fit <- extract_fit_parsnip(final_fit)
importance_tbl <- if (grepl("^rf", best_id)) {
imp <- parsnip_fit$fit$variable.importance
tibble(variable = names(imp), importance = unname(imp))
} else if (grepl("^xgb", best_id)) {
xgboost::xgb.importance(model = parsnip_fit$fit) |>
as_tibble() |>
transmute(variable = Feature, importance = Gain)
} else {
broom::tidy(parsnip_fit) |>
filter(term != "(Intercept)") |>
transmute(variable = term, importance = abs(estimate))
}
importance_tbl |>
slice_max(importance, n = 20) |>
ggplot(aes(importance, fct_reorder(variable, importance))) +
geom_col(fill = "steelblue") +
labs(title = paste("Top variable importances: ", best_id),
x = "Importance", y = NULL)Q16 (2 marks). Compare the top 3 most important variables to the three predictors you identified in Q1. Do they match? Write one sentence about any variable that surprised you.
Q17 (3 marks). In 3–5 sentences, address both of the following:
## R version 4.6.0 (2026-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Arch Linux
##
## Matrix products: default
## BLAS: /usr/lib/libblas.so.3.12.0
## LAPACK: /usr/lib/liblapack.so.3.12.0 LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/Halifax
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] fs_2.1.0 here_1.0.2 glmnet_5.0 Matrix_1.7-5
## [5] xgboost_3.2.1.1 ranger_0.18.0 yardstick_1.4.0 workflowsets_1.1.1
## [9] workflows_1.3.0 tune_2.1.0 tailor_0.1.0 rsample_1.3.2
## [13] recipes_1.3.2 parsnip_1.6.0 modeldata_1.5.1 infer_1.1.0
## [17] dials_1.4.3 scales_1.4.0 broom_1.0.12 tidymodels_1.5.0
## [21] lubridate_1.9.5 forcats_1.0.1 stringr_1.6.0 dplyr_1.2.1
## [25] purrr_1.2.2 readr_2.2.0 tidyr_1.3.2 tibble_3.3.1
## [29] ggplot2_4.0.3 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 timeDate_4052.112 farver_2.1.2
## [4] S7_0.2.2 fastmap_1.2.0 digest_0.6.39
## [7] rpart_4.1.27 timechange_0.4.0 lifecycle_1.0.5
## [10] survival_3.8-6 magrittr_2.0.5 compiler_4.6.0
## [13] rlang_1.2.0 sass_0.4.10 tools_4.6.0
## [16] yaml_2.3.12 data.table_1.18.4 knitr_1.51
## [19] labeling_0.4.3 DiceDesign_1.10 RColorBrewer_1.1-3
## [22] withr_3.0.2 nnet_7.3-20 grid_4.6.0
## [25] sparsevctrs_0.3.6 future_1.70.0 iterators_1.0.14
## [28] globals_0.19.1 MASS_7.3-65 cli_3.6.6
## [31] rmarkdown_2.31 generics_0.1.4 rstudioapi_0.18.0
## [34] future.apply_1.20.2 tzdb_0.5.0 cachem_1.1.0
## [37] splines_4.6.0 parallel_4.6.0 vctrs_0.7.3
## [40] hardhat_1.4.3 jsonlite_2.0.0 hms_1.1.4
## [43] listenv_0.10.1 foreach_1.5.2 gower_1.0.2
## [46] jquerylib_0.1.4 glue_1.8.1 parallelly_1.47.0
## [49] codetools_0.2-20 shape_1.4.6.1 stringi_1.8.7
## [52] gtable_0.3.6 pillar_1.11.1 furrr_0.4.0
## [55] htmltools_0.5.9 ipred_0.9-15 lava_1.9.1
## [58] R6_2.6.1 rprojroot_2.1.1 evaluate_1.0.5
## [61] lattice_0.22-9 backports_1.5.1 bslib_0.10.0
## [64] class_7.3-23 Rcpp_1.1.1-1.1 prodlim_2026.03.11
## [67] xfun_0.57 pkgconfig_2.0.3
This practical is adapted from earlier versions of the course’s
medical databases lab and updated to use the PhysioNet/CinC
Challenge 2019 sepsis dataset (Reyna et al., Crit Care Med
2020) to use a more modern tidymodels workflow.