This practical is mostly reading and configuration but you should try and complete the 2 parts labelled Exercise: and submit the knitted html and a link to your github/gitlab repo via BrightSpace. This practical will not be graded but is your chance to make sure everything works properly and is submitted in the correct format for later practicals.
In terms of specific learning objectives:
.Rproj
file, and a Git repository.dplyr +
tidyr)..gitignore.sessionInfo() footer so a reader knows
exactly which package versions produced your results.These skills underpin every subsequent practical in this course!
A “reproducible” analysis is one where another researcher (or future-you) can take your code and data and obtain the same results. In health research the stakes are higher than usual:
A practical reproducibility checklist for lab practicals:
.Rproj file. No absolute paths like
C:/Users/me/Desktop/....R/, data/, figures/,
output/).set.seed().sessionInfo() is recorded at the bottom of the
report.install.packages() and loaded
per session with library().dplyr, tidyr, ggplot2,
readr, tibble, purrr,
stringr, forcats, lubridate) that
share design principles and the pipe-friendly grammar.data.frame but prints more cleanly and never
silently changes types.gitlab.cs.dal.ca.|> is available): https://cran.r-project.org/Verify Git from a terminal and add your details to the configuration:
By default RStudio has four panes (configurable in Tools → Global Options → Pane Layout):
.R, .Rmd, and other source files.Quick sanity check - type the following in the Console and press Enter:
## [1] 2
You should see x appear in the Environment pane.
The recommended workflow is GitHub/Gitlab-first:
arhds-labs (you can add a README and an R
.gitignore template if prompted)..Rproj file.
Always open that .Rproj to start work - it
sets the working directory automatically, which is the foundation of
path reproducibility.A sensible starter .gitignore for R projects:
.Rhistory
.RData
.Ruserdata
.Rproj.user/
*.html
*.pdf
/data/raw/ # if data is large or non-redistributable
/renv/library/ # if using renv
Dalhousie Gitlab: when you create your repository if set the
visibility level to internal it will be public
for anyone logged into git.cs.dal.ca and you don’t need to
do any other configuration. If you want to limit access, set it as
private, create it, and then using the left-side menu
Manage -> Members -> Invite Members and invite my
csid finlaym to the repository.
Github: when you create your repository if you
Choose visibility as public then anyone online
can see it and you don’t need to do any other configuration. If you want
to limit access: set it to private, create, and click on invite
collaborators and add fmaguire to your repository.
renvFor coursework, plain library() calls are fine. For your
bigger research projects, you should consider using a utility like
renv (renv::init()) to pin
exact package versions in renv.lock so collaborators get
the same environment.
If you have never used R, have a look at the Harvard Chan Intro-R module material. We will go over the compressed key details.
A vector is an ordered collection of values of the same type. R indexes from 1 (not 0) and supports negative and logical indexing. Note: the negative indexing works differently than other languages!
## [1] 4
## [1] 1 3 2
## [1] 4 3
## [1] 4 3
## [1] 4
## [1] 2.5
## [1] 1.290994
## [1] 30 1 4 3 2 90
A factor encodes a categorical variable. Levels are the allowed values; by default they are alphabetically ordered, which is rarely what you want.
expression <- c("low", "high", "medium", "high", "low", "medium", "high")
# Default: alphabetical ordering - usually wrong for ordinal data
factor(expression)## [1] low high medium high low medium high
## Levels: high low medium
## [1] low high medium high low medium high
## Levels: low medium high
In modern tidyverse code, prefer
forcats (fct_relevel,
fct_infreq, fct_lump) over base R for factor
manipulation.
A data frame is a rectangular table whose columns may have different types. A tibble is the tidyverse drop-in replacement: same idea, better defaults.
# Base R
df_base <- data.frame(
patient_id = c("P01", "P02", "P03"),
age = c(58, 64, 71),
sex = c("F", "M", "F"),
sbp_mmhg = c(132, 145, 128) # systolic blood pressure
)
df_base# Tibble equivalent
library(tibble)
tb <- tibble(
patient_id = c("P01", "P02", "P03"),
age = c(58, 64, 71),
sex = c("F", "M", "F"),
sbp_mmhg = c(132, 145, 128)
)
tbNotice the tibble prints column types (<chr>,
<dbl>) - useful when debugging type coercion
bugs.
R has two pipes:
|> - the native pipe, built into R
≥ 4.1. No package required.%>% - the magrittr pipe, loaded
with dplyr/tidyverse. Older code uses this
almost exclusively.Both pass the left-hand side as the first argument of the right-hand
side. For new code, prefer |>.
## [1] 54.59815
## [1] 54.59815
## [1] 54.59815
The pipe lets you read data transformations left-to-right, top-to-bottom, like a recipe.
This lab will need the following 3 packages so you can install new packages like this:
There is no need to run library(dplyr) and
library(tidyverse) - the latter loads the former. Stick
with library(tidyverse) for analysis scripts; load
individual packages only when writing a package or a constrained Shiny
app.
dplyrdplyr provides a small set of verbs that compose into
rich pipelines. We will use a tiny synthetic clinical dataset
throughout.
set.seed(2026) # reproducibility: any random draws below give identical results every run
clinic <- tibble(
patient_id = sprintf("P%03d", 1:8),
age = c(58, 64, 71, 49, 82, 33, 67, 55),
sex = c("F", "M", "F", "M", "F", "M", "F", "M"),
smoker = c(FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE),
sbp_mmhg = c(132, 145, 128, 118, 162, 121, 150, 135), # systolic BP
bmi = c(27.4, 31.2, 24.8, 22.0, 29.5, 21.3, 33.1, 26.6)
)
clinicselect()
- pick columnsPython equivalent: df[["patient_id", "age", "sbp_mmhg"]]
or df.select_dtypes(include="number").
filter()
- pick rowsPython equivalent:
df.query("sbp_mmhg >= 140 and smoker").
mutate()
- create or modify columnsclinic |>
mutate(
bp_category = case_when(
sbp_mmhg < 120 ~ "normal",
sbp_mmhg < 130 ~ "elevated",
sbp_mmhg < 140 ~ "stage 1",
TRUE ~ "stage 2"
),
obese = bmi >= 30
)case_when() is the multi-branch if/else of
the tidyverse - much cleaner than nesting ifelse()
calls.
summarise() and group_by() - collapse
rowsclinic |>
group_by(sex) |>
summarise(
n = n(),
mean_age = mean(age),
mean_sbp = mean(sbp_mmhg),
pct_smokers = mean(smoker) * 100,
.groups = "drop"
)Python equivalent: df.groupby("sex").agg(...).
across()
- apply a function to multiple columnsacross() (introduced in dplyr 1.0) is the modern way to
compute the same summary for many columns:
clinic |>
group_by(sex) |>
summarise(across(c(age, sbp_mmhg, bmi), \(x) mean(x, na.rm = TRUE)),
.groups = "drop")The \(x) ... syntax is R 4.1’s anonymous-function
shorthand (equivalent to function(x) ... or
lambda x: in python).
rename(new = old) - rename columns.relocate(col, .before = other) - reorder columns.distinct() - drop duplicate rows.slice_max(col, n = 5) / slice_min() /
slice_sample(n = 100) - pick rows by rank or randomly.tidyrTidy data has three properties:
Most messy datasets violate one of these. Two verbs do most of the work:
pivot_longer() - wide → long(pivot_longer() replaces the older
gather(). You may still see gather() in older
code; it works but is no longer recommended.)
life_expectancy <- tribble(
~country, ~`2010`, ~`2015`, ~`2020`,
"Australia", 82.0, 82.4, 83.0,
"Canada", 80.7, 81.5, 81.9,
"France", 81.8, 82.3, 83.0
)
life_expectancyle_long <- life_expectancy |>
pivot_longer(
cols = -country, # everything except country
names_to = "year",
values_to = "expectancy"
) |>
mutate(year = as.integer(year))
le_longpivot_wider() - long → wide(pivot_wider() replaces spread().)
readr and hereHard-coded paths break reproducibility. The here package
resolves paths relative to the project root (the folder containing your
.Rproj):
library(here)
# Write the clinic tibble to data/processed/
dir.create(here("data", "processed"), recursive = TRUE, showWarnings = FALSE)
write_csv(clinic, here("data", "processed", "clinic.csv"))
# Read it back - works on any machine, regardless of where the project lives
clinic2 <- read_csv(here("data", "processed", "clinic.csv"))readr functions (read_csv,
read_tsv, read_delim) are faster than base R’s
read.csv and never silently coerce strings to factors.
ggplot2ggplot2 implements the Grammar of Graphics:
every plot is a stack of layers built from data + aesthetic
mappings + geometric objects + scales + coordinate system +
theme.
data(mpg) # built-in fuel-economy dataset
# A plot is built up with `+`, NOT the pipe.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()Add aesthetic mappings - colour by class:
ggplot(mpg, aes(x = displ, y = hwy, colour = class)) +
geom_point(alpha = 0.7) +
labs(
x = "Engine displacement (L)",
y = "Highway MPG",
colour = "Vehicle class",
title = "Larger engines deliver lower fuel economy"
) +
theme_minimal()Common geoms:
# Faceting - small multiples
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~ class) +
theme_minimal()This exercise demonstrates why we visualise data: thirteen datasets with nearly identical summary statistics but wildly different shapes.
The original Datasaurus is from Alberto Cairo’s blog post; the rest are from Matejka & Fitzmaurice’s Same Stats, Different Graphs (CHI 2017).
Q1. How many rows and columns does
datasaurus_dozen contain, and what are the variables?
Q2. Uncomment and complete the ggplot code to plot
y vs x for the dino subset, and
compute the Pearson correlation.
dino_data <- datasaurus_dozen |>
filter(dataset == "dino")
#ggplot(dino_data, aes( # complete...
dino_data |> summarise(r = cor(x, y))Q3. Repeat for the star dataset.
Compare its r to that of dino.
Q4. Repeat for the circle dataset.
Compare its r to that of dino.
Q5. Complete the following code to visualise all the
datasets at once with faceting and calculate the statistics as a single
grouped summarise command.
ggplot(datasaurus_dozen, aes(x = x, y = y, colour = dataset)) +
geom_point(size = 0.7) +
facet_wrap(~ dataset, ncol = 3) +
theme_minimal() +
theme(legend.position = "none",
axis.text = element_blank(),
axis.ticks = element_blank())datasaurus_dozen |>
group_by(dataset) |>
summarise(
mean_x = mean(x),
#... complete this to calculate the mean for y, standard deviation for x and y, and pearson correlation
.groups = "drop"
)Q6. Write 2–3 sentences in your knitted document on why these summary statistics are nearly identical despite the obvious visual differences, and what this implies for exploratory data analysis on real clinical datasets.
R ships with airquality (daily air-quality measurements,
NY 1973). Despite its age it is a useful, mildly messy dataset for
practising the verbs above.
Q7. Using airquality:
Ozone is missing.month_name with the month spelled out
("May", "Jun", …). Hint:
month.abb is a built-in vector.Ozone, mean Temp, and the
count of complete days per month.Ozone against Temp, coloured by
month, with a smoothed trend line
(geom_smooth(method = "lm")).A minimal commit cycle from the RStudio Terminal pane (or a system shell):
git status # what changed?
git add lab0_reproducible_research_tidyverse.Rmd
git commit -m "Lab 0: complete tidyverse + ggplot exercises"
git pushYou can also use the Git tab in RStudio’s top-right pane: tick the files to stage, click Commit, write a message, click Push.
Best practice:
.gitignore above already excludes
them.For each practical you will submit:
.Rmd in your Git
repository (public, or shared with github:fmaguire
/ gitlab.cs.dal.ca:finlaym if private - see explanation
above).Due midnight before the next week’s practical.