Author | |
---|---|
Name | Claire Descombes |
Affiliation | Universitätsklinik für Neurochirurgie, Inselspital Bern |
Degree | MSc Statistics and Data Science, University of Bern |
Contact | claire.descombes@insel.ch |
The reference material for this course, as well as some useful literature to deepen your knowledge of R, can be found at the bottom of the page.
The base R plotting functions are effective and user-friendly, so we will begin by exploring their features.
Next, we will take a look at the functions offered by
ggplot2
, a system for declarative graphics creation. You
provide the data and specify how to map variables to aesthetics and
which graphical primitives to use; ggplot2 then takes care of the
details. Although it is slightly more technical than base R, it produces
very aesthetically pleasing results.
To explore R’s plotting capabilities, we’ll continue working with the
NHANES datasets. Specifically, we’ll use the merged data frame that
combines the demo
, bpx
, bmx
and
smq
data sets. If you still have it loaded from Chapter 2,
you can use it directly.
To be able to plot survival curves, we will also need survival data
for the NHANES data set. Those can be found publicly as well (for adults
only), on the National
Death Index. I already merged this survival data with the NHANES
data set, please download it from the data_sets
folder and import it into R.
# Load the merged_nhanes CSV file
merged_nhanes <- read.csv("/home/claire/Documents/GitHub/rforphysicians/data_sets/merged_nhanes.csv")
# Load the merged_nhanes_with_mort CSV file
merged_nhanes_with_mort <- read.csv("/home/claire/Documents/GitHub/rforphysicians/data_sets/merged_nhanes_with_mort.csv")
plot(x = merged_nhanes$RIDAGEYR, y = merged_nhanes$BPXSY1,
xlab = "Age of participant [years]",
ylab = "Systolic blood pressure (1st reading) [mmHg]",
main = "Scatterplot of age vs. systolic BP",
col = c("steelblue"))
💡 The arguments xlab
and ylab
allow you to
define a label for the x and y axes respectively. The argument
main
defines the main title of the plot.
boxplot(merged_nhanes$BPXSY1 ~ merged_nhanes$RIAGENDR,
names = c("Male", "Female"),
xlab = "Gender",
ylab = "Systolic blood pressure (1st reading) [mmHg]",
main = "Boxplot of BP by gender",
col = c("salmon", "skyblue"))
# We define the age groups in which to calculate the mean systolic blood pressure and calculate the mean systolic blood pressure by age group
avg_bp <- merged_nhanes %>%
filter(!is.na(BPXSY1), !is.na(RIDAGEYR)) %>% # Remove NAs
mutate(age_group = cut(RIDAGEYR, breaks = seq(10, 80, 10))) %>% # Define age groups
group_by(age_group) %>%
summarise(mean_bp = mean(BPXSY1, na.rm = TRUE)) %>%
ungroup()
barplot(height = avg_bp$mean_bp,
names.arg = avg_bp$age_group,
xlab = "Age group [years]",
ylab = "Average systolic BP [mmHg]",
main = "Bar plot of average systolic BP by age group",
col = "steelblue")
💡 cut()
divides the range of the imputed vector into
intervals and codes its values according to which interval they fall.
The leftmost interval corresponds to level one, the next leftmost to
level two and so on.
hist(merged_nhanes$BMXBMI,
breaks = 30,
freq = TRUE,
col = "steelblue",
xlab = "Body Mass Index (BMI)",
main = "Histogram of BMI")
💡 The argument breaks
is a vector giving the
breakpoints between histogram cells or a function to compute the vector
of breakpoints.
💡 The argument freq
is a logical: if TRUE, the
histogram graphic is a representation of frequencies, the counts
component of the result; if FALSE, probability densities, component
density, are plotted.
# We define the age groups in which to calculate the mean systolic blood pressure and calculate the mean systolic blood pressure by age group
avg_bp_with_midpoints <- merged_nhanes %>%
filter(!is.na(BPXSY1), !is.na(RIDAGEYR)) %>%
mutate(age_group = cut(RIDAGEYR, breaks = seq(10, 80, 10))) %>%
group_by(age_group) %>%
summarise(mean_bp = mean(BPXSY1)) %>%
ungroup() %>%
mutate(
# Add midpoints
age_mid = seq(15, 75, 10)
)
plot(avg_bp_with_midpoints$age_mid, avg_bp_with_midpoints$mean_bp, type = "b",
xlab = "Age (midpoint of group) [years]",
ylab = "Mean systolic BP [mmHg]",
main = "Mean systolic BP by age group",
col = "steelblue")
💡 The argument type
is 1-character string giving the
type of plot desired. The following values are possible:
(gender_table <- table(merged_nhanes$RIAGENDR))
##
## Female Male
## 3298 3251
pie(x = gender_table,
labels = c("Female", "Male"),
col = c("skyblue", "salmon"),
main = "Gender Distribution")
💡 table()
uses cross-classifying factors to build a
contingency table of the counts at each combination of factor
levels.
Personally, I would recommend avoiding pie charts, as the human eye is not very good at recognising angle ratios. I favour bar plots, for example.
(smoking_gender_table <- table(merged_nhanes$RIAGENDR, merged_nhanes$SMQ020))
##
## Don't know No Refused Yes
## Female 2 1817 0 879
## Male 1 1220 1 1398
mosaicplot(smoking_gender_table[,c("Yes","No")], # we select only the 'yes/no' answers
main = "Mosaic plot of smoking status by gender",
xlab = "Gender",
ylab = "Smoked at least 100 cigarettes?",
col = c("steelblue"))
Base R does not include a built-in function for forest plots, but you
can use a package like forestplot
for that, or
ggplot2
(see below).
library(forestplot)
# Calculate summary statistics by gender
summary_stats <- merged_nhanes %>%
filter(!is.na(BPXSY1)) %>% # Remove NAs for BPXSY1
group_by(RIAGENDR) %>% # Group by gender
summarise(
mean = mean(BPXSY1),
sd = sd(BPXSY1),
n = n(),
lower = mean - 1.96 * sd / sqrt(n),
upper = mean + 1.96 * sd / sqrt(n)
) %>%
ungroup()
# Create the forest plot
forestplot(labeltext = c("Female","Male"),
mean = summary_stats$mean,
lower = summary_stats$lower,
upper = summary_stats$upper,
zero = 120,
grid = TRUE,
ci.vertices = TRUE,
xlab = "Systolic BP (95% CI) [mmHg]",
col = fpColors(box = "steelblue", line = "steelblue"),
title = "Forest plot of systolic BP by gender")
Base R requires the survival
package in order to draw
Kaplan-Meier survival curves.
library(survival)
# Filter out data outside of window (> 24 months)
merged_nhanes_no_outliers <- merged_nhanes_with_mort[merged_nhanes_with_mort$PERMTH_INT <= 24, ]
# Fit Kaplan–Meier survival curves
fit <- survfit(Surv(PERMTH_INT, MORTSTAT) ~ 1,
data = merged_nhanes_no_outliers)
# Plot in base R
plot(fit,
col = "steelblue",
lwd = 2,
xlab = "Time (months)",
ylab = "Survival probability",
ylim = c(0.7,1),
main = "Kaplan–Meier survival curve")
✏️ Exercise on plotting with base R n°1: Plot the first, second and third systolic blood pressure readings by gender using base R. Display all three plots (1st, 2nd and 3rd readings) in the same output panel.
💡 Hint: Try using the par()
function
(e.g. par(mfrow=c(2,3))
; do not forget to reset it to
par(mfrow=c(1,1))
afterwards!)
✏️ Exercise on plotting with base R n°2: Create a histogram showing the systolic blood pressure (1st reading) for each smoking status category (excluding those who refused to provide their smoking status). Set the number of breaks to 20. You should get three different plots. Display them all in the same output panel.
ggplot2
# Start by installing ggplot2
install.packages("ggplot2")
library(ggplot2) # after installation
# Note that the ggplot2 package is included in some R package collections, e.g. the tidyverse collection
install.packages("tidyverse")
library(tidyverse) # after installation
ggplot2
is based on the grammar of graphics — the idea
that every graph can be built from the same basic components:
Structure of a ggplot2
graph:
How to build a plot:
ggplot()
, where you provide the dataset (and
optionally the aesthetic mapping).geom_point()
(scatter plot)geom_histogram()
(histogram)geom_line()
(line plot)scales
(e.g., scale_y_log10())faceting
(e.g., facet_wrap())coordinate systems
(e.g., coord_flip())aes()
: these describe how
variables in the data are mapped to visual properties like: x, y, color,
fill, linetype, size, etc.💡 You can place aes()
either in ggplot()
(to apply it globally) or inside the individual
geom_*()
functions (to apply it locally).
The geom function must be chosen according to the type of data examined (two continuous variables, one categorical and one continuous variable, etc.).
ggplot(merged_nhanes, aes(x = RIDAGEYR, y = BPXSY1)) +
geom_point(alpha = 0.5, col = "steelblue") +
labs(title = "Scatterplot of age vs. systolic BP",
x = "Age of participant [years]",
y = "Systolic blood pressure (1st reading) [mmHg]") +
theme_minimal()
Or equivalently:
ggplot(merged_nhanes) +
geom_point(aes(x = RIDAGEYR, y = BPXSY1), alpha = 0.5, col = "steelblue") +
labs(title = "Scatterplot of age vs. systolic BP",
x = "Age of participant [years]",
y = "Systolic blood pressure (1st reading) [mmHg]") +
theme_minimal()
ggplot(data = merged_nhanes, aes(x = factor(RIAGENDR, labels = c("Male", "Female")), y = BPXSY1, fill = factor(RIAGENDR))) +
geom_boxplot() +
scale_fill_manual(values = c("salmon", "skyblue")) +
labs(title = "Box plot of systolic BP by gender",
x = "Gender",
y = "Systolic BP (1st reading) [mmHg]") +
theme_minimal() +
theme(legend.position = "none")
ggplot(avg_bp, aes(x = age_group, y = mean_bp)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Bar plot of mean systolic BP by age group",
x = "Age group [years]",
y = "Mean systolic BP [mmHg]") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot(merged_nhanes, aes(x = BMXBMI)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
labs(title = "Histogram of BMI",
x = "Body Mass Index (BMI)",
y = "Count") +
theme_minimal()
💡 A quick reminder that the number of bins in a histogram is extremely important and should be given careful consideration, as the appearance of the data can vary significantly depending on this choice, which could potentially be misleading.
An example with the age of the household reference person:
ggplot(merged_nhanes, aes(x = DMDHRAGE)) +
geom_histogram(bins = 40, fill = "steelblue", color = "white") +
labs(title = "Histogram of age of the household reference person (40 bins)",
x = "Age [years]",
y = "Count") +
theme_minimal()
ggplot(merged_nhanes, aes(x = DMDHRAGE)) +
geom_histogram(bins = 20, fill = "steelblue", color = "white") +
labs(title = "Histogram of age of the household reference person (20 bins)",
x = "Age [years]",
y = "Count") +
theme_minimal()
ggplot(merged_nhanes, aes(x = DMDHRAGE)) +
geom_histogram(bins = 10, fill = "steelblue", color = "white") +
labs(title = "Histogram of age of the household reference person (10 bins)",
x = "Age [years]",
y = "Count") +
theme_minimal()
ggplot(avg_bp_with_midpoints, aes(x = age_mid, y = mean_bp)) +
geom_line(col = "steelblue") +
geom_point(col = "steelblue") +
labs(title = "Mean systolic BP by age group",
x = "Age (midpoint of group) [years]",
y = "Mean systolic BP [mmHg]") +
theme_minimal()
# Count proportions for each gender
gender_counts <- merged_nhanes %>%
filter(!is.na(RIAGENDR)) %>% # Remove NAs if any
count(Gender = RIAGENDR) %>% # Count by gender, rename column to 'Gender'
mutate(
prop = n / sum(n), # Compute proportion
ypos = cumsum(prop) - 0.5 * prop # Compute y position for label
)
ggplot(gender_counts, aes(x = "", y = prop, fill = Gender)) +
geom_col(width = 1) +
coord_polar(theta = "y") +
geom_text(aes(y = ypos, label = scales::percent(prop)), color = "white") +
scale_fill_manual(values = c("salmon", "skyblue")) +
labs(title = "Pie chart of gender distribution") +
theme_void()
ggplot2
does not include a built-in function for mosaic
plots, but can be extended with the ggmoisaic
package for
that.
library(ggmosaic)
ggplot(data = merged_nhanes[merged_nhanes$SMQ020 %in% c("Yes","No"),]) +
geom_mosaic(aes(x = product(RIAGENDR), fill = SMQ020)) +
scale_fill_manual(values = c("salmon", "skyblue")) +
labs(x = "Gender", y = "Smoked at least 100 cigarettes?", title = "Mosaic plot of smoking status by gender") +
theme_minimal() +
theme(legend.position = "none")
# Calculate summary statistics by gender
forest_df <- merged_nhanes %>%
filter(!is.na(BPXSY1), !is.na(RIAGENDR)) %>%
group_by(Gender = RIAGENDR) %>%
summarise(
mean = mean(BPXSY1),
sd = sd(BPXSY1),
n = n(),
lower = mean - 1.96 * sd / sqrt(n),
upper = mean + 1.96 * sd / sqrt(n),
.groups = "drop"
)
ggplot(forest_df, aes(x = mean, y = Gender)) +
geom_point(size = 3, color = "steelblue") +
geom_errorbarh(aes(xmin = lower, xmax = upper), height = 0.2, color = "steelblue") +
geom_vline(xintercept = 120, linetype = "dashed") +
labs(title = "Forest plot of systolic BP by gender",
x = "Systolic BP (95% CI) [mmHg]",
y = "") +
theme_minimal()
ggplot2
does not include a built-in function for KM
curves, but can be extended with the survminer
package for
that.
library(survminer)
# Fit Kaplan–Meier survival curves (already done)
fit <- survfit(Surv(PERMTH_INT, MORTSTAT) ~ 1,
data = merged_nhanes_no_outliers)
# Plot with ggsurvplot
ggsurvplot(fit,
data = merged_nhanes_no_outliers,
palette = "steelblue",
conf.int = TRUE,
xlab = "Time (months)",
ylab = "Survival probability",
ylim = c(0.7, 1),
risk.table = TRUE,
title = "Kaplan–Meier survival curve")
✏️ Exercise on plotting with ggplot2
: Using the same age
groups as in the examples, create a bar plot showing the median BMI by
age group. Then produce two further bar plots showing the same data for
each gender, and display them in the same output panel, using the same
y-axis to allow for better comparison.
💡 Hints: Try using the grid.arrange()
function
from the gridExtra
package
(e.g. grid.arrange(plot1, plot2, ncol=2)
), and the argument
ylim
in the ggplot()
function.
We would use the jpeg()
function to save a plot as a
JPEG image and the png()
function to save it as a PNG
image. Please note that after the plotting and saving, we need to call
the function dev.off()
to save the file.
We can specify the resolution we want with arguments width and height. We can also specify the full path of the file we want to save if we don’t want to save it in the current directory.
plotting_directory <- "/home/claire/Documents/GitHub/rforphysicians/docs/plots/"
jpeg(file = paste0(plotting_directory,"hist_BMI.jpeg"),
width = 10,
height = 6,
units = "in",
res=100)
hist(merged_nhanes$BMXBMI,
breaks = 30,
freq = TRUE,
col = "steelblue",
xlab = "Body Mass Index (BMI)",
main = "Histogram of BMI")
dev.off()
png(file= paste0(plotting_directory,"pie_chart_gender.png"),
width = 10,
height = 6,
units="in",
res=100)
pie(x = gender_table,
labels = c("Female", "Male"),
col = c("skyblue", "salmon"),
main = "Gender Distribution")
dev.off()
ggplot2
Saving plots created using the ggplot2
package is very
easy with the ggsave()
function. However, the plot must
first be stored in a variable; otherwise, the function will simply save
the last plot displayed. The function guesses the type of graphics from
the file extension.
We can specify the resolution we want with arguments width and height. We can also specify the full path of the file we want to save if we don’t want to save it in the current directory.
boxplot_syst_bp_by_gender <-
ggplot(data = merged_nhanes, aes(x = factor(RIAGENDR, labels = c("Male", "Female")), y = BPXSY1, fill = factor(RIAGENDR))) +
geom_boxplot() +
scale_fill_manual(values = c("salmon", "skyblue")) +
labs(title = "Box plot of systolic BP by gender",
x = "Gender",
y = "Systolic BP (1st reading) [mmHg]") +
theme_minimal() +
theme(legend.position = "none")
ggsave(filename = "boxplot_syst_bp_by_gender.png",
plot = boxplot_syst_bp_by_gender,
bg = 'white',
width = 10,
height = 6,
units = c("in"),
path = plotting_directory)
✏️ Exercise on saving plots n°1: Save the three box plots from the plotting exercise using base R n°1 as one JPEG image.
✏️ Exercise on saving plots n°2: Save the two bar plots from the
plotting exercise using ggplot2
as one PNG image.
☑️ Exercise on plotting with base R n°1
par(mfrow=c(1,3))
boxplot(merged_nhanes$BPXSY1 ~ merged_nhanes$RIAGENDR,
names = c("Male", "Female"),
xlab = "Gender",
ylab = "Systolic blood pressure (1st reading) [mmHg]",
main = "Boxplot of BP by gender",
col = c("salmon", "skyblue"))
boxplot(merged_nhanes$BPXSY2 ~ merged_nhanes$RIAGENDR,
names = c("Male", "Female"),
xlab = "Gender",
ylab = "Systolic blood pressure (2nd reading) [mmHg]",
main = "Boxplot of BP by gender",
col = c("salmon", "skyblue"))
boxplot(merged_nhanes$BPXSY3 ~ merged_nhanes$RIAGENDR,
names = c("Male", "Female"),
xlab = "Gender",
ylab = "Systolic blood pressure (3rd reading) [mmHg]",
main = "Boxplot of BP by gender",
col = c("salmon", "skyblue"))
par(mfrow=c(1,1))
☑️ Exercise on plotting with base R n°2
# We identify the different smoking status
table(merged_nhanes$SMQ040)
##
## Every day Not at all Refused Some days
## 878 1210 2 187
par(mfrow=c(3,1))
hist(merged_nhanes[merged_nhanes$SMQ040=="Not at all",]$BPXSY1,
breaks = 20,
freq = TRUE,
col = "darkgreen",
xlab = "Systolic blood pressure (1st reading) [mmHg]",
main = "Histogram of systolic BP: not smoking")
hist(merged_nhanes[merged_nhanes$SMQ040=="Some days",]$BPXSY1,
breaks = 20,
freq = TRUE,
col = "orange2",
xlab = "Systolic blood pressure (1st reading) [mmHg]",
main = "Histogram of systolic BP: smoking every other day")
hist(merged_nhanes[merged_nhanes$SMQ040=="Every day",]$BPXSY1,
breaks = 20,
freq = TRUE,
col = "coral3",
xlab = "Systolic blood pressure (1st reading) [mmHg]",
main = "Histogram of systolic BP: smoking every day")
par(mfrow=c(1,1))
# In case you were wondering why the BP is higher in the group of non-smokers
merged_nhanes$age_group <- cut(merged_nhanes$RIDAGEYR, breaks = seq(10, 80, 10))
table(merged_nhanes[merged_nhanes$SMQ040!="Refused",]$SMQ040, merged_nhanes[merged_nhanes$SMQ040!="Refused",]$age_group)
##
## (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80]
## Every day 16 165 187 179 172 111 48
## Not at all 3 84 152 148 228 283 312
## Some days 5 60 35 30 26 24 7
☑️ Exercise on plotting with ggplot2
# We define the age groups
age_groups <- cut(x = merged_nhanes$RIDAGEYR, breaks = seq(10, 80, 10))
head(age_groups)
## [1] (20,30] (10,20] (40,50] (10,20] (20,30] (10,20]
## Levels: (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80]
# We calculate the median BMI by age group
(med_bmi <- aggregate(BMXBMI ~ age_groups, data = merged_nhanes, FUN = function(x) median(x, na.rm = TRUE)))
ggplot(med_bmi, aes(x = age_groups, y = BMXBMI)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Bar plot of median BMI by age group",
x = "Age group [years]",
y = "Median BMI") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# We calculate the median BMI by age group and by gender
(med_bmi <- aggregate(BMXBMI ~ age_groups + RIAGENDR, data = merged_nhanes, FUN = function(x) median(x, na.rm = TRUE)))
med_bmi_fem <- med_bmi[med_bmi$RIAGENDR == "Female",]
med_bmi_male <- med_bmi[med_bmi$RIAGENDR == "Male",]
plot1 <- ggplot(med_bmi_fem, aes(x = age_groups, y = BMXBMI)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Median BMI, gender = female",
x = "Age group [years]",
y = "Median BMI") +
theme_minimal() +
ylim(c(0,30)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
plot2 <- ggplot(med_bmi_male, aes(x = age_groups, y = BMXBMI)) +
geom_bar(stat = "identity", fill = "salmon") +
labs(title = "Median BMI, gender = male",
x = "Age group [years]",
y = "Median BMI") +
theme_minimal() +
ylim(c(0,30)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
library(gridExtra)
grid.arrange(plot1, plot2, ncol=2)
☑️ Exercise on saving plots n°1
plotting_directory <- "/home/claire/Documents/GitHub/rforphysicians/docs/plots/" # your plotting directory here
jpeg(file = paste0(plotting_directory,"boxplots_bp_by_gender.jpeg"),
width = 10,
height = 6,
units = "in",
res=100)
par(mfrow=c(1,3))
boxplot(merged_nhanes$BPXSY1 ~ merged_nhanes$RIAGENDR,
names = c("Male", "Female"),
xlab = "Gender",
ylab = "Systolic blood pressure (1st reading) [mmHg]",
main = "Boxplot of BP by gender",
col = c("salmon", "skyblue"))
boxplot(merged_nhanes$BPXSY2 ~ merged_nhanes$RIAGENDR,
names = c("Male", "Female"),
xlab = "Gender",
ylab = "Systolic blood pressure (2nd reading) [mmHg]",
main = "Boxplot of BP by gender",
col = c("salmon", "skyblue"))
boxplot(merged_nhanes$BPXSY3 ~ merged_nhanes$RIAGENDR,
names = c("Male", "Female"),
xlab = "Gender",
ylab = "Systolic blood pressure (3rd reading) [mmHg]",
main = "Boxplot of BP by gender",
col = c("salmon", "skyblue"))
dev.off()
par(mfrow=c(1,1))
☑️ Exercise on saving plots n°2
plotting_directory <- "/home/claire/Documents/GitHub/rforphysicians/docs/plots/" # your plotting directory here
bar_plot_med_bmi_by_gender <- grid.arrange(plot1, plot2, ncol=2)
ggsave(filename = "bar_plot_med_bmi_by_gender.png",
plot = bar_plot_med_bmi_by_gender,
bg = 'white',
width = 10,
height = 6,
units = c("in"),
path = plotting_directory)