Author | |
---|---|
Name | Claire Descombes |
Affiliation | Universitätsklinik für Neurochirurgie, Inselspital Bern |
Degree | MSc Statistics and Data Science, University of Bern |
Contact | claire.descombes@insel.ch |
The reference material for this course, as well as some useful literature to deepen your knowledge of R, can be found at the bottom of the page.
To explore R’s statistical tools, we’ll continue working with the
NHANES datasets. Specifically, we’ll use the merged data frame that
combines the demo
, bpx
, bmx
and
smq
data sets. If you still have it loaded from Chapters 2
und 3, you can use it directly. Otherwise, you can download it from the
data_sets
folder and import it into R.
To be able to use survival data, we will also need the NHANES data
set complemented with mortality data. If you still have it loaded from
Chapters 3, you can use it directly. Otherwise, you can download it from
the data_sets
folder and import it into R.
# Load the merged_nhanes CSV file
merged_nhanes <- read.csv("/home/claire/Documents/GitHub/rforphysicians/data_sets/merged_nhanes.csv")
# Load the merged_nhanes_with_mort CSV file
merged_nhanes_with_mort <- read.csv("/home/claire/Documents/GitHub/rforphysicians/data_sets/merged_nhanes_with_mort.csv")
This chapter will follow the structure outlined below. We will learn how to perform a selection of statistical tests and models in R, categorising them by their objective (e.g. comparing two means or analysing survival).
Section | Topic | Subtopics |
---|---|---|
4.1 Tests for comparing two groups | Tests comparing means or proportions between two groups |
• Student’s t-test • Wilcoxon-Mann-Whitney test (Mann-Whitney-U-test) • Fisher’s exact test • McNemar test |
4.2 Tests for more than two groups | Tests for comparing multiple groups |
• Kruskal-Wallis test • Friedman test • Pearson’s Chi-Square test |
4.3 Tests for distribution and normality | Tests for distribution of data | • Lilliefors / Kolmogorov-Smirnov-Lilliefors test |
4.4 Tests for survival analysis | Tests for time-to-event data | • Logrank test |
4.5 Correlation and association tests | Tests for relationships between variables |
• Correlation test by Pearson • Correlation test by Spearman |
4.6 Predictive modelling and regression | Predictive models and regression techniques |
• Generalized Linear Models (GLMs): Linear regression, logistic
regression, ordinal regression, Cox (proportional hazards) regression -
Multivariable Regression • Mixed Effects Models • Generalized Additive Models (GAMs) • Generalized Additive Mixed Models (GAMMs) |
Before we begin, here is an overview of the type of data required by each test/model. More tests and models are presented than will be discussed in the course, but I have included them for the sake of completeness. If you feel lost on that topic at any point in the chapter, just scroll up again to have a look at this table.
1 sample | 2 paired samples | 2 unpaired samples | >2 paired samples | >2 unpaired samples | Continuous predictor | |
---|---|---|---|---|---|---|
Binary | • Binomial test | • McNemar test |
• Chi-square test • Fisher’s exact test |
• Cochran’s Q test |
• Chi-square test • Extensions of Fisher’s test |
• Logistic regression |
Nominal | • Chi-square goodness of fit test |
• Chi-square test • Extensions of Fisher’s test |
• Chi-square test | • Multinomial regression | ||
Ordinal |
• Wilcoxon signed-rank test • Sign test |
• Sign test • Wilcoxon signed-rank test on differences |
• Mann-Whitney U test (Wilcoxon rank-sum test) | • Friedman test | • Kruskal–Wallis test | • Ordinal regression |
Continuous | • One-sample t-test |
• Paired t-test • Wilcoxon signed-rank test on differences |
• Two-sample t-test | • Repeated measures ANOVA | • ANOVA | • Linear regression |
Time-to-event | • One-sample log-rank test | • Log-rank test |
• Cox regression • Weibull regression |
Before we start looking at many different statistical hypothesis tests, let us recall the basic structure that they all share, by exploring more in-depth the Student’s t-test for two, independent continuous samples.
Generally | Student’s t-test for two, independent samples |
---|---|
A test always aims to put to test a claim we make about some parameter (a correlation, a difference in mean, etc.) in the distribution of the population of interest. | Working on two patients cohorts, the smokers and the non-smokers, one could ask her-/himself if they share the same mean on a continuous variable, e.g. the BMI |
The data itself is expected to fulfil some assumptions. | As the BMI data is continuous, the cohorts are big (> 100 patients each, of sizes n_1 and n_2) and assumed to be independent, and there is no reason to assume they’d have a different variance in the BMI, an unpaired Student’s t-test appears to be the right choice. |
The ‘status quo’/no difference formulation of the question we’re asking is called the null hypothesis. | The null hypothesis, in our example, is the hypothesis that our two cohorts (the smokers and the non-smokers) share the same mean BMI |
The hypothesis we are usually interested to prove, the one showing a difference between two quantities, is called the alternative hypothesis | The alternative hypothesis, in our example, is the hypothesis that our two cohorts (the smokers and the non-smokers) do not share the same mean BMI |
Assuming the null hypothesis is true, a specific parameter, called the statistic, is expected to fall into a given distribution | If the null hypothesis is true, then the t statistic: t = \frac{\tilde{X}_1-\tilde{X}_2}{s_p \cdot \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} (where s_p is the pooled standard deviation) follows a Student’s t-distribution with n_1+n_2-2 degrees of freedom. |
For any object of which the distribution is (assumed to be) known, it is possible to compute its probability to fall within a certain range; for a given significance level (denoted by \alpha) and between 0 and 1, one can determine an interval in which the statistic has a probability of 1-\alpha to fall within | Letting \alpha be 0.05, as it is common in medicine, we can determine that our t statistic has a probability of 95% to be in a given interval (that can be computed in R or using a table). |
If the statistic, after being computed, falls in the 1-\alpha confidence interval (p>\alpha), then we cannot reject the null hypothesis, and have to look for further evidence to prove our claim. | If the t statistic falls into the range we computed, then we cannot say that we saw a significant difference in the means. |
If the statistic, after being computed, does not fall in the 1-\alpha confidence interval (p \leq \alpha), then we can reject the null hypothesis, and say that the claim we made on the data is statistically significant with a 1-\alpha confidence. | If the t-statistic does not fall into the range we computed, then we can say that we observe a statistically significant difference in the means, with a confidence of 95%. |
A few important comments:
1. Null hypothesis
2. Type of data
3. Requirements
💡 Independent samples are randomly selected so that their observations are not dependent on the values of other observations.
4. R function
t.test(x, ...)
: Performs one and two sample t-tests on
vectors of data.
5. Important arguments
t.test(x, y = NULL,
alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)
x, y
: the vector(s) that contain the data (only one
vector in the case of a one-sample t-test, two vectors otherwise).alternative
: a character string specifying the
alternative hypothesis, must be one of “two.sided” (default), “greater”
or “less”.mu
: a number indicating the true value of the mean (or
difference in means if you are performing a two sample test).paired
: a logical indicating whether you want a paired
t-test.var.equal
:a logical variable indicating whether to
treat the two variances as being equal. If TRUE
then the
pooled variance is used to estimate the variance otherwise the Welch (or
Satterthwaite) approximation to the degrees of freedom is used.conf.level
: confidence level of the interval (default:
0.95).💡 “paired” means there is an obvious and meaningful one-to-one correspondence between the data in the first set and the data in the second set, e.g. blood pressure in a set of patients before and after administering a medicine.
6. Example
Let us test if the mean BMI (BMXBMI
) is significantly
different between men and women (RIAGENDR
).
# We quickly check the distribution of the BMI using a Q-Q-plot (even though,
# with such a large sample, this isn't required)
qqnorm(y = merged_nhanes$BMXBMI)
# We create two vectors, with the BMIs of women and men
bmi_women <- merged_nhanes$BMXBMI[merged_nhanes$RIAGENDR == "Female"]
bmi_men <- merged_nhanes$BMXBMI[merged_nhanes$RIAGENDR == "Male"]
# We can start looking at how many non NA entries we have and by computing the means
sum(!is.na(bmi_women));sum(!is.na(bmi_men))
## [1] 3239
## [1] 3212
mean(bmi_women, na.rm = TRUE);mean(bmi_men, na.rm = TRUE)
## [1] 28.35542
## [1] 27.34091
qqplot(x = bmi_women, y = bmi_men)
# We compare their means using a t-test
t.test(x = bmi_women, y = bmi_men, alternative = 'two.sided', var.equal = TRUE, paired = FALSE)
##
## Two Sample t-test
##
## data: bmi_women and bmi_men
## t = 5.8288, df = 6449, p-value = 5.854e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.6733092 1.3557093
## sample estimates:
## mean of x mean of y
## 28.35542 27.34091
t
value is the statistic of the t-test.df
are the degrees of freedom (a characteristics of the
underlying distribution of the statistic), depends on the test used
(one- or two-sample, paired or unpaired), in this case it is the sum of
the two samples minus 2 (3239+3212-2=6449).p-value
is the probability of getting such a
t
value under the null hypothesis. It is very small:
5.854e-09, so to an \alpha-level of
0.05, we say that their is a statistically significant difference
between the means of the two groups.💡 qqnorm
is a function that produces a normal Q-Q-plot
of the values in y. qqplot
produces a Q-Q-plot of two
datasets.
The z-test is similar to the t-test (continuous variable(s), independent and normally distributed), and its null hypothesis is also that the means of two populations are equal, but it requires to know the standard deviation of your population(s).
You should favour a z-test to a t-test if the standard deviation of the population is known and the sample size is greater than or equal to 30.
There is no z-test function available in base R, but in packages like
BSDA
(z.test(x, ...)
).
Non-parametric (= makes minimal assumptions about the underlying distribution of the data being studied) “alternative” to the t-test (we do not test exactly the same thing, the t-test compares means, the Wilcoxon test compares distributions).
💡 If working on paired data, the test is called Wilcoxon signed-rank test, and on unpaired data it is called Wilcoxon rank-sum test (also called Mann–Whitney U test, Mann–Whitney–Wilcoxon test, or Wilcoxon–Mann–Whitney test).
1. Null hypothesis
2. Type of data
3. Requirements
4. R function
wilcox.test(x, ...)
: Performs one- and two-sample
Wilcoxon tests on vectors of data; the latter is also known as
‘Mann-Whitney’ test.
5. Important arguments
wilcox.test(x, y = NULL,
alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, exact = NULL, correct = TRUE,
conf.int = FALSE, conf.level = 0.95,
tol.root = 1e-4, digits.rank = Inf, ...)
x, y
: the vector(s) that contain the data (only one
vector in the case of a one-sample Wilcoxon test, two vectors
otherwise).alternative
: a character string specifying the
alternative hypothesis, must be one of “two.sided” (default), “greater”
or “less”.mu
: a number specifying an optional parameter used to
form the null hypothesis. Set mu = 0
(default) to test if
the samples have the same distribution.paired
: a logical indicating whether you want a paired
test. If only x is given, or if both x and y are given and paired is
TRUE, a Wilcoxon signed rank test of the null that the
distribution of x (in the one sample case) or of x - y (in the paired
two sample case) is symmetric about mu is performed. Otherwise, if both
x and y are given and paired is FALSE, a Wilcoxon rank sum
test (equivalent to the Mann-Whitney test: see the Note) is
carried out. In this case, the null hypothesis is that the distributions
of x and y differ by a location shift of mu and the alternative is that
they differ by some other location shift.conf.level
: confidence level of the interval (default:
0.95).6. Example
Let us test if the distribution of the BMI (BMXBMI
) is
significantly different between men and women
(RIAGENDR
).
# We create two vectors, with the BMIs of women and men
bmi_women <- merged_nhanes$BMXBMI[merged_nhanes$RIAGENDR == "Female"]
bmi_men <- merged_nhanes$BMXBMI[merged_nhanes$RIAGENDR == "Male"]
# We compare their distributions using a Wilcoxon test
wilcox.test(x = bmi_women, y = bmi_men, alternative = 'two.sided', mu = 0, paired = FALSE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: bmi_women and bmi_men
## W = 5464414, p-value = 0.0004466
## alternative hypothesis: true location shift is not equal to 0
W
value is the statistic of the Wilcoxon test (also
called U
statistic).p-value
is the probability of getting such a
W
value under the null hypothesis. It is very small:
0.0004466, so to an \alpha-level of
0.05, we say that their is a statistically significant difference
between the distributions of the two groups.Focus on categorical variables and small sample sizes.
Non-parametric (= makes minimal assumptions about the underlying distribution of the data being studied) “alternative” to the t-test (we do not test exactly the same thing, the t-test compares means, the Wilcoxon test compares distributions).
💡 If working on paired data, the test is called Wilcoxon signed-rank test, and on unpaired data it is called Wilcoxon rank-sum test (also called Mann–Whitney U test, Mann–Whitney–Wilcoxon test, or Wilcoxon–Mann–Whitney test).
1. Null hypothesis
2. Type of data
3. Requirements
4. R function
fisher.test(x, ...)
: tests the null hypothesis of
independence of rows and columns in a contingency table with fixed
marginals.
5. Important arguments
fisher.test(x, y = NULL, workspace = 200000, hybrid = FALSE,
hybridPars = c(expect = 5, percent = 80, Emin = 1),
control = list(), or = 1, alternative = "two.sided",
conf.int = TRUE, conf.level = 0.95,
simulate.p.value = FALSE, B = 2000)
x
: either a two-dimensional contingency table in matrix
form, or a factor object.y
: a factor object; ignored if x is a matrix.alternative
: a character string specifying the
alternative hypothesis, must be one of “two.sided” (default), “greater”
or “less”. Only used in the 2\times2
case.conf.int
: logical indicating if a confidence interval
for the odds ratio in a 2\times2 table
should be computed (and returned).conf.level
: confidence level for the returned
confidence interval. Only used in the 2\times2 case and if
conf.int = TRUE
.6. Example
Let us test if the distribution of the BMI (BMXBMI
) is
significantly different between men and women
(RIAGENDR
).
# We create two vectors, with the BMIs of women and men
bmi_women <- merged_nhanes$BMXBMI[merged_nhanes$RIAGENDR == "Female"]
bmi_men <- merged_nhanes$BMXBMI[merged_nhanes$RIAGENDR == "Male"]
# We compare their distributions using a Wilcoxon test
wilcox.test(x = bmi_women, y = bmi_men, alternative = 'two.sided', mu = 0, paired = FALSE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: bmi_women and bmi_men
## W = 5464414, p-value = 0.0004466
## alternative hypothesis: true location shift is not equal to 0
Highlight its application for paired categorical data.
Generalization of Wilcoxon-Mann-Whitney for more than two groups.
Generalization of paired tests (e.g., Wilcoxon) for more than two related groups.
Complement Fisher’s Exact Test, emphasizing it’s better suited for larger samples.
Test for deviations from normality, set the stage for determining when to use parametric vs. non-parametric tests.
For analyzing time-to-event data. Mention Kaplan-Meier curves for context.
Basis for understanding relationships between two continuous, normally distributed variables.
Non-parametric alternative for monotonic relationships.
Purpose: GLMs are an extension of linear models that allow for non-normal distributions of the response variable (e.g., binary, count, or categorical outcomes). They offer more flexibility than traditional linear regression by using different link functions and error distributions.
Key Features:
Linear relationship: GLMs assume a linear relationship between the predictors and the transformed response variable.
Link function: Links the linear predictor to the mean of the distribution. Common link functions:
Error distributions: GLMs can be applied with various error distributions: * Normal for continuous data (linear regression) * Binomial for binary data (logistic regression) * Poisson for count data
Assumptions
Common Applications
Purpose: Used to model the relationship between a continuous dependent variable and one or more independent variables. Assumptions: Linearity, normality of residuals, homoscedasticity, independence. Example Application: Predicting the price of a house based on square footage, number of rooms, etc.
Purpose: Used when the dependent variable is binary (e.g., yes/no, success/failure). Assumptions: Linear relationship between the log-odds of the outcome and predictors. Example Application: Predicting the likelihood of a disease based on age, gender, and other factors.
Purpose: Used for survival analysis, particularly when studying the time to an event (e.g., time to death, relapse). Assumptions: Proportional hazards assumption, meaning the effect of the predictor on the hazard rate is constant over time. Example Application: Analyzing the impact of age, treatment type, and other covariates on patient survival times.
Purpose: An extension of linear or logistic regression with more than one predictor variable. Assumptions: Similar to linear and logistic regression, but more complex due to multiple predictors. Example Application: Predicting a health outcome (e.g., cholesterol levels) based on multiple lifestyle factors (e.g., diet, exercise, genetics).
Purpose: Mixed effects models allow for the inclusion of both fixed and random effects, providing flexibility for hierarchical or grouped data. They are especially useful when there is variation between groups or subjects.
Key Features
Fixed effects: These are the main predictors of interest (e.g., treatment, age, etc.), which are assumed to have the same effect across all groups. Random effects: These account for variability across groups or clusters (e.g., random intercepts for subjects or random slopes for measurements over time).
Assumptions
Common Applications
Purpose: GAMs extend GLMs by allowing for non-linear relationships between predictors and the outcome. This is useful when the relationship between the independent and dependent variables is not linear.
Key Features
Non-linear terms: Uses smooth functions (e.g., splines) for predictors, allowing for flexibility in modeling.
Additive structure: The model assumes that the total effect is an additive combination of linear and smooth non-linear terms.
Link function: Like GLMs, GAMs can use different link functions depending on the distribution of the outcome variable.
Common Applications: Modelling complex relationships in patient data where the effect of treatment or time may not be linear.
Purpose: GAMMs combine the flexibility of GAMs with random effects, useful for hierarchical or clustered data.
Key Features: Like GAMs, but with the inclusion of random effects to account for variability between groups.
Applications: Ideal for longitudinal studies or hierarchical data where both non-linear relationships and random effects are present.
Assumptions
Example Application: Analysing patient data where outcomes are influenced by both individual patient characteristics and random hospital-specific effects (e.g., variability between hospitals).