Author | |
---|---|
Name | Claire Descombes |
Affiliation | Universitätsklinik für Neurochirurgie, Inselspital Bern |
Degree | MSc Statistics and Data Science, University of Bern |
Contact | claire.descombes@insel.ch |
The reference material for this course, as well as some useful literature to deepen your knowledge of R, can be found at the bottom of the page.
R is a free and open source statistical computing and graphics software. RStudio is a user-friendly environment for R, designed to facilitate its accessibility. R can technically be used without RStudio (although I wouldn’t advise it), but the reverse is not possible. To download both, follow the links below:
✏️ Download both softwares.
Once you have downloaded both programs and opened RStudio, you will be presented with a window similar to the one shown in the following figure.
In the Console tab, we first see information about the version of R we are using and some basic commands to try out. At the end of these descriptions, we can type our R code, press Enter and see the result below the code line.
2+2
## [1] 4
The help()
function and ?
help operator in
R provide access to the documentation pages for R functions, data sets,
and other objects, both for packages in the standard R distribution and
for contributed packages.
You can access help directly from the console or via the Help tab in the bottom right-hand corner.
help(c)
# or equivalently
? c
However, when we run our code directly in the console, it isn’t saved for being reproduced further. If we need (and we usually do) to write a reproducible code to solve a specific task, we have to record and regularly save it in a script file rather than in the console.
To start recording a script, click File – New File – R Script. This will open a text editor in the top-left corner of the RStudio interface (above the Console tab, see following figure).
💡 When your code starts to get long or complex, consider breaking it
down into separate scripts with clear and specific purposes — for
example: 1_data_import.R
, 2_data_cleaning.R
,
3_survival_analysis.R
, 4_qol_analysis.R
,
etc.
✏️ Exercise 1: Create your own script. Feel free to take notes directly in it. You’ll use this script as a working document to complete various small tasks and exercises.
☑️ All exercises have an example solution at the end of the chapter.
In R, everything is an object. This means that every piece of data you work with, from a single number to a complex dataset, is represented as an object with specific properties and behaviours. An object has attributes like class (data type) and dimensions.
Variables act as labels for objects. They are essentially pointers to the actual object stored in memory and appear in the Environment tab in RStudio.
Here’s an example to clarify the difference between variables and objects.
# We create an object (here: a vector) named 'vec' and assign a sequence of numbers to it.
vec <- 1:10
# 'vec' is the variable. The sequence of numbers (1, 2, 3, ..., 10) is the object.
💡 To assign values to an object, use the <-
or
=
symbols.
Before diving into data types and structures, it’s helpful to know how to inspect objects in R. Several built-in functions can help you understand the structure and content of an object.
Let’s define a simple data frame (more details about data frames below) to demonstrate the purpose of those functions.
# Example data frame
df <- data.frame(
ID = 1:5,
Name = c("Anna", "Ben", "Carla", "David", "Eva"),
Age = c(23, 31, 29, 40, 35)
)
Now let’s inspect thus object using a few useful functions.
# General object inspection
typeof(df) # Returns the internal storage type of the object
## [1] "list"
str(df) # Gives a compact, human-readable summary of the object's structure
## 'data.frame': 5 obs. of 3 variables:
## $ ID : int 1 2 3 4 5
## $ Name: chr "Anna" "Ben" "Carla" "David" ...
## $ Age : num 23 31 29 40 35
attributes(df) # Lists the object's attributes (e.g., names, dimensions, class)
## $names
## [1] "ID" "Name" "Age"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3 4 5
str(attributes(df)) # Displays the structure of the attributes
## List of 3
## $ names : chr [1:3] "ID" "Name" "Age"
## $ class : chr "data.frame"
## $ row.names: int [1:5] 1 2 3 4 5
class(df) # Returns the class of the object (e.g., data.frame)
## [1] "data.frame"
# Functions especially useful for matrices or data frames
nrow(df) # Number of rows
## [1] 5
ncol(df) # Number of columns
## [1] 3
dim(df) # Dimensions (rows, columns)
## [1] 5 3
colnames(df) # Column names
## [1] "ID" "Name" "Age"
rownames(df) # Row names
## [1] "1" "2" "3" "4" "5"
str(colnames(df)) # Structure of the column names (e.g., character vector)
## chr [1:3] "ID" "Name" "Age"
str(rownames(df)) # Structure of the row names (e.g., character vector)
## chr [1:5] "1" "2" "3" "4" "5"
💡 The function str()
provides a compact view of the
internal structure of an R object, helping you understand its components
and data types quickly.
In R, data types define the kind of information a variable can hold. Here are some of the most common data types:
Represents real numbers (e.g., 3.14, -2.5, 0).
typeof(3.14)
## [1] "double"
str(3.14)
## num 3.14
Represents whole numbers (e.g., 2L, -5L). The “L” suffix indicates an integer.
typeof(2L)
## [1] "integer"
str(2L)
## int 2
Represents Boolean values (TRUE or FALSE).
typeof(TRUE)
## [1] "logical"
str(TRUE)
## logi TRUE
# Example of a logical operation
values <- 1:10
above_five <- (values > 5)
above_five
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
Represents text or strings (e.g., “hello”, “world”).
typeof("hello")
## [1] "character"
str("hello")
## chr "hello"
Represents complex numbers (e.g., 1 + 2i).
typeof(1 + 2i)
## [1] "complex"
str(1 + 2i)
## cplx 1+2i
In some cases the components of a vector may not be known. When an
element or value is “not available” or a “missing value” in the
statistical sense, a place within a vector may be reserved for it by
assigning it the special value NA
. In general any operation
on an NA becomes an NA.
z <- c(1:3,NA)
print(z)
## [1] 1 2 3 NA
is.na(z)
## [1] FALSE FALSE FALSE TRUE
There is a second kind of “missing” values which are produced by
numerical computation, the so-called Not a Number, NaN
,
values.
0/0
## [1] NaN
Inf - Inf
## [1] NaN
Objects are the entities that R operates on. These can be:
c()
function.vec1 <- c(1,2,3)
str(vec1)
## num [1:3] 1 2 3
# Alternative ways of creating vectors:
vec2 <- 1:3 # Sequence of integers
vec3 <- seq(1, 3, by=1) # More general sequence
Vector elements can be accessed using []
brackets.
# Accessing elements of the vector by index (R uses 1-based indexing)
vec1[1] # First element
## [1] 1
vec2[c(2, 3)] # Elements at indices 2 and 3
## [1] 2 3
vec3[c(-2)] # All elements except for the element at index 2
## [1] 1 3
matrix()
function.(mat <- matrix(c(1,2,3,4), nrow = 2, ncol = 2))
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
str(mat)
## num [1:2, 1:2] 1 2 3 4
💡 By enclosing the assignment in parentheses ()
, you
not only create the object but also automatically print its value to the
console — a useful shortcut. This is equivalent to writing
print(object)
or simply typing the object name (e.g.,
object
), but it saves you an extra line of code.
(array <- array(1:8, c(2,4,2)))
## , , 1
##
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
##
## , , 2
##
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
str(array)
## int [1:2, 1:4, 1:2] 1 2 3 4 5 6 7 8 1 2 ...
(list <- list(numb = 10:15, char = 'hello'))
## $numb
## [1] 10 11 12 13 14 15
##
## $char
## [1] "hello"
str(list)
## List of 2
## $ numb: int [1:6] 10 11 12 13 14 15
## $ char: chr "hello"
(fac <- factor(c("single", "married")))
## [1] single married
## Levels: married single
str(fac)
## Factor w/ 2 levels "married","single": 2 1
(d <- data.frame(id = 1:5,
val = c(4,5,2,6,5),
group = sample(c("exp","control"), size = 5, replace = TRUE)))
str(d)
## 'data.frame': 5 obs. of 3 variables:
## $ id : int 1 2 3 4 5
## $ val : num 4 5 2 6 5
## $ group: chr "exp" "control" "exp" "exp" ...
frac <- function(numerator, denominator) {
result <- numerator / denominator
return(result)
}
frac(numerator = 6, denominator = 2) # Calling the function
## [1] 3
💡 A function takes inputs called arguments. These can be:
frac(numerator = 6, denominator = 2)
, orfrac(6, 2)
.I strongly recommend explicitly naming arguments, because it
increases clarity, prevents order-related mistakes, and gives you
flexibility in argument order:
frac(denominator = 2, numerator = 6)
works just as well. In
contrast, when arguments are unnamed, order matters strictly — and a
small mistake can completely change the result: frac(2, 6)
is interpreted as frac(numerator = 2, denominator = 6)
!
✏️ Exercise 2: Given a vector of ages
(ages <- c(35, 45, 60, 15, 50, 8)
), determine which
patients are eligible for a treatment (age above 18). Return a Boolean
vector indicating whether each patient meets the age criteria.
✏️ Exercise 3: Create a data frame with 10 rows and the columns
id
, blood_pressure
, and group
. –
id
: integers from 1 to 10 – blood_pressure
:
random values from a normal distribution with mean 123 and standard
deviation 8 – group
: a factor with levels “drug 1”, “drug
2”, and “obs arm” (you decide how to assign them, e.g. by using the
function sample()
) After creating that data frame, add a
new column stand_score
where you calculate a standardized
score for each blood_pressure value. The standardized score is similar
to a z-score but is calculated based on the mean and standard deviation
of the blood_pressure
values in the dataset (standardized
score = (x−μ)/σ)).
💡 Hint: Use the function rnorm()
to simulate normal
values. Use the function scale()
to centre and scale a
vector, or alternatively the functions mean()
and
sd()
to compute mean and standard deviation of a vector.
You can use help()
to learn more about how these functions
work.
✏️ Exercise 4: Write a function sum_squared
that takes
two integers and returns the sum of their squared values.
# Example
sum_squared(int1 = 2, int2 = 3)
# The output should be:
13
Packages are collections of R functions, data, and compiled code in a well-defined format, created to add specific functionality.
There is a set of standard (or base) packages which are considered part of the R source code and automatically available as part of your R installation. Base packages contain the basic functions that allow R to work, and enable standard statistical and graphical functions on datasets.
On top of that, there are 10,000+ user contributed packages. You can install packages using the install.packages() function.
# To install a package, you can use the function.
install.packages("dplyr")
# Alternatively, you can click on 'Packages' in the bottom right pane, then click on 'Install' and type in the name of the package you want to install.
# To load a package you've already install, just load it using the library() function.
library(dplyr)
✏️ Exercise 5: install and load the tidyverse
package.
We will need it for Chapter 2.
Please not that those are only examples, there are always many ways to solve the same task!
☑️ Exercise 1:
☑️ Exercise 2:
ages <- c(35, 45, 60, 15, 50, 8)
eligible <- ages >= 18
eligible
## [1] TRUE TRUE TRUE FALSE TRUE FALSE
☑️ Exercise 3:
# Create the data frame
(df <- data.frame(id = 1:10,
blood_pressure = rnorm(10, mean = 123, sd = 8),
group = factor(sample(c("drug 1", "drug 2", "obs arm"), 10, replace = TRUE))))
# Compute standardized scores for blood pressure
df$stand_score <- scale(df$blood_pressure)
# Alternatively, compute by hand (standardized score = (x - μ) / σ)
df$stand_score_2 <- (df$blood_pressure - mean(df$blood_pressure))/sd(df$blood_pressure)
# View the data frame with standardized scores
df
☑️ Exercise 4:
sum_squared <- function(int1, int2){
sum_of_squares <- int1^2 + int2^2
return(sum_of_squares)
}
sum_squared(int1 = 3, int2 = 5)
## [1] 34
☑️ Exercise 5:
# To install the package, run the line:
install.packages("tidyverse")
# To load the package once installed, run the line:
library(tidyverse)
1.4 Comments
Comments can be added to the code in a script using the hash symbol
#
.It is very, very important that you always comment every piece of your code, to make sure:
So, for scientific purposes, please comment your code!
Here’s an example of how I usually comment the scripts I use in my daily work:
💡 Use
----
after numbered headers in comments to make your code more navigable and readable in long scripts (this is a common R style convention).