Author
Name	Claire Descombes
Affiliation	Universitätsklinik für Neurochirurgie, Inselspital Bern
Degree	MSc Statistics and Data Science, University of Bern
Contact	claire.descombes@insel.ch

The reference material for this course, as well as some useful literature to deepen your knowledge of R, can be found at the bottom of the page.

1 User interface

R is a free and open source statistical computing and graphics software. RStudio is a user-friendly environment for R, designed to facilitate its accessibility. R can technically be used without RStudio (although I wouldn’t advise it), but the reverse is not possible. To download both, follow the links below:

✏️ Download both softwares.

Once you have downloaded both programs and opened RStudio, you will be presented with a window similar to the one shown in the following figure.

Left pane: Contains the Console, Terminal and Background Jobs tabs.
Top right pane: contains the Environment, History, Connections and Tutorial tabs.
Bottom right pane: contains the Files, Plots, Packages, Help, Viewer and Presentation tabs.

1.1 Console

In the Console tab, we first see information about the version of R we are using and some basic commands to try out. At the end of these descriptions, we can type our R code, press Enter and see the result below the code line.

2+2

## [1] 4

1.2 Help

The help() function and ? help operator in R provide access to the documentation pages for R functions, data sets, and other objects, both for packages in the standard R distribution and for contributed packages.

You can access help directly from the console or via the Help tab in the bottom right-hand corner.

help(c)
# or equivalently
? c

1.3 Script file

However, when we run our code directly in the console, it isn’t saved for being reproduced further. If we need (and we usually do) to write a reproducible code to solve a specific task, we have to record and regularly save it in a script file rather than in the console.

To start recording a script, click File – New File – R Script. This will open a text editor in the top-left corner of the RStudio interface (above the Console tab, see following figure).

💡 When your code starts to get long or complex, consider breaking it down into separate scripts with clear and specific purposes — for example: 1_data_import.R, 2_data_cleaning.R, 3_survival_analysis.R, 4_qol_analysis.R, etc.

1.4 Comments

Comments can be added to the code in a script using the hash symbol #.

# Here is a comment.

It is very, very important that you always comment every piece of your code, to make sure:

that you will still be able to understand what you have written after a few months/years.
to facilitate sharing: without comments, it will take someone much longer to understand your code.

So, for scientific purposes, please comment your code!
Here’s an example of how I usually comment the scripts I use in my daily work:

################################
# TOSCAN 2.0: Matching algorithm
################################

# 1) Script info-header --------------------------------------------------------

# Project:                  TOSCAN 2.0
# Author:                   Claire Descombes
# Contact:                  claire.descombes@insel.ch
# Date last modification:   07/05/2025
# Purpose:                  Match TOSCAN cohort to Swiss population using BFS data
# Environment:              R version 4.4.2
#                           RStudio 2024.09.1+394 "Cranberry Hibiscus" Release

# 2) Packages & environment ----------------------------------------------------

library(duckdb)      # Interface to DuckDB, an in-process SQL OLAP database engine for fast queries on large datasets
library(dplyr)       # Grammar of data manipulation for data frames (select, filter, mutate, etc.)
library(lubridate)   # Makes working with dates and times easier (e.g., extracting year, month, parsing dates)
library(arrow)       # Provides access to Apache Arrow tools, including reading/writing Parquet files
library(readxl)      # Imports Excel files (.xls and .xlsx) into R
library(progress)    # Adds progress bars to loops and long operations in the console
library(glue)        # Facilitates string interpolation, especially useful for building SQL queries dynamically

options(scipen = 999)  # Prevents scientific notation (e.g., 1e+05) when printing numbers

# 3) Data import ---------------------------------------------------------------

# Set working directory to source file location
setwd("C:/Users/I0343303/Documents/Forschung/TOSCAN2.0")

# Etc.

💡 Use ---- after numbered headers in comments to make your code more navigable and readable in long scripts (this is a common R style convention).

1.5 Exercises

✏️ Exercise 1: Create your own script. Feel free to take notes directly in it. You’ll use this script as a working document to complete various small tasks and exercises.

☑️ All exercises have an example solution at the end of the chapter.

2 Objects, data types, variables

In R, everything is an object. This means that every piece of data you work with, from a single number to a complex dataset, is represented as an object with specific properties and behaviours. An object has attributes like class (data type) and dimensions.

Variables act as labels for objects. They are essentially pointers to the actual object stored in memory and appear in the Environment tab in RStudio.

Here’s an example to clarify the difference between variables and objects.

# We create an object (here: a vector) named 'vec' and assign a sequence of numbers to it.
vec <- 1:10 

# 'vec' is the variable. The sequence of numbers (1, 2, 3, ..., 10) is the object.

💡 To assign values to an object, use the <- or = symbols.

2.1 Inspect an object

Before diving into data types and structures, it’s helpful to know how to inspect objects in R. Several built-in functions can help you understand the structure and content of an object.

Let’s define a simple data frame (more details about data frames below) to demonstrate the purpose of those functions.

# Example data frame
df <- data.frame(
  ID = 1:5,
  Name = c("Anna", "Ben", "Carla", "David", "Eva"),
  Age = c(23, 31, 29, 40, 35)
)

Now let’s inspect thus object using a few useful functions.

# General object inspection
typeof(df)          # Returns the internal storage type of the object

## [1] "list"

str(df)             # Gives a compact, human-readable summary of the object's structure

## 'data.frame':    5 obs. of  3 variables:
##  $ ID  : int  1 2 3 4 5
##  $ Name: chr  "Anna" "Ben" "Carla" "David" ...
##  $ Age : num  23 31 29 40 35

attributes(df)      # Lists the object's attributes (e.g., names, dimensions, class)

## $names
## [1] "ID"   "Name" "Age" 
## 
## $class
## [1] "data.frame"
## 
## $row.names
## [1] 1 2 3 4 5

str(attributes(df)) # Displays the structure of the attributes

## List of 3
##  $ names    : chr [1:3] "ID" "Name" "Age"
##  $ class    : chr "data.frame"
##  $ row.names: int [1:5] 1 2 3 4 5

class(df)           # Returns the class of the object (e.g., data.frame)

## [1] "data.frame"

# Functions especially useful for matrices or data frames
nrow(df)          # Number of rows

## [1] 5

ncol(df)          # Number of columns

## [1] 3

dim(df)           # Dimensions (rows, columns)

## [1] 5 3

colnames(df)      # Column names

## [1] "ID"   "Name" "Age"

rownames(df)      # Row names

## [1] "1" "2" "3" "4" "5"

str(colnames(df)) # Structure of the column names (e.g., character vector)

##  chr [1:3] "ID" "Name" "Age"

str(rownames(df)) # Structure of the row names (e.g., character vector)

##  chr [1:5] "1" "2" "3" "4" "5"

💡 The function str() provides a compact view of the internal structure of an R object, helping you understand its components and data types quickly.

2.2 Data types

In R, data types define the kind of information a variable can hold. Here are some of the most common data types:

2.2.1 Numeric

Represents real numbers (e.g., 3.14, -2.5, 0).

typeof(3.14)

## [1] "double"

str(3.14)

##  num 3.14

2.2.2 Integer

Represents whole numbers (e.g., 2L, -5L). The “L” suffix indicates an integer.

typeof(2L)

## [1] "integer"

str(2L)

##  int 2

2.2.3 Logical

Represents Boolean values (TRUE or FALSE).

typeof(TRUE)

## [1] "logical"

str(TRUE)

##  logi TRUE

# Example of a logical operation
values <- 1:10
above_five <- (values > 5)
above_five

##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

2.2.4 Character

Represents text or strings (e.g., “hello”, “world”).

typeof("hello")

## [1] "character"

str("hello")

##  chr "hello"

2.2.5 Complex

Represents complex numbers (e.g., 1 + 2i).

typeof(1 + 2i)

## [1] "complex"

str(1 + 2i)

##  cplx 1+2i

2.2.6 NAs/NANs

In some cases the components of a vector may not be known. When an element or value is “not available” or a “missing value” in the statistical sense, a place within a vector may be reserved for it by assigning it the special value NA. In general any operation on an NA becomes an NA.

z <- c(1:3,NA)
print(z)

## [1]  1  2  3 NA

is.na(z)

## [1] FALSE FALSE FALSE  TRUE

There is a second kind of “missing” values which are produced by numerical computation, the so-called Not a Number, NaN, values.

0/0

## [1] NaN

Inf - Inf

## [1] NaN

2.3 Objects

Objects are the entities that R operates on. These can be:

2.3.1 Vectors

The most fundamental data structure.
A one-dimensional array of elements of the same data type (e.g., numeric, character, logical).
Created using the c() function.

vec1 <- c(1,2,3)
str(vec1)

##  num [1:3] 1 2 3

# Alternative ways of creating vectors:
vec2 <- 1:3  # Sequence of integers
vec3 <- seq(1, 3, by=1)  # More general sequence

Vector elements can be accessed using [] brackets.

# Accessing elements of the vector by index (R uses 1-based indexing)
vec1[1]  # First element

## [1] 1

vec2[c(2, 3)]  # Elements at indices 2 and 3

## [1] 2 3

vec3[c(-2)]  # All elements except for the element at index 2

## [1] 1 3

2.3.2 Matrices

Two-dimensional arrays of elements of the same data type.
Can be created using the matrix() function.

(mat <- matrix(c(1,2,3,4), nrow = 2, ncol = 2))

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

str(mat)

##  num [1:2, 1:2] 1 2 3 4

💡 By enclosing the assignment in parentheses (), you not only create the object but also automatically print its value to the console — a useful shortcut. This is equivalent to writing print(object) or simply typing the object name (e.g., object), but it saves you an extra line of code.

2.3.3 Arrays

Generalization of matrices to more than two dimensions.

(array <- array(1:8, c(2,4,2)))

## , , 1
## 
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8
## 
## , , 2
## 
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8

str(array)

##  int [1:2, 1:4, 1:2] 1 2 3 4 5 6 7 8 1 2 ...

2.3.4 Lists

Ordered collections of objects, which may be of different types.
Lists can contain other lists as elements.

(list <- list(numb = 10:15, char = 'hello'))

## $numb
## [1] 10 11 12 13 14 15
## 
## $char
## [1] "hello"

str(list)

## List of 2
##  $ numb: int [1:6] 10 11 12 13 14 15
##  $ char: chr "hello"

2.3.5 Factors

Categorical variables.
Represent data with a limited number of possible values.

(fac <- factor(c("single", "married")))

## [1] single  married
## Levels: married single

str(fac)

##  Factor w/ 2 levels "married","single": 2 1

2.3.6 Data Frames

Two-dimensional tabular data structure.
Can contain columns of different data types, such as numeric, character, logical, or factor.
The most common data structure for representing datasets in R.
Technically, a data frame is a special type of list where each element (i.e., column) is of the same length and can be a vector (numeric, character, logical), factor, matrix, list, or even another data frame.

(d <- data.frame(id = 1:5, 
                 val = c(4,5,2,6,5),
                 group = sample(c("exp","control"), size = 5, replace = TRUE)))

str(d)

## 'data.frame':    5 obs. of  3 variables:
##  $ id   : int  1 2 3 4 5
##  $ val  : num  4 5 2 6 5
##  $ group: chr  "exp" "control" "exp" "exp" ...

2.3.7 Functions

Reusable blocks of code that perform a specific task.
Also considered objects in R.

frac <- function(numerator, denominator) {
  result <- numerator / denominator
  return(result)
}

frac(numerator = 6, denominator = 2) # Calling the function

## [1] 3

💡 A function takes inputs called arguments. These can be:

Explicitly specified, e.g., frac(numerator = 6, denominator = 2), or
Implicitly given in order, e.g., frac(6, 2).

I strongly recommend explicitly naming arguments, because it increases clarity, prevents order-related mistakes, and gives you flexibility in argument order: frac(denominator = 2, numerator = 6) works just as well. In contrast, when arguments are unnamed, order matters strictly — and a small mistake can completely change the result: frac(2, 6) is interpreted as frac(numerator = 2, denominator = 6)!

2.4 Exercises

✏️ Exercise 2: Given a vector of ages (ages <- c(35, 45, 60, 15, 50, 8)), determine which patients are eligible for a treatment (age above 18). Return a Boolean vector indicating whether each patient meets the age criteria.

✏️ Exercise 3: Create a data frame with 10 rows and the columns id, blood_pressure, and group. – id: integers from 1 to 10 – blood_pressure: random values from a normal distribution with mean 123 and standard deviation 8 – group: a factor with levels “drug 1”, “drug 2”, and “obs arm” (you decide how to assign them, e.g. by using the function sample()) After creating that data frame, add a new column stand_score where you calculate a standardized score for each blood_pressure value. The standardized score is similar to a z-score but is calculated based on the mean and standard deviation of the blood_pressure values in the dataset (standardized score = (x−μ)/σ)).

💡 Hint: Use the function rnorm() to simulate normal values. Use the function scale() to centre and scale a vector, or alternatively the functions mean() and sd() to compute mean and standard deviation of a vector. You can use help() to learn more about how these functions work.

✏️ Exercise 4: Write a function sum_squared that takes two integers and returns the sum of their squared values.

# Example
sum_squared(int1 = 2, int2 = 3)
# The output should be:
13

3 Packages

Packages are collections of R functions, data, and compiled code in a well-defined format, created to add specific functionality.

There is a set of standard (or base) packages which are considered part of the R source code and automatically available as part of your R installation. Base packages contain the basic functions that allow R to work, and enable standard statistical and graphical functions on datasets.

On top of that, there are 10,000+ user contributed packages. You can install packages using the install.packages() function.

# To install a package, you can use the function.
install.packages("dplyr")

# Alternatively, you can click on 'Packages' in the bottom right pane, then click on 'Install' and type in the name of the package you want to install.

# To load a package you've already install, just load it using the library() function.
library(dplyr)

3.1 Exercises

✏️ Exercise 5: install and load the tidyverse package. We will need it for Chapter 2.

4 Solutions to the exercises

Please not that those are only examples, there are always many ways to solve the same task!

☑️ Exercise 1:

☑️ Exercise 2:

ages <- c(35, 45, 60, 15, 50, 8)
eligible <- ages >= 18
eligible

## [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE

☑️ Exercise 3:

# Create the data frame
(df <- data.frame(id = 1:10, 
                 blood_pressure = rnorm(10, mean = 123, sd = 8),
                 group = factor(sample(c("drug 1", "drug 2", "obs arm"), 10, replace = TRUE))))

# Compute standardized scores for blood pressure
df$stand_score <- scale(df$blood_pressure)

# Alternatively, compute by hand (standardized score = (x - μ) / σ)
df$stand_score_2 <- (df$blood_pressure - mean(df$blood_pressure))/sd(df$blood_pressure)

# View the data frame with standardized scores
df

☑️ Exercise 4:

sum_squared <- function(int1, int2){
  sum_of_squares <- int1^2 + int2^2
  return(sum_of_squares)
}

sum_squared(int1 = 3, int2 = 5)

## [1] 34

☑️ Exercise 5:

# To install the package, run the line:
install.packages("tidyverse")

# To load the package once installed, run the line:
library(tidyverse)

References

Alexander Henzi. 2021. “Programming and Data Analysis with R.” Lecture notes.

Burns, Patrick. n.d. The R Inferno. Accessed May 8, 2025. https://www.burns-stat.com/documents/books/the-r-inferno/.

CDC. 2025. “National Death Index.” Data Linkage. https://www.cdc.gov/nchs/linked-data/mortality-files/index.html.

“ChatGPT.” n.d. Accessed January 26, 2025. https://chatgpt.com.

Christopher J. Endres. 2025. “Introducing nhanesA.” https://cran.r-project.org/web/packages/nhanesA/vignettes/Introducing_nhanesA.html.

“Create Elegant Data Visualisations Using the Grammar of Graphics.” n.d. Accessed January 26, 2025. https://ggplot2.tidyverse.org/.

David, Author. 2016. “BIRT Joins.” MBSE Chaos. https://mbsechaos.wordpress.com/2016/05/24/birt-joins/.

Elena Kosourova. n.d. “RStudio Tutorial for Beginners: A Complete Guide.” Accessed January 26, 2025. https://www.datacamp.com/tutorial/r-studio-tutorial.

Grolemund, Hadley Wickham and Garrett. n.d. R for Data Science. Accessed May 8, 2025. https://r4ds.had.co.nz/introduction.html.

Mayer, Michael. 2025. “Mayer79/Statistical_computing_material.” https://github.com/mayer79/statistical_computing_material.

Patrick Burns. n.d. Impatient R. Accessed May 8, 2025. https://www.burns-stat.com/documents/tutorials/impatient-r/.

“P-Value.” 2025. Wikipedia. https://en.wikipedia.org/w/index.php?title=P-value&oldid=1305292611.

“Synthetic Dataset for AI in Healthcare.” n.d. Accessed May 9, 2025. https://www.kaggle.com/datasets/smmmmmmmmmmmm/synthetic-dataset-for-ai-in-healthcare.

“The Comprehensive R Archive Network.” n.d. Accessed January 26, 2025. https://stat.ethz.ch/CRAN/.

W. N. Venables, D. M. Smith and the R Core Team. n.d. “An Introduction to R.” Accessed May 8, 2025. https://cran.r-project.org/doc/manuals/r-release/R-intro.html.

Wickham, Hadley. n.d. Advanced R. Accessed May 8, 2025. https://adv-r.hadley.nz/introduction.html.

Chapter 1

R for physicians

2025-08-13