Author | |
---|---|
Name | Claire Descombes |
Affiliation | Universitätsklinik für Neurochirurgie, Inselspital Bern |
Degree | MSc Statistics and Data Science, University of Bern |
Contact | claire.descombes@insel.ch |
The reference material for this course, as well as some useful literature to deepen your knowledge of R, can be found at the bottom of the page.
When you get a file from somewhere on your computer (e.g. a dataset), you can either
The advantage of putting the files in the folder that contains your script and is set as the working directory is that you can easily move the folder around on your computer without getting any problems with your script: just set the working directory to your source file every time you open it, and you’ll be fine.
# Example
setwd("~/path/to/your/folder/")
data <- read.csv("testdata.csv")
The advantage of always giving the full path to a file is that you can get data in different folders on your computer, avoiding things like copying the source data in every folder where you have a corresponding script.
# Example
data <- read.csv("~/path/to/your/folder/testdata.csv")
Working directory
To tell R which folder you are working in (e.g., where your data is stored), you have several options:
setwd("path/to/your/folder")
in your script.💡 Tip: To avoid file path errors and keep your project organized,
it’s best to store your script and data files in the same folder — or at
least place your data files in a subfolder like data_sets
within your project directory. Then, set that folder as your working
directory in R. This ensures that your code can reliably find and save
your files.
getwd() # Displays the current working directory
setwd("~/path/to/your/folder") # Sets the working directory
We will first look at how to import a CSV file into R as a data frame.
CSV stands for Comma-Separated Values. In a .csv
file,
the values are stored as plain text, separated by commas. This is a
simple and widely used format for storing tabular data.
After setting your working directory or determining the path to your
CSV file, you can use the read.csv()
function to import the
data. This will create a data frame, which is one of the most commonly
used structures in R for handling datasets.
💡 I recommend to use data frames as data type for your data — they are generally easier to work with than matrices, especially for beginners.
# Import a CSV file into a data frame
dataset <- read.csv("~/path/to/your/folder/data.csv")
The function read.csv()
has several useful arguments,
including:
sep
: The field separator used in the file. For
read.csv()
, the default is a comma (,
), which
is standard for CSV files.
row.names
: Specifies the row names of the data
frame. It can be:
If a header is present and the first row has one fewer field than the
number of columns, the first column is used as row names. Otherwise,
rows are automatically numbered. Use row.names = NULL
to
force default numbering.
col.names
: Optional vector of column names. If not
provided, default names like “V1”, “V2”, etc., are assigned.
stringsAsFactors
: TRUE/FALSE; should character
vectors be converted to factors?
Another widely used data format is the Excel file
(.xlsx
). For these, you can use the readxl
package to import the data:
# Load the readxl package (after installing it)
library(readxl)
# Read the first sheet of an Excel file
dataset <- read_excel("~/path/to/your/folder/data.xlsx")
⚠️ Note: If your file is actually a CSV but mistakenly has a .xlsx extension, you should rename it to .csv and use read.csv() instead. Mixing up formats can lead to import errors.
Let us now look at real data frames to learn how to call or modify
their elements. To do this, we will use multiple health data sets from
the National Health and Nutrition Examination (NHANES
) Survey
from 2011-2012. The survey assessed overall health and nutrition of
adults and children in the United States and was conducted by the
National Center for Health Statistics (NCHS). The data sets can be found
in the data_sets
folder. More details on these data sets can be found in the Appendix
A.
✏️ Exercise on the NHANES
data sets n°1: import the
demo
, bpx
, bmx
and
smq
data sets from the data_sets
folder into R.
tidyverse
Base R, without any additional packages, already provides many functions that are very handy for data handling. However, some contributed packages make data preparation much easier and more readable.
I’ll introduce two such packages here, before diving into concrete
data handling examples. Both are part of a larger and very powerful
collection of packages for data science called the
tidyverse
, which I use for nearly all my analyses.
💡 In the Appendix B, you will find a table containing useful
functions from both Base R and the tidyverse
that
facilitate efficient data handling.
One of the most downloaded contributed extension packages of all
times is magrittr
. It provides a very useful operator, the
forward pipe operator %>%
, which passes the object on
its left as the first argument to the function on its right. This is
much easier to understand with an example.
# The easiest way to get magrittr is to install the whole tidyverse
install.packages("tidyverse")
# Once installed, a package has to be loaded to be used
library(tidyverse)
library(tidyverse)
# Let's do the same operation twice: once using the pipe, once without
# No pipe:
str(c(1,2,3,4))
## num [1:4] 1 2 3 4
# With pipe:
c(1,2,3,4) %>%
str()
## num [1:4] 1 2 3 4
# Not too exciting yet, but consider a more complex case:
summary(log(sqrt(na.omit(c(1, 4, NA, 16, 25)))))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.5199 1.0397 0.9222 1.4421 1.6094
# With the pipe, we can rewrite this more readably:
c(1, 4, NA, 16, 25) %>%
na.omit() %>%
sqrt() %>%
log() %>%
summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.5199 1.0397 0.9222 1.4421 1.6094
The pipe helps turn nested function calls into a sequence of simpler,
linear steps. This makes code easier to read, write, and debug. The pipe
becomes especially powerful when used with functions from the
dplyr
package for data manipulation.
dplyr
Another helpful R package is dplyr
. It is a grammar of
data manipulation, providing a consistent set of verbs that helps solve
the most common data manipulation challenges.
Let’s illustrate this with a simple example. Our goal: Group the cars dataset (contained in base R) by speed groups (e.g. low/medium/high), and for each group, compute (1) the average stopping distance and (2) the number of observations.
# Base R (no dplyr, no pipe)
cars$speed_group <- cut(cars$speed, breaks = c(0, 10, 20, 30),
labels = c("Low", "Medium", "High"))
avg_dist <- aggregate(dist ~ speed_group, data = cars, mean)
n_obs <- aggregate(dist ~ speed_group, data = cars, length)
names(n_obs)[2] <- "n"
summary_df <- merge(avg_dist, n_obs, by = "speed_group")
summary_df
# With dplyr, no pipe:
cars <- mutate(cars, speed_group = cut(speed, breaks = c(0, 10, 20, 30), labels = c("Low", "Medium", "High")))
summary_df <- summarise(group_by(cars, speed_group),
avg_dist = mean(dist),
n = n())
summary_df
# With dplyr and the pipe
cars %>%
mutate(speed_group = cut(speed, breaks = c(0, 10, 20, 30),
labels = c("Low", "Medium","High"))) %>%
group_by(speed_group) %>%
summarise(
avg_dist = mean(dist),
n = n()
)
💡 cut(x, ...)
divides the range of x
into
intervals (the breaks
) and codes the values in x according
to which interval they fall. labels
are the levels of the
resulting category. If labels = FALSE
, simple integer codes
are returned instead of a factor.
As you can see, using dplyr
and the pipe can make your
life much easier.
In the following chapter, we’ll use both base R and
tidyverse
functions without always noting which package
they belong to. If you’re ever unsure, you can check the top-left corner
of the function’s help page.
Being able to access elements in a data frame is essential when working with data. Here are some common methods to select specific elements, rows, or columns.
# Look at the first respectively last few rows
head(demo)
tail(demo)
# Select columns by name
demo[, c("RIDAGEYR", "RIAGENDR")] # Selecting age in years and gender
vars <- c("RIDAGEYR", "RIAGENDR")
demo[, vars] # Alternative using variable `vars`
# Select elements by position
demo[1, 1] # Access the first element of the first column (the respondent sequence number of the 1st participant)
## [1] 62161
ind_mat <- cbind(c(1, 3, 5), c(2, 4, 6))
demo[ind_mat] # Access rows and columns using multiple indices
## [1] "NHANES 2011-2012 public release" "Male"
## [3] NA
# Select rows based on a condition
head(demo[, "RIDAGEYR"] > 50) # Logical condition for age greater than 50
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
head(!(demo[, "DMDHHSIZ"] > 3)) # Logical negation for total number of people in the household not greater than 3
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
demo[demo[, "RIDAGEYR"] > 50, ] # Rows where age > 50
demo[demo[, "DMDHHSIZ"] < 3, ] # Rows where total number of people in the household greater than 3
demo[demo[, "DMDHHSIZ"] >= 3, ] # Rows where total number of people in the household greater or equal 3
# Combine logical vectors using "&" (AND), "|" (OR), and "!" (NOT)
demo[(demo[, "RIDAGEYR"] > 50 & demo[, "RIAGENDR"] == "Female"), ] # Both conditions must be true
demo[(demo[, "DMDHHSIZ"] < 3 | demo[, "RIAGENDR"] == "Male"), ] # One condition must be true
💡 To inspect one column, you can also use the dollar $
symbol to access a column as a vector.
head(demo$RIDAGEYR) # Returns the age column as a vector
## [1] 22 3 14 44 14 9
💡 You can use the brackets []
to select specific rows
and columns. Since data frames are bi-dimensional, the first index
refers to rows and the second to columns. To select a particular column,
you can omit the row index. To select a particular row, omit the column
index.
head(demo[, "RIDAGEYR"]) # All rows in the age column
## [1] 22 3 14 44 14 9
demo[1, ] # Row 1 (all columns)
dplyr
syntaxWith dplyr
, you can achieve the same things more
readably, especially for filtering rows and selecting columns.
# Select columns
select(demo, RIDAGEYR, RIAGENDR)
# Filter rows based on a condition
filter(demo, RIDAGEYR > 50)
# Combine filtering and selecting
filter(demo, RIDAGEYR > 50 & RIAGENDR == "Female") %>%
select(RIDAGEYR, RIAGENDR)
💡 Using the pipe (%>%
), the structure becomes more
linear: first say what data you’re working on, then what to do with
it.
demo %>%
filter(DMDHHSIZ < 3 | RIAGENDR == "Male") %>%
select(DMDHHSIZ, RIAGENDR)
This syntax avoids nested brackets and makes your code easier to read, especially as operations grow more complex.
R makes it simple to compute basic descriptive statistics for exploring your dataset. Below are a few useful examples.
mean(demo$RIDAGEYR, na.rm = TRUE) # Average (mean) age of participants
## [1] 31.40262
median(demo$DMDHHSIZ, na.rm = TRUE) # Median household size
## [1] 4
💡 The na.rm
argument in those functions allows for
ignoring the NA values.
With dplyr
, you can combine summarise()
with the pipe for clearer syntax:
demo %>%
summarise(mean_age = mean(RIDAGEYR, na.rm = TRUE),
median_household_size = median(DMDHHSIZ, na.rm = TRUE))
sd(demo$RIDAGEYR, na.rm = TRUE) # Standard deviation of age
## [1] 24.57899
range(demo$DMDHHSZA, na.rm = TRUE) # Range of number of young children
## [1] 0 3
min(demo$DMDHRAGE, na.rm = TRUE) # Minimum age of household reference person
## [1] 18
max(demo$DMDHRAGE, na.rm = TRUE) # Maximum age of household reference person
## [1] 80
Also possible with summarise()
:
demo %>%
summarise(sd_age = sd(RIDAGEYR, na.rm = TRUE),
min_age = min(DMDHRAGE, na.rm = TRUE),
max_age = max(DMDHRAGE, na.rm = TRUE))
table(demo$RIAGENDR) # Gender distribution
##
## Female Male
## 4900 4856
table(demo$DMDHRMAR) # Marital status of household reference person
##
## Divorced Don't Know Living with partner Married
## 816 8 876 5282
## Never married Refused Separated Widowed
## 1539 82 436 581
table(demo$AIALANGA) # Language of interview
##
## Asian languages English Spanish
## 95 5192 467
prop.table(table(demo$AIALANGA)) # Proportional distribution of interview languages
##
## Asian languages English Spanish
## 0.01651025 0.90232881 0.08116093
With dplyr
, you get frequencies directly:
demo %>%
count(RIAGENDR)
demo %>%
count(AIALANGA) %>%
mutate(prop = n / sum(n)) # Proportional distribution
💡 n()
gives the current group size.
aggregate(DMDHRAGE ~ DMDHRMAR, data = demo, FUN = mean, na.rm = TRUE) # Mean age of household reference person by marital status
aggregate(DMDHHSIZ ~ DMDHRGND, data = demo, FUN = median, na.rm = TRUE) # Median household size by gender of reference person
Or using dplyr
:
demo %>%
group_by(DMDHRMAR) %>%
summarise(mean_age = mean(DMDHRAGE, na.rm = TRUE))
demo %>%
group_by(DMDHRGND) %>%
summarise(median_household_size = median(DMDHHSIZ, na.rm = TRUE))
summary(demo)[,1:5] # only for the first 5 variables
## SEQN SDDSRVYR RIDSTATR RIAGENDR
## Min. :62161 Length:9756 Length:9756 Length:9756
## 1st Qu.:64600 Class :character Class :character Class :character
## Median :67038 Mode :character Mode :character Mode :character
## Mean :67038
## 3rd Qu.:69477
## Max. :71916
##
## RIDAGEYR
## Min. : 0.0
## 1st Qu.: 9.0
## Median :26.0
## Mean :31.4
## 3rd Qu.:52.0
## Max. :80.0
##
Now, let us assume we want to modify/add/remove one or multiple entries/rows/columns in our data frame. The brackets really come in handy now. In this setting, I recommend defining a new data frame before modifying the original one.
Some examples follow.
# Modify one entry:
demo_mod <- demo # Create a copy to avoid modifying the original data set
demo_mod[1, 1:5]
demo_mod[1, "RIAGENDR"] <- 'Female' # Change gender of the first participant
demo_mod[1, 1:5]
# Modify multiple entries based on a condition:
demo_mod[1:10, 1:5]
demo_mod[!is.na(demo_mod$RIDAGEYR) & demo_mod$RIDAGEYR < 18, ]$RIDAGEYR <- 18 # Set minimum age to 18
demo_mod[1:10, 1:5]
dplyr
syntaxWith mutate()
and case_when()
from
dplyr
, the logic becomes slightly more readable, especially
for multiple conditions:
demo_mod <- demo %>%
mutate(RIDAGEYR = case_when(
!is.na(RIDAGEYR) & RIDAGEYR < 18 ~ 18,
TRUE ~ RIDAGEYR # leave other values unchanged
))
💡 case_when()
is like a vectorised
ifelse()
on steroids — it’s handy for multiple conditions
and clear logic.
Handling missing data (NAs) is a common task in data analysis. Before deciding how to treat them, it’s important to understand where and how often they occur.
colSums(is.na(demo)) # Number of NAs per column
## SEQN SDDSRVYR RIDSTATR RIAGENDR RIDAGEYR RIDAGEMN RIDRETH1 RIDRETH3
## 0 0 0 0 0 9130 0 0
## RIDEXMON RIDEXAGY RIDEXAGM DMQMILIZ DMQADFC DMDBORN4 DMDCITZN DMDYRSUS
## 418 6338 5747 3749 9205 0 5 7683
## DMDEDUC3 DMDEDUC2 DMDMARTL RIDEXPRG SIALANG SIAPROXY SIAINTRP FIALANG
## 7157 4196 4196 8548 0 6 0 105
## FIAPROXY FIAINTRP MIALANG MIAPROXY MIAINTRP AIALANGA WTINT2YR WTMEC2YR
## 105 105 3043 3043 3043 4002 0 0
## SDMVPSU SDMVSTRA INDHHIN2 INDFMIN2 INDFMPIR DMDHHSIZ DMDFMSIZ DMDHHSZA
## 0 0 81 51 840 0 0 0
## DMDHHSZB DMDHHSZE DMDHRGND DMDHRAGE DMDHRBR4 DMDHREDU DMDHRMAR DMDHSEDU
## 0 0 0 0 365 362 136 4881
sum(complete.cases(demo)) # Number of rows without any NAs
## [1] 0
💡 is.na()
returns a logical matrix where
TRUE
indicates a missing value (NA
) and
FALSE
indicates a non-missing value. colSums()
takes this logical matrix and sums up the TRUE
values
(which are treated as 1), giving you the count of missing values for
each column.
💡 complete.cases()
returns a logical vector: TRUE if a
row has no missing values, and FALSE otherwise. Using
sum(complete.cases(...))
counts the number of rows with no
missing data.
One way to handle missing data is to remove rows containing NAs for the variable(s) you are interested in. This can be appropriate in some cases, but it should be done with care, as it may introduce bias or reduce sample size. We’ll discuss this further in Chapter 5.
# Remove rows with any missing values in the DMDHRMAR column
demo_DMDHRMAR <- demo[!is.na(demo$DMDHRMAR), ]
# Check for missing values
sum(is.na(demo_DMDHRMAR$DMDHRMAR))
## [1] 0
You can remove all rows with missing values across any of the columns
in the dataset using the function na.omit()
.
# Remove rows with missing values in any column
demo_no_na <- na.omit(demo)
# Check the resulting data frame and its structure
head(demo_no_na)
💡 For our demo
data set, this removes all the rows!
Another reminder to be very careful when removing NA values.
dplyr
syntaxThe filter()
function makes row-wise filtering more
intuitive:
# Keep only rows where DMDHRMAR is not missing
demo_DMDHRMAR <- demo %>%
filter(!is.na(DMDHRMAR))
# Remove all rows with any missing values
demo_no_na <- demo %>%
filter(complete.cases(.))
💡 Using filter()
keeps the logic clear and readable,
especially when chaining multiple conditions with the pipe.
✏️ Exercise on the NHANES
data sets n°2: inspect the
structure of the demo
data set, look at different entries,
get familiar with those commands.
✏️ Exercise on the NHANES
data sets n°3: generate a new
data frame selecting only the female patients that are above 18 years
old and that took the interview in Spanish.
In practice, data is often spread across multiple data frames that need to be combined. Depending on the structure and goal, there are different ways to combine data frames:
To add columns side-by-side, the data frames must have the same number of rows.
# Extract one column from demo to create an additional data frame with the same number of rows
extra_info <- demo$RIDRETH1
# Combine using cbind
combined_df <- cbind(demo, extra_info)
combined_df[1, ]
# Add a new column directly to demo with mutate() from dplyr
combined_df <- demo %>%
mutate(extra_info = RIDRETH1)
To stack data frames vertically, the data frames must have the same column names and types.
# Extract one column from demo to create an additional data frame with the same structure (column names and types)
(new_participant <- demo[18,])
# Combine using rbind
extended_df <- rbind(demo, new_participant)
extended_df[nrow(extended_df),]
# Alternative with dplyr (bind_rows())
extended_df <- bind_rows(demo, new_participant)
extended_df %>%
slice(n()) # Show last row
💡 n()
gives the current group size.
merge()
combines data frames based on a common column,
similar to SQL joins (see figure below for a reminder on the different
types of joins).
# Merge two data frames by participant ID `SEQN` (inner join by default)
merged_demo_bpx <- merge(demo, bpx, by = "SEQN")
# In base R, you can use the Reduce() function to iteratively merge a list of data frames
data_list <- list(demo, bpx, bmx) # list of data frames to merge
merged_demo_bpx_bmx <- Reduce(function(x, y) merge(x, y, by = "SEQN"), data_list) # perform an inner join on all data frames by 'SEQN'
You should specify the type of join you are looking for:
SEQN
in both data frames:
inner_join_df <- merge(demo, bmx, by = "SEQN")
demo
and only
matching rows from bmx
:
left_join_df <- merge(demo, bmx, by = "SEQN", all.x = TRUE)
bmx
and only
matching rows from demo
:
left_join_df <- merge(demo, bmx, by = "SEQN", all.y = TRUE)
demo
and
bmx
, filling with NA where there are no matches:
full_join_df <- merge(demo, bmx, by = "SEQN", all = TRUE)
💡 merge()
in base R is flexible but can be a bit
verbose. For simpler syntax, you can also use the dplyr
functions left_join()
, right_join()
,
inner_join()
, and full_join()
.
# Inner join
inner_join_df <- demo %>%
inner_join(bmx, by = "SEQN")
# Left join
left_join_df <- demo %>%
left_join(bmx, by = "SEQN")
# Right join
right_join_df <- demo %>%
right_join(bmx, by = "SEQN")
# Full join
full_join_df <- demo %>%
full_join(bmx, by = "SEQN")
✏️ Exercise on the NHANES
data sets n°4: merge the 4
data sets demo
, bpx
, bmx
and
smq
into a single data set merged_nhanes
by
performing an inner join on the SEQN
ID and using the
dplyr
package.
You can save data sets in different formats, depending on how you want to use them later:
Let us save a data frame in all those different formats.
# Specify the directory for the data
directory <- "~/Documents/GitHub/rforphysicians/data_sets/"
# Save as CSV
write.csv(merged_demo_bpx_bmx, file = file.path(directory,"merged_demo_bpx_bmx.csv"), row.names = FALSE)
# Save as RDS (recommended for re-loading in R)
saveRDS(merged_demo_bpx_bmx, file = file.path(directory,"merged_demo_bpx_bmx.rds"))
# Save as RData (can contain multiple objects)
save(merged_demo_bpx_bmx, file = file.path(directory,"merged_demo_bpx_bmx.RData"))
💡 file.path()
constructs the path to a file from
components in a platform-independent way. It concatenates paths and
filenames, and automatically uses the correct file separator (/ on
Unix/macOS, \ on Windows), making your code more robust and
portable.
You can load these later using:
# Read CSV
df_csv <- read.csv(file.path(directory, "merged_demo_bpx_bmx.csv"))
# Alternative: use readr for CSV (from tidyverse package)
df_csv2 <- read_csv(file.path(directory, "merged_demo_bpx_bmx.csv"))
# Read RDS
df_rds <- readRDS(file.path(directory, "merged_demo_bpx_bmx.rds"))
# Load RData (loads objects into the environment)
load(file.path(directory, "merged_demo_bpx_bmx.RData"))
💡 When loading RData, check which objects got loaded with
ls()
(returns a vector of character strings giving the
names of the objects in the specified environment).
✏️ Exercise on the NHANES
data sets n°5: Save the data
frame resulting from Exercise n°4 in your working directory in both .csv
and .rds formats.
Please not that those are only examples, there are always many ways to solve the same task!
☑️ Exercise on the NHANES
data sets n°1:
☑️ Exercise on the NHANES
data sets n°2: Did you manage
to select a specific column you were interested in? Were you able to
check how many men and women are included in the data set? Here are a
few examples of operations you can use to explore the data set.
# View the first few rows of the dataset
head(demo)
# Display the names of all columns
names(demo)
## [1] "SEQN" "SDDSRVYR" "RIDSTATR" "RIAGENDR" "RIDAGEYR" "RIDAGEMN"
## [7] "RIDRETH1" "RIDRETH3" "RIDEXMON" "RIDEXAGY" "RIDEXAGM" "DMQMILIZ"
## [13] "DMQADFC" "DMDBORN4" "DMDCITZN" "DMDYRSUS" "DMDEDUC3" "DMDEDUC2"
## [19] "DMDMARTL" "RIDEXPRG" "SIALANG" "SIAPROXY" "SIAINTRP" "FIALANG"
## [25] "FIAPROXY" "FIAINTRP" "MIALANG" "MIAPROXY" "MIAINTRP" "AIALANGA"
## [31] "WTINT2YR" "WTMEC2YR" "SDMVPSU" "SDMVSTRA" "INDHHIN2" "INDFMIN2"
## [37] "INDFMPIR" "DMDHHSIZ" "DMDFMSIZ" "DMDHHSZA" "DMDHHSZB" "DMDHHSZE"
## [43] "DMDHRGND" "DMDHRAGE" "DMDHRBR4" "DMDHREDU" "DMDHRMAR" "DMDHSEDU"
# View all unique values in the "AIALANGA" (Interview Language) column
unique(demo$AIALANGA)
## [1] "English" NA "Spanish" "Asian languages"
# Count how many men and women are in the dataset
table(demo$RIAGENDR)
##
## Female Male
## 4900 4856
# Get a quick statistical summary of a few columns
summary(demo[,c('RIDAGEYR','DMDHHSZA','DMDHRAGE')])
## RIDAGEYR DMDHHSZA DMDHRAGE
## Min. : 0.0 Min. :0.000 Min. :18.00
## 1st Qu.: 9.0 1st Qu.:0.000 1st Qu.:33.00
## Median :26.0 Median :0.000 Median :43.00
## Mean :31.4 Mean :0.531 Mean :45.39
## 3rd Qu.:52.0 3rd Qu.:1.000 3rd Qu.:56.00
## Max. :80.0 Max. :3.000 Max. :80.00
# View the number of missing values in each column
colSums(is.na(demo))
## SEQN SDDSRVYR RIDSTATR RIAGENDR RIDAGEYR RIDAGEMN RIDRETH1 RIDRETH3
## 0 0 0 0 0 9130 0 0
## RIDEXMON RIDEXAGY RIDEXAGM DMQMILIZ DMQADFC DMDBORN4 DMDCITZN DMDYRSUS
## 418 6338 5747 3749 9205 0 5 7683
## DMDEDUC3 DMDEDUC2 DMDMARTL RIDEXPRG SIALANG SIAPROXY SIAINTRP FIALANG
## 7157 4196 4196 8548 0 6 0 105
## FIAPROXY FIAINTRP MIALANG MIAPROXY MIAINTRP AIALANGA WTINT2YR WTMEC2YR
## 105 105 3043 3043 3043 4002 0 0
## SDMVPSU SDMVSTRA INDHHIN2 INDFMIN2 INDFMPIR DMDHHSIZ DMDFMSIZ DMDHHSZA
## 0 0 81 51 840 0 0 0
## DMDHHSZB DMDHHSZE DMDHRGND DMDHRAGE DMDHRBR4 DMDHREDU DMDHRMAR DMDHSEDU
## 0 0 0 0 365 362 136 4881
# Calculate the average age
mean(demo$RIDAGEYR, na.rm = TRUE)
## [1] 31.40262
☑️ Exercise on the NHANES
data sets n°3:
# Filter the dataset to include only female patients that are above 18 years old and that took the interview in Spanish.
demo_filtered <- demo[demo$RIAGENDR == "Female" &
demo$RIDAGEYR > 17 &
!is.na(demo$AIALANGA) & demo$AIALANGA == "Spanish", ]
ind <- c('RIAGENDR', 'RIDAGEYR', 'AIALANGA')
demo_filtered[, ind]
💡 Note: When filtering on a variable that contains missing values
(e.g., AIALANGA
), you must explicitly exclude NAs using
!is.na(...)
. This is because comparisons like demo$AIALANGA
== “Spanish” return NA for missing values, not FALSE, so those rows
aren’t properly excluded from the subset.
☑️ Exercise on the NHANES
data sets n°4:
# If necessary, install the dplyr package
install.packages("dplyr")
# Load the dplyr package
library(dplyr)
# Merge the datasets using inner joins by SEQN
data_list <- list(demo, bpx, bmx, smq)
merged_nhanes <- Reduce(function(x, y) inner_join(x, y, by = "SEQN"), data_list)
☑️ Exercise on the NHANES
data sets n°5:
# Save the merged dataset in your working directory
# Click on Session > Set Working Directory > To Source File Location, or use file.path()
write.csv(merged_nhanes, "merged_nhanes.csv", row.names = FALSE)
saveRDS(merged_nhanes, "merged_nhanes.rds")