This is a tutorial for how to write your first custom function in R.
For this tutorial, you will first need to load tidyverse
for data wrangling and kableExtra
for tables. Next, I am using rio::import
to import the cleaned data file to my environment. Note, you may need to change the file path if your file structure is not the same as mine. Last, I am using mutate
and case_when
to label the levels of the food_secure variable.
If you would like to copy the code, you can hover of the code, and a “copy code” box should appear.
library(tidyverse)
library(kableExtra)
data <- rio::import(here::here("data", "nhanes_1999-2016.csv")) %>%
mutate(food_security = case_when(hh_food_secure == 1 ~ "Fully food secure",
hh_food_secure == 2 ~ "Marginally food secure",
hh_food_secure == 3 ~ "Food insecure without hunger",
hh_food_secure == 4 ~ "Food insecure with hunger"))
A function is code that carries out an operation. For example +
is a function that carries out the operation addition.
2 + 3
[1] 5
In algebra, you may recall learning functions such as \(f(x,y) = x^2+y\), where you put in inputs x = 2
and y = 1
, the function computes the operation, \(2^2+1\), and outputs \(5\). Similarly, when using a function in R, it takes a form like f(x,y,z..)
where f is the function name and x,y,z… are the arguments of the function.
`+`(2,3)
[1] 5
Let’s create our first function in R called my_pet()
that will print out a statement about your pet. For best practice, you should try to name your function something descriptive. It should also not be named another function that is popular like mean
because it will overwrite the default mean
function for your script.
my_pet <- function(pronoun, animal, verb){
paste0(str_to_title(pronoun), " is a ", animal, " who likes to ", verb, ".")
}
The arguments of the function (which are called formals
) are what the user supplies the function to get their desired output. You can see the formals of a function in R by using the formals()
function.
formals(my_pet)
$pronoun
$animal
$verb
The body of the function is where the function takes in the formals
and creates the output. You can see the body of a function in R by using the body()
function.
body(my_pet)
{
paste0(str_to_title(pronoun), " is a ", animal, " who likes to ",
verb, ".")
}
Now, in order to use the function, you can supply it with your desired formals.
my_pet(pronoun = "she",
animal = "dog",
verb = "play outside")
[1] "She is a dog who likes to play outside."
When you use your function, you can drop the argument names as long as you keep the same order.
my_pet("he", "cat", "sleep")
[1] "He is a cat who likes to sleep."
If you want to use arguments out of the order they were defined in, you will need to label them.
my_pet(animal = "lizard",
pronoun ="she",
verb = "eat")
[1] "She is a lizard who likes to eat."
You can set a “default” setting for an argument. This is the setting that occurs when the user does not specify the argument.
my_pet2 <- function(pronoun, animal, verb = "dance"){
paste0(str_to_title(pronoun), " is a ", animal, " who likes to ", verb, ".")
}
my_pet2(pronoun = "she",
animal = "dog")
[1] "She is a dog who likes to dance."
The user can overwrite the default.
my_pet2(pronoun = "she",
animal = "dog",
verb = "cuddle")
[1] "She is a dog who likes to cuddle."
If someone else (or your future self) is going to use your function, it is helpful to embed errors with stop()
and/or warnings with warning()
into your code to explain why the code will not work (or if it will not work as expected).
# I am using an if else structure
# If the user inputs "cat" for animal, the function will
# throw an error and say "Sorry, this function doesn't work # for people who own cats"
# If they use input any other animal, it will work
my_pet3 <- function(pronoun, animal, verb = "stretch"){
if(animal == "cat"){
stop("Sorry, this function doesn't work for people who own cats.")
}
else{
paste0(str_to_title(pronoun), " is a ", animal, " who likes to ", verb, ".")
}
}
# This works
my_pet3("she", "dog")
[1] "She is a dog who likes to stretch."
# This throws an error
my_pet3("he", "cat")
Error in my_pet3("he", "cat"): Sorry, this function doesn't work for people who own cats.
Instead of an error, you might just want to throw a warning, but still allow the function to work.
my_pet4("he", "cat")
Warning in my_pet4("he", "cat"): Really? A cat? You should really
consider getting a dog.
[1] "He is a cat who likes to stretch."
my_pet4("he", "fish", "swim")
Warning in my_pet4("he", "fish", "swim"): Really? A fish? You should
really consider getting a dog.
[1] "He is a fish who likes to swim."
# works as normal
my_pet4("she", "dog")
[1] "She is a dog who likes to stretch."
These are silly examples, but there are many reasons why you would want the function to output a warning or error. For example, you may want the function to warn someone if their input is not going to give them the output they expect, or if the function will not work for certain input. This will help the user work with your function to get their desired output.
What makes a particularly good function is code that you write a lot. Additionally, you want to make a function that is simple and only does one thing. For this first function, I am going to create a function that will create a “total” column that calculates the total observations across all levels of a grouping variable.
Before I create a function, I first like to try to make it work in a single case.
a <- count(data, year, gender)
a
year gender n
1 1999-2000 female 5082
2 1999-2000 male 4883
3 2001-2002 female 5708
4 2001-2002 male 5331
5 2003-2004 female 5152
6 2003-2004 male 4970
7 2005-2006 female 5268
8 2005-2006 male 5080
9 2007-2008 female 5053
10 2007-2008 male 5096
11 2009-2010 female 5312
12 2009-2010 male 5225
13 2011-2012 female 4900
14 2011-2012 male 4856
15 2013-2014 female 5172
16 2013-2014 male 5003
17 2015-2016 female 5079
18 2015-2016 male 4892
b <- count(data, year)
b
year n
1 1999-2000 9965
2 2001-2002 11039
3 2003-2004 10122
4 2005-2006 10348
5 2007-2008 10149
6 2009-2010 10537
7 2011-2012 9756
8 2013-2014 10175
9 2015-2016 9971
# Join a and b by year
# There are two n's so I am also changing the suffix of n so that they are labeled more clearly
left_join(a,b, by = "year",
suffix = c("_group", "_total"))
year gender n_group n_total
1 1999-2000 female 5082 9965
2 1999-2000 male 4883 9965
3 2001-2002 female 5708 11039
4 2001-2002 male 5331 11039
5 2003-2004 female 5152 10122
6 2003-2004 male 4970 10122
7 2005-2006 female 5268 10348
8 2005-2006 male 5080 10348
9 2007-2008 female 5053 10149
10 2007-2008 male 5096 10149
11 2009-2010 female 5312 10537
12 2009-2010 male 5225 10537
13 2011-2012 female 4900 9756
14 2011-2012 male 4856 9756
15 2013-2014 female 5172 10175
16 2013-2014 male 5003 10175
17 2015-2016 female 5079 9971
18 2015-2016 male 4892 9971
Now, let’s try to generalize it.
total_grouping <- function(data, var, grouping){
a <- count(data, var, grouping)
b <- count(data, var)
c <- left_join(a,b,by = "year",
suffix = c("_group", "_total"))
}
total_grouping(data, year, gender)
Error: Must group by variables found in `.data`.
* Column `var` is not found.
* Column `grouping` is not found.
Unfortunately, from running this code, we get an error that says that the columns var
and grouping
are not found. This can happen when you use tidyverse functions to write functions because the tidyverse uses what is called non-standard evaluation. NSE makes the functions really user friendly. For example, when you use dplyr functions like select(data, year)
or group_by(data, year)
, the function knows that year is referring to data$year
and not year
from your global environment. However, this causes trouble when you try to use select()
or group_by()
(or in our case, counts()
) in your own function. R looks for the column names (in our case, var
and grouping
) in your global environment but can’t find them. In order to work around this, we need to also use NSE. Here are two ways you can write the function with NSE:
total_grouping <- function(data, var, grouping){
a <- count(data, {{var}}, {{grouping}}) ##wrap the variables in {{}}
b <- count(data, {{var}})
left_join(a,b,by = "year",
suffix = c("_group", "_total"))
}
total_grouping(data, year, gender) %>% head()
year gender n_group n_total
1 1999-2000 female 5082 9965
2 1999-2000 male 4883 9965
3 2001-2002 female 5708 11039
4 2001-2002 male 5331 11039
5 2003-2004 female 5152 10122
6 2003-2004 male 4970 10122
total_grouping <- function(data, var, grouping){
v1 <- enquo(var) #quote the variables
v2 <- enquo(grouping)
a <- count(data, !!v1, !!v2) # use !! to evaluate the quoted variables
b <- count(data, !!v1)
left_join(a,b,by = "year",
suffix = c("_group", "_total"))
}
total_grouping(data, year, gender) %>% head()
year gender n_group n_total
1 1999-2000 female 5082 9965
2 1999-2000 male 4883 9965
3 2001-2002 female 5708 11039
4 2001-2002 male 5331 11039
5 2003-2004 female 5152 10122
6 2003-2004 male 4970 10122
The great thing about functions is that we can now use our function with other variables, without having to copy,paste, and change the numbers. This cuts down on mistakes and makes your code easier to read.
total_grouping(data, year, race_ethnic) %>%
head()
year race_ethnic n_group n_total
1 1999-2000 mexican-american 3393 9965
2 1999-2000 non-hispanic-black 2228 9965
3 1999-2000 non-hispanic-white 3367 9965
4 1999-2000 other-hispanic 589 9965
5 1999-2000 other-race 388 9965
6 2001-2002 mexican-american 2776 11039
total_grouping(data, year, food_security) %>%
head()
year food_security n_group n_total
1 1999-2000 Food insecure with hunger 499 9965
2 1999-2000 Food insecure without hunger 1209 9965
3 1999-2000 Fully food secure 7102 9965
4 1999-2000 Marginally food secure 889 9965
5 1999-2000 <NA> 266 9965
6 2001-2002 Food insecure with hunger 627 11039
You can use a previous function you have defined earlier in a script in a new function! Here, I am extending the previous function to make a column that will calculate the percentage of the count in a given year.
Let’s first try an example.
total_grouping(data, year, gender) %>%
mutate(percent = n_group/n_total * 100, #make a new variable called percent
percent = round(percent, 2),
percent = paste0(percent, "%")) #round it and add a % sign
year gender n_group n_total percent
1 1999-2000 female 5082 9965 51%
2 1999-2000 male 4883 9965 49%
3 2001-2002 female 5708 11039 51.71%
4 2001-2002 male 5331 11039 48.29%
5 2003-2004 female 5152 10122 50.9%
6 2003-2004 male 4970 10122 49.1%
7 2005-2006 female 5268 10348 50.91%
8 2005-2006 male 5080 10348 49.09%
9 2007-2008 female 5053 10149 49.79%
10 2007-2008 male 5096 10149 50.21%
11 2009-2010 female 5312 10537 50.41%
12 2009-2010 male 5225 10537 49.59%
13 2011-2012 female 4900 9756 50.23%
14 2011-2012 male 4856 9756 49.77%
15 2013-2014 female 5172 10175 50.83%
16 2013-2014 male 5003 10175 49.17%
17 2015-2016 female 5079 9971 50.94%
18 2015-2016 male 4892 9971 49.06%
Okay, now we’re ready to generalize. Note that I’m using NSE here too.
percent_grouping(data, year, gender) %>% head()
year gender n_group n_total percent
1 1999-2000 female 5082 9965 51%
2 1999-2000 male 4883 9965 49%
3 2001-2002 female 5708 11039 51.71%
4 2001-2002 male 5331 11039 48.29%
5 2003-2004 female 5152 10122 50.9%
6 2003-2004 male 4970 10122 49.1%
percent_grouping(data, year, race_ethnic) %>% head()
year race_ethnic n_group n_total percent
1 1999-2000 mexican-american 3393 9965 34.05%
2 1999-2000 non-hispanic-black 2228 9965 22.36%
3 1999-2000 non-hispanic-white 3367 9965 33.79%
4 1999-2000 other-hispanic 589 9965 5.91%
5 1999-2000 other-race 388 9965 3.89%
6 2001-2002 mexican-american 2776 11039 25.15%
Now, let’s make a table with our output. These tables will tell us how the demographics of our sample changed from year to year.
# creates an ugly first draft
temp_table <- percent_grouping(data, year, gender) %>%
select(year, gender, percent) %>%
pivot_wider(names_from = gender,
values_from = percent) %>% t(.)
temp_table
[,1] [,2] [,3] [,4] [,5]
year "1999-2000" "2001-2002" "2003-2004" "2005-2006" "2007-2008"
female "51%" "51.71%" "50.9%" "50.91%" "49.79%"
male "49%" "48.29%" "49.1%" "49.09%" "50.21%"
[,6] [,7] [,8] [,9]
year "2009-2010" "2011-2012" "2013-2014" "2015-2016"
female "50.41%" "50.23%" "50.83%" "50.94%"
male "49.59%" "49.77%" "49.17%" "49.06%"
# moves the first row to the title
table <- temp_table[2:3,]
colnames(table) <- temp_table[1,]
rownames(table) <- rownames(table) %>% str_to_title()
#stylized table
table %>% kbl() %>%
kable_styling(bootstrap_options = "striped", full_width = F) %>% #gives me a stylized striped table
row_spec(0, angle = -45) #rotates column names
1999-2000 | 2001-2002 | 2003-2004 | 2005-2006 | 2007-2008 | 2009-2010 | 2011-2012 | 2013-2014 | 2015-2016 | |
---|---|---|---|---|---|---|---|---|---|
Female | 51% | 51.71% | 50.9% | 50.91% | 49.79% | 50.41% | 50.23% | 50.83% | 50.94% |
Male | 49% | 48.29% | 49.1% | 49.09% | 50.21% | 49.59% | 49.77% | 49.17% | 49.06% |
I can generalize this with a function so I can make tables for other variables!
my_table <- function(data, var, grouping){
temp_table <- percent_grouping(data, {{var}}, {{grouping}}) %>%
select({{var}}, {{grouping}}, percent) %>%
pivot_wider(names_from = {{grouping}},
values_from = percent) %>% t(.)
table <- temp_table[2:nrow(temp_table),]
colnames(table) <- temp_table[1,]
rownames(table) <- rownames(table) %>% str_to_title()
table %>% kbl() %>%
kable_styling(bootstrap_options = "striped", full_width = F) %>%
row_spec(0, angle = -45)
}
my_table(data, year, gender)
1999-2000 | 2001-2002 | 2003-2004 | 2005-2006 | 2007-2008 | 2009-2010 | 2011-2012 | 2013-2014 | 2015-2016 | |
---|---|---|---|---|---|---|---|---|---|
Female | 51% | 51.71% | 50.9% | 50.91% | 49.79% | 50.41% | 50.23% | 50.83% | 50.94% |
Male | 49% | 48.29% | 49.1% | 49.09% | 50.21% | 49.59% | 49.77% | 49.17% | 49.06% |
my_table(data, year, race_ethnic)
1999-2000 | 2001-2002 | 2003-2004 | 2005-2006 | 2007-2008 | 2009-2010 | 2011-2012 | 2013-2014 | 2015-2016 | |
---|---|---|---|---|---|---|---|---|---|
Mexican-American | 34.05% | 25.15% | 24.89% | 27.51% | 21.25% | 22.63% | 13.89% | 17% | 19.27% |
Non-Hispanic-Black | 22.36% | 24.29% | 26.31% | 26.19% | 21.79% | 18.57% | 27.5% | 22.28% | 21.35% |
Non-Hispanic-White | 33.79% | 41.72% | 40.83% | 37.96% | 40.55% | 41.95% | 30.47% | 36.11% | 30.75% |
Other-Hispanic | 5.91% | 4.68% | 3.37% | 3.37% | 11.83% | 10.75% | 11.03% | 9.43% | 13.12% |
Other-Race | 3.89% | 4.16% | 4.6% | 4.97% | 4.58% | 6.1% | 17.11% | 15.17% | 15.51% |
my_table(data, year, food_security)
1999-2000 | 2001-2002 | 2003-2004 | 2005-2006 | 2007-2008 | 2009-2010 | 2011-2012 | 2013-2014 | 2015-2016 | |
---|---|---|---|---|---|---|---|---|---|
Food Insecure With Hunger | 5.01% | 5.68% | 6.33% | 5.83% | 6.25% | 7.99% | 7.98% | 7.3% | 9% |
Food Insecure Without Hunger | 12.13% | 11.68% | 12.17% | 13.02% | 13.34% | 14.87% | 15.05% | 14.47% | 16.53% |
Fully Food Secure | 71.27% | 67.97% | 68.39% | 69.8% | 68.25% | 63.18% | 62.41% | 65.41% | 56.67% |
Marginally Food Secure | 8.92% | 8.5% | 8.58% | 10.1% | 11.31% | 12.82% | 14.05% | 11.63% | 14.37% |
Na | 2.67% | 6.18% | 4.53% | 1.26% | 0.85% | 1.14% | 0.5% | 1.2% | 3.43% |
Over time, as you use your function with different examples, you may want to tweak it. For example, I noticed by using the food security variable that I never explictly removed NAs. You may want to always remove NAs. Or, you can give your user an option to remove NAs or not by creating an argument called “remove.nas” and set it to TRUE to remove NAs by default. Then, if the user wants to see the NAs, they can set it to FALSE.
Additionally, I noticed that I made sure that my row names were capitalized, but I didn’t do that for the column names because it wasn’t relevant here (in all my examples the column names were numbers). This is why it is helpful to test your functions under as many different conditions as you can think of!