Writing Functions

custom functions

This is a tutorial for how to write your first custom function in R.

Sarah Dimakis
05-13-2021

Load libraries and read in data

For this tutorial, you will first need to load tidyverse for data wrangling and kableExtra for tables. Next, I am using rio::import to import the cleaned data file to my environment. Note, you may need to change the file path if your file structure is not the same as mine. Last, I am using mutate and case_when to label the levels of the food_secure variable.

If you would like to copy the code, you can hover of the code, and a “copy code” box should appear.

library(tidyverse)
library(kableExtra)

data <- rio::import(here::here("data", "nhanes_1999-2016.csv")) %>% 
  mutate(food_security = case_when(hh_food_secure == 1 ~ "Fully food secure",
            hh_food_secure == 2 ~ "Marginally food secure",
            hh_food_secure == 3 ~ "Food insecure without hunger",
            hh_food_secure == 4 ~ "Food insecure with hunger"))

Functions

A function is code that carries out an operation. For example + is a function that carries out the operation addition.

2 + 3
[1] 5

In algebra, you may recall learning functions such as \(f(x,y) = x^2+y\), where you put in inputs x = 2 and y = 1, the function computes the operation, \(2^2+1\), and outputs \(5\). Similarly, when using a function in R, it takes a form like f(x,y,z..) where f is the function name and x,y,z… are the arguments of the function.

`+`(2,3)
[1] 5

Let’s create our first function in R called my_pet() that will print out a statement about your pet. For best practice, you should try to name your function something descriptive. It should also not be named another function that is popular like mean because it will overwrite the default mean function for your script.

my_pet <- function(pronoun, animal, verb){
  paste0(str_to_title(pronoun), " is a ", animal, " who likes to ", verb, ".")
}

The arguments of the function (which are called formals) are what the user supplies the function to get their desired output. You can see the formals of a function in R by using the formals() function.

formals(my_pet)
$pronoun


$animal


$verb

The body of the function is where the function takes in the formals and creates the output. You can see the body of a function in R by using the body() function.

body(my_pet)
{
    paste0(str_to_title(pronoun), " is a ", animal, " who likes to ", 
        verb, ".")
}

Now, in order to use the function, you can supply it with your desired formals.

my_pet(pronoun = "she", 
       animal = "dog", 
       verb = "play outside")
[1] "She is a dog who likes to play outside."

When you use your function, you can drop the argument names as long as you keep the same order.

my_pet("he", "cat", "sleep")
[1] "He is a cat who likes to sleep."

If you want to use arguments out of the order they were defined in, you will need to label them.

my_pet(animal = "lizard",
       pronoun ="she", 
       verb = "eat")
[1] "She is a lizard who likes to eat."

Default settings

You can set a “default” setting for an argument. This is the setting that occurs when the user does not specify the argument.

my_pet2 <- function(pronoun, animal, verb = "dance"){
  paste0(str_to_title(pronoun), " is a ", animal, " who likes to ", verb, ".")
}
my_pet2(pronoun = "she",
        animal = "dog")
[1] "She is a dog who likes to dance."

The user can overwrite the default.

my_pet2(pronoun = "she",
        animal = "dog",
        verb = "cuddle")
[1] "She is a dog who likes to cuddle."

Errors and warnings

If someone else (or your future self) is going to use your function, it is helpful to embed errors with stop() and/or warnings with warning() into your code to explain why the code will not work (or if it will not work as expected).

# I am using an if else structure
# If the user inputs "cat" for animal, the function will 
# throw an error and say "Sorry, this function doesn't work # for people who own cats"
# If they use input any other animal, it will work

my_pet3 <- function(pronoun, animal, verb = "stretch"){
  if(animal == "cat"){
    stop("Sorry, this function doesn't work for people who own cats.")
  }
  else{
    paste0(str_to_title(pronoun), " is a ", animal, " who likes to ", verb, ".")
    }
}
# This works
my_pet3("she", "dog")
[1] "She is a dog who likes to stretch."
# This throws an error
my_pet3("he", "cat")
Error in my_pet3("he", "cat"): Sorry, this function doesn't work for people who own cats.

Instead of an error, you might just want to throw a warning, but still allow the function to work.

my_pet4 <- function(pronoun, animal, verb = "stretch"){
  if(animal != "dog"){
    warning(paste0("Really? A ", animal, "? You should really consider getting a dog."))
  }
  paste0(str_to_title(pronoun), " is a ", animal, " who likes to ", verb, ".")
    
}
my_pet4("he", "cat")
Warning in my_pet4("he", "cat"): Really? A cat? You should really
consider getting a dog.
[1] "He is a cat who likes to stretch."
my_pet4("he", "fish", "swim")
Warning in my_pet4("he", "fish", "swim"): Really? A fish? You should
really consider getting a dog.
[1] "He is a fish who likes to swim."
# works as normal
my_pet4("she", "dog")
[1] "She is a dog who likes to stretch."

These are silly examples, but there are many reasons why you would want the function to output a warning or error. For example, you may want the function to warn someone if their input is not going to give them the output they expect, or if the function will not work for certain input. This will help the user work with your function to get their desired output.

Example function 1

What makes a particularly good function is code that you write a lot. Additionally, you want to make a function that is simple and only does one thing. For this first function, I am going to create a function that will create a “total” column that calculates the total observations across all levels of a grouping variable.

Before I create a function, I first like to try to make it work in a single case.

a <- count(data, year, gender)
a
        year gender    n
1  1999-2000 female 5082
2  1999-2000   male 4883
3  2001-2002 female 5708
4  2001-2002   male 5331
5  2003-2004 female 5152
6  2003-2004   male 4970
7  2005-2006 female 5268
8  2005-2006   male 5080
9  2007-2008 female 5053
10 2007-2008   male 5096
11 2009-2010 female 5312
12 2009-2010   male 5225
13 2011-2012 female 4900
14 2011-2012   male 4856
15 2013-2014 female 5172
16 2013-2014   male 5003
17 2015-2016 female 5079
18 2015-2016   male 4892
b <- count(data, year)
b
       year     n
1 1999-2000  9965
2 2001-2002 11039
3 2003-2004 10122
4 2005-2006 10348
5 2007-2008 10149
6 2009-2010 10537
7 2011-2012  9756
8 2013-2014 10175
9 2015-2016  9971
# Join a and b by year
# There are two n's so I am also changing the suffix of n so that they are labeled more clearly

left_join(a,b, by = "year", 
          suffix = c("_group", "_total"))
        year gender n_group n_total
1  1999-2000 female    5082    9965
2  1999-2000   male    4883    9965
3  2001-2002 female    5708   11039
4  2001-2002   male    5331   11039
5  2003-2004 female    5152   10122
6  2003-2004   male    4970   10122
7  2005-2006 female    5268   10348
8  2005-2006   male    5080   10348
9  2007-2008 female    5053   10149
10 2007-2008   male    5096   10149
11 2009-2010 female    5312   10537
12 2009-2010   male    5225   10537
13 2011-2012 female    4900    9756
14 2011-2012   male    4856    9756
15 2013-2014 female    5172   10175
16 2013-2014   male    5003   10175
17 2015-2016 female    5079    9971
18 2015-2016   male    4892    9971

Now, let’s try to generalize it.

total_grouping <- function(data, var, grouping){
  a <- count(data, var, grouping)
  b <- count(data, var)
  c <- left_join(a,b,by = "year", 
          suffix = c("_group", "_total"))
}
total_grouping(data, year, gender)
Error: Must group by variables found in `.data`.
* Column `var` is not found.
* Column `grouping` is not found.

Unfortunately, from running this code, we get an error that says that the columns var and grouping are not found. This can happen when you use tidyverse functions to write functions because the tidyverse uses what is called non-standard evaluation. NSE makes the functions really user friendly. For example, when you use dplyr functions like select(data, year) or group_by(data, year), the function knows that year is referring to data$year and not year from your global environment. However, this causes trouble when you try to use select() or group_by() (or in our case, counts()) in your own function. R looks for the column names (in our case, var and grouping) in your global environment but can’t find them. In order to work around this, we need to also use NSE. Here are two ways you can write the function with NSE:

  1. {{}} Syntax
total_grouping <- function(data, var, grouping){
  a <- count(data, {{var}}, {{grouping}}) ##wrap the variables in {{}}
  b <- count(data, {{var}})
  left_join(a,b,by = "year", 
          suffix = c("_group", "_total"))
}

total_grouping(data, year, gender) %>% head()
       year gender n_group n_total
1 1999-2000 female    5082    9965
2 1999-2000   male    4883    9965
3 2001-2002 female    5708   11039
4 2001-2002   male    5331   11039
5 2003-2004 female    5152   10122
6 2003-2004   male    4970   10122
  1. Quote the variables
total_grouping <- function(data, var, grouping){
  v1 <- enquo(var)  #quote the variables
  v2 <- enquo(grouping)
  
  a <- count(data, !!v1, !!v2) # use !! to evaluate the quoted variables
  b <- count(data, !!v1)
  left_join(a,b,by = "year", 
          suffix = c("_group", "_total"))
}

total_grouping(data, year, gender) %>% head()
       year gender n_group n_total
1 1999-2000 female    5082    9965
2 1999-2000   male    4883    9965
3 2001-2002 female    5708   11039
4 2001-2002   male    5331   11039
5 2003-2004 female    5152   10122
6 2003-2004   male    4970   10122

The great thing about functions is that we can now use our function with other variables, without having to copy,paste, and change the numbers. This cuts down on mistakes and makes your code easier to read.

total_grouping(data, year, race_ethnic) %>% 
  head()
       year        race_ethnic n_group n_total
1 1999-2000   mexican-american    3393    9965
2 1999-2000 non-hispanic-black    2228    9965
3 1999-2000 non-hispanic-white    3367    9965
4 1999-2000     other-hispanic     589    9965
5 1999-2000         other-race     388    9965
6 2001-2002   mexican-american    2776   11039
total_grouping(data, year, food_security) %>% 
  head()
       year                food_security n_group n_total
1 1999-2000    Food insecure with hunger     499    9965
2 1999-2000 Food insecure without hunger    1209    9965
3 1999-2000            Fully food secure    7102    9965
4 1999-2000       Marginally food secure     889    9965
5 1999-2000                         <NA>     266    9965
6 2001-2002    Food insecure with hunger     627   11039

Example function 2

You can use a previous function you have defined earlier in a script in a new function! Here, I am extending the previous function to make a column that will calculate the percentage of the count in a given year.

Let’s first try an example.

total_grouping(data, year, gender) %>% 
  mutate(percent = n_group/n_total * 100, #make a new variable called percent 
         percent = round(percent, 2),
         percent = paste0(percent, "%")) #round it and add a % sign
        year gender n_group n_total percent
1  1999-2000 female    5082    9965     51%
2  1999-2000   male    4883    9965     49%
3  2001-2002 female    5708   11039  51.71%
4  2001-2002   male    5331   11039  48.29%
5  2003-2004 female    5152   10122   50.9%
6  2003-2004   male    4970   10122   49.1%
7  2005-2006 female    5268   10348  50.91%
8  2005-2006   male    5080   10348  49.09%
9  2007-2008 female    5053   10149  49.79%
10 2007-2008   male    5096   10149  50.21%
11 2009-2010 female    5312   10537  50.41%
12 2009-2010   male    5225   10537  49.59%
13 2011-2012 female    4900    9756  50.23%
14 2011-2012   male    4856    9756  49.77%
15 2013-2014 female    5172   10175  50.83%
16 2013-2014   male    5003   10175  49.17%
17 2015-2016 female    5079    9971  50.94%
18 2015-2016   male    4892    9971  49.06%

Okay, now we’re ready to generalize. Note that I’m using NSE here too.

percent_grouping <- function(data, var, grouping){
  total_grouping(data, {{var}}, {{grouping}}) %>% 
  mutate(percent = n_group/n_total * 100,
         percent = round(percent, 2),
         percent = paste0(percent, "%")) 
}
percent_grouping(data, year, gender) %>% head()
       year gender n_group n_total percent
1 1999-2000 female    5082    9965     51%
2 1999-2000   male    4883    9965     49%
3 2001-2002 female    5708   11039  51.71%
4 2001-2002   male    5331   11039  48.29%
5 2003-2004 female    5152   10122   50.9%
6 2003-2004   male    4970   10122   49.1%
percent_grouping(data, year, race_ethnic) %>% head()
       year        race_ethnic n_group n_total percent
1 1999-2000   mexican-american    3393    9965  34.05%
2 1999-2000 non-hispanic-black    2228    9965  22.36%
3 1999-2000 non-hispanic-white    3367    9965  33.79%
4 1999-2000     other-hispanic     589    9965   5.91%
5 1999-2000         other-race     388    9965   3.89%
6 2001-2002   mexican-american    2776   11039  25.15%

Tables

Now, let’s make a table with our output. These tables will tell us how the demographics of our sample changed from year to year.

# creates an ugly first draft 
temp_table <- percent_grouping(data, year, gender) %>% 
  select(year, gender, percent) %>% 
  pivot_wider(names_from = gender,
              values_from = percent) %>% t(.)

temp_table
       [,1]        [,2]        [,3]        [,4]        [,5]       
year   "1999-2000" "2001-2002" "2003-2004" "2005-2006" "2007-2008"
female "51%"       "51.71%"    "50.9%"     "50.91%"    "49.79%"   
male   "49%"       "48.29%"    "49.1%"     "49.09%"    "50.21%"   
       [,6]        [,7]        [,8]        [,9]       
year   "2009-2010" "2011-2012" "2013-2014" "2015-2016"
female "50.41%"    "50.23%"    "50.83%"    "50.94%"   
male   "49.59%"    "49.77%"    "49.17%"    "49.06%"   
# moves the first row to the title
table <- temp_table[2:3,]
colnames(table) <- temp_table[1,] 
rownames(table) <- rownames(table) %>% str_to_title()

#stylized table
table %>% kbl() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)  %>% #gives me a stylized striped table
  row_spec(0, angle = -45) #rotates column names
1999-2000 2001-2002 2003-2004 2005-2006 2007-2008 2009-2010 2011-2012 2013-2014 2015-2016
Female 51% 51.71% 50.9% 50.91% 49.79% 50.41% 50.23% 50.83% 50.94%
Male 49% 48.29% 49.1% 49.09% 50.21% 49.59% 49.77% 49.17% 49.06%

I can generalize this with a function so I can make tables for other variables!

my_table <- function(data, var, grouping){
  temp_table <- percent_grouping(data, {{var}}, {{grouping}}) %>% 
    select({{var}}, {{grouping}}, percent) %>% 
    pivot_wider(names_from = {{grouping}},
                values_from = percent) %>% t(.)
  
  table <- temp_table[2:nrow(temp_table),]
  colnames(table) <- temp_table[1,] 
  rownames(table) <- rownames(table) %>% str_to_title()
  
  table %>% kbl() %>% 
    kable_styling(bootstrap_options = "striped", full_width = F) %>% 
    row_spec(0, angle = -45)
}
my_table(data, year, gender)
1999-2000 2001-2002 2003-2004 2005-2006 2007-2008 2009-2010 2011-2012 2013-2014 2015-2016
Female 51% 51.71% 50.9% 50.91% 49.79% 50.41% 50.23% 50.83% 50.94%
Male 49% 48.29% 49.1% 49.09% 50.21% 49.59% 49.77% 49.17% 49.06%
my_table(data, year, race_ethnic)
1999-2000 2001-2002 2003-2004 2005-2006 2007-2008 2009-2010 2011-2012 2013-2014 2015-2016
Mexican-American 34.05% 25.15% 24.89% 27.51% 21.25% 22.63% 13.89% 17% 19.27%
Non-Hispanic-Black 22.36% 24.29% 26.31% 26.19% 21.79% 18.57% 27.5% 22.28% 21.35%
Non-Hispanic-White 33.79% 41.72% 40.83% 37.96% 40.55% 41.95% 30.47% 36.11% 30.75%
Other-Hispanic 5.91% 4.68% 3.37% 3.37% 11.83% 10.75% 11.03% 9.43% 13.12%
Other-Race 3.89% 4.16% 4.6% 4.97% 4.58% 6.1% 17.11% 15.17% 15.51%
my_table(data, year, food_security)
1999-2000 2001-2002 2003-2004 2005-2006 2007-2008 2009-2010 2011-2012 2013-2014 2015-2016
Food Insecure With Hunger 5.01% 5.68% 6.33% 5.83% 6.25% 7.99% 7.98% 7.3% 9%
Food Insecure Without Hunger 12.13% 11.68% 12.17% 13.02% 13.34% 14.87% 15.05% 14.47% 16.53%
Fully Food Secure 71.27% 67.97% 68.39% 69.8% 68.25% 63.18% 62.41% 65.41% 56.67%
Marginally Food Secure 8.92% 8.5% 8.58% 10.1% 11.31% 12.82% 14.05% 11.63% 14.37%
Na 2.67% 6.18% 4.53% 1.26% 0.85% 1.14% 0.5% 1.2% 3.43%

Over time, as you use your function with different examples, you may want to tweak it. For example, I noticed by using the food security variable that I never explictly removed NAs. You may want to always remove NAs. Or, you can give your user an option to remove NAs or not by creating an argument called “remove.nas” and set it to TRUE to remove NAs by default. Then, if the user wants to see the NAs, they can set it to FALSE.

Additionally, I noticed that I made sure that my row names were capitalized, but I didn’t do that for the column names because it wasn’t relevant here (in all my examples the column names were numbers). This is why it is helpful to test your functions under as many different conditions as you can think of!