Understanding Data Wrangling in R: Summarizing and slicing your data
This is the fourth blog post in a series of dplyr tutorials. In this tutorial we will summarizing our data: i) counting cases and observations, ii) creating summaries using summarise()
and it’s summarise_all()
, _if()
and _at()
variants, and iii) pulling the maximum and minimum row values.
Content:
Note: as per previous blog posts, I will present everything in the form of a pipe. In some of the below cases, this might not be necessary and it would be just as easy to write it on a single line, but as I want to present options that you can use in your pipes, all below examples will be piped.
The data
As per previous blog posts many of these functions truly shine when you
have a lot of columns, but to make it easy on people to copy paste code
and experiment, I’m using a built-in dataset:
library(dplyr)
msleep <- ggplot2::msleep
glimpse(msleep)
## Observations: 83
## Variables: 11
## $ name <chr> "Cheetah", "Owl monkey", "Mountain beaver", "Grea...
## $ genus <chr> "Acinonyx", "Aotus", "Aplodontia", "Blarina", "Bo...
## $ vore <chr> "carni", "omni", "herbi", "omni", "herbi", "herbi...
## $ order <chr> "Carnivora", "Primates", "Rodentia", "Soricomorph...
## $ conservation <chr> "lc", NA, "nt", "lc", "domesticated", NA, "vu", N...
## $ sleep_total <dbl> 12.1, 17.0, 14.4, 14.9, 4.0, 14.4, 8.7, 7.0, 10.1...
## $ sleep_rem <dbl> NA, 1.8, 2.4, 2.3, 0.7, 2.2, 1.4, NA, 2.9, NA, 0....
## $ sleep_cycle <dbl> NA, NA, NA, 0.1333333, 0.6666667, 0.7666667, 0.38...
## $ awake <dbl> 11.9, 7.0, 9.6, 9.1, 20.0, 9.6, 15.3, 17.0, 13.9,...
## $ brainwt <dbl> NA, 0.01550, NA, 0.00029, 0.42300, NA, NA, NA, 0....
## $ bodywt <dbl> 50.000, 0.480, 1.350, 0.019, 600.000, 3.850, 20.4...
Counting cases and adding counts
Counting the number of observations
The easiest way to know how many observations you have for a specific
variable, is to use count()
. By adding the sort = TRUE
argument, it
immediately returns a sorted table with descending number of
observations:
msleep %>%
count(order, sort = TRUE)
## # A tibble: 19 x 2
## order n
## <chr> <int>
## 1 Rodentia 22
## 2 Carnivora 12
## 3 Primates 12
## 4 Artiodactyla 6
## 5 Soricomorpha 5
## 6 Cetacea 3
## 7 Hyracoidea 3
## 8 Perissodactyla 3
## 9 Chiroptera 2
## 10 Cingulata 2
## 11 Didelphimorphia 2
## 12 Diprotodontia 2
## 13 Erinaceomorpha 2
## 14 Proboscidea 2
## 15 Afrosoricida 1
## 16 Lagomorpha 1
## 17 Monotremata 1
## 18 Pilosa 1
## 19 Scandentia 1
You can add multiple variables to a count()
statement; the example
below is counting by order and vore:
msleep %>%
count(order, vore, sort = TRUE)
## # A tibble: 32 x 3
## order vore n
## <chr> <chr> <int>
## 1 Rodentia herbi 16
## 2 Carnivora carni 12
## 3 Primates omni 10
## 4 Artiodactyla herbi 5
## 5 Cetacea carni 3
## 6 Perissodactyla herbi 3
## 7 Rodentia <NA> 3
## 8 Soricomorpha omni 3
## 9 Chiroptera insecti 2
## 10 Hyracoidea herbi 2
## # ... with 22 more rows
Adding the number of observations in a column
If you’re only interested in counting the total number of cases for a
dataframe, you could use tally()
, which behaves simarly to nrow()
.
You can’t provide a variable to count with tally()
, it only works to
count the overall number of observations. In fact, as is described in
the dplyr
documentation, count()
is a short-hand for group_by()
and tally()
.
msleep %>%
tally()
## # A tibble: 1 x 1
## n
## <int>
## 1 83
More interesting is the add_tally()
function which automatically adds
a column with the overall number of observations. This would be the same
as mutate(n = n())
.
msleep %>%
select(1:3) %>%
add_tally()
## # A tibble: 83 x 4
## name genus vore n
## <chr> <chr> <chr> <int>
## 1 Cheetah Acinonyx carni 83
## 2 Owl monkey Aotus omni 83
## 3 Mountain beaver Aplodontia herbi 83
## 4 Greater short-tailed shrew Blarina omni 83
## 5 Cow Bos herbi 83
## 6 Three-toed sloth Bradypus herbi 83
## 7 Northern fur seal Callorhinus carni 83
## 8 Vesper mouse Calomys <NA> 83
## 9 Dog Canis carni 83
## 10 Roe deer Capreolus herbi 83
## # ... with 73 more rows
Even more interesting is add_count()
which takes a variable as
argument, and adds a column which the number of observations. This saves
the combination of grouping, mutating and ungrouping again.
msleep %>%
select(name:vore) %>%
add_count(vore)
## # A tibble: 83 x 4
## name genus vore n
## <chr> <chr> <chr> <int>
## 1 Cheetah Acinonyx carni 19
## 2 Owl monkey Aotus omni 20
## 3 Mountain beaver Aplodontia herbi 32
## 4 Greater short-tailed shrew Blarina omni 20
## 5 Cow Bos herbi 32
## 6 Three-toed sloth Bradypus herbi 32
## 7 Northern fur seal Callorhinus carni 19
## 8 Vesper mouse Calomys <NA> 7
## 9 Dog Canis carni 19
## 10 Roe deer Capreolus herbi 32
## # ... with 73 more rows
Summarising data
To note: for some functions, dplyr
foresees both an American English
and a UK English variant. The function summarise()
is the equivalent
of summarize()
.
If you just want to know the number of observations count()
does the
job, but to produce summaries of the average, sum, standard deviation,
minimum, maximum of the data, we need summarise()
. To use the function
you just add your new column name, and after the equal sign the
mathematics of what needs to happen: column_name = function(variable)
.
You can add multiple summary functions behind each other.
msleep %>%
summarise(n = n(), average = mean(sleep_total), maximum = max(sleep_total))
## # A tibble: 1 x 3
## n average maximum
## <int> <dbl> <dbl>
## 1 83 10.4 19.9
In most cases, we don’t just want to summarise the whole data table, but
we want to get summaries by a group. To do this, you first need to
specify by which variable(s) you want to divide the data using
group_by()
. You can add one of more variables as arguments in
group_by()
.
msleep %>%
group_by(vore) %>%
summarise(n = n(), average = mean(sleep_total), maximum = max(sleep_total))
## # A tibble: 5 x 4
## vore n average maximum
## <chr> <int> <dbl> <dbl>
## 1 carni 19 10.4 19.4
## 2 herbi 32 9.51 16.6
## 3 insecti 5 14.9 19.9
## 4 omni 20 10.9 18.0
## 5 <NA> 7 10.2 13.7
The summarise()
call works with nearly any aggregate function, and
allows for additional arithmetics:
n()
- gives the number of observationsn_distinct(var)
- gives the numbers of unique values ofvar
sum(var)
,max(var)
,min(var)
, …mean(var)
,median(var)
,sd(var)
,IQR(var)
, …- …
The sample code will average sleep_total and divide by 24, to get the amount of sleep as a fraction of a day.
msleep %>%
group_by(vore) %>%
summarise(avg_sleep_day = mean(sleep_total)/24)
## # A tibble: 5 x 2
## vore avg_sleep_day
## <chr> <dbl>
## 1 carni 0.432
## 2 herbi 0.396
## 3 insecti 0.622
## 4 omni 0.455
## 5 <NA> 0.424
Summarise all
Similarly to the filter, select and mutate functions, summarise()
comes with three additional functions for doing things to multiple
columns in one go:
summarise_all()
will summarise all columns based on your further instructionssummarise_if()
requires a function that returns a boolean. If that is true, the summary instructions will be followedsumarise_at()
requires you to specify columns inside avars()
argument for which the summary will be done.
Summarise all
The function summarise_all()
requires a function as argument, which it
will apply to all columns. The sample code calculates the mean for each column. I had to add the
na.rm = TRUE
argument to ignore NA
values.
msleep %>%
group_by(vore) %>%
summarise_all(mean, na.rm=TRUE)
## # A tibble: 5 x 11
## vore name genus order conservation sleep_total sleep_rem sleep_cycle
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 carni NA NA NA NA 10.4 2.29 0.373
## 2 herbi NA NA NA NA 9.51 1.37 0.418
## 3 insecti NA NA NA NA 14.9 3.52 0.161
## 4 omni NA NA NA NA 10.9 1.96 0.592
## 5 <NA> NA NA NA NA 10.2 1.88 0.183
## # ... with 3 more variables: awake <dbl>, brainwt <dbl>, bodywt <dbl>
The instructions for summarizing have to be a function. When there is no function available in base R or a package to do what you want, you can either make a function upfront, or make a function on the fly.
The sample code will add 5 to the mean of each column. The function on the
fly can be made by either using funs(mean(., na.rm = TRUE) + 5)
, or
via a tilde: ~mean(., na.rm = TRUE) + 5
.
msleep %>%
group_by(vore) %>%
summarise_all(~mean(., na.rm = TRUE) + 5)
## # A tibble: 5 x 11
## vore name genus order conservation sleep_total sleep_rem sleep_cycle
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 carni NA NA NA NA 15.4 7.29 5.37
## 2 herbi NA NA NA NA 14.5 6.37 5.42
## 3 insecti NA NA NA NA 19.9 8.52 5.16
## 4 omni NA NA NA NA 15.9 6.96 5.59
## 5 <NA> NA NA NA NA 15.2 6.88 5.18
## # ... with 3 more variables: awake <dbl>, brainwt <dbl>, bodywt <dbl>
Summarise if
The function summarise_if()
requires two arguments:
First it needs information about the columns you want it to consider. This information needs to be a function that returns a boolean value. The easiest cases are functions like
is.numeric
,is.integer
,is.double
,is.logical
,is.factor
,lubridate::is.POSIXt
orlubridate::is.Date
.Secondly, it needs information about how to summarise that data, which needs to be a function. If not a function, you can create a function on the fly using
funs()
or a tilde (see above).
The sample code below will return the average of all numeric columns:
msleep %>%
group_by(vore) %>%
summarise_if(is.numeric, mean, na.rm=TRUE)
## # A tibble: 5 x 7
## vore sleep_total sleep_rem sleep_cycle awake brainwt bodywt
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 carni 10.4 2.29 0.373 13.6 0.0793 90.8
## 2 herbi 9.51 1.37 0.418 14.5 0.622 367
## 3 insecti 14.9 3.52 0.161 9.06 0.0216 12.9
## 4 omni 10.9 1.96 0.592 13.1 0.146 12.7
## 5 <NA> 10.2 1.88 0.183 13.8 0.00763 0.858
One of the downsides of the aggregate summarise functions is that you do
not require a new column title. It therefore might not always be clear
what this new value is (average? median? minimum?). Luckily thanks to
similar rename_*()
functions, it only takes one line extra to rename
them all:
msleep %>%
group_by(vore) %>%
summarise_if(is.numeric, mean, na.rm=TRUE) %>%
rename_if(is.numeric, ~paste0("avg_", .))
## # A tibble: 5 x 7
## vore avg_sleep_total avg_sleep_rem avg_sleep_cycle avg_awake
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 carni 10.4 2.29 0.373 13.6
## 2 herbi 9.51 1.37 0.418 14.5
## 3 insecti 14.9 3.52 0.161 9.06
## 4 omni 10.9 1.96 0.592 13.1
## 5 <NA> 10.2 1.88 0.183 13.8
## # ... with 2 more variables: avg_brainwt <dbl>, avg_bodywt <dbl>
Summarise at
The function summarise_at()
also requires two arguments:
First it needs information about the columns you want it to consider. In this case you need to wrap them inside a
vars()
statement. Insidevars()
you can use anything that can be used inside aselect()
statement. Have a look here if you need more info.Secondly, it needs information about how to summarise that data, which as above needs to be a function. If not a function, you can create a function on the fly using
funs()
or a tilde (see above).
The sample code below will return the average of all columns which contain the word ‘sleep’, and also rename them to “avg_var” for clarity.
msleep %>%
group_by(vore) %>%
summarise_at(vars(contains("sleep")), mean, na.rm=TRUE) %>%
rename_at(vars(contains("sleep")), ~paste0("avg_", .))
## # A tibble: 5 x 4
## vore avg_sleep_total avg_sleep_rem avg_sleep_cycle
## <chr> <dbl> <dbl> <dbl>
## 1 carni 10.4 2.29 0.373
## 2 herbi 9.51 1.37 0.418
## 3 insecti 14.9 3.52 0.161
## 4 omni 10.9 1.96 0.592
## 5 <NA> 10.2 1.88 0.183
Arranging rows
It’s useful if your summary tables are arranged, which is arrange()
’s
job. The default format for numeric variables is to sort ascending, but
you can add the desc()
function in your call to change the default.
For string variables, it will sort alphabetically.
Sorting numeric variables:
arrange(sleep_total)
will arrange it from short sleepers to long
sleepers. In this case I wanted the opposite:
msleep %>%
group_by(vore) %>%
summarise(avg_sleep = mean(sleep_total)) %>%
arrange(desc(avg_sleep))
## # A tibble: 5 x 2
## vore avg_sleep
## <chr> <dbl>
## 1 insecti 14.9
## 2 omni 10.9
## 3 carni 10.4
## 4 <NA> 10.2
## 5 herbi 9.51
If you already grouped your data, you can refer to that group within the
arrange()
statement as well by adding a .by_group = TRUE
statement.
This will sort by descending total sleep time, but within each group.
msleep %>%
select(order, name, sleep_total) %>%
group_by(order) %>%
arrange(desc(sleep_total), .by_group = TRUE)
## # A tibble: 83 x 3
## # Groups: order [19]
## order name sleep_total
## <chr> <chr> <dbl>
## 1 Afrosoricida Tenrec 15.6
## 2 Artiodactyla Pig 9.10
## 3 Artiodactyla Goat 5.30
## 4 Artiodactyla Cow 4.00
## 5 Artiodactyla Sheep 3.80
## 6 Artiodactyla Roe deer 3.00
## 7 Artiodactyla Giraffe 1.90
## 8 Carnivora Tiger 15.8
## 9 Carnivora Lion 13.5
## 10 Carnivora Domestic cat 12.5
## # ... with 73 more rows
Showing only part of your data
In some cases, you don’t just want to show all rows available. Here are some nice shortcuts which can save time
The 5 lowest and highest values
In some cases, you don’t just want to show all rows available. You can
filter of course, but there are some shortcuts for specific needs: if
you want to select the highest 5 cases, you could combine an arrange
call with a head(n=5)
. Or you can use top_n(5)
which will retain
(unsorted) the 5 highest values.
msleep %>%
group_by(order) %>%
summarise(average = mean(sleep_total)) %>%
top_n(5)
## # A tibble: 5 x 2
## order average
## <chr> <dbl>
## 1 Afrosoricida 15.6
## 2 Chiroptera 19.8
## 3 Cingulata 17.8
## 4 Didelphimorphia 18.7
## 5 Pilosa 14.4
The five lowest values can be found using top_n(-5)
:
msleep %>%
group_by(order) %>%
summarise(average = mean(sleep_total)) %>%
top_n(-5)
## # A tibble: 5 x 2
## order average
## <chr> <dbl>
## 1 Artiodactyla 4.52
## 2 Cetacea 4.50
## 3 Hyracoidea 5.67
## 4 Perissodactyla 3.47
## 5 Proboscidea 3.60
If you have more than one column, you can add the variable you want it to use. The sample code will retain the 5 highest values of average_sleep.
msleep %>%
group_by(order) %>%
summarise(average_sleep = mean(sleep_total), max_sleep = max(sleep_total)) %>%
top_n(5, average_sleep)
## # A tibble: 5 x 3
## order average_sleep max_sleep
## <chr> <dbl> <dbl>
## 1 Afrosoricida 15.6 15.6
## 2 Chiroptera 19.8 19.9
## 3 Cingulata 17.8 18.1
## 4 Didelphimorphia 18.7 19.4
## 5 Pilosa 14.4 14.4
A random selection of rows
Using sample_n()
you can sample a random selection of rows.
Alternative is sample_frac()
allowing you to randomly select a
fraction of rows (here 10%).
msleep %>%
sample_frac(.1)
## # A tibble: 8 x 11
## name genus vore order conservation sleep_total sleep_rem sleep_cycle
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Africa~ Rhab~ omni Rode~ <NA> 8.70 NA NA
## 2 Giraffe Gira~ herbi Arti~ cd 1.90 0.400 NA
## 3 Northe~ Onyc~ carni Rode~ lc 14.5 NA NA
## 4 Squirr~ Saim~ omni Prim~ <NA> 9.60 1.40 NA
## 5 Three-~ Brad~ herbi Pilo~ <NA> 14.4 2.20 0.767
## 6 Galago Gala~ omni Prim~ <NA> 9.80 1.10 0.550
## 7 "Vole " Micr~ herbi Rode~ <NA> 12.8 NA NA
## 8 Caspia~ Phoca carni Carn~ vu 3.50 0.400 NA
## # ... with 3 more variables: awake <dbl>, brainwt <dbl>, bodywt <dbl>
A user-defined slice of rows
The head()
call will standard show the first 6 rows, which can be
modified by adding a n-argument: head(n=10)
. Similarly tail()
will
show the final 6 rows, which again can be modified by adding a
n-argument. If you want to slice somewhere in the middle, you can use
slice()
. The sample code will show rows 50 to 55.
msleep %>%
slice(50:55)
## # A tibble: 6 x 11
## name genus vore order conservation sleep_total sleep_rem sleep_cycle
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Chimp~ Pan omni Prim~ <NA> 9.70 1.40 1.42
## 2 Tiger Panth~ carni Carn~ en 15.8 NA NA
## 3 Jaguar Panth~ carni Carn~ nt 10.4 NA NA
## 4 Lion Panth~ carni Carn~ vu 13.5 NA NA
## 5 Baboon Papio omni Prim~ <NA> 9.40 1.00 0.667
## 6 Deser~ Parae~ <NA> Erin~ lc 10.3 2.70 NA
## # ... with 3 more variables: awake <dbl>, brainwt <dbl>, bodywt <dbl>