This page will show you how to aggregate data in R using the data.table package. Easily calculate mean, median, sum or any of the other built-in functions in R across any number of groups.
If you want to follow along with the examples below you will need the data that is used. To get this data, install and load the eeptools package and then open the sample data by running the following code in R:
# Install eeptools install.packages("eeptools") # Load eeptools library(eeptools) # Open data data(stulevel)
The first example shows how to aggregate data in R without using the package data.table. Why not use data.table? data.table is not necessary if you want to aggregate without grouping. The example dataset stulevel contains thousands of rows, with each row representing a student. Each student has a test score, represented by the column called ability. If I want to find out the mean score for all students I don’t need to use data table. I can use the command mean() from base R. The below code calculates the mean of ability (ignoring NAs by indicating na.rm=TRUE) and saves the mean in a new object called stulevel_agg_1.
# Average ability for all students stulevel_agg_1<- mean(stulevel$ability, na.rm=TRUE) # Show results stulevel_agg_1 [1] 100.1539
The second example shows how to use data.table to aggregate in R. This example will calculate the mean ability score grouped by grade. The first (optional) step is to install the package data.table if you don’t already have it installed. The second step is to load the data.table package. The third step is to convert the data frame to a data table. The fourth (optional) step is to declare which field you intend to group on. This step is optional but can speed calculations with large datasets. The fifth step is to actually aggregate with data.table.
One downside to this process is that the the names of the new fields are V1, V2, V3, etc. Go to the last example to see how to rename the fields within the data.table process.
# Install data.table (if is isn't already installed) install.packages("data.table")
# Initialize the data.table package library(data.table)
# Convert stulevel to data table stulevel <- data.table(stulevel)
# Declare which variable you want to group on (optional). # List the name of the data table first, then the name of the field(s). setkey(stulevel, grade)
# Average ability by grade stulevel_agg_2 <- as.data.frame(stulevel[, mean(ability, na.rm = TRUE),by = grade])
# Show results stulevel_agg_2
grade V1 1 3 99.91628 2 4 100.19533 3 5 100.52899 4 6 100.45084 5 7 100.12068 6 8 99.68494
The third example shows how to use data.table to aggregate more than one field. This example will calculate the mean ability score and number of attendance days grouped by grade. This is done by adding j=list() with the names of the fields you want to aggregate listed within the parentheses. See below for details. Note that in this and later examples I will leave out a few of the steps above assuming that you’ve already installed data.table, loaded the data, and setkey.
# Average ability by grade stulevel_agg_3 <- as.data.frame(stulevel[, j=list(mean(ability, na.rm = TRUE),mean(attday, na.rm = TRUE)),by = grade])
# Show results stulevel_agg_3
grade V1 V2 1 3 99.91628 167.6380 2 4 100.19533 167.2275 3 5 100.52899 167.6660 4 6 100.45084 168.3025 5 7 100.12068 168.4980 6 8 99.68494 168.2675
The fourth example shows how to use data.table to aggregate more than one field by more than one field. This example will calculate the mean ability score and number of attendance days grouped by year and grade. Grouping by more than one field is accomplished by adding list(year,grade) just after by =. See below for details.
# Average ability by grade stulevel_agg_4 <- as.data.frame(stulevel[, j=list(mean(ability, na.rm = TRUE),mean(attday, na.rm = TRUE)),by = list(year,grade)])
# Show results
stulevel_agg_4
year grade V1 V2 1 2000 3 100.5550 166.595 2 2001 3 98.2874 168.590 3 2002 3 100.0920 168.205 4 2000 4 101.3839 167.130 5 2001 4 100.5550 166.595 6 2002 4 98.2874 168.590 7 2000 5 100.0755 169.005 8 2001 5 101.3839 167.130 9 2002 5 100.5550 166.595 10 2000 6 100.2684 168.070 11 2001 6 100.0755 169.005 12 2002 6 101.3839 167.130 13 2000 7 100.0920 168.205 14 2001 7 100.2684 168.070 15 2002 7 100.0755 169.005 16 2000 8 98.2874 168.590 17 2001 8 100.0920 168.205 18 2002 8 100.2684 168.070
The fifth example shows how to name the fields within the data.table command. As shown in the above examples the fields that result from data.table are named V1, V2, V3, etc. It’s possible to rename columns after data.table (see http://rprogramming.net/rename-columns-in-r/) but this isn’t very convenient. To rename type the desired field name and then the equals sign in front of the method of aggregation (mean, median, max, etc.).
Note that this method of renaming seems to only work with the j=list() from examples three and four. If you try to rename without j=list() it returns an error.
# Average ability by grade and rename stulevel_agg_5 <- as.data.frame(stulevel[, j=list(mean_ability = mean(ability, na.rm = TRUE), mean_attendance = mean(attday, na.rm = TRUE)), by = list(year,grade)])
# Show results
stulevel_agg_5
year grade mean_ability mean_attendance 1 2000 3 100.5550 166.595 2 2001 3 98.2874 168.590 3 2002 3 100.0920 168.205 4 2000 4 101.3839 167.130 5 2001 4 100.5550 166.595 6 2002 4 98.2874 168.590 7 2000 5 100.0755 169.005 8 2001 5 101.3839 167.130 9 2002 5 100.5550 166.595 10 2000 6 100.2684 168.070 11 2001 6 100.0755 169.005 12 2002 6 101.3839 167.130 13 2000 7 100.0920 168.205 14 2001 7 100.2684 168.070 15 2002 7 100.0755 169.005 16 2000 8 98.2874 168.590 17 2001 8 100.0920 168.205 18 2002 8 100.2684 168.070
There are other options that can be used with data.table. See official R-manual page on data.table to learn more: http://cran.r-project.org/web/packages/data.table/data.table.pdf.
Practice
To practice aggregating data in R, try the exercises in this manipulating data tutorial.
Thanks for reading! This website took a great deal of time to create. If it was helpful to you, please show it by sharing with friends, liking, or tweeting! If you have any thoughts regarding this R code please post in the comments.
Thanks so much for this. It saved my day to a large extent
Great! I’m happy I could help.
Assignment “colnames(stulevel_agg_4) <-…" is not recommended in package data.table, because results in copying all table. Instead setnames() should be used, see "?setnames".
I’ve updated this page by removing the colnames command and renaming within the data.table command. See example five. I think this is the cleanest method.
Also see for methods for renaming columns.
Hi
Thanks for the article . Nice presentation. I was using data.tables found another method for renaming the column as follows
Test <- rossman[,.(Sales=sum(Sales)),by=SaleYear]
Notice the "." followed by the brace enclosure for the calculated column Sales by SaleYear
methods available in
https://www.datacamp.com/courses/data-analysis-the-data-table-way?utm_source=data.table%20cheat%20sheet&utm_medium=cheat%20sheet&utm_campaign=data.table%20cheat%20sheet
dataset.
I am attempting to use the scale function in data table and am totally perplexed. In my project I will be calculating z scores using scale on numerous columns so thought data table, .SD, and .SDcols would be a good way to this. Brute force by using a for loop would work, but that is so un R like. Sample code below
# attempting to use z scores
library(data.table)
dt <- data.table(a = seq(1,10, 1), b = 35:26, c = rep(c(30,40),5))
dt
colnames(dt)
typeof(dt)
class(dt)
dt.simple <- dt[, .(z.a = scale(a))]
dt.simple
dt <- dt[, list(z.a = scale(a), z.b = scale(b)), by = c]
dt
dt <- dt[, list("z.a", "z.b" := list(scale(a), scale(b))), by = c]
dt
dt <- dt[, .(':=' z.a = scale(a), z.b = scale(b)), by = c]
dt
dt <- dt[, lapply(.SD, scale()), by = c, .SDcols = c("a","b")]
dt
from help(":=")
Fast add, remove and modify subsets of columns, by reference.
Usage
# DT[i, LHS:=RHS, by=…]
# DT[i, c("LHS1","LHS2") := list(RHS1, RHS2), by=…]
# DT[i, `:=`(LHS1=RHS1,
# LHS2=RHS2,
# …), by=…]
any help is vastly appreciated
Carl
I am wondering why the computations start with “as.data.frame”. Stulevel has been a data.table since row 7. I have recreated agg_2 and agg_3 removing the “as.data.frame”, replacing j = list with a simple period, and named the column for each statement. It works on my machine.
I have not used the package eeptools so my next thing to do is read up on it.
Thanks for an inspiring post.
My code is below
# Agregate Data in R Using Data Table
library(eeptools)
library(data.table)
data(stulevel)
stulevel_agg_1 <- mean(stulevel$ability, na.rm = TRUE)
stulevel_agg_1
stulevel <- data.table(stulevel)
setkey(stulevel,grade)
stulevel_agg_2 <- as.data.frame(stulevel[, mean(ability,na.rm = TRUE), by = grade])
stulevel_agg_2
stulevel_agg_3 <- as.data.frame(stulevel[, j = list(mean(ability,na.rm = TRUE),
mean(attday,na.rm = TRUE)), by = grade])
stulevel_agg_3
stulevel_agg_4 <- as.data.frame(stulevel[, j=list(mean(ability,na.rm = TRUE),
mean(attday,na.rm = TRUE)), by = list(year,grade)])
stulevel_agg_4
stulevel_agg_5 <- as.data.frame(stulevel[,j = list(mean_ability = mean(ability, na.rm = TRUE),
mean_attendance = mean(attday, na.rm = TRUE)), by = list(year, grade)])
stulevel_agg_5
stulevel_agg_2 <- stulevel[, .(mean_ability = mean(ability,na.rm = TRUE)), by = grade]
stulevel_agg_2
stulevel_agg_3 <- stulevel[, .(mean_ability = mean(ability,na.rm = TRUE),
mean_attendance = mean(attday,na.rm = TRUE)), by = grade]
stulevel_agg_3