Aggregate Data in R Using data.table

This page will show you how to aggregate data in R using the data.table package. Easily calculate mean, median, sum or any of the other built-in functions in R across any number of groups.

If you want to follow along with the examples below you will need the data that is used. To get this data, install and load the eeptools package and then open the sample data by running the following code in R:

# Install eeptools
install.packages("eeptools")

# Load eeptools
library(eeptools)

# Open data
data(stulevel)

The first example shows how to aggregate data in R without using the package data.table. Why not use data.table? data.table is not necessary if you want to aggregate without grouping. The example dataset stulevel contains thousands of rows, with each row representing a student. Each student has a test score, represented by the column called ability. If I want to find out the mean score for all students I don’t need to use data table. I can use the command mean() from base R. The below code calculates the mean of ability (ignoring NAs by indicating na.rm=TRUE) and saves the mean in a new object called stulevel_agg_1.

# Average ability for all students
stulevel_agg_1<- mean(stulevel$ability, na.rm=TRUE)

# Show results
stulevel_agg_1
[1] 100.1539

The second example shows how to use data.table to aggregate in R. This example will calculate the mean ability score grouped by grade. The first (optional) step is to install the package data.table if you don’t already have it installed. The second step is to load the data.table package. The third step is to convert the data frame to a data table. The fourth (optional) step is to declare which field you intend to group on. This step is optional but can speed calculations with large datasets. The fifth step is to actually aggregate with data.table.

One downside to this process is that the the names of the new fields are V1, V2, V3, etc. Go to the last example to see how to rename the fields within the data.table process.

# Install data.table (if is isn't already installed)
install.packages("data.table")
# Initialize the data.table package
library(data.table)
# Convert stulevel to data table
stulevel <- data.table(stulevel)
# Declare which variable you want to group on (optional).
# List the name of the data table first, then the name of the field(s).
setkey(stulevel, grade)
# Average ability by grade
stulevel_agg_2 <- as.data.frame(stulevel[, mean(ability, na.rm = TRUE),by = grade])
# Show results
stulevel_agg_2
grade V1
1 3 99.91628
2 4 100.19533
3 5 100.52899
4 6 100.45084
5 7 100.12068
6 8 99.68494

The third example shows how to use data.table to aggregate more than one field. This example will calculate the mean ability score and number of attendance days grouped by grade. This is done by adding j=list() with the names of the fields you want to aggregate listed within the parentheses. See below for details. Note that in this and later examples I will leave out a few of the steps above assuming that you’ve already installed data.table, loaded the data, and setkey.

# Average ability by grade
stulevel_agg_3 <- as.data.frame(stulevel[, j=list(mean(ability, na.rm = TRUE),mean(attday, na.rm = TRUE)),by = grade])
# Show results
stulevel_agg_3
grade V1 V2
1 3 99.91628 167.6380
2 4 100.19533 167.2275
3 5 100.52899 167.6660
4 6 100.45084 168.3025
5 7 100.12068 168.4980
6 8 99.68494 168.2675

The fourth example shows how to use data.table to aggregate more than one field by more than one field. This example will calculate the mean ability score and number of attendance days grouped by year and grade. Grouping by more than one field is accomplished by adding list(year,grade) just after by =. See below for details.

# Average ability by grade
stulevel_agg_4 <- as.data.frame(stulevel[, j=list(mean(ability, na.rm = TRUE),mean(attday, na.rm = TRUE)),by = list(year,grade)])
# Show results
stulevel_agg_4
year grade V1 V2
1 2000 3 100.5550 166.595
2 2001 3 98.2874 168.590
3 2002 3 100.0920 168.205
4 2000 4 101.3839 167.130
5 2001 4 100.5550 166.595
6 2002 4 98.2874 168.590
7 2000 5 100.0755 169.005
8 2001 5 101.3839 167.130
9 2002 5 100.5550 166.595
10 2000 6 100.2684 168.070
11 2001 6 100.0755 169.005
12 2002 6 101.3839 167.130
13 2000 7 100.0920 168.205
14 2001 7 100.2684 168.070
15 2002 7 100.0755 169.005
16 2000 8 98.2874 168.590
17 2001 8 100.0920 168.205
18 2002 8 100.2684 168.070

The fifth example shows how to name the fields within the data.table command. As shown in the above examples the fields that result from data.table are named V1, V2, V3, etc. It’s possible to rename columns after data.table (see http://rprogramming.net/rename-columns-in-r/) but this isn’t very convenient. To rename type the desired field name and then the equals sign in front of the method of aggregation (mean, median, max, etc.).

Note that this method of renaming seems to only work with the j=list() from examples three and four. If you try to rename without j=list() it returns an error.

# Average ability by grade and rename
stulevel_agg_5 <- as.data.frame(stulevel[, j=list(mean_ability = mean(ability, na.rm = TRUE), mean_attendance = mean(attday, na.rm = TRUE)), by = list(year,grade)])
# Show results
stulevel_agg_5
year grade mean_ability mean_attendance
1 2000 3 100.5550 166.595
2 2001 3 98.2874 168.590
3 2002 3 100.0920 168.205
4 2000 4 101.3839 167.130
5 2001 4 100.5550 166.595
6 2002 4 98.2874 168.590
7 2000 5 100.0755 169.005
8 2001 5 101.3839 167.130
9 2002 5 100.5550 166.595
10 2000 6 100.2684 168.070
11 2001 6 100.0755 169.005
12 2002 6 101.3839 167.130
13 2000 7 100.0920 168.205
14 2001 7 100.2684 168.070
15 2002 7 100.0755 169.005
16 2000 8 98.2874 168.590
17 2001 8 100.0920 168.205
18 2002 8 100.2684 168.070

There are other options that can be used with data.table. See official R-manual page on data.table to learn more: http://cran.r-project.org/web/packages/data.table/data.table.pdf.

Thanks for reading! This website took a great deal of time to create. If it was helpful to you, please show it by sharing with friends, liking, or tweeting! If you have any thoughts regarding this R code please post in the comments.

JM

7 thoughts on “Aggregate Data in R Using data.table

  1. Alex Zolot

    Assignment “colnames(stulevel_agg_4) <-…" is not recommended in package data.table, because results in copying all table. Instead setnames() should be used, see "?setnames".

    Reply
    1. Justin@RProgramming.net Post author

      I’ve updated this page by removing the colnames command and renaming within the data.table command. See example five. I think this is the cleanest method.

      Also see for methods for renaming columns.

      Reply
  2. Rajesh Rajamani

    Hi

    Thanks for the article . Nice presentation. I was using data.tables found another method for renaming the column as follows

    Test <- rossman[,.(Sales=sum(Sales)),by=SaleYear]

    Notice the "." followed by the brace enclosure for the calculated column Sales by SaleYear

    methods available in
    https://www.datacamp.com/courses/data-analysis-the-data-table-way?utm_source=data.table%20cheat%20sheet&utm_medium=cheat%20sheet&utm_campaign=data.table%20cheat%20sheet

    dataset.

    Reply
  3. carl sutton

    I am attempting to use the scale function in data table and am totally perplexed. In my project I will be calculating z scores using scale on numerous columns so thought data table, .SD, and .SDcols would be a good way to this. Brute force by using a for loop would work, but that is so un R like. Sample code below

    # attempting to use z scores
    library(data.table)
    dt <- data.table(a = seq(1,10, 1), b = 35:26, c = rep(c(30,40),5))
    dt
    colnames(dt)
    typeof(dt)
    class(dt)
    dt.simple <- dt[, .(z.a = scale(a))]
    dt.simple
    dt <- dt[, list(z.a = scale(a), z.b = scale(b)), by = c]
    dt
    dt <- dt[, list("z.a", "z.b" := list(scale(a), scale(b))), by = c]
    dt
    dt <- dt[, .(':=' z.a = scale(a), z.b = scale(b)), by = c]
    dt
    dt <- dt[, lapply(.SD, scale()), by = c, .SDcols = c("a","b")]
    dt

    from help(":=")
    Fast add, remove and modify subsets of columns, by reference.

    Usage

    # DT[i, LHS:=RHS, by=…]

    # DT[i, c("LHS1","LHS2") := list(RHS1, RHS2), by=…]

    # DT[i, `:=`(LHS1=RHS1,
    # LHS2=RHS2,
    # …), by=…]

    any help is vastly appreciated

    Carl

    Reply
  4. Carl Sutton

    I am wondering why the computations start with “as.data.frame”. Stulevel has been a data.table since row 7. I have recreated agg_2 and agg_3 removing the “as.data.frame”, replacing j = list with a simple period, and named the column for each statement. It works on my machine.

    I have not used the package eeptools so my next thing to do is read up on it.

    Thanks for an inspiring post.

    My code is below
    # Agregate Data in R Using Data Table
    library(eeptools)
    library(data.table)
    data(stulevel)
    stulevel_agg_1 <- mean(stulevel$ability, na.rm = TRUE)
    stulevel_agg_1
    stulevel <- data.table(stulevel)
    setkey(stulevel,grade)
    stulevel_agg_2 <- as.data.frame(stulevel[, mean(ability,na.rm = TRUE), by = grade])
    stulevel_agg_2
    stulevel_agg_3 <- as.data.frame(stulevel[, j = list(mean(ability,na.rm = TRUE),
    mean(attday,na.rm = TRUE)), by = grade])
    stulevel_agg_3
    stulevel_agg_4 <- as.data.frame(stulevel[, j=list(mean(ability,na.rm = TRUE),
    mean(attday,na.rm = TRUE)), by = list(year,grade)])
    stulevel_agg_4
    stulevel_agg_5 <- as.data.frame(stulevel[,j = list(mean_ability = mean(ability, na.rm = TRUE),
    mean_attendance = mean(attday, na.rm = TRUE)), by = list(year, grade)])
    stulevel_agg_5
    stulevel_agg_2 <- stulevel[, .(mean_ability = mean(ability,na.rm = TRUE)), by = grade]
    stulevel_agg_2
    stulevel_agg_3 <- stulevel[, .(mean_ability = mean(ability,na.rm = TRUE),
    mean_attendance = mean(attday,na.rm = TRUE)), by = grade]
    stulevel_agg_3

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *