7 Use R efficiently with data.table!

## TL;DR

  1. For any code you do (ie, by default) your first lines of code should be:

The data.table package developed by Matt Dowle is a game changer for many data scientists

Learn about it… and use it always, by default:

library(data.table)
dtIris <- data.table(iris) # or
df <- iris; dtIris <- setDT(df)

Where to learn:

7.1 data.table vs. dplyr

data.table (Computer language) way vs. dplyr (“English language”) way

  1. The best: No wasted computations .No new memory allocations. dtLocations %>% .[WLOC == 4313, WLOC:=4312]
  2. No new memory allocations, but computations are done with ALL rows. dtLocations %>% .[, WLOC:=ifelse(WLOC==4313, 4312, WLOC)]
  3. The worst: Computations are done with ALL rows. Furthermore, the entire data is copied from one memory location to another. (Imagine if your data as in 1 million of cells, of which only 10 needs to be changed !) dtLocations <- dtLocations %>% mutate(WLOC=ifelse(WLOC==4313, 4312, WLOC)) NB: dtLocations %>% . [] is the same as dtLocations[]. so you can use it in pipes.

7.2 Extensions of data.table

There’s considerable effort to marry data.table package with dplyr package. Here are notable ones: