7 Use R efficiently with data.table!
## TL;DR
- For any code you do (ie, by default) your first lines of code should be:
The data.table
package developed by Matt Dowle is a game changer for many data scientists
Learn about it… and use it always, by default:
library(data.table)
data.table(iris) # or
dtIris <- iris; dtIris <- setDT(df) df <-
Where to learn:
http://r-datatable.com (https://rdatatable.gitlab.io/data.table/)
https://www.datacamp.com/courses/time-series-with-datatable-in-r
https://www.datacamp.com/courses/data-manipulation-in-r-with-datatable
https://rpubs.com/josemz/SDbf - Making .SD your best friend
7.1 data.table vs. dplyr
data.table (Computer language) way vs. dplyr (“English language”) way
- The best: No wasted computations .No new memory allocations. dtLocations %>% .[WLOC == 4313, WLOC:=4312]
- No new memory allocations, but computations are done with ALL rows. dtLocations %>% .[, WLOC:=ifelse(WLOC==4313, 4312, WLOC)]
- The worst: Computations are done with ALL rows. Furthermore, the entire data is copied from one memory location to another. (Imagine if your data as in 1 million of cells, of which only 10 needs to be changed !) dtLocations <- dtLocations %>% mutate(WLOC=ifelse(WLOC==4313, 4312, WLOC)) NB: dtLocations %>% . [] is the same as dtLocations[]. so you can use it in pipes.
7.2 Extensions of data.table
There’s considerable effort to marry data.table package with dplyr package. Here are notable ones:
- https://github.com/tidyverse/dtplyr (Version: 1.1.0, Published: 2021-02-20, From Hadley himself - I found it quite cumbersome still though…)
- https://github.com/asardaes/table.express (Version: 0.3.1 Published: 2019-09-07 - somewhat easier?)
- https://github.com/markfairbanks/tidytable (Version: 0.6.2 Published: 2021-05-18 - seems to be the best supported of the three?)