1 Why R?

R is one of the fastest growing programming languages and environment for data science, visualization and processing. R integrates well with Python and other free and commercial Data Science tools. It can also do something that other tools cannot and is very well supported by growing international community.

It also comes with RStudio - a free Integrated Development Environment (IDE) that is now supported by most GC Agencies and Departments, and that is becoming one of the main tools for data related problems worldwide. For Microsoft data users, anything you do with Excel, Access, Power BI, you can also do with R and RStudio.

1.1 Top 10 R advantages for Data Science

A quick review of the presentations made at the 2021 International Methodology Symposium showed that about one third of projects presented at the symposium were done using R and RStudio. If counting the percentage of projects dealing with data visualization, including geo/spatial mapping, then this number becomes close to two thirds of all presented projects. Clearly, even though Python remains the most frequently used language for machine learning and complex data analysis tasks, R has become de-facto the language of choice in many government departments when it comes to developing interactive reports, web applications and complex visualizations, thanks to such its popular packages as “rmarkdown”, “ggplot” and “shiny”.

Other advantages of R highlighted at the Symposium include: common tidy data approach shared across all R packages; peer-reviewing and curation of packages by CRAN (Comprehensive R Archive Network), which is facilitated through the “devtool” R package that is specifically designed for this purpose; and RStudio-led movement for R education and deployment. All of these are in contrast to the ‘wild-west’ development of Python packages (which is a popular expression even among Python users), and are very important for enabling collaborative development of data science codes and knowledgebase.

To summarize, below are our own Top 10 reasons to use R for Data Science - filtered and refined by the members of this community.

  1. Advanced graphics with ggplot2 and its extensions
  2. Automated report/tutorials/textbooks generation with RMarkdown
  3. Streamlined package development with devtools
  4. Streamlined Interactive interfaces and dashboards development and deployment with Shiny
  5. “Best for geocomputation”*
  6. Common tidy design shared across packages
  7. Curated peer-tested repo of packages at CRAN
  8. RStudio IDE (Integrated Development Environment) on desktop and cloud (rstudio.cloud)
  9. Full support and inter-operability with Python from the same IDE
  10. Global RStudio**-led movement for R education and advancement (rstudio.com)

1.2 Tackling the limitations of R

The main deficiency of R (in particular, when compared to Python) is that many R functions and packages are not developed to be memory- or processing time- efficient, and “by default”, i.e., unless special care is taken in developing the code, most codes written in R are very slow and consume prohibitively large amount of memory, making it often impossible to run data processing codes on ordinarily laptops (with not more than 16 GB of RAM) such as those used by majority of GC employees. This is because all variables in R, including data array and data frame variables (which are naturally very large) are passed to a function by copying the entire variable content from one memory location to another , and then copying it back after the function completes its operation. In other programming languages, such as Python, C++, or Java, such operations normally are not done by passing the entire object to a function, but rather by pasing only a pointer to (or memory address of) an object that needs to be processed by the function, which is much faster and does not consume a lot of memory. This deficiency however can be practically entirely eliminated, should the R code be developed using the “efficiency-by-design” methodology that is offered by the ‘data.table’ R package. This is why much emphasis in the R4GC community discussions is put on encouraging the community members to use ‘data.table’ class by default (or always) instead of base ‘data.frame’ class.

Another deficiency of R is that object-oriented programming (OOP) is not native in R. There are a number of ways to implement an object in R - using S3, S4 and R6 classes, however each of them has its own limitations. This deficiency however is not critical, because fast and memory-efficient codes can still be developed with and without OOP, and, if needed, a portion of the OOP code can be also built in Python (or C++) and then used from within R using ‘Rcpp’ and ‘reticulate’ packages.

1.3 R vs. Python

Main references:

The latter has also a nice summary of PROs vs. CONs for both languages, subjectively summarized by myself below:

Python’s PROs:

  • Object-oriented language (this is a big one for me)
  • General Purpose , i.e. you can use elsewhere (e.g. with raspberry pi - www.raspberrypi.org, as I did for one of my daughters’ projects)
  • Simple and easy to understand and learn (this could be subjective, but I would generally agree that R is not taught they way I personally would teach it, i.e. based on computer science principles, rather than on memorizing a collection of various heuristic tools and functions.)
  • Efficient (fast) packages for advanced machine learning activities, e.g. tensorflow or keras (which you can use from R too, but it could be less efficient), also a large collection of audio manipulation / recognition (some of whicg I played with for one of my biometrics projects)

Python CONs:

  • Not really designed for advanced manipulation or visualization of data

R’s PRO’s:

  • It is designed specifically for data tasks
  • CRAN provides >10K peer-tested (!) packages for any data task you may think of . (I would also add - It also provides much great tools to get you OWN work be tested and featured one day in CRAN - which is what we are doing now at our Learn R Meet-ups!)
  • ggplot2 graphics is unbeatable
  • Capable of standalone analyses with built-in packages. (Anyone knows what it means??)

R’s CON’s:

  • Not the fastest or memory efficient (Hey- That’s without data.table package! I don’t think so,if you use data.table,properly)
  • Object-oriented programming is not native or as easy in R, compared to Python. - Actually, I added this one. To me, this is one of main motivations to see how to use both Python and R, instead of R by itself. But for someone else, this could be not tie-breaker though