Preface
Introduction
Many Government of Canada groups are developing codes to process and visualize various kinds data, often duplicating each other’s efforts, with sub-optimal efficiency and limited level of code quality reviewing. This book presents a pan-government initiative to address this problem. The idea is to collaboratively build a common repository of code and knowledgebase for use by anyone in the government to perform many common data science tasks, and, in doing that, help each other to master both the data science coding skills and the industry standard collaborative practices. The book explains why R language is used as the language of choice for collaborative data science code development. It summaries R advantages and addresses its limitations, establishes the taxonomy of discussion topics of highest interested to the GC data scientists working with R, provides an overview of used collaborative platforms, and presents the results obtained to date. Even though the code knowledgebase is developed mainly in R, it is meant to be valuable also for data scientists coding in Python and other development environments.
Data as ‘electricity’ of the 21st century
Data is the ‘electricity’ of the 21st century. It is everywhere. Everybody is using it, and one may not need special training or certification to work with data. However, in contrast to working with electricity, working with data relies on the use of data processing tools (codes), the number and complexity of which increases daily, and which are developed by other data practitioners. In other words, data science - by its nature - is the science that is fueled by collaboration of data scientists and that heavily relies on the quality of the shared data processing codes.
Globally, and in the Government of Canada (GC) in particular, data practitioners come from many different backgrounds and may not have equal level of code programming training, which hinders the development of high-quality (efficient, scalable and re-usable) codes for data science problems. This gap and the need for collaboration for data scientists, specifically those coding in R language, has been raised at the 2021 GC Data Conference Workshop on Data Engineering in February 2021. This book addresses this need.
Data Engineering challenge
This effort has started from addressing the problem of data engineering, where it is understood - in analogy with the definition of software engineering by IEEE [2] - as the field of science and technology that deals with “developing scientific and technological knowledge, methods, and experience to the design, implementation, testing, and documentation” of data-driven systems and solutions; or - in analogy with the definition of software engineering at Google [3] - as the field that “encompasses not just the act of writing code [for data analysis, in our case], but all of the tools and processes an organization uses to build and maintain that code over time”.
Towards open source and open science
The spectacular growth of data science tools development is overwhelmingly attributed to the collaborative open nature of the current data science code development practices. Consequently, Canadian government is also adopting open source industry standards for data coding and reporting. Major shift towards enabling and promoting such practices within the government has started recently, when Government of Canada adopted a number of policies in support of Digital, Open Science, and Open Government [4], when Shared Services of Canada deployed a number of collaboration platforms that are now available to all GC organizations, and when IT Security of most GC departments approved the use of open-source data science tools such as R and R Studio.
Pedagogical approach
The approach to teaching (mastering) R in this book is very different from that of what most (all?) other open-source R tutorials one will find on the Web.
It is driven by the personal professional backgrounds and experiences of the authors (one comes from Computing Science, the other from Open Government) working with R and open data. -
The consistent deliberate effort throughout this book, as throughout all meetups and discussion of the authors in the R4GC community portal, is put into going away from judging the quality of your R skills by how good your outputs (graphs, reports etc) are , but how good the code that generated your outputs is.
- Is it easy to understand (without having you to read comments or documentation)?
- Is it easy to debug
- Is it modular?
- Is it as simple as possible?
- Is it re-usable?
- Is it scalable (to larger, or different data)?
- Does it have least possible number of dependencies (packages), and those that he uses are the best (most effciient and robust, and best supported) for the purpose?
We focus on ensuring that from the very first lines of code, you code is written properly, i.e, it conforms to all above criteria.
In order to achieve that, in contrast to all other tutorials (such as those provided by fabulous RStudio team), the first package we always introduce is ‘data.table’, and not anything else. All other packages and codes will be added later and through the use of ‘data.table’ by default.
Why? - Read the section on R limitations @ref{r_limitations}.
Importantly, you will soon see that you code will not only runs much faster - so you can run it even no your regular 16 Gb RAM laptop, it also takes less screen space and is easier to read!
And … you will never compare your code to Italian pasta dish (“spaghetti”), and it will get used and possibly further extended by others, and bring more joy to your life.
Book outline
The book is organized in several parts, according to the topics of interest discussed in the GCcollab R4GC group. Discussions around these topics are reviewed and updated regularly, commonly as part of weekly community meetups.
Part I is dedicated to General discussions, which includes the following:
- Why R?
- Learn R: Right way!
- Great open-source textbooks
- Events and forums for R users
- Using R with GC data infrastructure
- Open source policies and guidelines
Part II is dedicated to the Best Practices and Efficient Coding in R and includes the following:
Use R efficiently with data.table!
R and Python, Unite!
From Excel to R
Reading various kinds of data in R
Other tips and rules for coding well in R Part III is dedicated to Visualization and Reporting and includes the following:
Literate programming and automated reports with ‘rmarkdown’
Data visualization with ‘ggplot2’ and its extensions
Interactive interfaces, applications and dashboards with ‘shiny’
Interactive html with ‘plotly’, ‘Datatable’, ‘reactable’
Geo/Spatial coding and visualization in R
Part IV is dedicated Advanced use of Data Science, Machine Learning, and AI and includes:
- Entity resolution and record linking in R
- Text Analysis in R
- Machine Learning and Modeling in R
- Computer vision and Deep learning in R
- Simulation and Optimization in R
- Operationalizing in-house built tools and models (could go in Part I)
Part V contains the Tutorials developed by and for the community. These are:
GCCode 101
Packages 101
R101: Building COVID-19 Tracker App from scratch
Geo/Spatial coding and visualization with R. Part 1:
Text Analysis with R. Part 1:
and a number of short "How To: tutorials such as:
- Dual Coding - Python and R unite !
- Working with ggtables
- Automate common look and feel of your ggplot graphs and others
Part VI provides information about Shiny Web Apps developed by the community members.
Finally, Appendix includes agendas, notes, and codes from the R4GC community “Lunch and Learn” meetups.