• Raison d’être
    • R4GC Community
      • “Lunch and Learn Data Science with R” meetups
      • Community portals
      • Community presentations
      • International Methodology Symposium (October 2021)
      • GC Data Conference (February, 2021)
    • Book source code
    • Disclaimer
    • Contributors
    • Licenses
    • Key principles
      • Open data, open source, open science
      • Chatham House Rule
    • How to contribute
    • About authors
      • Contacts
  • Français
  • Preface
    • Introduction
    • Data as ‘electricity’ of the 21st century
    • Data Engineering challenge
    • Towards open source and open science
    • Pedagogical approach
    • Book outline
  • I General discussions
  • 1 Why R?
    • 1.1 Top 10 R advantages for Data Science
    • 1.2 Tackling the limitations of R
    • 1.3 R vs. Python
  • 2 Learn R: Right way!
    • 2.1 Intro
    • 2.2 In which sense ‘Right’?
    • 2.3 R101 from “R4GC Lunch and Learn”
    • 2.4 Additional resources
      • 2.4.1 Ways to continue growing your R skills
      • 2.4.2 Selected presentations
  • 3 Great Open Source Textbooks
    • 3.1 By presentation
      • 3.1.1 In two-column format
      • 3.1.2 In three-column format
    • 3.2 By Topic
      • 3.2.1 Introductory and General
      • 3.2.2 Specialized
  • 4 Great Open Source Tutorials
    • 4.1 From RStudio
    • 4.2 From Data Carpentry
    • 4.3 More from GitHub
  • 5 Events and forums for R users
    • 5.1 From RStudio
      • 5.1.1 https://resources.rstudio.com.
      • 5.1.2 RStudio Community Meetup
    • 5.2 R blogs
    • 5.3 R conferences
    • 5.4 Related journals:
      • 5.4.1 Academic Journals:
    • 5.5 RElated conferences
      • 5.5.1 Non-academic
      • 5.5.2 Academic Conferences:
      • 5.5.3 IEEE Conferences
    • 5.6 Related communities in GC
      • 5.6.1 GCConnex (GC only):
      • 5.6.2 GCcode (GC only):
      • 5.6.3 GCcollab (public with login):
      • 5.6.4 GCwiki (public):
      • 5.6.5 GitHub (public):
      • 5.6.6 Other Canada groups
  • 6 Open government policies
    • 6.1 Open Source
      • 6.1.1 Open government
    • 6.2 Open Science
      • 6.2.1 Scientific Integrity
  • II Art of R programming
  • 7 Use R efficiently with data.table!
    • 7.1 data.table vs. dplyr
    • 7.2 Extensions of data.table
  • 8 Python and R unite!
  • 9 From Excel to R
  • 10 Reading various kinds of data in R
    • 10.0.1 rio
    • 10.1 readxl and xlsx
    • 10.2 Discussion
  • 11 Other tips for Efficient coding in R
    • 11.1 Variable names !
      • 11.1.1 Code formating
    • 11.2 Code starter tricks
      • 11.2.1 Most important / useful libraries
    • 11.3 Efficient workflows
      • 11.3.1 Workflow: Data-first approach
      • 11.3.2 Workflow: Task/needs/algorithm-first approach
    • 11.4 Object oriented programming in R
      • 11.4.1 S3
      • 11.4.2 R6
    • 11.5 RStudio tricks
      • 11.5.1 Coding online
  • 12 Using R with GC data infastructure (gcdocs, AWS, etc)
    • 12.1 gcdocs
  • III Visualization and Reporting
  • 13 R Markdown for literate programming and automated reports
    • 13.1 Resources
    • 13.2 Automated generation of multiple PDF files
    • 13.3 Useful tricks and tips
      • 13.3.1 In RStudio editor
      • 13.3.2 Spliting Rmd in chunks
      • 13.3.3 Making good use of configuration yaml header in index.Rmd file
      • 13.3.4 Conditional execution of chunks
  • Welcome to Supervised Machine Learning for Text Analysis in R
    • 13.3.5 Automated compilation of Rmd files in GitHub using GitHub Actions
  • 14 ggplot2 and its extensions for data visualization
    • 14.1 Resources
      • 14.1.1 Plotly + R Shiny
  • 15 Shiny for Interactive Data Visualization, Analysis and Web App development
    • 15.1 Resources
  • 16 Interactive Outputs in R: plotly, Datatable, reactable
    • 16.1 reactable
  • 17 Geo/Spatial coding and visualization in R
    • 17.1 Resources
    • 17.2 Federal Geospatial Platform
    • 17.3 Tutorials
    • 17.4 Dealing with memory issues
    • 17.5 Canadian geo-data
    • 17.6 Code snippets
      • 17.6.1 Using simplemaps.com
      • 17.6.2 Using Google API
      • 17.6.3 Using tidygeocoder
      • 17.6.4 Using Open Street map
      • 17.6.5 Using Open Database of Addresses / Educational Facilities
      • 17.6.6 Getting Postal codes
  • IV Machine Learning and AI
  • 18 Data Engineering, Record Linking and Deduplication
    • 18.1 TL;DL
    • 18.2 Intro: What is Data Engineering
      • 18.2.1 Data Engineering vs. Software Engineering
    • 18.3 Data Engineering vs. ETL and ELT
      • 18.3.1 Taxonomy of Data Engineering tasks
    • 18.4 Useful packages
      • 18.4.1 Single variable
    • 18.5 0 > R base and data.table
      • 18.5.1 Description
      • 18.5.2 Examples
    • 18.6 1 >textclean
      • 18.6.1 Description
      • 18.6.2 Example
    • 18.7 2 > Package phonics
      • 18.7.1 Description
      • 18.7.2 Example
    • 18.8 3 > Package stringdist
      • 18.8.1 Description
      • 18.8.2 Example
    • 18.9 Multi-variable recording linking
    • 18.10 >> library(fastLink)
      • 18.10.1 Description
      • 18.10.2 Dataset
      • 18.10.3 Example
    • 18.11 >> Package RecordLinkage
      • 18.11.1 Description
      • 18.11.2 Datasets: German names 500 and 10,000
    • 18.12 > library(blink)
      • 18.12.1 Summary
    • 18.13 > library(reclin)
      • 18.13.1 Description
      • 18.13.2 Included Datasets:
    • 18.14 >> library(fuzzyjoin)
      • 18.14.1 Description
      • 18.14.2 Example 1: Joining with Common Mispelling
      • 18.14.3 Example 2: from datacamp
      • 18.14.4 Example 3: from stackoverflow
    • 18.15 > Package blink
      • 18.15.1 Description
      • 18.15.2 Datasets: German names 500 and 10,000
    • 18.16 Discusion - Other methods
  • 19 Text Analysis in R
    • 19.1 Open source textbook
    • 19.2 Plagiarism detection
    • 19.3 Related work at International Methodology Symposium
    • 19.4 Useful code snippets
      • 19.4.1 Basic cleaning : Remove accents (benchmarking)
      • 19.4.2 Text cleaning
      • 19.4.3 Extracting, re-ordering words in a string
      • 19.4.4 Automatically finding / removing common parts in strings
      • 19.4.5 Useful packages
      • 19.4.6 cleanText(text): clean text
      • 19.4.7 Convert Text to Date or Timestamp
      • 19.4.8 Transliteration & cleaning
  • 20 Statistical tests and mixed-effects analysis
  • 21 Machine Learning and Modeling in R
    • 21.1 Resources
    • 21.2 Additional references
  • 22 Deep Learning and Computer vision
  • 23 Simulation and Optimization
  • V Community Tutorials
  • 24 GCCode 101
    • 24.1 TL;DR
    • 24.2 Step 00: Connecting to GCCode and installing required soft.
    • 24.3 Step 0: Configuring Windows, Git and GitLab (tokens)
    • 24.4 Step 1: Find (or create) a GitLab project you want to contribute to.
      • 24.4.1 1.2 Using Command Line
      • 24.4.2 1.2 In RStudio
    • 24.5 Step 2: Using Branches (optional)
    • 24.6 Step 3: GCcoding from RStudio
    • 24.7 Related GC discussions and links:
  • 25 Packages 101
    • 25.1 Resources
      • 25.1.1 GC groups and repos
      • 25.1.2 Related tutorials
      • 25.1.3 How to contribute
      • 25.1.4 * R script to start:
    • 25.2 * Setup
    • 25.3 * Overall Workflow
      • 25.3.1
    • 25.4 * .Rbuildignore
    • 25.5 * License: use_mit_license()
    • 25.6 * DESCRIPTION
    • 25.7 * NAMESPACE
    • 25.8 * Examples and tests
    • 25.9 .. in MY_CODES
    • 25.10 * testthat
    • 25.11 * Documentation
    • 25.12 * Vignettes
    • 25.13 Delivering package
    • 25.14 Packaging and publishing w. pkgdown
    • 25.15 (optional) Rtools
  • 26 R101: Building COVID-19 Tracker App from scratch
  • VI Community Databases
  • 27 Accessing Open Data Canada databases
    • 27.1 With curl, fread, readxls
    • 27.2 With cansim package
    • 27.3 Via API using library(ckanr)
      • 27.3.1 Via API using library(“rgovcan”)
  • 28 Health-related databases
    • 28.1 Canadian Vitals Statistics Database
      • 28.1.1 Vital Statistics - Death Database
      • 28.1.2 Vital Statistics - Birth Database
    • 28.2 COVID-19 infection and vaccination related:
  • 29 Open Ontario Data
    • 29.0.1
  • 30 Performance-related databases
    • 30.1 PSES Results interactive analysis and visualization:
      • 30.1.1 Other PSES tools and dashboards
    • 30.2 TIP requests dataset
    • 30.3 Geo-mapped current, historical and predicted border wait times:
  • VII Community Codes
  • 31 Lunch and Learn notes
    • Interactive Outputs in R without Shiny
    • 31.1 Geo/Spatial coding and visualization with R. Part 1:
    • 31.2 Text Analysis with R. Part 1:
    • 31.3 Dual Coding - Python and R unite !
    • 31.4 Working with ggtables
    • 31.5 Automate common look and feel of your ggplot graphs
    • 31.6 Automated generation of report cards
    • 31.7 Discussed RStudio Webinars
  • 32 Canada-related Open source R codes and packages
    • 32.1 Packages
      • 32.1.1 Packages on CRAN
    • 32.2 Packages not on CRAN
    • 32.3 Codes on GitHub
      • 32.3.1 Expenditures and procurement
      • 32.3.2 Health and Environment
      • 32.3.3 Elections
      • 32.3.4 Ottawa
      • 32.3.5 Vancouver
      • 32.3.6 Other related packages
  • Published with bookdown/li>

The R4GC Book

30 Performance-related databases

30.1 PSES Results interactive analysis and visualization:

URL: https://open-canada.github.io/Apps/pses Source: https://gccode.ssc-spc.gc.ca/r4gc/codes/pses

No other Open Canada data is of as much common interest across the government as the PSES results [^11]. These data contain the information about all GC departments, their organizational structure and performance. A Shiny App prototype is developed to perform three most desired tasks one wished to do with these data (see Figure):

[^11] https://www.canada.ca/en/treasury-board-secretariat/services/innovation/public-service-employee-survey.html

  1. Vertical result tracking: results comparison across an organization, automated detection and visualization of the performance variation across the organization - for any given PSES question.
  2. Horizontal results tracking: results comparison over time - for any given unit, in comparison to the organization and Public Service averages
  3. Performance summary by theme: automated generation of report cards that show the performance at each level of the organization for each of theme and in relation to the rest of the organization and Public Service average.
  4. Automated generation of full report with detailed comparative analysis and recommendations: for each unit and each level of organization

Figure. Key functionalities of the PSES App prototype: a) vertical results tracking - for any question over entire organization, b) horizontal results tracking - for any unit over time, and c) performance report summary - for each unit, by theme, and in comparison to the Public Service average (shown as crosses) and other units within organization (shown as small dots). The results can be displayed filtered and sorted by score, ranking percentile, org. structure, number of responses, by theme or question.

30.1.1 Other PSES tools and dashboards

TBS Official PSES Results viewer (developed in PowerBI): https://hranalytics-analytiquerh.tbs-sct.gc.ca/Home/ReportList/2019?GoCTemplateCulture=en-CA
Direct link to PowerBI here: https://hranalytics-analytiquerh.tbs-sct.gc.ca/Home/EmbedReport/6?GoCTemplateCulture=en-CA

​

30.2 TIP requests dataset

Another Open Canada dataset that relates to the performance of many GC departments is the ATIP requests dataset [^22]. An interactive Natural Language Processing (NLP) application has been developed to enable the analysis and visualization of these requests for each participating department (see Figure 2). Its functionalities include: statistics summary, automated key-words and topic extraction using N-grams, document term matrix and Latent Dirichlet Allocation. The topics can be visualized as word-clouds or as graphs that connect the related words.

[^22] https://open.canada.ca/en/search/ati.

See “Text Mining with R!” by Julia Silge and David Robinson (https://www.tidytextmining.com) for definitions of the terms.

Figure 2. Key functionalities of the ATIP App: a) department specific bi-variable statics (such as dispositions by year, shown in the image), b) department specific key topics, visualized as correlated terms graphs , c) key topics for each participating department, visualized as N-gram frequency bars (such as 2-gram, or two-word combination, shown in the image).

30.3 Geo-mapped current, historical and predicted border wait times:

URL: https://open-canada.github.io/Apps/border (redirect to open.canada.ca).
Source: https://gccode.ssc-spc.gc.ca/gorodnichy/simborder

This App (shown in Figure 4) combines Open geo-spatial data with Open historical and current border wait times data to predict and visualize delays at Canadian land border crossings. The App is included (entitled iTrack-Border) in the Open Canada Apps Gallery at https://open.canada.ca/en/apps, where more information about it can be found.