18 Data Engineering, Record Linking and Deduplication

18.1 TL;DL

  • Demo Tool: See http://rcanada.shinyapps.io/demo and the #GCData2021 Data Engineering workshop presentation for the backgrounder and the demonstration of various Data Engineering tasks and solutions. The source is in GCCode.

18.2 Intro: What is Data Engineering

Data Engineering refers to all tasks related to automation of data processing within the “insight-from-data” pipeline.

18.2.1 Data Engineering vs. Software Engineering

In analogy to Software Engineering (a field of Computer Science focused on developing “scientific and technological knowledge, methods, and experience to the design, implementation, testing, and documentation of software”[IEEE Vocabulary], developing knowledge, methods robust scalable tools for computer programs),
Data Engineering may be treated as a field of Data Science that is focused on developing “scientific and technological knowledge, methods, and experience to the design, implementation, testing, and documentation of data-driven solutions”]

18.3 Data Engineering vs. ETL and ELT

DE Data Engineering with ETL (Extract - Transform - Load ) but is more than that.

We’ve also observed a general shift away from drag-and-drop ETL (Extract Transform and Load) tools towards a more programmatic approach [ (i.e. Software engineering,Cmputer science approaches) so that be scalable to big data] To a modern data engineer, traditional ETL tools are largely obsolete because logic cannot be expressed using code.

18.3.1 Taxonomy of Data Engineering tasks

  • Single variable
  • Multiple variable, no semantic relationship
  • Multiple variable, with semantic relationship

18.4 Useful packages

18.4.1 Single variable

18.5 0 > R base and data.table

18.5.1 Description

agrep {base}
adist {utils}

agrep(pattern, x, max.distance = 0.1, costs = NULL,
ignore.case = FALSE, value = FALSE, fixed = TRUE,
useBytes = FALSE)

18.5.2 Examples

## Cf. the examples for agrep:
adist("lasy", "1 lazy 2")
##      [,1]
## [1,]    5
## For a "partial approximate match" (as used for agrep):
adist("lasy", "1 lazy 2", partial = TRUE)
##      [,1]
## [1,]    1
x <- c("1 lazy", "1", "1 LAZY")
aregexec("laysy", x, max.distance = 2)
## [[1]]
## [1] 3
## attr(,"match.length")
## [1] 4
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## [[3]]
## [1] -1
## attr(,"match.length")
## [1] -1
aregexec("(lay)(sy)", x, max.distance = 2)
## [[1]]
## [1] 3 3 5
## attr(,"match.length")
## [1] 4 2 2
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## [[3]]
## [1] -1
## attr(,"match.length")
## [1] -1

   # https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/download-telecharger/comp/page_dl-tc.cfm?Lang=E

  dtStatcan <-  fread("datasets\\geoCa\\census-data/98-401-X2016044_English_CSV_data.csv") # 14 Mb file !
  numcols <- dtStatcan[, which(sapply(.SD, is.character))]
  dtStatcan[, (numcols):=lapply(.SD, as.factor), .SDcols=numcols]

  dtStatcan[str_detect(GEO_CODE, "(?i)^3506008")]
  dtStatcan[GEO_CODE %ilike% "^3506008"]
  dtStatcan[GEO_CODE %ilike% "(?i)^350600"]
  dtStatcan[GEO_CODE %ilike% "(?i)^3506" & GEO_LEVEL<4][order(GEO_LEVEL)]
  dtStatcan[GEO_CODE %ilike% "(?i)^350" & GEO_LEVEL<4][order(GEO_LEVEL)]

18.6 1 >textclean

18.6.1 Description

textclean: Text Cleaning Tools

Tools to clean and process text. Tools are geared at checking for substrings that are not optimal for analysis and replacing or removing them (normalizing) with more analysis friendly substrings or extracting them into new variables.

Imports: data.table, english (≥ 1.0-2), glue (≥ 1.3.0), lexicon (≥ 1.0.0), mgsub (≥ 1.5.0), qdapRegex, stringi, textshape (≥ 1.0.1)

https://www.sciencedirect.com/science/article/abs/pii/S088523080190169X?via%3Dihub https://github.com/trinker/textclean

18.6.2 Example

18.7 2 > Package phonics

18.7.1 Description

https://github.com/k3jph/phonics-in-r ; https://jameshoward.us/phonics-in-r/articles/phonics.html

Provides a collection of phonetic algorithms including Soundex, Metaphone, NYSIIS, Caverphone, and others.

Phonetic Spelling Algorithm Implementations for R James P. Howard, II, Journal of Statistical Software, October 2020, Volume 95, Issue 8. https://www.jstatsoft.org/article/view/v095i08

All functions

caverphone() Caverphone

cologne() Cologne Phonetic Name Coding

lein() Lein Name Coding

metaphone() Generate phonetic versions of strings with Metaphone

mra_encode() mra_compare()Match Rating Approach Encoder

nysiis() New York State Identification and Intelligence System

onca() Oxford Name Compression Algorithm

phonex() Phonex Name Coding


Phonetic Spelling Algorithms


Roger Root Name Coding Procedure

soundex() refinedSoundex()



Statistics Canada Name Coding

18.7.2 Example

18.8 3 > Package stringdist

18.8.1 Description

CRAN: https://cran.r-project.org/web/packages/stringdist/stringdist.pdf (Date/Publication 2020-10-09 10:30:03 UTC)
URL: https://github.com/markvanderloo/stringdist
Paper: https://journal.r-project.org/archive/2014-1/loo.pdf,

Implements an approximate string matching version of R’s native ‘match’ function. Also offers fuzzy text search based on various string distance measures. Can calculate various string distances based on edits (Damerau-Levenshtein, Hamming, Levenshtein, optimal sting alignment), qgrams (qgram, cosine, jaccard distance) or heuristic metrics (Jaro, Jaro-Winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences.

18.8.2 Example

18.9 Multi-variable recording linking

Packages below are listed in the order of the speed and quality

18.11 >> Package RecordLinkage

18.11.1 Description

RecordLinkage: Record Linkage Functions for Linking and Deduplicating Data Sets

Provides functions for linking and deduplicating data sets. Methods based on a stochastic approach are implemented as well as classification algorithms from the machine learning domain. For details, see our paper “The RecordLinkage Package: Detecting Errors in Data” Sariyar M / Borg A (2010) doi:10.32614/RJ-2010-017.

Published: 2020-08-25
Depends: R (≥ 3.5.0), DBI, RSQLite (≥ 1.0.0), ff
Imports: Machine learning - e1071, rpart, ada, ipred, stats, evd, methods, nnet. Efficiency - data.table (≥ 1.7.8),

Reverse enhances: SoundexBR

18.11.2 Datasets: German names 500 and 10,000

The RLdata500 data consists of 500 records with 10 percent duplication.

18.13 > library(reclin)

18.13.1 Description

Title: Record Linkage Toolkit

Date: 2018-08-09

Description: Functions to assist in performing probabilistic record linkage and deduplication: generating pairs, comparing records, em-algorithm for estimating m- and u-probabilities, forcing one-to-one matching. Can also be used for pre- and post-processing for machine learning methods for record linkage.

Depends: lvec, ldat, R (≥ 3.4.0) Imports: stringdist, lpSolve, Rcpp

Comments: Used in DataCamp training

https://cran.r-project.org/web/packages/reclin/reclin.pdf https://github.com/djvanderlaan/reclin https://cran.r-project.org/web/packages/reclin/vignettes/introduction_to_reclin.html (with Two small artificial datasets: first and last names with address, sex, postcode) https://cran.r-project.org/web/packages/reclin/vignettes/deduplication.html

18.13.2 Included Datasets:

Two small (6 and 5 record) artificial names data Deduplication example

18.14 >> library(fuzzyjoin)

18.14.1 Description

fuzzyjoin: Join Tables Together on Inexact Matching
Join tables together based not on whether columns match exactly, but whether they are similar by some comparison. Implementations include string distance and regular expression matching.

Published: 2020-05-15

Imports: stringdist


18.14.2 Example 1: Joining with Common Mispelling

18.14.3 Example 2: from datacamp

18.14.4 Example 3: from stackoverflow

# https://stackoverflow.com/questions/52402768/r-fuzzy-merge-using-agrep-and-data-table
dt1 = data.table(Name = c("ASML HOLDING","ABN AMRO GROUP"), A = c(1,2))
dt2 = data.table(Name = c("ASML HOLDING NV", "ABN AMRO GROUP"), B = c("p", "q"))

dt1 = data.frame(Name = c("ASML HOLDING","ABN AMRO GROUP"), A = c(1,2),Date=c(1,2))
dt2 = data.frame(Name = c("ASML HOLDING NV", "ABN AMRO GROUP", "ABN AMRO GROUP"), B = c("p", "q","r"),Date=c(1,2,3))

dt1 %>% fuzzy_inner_join(dt2, by=c("Date","Name"), match_fun=f) %>% filter(Date.x==Date.y)

f <- Vectorize(function(x,y) agrepl(x, y,
                                    max.distance = 0.05, useBytes = TRUE))
dt1 %>% fuzzy_inner_join(dt2, by="Name", match_fun=f)

18.16 Discusion - Other methods

LSTMs - Have you tried using LSTMs for record linkage for entity linkage? We were quiet sucessful in doing it compared to probalistic linkage.