18 Data Engineering, Record Linking and Deduplication
18.1 TL;DL
- Demo Tool: See http://rcanada.shinyapps.io/demo and the #GCData2021 Data Engineering workshop presentation for the backgrounder and the demonstration of various Data Engineering tasks and solutions. The source is in GCCode.
18.2 Intro: What is Data Engineering
Data Engineering refers to all tasks related to automation of data processing within the “insight-from-data” pipeline.
18.2.1 Data Engineering vs. Software Engineering
In analogy to Software Engineering (a field of Computer Science focused on developing
“scientific and technological knowledge, methods, and experience to the design, implementation, testing, and documentation of software”[IEEE Vocabulary],
developing knowledge, methods robust scalable tools for computer programs),
Data Engineering may be treated as a field of Data Science that is focused on developing
“scientific and technological knowledge, methods, and experience to the design, implementation, testing, and documentation of data-driven solutions”]
18.3 Data Engineering vs. ETL and ELT
DE Data Engineering with ETL (Extract - Transform - Load ) but is more than that.
We’ve also observed a general shift away from drag-and-drop ETL (Extract Transform and Load) tools towards a more programmatic approach [ (i.e. Software engineering,Cmputer science approaches) so that be scalable to big data] To a modern data engineer, traditional ETL tools are largely obsolete because logic cannot be expressed using code.
18.3.1 Taxonomy of Data Engineering tasks
- Single variable
- Multiple variable, no semantic relationship
- Multiple variable, with semantic relationship
18.4 Useful packages
18.4.1 Single variable
18.5 0 > R base and data.table
18.5.1 Description
agrep {base}
adist {utils}
agrep(pattern, x, max.distance = 0.1, costs = NULL,
ignore.case = FALSE, value = FALSE, fixed = TRUE,
useBytes = FALSE)
18.5.2 Examples
## Cf. the examples for agrep:
adist("lasy", "1 lazy 2")
## [,1]
## [1,] 5
## For a "partial approximate match" (as used for agrep):
adist("lasy", "1 lazy 2", partial = TRUE)
## [,1]
## [1,] 1
c("1 lazy", "1", "1 LAZY")
x <-aregexec("laysy", x, max.distance = 2)
## [[1]]
## [1] 3
## attr(,"match.length")
## [1] 4
##
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
##
## [[3]]
## [1] -1
## attr(,"match.length")
## [1] -1
aregexec("(lay)(sy)", x, max.distance = 2)
## [[1]]
## [1] 3 3 5
## attr(,"match.length")
## [1] 4 2 2
##
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
##
## [[3]]
## [1] -1
## attr(,"match.length")
## [1] -1
# https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/download-telecharger/comp/page_dl-tc.cfm?Lang=E
dtStatcan <- fread("datasets\\geoCa\\census-data/98-401-X2016044_English_CSV_data.csv") # 14 Mb file !
numcols <- dtStatcan[, which(sapply(.SD, is.character))]
dtStatcan[, (numcols):=lapply(.SD, as.factor), .SDcols=numcols]
summary(dtStatcan)
dtStatcan[str_detect(GEO_CODE, "(?i)^3506008")]
dtStatcan[GEO_CODE %ilike% "^3506008"]
dtStatcan[GEO_CODE %ilike% "(?i)^350600"]
dtStatcan[GEO_CODE %ilike% "(?i)^3506" & GEO_LEVEL<4][order(GEO_LEVEL)]
dtStatcan[GEO_CODE %ilike% "(?i)^350" & GEO_LEVEL<4][order(GEO_LEVEL)]
18.6 1 >textclean
18.6.1 Description
textclean: Text Cleaning Tools
Tools to clean and process text. Tools are geared at checking for substrings that are not optimal for analysis and replacing or removing them (normalizing) with more analysis friendly substrings or extracting them into new variables.
Imports: data.table, english (≥ 1.0-2), glue (≥ 1.3.0), lexicon (≥ 1.0.0), mgsub (≥ 1.5.0), qdapRegex, stringi, textshape (≥ 1.0.1)
https://www.sciencedirect.com/science/article/abs/pii/S088523080190169X?via%3Dihub https://github.com/trinker/textclean
18.6.2 Example
18.7 2 > Package phonics
18.7.1 Description
https://cran.r-project.org/web/packages/phonics/index.html
https://github.com/k3jph/phonics-in-r ; https://jameshoward.us/phonics-in-r/articles/phonics.html
Provides a collection of phonetic algorithms including Soundex, Metaphone, NYSIIS, Caverphone, and others.
Phonetic Spelling Algorithm Implementations for R James P. Howard, II, Journal of Statistical Software, October 2020, Volume 95, Issue 8. https://www.jstatsoft.org/article/view/v095i08
All functions
caverphone() Caverphone
cologne() Cologne Phonetic Name Coding
lein() Lein Name Coding
metaphone() Generate phonetic versions of strings with Metaphone
mra_encode() mra_compare()Match Rating Approach Encoder
nysiis() New York State Identification and Intelligence System
onca() Oxford Name Compression Algorithm
phonex() Phonex Name Coding
phonics()
Phonetic Spelling Algorithms
rogerroot()
Roger Root Name Coding Procedure
soundex() refinedSoundex()
Soundex
statcan()
Statistics Canada Name Coding
18.7.2 Example
18.8 3 > Package stringdist
18.8.1 Description
CRAN: https://cran.r-project.org/web/packages/stringdist/stringdist.pdf (Date/Publication 2020-10-09 10:30:03 UTC)
URL: https://github.com/markvanderloo/stringdist
Paper: https://journal.r-project.org/archive/2014-1/loo.pdf,
Implements an approximate string matching version of R’s native ‘match’ function. Also offers fuzzy text search based on various string distance measures. Can calculate various string distances based on edits (Damerau-Levenshtein, Hamming, Levenshtein, optimal sting alignment), qgrams (qgram, cosine, jaccard distance) or heuristic metrics (Jaro, Jaro-Winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences.
18.8.2 Example
18.9 Multi-variable recording linking
Packages below are listed in the order of the speed and quality
18.10 >> library(fastLink)
18.10.1 Description
Published: 2020-04-29
Implements a Fellegi-Sunter probabilistic record linkage model that allows for missing data and the inclusion of auxiliary information. This includes functionalities to conduct a merge of two datasets under the Fellegi-Sunter model using the Expectation-Maximization algorithm. In addition, tools for preparing, adjusting, and summarizing data merges are included. The package implements methods described in Enamorado, Fifield, and Imai (2019) ”Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records”, American Political Science Review and is available at http://imai.fas.harvard.edu/research/linkage.html.
https://github.com/kosukeimai/fastLink
https://imai.fas.harvard.edu/research/files/linkage.pdf
Imports: parallel, foreach, doParallel, gtools, [data.table], [stringdist], stringr, stringi, Rcpp
18.10.2 Dataset
18.10.3 Example
18.11 >> Package RecordLinkage
18.11.1 Description
RecordLinkage: Record Linkage Functions for Linking and Deduplicating Data Sets
Provides functions for linking and deduplicating data sets. Methods based on a stochastic approach are implemented as well as classification algorithms from the machine learning domain. For details, see our paper “The RecordLinkage Package: Detecting Errors in Data” Sariyar M / Borg A (2010) doi:10.32614/RJ-2010-017.
Published: 2020-08-25
Depends: R (≥ 3.5.0), DBI, RSQLite (≥ 1.0.0), ff
Imports: Machine learning - e1071, rpart, ada, ipred, stats, evd, methods, nnet. Efficiency - data.table (≥ 1.7.8),
Reverse enhances: SoundexBR
18.11.2 Datasets: German names 500 and 10,000
The RLdata500 data consists of 500 records with 10 percent duplication.
18.12 > library(blink)
18.12.1 Summary
2020-09-30 “Entity Resolution with Emprically Motivated Priors”, Bayesian Analysis, (10),4:849-975. We will be using the RecordLinkage package in R and the RLdata500 data set. link
18.13 > library(reclin)
18.13.1 Description
Title: Record Linkage Toolkit
Date: 2018-08-09
Description: Functions to assist in performing probabilistic record linkage and deduplication: generating pairs, comparing records, em-algorithm for estimating m- and u-probabilities, forcing one-to-one matching. Can also be used for pre- and post-processing for machine learning methods for record linkage.
Depends: lvec, ldat, R (≥ 3.4.0) Imports: stringdist, lpSolve, Rcpp
Comments: Used in DataCamp training
https://cran.r-project.org/web/packages/reclin/reclin.pdf https://github.com/djvanderlaan/reclin https://cran.r-project.org/web/packages/reclin/vignettes/introduction_to_reclin.html (with Two small artificial datasets: first and last names with address, sex, postcode) https://cran.r-project.org/web/packages/reclin/vignettes/deduplication.html
18.13.2 Included Datasets:
Two small (6 and 5 record) artificial names data
18.13.2.1 Deduplication example
18.14 >> library(fuzzyjoin)
18.14.1 Description
fuzzyjoin: Join Tables Together on Inexact Matching
Join tables together based not on whether columns match exactly, but whether they are similar by some comparison. Implementations include string distance and regular expression matching.
Published: 2020-05-15
Imports: stringdist
https://cran.r-project.org/web/packages/fuzzyjoin/vignettes/stringdist_join.html
18.14.2 Example 1: Joining with Common Mispelling
18.14.3 Example 2: from datacamp
18.14.4 Example 3: from stackoverflow
# https://stackoverflow.com/questions/52402768/r-fuzzy-merge-using-agrep-and-data-table
library("data.table")
dt1 = data.table(Name = c("ASML HOLDING","ABN AMRO GROUP"), A = c(1,2))
dt2 = data.table(Name = c("ASML HOLDING NV", "ABN AMRO GROUP"), B = c("p", "q"))
dt1 = data.frame(Name = c("ASML HOLDING","ABN AMRO GROUP"), A = c(1,2),Date=c(1,2))
dt2 = data.frame(Name = c("ASML HOLDING NV", "ABN AMRO GROUP", "ABN AMRO GROUP"), B = c("p", "q","r"),Date=c(1,2,3))
dt1 %>% fuzzy_inner_join(dt2, by=c("Date","Name"), match_fun=f) %>% filter(Date.x==Date.y)
library(fuzzyjoin)
f <- Vectorize(function(x,y) agrepl(x, y,
ignore.case=TRUE,
max.distance = 0.05, useBytes = TRUE))
dt1 %>% fuzzy_inner_join(dt2, by="Name", match_fun=f)
18.15 > Package blink
18.15.1 Description
blink: Record Linkage for Empirically Motivated Priors
An implementation of the model in Steorts (2015) doi:10.1214/15-BA965SI, which performs Bayesian entity resolution for categorical and text data, for any distance function defined by the user. In addition, the precision and recall are in the package to allow one to compare to any other comparable method such as logistic regression, Bayesian additive regression trees (BART), or random forests. The experiments are reproducible and illustrated using a simple vignette. LICENSE: GPL-3 + file license.
Depends: stringdist
Published: 2020-10-06
https://projecteuclid.org/euclid.ba/1441790411 https://cran.r-project.org/web/packages/blink/vignettes/introEBLink.html
18.15.2 Datasets: German names 500 and 10,000
18.16 Discusion - Other methods
LSTMs - Have you tried using LSTMs for record linkage for entity linkage? We were quiet sucessful in doing it compared to probalistic linkage.