18 Data Engineering, Record Linking and Deduplication

18.1 TL;DL

Demo Tool: See http://rcanada.shinyapps.io/demo and the #GCData2021 Data Engineering workshop presentation for the backgrounder and the demonstration of various Data Engineering tasks and solutions. The source is in GCCode.

18.2 Intro: What is Data Engineering

Data Engineering refers to all tasks related to automation of data processing within the “insight-from-data” pipeline.

18.2.1 Data Engineering vs. Software Engineering

In analogy to Software Engineering (a field of Computer Science focused on developing “scientific and technological knowledge, methods, and experience to the design, implementation, testing, and documentation of software”[IEEE Vocabulary], developing knowledge, methods robust scalable tools for computer programs),
Data Engineering may be treated as a field of Data Science that is focused on developing “scientific and technological knowledge, methods, and experience to the design, implementation, testing, and documentation of data-driven solutions”]

18.3 Data Engineering vs. ETL and ELT

DE Data Engineering with ETL (Extract - Transform - Load ) but is more than that.

We’ve also observed a general shift away from drag-and-drop ETL (Extract Transform and Load) tools towards a more programmatic approach [ (i.e. Software engineering,Cmputer science approaches) so that be scalable to big data] To a modern data engineer, traditional ETL tools are largely obsolete because logic cannot be expressed using code.

18.3.1 Taxonomy of Data Engineering tasks

Single variable
Multiple variable, no semantic relationship
Multiple variable, with semantic relationship

18.4 Useful packages

18.4.1 Single variable

18.5 0 > R base and data.table

18.5.1 Description

agrep {base}
adist {utils}

agrep(pattern, x, max.distance = 0.1, costs = NULL,
ignore.case = FALSE, value = FALSE, fixed = TRUE,
useBytes = FALSE)

18.5.2 Examples

## Cf. the examples for agrep:
adist("lasy", "1 lazy 2")

##      [,1]
## [1,]    5

## For a "partial approximate match" (as used for agrep):
adist("lasy", "1 lazy 2", partial = TRUE)

##      [,1]
## [1,]    1

x <- c("1 lazy", "1", "1 LAZY")
aregexec("laysy", x, max.distance = 2)

## [[1]]
## [1] 3
## attr(,"match.length")
## [1] 4
## 
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## 
## [[3]]
## [1] -1
## attr(,"match.length")
## [1] -1

aregexec("(lay)(sy)", x, max.distance = 2)

## [[1]]
## [1] 3 3 5
## attr(,"match.length")
## [1] 4 2 2
## 
## [[2]]
## [1] -1
## attr(,"match.length")
## [1] -1
## 
## [[3]]
## [1] -1
## attr(,"match.length")
## [1] -1


   # https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/prof/details/download-telecharger/comp/page_dl-tc.cfm?Lang=E

  dtStatcan <-  fread("datasets\\geoCa\\census-data/98-401-X2016044_English_CSV_data.csv") # 14 Mb file !
   
  numcols <- dtStatcan[, which(sapply(.SD, is.character))]
  dtStatcan[, (numcols):=lapply(.SD, as.factor), .SDcols=numcols]
  summary(dtStatcan)


  dtStatcan[str_detect(GEO_CODE, "(?i)^3506008")]
  dtStatcan[GEO_CODE %ilike% "^3506008"]
  dtStatcan[GEO_CODE %ilike% "(?i)^350600"]
  dtStatcan[GEO_CODE %ilike% "(?i)^3506" & GEO_LEVEL<4][order(GEO_LEVEL)]
  dtStatcan[GEO_CODE %ilike% "(?i)^350" & GEO_LEVEL<4][order(GEO_LEVEL)]

18.6 1 >`textclean`

18.6.1 Description

textclean: Text Cleaning Tools

Tools to clean and process text. Tools are geared at checking for substrings that are not optimal for analysis and replacing or removing them (normalizing) with more analysis friendly substrings or extracting them into new variables.

Imports: data.table, english (≥ 1.0-2), glue (≥ 1.3.0), lexicon (≥ 1.0.0), mgsub (≥ 1.5.0), qdapRegex, stringi, textshape (≥ 1.0.1)

https://www.sciencedirect.com/science/article/abs/pii/S088523080190169X?via%3Dihub https://github.com/trinker/textclean

18.6.2 Example

18.7 2 > Package `phonics`

18.7.1 Description

https://cran.r-project.org/web/packages/phonics/index.html
https://github.com/k3jph/phonics-in-r ; https://jameshoward.us/phonics-in-r/articles/phonics.html

Provides a collection of phonetic algorithms including Soundex, Metaphone, NYSIIS, Caverphone, and others.

Phonetic Spelling Algorithm Implementations for R James P. Howard, II, Journal of Statistical Software, October 2020, Volume 95, Issue 8. https://www.jstatsoft.org/article/view/v095i08

All functions

caverphone() Caverphone

cologne() Cologne Phonetic Name Coding

lein() Lein Name Coding

metaphone() Generate phonetic versions of strings with Metaphone

mra_encode() mra_compare()Match Rating Approach Encoder

nysiis() New York State Identification and Intelligence System

onca() Oxford Name Compression Algorithm

phonex() Phonex Name Coding

phonics()

Phonetic Spelling Algorithms

rogerroot()

Roger Root Name Coding Procedure

soundex() refinedSoundex()

Soundex

statcan()

Statistics Canada Name Coding

18.7.2 Example

18.8 3 > Package `stringdist`

18.8.1 Description

CRAN: https://cran.r-project.org/web/packages/stringdist/stringdist.pdf (Date/Publication 2020-10-09 10:30:03 UTC)
URL: https://github.com/markvanderloo/stringdist
Paper: https://journal.r-project.org/archive/2014-1/loo.pdf,

Implements an approximate string matching version of R’s native ‘match’ function. Also offers fuzzy text search based on various string distance measures. Can calculate various string distances based on edits (Damerau-Levenshtein, Hamming, Levenshtein, optimal sting alignment), qgrams (qgram, cosine, jaccard distance) or heuristic metrics (Jaro, Jaro-Winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences.

18.8.2 Example

18.9 Multi-variable recording linking

Packages below are listed in the order of the speed and quality

18.10 >> library(fastLink)

18.10.1 Description

Published: 2020-04-29

Implements a Fellegi-Sunter probabilistic record linkage model that allows for missing data and the inclusion of auxiliary information. This includes functionalities to conduct a merge of two datasets under the Fellegi-Sunter model using the Expectation-Maximization algorithm. In addition, tools for preparing, adjusting, and summarizing data merges are included. The package implements methods described in Enamorado, Fifield, and Imai (2019) ”Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records”, American Political Science Review and is available at http://imai.fas.harvard.edu/research/linkage.html.

https://github.com/kosukeimai/fastLink

https://imai.fas.harvard.edu/research/files/linkage.pdf

Imports: parallel, foreach, doParallel, gtools, [data.table], [stringdist], stringr, stringi, Rcpp

18.10.2 Dataset

18.10.3 Example

18.11 >> Package `RecordLinkage`

18.11.1 Description

RecordLinkage: Record Linkage Functions for Linking and Deduplicating Data Sets

Provides functions for linking and deduplicating data sets. Methods based on a stochastic approach are implemented as well as classification algorithms from the machine learning domain. For details, see our paper “The RecordLinkage Package: Detecting Errors in Data” Sariyar M / Borg A (2010) doi:10.32614/RJ-2010-017.

Published: 2020-08-25
Depends: R (≥ 3.5.0), DBI, RSQLite (≥ 1.0.0), ff
Imports: Machine learning - e1071, rpart, ada, ipred, stats, evd, methods, nnet. Efficiency - data.table (≥ 1.7.8),

Reverse enhances: SoundexBR

18.11.2 Datasets: German names 500 and 10,000

The RLdata500 data consists of 500 records with 10 percent duplication.

18.12 > library(blink)

18.12.1 Summary

2020-09-30 “Entity Resolution with Emprically Motivated Priors”, Bayesian Analysis, (10),4:849-975. We will be using the RecordLinkage package in R and the RLdata500 data set. link

18.13 > `library(reclin)`

18.13.1 Description

Title: Record Linkage Toolkit

Date: 2018-08-09

Description: Functions to assist in performing probabilistic record linkage and deduplication: generating pairs, comparing records, em-algorithm for estimating m- and u-probabilities, forcing one-to-one matching. Can also be used for pre- and post-processing for machine learning methods for record linkage.

Depends: lvec, ldat, R (≥ 3.4.0) Imports: stringdist, lpSolve, Rcpp

Comments: Used in DataCamp training

https://cran.r-project.org/web/packages/reclin/reclin.pdf https://github.com/djvanderlaan/reclin https://cran.r-project.org/web/packages/reclin/vignettes/introduction_to_reclin.html (with Two small artificial datasets: first and last names with address, sex, postcode) https://cran.r-project.org/web/packages/reclin/vignettes/deduplication.html

18.13.2 Included Datasets:

Two small (6 and 5 record) artificial names data

18.13.2.1 Deduplication example

18.14 >> library(fuzzyjoin)

18.14.1 Description

fuzzyjoin: Join Tables Together on Inexact Matching
Join tables together based not on whether columns match exactly, but whether they are similar by some comparison. Implementations include string distance and regular expression matching.

Published: 2020-05-15

Imports: stringdist

https://cran.r-project.org/web/packages/fuzzyjoin/vignettes/stringdist_join.html

18.14.2 Example 1: Joining with Common Mispelling

18.14.3 Example 2: from datacamp

18.14.4 Example 3: from stackoverflow

# https://stackoverflow.com/questions/52402768/r-fuzzy-merge-using-agrep-and-data-table
library("data.table")
dt1 = data.table(Name = c("ASML HOLDING","ABN AMRO GROUP"), A = c(1,2))
dt2 = data.table(Name = c("ASML HOLDING NV", "ABN AMRO GROUP"), B = c("p", "q"))

dt1 = data.frame(Name = c("ASML HOLDING","ABN AMRO GROUP"), A = c(1,2),Date=c(1,2))
dt2 = data.frame(Name = c("ASML HOLDING NV", "ABN AMRO GROUP", "ABN AMRO GROUP"), B = c("p", "q","r"),Date=c(1,2,3))

dt1 %>% fuzzy_inner_join(dt2, by=c("Date","Name"), match_fun=f) %>% filter(Date.x==Date.y)

library(fuzzyjoin)
f <- Vectorize(function(x,y) agrepl(x, y,
                                    ignore.case=TRUE,
                                    max.distance = 0.05, useBytes = TRUE))
dt1 %>% fuzzy_inner_join(dt2, by="Name", match_fun=f)

18.15 > Package `blink`

18.15.1 Description

blink: Record Linkage for Empirically Motivated Priors

An implementation of the model in Steorts (2015) doi:10.1214/15-BA965SI, which performs Bayesian entity resolution for categorical and text data, for any distance function defined by the user. In addition, the precision and recall are in the package to allow one to compare to any other comparable method such as logistic regression, Bayesian additive regression trees (BART), or random forests. The experiments are reproducible and illustrated using a simple vignette. LICENSE: GPL-3 + file license.

Depends: stringdist

Published: 2020-10-06

https://projecteuclid.org/euclid.ba/1441790411 https://cran.r-project.org/web/packages/blink/vignettes/introEBLink.html

18.15.2 Datasets: German names 500 and 10,000

18.16 Discusion - Other methods

LSTMs - Have you tried using LSTMs for record linkage for entity linkage? We were quiet sucessful in doing it compared to probalistic linkage.