Pipeline Steps
This page gives a brief explanation for the major steps in the pipeline. Last updated 17/06/24.
Build-System Tasks
process_voter_data
This task processes the raw data from the l2 files that were extracted by
00_unzip_l2.R
, and saves them as a series of
parquet files, which are much faster to read. It
also standardizes all the columns so that they are consistent between different
files in the collection. Internally, the code uses two schemas that we have
developed, yale_schema
and datavant_schema
these are legacy from another
project, and are mostly kept-around to ensure compatibility with another set of
L2 analysis.
clean_physician_data
This task is responsible for cleaning and consolidating data from the NPPES,
NUCC, and taxonomy files. It reads all of the files and ensures that the
columns are of the right type, before joining all three files together
using the NPI number as a join key. It also subsets the datasets to
physicians labeled as Allopathic & Osteopathic Physicians
, to ensure we
only have the right kind of provider in our future analyses.
locality_sensitive_hash
This task runs locality sensitive hashing as implemented in the zoomerjoin package to find all physician-voter pairs with similar names within each state. This is a 'blocking' step which reduces the number of physician-voter pairs we have to classify as matches / non-matches by weeding out pairs that are unlikely to match as the names are dissimilar.
add_rf_match_predictions_to_df
This task takes the LSHed data and the labeled training data as inputs. It uses
the labeled data to train a Random Forest that predicts whether two records are
matches based on several predictors we generate (similarity of the two names,
distance between supposed birth date and graduation from medical
school, etc). It then uses the Random Forest to predict whether each
record in the larger corpus is a match or not a match. The task then returns
the original dataframe with this vector of predictions added as a column named
match
.
Stand-Alone Scripts
code/00_unzip_l2.R
Note
If you don't have access to the Network drive that houses the voter
data, you will not be able to run this script. Instead, you can use Data
Version Control to pull a cached version of the files.
To do this, simply install dvc using pip + the included requirements.txt
file and run dvc pull
from within the repository.
This is a standalone script, and is not integrated into the build system. It is
responsible for unzipping the raw l2 files, which are kept on a network drive,
and copying them over to the data/
folder. If you are running the
code somewhere other than the server, you are responsible for
pointing this code to the correct location of the L2 datasets so
they can be ingested for this pipeline.
code/make_training_data.R
This is a helper script that we used to create the training data used for the supervised matching algorithms. It takes 1500 random rows from the LSH-ed dataset, and divides them up into 3 semi-overlapping partitions that were hand-coded by lab members. The semi-overlapping nature of the partitions allows us to collect a lot of training data points while also calculating statistics such as the inter-coder reliability.
code/label.R
This is a 40-line helper script that we used to label some of the training
data. It takes records from the partitioned training data, and asks the user
whether they match or not. The output is then saved into the
labelled_training_data
directory.