TMSIS Data Documentation
This website provides documentation for our TMSIS datasets for researchers within / collaborating with the Yale Medicaid Lab. On the site, you can find resources explaining the structure of the datasets, recommendations for working with large datasets, and example code from other projects that have used the same data.
Hive-Partitioned Datasets:
CMS gives us the data in a bespoke fixed-width file format called .fts
. Working with these files is generally slow (as the file format is not optimized for database queries), and cumbersome (as one must implement their own parsing logic to read the files). We parse and format the TMSIS data into a series of hive partitioned parquet files, a standardized file format that is fast to read, supports modern database operations, and enforces strong types, so you never have to worry that your files have been read in incorrectly.
To get started, with the files, you can log into Milgram and open the directory where they are stored on the server. The files are currently located at /gpfs/milgram/pi/medicaid_lab/data/cms/ingested/TMSIS_TAF
.
Opening this directory, you will see the following sub-directories:
TMSIS_taf/
├── taf_demog_elig_base
├── taf_demog_elig_dates
├── taf_demog_elig_disability
├── taf_demog_elig_hh_spo
├── taf_demog_elig_mngd_care
├── taf_demog_elig_mny_flw_prsn
├── taf_demog_elig_waiver
├── taf_inpatient_header
├── taf_inpatient_line
├── taf_inpatient_occurrence
├── taf_long_term_header
├── taf_long_term_line
├── taf_long_term_occurrence
├── taf_other_services_header
├── taf_other_services_line
├── taf_other_services_occurrence
├── taf_rx_header
└── taf_rx_line
The directories fall into 5 groups, the pharmacy files (taf_rx_.*), the long-term care files (taf_long_term_.*), the inpatient files (taf_inpatient_.*), the other care files (taf_other_services_.*) and the demographic and eligibility files (taf_demog_elig_.*). You can find quick links to the ResDAC documentation for each file type at the bottom of this page.
If we open the taf_demog_elig_base
directory, we can see that the data is
further partitioned at the state-year level. If you are performing data
analysis on a state in a single year, you can simply take one of these files
and start performing data analysis on it. Alternatively, if you are more
familiar with arrow
and DuckDB
, you can use this file structure to speed up
queries that are run on specific states or years. Check out the R
examples for more information on how to do
this if you are curious.
taf_demog_elig_base/
├── year=2016
│ ├── state=AK
│ │ └── data.paquet
│ ├── state=AL
│ │ └── data.paquet
│ └── ...
├── year=2017
│ ├── state=AK
│ │ └── data.paquet
│ ├── state=AL
│ │ └── data.paquet
│ └── ...
└── ...
ResDAC Documentation Links:
ResDAC provides extensive documentation for all of the files we use. No column
names or datatypes have been changed in formatting the dataset into parquet
format, so all the documentation presented on this website applies to the
standardized files as well as the raw files.