Best Practices for R
Using renv to manage dependencies:
Dependency management on the HPC can be very difficult. renv allows you to
set up a virtual environment in R that isolates the dependencies for each
individual project. This is useful for two reasons:
- It allows you to clearly communicate your code's required dependencies to your collaborators
- Updating dependencies for one project will never break dependencies for another project
Virtual environments have already been set up for all of the example projects, and the virtual environment will be activated whenever you open a new R session in these directories (so if you start your project by copying one of these directories, you are good to go).
To set up renv on a new project, simply call renv::init() in R while in the project home folder. You can then use either renv::install() or install.packages() to install packages as normal.
More info about getting started with renv can be found here.
Using duckplyr to conduct large-scale data analysis.
Another important skill is using a query engine like DuckDB to run queries on datasets without loading them into memory.
If you don't work with large-scale data, you are probably most familiar with a workflow that looks something like this:
- Read the data into R using read_csvor an alternative
- Run models and analysis on the data using dplyr
- Save out the results to another file
When datasets get large, this workflow begins to break down as it becomes very slow (if not impossible) to load the data into memory in order to analyze. This is where technologies like DuckDB can help - you can register data with DuckDB without reading it into memory, and then write queries that access and load only the relevant parts of your data for analysis, saving time and resources.
Importantly, new packages like duckplyr provide you a way to do this while
still using dplyr-syntax. The first example, eligibility_example/ is a great
example of this: using duckplyr allows us to compute summary statistics on a
massive dataset, using dplyr-syntax,  without reading the entire dataset into
memory.
To learn more about duckplyr check out these resources:
- This post announcing duckplyr on duckdb.org
- This presentation by Kirill Müller at posit::confintroducing duckplyr and demonstrating some of its' features
- This talk by Hannes Mühleisen at posit::conftalking about DuckDB, the query engine / database that backs up duckplyr