Introduction to blueprintr

blueprintr is a framework for managing your data assets in a reproducible fashion. While it uses targets or drake, it adds automated steps for tabular dataset documentation and testing. This allows researchers to create a replicable framework to prevent programming issues from affecting analysis results.

Installation

# install.packages("blueprintr", repos = "https://nyuglobalties.r-universe.dev")

library(blueprintr)

Designed Use of blueprintr

blueprintr provides your data with guardrails typically found in software engineering workflows. This allows you to test and document before deploying to production.

The top level of the blueprintr workflow is a “blueprints” directory, consisting of .R and .csv files.

About blueprints

Each blueprint has two components to it:

  • Data Construction Spec, usually a .R file that instructs drake or targets on how to build a specific dataset.
  • Metadata, usually a .csv file that incorporates any mapping files and checks that need to be done on the dataset.

In order to create a blueprint, we use the blueprint function. This function takes three arguments: name (the name of your generated dataset), description (a description of your dataset), command (any functions that need to be applied in order to build the dataset).

A project may need only a few blueprints, but more likely you’ll need nested blueprints to transform the data.

blueprintr generates six “steps” (targets) per blueprint:

Target name Description
{blueprint}_initial The result of running the blueprint’s command
{blueprint}_blueprint A copy of the blueprint to be used throughout the plan
{blueprint}_meta A copy of the dataset metadata — if the metadata file doesn’t exist, it will be created in this step
{blueprint}_meta_path Creates the metadata file or loads it
{blueprint}_checks Runs all tests on the {blueprint}_initial target
{blueprint} The built dataset after running some cleanup tasks
When writing other steps in your workflow (be it targets or drake), it is advised to not refer to the {blueprint}_initial step since it could have problems which are discovered in the {blueprint}_checks step.

Example

Let’s take a well known dataset – mtcars, and create a blueprint for it.

# Keeping the row names under the column `rn`
our_mtcars <- mtcars |> tidytable::as_tidytable(rownames = "rn")

# Inspecting our mtcars dataset
head(our_mtcars)
#> # A tidytable: 6 × 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> 4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
#> 5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
#> 6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1

When we ingest data from various sources, it’s usually helpful to outline the expected metadata for the sources. At TIES, we document this metadata in a user-created “mapping file.” This mapping file acts as a map for any variable name changes, as well as categorical variable coding changes.

mapping_file <- system.file("mapping/mtcars_item_mapping.csv", package = "blueprintr", mustWork = TRUE)

# Read this csv file:
item_mapping <- mapping_file |>
  readr::read_csv(
    col_types = readr::cols(
      name_1 = readr::col_character(),
      description_1 = readr::col_character(),
      coding_1 = readr::col_character(),
      panel = readr::col_character(),
      homogenized_name = readr::col_character(),
      homogenized_coding = readr::col_character(),
      homogenized_description = readr::col_character()
    )
  )
item_mapping
#> # A tibble: 12 × 7
#>    name_1 description_1       coding_1 panel homogenized_name homogenized_coding
#>    <chr>  <chr>               <chr>    <chr> <chr>            <chr>             
#>  1 rn     Name of car          <NA>    MTCA… name              <NA>             
#>  2 mpg    Miles per gallon     <NA>    MTCA… mpg               <NA>             
#>  3 cyl    Number of cylinders  <NA>    MTCA… cyl               <NA>             
#>  4 disp   Displacement         <NA>    MTCA… disp              <NA>             
#>  5 hp     Gross horsepower     <NA>    MTCA… hp                <NA>             
#>  6 drat   Rear axle ratio      <NA>    MTCA… drat              <NA>             
#>  7 wt     Weight               <NA>    MTCA… wt                <NA>             
#>  8 qsec   Quarter mile time    <NA>    MTCA… qsec              <NA>             
#>  9 vs     Engine              "coding… MTCA… vs               "coding(code(\"1\…
#> 10 am     Transmission        "coding… MTCA… am               "coding(code(\"1\…
#> 11 gear   Number of forward …  <NA>    MTCA… gear              <NA>             
#> 12 carb   Number of carburet…  <NA>    MTCA… carb              <NA>             
#> # ℹ 1 more variable: homogenized_description <chr>

Then, we typically use a tool such as panelcleaner to attach our mapping file to the mtcars database. This is a command executed in the dataset construction spec.

blueprint(
  "mt_cars",
  description = "mtcars database with attached metadata",
  annotate = TRUE,
  command = {
    pnl <- panelcleaner::enpanel("MTCARS_PANEL", our_mtcars) |>
      panelcleaner::add_mapping(item_mapping) |>
      panelcleaner::homogenize_panel() |>
      panelcleaner::bind_waves() |>
      as.data.frame()

    pnl_name <- get_attr(pnl, "panel_name")
    pnl_mapping <- get_attr(pnl, "mapping")

    pnl <-
      pnl

    class(pnl) <- c("mapped_df", class(pnl))
    set_attrs(pnl, mapping = pnl_mapping, panel_name = pnl_name)
  }
) |>
  bp_include_panelcleaner_meta()
#> <blueprint: 'mt_cars'>
#> 
#> Description: mtcars database with attached metadata
#> Annotations: ENABLED
#> Metadata location: '/tmp/RtmpxPtsQS/Rbuild24a9529f0292/blueprintr/blueprints/mt_cars.csv'
#> 
#> -- Command --
#> Workflow command:
#> {
#>     pnl <- as.data.frame(panelcleaner::bind_waves(panelcleaner::homogenize_panel(panelcleaner::add_mapping(panelcleaner::enpanel("MTCARS_PANEL", 
#>         our_mtcars), item_mapping))))
#>     pnl_name <- get_attr(pnl, "panel_name")
#>     pnl_mapping <- get_attr(pnl, "mapping")
#>     pnl <- pnl
#>     class(pnl) <- c("mapped_df", class(pnl))
#>     set_attrs(pnl, mapping = pnl_mapping, panel_name = pnl_name)
#> }
#> 
#> Raw command:
#> {
#>     pnl <- as.data.frame(panelcleaner::bind_waves(panelcleaner::homogenize_panel(panelcleaner::add_mapping(panelcleaner::enpanel("MTCARS_PANEL", 
#>         our_mtcars), item_mapping))))
#>     pnl_name <- get_attr(pnl, "panel_name")
#>     pnl_mapping <- get_attr(pnl, "mapping")
#>     pnl <- pnl
#>     class(pnl) <- c("mapped_df", class(pnl))
#>     set_attrs(pnl, mapping = pnl_mapping, panel_name = pnl_name)
#> }

Save this script with a filename of your choice inside of the “blueprints” directory of your project. We’ll assume you are using targets for your project:

./
  _targets.R
  blueprints/
    ... all blueprint R and CSV files go here ...
  R/
    ... all associated R function definitions are here ...
  project.Rproj
  ...
It is not required to use panelcleaner or even document the source metadata. This is just a convention we at TIES developed. However, we strongly advise doing something similar to track your data sources over time.

When running this code with either targets or drake, the blueprint metadata is automatically created. For our mtcars example, this looks like:

#> Rows: 13 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (4): name, type, description, coding
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 13 × 4
#>    name  type      description             coding                               
#>    <chr> <chr>     <chr>                   <chr>                                
#>  1 name  character Name of Car              <NA>                                
#>  2 mpg   double    Miles per gallon         <NA>                                
#>  3 cyl   double    Number of cylinders      <NA>                                
#>  4 disp  double    Displacement             <NA>                                
#>  5 hp    double    Gross horsepower         <NA>                                
#>  6 drat  double    Rear axle ratio          <NA>                                
#>  7 wt    double    Weight                   <NA>                                
#>  8 qsec  double    Quarter mile time        <NA>                                
#>  9 vs    character Engine                  "coding(code(\"straight\",\"1\"), co…
#> 10 am    character Transmission            "coding(code(\"manual\",\"1\"), code…
#> 11 gear  double    Number of forward gears  <NA>                                
#> 12 carb  double    Number of carburetors    <NA>                                
#> 13 wave  character <NA>                     <NA>

Manually editing the metadata allows the user to add tests to check the data type and values.

The last step of our work is to load this blueprint into either targets or drake. For this example, we’ll use targets as drake is deprecated. A full discussion of targets is beyond the scope of this vignette, but you can find an excellent walkthrough here. The only detail that is needed is to add blueprintr::tar_blueprints() to your _targets.R file:

# _targets.R
library(targets)

# ...

list(
  tar_target(
    item_mapping,
    readr::read_csv("where/your/mapping/file/is/stored.csv")
  ),
  
  blueprintr::tar_blueprints(),

  # Other targets for your project!
)

This will load all blueprints in the “blueprints” directory. If you have a nested directory structure, use blueprintr::tar_blueprints(recurse = TRUE).

And there you have it! You have created your first blueprint on the mtcars dataset. When running a pipeline with blueprintr, the checks allow researchers to be warned of any issues at an early stage, allowing them to produce replicable results.