Title: | Automagically Document and Test Datasets Using Targets Or Drake |
---|---|
Description: | Documents and tests datasets in a reproducible manner so that data lineage is easier to comprehend for small to medium tabular data. Originally designed to aid data cleaning tasks for humanitarian research groups, specifically large-scale longitudinal studies. |
Authors: | Patrick Anker [aut, cre] , Hillary Gao [ctb], Global TIES for Children [cph] |
Maintainer: | Patrick Anker <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.7 |
Built: | 2024-11-12 04:27:55 UTC |
Source: | https://github.com/nyuglobalties/blueprintr |
Access the blueprintr metadata at runtime
annotations(x) annotation_names(x) annotation(x, field) super_annotation(x, field) has_annotation(x, field) has_super_annotation(x, field) add_annotation(x, field, value, overwrite = FALSE) set_annotation(x, field, value) add_super_annotation(x, field, value) remove_super_annotation(x, field)
annotations(x) annotation_names(x) annotation(x, field) super_annotation(x, field) has_annotation(x, field) has_super_annotation(x, field) add_annotation(x, field, value, overwrite = FALSE) set_annotation(x, field, value) add_super_annotation(x, field, value) remove_super_annotation(x, field)
x |
An object, most likely a variable in a |
field |
The name of a metadata field |
value |
A value to assign to an annotation field |
overwrite |
If |
annotations()
: Gets a list of all annotations assigned to an object
annotation_names()
: Get the names of all of the
annotations assigned to an object
annotation()
: Gets an annotation for an object
super_annotation()
: Gets an annotation that overrides existing
annotations
has_annotation()
: Checks to see if an annotation exists for an object
has_super_annotation()
: Checks to see if an overriding
annotation exists for an object
add_annotation()
: Adds an annotation to an object,
with the option of overwriting an existing value
set_annotation()
: Alias to add_annotation(overwrite = TRUE)
add_super_annotation()
: Adds an overriding annotation to an
object. Note that overriding annotations will overwrite
previous assignments!
remove_super_annotation()
: Removes overriding annotation
Blueprints outline a sequence of checks and cleanup steps that come after a dataset is created. In order for these steps to be executed, the blueprint must be attached to a drake plan so that drake can run these steps properly.
attach_blueprints(plan, ...) attach_blueprint(plan, blueprint)
attach_blueprints(plan, ...) attach_blueprint(plan, blueprint)
plan |
A drake plan |
... |
Multiple blueprints |
blueprint |
A blueprint object |
Create a blueprint
blueprint( name, command, description = NULL, metadata = NULL, annotate = FALSE, metadata_file_type = c("csv"), metadata_file_name = NULL, metadata_directory = NULL, metadata_file_path = NULL, extra_steps = NULL, ..., class = character() )
blueprint( name, command, description = NULL, metadata = NULL, annotate = FALSE, metadata_file_type = c("csv"), metadata_file_name = NULL, metadata_directory = NULL, metadata_file_path = NULL, extra_steps = NULL, ..., class = character() )
name |
The name of the blueprint |
command |
The code to build the target dataset |
description |
An optional description of the dataset to be used for codebook generation |
metadata |
The associated variable metadata for this dataset |
annotate |
If |
metadata_file_type |
The kind of metadata file. Currently only CSV. |
metadata_file_name |
The file name for the metadata file. If the
option |
metadata_directory |
Where the metadata file will be stored. If the
option |
metadata_file_path |
Overrides the metadata file path generated by
|
extra_steps |
A |
... |
Any other parameters and settings for the blueprint |
class |
A subclass of blueprint capability, for future work |
A blueprint object
blueprintr offers some post-check tasks that attempt to match datasets to the metadata as much as possible. There are two default tasks that run:
Reorders variables to match metadata order.
Drops variables marked with dropped == TRUE
if the dropped
variable
exists in the metadata.
The remaining tasks have to be enabled by the user:
If labelled = TRUE
in the blueprint()
command, all columns will be
converted to labelled() columns, provided that at
least the description
field is filled in. If the coding
column is
present in the metadata, then categorical levels as specified by a
coding() will be added to the column as well. In case
the description
field is used for detailed column descriptions, the
title
field can be added to the metadata to act as short titles for the
columns.
blueprintr
uses code inspection to identify and trace dataset dependencies.
These macro functions signal a dependency to blueprintr
and evaluate to
symbols to be analyzed in the drake
plan.
.TARGET(bp_name, .env = parent.frame()) .BLUEPRINT(bp_name, .env = parent.frame()) .META(bp_name, .env = parent.frame()) .SOURCE(dat_name) mark_source(dat)
.TARGET(bp_name, .env = parent.frame()) .BLUEPRINT(bp_name, .env = parent.frame()) .META(bp_name, .env = parent.frame()) .SOURCE(dat_name) mark_source(dat)
bp_name |
Character string of blueprint's name |
.env |
The environment in which to evaluate the macro. For internal use only! |
dat_name |
Character string of an object's name, used exclusively for marking "sources" |
dat |
A data.frame-like object |
.TARGET()
: Gets symbol of built and checked data
.BLUEPRINT()
: Gets symbol of blueprint reference in plan
.META()
: Gets symbol of metadata reference in plan
.SOURCE()
: Gets a symbol for an object intended to be a
"data source"
mark_source()
: Mark an data.frame-like object as a source table
Generally speaking, the .BLUEPRINT
and .META
macros should be used for
check functions, which frequently require context, e.g. in the form of
configuration from the blueprint or coding expectations from the metadata.
.TARGET
is primarily used in blueprint commands, but there could be
situations where a check depends on the content of another dataset.
It is important to note that the symbols generated by these macros are only
understood in the context of a drake
plan. The targets associated with the
symbols are generated when blueprints are attached to a plan.
Sources are an ability to add variable UUIDs to objects that are not constructed
using blueprints. This is often the case if the sourced table derives from some
intermittent HTTP query or a file from disk. Blueprints have limited capability
of configuring the underlying target behavior during the _initial
phase, so often
it is easier to do that sort of fetching and pre-processing before using blueprints.
However, you lose the benefit of variable lineage when you don't use blueprints.
"Sources" are simply data.frame-like objects that have the ".uuid" attribute for each
variable so that variable lineage can cover the full data lifetime. Use blueprintr::mark_source()
to add the UUID attributes, and then use .SOURCE()
in the blueprints so lineage
can be captured
.TARGET("example_dataset") .BLUEPRINT("example_dataset") .META("example_dataset") blueprint( "test_bp", description = "Blueprint with dependencies", command = .TARGET("parent1") %>% left_join(.TARGET("parent2"), by = "id") %>% filter(!is.na(id)) )
.TARGET("example_dataset") .BLUEPRINT("example_dataset") .META("example_dataset") blueprint( "test_bp", description = "Blueprint with dependencies", command = .TARGET("parent1") %>% left_join(.TARGET("parent2"), by = "id") %>% filter(!is.na(id)) )
blueprint()
objects store custom bpstep objects
in the "extra_steps" element. This function adds a new
step to that element.
bp_add_bpstep(bp, step)
bp_add_bpstep(bp, step)
bp |
A blueprint |
step |
A bpstep object |
if (FALSE) { # Based on the codebook export step step <- bpstep( step = "export_codebook", bp = bp, payload = bpstep_payload( target_name = blueprint_codebook_name(bp), target_command = codebook_export_call(bp), format = "file", ... ) ) bp_add_bpstep( bp, step ) }
if (FALSE) { # Based on the codebook export step step <- bpstep( step = "export_codebook", bp = bp, payload = bpstep_payload( target_name = blueprint_codebook_name(bp), target_command = codebook_export_call(bp), format = "file", ... ) ) bp_add_bpstep( bp, step ) }
Instruct blueprint to export codebooks
bp_export_codebook( blueprint, summaries = FALSE, file = NULL, template = NULL, title = NULL )
bp_export_codebook( blueprint, summaries = FALSE, file = NULL, template = NULL, title = NULL )
blueprint |
A blueprint |
summaries |
Whether or not variable summaries should be included in codebook |
file |
Path to where the codebook should be saved |
template |
A path to an RMarkdown template |
title |
Optional title of codebook |
An amended blueprint with the codebook export instructions
## Not run: test_bp <- blueprint( "mtcars_dat", description = "The mtcars dataset", command = mtcars ) new_bp <- test_bp %>% bp_export_codebook() ## End(Not run)
## Not run: test_bp <- blueprint( "mtcars_dat", description = "The mtcars dataset", command = mtcars ) new_bp <- test_bp %>% bp_export_codebook() ## End(Not run)
Instruct blueprint to generate kfa report
bp_export_kfa_report( bp, scale, path = NULL, path_pattern = NULL, format = NULL, title = NULL, kfa_args = list(), ... )
bp_export_kfa_report( bp, scale, path = NULL, path_pattern = NULL, format = NULL, title = NULL, kfa_args = list(), ... )
bp |
A blueprint |
scale |
Which scale(s) to analyze |
path |
Path(s) to where the report(s) should be saved |
path_pattern |
Override the default location to save files (always rooted to the project root with here::here()) |
format |
The output format of the report(s) |
title |
Optional title of report |
kfa_args |
Arguments forwarded to |
... |
Arguments forwarded to the executing engine e.g. targets::tar_target_raw() or drake::target() |
An amended blueprint with the kfa report export instructions
## Not run: test_bp <- blueprint( "mtcars_dat", description = "The mtcars dataset", command = mtcars ) new_bp <- test_bp %>% bp_export_codebook() ## End(Not run)
## Not run: test_bp <- blueprint( "mtcars_dat", description = "The mtcars dataset", command = mtcars ) new_bp <- test_bp %>% bp_export_codebook() ## End(Not run)
blueprint()
objects are essentially just list()
objects
that contain a bunch of metadata on the data asset construction.
Use bp_extend()
to set or add new elements.
bp_extend(bp, ...)
bp_extend(bp, ...)
bp |
A blueprint |
... |
Keyword arguments forwarded to blueprint() |
if (FALSE) { bp <- blueprint("some_blueprint", ...) adjusted_bp <- bp_extend(bp, new_option = TRUE) bp_with_annotation_set <- bp_extend(bp, annotate = TRUE) }
if (FALSE) { bp <- blueprint("some_blueprint", ...) adjusted_bp <- bp_extend(bp, new_option = TRUE) bp_with_annotation_set <- bp_extend(bp, annotate = TRUE) }
panelcleaner defines
a mapping structure used for data import of panel, or more generally
longitudinal, surveys / data which can be used as a source for some
kinds of metadata (currently, only categorical coding information).
If the blueprint constructs a mapped_df
object, then this extension
will signal to blueprintr to extract the mapping information and
include it.
bp_include_panelcleaner_meta(blueprint)
bp_include_panelcleaner_meta(blueprint)
blueprint |
A blueprint that may create a |
An amended blueprint with mapped_df
metadata extraction set
for metadata creation
The haven package has a handy tool called
"labeled vectors", which are like factors that can be interpreted in other
statistical software like STATA and SPSS. See haven::labelled()
for more information on the type. Running this on a blueprint will instruct
the blueprint to convert all variables with non-NA title
, description
, or
coding
fields to labeled vectors.
bp_label_variables(blueprint)
bp_label_variables(blueprint)
blueprint |
A blueprint |
An amended blueprint with variable labelling in the cleanup phase set
Each step in the blueprint assembly process is contained in a wrapper 'bpstep' object.
bpstep(step, bp, payload, ...)
bpstep(step, bp, payload, ...)
step |
The name of the step |
bp |
A 'blueprint' object to create the assembled step |
payload |
A 'bpstep_payload' object that outlines the code to be assembled depending on the workflow executor |
... |
Extensions to the bpstep, like "allow_duplicates" |
A 'bpstep' object
The bpstep payload is the object that contains the target name and command, along with any other metadata to be passed to the execution engine.
bpstep_payload(target_name, target_command, ...)
bpstep_payload(target_name, target_command, ...)
target_name |
The target's name |
target_command |
The target's command |
... |
Arguments to be passed to the executing engine (e.g. arguments sent to targets::tar_target()) |
A bpstep payload object
if (FALSE) { bpstep( step = "some_step", bp = some_bp_object, payload = bpstep_payload( "payload_name", payload_command() ) ) }
if (FALSE) { bpstep( step = "some_step", bp = some_bp_object, payload = bpstep_payload( "payload_name", payload_command() ) ) }
Create a quoted list of check calls
check_list(...)
check_list(...)
... |
A collection of calls to be used for checks |
After building a dataset, it's beneficial (if not
a requirement) to run tests on that dataset to ensure
that it behaves as expected. blueprintr
gives authors
a framework to run these tests automatically, both for
individual variables and general dataset checks.
blueprintr
provides three functions as models for developing
these kinds of functions: one to check that all expected variables
are present, one to check the variable types, and a generic
function that checks if variable values are contained within
a known set.
all_variables_present(df, meta, blueprint) all_types_match(df, meta)
all_variables_present(df, meta, blueprint) all_types_match(df, meta)
df |
The built dataset |
meta |
The dataset's metadata |
blueprint |
The dataset's blueprint |
After checks pass, this step runs in the blueprint sequence. If any cleanup features are enabled, they will run on the dataset prior to setting the final blueprint target.
cleanup(results, df, blueprint, meta)
cleanup(results, df, blueprint, meta)
results |
A reference to the checks results. Currently used to ensure that this step runs after the checks step. |
df |
The built dataset |
blueprint |
The blueprint associated with the built dataset |
meta |
The metadata associated with the built dataset |
One of the targets in the blueprint workflow target chain. If a metadata file does not exist, then this function will be added to the workflow.
create_metadata_file(df, blueprint, ...)
create_metadata_file(df, blueprint, ...)
df |
A dataframe that the metadata table describes |
blueprint |
The original blueprint for the dataframe |
... |
A variable list of metadata tables on which this metadata table depends |
Runs all checks – dataset and variable – on a blueprint to determine if a built dataset passes all restrictions.
eval_checks(..., .env = parent.frame())
eval_checks(..., .env = parent.frame())
... |
All quoted check calls |
.env |
The environment in which the calls are evaluated |
Check functions are simple functions that take in either
a data.frame or variable at the minimum, plus some extra
arguments if need, and returns a logical value: TRUE
or FALSE.
In blueprintr, the entire check passes or fails unlike other
testing frameworks like pointblank. If you'd like to embed
extra context for your test result, modify the "check.errors"
attribute of the returned logical value with a character vector
which will be rendered into a bulleted list. Note: if you embed
reasons for a TRUE
, the check will produce a warning in the targets
or drake pipeline.
Test if x is a subset of y
in_set(x, y)
in_set(x, y)
x |
A vector |
y |
A vector representing an entire set |
Load a blueprint from a script file
load_blueprint(plan, file) load_blueprints(plan, directory = here::here("blueprints"), recurse = FALSE)
load_blueprint(plan, file) load_blueprints(plan, directory = here::here("blueprints"), recurse = FALSE)
plan |
A drake plan |
file |
A path to a script file |
directory |
A path to a directory with script files that are blueprints. Defaults to the "blueprints" directory at the root of the current R project. |
recurse |
Recursively loads blueprints from a directory if |
A drake_plan with attached blueprints
By default, blueprintr ignore empty blueprint folders. However, it may be beneficial
to warn users if folder is empty, particularly during project setup. This helps
identify any potential misconfiguration of drake plan attachment. To enable these warnings,
set option(blueprintr.warn_empty_blueprints_dirs = TRUE)
.
Read blueprints from folder and get lineage
load_table_lineage( directory = here::here("blueprints"), recurse = FALSE, script = here::here("_targets.R") )
load_table_lineage( directory = here::here("blueprints"), recurse = FALSE, script = here::here("_targets.R") )
directory |
A folder containing blueprint scripts |
recurse |
Should this function recursively load blueprints? |
script |
Where the targets/drake project script file is located. Defaults to using targets. |
An igraph of the table lineage for the desired blueprints
Convert an input dataframe into a metadata object
metadata(df)
metadata(df)
df |
A dataframe that will be converted into a metadata object, once content checks pass. |
Usually, metadata should be a reflection of what the data
should represent and act as a check on the generation code.
However, in the course of data aggregation, it can be common to
perform massive transformations that would be cumbersome to
document manually. This exposes a metadata-manipulation framework
prior to metadata file creation, in the style of tidytable::mutate
.
mutate_annotation(.data, .field, ..., .overwrite = TRUE) mutate_annotation_across( .data, .field, .fn, .cols = tidyselect::everything(), .with_names = FALSE, ..., .overwrite = TRUE )
mutate_annotation(.data, .field, ..., .overwrite = TRUE) mutate_annotation_across( .data, .field, .fn, .cols = tidyselect::everything(), .with_names = FALSE, ..., .overwrite = TRUE )
.data |
A |
.field |
The name of the annotation field that you wish to modify |
... |
For For |
.overwrite |
If |
.fn |
A function that takes in a vector and arbitrary arguments |
.cols |
A tidyselect-compatible selection of variables to be edited |
.with_names |
If |
A data.frame
with annotated columns
# Adds a "mean" annotation to 'mpg' mutate_annotation(mtcars, "mean", mpg = mean(mpg)) # Adds a "mean" annotation to all variables in `mtcars` mutate_annotation_across(mtcars, "mean", .fn = mean) # Adds a "title" annotation that copies the column name mutate_annotation_across( mtcars, "title", .fn = function(x, nx) nx, .with_names = TRUE )
# Adds a "mean" annotation to 'mpg' mutate_annotation(mtcars, "mean", mpg = mean(mpg)) # Adds a "mean" annotation to all variables in `mtcars` mutate_annotation_across(mtcars, "mean", .fn = mean) # Adds a "title" annotation that copies the column name mutate_annotation_across( mtcars, "title", .fn = function(x, nx) nx, .with_names = TRUE )
Creates a new drake plan from a blueprint
plan_from_blueprint(blueprint)
plan_from_blueprint(blueprint)
blueprint |
A blueprint |
A drake plan with all of the necessary blueprint steps
Render codebooks for datasets
render_codebook( blueprint, meta, file, title = glue::glue("{ui_value(blueprint$name)} Codebook"), dataset = NULL, template = bp_path("codebook_templates/default_codebook.Rmd"), ... )
render_codebook( blueprint, meta, file, title = glue::glue("{ui_value(blueprint$name)} Codebook"), dataset = NULL, template = bp_path("codebook_templates/default_codebook.Rmd"), ... )
blueprint |
A dataset blueprint |
meta |
A |
file |
Path to where the codebook should be saved |
title |
Title of the codebook |
dataset |
If included, a |
template |
Path to the knitr template |
... |
Extra parameters passed to |
Generates a k-fold factor analysis report using the 'scale' field in the blueprintr data dictionaries. While not recommended, this function does allow for multiple loaded variables, delimited by commas. For example, 'var1' could have 'scale' be "SCALE1,SCALE2".
render_kfa_report( dat, bp, meta, scale, path = NULL, path_pattern = "reports/kfa-{snakecase_scale}-{dat_name}.html", format = NULL, title = NULL, ... )
render_kfa_report( dat, bp, meta, scale, path = NULL, path_pattern = "reports/kfa-{snakecase_scale}-{dat_name}.html", format = NULL, title = NULL, ... )
dat |
Source data |
bp |
The dataset's blueprint |
meta |
blueprintr data dictionary |
scale |
Scale identifier to be located in the 'scale' field |
path |
Where to output the report; defaults to the "reports" subfolder of the current working project folder. |
path_pattern |
If path is
|
format |
The output format; defaults to 'html_document' |
title |
Optional title of the report |
... |
Arugments forwarded kfa::kfa() |
Path to where the generated report is saved
As of blueprintr 0.2.1, there is now the option for metadata files to
always overwrite annotations at runtime. Previously, this would be
a conflict with mutate_annotation and mutate_annotation_across
since the annotation phase happens during the blueprint cleanup phase, whereas these
annotation manipulation tools occur at the blueprint initial phase. To resolve
this, 0.2.1 introduces "super annotations", which are just annotations prefixed
with "super.". However, the super annotations will overwrite the normal annotations
during cleanup. This gives the annotation manipulation tools a means of not losing their
work if annotate_overwrite
is effectively enabled. To enable this functionality,
set options(blueprintr.use_improved_annotations = TRUE)
. This also has the side effect
of always treating annotate = TRUE
and annotate_overwrite = TRUE
.
improved_annotation_option() using_improved_annotations()
improved_annotation_option() using_improved_annotations()
improved_annotation_option()
: Returns the option string for improved annotations
using_improved_annotations()
: Checks if improved annotations are enabled
Unlike drake, which requires some extra metaprogramming to "attach" blueprint
steps to a plan, targets pipelines allow for direct target construction.
Blueprints can thus be added directly into a tar_pipeline()
object using
this function. The arguments for tar_blueprint()
are exactly the same as
blueprint()
. tar_blueprints()
behaves like load_blueprints()
but is
called, like tar_blueprint()
, directly in a tar_pipeline()
object.
tar_blueprint(...) tar_blueprints(directory = here::here("blueprints"), recurse = FALSE) tar_blueprint_raw(bp)
tar_blueprint(...) tar_blueprints(directory = here::here("blueprints"), recurse = FALSE) tar_blueprint_raw(bp)
... |
Arguments passed to |
directory |
A folder containing R scripts that evaluate to |
recurse |
Recursively loads blueprints from a directory if |
bp |
A blueprint object |
A list()
of tar_target
objects
By default, blueprintr ignore empty blueprint folders. However, it may be beneficial
to warn users if folder is empty, particularly during project setup. This helps
identify any potential misconfiguration of targets generation. To enable these warnings,
set option(blueprintr.warn_empty_blueprints_dirs = TRUE)
.
This is an experimental feature that traces variable lineage through an injection of a ".uuid" attribute for each variable. Previous attempts at variable lineage were conducted using variable names and heuristics of known functions. This approach yields a more consistent lineage.
load_variable_lineage( directory = here::here("blueprints"), recurse = FALSE, script = here::here("_targets.R") ) filter_variable_lineage( g, variables = NULL, tables = NULL, mode = "all", cutoff = -1 ) vis_variable_lineage(..., g = NULL, cluster_by_dataset = TRUE)
load_variable_lineage( directory = here::here("blueprints"), recurse = FALSE, script = here::here("_targets.R") ) filter_variable_lineage( g, variables = NULL, tables = NULL, mode = "all", cutoff = -1 ) vis_variable_lineage(..., g = NULL, cluster_by_dataset = TRUE)
directory |
A folder containing blueprint scripts |
recurse |
Should this function recursively load blueprints? |
script |
Where the targets/drake project script file is located. Defaults to using targets. |
g |
An igraph object. This defaults to a graph loaded with load_variable_lineage. However, use this if you want to inspect subgraphs of the variable lineage. |
variables |
Character vector of patterns for variable names to
match. Note that each pattern is assumed to be disjoint (e.g. "if variable pattern
A or variable pattern B"), but if |
tables |
Character vector of patterns for table names to match. Note that
each pattern is assumed to be disjoint (e.g. "if table pattern A or table pattern B"),
but if |
mode |
Which sort of relationships to include. Defaults to "all" (includes both relations to the target node in the graph and from the target node in the graph). See igraph::all_simple_paths() for more details. |
cutoff |
The number of node steps to consider in the graph traversal for filtering. Defaults to -1 (no limit on steps). See igraph::all_simple_paths() for more details. |
... |
Arguments passed to load_variable_lineage |
cluster_by_dataset |
If |
To enable the variable feature, set options(blueprintr.use_variable_uuids = TRUE)
.
load_variable_lineage()
: Reads blueprintrs from folder to get variable lineage.
Returns an igraph of the variable lineage.
filter_variable_lineage()
: Filter for specific variables to include
in the lineage graph
vis_variable_lineage()
: Visualizes variable lineage with visNetwork.
Returns an interactive graph.
Visualize table lineage with visNetwork
vis_table_lineage(..., g = NULL)
vis_table_lineage(..., g = NULL)
... |
Arguments passed to load_table_lineage |
g |
An igraph object, defaulting to the one created with load_table_lineage |
Interactive graph run by visNetwork