This vignette explains how to add new data-generating mechanisms
(DGMs) to the PublicationBiasBenchmark package. In the
following, we will use the no_bias DGM as an example. (See
the Using Presimulated
Datasets vignette for details on working with the already stored
simulated datasets.)
Overview
Each DGM in the package consists of three key components:
- Main DGM function: Implements the data-generating mechanism
-
Validation function: Validates input parameters and
settings
- Conditions function: Defines pre-specified conditions
All three functions must be implemented in a single file named
dgm-{DGM_NAME}.R in the R/ directory.
Implementation of these three functions allows users to generate data
from the DGM via the simulate_dgm()
function.
File Structure and Naming
For a DGM called “no_bias”, you need to create a file named
R/dgm-no_bias.R containing three functions:
-
dgm.no_bias(): The main data-generating mechanism implementation -
validate_dgm_setting.no_bias(): Parameter validation -
dgm_conditions.no_bias(): Pre-defined conditions
The naming pattern is crucial for the package’s S3 method dispatch system to work correctly.
1. Main DGM Function: dgm.{DGM_NAME}()
This is the core function that implements your data-generating
mechanism. Here is the no_bias implementation as an
example:
#' @title Normal Unbiased Data-Generating Mechanism
#'
#' @description
#' An example data-generating mechanism to simulate effect sizes without
#' publication bias.
#'
#' @param dgm_name DGM name (automatically passed)
#' @param settings List containing \describe{
#' \item{mean_effect}{Mean effect}
#' \item{heterogeneity}{Effect heterogeneity}
#' \item{n_studies}{Number of effect size estimates}
#' }
#'
#'
#' @return Data frame with \describe{
#' \item{yi}{effect size}
#' \item{sei}{standard error}
#' }
#'
#' @references
#' \insertAllCited{}
#'
#' @seealso [dgm()], [validate_dgm_setting()]
#' @export
dgm.no_bias <- function(dgm_name, settings) {
# Extract settings
n_studies <- settings[["n_studies"]]
mean_effect <- settings[["mean_effect"]]
heterogeneity <- settings[["heterogeneity"]]
# Simulate sample sizes based on empirical distribution
N_shape <- 2
N_scale <- 58
N_low <- 25
N_high <- 500
N_seq <- seq(N_low, N_high, 1)
N_den <- stats::dnbinom(N_seq, size = N_shape, prob = 1/(N_scale+1)) /
(stats::pnbinom(N_high, size = N_shape, prob = 1/(N_scale+1)) -
stats::pnbinom(N_low - 1, size = N_shape, prob = 1/(N_scale+1)))
N <- sample(N_seq, n_studies, TRUE, N_den)
# Compute standard errors based on sample sizes (Cohen's d formula)
standard_errors <- sqrt(4/N)
# Simulate true effect sizes with heterogeneity
effect_sizes <- stats::rnorm(n_studies, mean_effect,
sqrt(heterogeneity^2 + standard_errors^2))
# Return standardized data frame
data <- data.frame(
yi = effect_sizes,
sei = standard_errors,
ni = N
)
return(data)
}Key Requirements for the Main Function:
Input Parameters:
-
dgm_name: Automatically passed by the framework -
settings: Named list containing all DGM parameters or thecondition_idvalue
Output: Must return a data frame with these required columns:
-
yi: Effect sizes -
sei: Standard errors -
ni: Sample sizes -
es_type: Type of effect size (e.g., “SMD”, “logOR”, “none”)
Optional additional columns (commonly used):
-
study_id: Unique identifier for each study/cluster (in the presence of multilevel/clustered data)
2. Validation Function:
validate_dgm_setting.{DGM_NAME}()
This function validates that all required parameters are provided and have valid values:
#' @export
validate_dgm_setting.no_bias <- function(dgm_name, settings) {
# Check that all required settings are specified
required_params <- c("n_studies", "mean_effect", "heterogeneity")
missing_params <- setdiff(required_params, names(settings))
if (length(missing_params) > 0)
stop("Missing required settings: ", paste(missing_params, collapse = ", "))
# Extract settings for validation
n_studies <- settings[["n_studies"]]
mean_effect <- settings[["mean_effect"]]
heterogeneity <- settings[["heterogeneity"]]
# Validate each parameter
if (length(n_studies) != 1 || !is.numeric(n_studies) || is.na(n_studies) ||
!is.wholenumber(n_studies) || n_studies < 1)
stop("'n_studies' must be an integer larger than 0")
if (length(mean_effect) != 1 || !is.numeric(mean_effect) || is.na(mean_effect))
stop("'mean_effect' must be numeric")
if (length(heterogeneity) != 1 || !is.numeric(heterogeneity) ||
is.na(heterogeneity) || heterogeneity < 0)
stop("'heterogeneity' must be non-negative")
return(invisible(TRUE))
}Key Points for Validation:
- Check for missing required parameters
- Validate parameter types (numeric, integer, character, etc.)
- Check parameter ranges and constraints
- Provide clear, informative error messages
- Return
invisible(TRUE)on successful validation - Use
stop()for validation failures
3. Conditions Function:
dgm_conditions.{DGM_NAME}()
This function defines pre-specified conditions for benchmarking studies:
#' @export
dgm_conditions.no_bias <- function(dgm_name) {
# Generate a grid of pre-specified settings
settings <- data.frame(expand.grid(
mean_effect = c(0, 0.3),
heterogeneity = c(0, 0.15),
n_studies = c(10, 100)
))
# Attach unique condition identifiers
settings$condition_id <- 1:nrow(settings)
return(settings)
}Always add a condition_id column with unique
identifiers. This column is used for generating data from the
pre-defined conditions.
Once defined, these settings cannot be changed retrospectively to ensure reproducibility and continuity of the benchmark.
Using Your New DGM
Once implemented, your DGM can be used through a unified interface:
# Use with custom settings
data <- simulate_dgm("no_bias", list(
mean_effect = 0.2,
heterogeneity = 0.1,
n_studies = 50
))
head(data)
# Use with pre-defined conditions
data <- simulate_dgm("no_bias", settings = 1)
head(data)
# View available conditions
conditions <- dgm_conditions("no_bias")
conditions