Example ldmppr Workflow on Simulated Data

library(ldmppr)

Data

To illustrate the use of the ldmppr package, we will be using the included small_example_data dataset. This dataset consists of locations and sizes for 121 points in a (25m x 25m) square domain.

# Load the data
data("small_example_data")

# Display the first few rows of the dataset
nrow(small_example_data)
#> [1] 121
head(small_example_data)
#>           x         y     size
#> 1 10.000000 14.000000 897.4857
#> 2  9.437449 10.841183 623.7816
#> 3 16.434584 10.734105 583.2180
#> 4 11.155480  8.432171 508.4971
#> 5 13.293043 22.866626 472.0684
#> 6  4.001424 13.901061 434.1340

We also include a collection of example raster files obtained from the following ESS-DIVE repository: https://data.ess-dive.lbl.gov/view/doi:10.15485/1652536.

Details related to generating and obtaining these datasets can be found in the help file for the small_example_data dataset.

Introduction

The ldmppr package provides tools for estimating marked point processes with regularity and dependence between the marks and the locations using self-correcting point processes, training mark models, assessing model fit, and simulating spatio-temporal datasets. This vignette demonstrates the core functionality of the package through an example workflow. The approach we take in fitting a location dependent marked point process is to equate it to a spatio-temporal process where the marks (i.e. sizes) can be mapped to arrival times of events. This allows us to use the machinery of spatio-temporal point processes to estimate the parameters of the marked point process using a likelihood based approach that is computationally feasible.

Workflow Overview

  1. Estimate the parameters of a self-correcting point process given a reference dataset.
  2. Train a mark model using simulated or real-world data.
  3. Check the fit of the model using various non-parametric summaries for point processes and global envelope tests.
  4. Simulate and visualize datasets from the fitted model.

Example: Analyzing Location Dependent Marked Point Processes Using Mechanistic Spatio-Temporal Point Process Models

Step 1: Estimating a Self-Correcting Point Process

We start by estimating the parameters of a self-correcting spatio-temporal point process using the estimate_parameters_sc function. This function makes use of a user specified optimization algorithm in the nloptr function to perform estimation through log-likelihood optimization. The function requires the observed locations and observed arrival times, which are generated from the observed sizes using a power law mapping function. The power law mapping function depends on the hyperparameter delta, which controls the shape of the mapping relationship between sizes and arrival times. The function also requires the grid values for the optimization process, the initial parameter values, the upper bounds for the parameters, and the optimization algorithm to use. The function returns the optimal parameter estimates for the self-correcting point process.

# Map the observed sizes to arrival times using the power_law_mapping function
parameter_estimation_data <- small_example_data %>%
  dplyr::mutate(time = power_law_mapping(size = size, delta = .5)) %>%
  dplyr::select(time, x, y)

# Define the grid values
x_grid <- seq(0, 25, length.out = 5)
y_grid <- seq(0, 25, length.out = 5)
t_grid <- seq(0, 1, length.out = 5)

# Define the parameter initialization values
parameter_inits <- c(1.5, 8.5, .015, 1.5, 3.2, .75, 3, .08)

# Define the upper bounds
upper_bounds <- c(1, 25, 25)

# Estimate the parameters
estimated_sc <- estimate_parameters_sc(
  data = parameter_estimation_data,
  x_grid = x_grid,
  y_grid = y_grid,
  t_grid = t_grid,
  parameter_inits = parameter_inits,
  upper_bounds = upper_bounds,
  opt_algorithm = "NLOPT_LN_SBPLX",
  nloptr_options = list(
    maxeval = 300,
    xtol_rel = 1e-3
  ),
  verbose = FALSE
)

# Obtain the optimal parameter estimates
optimal_parameters <- estimated_sc$solution
print(optimal_parameters)
#> [1] 2.07287954 8.51414057 0.02969347 1.84102470 3.34284409 1.06207190 2.66250000
#> [8] 0.16594370

In practice, we recommend using a denser grid for the optimization process to increase the likelihood that the global optimum is found. We also note that the function estimate_parameters_sc_parallel() can be used to fit the model for varying values of delta in parallel to identify the optimal mapping between sizes and arrival times.

Step 2: Training a Mark Model

Next, we use the train_mark_model function to train a suitably flexible mark model using location specific covariates and the generated arrival times to predict sizes. The function uses the xgboost or ranger engines to train the model and may be run in parallel if desired. The user has control over the choice of a Gradient Boosting Machine (GBM) or Random Forest (RF) model, the bounds of the spatial domain, the inclusion of competition indices, the competition radius, the correction method, the final model selection metric, the number of cross validation folds, and the size of the parameter tuning grid for the model.

# Load the example raster files
raster_paths <- list.files(system.file("extdata", package = "ldmppr"), pattern = "\\.tif$", full.names = TRUE)
rasters <- lapply(raster_paths, terra::rast)
scaled_rasters <- scale_rasters(rasters)

# Generate the time values using the power law mapping with optimal delta
model_training_data <- small_example_data %>%
  dplyr::mutate(time = power_law_mapping(size, .5))

# Train the model
example_mark_model <- train_mark_model(
  data = model_training_data,
  raster_list = scaled_rasters,
  model_type = "xgboost",
  xy_bounds = c(0, 25, 0, 25),
  parallel = FALSE,
  include_comp_inds = TRUE,
  competition_radius = 10,
  correction = "none",
  selection_metric = "rmse",
  cv_folds = 5,
  tuning_grid_size = 20,
  verbose = TRUE
)
#> Processing data...
#> Training XGBoost model...
#> Training complete!

# Unbundle the model
example_mark_model <- bundle::unbundle(example_mark_model$bundled_model)

The train_mark_model function returns a trained mark model object that can be used to predict sizes for new locations. The model object contains the trained model, the optimal hyperparameters, the cross-validated performance metrics, and the feature importance values. In practice, we recommend taking advantage of the parallelization capabilities of the function to speed up the training process, in addition to using more cross-validation folds and a denser tuning grid to obtain a more robust model.

Step 3: Checking the Fit of the Model

Now that we have an estimated self-correcting point process and a trained mark model, we can check the fit of the model using the check_model_fit function. This function provides a variety of non-parametric summaries for point processes and global envelope tests to assess the goodness of fit of the model. The package provides individual envelope tests for the L, F, G, J, E, and V statistics, or a combined envelope test for all statistics simultaneously by making use of the functionality of the spatstat and GET packages. By plotting the combined envelope test, we can visually assess the goodness of fit of the model and obtain a p-value measuring how well the estimated model captures the relationships observed in the reference data. Typically a p-value less than 0.05 indicates a poor fit of the model to the data, and the authors of the GET package recommend a minimum of 2500 simulations to obtain a valid p-value at the .05 level.

# Generate the reference pattern
reference_data <- generate_mpp(
  locations = small_example_data[, c("x", "y")],
  marks = small_example_data$size,
  xy_bounds = c(0, 25, 0, 25)
)
#> Registered S3 method overwritten by 'spatstat.geom':
#>   method       from     
#>   print.metric yardstick

# Define an anchor point
M_n <- as.matrix(small_example_data[1, c("x", "y")])

# Specify the estimated parameters of the self-correcting process
estimated_parameters <- optimal_parameters

# Check the model fit
example_model_fit <- check_model_fit(
  reference_data = reference_data,
  t_min = 0,
  t_max = 1,
  sc_params = estimated_parameters,
  anchor_point = M_n,
  raster_list = scaled_rasters,
  mark_model = example_mark_model,
  xy_bounds = c(0, 25, 0, 25),
  include_comp_inds = TRUE,
  thinning = TRUE,
  correction = "none",
  competition_radius = 10,
  n_sim = 100,
  save_sims = FALSE,
  verbose = TRUE,
  seed = 90210
)
#> Beginning Data Simulations...
#> Data Simulations Complete...

# Plot the combined global envelope test results
plot(example_model_fit$combined_env)

The combined envelope test provides a visual summary of the goodness of fit of the model to the reference data. The p-value of the test can be used to assess how well the fitted model reflects the reference data. If the p-value is less than 0.05, the model may not be a good fit for the data.

Step 4: Simulating and Visualizing Datasets

Finally, we can simulate datasets from the fitted model using the simulate_mpp function. This function generates a realization of a marked point process given the estimated parameters of the self-correcting process and a trained mark model. The function allows for the specification of the number of points to simulate, the time range to simulate over, the spatial domain to simulate over, and the inclusion of competition indices. The function returns a list containing the marked point process object and a data frame containing the simulated locations, marks, and arrival times. The realization can be visualized using the plot_mpp function.

# Simulate a marked point process
simulated_mpp <- simulate_mpp(
  sc_params = estimated_parameters,
  t_min = 0,
  t_max = 1,
  anchor_point = M_n,
  raster_list = scaled_rasters,
  mark_model = example_mark_model,
  xy_bounds = c(0, 25, 0, 25),
  include_comp_inds = TRUE,
  competition_radius = 10,
  correction = "none",
  thinning = TRUE
)

# Plot the simulated marked point process
plot_mpp(
  mpp_data = simulated_mpp$mpp,
  pattern_type = "simulated"
)

The plot_mpp function provides a visual representation of the simulated marked point process.