3  Train a Random Forest

3.1 Load data

In the previous Chapter, we create a dataframe that holds information on the soil sampling locations and the covariates that we extracted for these positions. Let’s load this datafarme into our environment

Code
data_clean <- readRDS(here::here("data/bern_sampling_locations_with_covariates.rds"))

head(data_clean) |> knitr::kable()
site_id_unique timeset x y dataset dclass waterlog.30 waterlog.50 waterlog.100 ph.0.10 ph.10.30 ph.30.50 ph.50.100 geo500h1id Se_n_aspect2m mt_rr_y Se_curvplan2m_std_5c vdcn25 lsf Se_conv2m Se_slope2m Se_MRVBF2m Se_curvprof2m_fmean_50c Se_diss2m_50c Se_vrm2m Se_curvplan2m Se_SAR2m Se_curv6m
4_26-In-005 d1968_1974_ptf 2571994 1203001 validation poor 0 0 1 6.071733 6.227780 7.109235 7.214589 6 -0.2402939 9931.120 0.6229440 65.62196 0.0770846 -40.5395088 1.1250136 6.950892 -0.0382753 0.3934371 0.0002450 -1.0857303 4.000910 -0.5886537
4_26-In-006 d1974_1978 2572149 1202965 calibration poor 0 1 1 6.900000 6.947128 7.203502 7.700000 6 0.4917848 9931.672 2.2502327 69.16074 0.0860347 19.0945148 1.3587183 6.984581 -0.0522900 0.4014700 0.0005389 -0.3522736 4.001326 0.1278165
4_26-In-012 d1974_1978 2572937 1203693 calibration moderate 0 1 1 6.200000 6.147128 5.603502 5.904355 6 -0.9633239 9935.438 0.2292406 63.57096 0.0737963 -9.1396294 0.7160403 6.990917 -0.0089129 0.6717541 0.0000124 -0.2168447 4.000320 -0.0183221
4_26-In-014 d1974_1978 2573374 1203710 validation well 0 0 0 6.600000 6.754607 7.200000 7.151129 6 -0.4677161 9939.923 0.1029889 64.60535 0.0859686 -0.9318936 0.8482135 6.964162 -0.0331309 0.4988544 0.0000857 -0.0272214 4.000438 -0.0706228
4_26-In-015 d1968_1974_ptf 2573553 1203935 validation moderate 0 0 1 6.272715 6.272715 6.718392 7.269008 6 0.5919228 9942.032 0.9816071 61.16533 0.0650000 4.2692256 1.2301254 6.945287 -0.0202268 0.6999696 0.0002062 0.2968794 4.000948 0.0476020
4_26-In-016 d1968_1974_ptf 2573310 1204328 calibration poor 0 0 1 6.272715 6.160700 5.559031 5.161655 6 0.5820994 9940.597 0.3455668 55.78354 0.0731646 -0.1732794 1.0906221 6.990967 -0.0014042 0.3157751 0.0001151 0.0100844 4.000725 0.0400775

3.2 Preparations

Before we can fit the model, we have to specify a few settings. First, we have to specify our response and predictor variables. Then, we have to split our dataset into a calibration and a validation set. Random Forest models cannot deal with NA values, so we have to remove these from our calibration set.

Code
# Specify response: The pH in the top 10cm
response <- "ph.0.10"

# Specify predictors: Remove soil sampling information
predictors <- 
  data_clean |> 
  dplyr::select(-response,                             # Remove response variable
                -site_id_unique,                       # Remove site ID
                -tidyr::starts_with("ph"),             # No pH information
                -tidyr::starts_with("waterlog"),       # No water-status information
                -dclass,                               # No water-status information
                -dataset) |>                           # No calib./valid information
  names()

cat("The response is:", response,
    "\nThe predictors are:", paste0(predictors[1:8], sep = ", "), "...")
The response is: ph.0.10 
The predictors are: timeset,  x,  y,  geo500h1id,  Se_n_aspect2m,  mt_rr_y,  Se_curvplan2m_std_5c,  vdcn25,  ...
Code
# Split dataset into calibration and validation
data_cal <- data_clean |> dplyr::filter(dataset == "calibration")
data_val <- data_clean |> dplyr::filter(dataset == "validation")

# Filter out any NA to avoid error when running a Random Forest
data_cal <- data_cal |> tidyr::drop_na()
data_val <- data_val |> tidyr::drop_na()

# A little bit of verbose output:
n_tot <- nrow(data_cal) + nrow(data_val)

perc_cal <- (nrow(data_cal) / n_tot) |> round(2) * 100
perc_val <- (nrow(data_val) / n_tot) |> round(2) * 100

cat("For model training, we have a calibration / validation split of: ",
    perc_cal, "/", perc_val, "%")
For model training, we have a calibration / validation split of:  75 / 25 %

Alright, this looks all good. We have our response and predictor variables saved for easy access later on and the 75/25 split of calibration and validation data looks good too. We can now move on to model fitting!

3.3 Model training

To fit a Random Forest model that predicts the soil pH in the top 10cm, we are looking at different model setups. These setups always train a Random Forest model but differ in the complexity that we intentionally add to improve the final model. If you need a recap on Random Forests, have a look at the introduction given in AGDS 1.

3.3.1 Basic model

Let’s start with the basic model, where we use the {ranger} package with the pre-defined hyperparameters.

Code
# ranger() crashes when using tibbles, so we are using the
# base R notation to enter the data

rf_basic <- ranger::ranger( 
  y = data_cal[, response],   # Response variable
  x = data_cal[, predictors], # Predictor variables
  seed = 42,                  # Specify the seed for randomization to reproduce the same model again
  num.threads = parallel::detectCores() - 1) # Use all but one CPU core for quick model training

# Print a summary of fitted model
rf_basic |> print()
Ranger result

Call:
 ranger::ranger(y = data_cal[, response], x = data_cal[, predictors],      seed = 42, num.threads = parallel::detectCores() - 1) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      605 
Number of independent variables:  18 
Mtry:                             4 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       0.2821088 
R squared (OOB):                  0.5224995 
Predicting categories with Random Forests

If our response variable was a categorical and not a continuous variable, we would have to set the argument probability = TRUE. The output would then be a probability map from 0-100%.

Although we only used the pre-defined parameters, we already get a fairly good out-of-bag (OOB) \(R^2\) of 0.53 and a MSE of 0.28. Let’s see how we can improve our model further.

3.3.2 Model with weights

Sometimes we know that a subset of our dataset is more trustworthy than the rest. For example, when you are using a gap-filling technique to interpolate data, that gap-filled data is less trustworthy than the actually measured data. Informing the model algorithm that it should weigh certain data entries more than other can change the importance of variables and the final model performance.

In our dataset, we have information on whether the pH values were measured in the field - which is less precise - or in the lab. Also, we have to different lab methods. All of this information is held in the suffix of the timeset variable, so let’s assign weights according on the quality of the pH data as follows: 1 for CaCl\(_2\) lab measurement (no suffix), 0.9 for pedotransfer from another lab method (suffix _ptf), and 0.7 for field data (suffix _field). For this, we create a weight-matching vector:

Code
weights <- 
  data_cal |>
  dplyr::mutate(
    # Create a new variable 'weight' which holds only 1's
    w = 1,
    # Check the suffix in each row and if true, give a new weight. If false, keep the old weight.
    w = ifelse(stringr::str_detect(timeset, "_field"), 0.9, w),
    w = ifelse(stringr::str_detect(timeset, "_ptf"), 0.7, w)
  )

# Quality check if everything worked:
set.seed(42)

weights |> 
  dplyr::select(timeset, w) |> 
  dplyr::slice_sample(n = 8) |>   # Pick 8 random rows
  knitr::kable()
timeset w
d1979_2010 1.0
d1968_1974_field 0.9
d1968_1974_field 0.9
d1968_1974_field 0.9
d1968_1974_field 0.9
d1968_1974_ptf 0.7
d1968_1974_field 0.9
d1968_1974_field 0.9

It is always a good idea to do quality checks when wrangling! Here we see that our weight attribution code worked as expected, so we can move on to model fitting.

Code
rf_weighted <- ranger::ranger( 
  y = data_cal[, response],      # Response variable
  x = data_cal[, predictors],    # Predictor variables
  case.weights = weights[, "w"], # Add weights to input
  seed = 42,                     # Specify seed for randomization to reproduce the same model again
  num.threads = parallel::detectCores() - 1) # Use all but one CPU core for quick model training

# Print a summary of fitted model
rf_weighted |> print()
Ranger result

Call:
 ranger::ranger(y = data_cal[, response], x = data_cal[, predictors],      case.weights = weights[, "w"], seed = 42, num.threads = parallel::detectCores() -          1) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      605 
Number of independent variables:  18 
Mtry:                             4 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       0.2809047 
R squared (OOB):                  0.5245376 

Not much has changed compared to our previous model. We see that the \(R^2\) and MSE got negligibly worse but as a trade-off we gained more trust in our model.

3.4 Model interpretation

3.4.1 Variable importance based on OOB-Error

Our model has 98 variables but we have no idea if each of them should really be in the model and if we are not just fitting noise in the dataset. To investigate this issue, the ranger() function takes an argument called importance. We can set this argument either to follow the permutation method, whereby the algorithm randomly permutes values of each variable and measures the resulting decrease in model accuracy. A larger decrease indicates a more important variable. If the code runs slow, you can also use the faster impurity method (see more information here).

Assessing the variable importance gives us a feeling for what variables we should keep or drop from the dataset. The ranger-model stores this information if we enter a importance method. The code below accesses the model’s variable importance and sorts the variables with decreasing importance.

Code
# Let's run the weighted model again but with recording the variable importance
rf_weighted <- ranger::ranger( 
  y = data_cal[, response],     # Response variable
  x = data_cal[, predictors],   # Predictor variables
  case.weights = weights[, "w"],# Add weights to input
  importance   = "permutation", # Pick permutation to calculate variable importance
  seed = 42,                    # Specify seed for randomization to reproduce the same model again
  num.threads = parallel::detectCores() - 1) # Use all but one CPU core for quick model training

# Extract the variable importance and create a long tibble
vi_rf_weighted <- 
  rf_weighted$variable.importance |>
  dplyr::bind_rows() |> 
  tidyr::pivot_longer(cols = dplyr::everything(), names_to = "variable")

# Plot variable importance, ordered by decreasing value
p <- 
  vi_rf_weighted |> 
  ggplot2::ggplot(ggplot2::aes(x = reorder(variable, value), y = value)) +
  ggplot2::geom_bar(stat = "identity", fill = "grey50", width = 0.75) + 
  ggplot2::labs(
    y = "Change in OOB MSE after permutation", 
    x = "",
    title = "Change in OOB MSE after permutation") +
  ggplot2::theme_classic() +
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, vjust = 0.5, hjust=1))

# Display plot
p

What do we see here? The y-axis shows the decrease in model performance when the respective variable is randomly permuted and, therefore, denotes the importance of a variable. The higher the value, the stronger the effect of permutation on the model performance, the more important the variable. In our randomly picked subset of covariates, we see that the time of sampling, the location (x and y coordinates), and the mean annual precipitation are very strong predictors of soil pH. From a soil-forming perspective, this seems plausible. For example, soil weathering is strongly driven by water. So, the more water a location receives, the more weather a soil is, the more likely it has a low pH. To dig deeper into understanding how the model works, we could further investigate its partial dependence plos.

We can also see that a large part of our covariates have practically no power to predict the pH and can therefore be removed from the model. But how do we determine what variable to pick for our final model? Do we want a maximum number of predictors? Do we want a set of the top n predictors that make up a certain percentage of the total variable importance?

3.4.2 Boruta Algorithm

One common option to generalize variable selection is the “Boruta-Algorithm”, which itself is based Random Forests. In essence, the algorithm creates “shadows” of your original data, where each of the predictor value is randomly permuted, which destroys the predictive power of the variable. Then, the algorithm iterates over these “shadows” and assess for each variable whether its permutation has a substantial effect on the model performance or not. E.g., if a model trained on a variable’s shadow performs constantly worse than when trained on the original values, that variable is assessed as important. The algorithm categorizes all variables into “to reject”, “to consider”, and “to keep”. Luckily, we do not have to write this algorithm ourselves but can use the {Boruta} package:

Code
set.seed(42)
bor <- 
  Boruta::Boruta(
    y = data_cal[, response], 
    x = data_cal[, predictors],
    maxRuns = 50, # Number of iterations. Set to 30 or lower if it takes too long
    num.threads = parallel::detectCores()-1)

# Plot variable importance, the Boruta-output can be directly fed into base R plot()
par(oma = c(8,3,2,2)) # enlarge plot below for long variable labels
plot(bor, 
     xlab = "", 
     ylab = "Importance",
     las = 2,
     )

Code
# Check whether the most important variables from Boruta-Algorithm are similar as the
# important variables from the weighted Random Forest model
bor_top10 <- 
  Boruta::attStats(bor) |> 
  tibble::rownames_to_column() |> 
  dplyr::arrange(dplyr::desc(meanImp)) |> 
  dplyr::slice_head(n = 10) |> 
  dplyr::pull(rowname)

vi_top10 <- 
  vi_rf_weighted |> 
  dplyr::slice_head(n = 10) |> 
  dplyr::pull(variable)

cbind(vi_top10, bor_top10) |> 
  knitr::kable(col.names = c("RF Top 10", "Boruta Top 10"))
RF Top 10 Boruta Top 10
timeset timeset
x mt_rr_y
y x
geo500h1id y
Se_n_aspect2m Se_curvplan2m_std_5c
mt_rr_y vdcn25
Se_curvplan2m_std_5c lsf
vdcn25 Se_MRVBF2m
lsf geo500h1id
Se_conv2m Se_diss2m_50c

We see that apart from the timeset variable, the variable importance calculated by the Boruta-Algorithm differs quite a bit, compared to the simple variable importance assessment built into the ranger() function. To move forward, we could keep only variables that were classified as “to keep” or “to consider” by the Boruta-Algorithm:

Code
predictors_bor <- 
  bor$finalDecision |> 
  as.data.frame(nm = "assessment") |> 
  tibble::rownames_to_column(var = "variable") |> 
  dplyr::filter(assessment != "Rejected") |> 
  dplyr::pull(variable)

For this tutorial, we are happy to continue with our most basic Random Forest and save all necessary data as shown below. However, as shown with the model using weights, with the variable selection of the Boruta method, and with applying hyper-parameter tuning and cross-validation for a more robust model, there are many options to create a better model.

Code
# Save relevant data for model testing in the next chapter.
# Pick rf to save:
rf <- rf_basic

# Extract predictors used in this rf
predictors <- rf$forest$independent.variable.names

saveRDS(rf_basic,                   
        here::here("data/rf_for_ph0-10.rds"))

saveRDS(data_cal[, c(response, predictors)],
        here::here("data/cal_for_ph0-10.rds"))

saveRDS(data_val[, c(response, predictors)],
        here::here("data/val_for_ph0-10.rds"))