3 Train a Random Forest

3.1 Load data

In the previous Chapter, we create a dataframe that holds information on the soil sampling locations and the covariates that we extracted for these positions. Let’s load this datafarme into our environment

Code

data_clean <- readRDS(here::here("data/bern_sampling_locations_with_covariates.rds"))

head(data_clean) |> knitr::kable()

site_id_unique	timeset	x	y	dataset	dclass	waterlog.50	waterlog.100	ph.0.10	ph.10.30	ph.30.50	ph.50.100	geo500h1id	Se_n_aspect2m	mt_rr_y	Se_curvplan2m_std_5c	vdcn25	lsf	Se_conv2m	Se_slope2m	Se_MRVBF2m	Se_curvprof2m_fmean_50c	Se_diss2m_50c	Se_vrm2m	Se_curvplan2m	Se_SAR2m	Se_curv6m
4_26-In-005	d1968_1974_ptf	2571994	1203001	validation	poor	0	1	6.071733	6.227780	7.109235	7.214589	6	-0.2402939	9931.120	0.6229440	65.62196	0.0770846	-40.5395088	1.1250136	6.950892	-0.0382753	0.3934371	0.0002450	-1.0857303	4.000910	-0.5886537
4_26-In-006	d1974_1978	2572149	1202965	calibration	poor	1	1	6.900000	6.947128	7.203502	7.700000	6	0.4917848	9931.672	2.2502327	69.16074	0.0860347	19.0945148	1.3587183	6.984581	-0.0522900	0.4014700	0.0005389	-0.3522736	4.001326	0.1278165
4_26-In-012	d1974_1978	2572937	1203693	calibration	moderate	1	1	6.200000	6.147128	5.603502	5.904355	6	-0.9633239	9935.438	0.2292406	63.57096	0.0737963	-9.1396294	0.7160403	6.990917	-0.0089129	0.6717541	0.0000124	-0.2168447	4.000320	-0.0183221
4_26-In-014	d1974_1978	2573374	1203710	validation	well	0	0	6.600000	6.754607	7.200000	7.151129	6	-0.4677161	9939.923	0.1029889	64.60535	0.0859686	-0.9318936	0.8482135	6.964162	-0.0331309	0.4988544	0.0000857	-0.0272214	4.000438	-0.0706228
4_26-In-015	d1968_1974_ptf	2573553	1203935	validation	moderate	0	1	6.272715	6.272715	6.718392	7.269008	6	0.5919228	9942.032	0.9816071	61.16533	0.0650000	4.2692256	1.2301254	6.945287	-0.0202268	0.6999696	0.0002062	0.2968794	4.000948	0.0476020
4_26-In-016	d1968_1974_ptf	2573310	1204328	calibration	poor	0	1	6.272715	6.160700	5.559031	5.161655	6	0.5820994	9940.597	0.3455668	55.78354	0.0731646	-0.1732794	1.0906221	6.990967	-0.0014042	0.3157751	0.0001151	0.0100844	4.000725	0.0400775

3.2 Preparations

Before we can fit the model, we have to specify a few settings. First, we have to specify our response and predictor variables. Then, we have to split our dataset into a calibration and a validation set. Random Forest models cannot deal with NA values, so we have to remove these from our calibration set.

Code

# Specify response: The pH in the top 10cm
response <- "ph.0.10"

# Specify predictors: Remove soil sampling information
predictors <- 
  data_clean |> 
  dplyr::select(-response,                             # Remove response variable
                -site_id_unique,                       # Remove site ID
                -tidyr::starts_with("ph"),             # No pH information
                -tidyr::starts_with("waterlog"),       # No water-status information
                -dclass,                               # No water-status information
                -dataset) |>                           # No calib./valid information
  names()

cat("The response is:", response,
    "\nThe predictors are:", paste0(predictors[1:8], sep = ", "), "...")

The response is: ph.0.10 
The predictors are: timeset,  x,  y,  geo500h1id,  Se_n_aspect2m,  mt_rr_y,  Se_curvplan2m_std_5c,  vdcn25,  ...

Code

# Split dataset into calibration and validation
data_cal <- data_clean |> dplyr::filter(dataset == "calibration")
data_val <- data_clean |> dplyr::filter(dataset == "validation")

# Filter out any NA to avoid error when running a Random Forest
data_cal <- data_cal |> tidyr::drop_na()
data_val <- data_val |> tidyr::drop_na()

# A little bit of verbose output:
n_tot <- nrow(data_cal) + nrow(data_val)

perc_cal <- (nrow(data_cal) / n_tot) |> round(2) * 100
perc_val <- (nrow(data_val) / n_tot) |> round(2) * 100

cat("For model training, we have a calibration / validation split of: ",
    perc_cal, "/", perc_val, "%")

For model training, we have a calibration / validation split of:  75 / 25 %

Alright, this looks all good. We have our response and predictor variables saved for easy access later on and the 75/25 split of calibration and validation data looks good too. We can now move on to model fitting!

3.3 Model training

To fit a Random Forest model that predicts the soil pH in the top 10cm, we are looking at different model setups. These setups always train a Random Forest model but differ in the complexity that we intentionally add to improve the final model. If you need a recap on Random Forests, have a look at the introduction given in AGDS 1.

3.3.1 Basic model

Let’s start with the basic model, where we use the {ranger} package with the pre-defined hyperparameters.

Code

# ranger() crashes when using tibbles, so we are using the
# base R notation to enter the data

rf_basic <- ranger::ranger( 
  y = data_cal[, response],   # Response variable
  x = data_cal[, predictors], # Predictor variables
  seed = 42,                  # Specify the seed for randomization to reproduce the same model again
  num.threads = parallel::detectCores() - 1) # Use all but one CPU core for quick model training

# Print a summary of fitted model
rf_basic |> print()

Ranger result

Call:
 ranger::ranger(y = data_cal[, response], x = data_cal[, predictors],      seed = 42, num.threads = parallel::detectCores() - 1) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      605 
Number of independent variables:  18 
Mtry:                             4 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       0.2821088 
R squared (OOB):                  0.5224995

Predicting categories with Random Forests

If our response variable was a categorical and not a continuous variable, we would have to set the argument probability = TRUE. The output would then be a probability map from 0-100%.

Although we only used the pre-defined parameters, we already get a fairly good out-of-bag (OOB) \(R^2\) of 0.53 and a MSE of 0.28. Let’s see how we can improve our model further.

3.3.2 Model with weights

Sometimes we know that a subset of our dataset is more trustworthy than the rest. For example, when you are using a gap-filling technique to interpolate data, that gap-filled data is less trustworthy than the actually measured data. Informing the model algorithm that it should weigh certain data entries more than other can change the importance of variables and the final model performance.

In our dataset, we have information on whether the pH values were measured in the field - which is less precise - or in the lab. Also, we have to different lab methods. All of this information is held in the suffix of the timeset variable, so let’s assign weights according on the quality of the pH data as follows: 1 for CaCl\(_2\) lab measurement (no suffix), 0.9 for pedotransfer from another lab method (suffix _ptf), and 0.7 for field data (suffix _field). For this, we create a weight-matching vector:

Code

weights <- 
  data_cal |>
  dplyr::mutate(
    # Create a new variable 'weight' which holds only 1's
    w = 1,
    # Check the suffix in each row and if true, give a new weight. If false, keep the old weight.
    w = ifelse(stringr::str_detect(timeset, "_field"), 0.9, w),
    w = ifelse(stringr::str_detect(timeset, "_ptf"), 0.7, w)
  )

# Quality check if everything worked:
set.seed(42)

weights |> 
  dplyr::select(timeset, w) |> 
  dplyr::slice_sample(n = 8) |>   # Pick 8 random rows
  knitr::kable()

timeset	w
d1979_2010	1.0
d1968_1974_field	0.9
d1968_1974_field	0.9
d1968_1974_field	0.9
d1968_1974_field	0.9
d1968_1974_ptf	0.7
d1968_1974_field	0.9
d1968_1974_field	0.9

It is always a good idea to do quality checks when wrangling! Here we see that our weight attribution code worked as expected, so we can move on to model fitting.

Code

rf_weighted <- ranger::ranger( 
  y = data_cal[, response],      # Response variable
  x = data_cal[, predictors],    # Predictor variables
  case.weights = weights[, "w"], # Add weights to input
  seed = 42,                     # Specify seed for randomization to reproduce the same model again
  num.threads = parallel::detectCores() - 1) # Use all but one CPU core for quick model training

# Print a summary of fitted model
rf_weighted |> print()

Ranger result

Call:
 ranger::ranger(y = data_cal[, response], x = data_cal[, predictors],      case.weights = weights[, "w"], seed = 42, num.threads = parallel::detectCores() -          1) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      605 
Number of independent variables:  18 
Mtry:                             4 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       0.2809047 
R squared (OOB):                  0.5245376

Not much has changed compared to our previous model. We see that the \(R^2\) and MSE got negligibly worse but as a trade-off we gained more trust in our model.

3.4 Model interpretation

3.4.1 Variable importance based on OOB-Error

Our model has 98 variables but we have no idea if each of them should really be in the model and if we are not just fitting noise in the dataset. To investigate this issue, the ranger() function takes an argument called importance. We can set this argument either to follow the permutation method, whereby the algorithm randomly permutes values of each variable and measures the resulting decrease in model accuracy. A larger decrease indicates a more important variable. If the code runs slow, you can also use the faster impurity method (see more information here).

Assessing the variable importance gives us a feeling for what variables we should keep or drop from the dataset. The ranger-model stores this information if we enter a importance method. The code below accesses the model’s variable importance and sorts the variables with decreasing importance.

Code

# Let's run the weighted model again but with recording the variable importance
rf_weighted <- ranger::ranger( 
  y = data_cal[, response],     # Response variable
  x = data_cal[, predictors],   # Predictor variables
  case.weights = weights[, "w"],# Add weights to input
  importance   = "permutation", # Pick permutation to calculate variable importance
  seed = 42,                    # Specify seed for randomization to reproduce the same model again
  num.threads = parallel::detectCores() - 1) # Use all but one CPU core for quick model training

# Extract the variable importance and create a long tibble
vi_rf_weighted <- 
  rf_weighted$variable.importance |>
  dplyr::bind_rows() |> 
  tidyr::pivot_longer(cols = dplyr::everything(), names_to = "variable")

# Plot variable importance, ordered by decreasing value
p <- 
  vi_rf_weighted |> 
  ggplot2::ggplot(ggplot2::aes(x = reorder(variable, value), y = value)) +
  ggplot2::geom_bar(stat = "identity", fill = "grey50", width = 0.75) + 
  ggplot2::labs(
    y = "Change in OOB MSE after permutation", 
    x = "",
    title = "Change in OOB MSE after permutation") +
  ggplot2::theme_classic() +
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, vjust = 0.5, hjust=1))

# Display plot
p

What do we see here? The y-axis shows the decrease in model performance when the respective variable is randomly permuted and, therefore, denotes the importance of a variable. The higher the value, the stronger the effect of permutation on the model performance, the more important the variable. In our randomly picked subset of covariates, we see that the time of sampling, the location (x and y coordinates), and the mean annual precipitation are very strong predictors of soil pH. From a soil-forming perspective, this seems plausible. For example, soil weathering is strongly driven by water. So, the more water a location receives, the more weather a soil is, the more likely it has a low pH. To dig deeper into understanding how the model works, we could further investigate its partial dependence plos.

We can also see that a large part of our covariates have practically no power to predict the pH and can therefore be removed from the model. But how do we determine what variable to pick for our final model? Do we want a maximum number of predictors? Do we want a set of the top n predictors that make up a certain percentage of the total variable importance?

3.4.2 Boruta Algorithm

One common option to generalize variable selection is the “Boruta-Algorithm”, which itself is based Random Forests. In essence, the algorithm creates “shadows” of your original data, where each of the predictor value is randomly permuted, which destroys the predictive power of the variable. Then, the algorithm iterates over these “shadows” and assess for each variable whether its permutation has a substantial effect on the model performance or not. E.g., if a model trained on a variable’s shadow performs constantly worse than when trained on the original values, that variable is assessed as important. The algorithm categorizes all variables into “to reject”, “to consider”, and “to keep”. Luckily, we do not have to write this algorithm ourselves but can use the {Boruta} package:

Code

set.seed(42)
bor <- 
  Boruta::Boruta(
    y = data_cal[, response], 
    x = data_cal[, predictors],
    maxRuns = 50, # Number of iterations. Set to 30 or lower if it takes too long
    num.threads = parallel::detectCores()-1)

# Plot variable importance, the Boruta-output can be directly fed into base R plot()
par(oma = c(8,3,2,2)) # enlarge plot below for long variable labels
plot(bor, 
     xlab = "", 
     ylab = "Importance",
     las = 2,
     )

Code

# Check whether the most important variables from Boruta-Algorithm are similar as the
# important variables from the weighted Random Forest model
bor_top10 <- 
  Boruta::attStats(bor) |> 
  tibble::rownames_to_column() |> 
  dplyr::arrange(dplyr::desc(meanImp)) |> 
  dplyr::slice_head(n = 10) |> 
  dplyr::pull(rowname)

vi_top10 <- 
  vi_rf_weighted |> 
  dplyr::slice_head(n = 10) |> 
  dplyr::pull(variable)

cbind(vi_top10, bor_top10) |> 
  knitr::kable(col.names = c("RF Top 10", "Boruta Top 10"))

RF Top 10	Boruta Top 10
timeset	timeset
x	mt_rr_y
y	x
geo500h1id	y
Se_n_aspect2m	Se_curvplan2m_std_5c
mt_rr_y	vdcn25
Se_curvplan2m_std_5c	lsf
vdcn25	Se_MRVBF2m
lsf	geo500h1id
Se_conv2m	Se_diss2m_50c

We see that apart from the timeset variable, the variable importance calculated by the Boruta-Algorithm differs quite a bit, compared to the simple variable importance assessment built into the ranger() function. To move forward, we could keep only variables that were classified as “to keep” or “to consider” by the Boruta-Algorithm:

Code

predictors_bor <- 
  bor$finalDecision |> 
  as.data.frame(nm = "assessment") |> 
  tibble::rownames_to_column(var = "variable") |> 
  dplyr::filter(assessment != "Rejected") |> 
  dplyr::pull(variable)

For this tutorial, we are happy to continue with our most basic Random Forest and save all necessary data as shown below. However, as shown with the model using weights, with the variable selection of the Boruta method, and with applying hyper-parameter tuning and cross-validation for a more robust model, there are many options to create a better model.

Code

# Save relevant data for model testing in the next chapter.
# Pick rf to save:
rf <- rf_basic

# Extract predictors used in this rf
predictors <- rf$forest$independent.variable.names

saveRDS(rf_basic,                   
        here::here("data/rf_for_ph0-10.rds"))

saveRDS(data_cal[, c(response, predictors)],
        here::here("data/cal_for_ph0-10.rds"))

saveRDS(data_val[, c(response, predictors)],
        here::here("data/val_for_ph0-10.rds"))