---
title: "orf: ordered random forests"
author: "Gabriel Okasa and Michael Lechner"
output: rmarkdown::html_vignette
bibliography: library.bib
vignette: >
  %\VignetteIndexEntry{orf: ordered random forests}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r options, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 7, fig.align = "center")
```

## orf: Introduction

The `R` package `orf` is an implementation of the Ordered Forest estimator
as in @Okasa2019. The Ordered Forest flexibly estimates the conditional
probabilities of models with ordered categorical outcomes (so-called ordered
choice models). Additionally to common machine learning algorithms the `orf`
package provides functions for estimating marginal effects as well as
statistical inference thereof and thus provides similar output as in standard
econometric models for ordered choice. The core forest algorithm relies on the
fast `C++` forest implementation from the `ranger` package [@Wright2017].

## orf: Installation

In order to install the latest `CRAN` released version use:

```{r install, eval = FALSE}
install.packages("orf", dependencies = c("Imports", "Suggests"))
```

to make sure all the needed packages are installed as well. Note that if you install
the package directly from the source a `C++` compiler is required. For Windows 
users `Rtools` collection is required too.

## orf: Algorithm

The main function of the package is `orf`, which implements the Ordered Forest
estimator as developed in @Okasa2019. The main idea is to
provide a flexible alternative to standard econometric ordered choice
models (categorical dependent variable *with* inherent ordering) such as ordered
logit or ordered probit while still being able to recover
essentially the same output as in the standard parametric models. As such the 
Ordered Forest estimator not only provides estimates of conditional
ordered choice probabilities, i.e. $P[Y=m|X=x]$, but also estimates of marginal
effects, i.e. how the conditional probabilities vary with changes in $X$. Further,
the `orf` estimator provides also inference for the marginal effects as well
as for the conditional probabilities.

More formally, consider an categorical ordered outcome variable $Y_i \in \{1,...,M \}$.
Then, the algorithm can be described as follows:

******  
**Algorithm**: Ordered Forest

******
**input:** Data ($X,Y$)  
**output:** Class Probabilities $\hat{P}[Y=m \mid X=x]$  

 * 1 **procedure** Ordered Forest  
    + 2 **subprocedure** Cumulative Probabilities    
        - 3 **for** $m=0,...,M$:  
            - 4 **create** binary indicator variables according to: $Y_{m,i}=\mathbf{1}(Y_i \leq m)$  
            - 5 **estimate** regression random forest for: $P[Y_{m,i}=1 \mid X_i=x]$  
            - 6 **predict** conditional probabilities as: $\hat{Y}_{m,i}=\hat{P}[Y_{m,i}=1 \mid X_i=x]$
        - 7 **endfor**  
    + 8 **end subprocedure**    
    
    + 9 **subprocedure** Class Probabilities  
        - 10 **for** $m=1,...,M$:  
            - 11 **compute** probabilities for each class as: $\hat{P}_{m,i}=\hat{Y}_{m,i}-\hat{Y}_{m-1,i}$  
            - 12 **if** $\hat{P}_{m,i}<0$  
                - 13 **set** $\hat{P}_{m,i}=0$ and $\hat{P}_{m,i}=\frac{\hat{P}_{m,i}}{\sum^{M}_{m=1}\hat{P}_{m,i}}$  
            - 14 **endif**  
        - 15 **endfor**  
    + 16 **end subprocedure**    
    
 * 17 **end procedure**


******

Hence, the main idea of the Ordered Forest is to firstly, transform the ordered 
model into multiple overlapping binary models which are estimated by regression
forests and thus yield predictions for the cumulative probabilities. Secondly,
the estimated cumulative probabilities are differenced to isolate the actual class 
probabilities. As such, the prediction for the conditional probability of a 
particular ordered class $m$ is given by subtracting two adjacent cumulative
probabilities. Notice that this procedure uses the fact that the cumulative
probability over all classes must sum up to unity by definition.

## orf: Ordered Forest

The Ordered Forest provided in the `orf` function estimates the conditional
ordered choice probabilities as described by the above algorithm. Additionally,
weight-based inference for the probability predictions can be conducted as well.
If inference is desired, the Ordered Forest must be estimated with honesty and
subsampling. Honesty is defined as in @Lechner2019 and thus refers to the
honest forest, instead of the honest tree as is the case in @Wager2018.
This means that the honest split takes place before the forest estimation and not
only before the tree estimations. This might somewhat reduce the efficiency of
the estimator. However, if prediction only is desired, estimation without honesty
and with bootstrapping as in classical random forests by @Breiman2001 is 
recommended for optimal prediction performance.

In order to estimate the Ordered Forest user must supply the data in form of
matrix of covariates ($X$) and a vector of outcomes ($Y$) to the `orf` function.
These data inputs are also the only inputs that must be specified by the user
without any defaults. Further optional arguments include the classical forest
hyperparameters such as number of trees, `num.trees`, number of randomly 
selected features, `mtry`, and the minimum leaf size, `min.node.size`.
The forest building scheme is regulated by the `replace` argument, meaning
bootstrapping if `replace = TRUE` or subsampling if `replace = FALSE`. For the
case of subsampling, `sample.fraction` argument regulates the subsampling rate.
Further, honest forest is estimated if the `honesty` argument is set to `TRUE`,
which is also the default. Similarly, the fraction of the sample used for the
honest estimation is regulated by the `honesty.fraction` argument. The default
setting conducts a 50:50 sample split, which is also generally advised to follow
for optimal performance. Inference procedure of the Ordered Forest is based on 
the forest weights as suggested in @Okasa2019 and is controlled by
the `inference` argument. Note that such weight-based inference is computationally
demanding exercise due to the estimation of the forest weights and as such longer
computation time is to be expected. Lastly, the `importance` argument turns on
and off the permutation based variable importance. The variable importance for
the Ordered Forest is a simple class-weighted importance of the underlying forests.

Additionally, standard `R` functions such as `summary`, `predict`,
or `plot` are provided as well to facilitate the classical `R` user experience.
Below you will find a few examples on how to use the `orf` function to estimate
the Ordered Forest.

```{r orf intro, include = FALSE}
library(orf)
set.seed(123)
```

### orf: `odata`

First, load an example data included in the `orf` package. This data includes an 
ordered categorical outcome variable with 3 distinct ordered classes $Y\in\{1,2,3\}$
with a set of four covariates $X \in \{X1, X2, X3, X4\}$ of different types. The first
covariate and the last covariate, i.e. $X1$ and $X4$ are continuous, the second
one, $X2$, is ordered categorical and the third one, $X3$, is binary. Furthermore,
within the data generating process, covariates $X1$, $X2$ and $X3$ enter in a 
linear form with a positive effect on the outcome, while $X4$ is without any effect
and thus serves as a noise variable in the dataset. For the exact DGP, see `?orf::odata`.

```{r data}
# load example data
data(odata)

# specify response and covariates
Y <- as.numeric(odata[, 1])
X <- as.matrix(odata[, -1])
```

### orf: `orf`, `print.orf`, `summary.orf`, `plot.orf`

Now, estimate the Ordered Forest using the `orf` function with the default
settings and supplying only the required data inputs. Print the output
of the estimation procedure with the S3 method `print.orf`.

```{r orf default}
# estimate Ordered Forest with default settings
orf_model <- orf(X, Y)

# print output of the orf estimation
print(orf_model)
```

Repeat the `orf` estimation with custom settings for the hyperparameters and 
summarize the estimation output with the S3 method `summary.orf`.

```{r orf custom}
# estimate Ordered Forest with custom settings
orf_model <- orf(X, Y,
                       num.trees = 1000, mtry = 2, min.node.size = 5,
                       replace = FALSE, sample.fraction = 0.5,
                       honesty = TRUE, honesty.fraction = 0.5,
                       inference = FALSE, importance = FALSE)

# show summary of the orf estimation
summary(orf_model)
```

The summary of the estimated Ordered Forest provides the basic information about
the estimation and its inputs as well as information about the out-of-bag prediction
accuracy measured in terms of the classical mean squared error (MSE) and the 
probabilistic ranked probability score (RPS). Furthermore, the `summary.orf` command
provides a `latex` argument which generates a LaTeX coded table for immediate
extraction of the results for the research documentation. In addition, the `orf`
object contains further elements that can be accessed with the $\$$ operator.

For a graphical representation of the estimated probabilities `plot.orf` command
plots the probability distributions estimated by the Ordered Forest. The plots 
visualize the estimated probability density of each outcome class, i.e.
$\hat{P}[Y=1\mid X=x]$, $\hat{P}[Y=2\mid X=x]$, and $\hat{P}[Y=3\mid X=x]$ in 
contrast to the actual observed outcome class and as such provides a visual 
inspection of the underlying probability predictions for the outcome classes.
The dashed lines within the density plots locate the means of the respective
probability distributions.

The example below demonstrates the usage of the `plot.orf` command.

```{r orf plot}
# plot the estimated probability distributions
plot(orf_model)
```

### orf: `predict.orf`, `print.predict.orf`, `summary.predict.orf`

The command `predict.orf` predicts the conditional choice probabilities for new 
data points based on the estimated Ordered Forest object. If no new data is
supplied to `newdata` argument, the in-sample fitted values will be returned.
The user can additionally specify the type of the predictions. If probability
predictions are desired, `type = "p"` or `type = "probs"` should be specified
(this is also the default). For class predictions, define `type = "c"` or
`type = "class"`. In this case, the predicted classes are obtained as classes
with the highest predicted probability. Furthermore, for the probability
predictions the weight-based inference can be conducted as well. If inference
is desired, the supplied Ordered Forest must be estimated with honesty and
subsampling. If prediction only is desired, estimation without honesty and with
bootstrapping is recommended for optimal prediction performance.

The example below illustrates the `predict.orf` command for in-sample predictions
and the subsequent information about the predictions printed to the console.

```{r orf predict}
# get fitted values with the estimated orf
orf_fitted <- predict(orf_model)

# print orf fitted values
print(orf_fitted)
```

Now, divide the data into train and test set for a out-of-sample prediction exercise
and summarize the prediction results. Similarly to the above, also for the prediction
summary a LaTeX table can be directly generated with the `latex` argument in the
`summary.predict.orf` command.

```{r orf predict test}
# specify response and covariates for train and test
idx <- sample(seq(1, nrow(odata), 1), 0.8*nrow(odata))

# train set
Y_train <- odata[idx, 1]
X_train <- odata[idx, -1]

# test set
Y_test <- odata[-idx, 1]
X_test <- odata[-idx, -1]

# estimate Ordered Forest
orf_train <- orf(X_train, Y_train)

# predict the probabilities with the estimated orf
orf_test <- predict(orf_train, newdata = X_test, type = "probs", inference = FALSE)

# summary of the orf predictions
summary(orf_test)
```

### orf: `margins.orf`, `print.margins.orf`, `summary.margins.orf`

Besides the estimation and prediction of the conditional choice probabilities,
the Ordered Forest enables also the estimation of the marginal effects, i.e.
how these probabilities vary with changes in covariates. `margins.orf` estimates 
marginal effects at the mean, at the median, or the mean marginal effects,
depending on the `eval` argument. The evaluation window for the marginal effects
can be regulated by the user through the `window` argument, which is defined as
the share of standard deviation of the particular covariate $X$ with default set
as `window = 0.1`. Furthermore, new data for which marginal effects should be
estimated can be supplied as well using the argument `newdata` as long as the
new data lies within the support of $X$. Additionally to the estimation of the
marginal effects, the weight-based inference for the effects is supported as well,
controlled by the `inference` argument. Note again that the inference procedure
is computationally exhausting exercise due to the estimation of the forest weights.

Furthermore, the marginal effect estimation procedure depends on the type of the
particular covariate $X$. On one hand, for continuous covariates such as $X1$ and
$X4$ in this example, the marginal effects are estimated as a derivative using
two-sided numeric approximation. On the other hand, for discrete covariates such
as $X2$ and $X3$ in this example, the marginal effects are estimated as a discrete
change. In case of a binary variables such as $X3$, the marginal effect is estimated
as a difference in the conditional probabilities evaluated at $X=1$ and $X=0$,
respectively. In case of categorical variables such as $X2$, the conditional
probabilities in the difference are evaluated at the mean of $X$ rounded up and
down, respectively. For a detailed discussion of these quantities see @Okasa2019.

The example below shows the usage of the `margins.orf` command with default 
settings and prints the basic estimation information together with the estimated
effects for each covariate and each outcome class.

```{r orf margins}
# estimate marginal effects of the orf
orf_margins <- margins(orf_model)

# print the results of the marginal effects estimation
print(orf_margins)
```

Now, estimate the mean marginal effects with weight-based inference and summarize
the estimation output as well as the estimated effects together with the inference
results. Additionally, `summary.margins.orf` also supports the LaTeX summary table
with the `latex` argument.

```{r orf margins custom}
# estimate marginal effects of the orf with inference
orf_margins <- margins(orf_model, eval = "mean", window = 0.1,
                                  inference = TRUE, newdata = NULL)

# summarize the results of the marginal effects estimation
summary(orf_margins)
```

## orf: Applications

The Ordered Forest estimator is currently used by the Swiss Institute for
Empirical Economic Research (SEW-HSG) of the University of St.Gallen, Switzerland
in the Soccer Analytics project for the probability predictions of win, draw and 
loss in soccer matches in the German Bundesliga and the Swiss Super League.
More details about the soccer predictions can be found in @Okasa2018 and the most
recent predictions are listed online at [SEW Soccer Analytics (GER)](https://sew.unisg.ch/en/lehrstuehle/empirische-wirtschaftsforschung/sports-economics-research-group/soccer-analytics),
[SEW Soccer Analytics (SUI)](https://sew.unisg.ch/en/lehrstuehle/empirische-wirtschaftsforschung/sports-economics-research-group/soccer-analytics---ch) and on [Twitter](https://twitter.com/SEW_Soccer).

## orf: References
 

