A4-MLpipeline.qmd

---
output: html_document
editor_options:
  chunk_output_type: console
---

# Machine learning pipeline {#workflow}

The Standard Machine Learning Pipeline at the example of the Titanic Data set

Before we specialize on any tuning, it is important to understand that machine learning always consists of a pipeline of actions.

The typical machine learning workflow consist of:

-   Data cleaning and exploration (EDA = explorative data analysis) for example with tidyverse.
-   Preprocessing and feature selection.
-   Splitting data set into training and test set for evaluation.
-   Model fitting.
-   Model evaluation.
-   New predictions.

Here is an (optional) video that explains the entire pipeline from a slightly different perspective:

```{r chunk_chapter4_39, eval=knitr::is_html_output(excludes = "epub"), results = 'asis', echo = F}
cat(
  '<iframe width="560" height="315" 
  src="https://www.youtube.com/embed/nKW8Ndu7Mjw"
  frameborder="0" allow="accelerometer; autoplay; encrypted-media;
  gyroscope; picture-in-picture" allowfullscreen>
  </iframe>'
)
```

In the following example, we use tidyverse, a collection of R packages for data science / data manipulation mainly developed by Hadley Wickham. A video that explains the basics can be found here :

```{r chunk_chapter4_40, eval=knitr::is_html_output(excludes = "epub"), results = 'asis', echo = F}
cat(
  '<iframe width="560" height="315" 
  src="https://www.youtube.com/embed/nRtp7wSEtJA"
  frameborder="0" allow="accelerometer; autoplay; encrypted-media;
  gyroscope; picture-in-picture" allowfullscreen>
  </iframe>'
)
```

Another good reference is "**R for data science**" by Hadley Wickham: <a href="https://r4ds.had.co.nz/" target="_blank" rel="noopener"></a>.

For this lecture you need the Titanic data set provided by us. You can find it in GRIPS (datasets.RData in the data set and submission section) or at <a href="http://rhsbio6.uni-regensburg.de:8500" target="_blank" rel="noopener">http://rhsbio6.uni-regensburg.de:8500</a>.

We have split the data set already into training and test/prediction data sets (the test/prediction split has one column less than the train split, as the result is not known a priori).

## Data preparation

Load necessary libraries:

```{r chunk_chapter4_41, message=FALSE}
library(tidyverse)
```

Load data set:

```{r chunk_chapter4_42}
library(EcoData)
data(titanic_ml)
data = titanic_ml
```

Standard summaries:

```{r chunk_chapter4_43}
str(data)
summary(data)
```

The name variable consists of 1309 unique factors (there are 1309 observations...) and could be now transformed. If you are interested in how to do that, take a look at the following box.

::: {.callout-tip collapse="true"}
## Feature engineering of the name variable

```{r chunk_chapter4_44}
length(unique(data$name))
```

However, there is a title in each name. Let's extract the titles:

1.  We will extract all names and split each name after each comma ",".
2.  We will split the second split of the name after a point "." and extract the titles.

```{r chunk_chapter4_45}
first_split = sapply(data$name,
                     function(x) stringr::str_split(x, pattern = ",")[[1]][2])
titles = sapply(first_split,
                function(x) strsplit(x, ".",fixed = TRUE)[[1]][1])
```

We get 18 unique titles:

```{r chunk_chapter4_46}
table(titles)
```

A few titles have a very low occurrence rate:

```{r chunk_chapter4_47}
titles = stringr::str_trim((titles))
titles %>%
 fct_count()
```

We will combine titles with low occurrences into one title, which we can easily do with the forcats package.

```{r chunk_chapter4_48}
titles2 =
  forcats::fct_collapse(titles,
                        officer = c("Capt", "Col", "Major", "Dr", "Rev"),
                        royal = c("Jonkheer", "Don", "Sir",
                                  "the Countess", "Dona", "Lady"),
                        miss = c("Miss", "Mlle"),
                        mrs = c("Mrs", "Mme", "Ms")
                        )
```

We can count titles again to see the new number of titles:

```{r chunk_chapter4_49}
titles2 %>%  
   fct_count()
```

Add new title variable to data set:

```{r chunk_chapter4_50}
data =
  data %>%
    mutate(title = titles2)
```
:::

### Imputation

NAs are a common problem in ML. For example, the age variable has 20% NAs:

```{r chunk_chapter4_51}
summary(data$age)
sum(is.na(data$age)) / nrow(data)
```

Either we remove all observations with NAs, or we impute (fill) the missing values, e.g. with the median age. However, age itself might depend on other variables such as sex, class and title. We want to fill the NAs with the median age of these groups. In tidyverse we can easily "group" the data, i.e. we will nest the observations (here: group_by after sex, pclass and title). After grouping, all operations (such as our median(age....)) will be done within the specified groups.

```{r chunk_chapter4_52}
data =
  data %>%
    select(survived, sex, age, fare, pclass) %>% 
    group_by(sex, pclass) %>%
    mutate(age2 = ifelse(is.na(age), median(age, na.rm = TRUE), age)) %>%
    mutate(fare2 = ifelse(is.na(fare), median(fare, na.rm = TRUE), fare)) %>%
    ungroup()
```

### Preprocessing and Feature Selection

Later (tomorrow), we want to use Keras in our example, but it cannot handle factors and requires the data to be scaled.

Normally, one would do this for all predictors, but as we only show the pipeline here, we have sub-selected a bunch of predictors and do this only for them. We first scale the numeric predictors and change the factors with only two groups/levels into integers (this can be handled by Keras).

```{r chunk_chapter4_53}
data_sub =
  data %>%
    select(survived, sex, age2, fare2, pclass) %>%
    mutate(age2 = scales::rescale(age2, c(0, 1)),
           fare2 = scales::rescale(fare2, c(0, 1))) %>%
    mutate(sex = as.integer(sex) - 1L,
           pclass = as.integer(pclass - 1L))
```

::: {.callout-tip collapse="true"}
## Transforming factors with more than two levels

Factors with more than two levels should be **one hot encoded** (Make columns for every different factor level and write 1 in the respective column for every taken feature value and 0 else. For example: $\{red, green, green, blue, red\} \rightarrow \{(0,0,1), (0,1,0), (0,1,0), (1,0,0), (0,0,1)\}$):

```{r chunk_chapter4_54, eval = FALSE}
one_title = model.matrix(~0+as.factor(title), data = data)
colnames(one_title) = levels(data$title)

one_sex = model.matrix(~0+as.factor(sex), data = data)
colnames(one_sex) = levels(data$sex)

one_pclass = model.matrix(~0+as.factor(pclass), data = data)
colnames(one_pclass) = paste0("pclass", 1:length(unique(data$pclass)))
```

And we have to add the dummy encoded variables to the data set:

```{r chunk_chapter4_55, eval = FALSE}
data = cbind(data.frame(survived= data$survived),
                 one_title, one_sex, age = data$age2,
                 fare = data$fare2, one_pclass)
head(data)
```
:::

## Modelling

### Split data for final predictions

To tune our hyperparameters and evaluate our models, we need to split the data as we learned in the CV section. Before doing so, however, we must split off the new observations in the data set :

```{r}
summary(data_sub$survived)
```

655 observations have NAs in our response variable, these are the observations for which we want to make predictions at the end of our pipeline (we have no information about their actual values!).

```{r chunk_chapter4_56}
data_new = data_sub[is.na(data_sub$survived),]
data_obs = data_sub[!is.na(data_sub$survived),]
```

### Training and evaluation

Now, we can do a simple 10xCV with the observed_data:

```{r, warning=FALSE, message=FALSE, results='hide'}
library(glmnet)
library(glmnetUtils)
data_obs = data_sub[!is.na(data_sub$survived),] 
cv = 10

outer_split = as.integer(cut(1:nrow(data_obs), breaks = cv))

results = data.frame(
  set = 1:cv,
  AUC = rep(NA, cv)
)

for(i in 1:cv) {
  train_outer = data_obs[outer_split != i, ]
  test_outer = data_obs[outer_split == i, ]
  model = glmnet(survived~.,data = train_outer, family = "binomial",alpha = 0.2)
  results[i, 2] = Metrics::auc(test_outer$survived, predict(model, test_outer, 
                                                            alpha = 0.2,
                                                            s = 0.01,
                                                            type = "response"))

}
```

```{r}
print(results)
print(mean(results$AUC))
```

### Hyperparameter optimization

We did a simple 10xCV to evaluate our model but we didn't tune our hyperparameters ($\lambda$ and $\alpha$). If we want to tune them, we need do another CV within each split of the model evaulation CV, which is called nested CV.

We used only one split (the split for the submission server doesn't count) to evaluate the performance of the model before we made the final predictions. If we test many different hyperparameter combinations, how do we ensure that a certain hyperparameter is not only good for our training dataset but also good for the new data (our outer split on the submission server)? You may have guessed it already, we need to do another CV within the previous CV to check whether a certain hyperparameter solution generalizes to the whole data. To tune $\lambda$, we would need to split the data another time (called nested CV).

Why is it important to tune hyperparameters? Hyperparameters (configuration parameters of our ML algorithms that (mostly) control their complexity) are usually tuned (optimized) in an automatic / systematic way. A common procedure, called random search, is to sample random configuration combinations from the set of hyperparameters and test for each combination the prediction error.

Let's implement manually a nested CV to tune the $\alpha$. Let's start with a 5CVx5CV and 20x different alpha values:

```{r}
library(glmnet)
library(glmnetUtils)
set.seed(42)
data_obs = data_sub[!is.na(data_sub$survived),] 
cv = 5
cv_inner = 5
hyper_alpha = runif(20,0, 1)

outer_split = as.integer(cut(1:nrow(data_obs), breaks = cv))

results = data.frame(
  set = rep(NA, cv),
  alpha = rep(NA, cv),
  AUC = rep(NA, cv)
)

for(i in 1:cv) {
  train_outer = data_obs[outer_split != i, ]
  test_outer = data_obs[outer_split == i, ]
  
  best_alpha = NULL
  best_auc = NULL
  
  # inner split
  for(j in 1:cv_inner) {
    inner_split = as.integer(cut(1:nrow(train_outer), breaks = cv_inner))
    train_inner = train_outer[inner_split != j, ]
    test_inner = train_outer[inner_split == j, ]
    
    tuning_results_inner = 
      sapply(1:length(hyper_alpha), function(k) {
        model = glmnet(survived~.,data = train_inner, family = "binomial",alpha = hyper_alpha[k])
        return(Metrics::auc(test_inner$survived, predict(model, test_inner, 
                                                         alpha = hyper_alpha[k],
                                                         s = 0.01,
                                                         type = "response")))
      })
    best_alpha[j] = hyper_alpha[which.max(tuning_results_inner)]
    best_auc[j] = max(tuning_results_inner)
  }
  best_alpha = best_alpha[which.max(best_auc)]
  model = glmnet(survived~., data = train_outer, alpha = best_alpha, family = "binomial")
  results[i, 1] = i
  results[i, 2] = best_alpha
  results[i, 3] = Metrics::auc(test_outer$survived, predict(model, test_outer, s = 0.01, alpha = best_alpha, type = "response"))
}

print(results)
```

## Predictions and Submission

When we are satisfied with the performance of our model, we will create predictions for the new observations on the submission server. But first we will train now our model on the full observed dataset:

$\alpha = 0.915$ has the highest AUC, let's use it to train the model on the full dataset:

```{r}
model = glmnet(survived~.,data = data_obs, family = "binomial",alpha = 0.915)
```

We cannot assess the performance for the new observations because the true survival ratio is unknown, however, we can now submit our predictions to the submission server at <a href="http://rhsbio7.uni-regensburg.de:8500" target="_blank" rel="noopener">http://rhsbio7.uni-regensburg.de:8500</a>.

For the submission it is critical to change the predictions into a data.frame, select the second column (the probability to survive), and save it with the write.csv function:

```{r, results='hide', warning=FALSE, message=FALSE}
data_new = data_sub[is.na(data_sub$survived),]
write.csv(data.frame(y = predict(model, data_new, alpha = 0.915, s = 0.01, type = "response")[,1] ), file = "Max_1.csv")
```

We have now used the $\alpha$ value with the highest AUC here, but our tuning has shown that the best value of $\alpha$ depends on the partitioning, so it would probably be better to build ten models and combine their predictions (e.g., by averaging the predictions):

```{r, results='hide',warning=FALSE, message=FALSE}
prediction_ensemble = 
  sapply(results$alpha, function(alpha) {
    model = glmnet(survived~.,data = data_obs, family = "binomial",alpha = alpha)
    return(predict(model, data_new, alpha = alpha, s = 0.01, type = "response")[,1])
  })
write.csv(data.frame(y = apply(prediction_ensemble, 1, mean)), file = "Max_1.csv")
```

## Exercises

::: {.callout-caution icon="false"}
#### Task: Tuning $\alpha$ and $\lambda$

1.  Extend the code from above and tune $\alpha$ and $\lambda$ (Nested-CV or via a simple CV)

2.  Train the model with best set of hyperparameters and submit your predictions

3.  Compare the predictive performance from the single best model with the ensemble model

Submit both predictions (<http://rhsbio7.uni-regensburg.de:8500/>), which model has a higher AUC?

```{r}
library(EcoData)
library(dplyr)
library(missRanger)
library(glmnet)
library(glmnetUtils)
data(titanic_ml)
data = titanic_ml
data = 
  data %>% select(survived, sex, age, fare, pclass)

# missRanger uses a random forest to impute NAs (RF is trained on the data to predict values for the NAs)
data[,-1] = missRanger(data[,-1], verbose = 0)

data_sub =
  data %>%
    mutate(age = scales::rescale(age, c(0, 1)),
           fare = scales::rescale(fare, c(0, 1))) %>%
    mutate(sex = as.integer(sex) - 1L,
           pclass = as.integer(pclass - 1L))
data_new = data_sub[is.na(data_sub$survived),] # for which we want to make predictions at the end
data_obs = data_sub[!is.na(data_sub$survived),] # data with known response
```

Bonus:

-   Try different features
-   Try cito
-   Try different datasets (see @sec-datasets)

Code template for a simple CV (only $\alpha$ is tuned, add the tuning for $\lambda$:

```{r}
library(glmnet)
library(glmnetUtils)
set.seed(42)
data_obs = data_sub[!is.na(data_sub$survived),] 
cv = 5
hyper_alpha = runif(20,0, 1)

outer_split = as.integer(cut(1:nrow(data_obs), breaks = cv))

results = data.frame(
  set = rep(NA, cv),
  alpha = rep(NA, cv),
  AUC = rep(NA, cv)
)

for(i in 1:cv) {
  train_outer = data_obs[outer_split != i, ]
  test_outer = data_obs[outer_split == i, ]
  
  tuning_results = 
      sapply(1:length(hyper_alpha), function(k) {
        model = glmnet(survived~.,data = train_outer, family = "binomial",alpha = hyper_alpha[k])
        return(Metrics::auc(test_outer$survived, predict(model, test_outer, 
                                                         alpha = hyper_alpha[k],
                                                         s = 0.01,
                                                         type = "response")))
      })
  best_alpha = hyper_alpha[which.max(tuning_results)]
  results[i, 1] = i
  results[i, 2] = best_alpha
  results[i, 3] = max(tuning_results)
}

print(results)
```

`r hide("Click here to see the solution for the single model")`

Nested CV:

```{r}
set.seed(42)
data_obs = data_sub[!is.na(data_sub$survived),] 
cv = 5
cv_inner = 5
hyper_alpha = runif(30,0, 1)
hyper_lambda = runif(30,0, 1)

outer_split = as.integer(cut(1:nrow(data_obs), breaks = cv))

results = data.frame(
  set = rep(NA, cv),
  alpha = rep(NA, cv),
  lambda = rep(NA, cv),
  AUC = rep(NA, cv)
)

for(i in 1:cv) {
  train_outer = data_obs[outer_split != i, ]
  test_outer = data_obs[outer_split == i, ]
  
  best_alpha = NULL
  best_lambda = NULL
  best_auc = NULL
  # inner split
  for(j in 1:cv_inner) {
    inner_split = as.integer(cut(1:nrow(train_outer), breaks = cv_inner))
    train_inner = train_outer[inner_split != j, ]
    test_inner = train_outer[inner_split == j, ]
    
    tuning_results_inner = 
      sapply(1:length(hyper_alpha), function(k) {
        model = glmnet(survived~.,data = train_inner, family = "binomial",alpha = hyper_alpha[k])
        return(Metrics::auc(test_inner$survived, predict(model, test_inner, 
                                                         alpha = hyper_alpha[k],
                                                         s = hyper_lambda[k],
                                                         type = "response")))
      })
    best_alpha[j] = hyper_alpha[which.max(tuning_results_inner)]
    best_lambda[j] = hyper_lambda[which.max(tuning_results_inner)]
    best_auc[j] = max(tuning_results_inner)
  }
  best_alpha = best_alpha[which.max(best_auc)]
  best_lambda = best_lambda[which.max(best_auc)]
  
  model = glmnet(survived~., data = train_outer, alpha = best_alpha, family = "binomial")
  results[i, 1] = i
  results[i, 2] = best_alpha
  results[i, 3] = best_lambda
  results[i, 4] = Metrics::auc(test_outer$survived, predict(model, test_outer, s = best_lambda, alpha = best_alpha, type = "response"))
}

print(results)

```

Simple CV:

```{r}
set.seed(42)
data_obs = data_sub[!is.na(data_sub$survived),] 
cv = 5
hyper_alpha = runif(20,0, 1)
hyper_lambda = runif(20, 0, 1)

outer_split = as.integer(cut(1:nrow(data_obs), breaks = cv))

results = data.frame(
  set = rep(NA, cv),
  alpha = rep(NA, cv),
  lambda = rep(NA, cv),
  AUC = rep(NA, cv)
)

for(i in 1:cv) {
  train_outer = data_obs[outer_split != i, ]
  test_outer = data_obs[outer_split == i, ]
  
  tuning_results = 
      sapply(1:length(hyper_alpha), function(k) {
        model = glmnet(survived~.,data = train_outer, family = "binomial",alpha = hyper_alpha[k])
        return(Metrics::auc(test_outer$survived, predict(model, test_outer, 
                                                         alpha = hyper_alpha[k],
                                                         s = hyper_lambda[k],
                                                         type = "response")))
      })
  best_alpha = hyper_alpha[which.max(tuning_results)]
  best_lambda = hyper_lambda[which.max(tuning_results)]
  results[i, 1] = i
  results[i, 2] = best_alpha
  results[i, 3] = best_lambda
  results[i, 4] = max(tuning_results)
}

print(results)
```

Predictions:

```{r, results='hide', message=FALSE, warning=FALSE}
prediction_ensemble = 
  sapply(1:nrow(results), function(i) {
    model = glmnet(survived~.,data = data_obs, family = "binomial",alpha = results$alpha[i])
    return(predict(model, data_new, alpha = results$alpha[i], s = results$lambda[i], type = "response")[,1])
  })

# Single predictions from the model with the highest AUC:
write.csv(data.frame(y = prediction_ensemble[,which.max(results$AUC)]), file = "Max_titanic_best_model.csv")

# Single predictions from the ensemble model:
write.csv(data.frame(y = apply(prediction_ensemble, 1, mean)), file = "Max_titanic_ensemble.csv")
```

`r unhide()`
:::

## Machine learning frameworks

As we have seen today, many of the machine learning algorithms are distributed over several packages but the general machine learning pipeline is very similar for all models: feature engineering, feature selection, hyperparameter tuning and cross-validation.

Machine learning frameworks such as `mlr3` or `tidymodels` provide a general interface for the ML pipeline, in particular the training and the hyperparameter tuning with nested CV. They support most ML packages/algorithms.

### mlr3 {#sec-mlr}

The key features of mlr3 are:

-   All common machine learning packages are integrated into mlr3, you can easily switch between different machine learning algorithms.
-   A common 'language'/workflow to specify machine learning pipelines.
-   Support for different cross-validation strategies.
-   Hyperparameter tuning for all supported machine learning algorithms.
-   Ensemble models.

Useful links:

-   <a href="https://mlr3book.mlr-org.com/" target="_blank" rel="noopener">mlr3-book</a> (still in work)
-   <a href="https://mlr3.mlr-org.com/" target="_blank" rel="noopener">mlr3 website</a>
-   <a href="https://cheatsheets.mlr-org.com/mlr3.pdf" target="_blank" rel="noopener">mlr3 cheatsheet</a>

#### mlr3 - The Basic Workflow

The mlr3 package actually consists of several packages for different tasks (e.g. mlr3tuning for hyperparameter tuning, mlr3pipelines for data preparation pipes). But let's start with the basic workflow:

```{r chunk_chapter4_65, message=FALSE}
library(EcoData)
library(tidyverse)
library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
library(mlr3tuning)
library(mlr3measures)
data(nasa)
str(nasa)
```

Let's drop time, name and ID variable and create a classification task:

```{r chunk_chapter4_66}
data = nasa %>% select(-Orbit.Determination.Date,
                       -Close.Approach.Date, -Name, -Neo.Reference.ID)
data$Hazardous = as.factor(data$Hazardous)

# Create a classification task.
task = TaskClassif$new(id = "nasa", backend = data,
                       target = "Hazardous", positive = "1")
```

Create a generic pipeline of data transformation (imputation $\rightarrow$ scaling $\rightarrow$ encoding of categorical variables):

```{r chunk_chapter4_67}
set.seed(123)

# Let's create the preprocessing graph.
preprocessing = po("imputeoor") %>>% po("scale") %>>% po("encode") 

# Run the task.
transformed_task = preprocessing$train(task)[[1]]

transformed_task$missings()
```

We can even visualize the preprocessing graph:

```{r chunk_chapter4_68}
preprocessing$plot()
```

To test our model (glmnet) with 10-fold cross-validated, we will do:

-   Specify the missing target rows as validation so that they will be ignored.
-   Specify the cross-validation, the learner (the machine learning model we want to use), and the measurement (AUC).
-   Run (benchmark) our model.

```{r chunk_chapter4_69__mlr1}
set.seed(123)

transformed_task$data()[1,]
transformed_task$set_row_roles((1:nrow(data))[is.na(data$Hazardous)],
                               "holdout")

cv10 = mlr3::rsmp("cv", folds = 10L)
EN = lrn("classif.glmnet", predict_type = "prob")
measurement =  msr("classif.auc")
```

```{r chunk_chapter4_70, eval=FALSE}
result = mlr3::resample(transformed_task,
                        EN, resampling = cv10, store_models = TRUE)

# Calculate the average AUC of the holdouts.
result$aggregate(measurement)
```

Very cool! Preprocessing + 10-fold cross-validation model evaluation in a few lines of code!

Let's create the final predictions:

```{r chunk_chapter4_71__mlr2, eval=FALSE}
pred = sapply(1:10, function(i) result$learners[[i]]$predict(transformed_task,
row_ids = (1:nrow(data))[is.na(data$Hazardous)])$data$prob[, "1", drop = FALSE])
dim(pred)
predictions = apply(pred, 1, mean)
```

You could now submit the predictions <a href="http://rhsbio7.uni-regensburg.de:8500" target="_blank" rel="noopener">here</a>.

But we are still not happy with the results, let's do some hyperparameter tuning!

#### mlr3 - Hyperparameter Tuning

With mlr3, we can easily extend the above example to do hyperparameter tuning within nested cross-validation (the tuning has its own inner cross-validation).

Print the hyperparameter space of our glmnet learner:

```{r chunk_chapter4_72}
EN$param_set
```

Define the hyperparameter space of the random forest:

```{r chunk_chapter4_73__mlr3}
library(paradox)

EN_pars = 
    paradox::ParamSet$new(
      list(paradox::ParamDbl$new("alpha", lower = 0, upper = 1L),
           paradox::ParamDbl$new("lambda", lower = 0, upper = 0.5 )) )
print(EN_pars)
```

To set up the tuning pipeline we need:

-   Inner cross-validation resampling object.
-   Tuning criterion (e.g. AUC).
-   Tuning method (e.g. random or block search).
-   Tuning terminator (When should we stop tuning? E.g. after $n$ iterations).

```{r chunk_chapter4_74__mlr4}
set.seed(123)

inner3 = mlr3::rsmp("cv", folds = 3L)
measurement =  msr("classif.auc")
tuner =  mlr3tuning::tnr("random_search") 
terminator = mlr3tuning::trm("evals", n_evals = 5L)
EN = lrn("classif.glmnet", predict_type = "prob")

learner_tuner = AutoTuner$new(learner = EN, 
                              measure = measurement, 
                              tuner = tuner, 
                              terminator = terminator,
                              search_space = EN_pars,
                              resampling = inner3)
print(learner_tuner)
```

Now we can wrap it normally into the 10-fold cross-validated setup as done previously:

```{r chunk_chapter4_75, echo=FALSE, results='hide'}
set.seed(123)

outer3 = mlr3::rsmp("cv", folds = 3L)
result = mlr3::resample(transformed_task, learner_tuner,
                        resampling = outer3, store_models = TRUE)

```

```{r}
# Calculate the average AUC of the holdouts.
result$aggregate(measurement)
```

Let's create the final predictions:

```{r chunk_chapter4_76, eval=FALSE}
pred = sapply(1:3, function(i) result$learners[[i]]$predict(transformed_task,
row_ids = (1:nrow(data))[is.na(data$Hazardous)])$data$prob[, "1", drop = FALSE])
dim(pred)
predictions = apply(pred, 1, mean)
```

## Exercises

::: {.callout-caution icon="false"}
#### Question: Use mlr3 for the titanic dataset

1.  Use `mlr3` to tune glmnet for the titanic dataset using nested CV
2.  Submit single predictions and multiple predictions

If you need help, take a look at the solution, go through it line by line and try to understand it.

`r hide("Click here to see the solution")`

Prepare data

```{r}
data = titanic_ml %>% select(-name, -ticket, -name, -body)
data$pclass = as.factor(data$pclass)
data$sex = as.factor(data$sex)
data$survived = as.factor(data$survived)

# Change easy things manually:
data$embarked[data$embarked == ""] = "S"  # Fill in "empty" values.
data$embarked = droplevels(as.factor(data$embarked)) # Remove unused levels ("").
data$cabin = (data$cabin != "") * 1 # Dummy code the availability of a cabin.
data$fare[is.na(data$fare)] = mean(data$fare, na.rm = TRUE)
levels(data$home.dest)[levels(data$home.dest) == ""] = "unknown"
levels(data$boat)[levels(data$boat) == ""] = "none"

# Create a classification task.
task = TaskClassif$new(id = "titanic", backend = data,
                       target = "survived", positive = "1")
task$missings()

# Let's create the preprocessing graph.
preprocessing = po("imputeoor") %>>% po("scale") %>>% po("encode") 

# Run the task.
transformed_task = preprocessing$train(task)[[1]]

transformed_task$set_row_roles((1:nrow(data))[is.na(data$survived)], "holdout")
```

Hyperparameter tuning:

```{r, results='hide'}

cv10 = mlr3::rsmp("cv", folds = 10L)

inner3 = mlr3::rsmp("cv", folds = 3L)
measurement =  msr("classif.auc")
tuner =  mlr3tuning::tnr("random_search") 
terminator = mlr3tuning::trm("evals", n_evals = 5L)
EN = lrn("classif.glmnet", predict_type = "prob")
EN_pars = 
    paradox::ParamSet$new(
      list(paradox::ParamDbl$new("alpha", lower = 0, upper = 1L),
           paradox::ParamDbl$new("lambda", lower = 0, upper = 0.5 )) )

learner_tuner = AutoTuner$new(learner = EN, 
                              measure = measurement, 
                              tuner = tuner, 
                              terminator = terminator,
                              search_space = EN_pars,
                              resampling = inner3)


result = mlr3::resample(transformed_task, learner_tuner,
                        resampling = cv10, store_models = TRUE)
```

Evaluation:

```{r}
measurement =  msr("classif.auc")
result$aggregate(measurement)
```

Predictions:

We can extract a learner with optimized hyperparameters:

```{r}
model = result$learners[[1]]$learner$clone()
model$param_set$values
```

And we can fit it then on the full data set:

```{r}
model$train(transformed_task)
predictions = model$predict(transformed_task, row_ids = transformed_task$row_roles$holdout)
predictions = predictions$prob[,1]
head(predictions)
```

And submit to http://rhsbio7.uni-regensburg.de:8500

```{r}
write.csv(data.frame(y = predictions), file = "glmnet.csv")
```

`r unhide()`
:::