Skip to content

Classify Titanic passangers

In this example, I would like to show you how to analyze Titanic dataset with AutoML mljar-supervised. The AutoML will do all the job and let's go through all results.

All the code and results are available at the GitHub

The code

What does python code do:

  • reads Titanic train dataset (the same data as in Kaggle platform),
  • trains AutoML object,
  • computes predictions and accuracy on test dataset (the same test data as in Kaggle)
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
from supervised import AutoML

train = pd.read_csv("https://raw.githubusercontent.com/pplonski/datasets-for-start/master/Titanic/train.csv")

X = train[train.columns[2:]]
y = train["Survived"]

automl = AutoML(results_path="AutoML_3")
automl.fit(X, y)

test = pd.read_csv("https://raw.githubusercontent.com/pplonski/datasets-for-start/master/Titanic/test_with_Survived.csv")
predictions = automl.predict(test)
print(f"Accuracy: {accuracy_score(test['Survived'], predictions)*100.0:.2f}%" )

As you see from above example the heavy job is done in exactly 2 lines of code:

automl = AutoML(results_path="AutoML_3")
automl.fit(X, y)

I will show you step by step what above code produced based on the training data.

The Explain mode

The default mode for mljar-supervised is Explain, which means that:

  • there will be used 75% / 25% for train / test split for model training and evaluation,
  • there will be trained following algorithms: Baseline, Decision Tree, Linear, Random Forest, Xgboost, Neural Network, and Ensemble,
  • the full explanations will be created.

All results created during AutoML training will be saved to the hard drive. There will be Markdown report in the README.md file for each model available (no black-boxes!).

The AutoML leaderboard report

The main README.md in the report will contain:

  • table will all models performance,
  • performance plotted as scatter plot and box plot.

The leaderbord:

Best model name model_type metric_type metric_value train_time Link
1_Baseline Baseline logloss 0.666775 0.26 Results link
2_DecisionTree Decision Tree logloss 0.648504 18 Results link
3_Linear Linear logloss 0.593649 12.2 Results link
4_Default_RandomForest Random Forest logloss 0.448691 22.24 Results link
5_Default_Xgboost Xgboost logloss 0.458922 12.63 Results link
6_Default_NeuralNetwork Neural Network logloss 0.733411 23.84 Results link
the best Ensemble Ensemble logloss 0.436319 0.83 Results link

From the above table you can check what was the performance of the models and how long was the training. There is a Results link in the table for each model (please scroll this table if you don't see it), which you can click and go into model details 👍 🎉

The performance is presented in the plots:

AutoML Performance

AutoML Performance Boxplot

The Baseline

The Baseline algorithm is very important during initial analysis. It tells us about quality of our data and helps to check if we need Machine Learning to solve this problem.

Let's compute the percentage difference between the best model (Ensemble) and the Baseline:

% difference = (0.667 - 0.436) / 0.667 * 100.0 = 34.6% 

The best model is 34.6% better than Baseline, the usage of ML is justifed and the data doesn't look like the random data.

When data looks like random?

I personally assume that if the best model is less than 5% better than Baseline then data looks like the random data and ML usage should be reconsidered.

Decision Tree

Let's look closer into Decision Tree report.

The part of report is below:

Decision Tree hyperparameters

  • criterion: gini
  • max_depth: 3
  • explain_level: 2

Validation

  • validation_type: split
  • train_ratio: 0.75
  • shuffle: True
  • stratify: True

Optimized metric

logloss

Training time

17.1 seconds

Metric details

score threshold
logloss 0.648504 nan
auc 0.814293 nan
f1 0.728261 0.351143
accuracy 0.775785 0.351143
precision 0.843137 0.597938
recall 0.965116 0
mcc 0.54213 0.351143

Confusion matrix (at threshold=0.351143)

Predicted as negative Predicted as positive
Labeled as negative 106 31
Labeled as positive 19 67

There are many metrics and confusion matrix pre-computed.

Additionally, there is a Decision Tree visualization:

Decision Tree visualization

There are created many explanations for each model. Let's check how they look like for Xgboost (the best single model).

The Xgboost model

You can check details of Xgboost model in the Markdown report. Here I will show some parts of the report with short comment.

Learning curves

The vertical line indicates the optimal number of trees in the Xgboost (found with early stopping). This number of trees will be used during computning predictions.

Xgboost learning curve

Feature Importance

The permutation-based feature importance:

Permutation based feature importance

From the plot you can see that the most used feature is Name_mr. There wasn't such feature in the training data. There was Name feature. The AutoML used TF-IDF transformation (scikit-learn TfidfVectorizer) to construct new features from Name text feature.

SHAP dependence plots

SHAP dependece plots

The test accuracy

The AutoML is used to predict the labels for test data samples. The accuracy computed on test data:

AutoML directory: AutoML_3
The task is binary_classification with evaluation metric logloss
AutoML will use algorithms: ['Baseline', 'Linear', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network']
AutoML will ensemble availabe models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'ensemble']
* Step simple_algorithms will try to check up to 3 models
1_Baseline logloss 0.666775 trained in 0.23 seconds
2_DecisionTree logloss 0.648504 trained in 17.06 seconds
3_Linear logloss 0.593649 trained in 11.04 seconds
* Step default_algorithms will try to check up to 3 models
4_Default_RandomForest logloss 0.448691 trained in 21.72 seconds
5_Default_Xgboost logloss 0.458922 trained in 17.47 seconds
6_Default_NeuralNetwork logloss 0.718124 trained in 22.08 seconds
* Step ensemble will try to check up to 1 model
Ensemble logloss 0.436478 trained in 0.71 seconds
AutoML fit time: 96.77 seconds
Accuracy: 77.99%

Summary

The AutoML was used to analyze Titanic dataset. I hope you see advantages of AutoML (with 2 lines of code):

  • all needed preprocessing were done automatically: insert missing values, convert categoricals, convert text to numbers.
  • there were checked many different algorithms,
  • all results are saved to the hard drive, Markdown reports are available for all models.

Do you see If you are still asking yourself if AutoML will replace data scientist. Then I hope you have an answer now. Yes, the AutoML will replace Data Scientists who are not using AutoML with the ones that are using AutoML.