Algorithms¶
Baseline¶
Classification¶
The Baseline
algorithm is using scikit-learn algorithm: DummyClassifier
. It is using strategy prior
which returns most frequent class as label and class prior for predict_proba()
.
Regression¶
The Baseline
algorithm is using scikit-learn algorithm: DummyRegressor
. It is using strategy mean
which returns mean of the target from training data.
Baseline
is not tuned
There will be only one model for algorithm Baseline
. This algorithm has no hyperparameters.
Decision Tree¶
Classification¶
The Decision Tree
is using scikit-learn DecisionTreeClassifier
.
Decision Tree
hyperparameters for classification
The allowed values of hyperparameters:
dt_params = {"criterion": ["gini", "entropy"],
"max_depth": [2, 3, 4]}
classification_default_params = {"criterion": "gini", "max_depth": 3}
Regression¶
The Decision Tree
is using scikit-learn DecisionTreeRegressor
.
Decision Tree
hyperparameters for regression
The allowed values of hyperparameters:
dt_params = {
"criterion": ["mse", "friedman_mse"],
"max_depth": [2, 3, 4]
}
classification_default_params = {"criterion": "mse", "max_depth": 3}
For Decision Tree
a visualization can be created with dtreeviz
package (not to have explain_level > 0
).
Linear¶
Classification¶
The Linear
is using scikit-learn LogisticRegression
.
Linear
hyperparameters for classification
Thera are no hyperparameters for Linear
model. The parameters used in LogisticRegression
initialization: max_iter=500, tol=5e-4, n_jobs=-1
.
Regression¶
The Linear
is using scikit-learn LinearRegression
.
Linear
hyperparameters for regression
Thera are no hyperparameters for Linear
model. The parameters used in LinearRegression
initialization: n_jobs=-1
.
The coefficients are saved in Markdown report if explain_level > 0
.
Random Forest¶
Classification¶
The Random Forest
is using scikit-learn RandomForestClassifier
.
Random Forest
hyperparameters for classification
The allowed hyperparameters values:
rf_params = {
"criterion": ["gini", "entropy"],
"max_features": [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
"min_samples_split": [10, 20, 30, 40, 50],
"max_depth": [4, 6, 8, 10, 12],
}
classification_default_params = {
"criterion": "gini",
"max_features": 0.6,
"min_samples_split": 30,
"max_depth": 6,
}
Regression¶
The Random Forest
is using scikit-learn RandomForestRegressor
.
Random Forest
hyperparameters for regression
The allowed hyperparameters values:
regression_rf_params = {
"criterion": ["mse"],
"max_features": [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
"min_samples_split": [10, 20, 30, 40, 50],
"max_depth": [4, 6, 8, 10, 12],
}
regression_default_params = {
"criterion": "mse",
"max_features": 0.6,
"min_samples_split": 30,
"max_depth": 6,
}
Extra Trees¶
Classification¶
The Extra Trees
is using scikit-learn ExtraTreesClassifier
.
Extra Trees
hyperparameters for classification
The allowed hyperparameters values:
et_params = {
"criterion": ["gini", "entropy"],
"max_features": [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
"min_samples_split": [10, 20, 30, 40, 50],
"max_depth": [4, 6, 8, 10, 12],
}
classification_default_params = {
"criterion": "gini",
"max_features": 0.6,
"min_samples_split": 30,
"max_depth": 6,
}
Regression¶
The Extra Trees
is using scikit-learn ExtraTreesRegressor
.
Extra Trees
hyperparameters for regression
The allowed hyperparameters values:
regression_et_params = {
"criterion": ["mse"],
"max_features": [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
"min_samples_split": [10, 20, 30, 40, 50],
"max_depth": [4, 6, 8, 10, 12],
}
regression_default_params = {
"criterion": "mse",
"max_features": 0.6,
"min_samples_split": 30,
"max_depth": 6,
}
Xgboost¶
The AutoML is using Xgboost
package.
Binary Classification¶
Xgboost
hyperparameters for binary classification
The allowed hyperparameters values:
xgb_bin_class_params = {
"objective": ["binary:logistic"],
"eval_metric": ["logloss"],
"eta": [0.05, 0.075, 0.1, 0.15],
"max_depth": [1, 2, 3, 4, 5, 6, 7, 8, 9],
"min_child_weight": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"subsample": [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
"colsample_bytree": [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
}
classification_bin_default_params = {
"objective": "binary:logistic",
"eval_metric": "logloss",
"eta": 0.1,
"max_depth": 6,
"min_child_weight": 1,
"subsample": 1.0,
"colsample_bytree": 1.0,
}
Multi-class Classification¶
Xgboost
hyperparameters for multi-class classification
The allowed hyperparameters values:
xgb_multi_class_params = {
"objective": ["multi:softprob"],
"eval_metric": ["mlogloss"],
"eta": [0.05, 0.075, 0.1, 0.15],
"max_depth": [1, 2, 3, 4, 5, 6, 7, 8, 9],
"min_child_weight": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"subsample": [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
"colsample_bytree": [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
}
classification_multi_default_params = {
"objective": "multi:softprob",
"eval_metric": "mlogloss",
"eta": 0.1,
"max_depth": 6,
"min_child_weight": 1,
"subsample": 1.0,
"colsample_bytree": 1.0,
}
Regression¶
Xgboost
hyperparameters for regression
The allowed hyperparameters values:
xgb_regression_params = {
"objective": ["reg:squarederror"],
"eval_metric": ["rmse"],
"eta": [0.05, 0.075, 0.1, 0.15],
"max_depth": [1, 2, 3, 4],
"min_child_weight": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"subsample": [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
"colsample_bytree": [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
}
regression_default_params = {
"objective": "reg:squarederror",
"eval_metric": "rmse",
"eta": 0.1,
"max_depth": 4,
"min_child_weight": 1,
"subsample": 1.0,
"colsample_bytree": 1.0,
}
CatBoost¶
The AutoML is using CatBoost
package.
Classification¶
CatBoost
hyperparameters for classification
The allowed hyperparameters values:
classification_params = {
"learning_rate": [0.05, 0.1, 0.2],
"depth": [2, 3, 4, 5, 6],
"rsm": [0.7, 0.8, 0.9, 1], # random subspace method
"subsample": [0.7, 0.8, 0.9, 1], # random subspace method
"min_data_in_leaf": [1, 5, 10, 15, 20, 30, 50],
}
classification_default_params = {
"learning_rate": 0.1,
"depth": 6,
"rsm": 0.9,
"subsample": 1.0,
"min_data_in_leaf": 15,
}
- for binary classification
loss_function=Logloss
, - for mutliclass classification
loss_function=MultiClass
.
Regression¶
CatBoost
hyperparameters for regression
The allowed hyperparameters values:
regression_params = {
"learning_rate": [0.05, 0.1, 0.2],
"depth": [2, 3, 4, 5, 6],
"rsm": [0.7, 0.8, 0.9, 1], # random subspace method
"subsample": [0.7, 0.8, 0.9, 1], # random subspace method
"min_data_in_leaf": [1, 5, 10, 15, 20, 30, 50],
}
regression_default_params = {
"learning_rate": 0.1,
"depth": 6,
"rsm": 0.9,
"subsample": 1.0,
"min_data_in_leaf": 15,
}
For regression loss_function=RMSE
.
LightGBM¶
The AutoML is using LightGBM
package.
Binary Classification¶
LightGBM
hyperparameters for binary classification
The allowed hyperparameters values:
lgbm_bin_params = {
"objective": ["binary"],
"metric": ["binary_logloss"],
"num_leaves": [3, 7, 15, 31],
"learning_rate": [0.05, 0.075, 0.1, 0.15],
"feature_fraction": [0.8, 0.9, 1.0],
"bagging_fraction": [0.8, 0.9, 1.0],
"min_data_in_leaf": [5, 10, 15, 20, 30, 50],
}
classification_bin_default_params = {
"objective": "binary",
"metric": "binary_logloss",
"num_leaves": 31,
"learning_rate": 0.1,
"feature_fraction": 0.9,
"bagging_fraction": 0.9,
"min_data_in_leaf": 10,
}
Multi-class Classification¶
LightGBM
hyperparameters for multi-class classification
The allowed hyperparameters values:
lgbm_bin_params = {
"objective": ["multiclass"],
"metric": ["multi_logloss"],
"num_leaves": [3, 7, 15, 31],
"learning_rate": [0.05, 0.075, 0.1, 0.15],
"feature_fraction": [0.8, 0.9, 1.0],
"bagging_fraction": [0.8, 0.9, 1.0],
"min_data_in_leaf": [5, 10, 15, 20, 30, 50],
}
classification_multi_default_params = {
"objective": "multiclass",
"metric": "multi_logloss",
"num_leaves": 31,
"learning_rate": 0.1,
"feature_fraction": 0.9,
"bagging_fraction": 0.9,
"min_data_in_leaf": 10,
}
Regression¶
LightGBM
hyperparameters for regression
The allowed hyperparameters values:
lgbm_bin_params = {
"objective": ["regression"],
"metric": ["l2"],
"num_leaves": [3, 7, 15, 31],
"learning_rate": [0.05, 0.075, 0.1, 0.15],
"feature_fraction": [0.8, 0.9, 1.0],
"bagging_fraction": [0.8, 0.9, 1.0],
"min_data_in_leaf": [5, 10, 15, 20, 30, 50],
}
regression_default_params = {
"objective": "regression",
"metric": "l2",
"num_leaves": 15,
"learning_rate": 0.1,
"feature_fraction": 0.9,
"bagging_fraction": 0.9,
"min_data_in_leaf": 10,
}
Neural Network¶
For Neural Network
algorithm the Keras and Tensorflow are used. The same set of hyperparameters are used for all Machine Learning tasks (classification and regression). There is difference in output neurons type and loss function depending on ML task.
Neural Network
hyperparameters
The allowed hyperparameters values:
nn_params = {
"dense_layers": [2],
"dense_1_size": [16, 32, 64],
"dense_2_size": [4, 8, 16, 32],
"dropout": [0, 0.1, 0.25],
"learning_rate": [0.01, 0.05, 0.08, 0.1],
"momentum": [0.85, 0.9, 0.95],
"decay": [0.0001, 0.001, 0.01],
}
default_nn_params = {
"dense_layers": 2,
"dense_1_size": 32,
"dense_2_size": 16,
"dropout": 0,
"learning_rate": 0.05,
"momentum": 0.9,
"decay": 0.001,
}
Binary Classification¶
- There is single output neuron with
sigmoid
activation. - The loss function:
binary_crossentropy
.
Multi-class Classification¶
- The number of output neurons is equal to the number of unique classes in the target. The activation in the output layer is
softmax
. - The loss function:
categorical_crossentropy
.
Regression¶
- There is single output neuron with
linear
activation. - The loss function:
mean_squared_error
.
Nearest Neighbor¶
The Nearest Neighbor
algorithm is using scikit-learn:
- the
KNeighborsClassifier
for classification, - the
KNeighborsRegressor
for regression.
Nearest Neighbor
hyperparameters
The allowed hyperparameters values:
knn_params = {
"n_neighbors": [3, 5, 7],
"weights": ["uniform", "distance"]
}
default_params = {
"n_neighbors": 5,
"weights": "uniform"
}
Stacked Algorithm¶
The stacked algorithms are built with predictions from previous (unstacked) models. The stacked algorithms are reusing hyperparameters of already found good models.
- During the stacking up to 10 best models from each algorithm are used, except
Baseline
. - The out-of-folds predictions are used to construct extended training data. The stacking only works for
validation_strategy="kfold"
(k-fold cross-validation). - The stacked model can be only:
Xgboost
,LightGBM
,CatBoost
. The AutoML algorithm selects the best models from unstackedXgboost
,LightGBM
,CatBoost
and reuses its hyperparameters to train stacked models.
Ensemble¶
The Ensemble
algorithm is implemented based on Caruana article. The Ensemble
is using average method, which does a greedy search over all models and try to add (with repetition) a model to the ensemble to improve ensemble's performance. The ensemble performance is computed based on out-of-folds predictions of used models