API documentation¶
AutoML
class¶
¶
Automated Machine Learning for supervised tasks (binary classification, multiclass classification, regression).
__init__(self, results_path=None, total_time_limit=3600, mode='Explain', ml_task='auto', model_time_limit=None, algorithms='auto', train_ensemble=True, stack_models='auto', eval_metric='auto', validation_strategy='auto', explain_level='auto', golden_features='auto', features_selection='auto', start_random_models='auto', hill_climbing_steps='auto', top_models_to_improve='auto', verbose=1, random_state=1234)
special
¶
Initialize AutoML
object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
results_path |
str |
The path with results. If None, then the name of directory will be generated with the template: AutoML_{number},
where the number can be from 1 to 1,000 - depends which direcory name will be available.
If the |
None |
total_time_limit |
int |
The total time limit in seconds for AutoML training.
It is not used when |
3600 |
mode |
str |
Can be {
|
'Explain' |
ml_task |
str |
Can be {"auto", "binary_classification", "multiclass_classification", "regression"}.
|
'auto' |
model_time_limit |
int |
The time limit for training a single model, in seconds.
If For example, in the case of 10-fold cross-validation, one model will have 10 learners.
The |
None |
algorithms |
list of str |
The list of algorithms that will be used in the training. The algorithms can be:
|
'auto' |
train_ensemble |
boolean |
Whether an ensemble gets created at the end of the training. |
True |
stack_models |
boolean |
Whether a models stack gets created at the end of the training. Stack level is 1. |
'auto' |
eval_metric |
str |
The metric to be optimized. If "auto", then:
Note:
Still not implemented, please left |
'auto' |
validation_strategy |
dict |
Dictionary with validation type. Right now train/test split and cross-validation are supported. Example:
|
'auto' |
explain_level |
int |
The level of explanations included to each model:
If left |
'auto' |
golden_features |
boolean |
Whether to use golden features
If left
|
'auto' |
features_selection |
boolean |
Whether to do features_selection
If left
|
'auto' |
start_random_models |
int |
Number of starting random models to try.
If left
|
'auto' |
hill_climbing_steps |
int |
Number of steps to perform during hill climbing.
If left
|
'auto' |
top_models_to_improve |
int |
Number of best models to improve in
|
'auto' |
verbose |
int |
Controls the verbosity when fitting and predicting. Note:
Still not implemented, please left |
1 |
random_state |
int |
Controls the randomness of the |
1234 |
Examples:
Binary Classification Example:
>>> import pandas as pd
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.metrics import roc_auc_score
>>> from supervised import AutoML
>>> df = pd.read_csv(
... "https://raw.githubusercontent.com/pplonski/datasets-for-start/master/adult/data.csv",
... skipinitialspace=True
... )
>>> X_train, X_test, y_train, y_test = train_test_split(
... df[df.columns[:-1]], df["income"], test_size=0.25
... )
>>> automl = AutoML()
>>> automl.fit(X_train, y_train)
>>> y_pred_prob = automl.predict_proba(X_test)
>>> print(f"AUROC: {roc_auc_score(y_test, y_pred_prob):.2f}%")
Multi-Class Classification Example:
>>> import pandas as pd
>>> from sklearn.datasets import load_digits
>>> from sklearn.metrics import accuracy_score
>>> from sklearn.model_selection import train_test_split
>>> from supervised import AutoML
>>> digits = load_digits()
>>> X_train, X_test, y_train, y_test = train_test_split(
... digits.data, digits.target, stratify=digits.target, test_size=0.25,
... random_state=123
... )
>>> automl = AutoML(mode="Perform")
>>> automl.fit(X_train, y_train)
>>> y_pred = automl.predict(X_test)
>>> print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}%")
Regression Example:
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.metrics import mean_squared_error
>>> from supervised import AutoML
>>> housing = load_boston()
>>> X_train, X_test, y_train, y_test = train_test_split(
... pd.DataFrame(housing.data, columns=housing.feature_names),
... housing.target,
... test_size=0.25,
... random_state=123,
... )
>>> automl = AutoML(mode="Compete")
>>> automl.fit(X_train, y_train)
>>> print("Test R^2:", automl.score(X_test, y_test))
Scikit-learn Pipeline Integration Example:
>>> from imblearn.over_sampling import RandomOverSampler
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import train_test_split
>>> from supervised import AutoML
>>> X, y = make_classification()
>>> X_train, X_test, y_train, y_test = train_test_split(X,y)
>>> pipeline = make_pipeline(RandomOverSampler(), AutoML())
>>> print(pipeline.fit(X_train, y_train).score(X_test, y_test))
Source code in supervised/automl.py
def __init__(
self,
results_path=None,
total_time_limit=60 * 60,
mode="Explain",
ml_task="auto",
model_time_limit=None,
algorithms="auto",
train_ensemble=True,
stack_models="auto",
eval_metric="auto",
validation_strategy="auto",
explain_level="auto",
golden_features="auto",
features_selection="auto",
start_random_models="auto",
hill_climbing_steps="auto",
top_models_to_improve="auto",
verbose=1,
random_state=1234,
):
"""
Initialize `AutoML` object.
Arguments:
results_path (str): The path with results. If None, then the name of directory will be generated with the template: AutoML_{number},
where the number can be from 1 to 1,000 - depends which direcory name will be available.
If the `results_path` will point to directory with AutoML results (`params.json` must be present),
then all models will be loaded.
total_time_limit (int): The total time limit in seconds for AutoML training.
It is not used when `model_time_limit` is not `None`.
mode (str): Can be {`Explain`, `Perform`, `Compete`}. This parameter defines the goal of AutoML and how intensive the AutoML search will be.
- `Explain` : To to be used when the user wants to explain and understand the data.
- Uses 75%/25% train/test split.
- Uses the following models: `Baseline`, `Linear`, `Decision Tree`, `Random Forest`, `XGBoost`, `Neural Network`, and `Ensemble`.
- Has full explanations in reports: learning curves, importance plots, and SHAP plots.
- `Perform` : To be used when the user wants to train a model that will be used in real-life use cases.
- Uses 5-fold CV (Cross-Validation).
- Uses the following models: `Linear`, `Random Forest`, `LightGBM`, `XGBoost`, `CatBoost`, `Neural Network`, and `Ensemble`.
- Has learning curves and importance plots in reports.
- `Compete` : To be used for machine learning competitions (maximum performance).
- Uses 10-fold CV (Cross-Validation).
- Uses the following models: `Decision Tree`, `Random Forest`, `Extra Trees`, `XGBoost`, `CatBoost`, `Neural Network`,
`Nearest Neighbors`, `Ensemble`, and `Stacking`.
- It has only learning curves in the reports.
ml_task (str): Can be {"auto", "binary_classification", "multiclass_classification", "regression"}.
- If left `auto` AutoML will try to guess the task based on target values.
- If there will be only 2 values in the target, then task will be set to `"binary_classification"`.
- If number of values in the target will be between 2 and 20 (included), then task will be set to `"multiclass_classification"`.
- In all other casses, the task is set to `"regression"`.
model_time_limit (int): The time limit for training a single model, in seconds.
If `model_time_limit` is set, the `total_time_limit` is not respected.
The single model can contain several learners. The time limit for subsequent learners is computed based on `model_time_limit`.
For example, in the case of 10-fold cross-validation, one model will have 10 learners.
The `model_time_limit` is the time for all 10 learners.
algorithms (list of str): The list of algorithms that will be used in the training.
The algorithms can be:
- `Baseline`,
- `Linear`,
- `Decision Tree`,
- `Random Forest`,
- `Extra Trees`,
- `LightGBM`,
- `Xgboost`,
- `CatBoost`,
- `Neural Network`,
- `Nearest Neighbors`,
train_ensemble (boolean): Whether an ensemble gets created at the end of the training.
stack_models (boolean): Whether a models stack gets created at the end of the training. Stack level is 1.
eval_metric (str): The metric to be optimized.
If "auto", then:
- `logloss` is used for classifications taks.
- `rmse` is used for regression taks.
Note:
Still not implemented, please left `None`
validation_strategy (dict): Dictionary with validation type. Right now train/test split and cross-validation are supported.
Example:
Cross-validation exmaple:
{
"validation_type": "kfold",
"k_folds": 5,
"shuffle": True,
"stratify": True,
"random_seed": 123
}
Train/test example:
{
"validation_type": "split",
"train_ratio": 0.75,
"shuffle": True,
"stratify": True
}
explain_level (int): The level of explanations included to each model:
- if `explain_level` is `0` no explanations are produced.
- if `explain_level` is `1` the following explanations are produced: importance plot (with permutation method), for decision trees produce tree plots, for linear models save coefficients.
- if `explain_level` is `2` the following explanations are produced: the same as `1` plus SHAP explanations.
If left `auto` AutoML will produce explanations based on the selected `mode`.
golden_features (boolean): Whether to use golden features
If left `auto` AutoML will use golden features based on the selected `mode`:
- If `mode` is "Explain", `golden_features` = False.
- If `mode` is "Perform", `golden_features` = True.
- If `mode` is "Compete", `golden_features` = True.
features_selection (boolean): Whether to do features_selection
If left `auto` AutoML will do feature selection based on the selected `mode`:
- If `mode` is "Explain", `features_selection` = False.
- If `mode` is "Perform", `features_selection` = True.
- If `mode` is "Compete", `features_selection` = True.
start_random_models (int): Number of starting random models to try.
If left `auto` AutoML will select it based on the selected `mode`:
- If `mode` is "Explain", `start_random_models` = 1.
- If `mode` is "Perform", `start_random_models` = 5.
- If `mode` is "Compete", `start_random_models` = 10.
hill_climbing_steps (int): Number of steps to perform during hill climbing.
If left `auto` AutoML will select it based on the selected `mode`:
- If `mode` is "Explain", `hill_climbing_steps` = 0.
- If `mode` is "Perform", `hill_climbing_steps` = 2.
- If `mode` is "Compete", `hill_climbing_steps` = 2.
top_models_to_improve (int): Number of best models to improve in `hill_climbing` steps.
If left `auto` AutoML will select it based on the selected `mode`:
- If `mode` is "Explain", `top_models_to_improve` = 0.
- If `mode` is "Perform", `top_models_to_improve` = 2.
- If `mode` is "Compete", `top_models_to_improve` = 3.
verbose (int): Controls the verbosity when fitting and predicting.
Note:
Still not implemented, please left `1`
random_state (int): Controls the randomness of the `AutoML`
Examples:
Binary Classification Example:
>>> import pandas as pd
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.metrics import roc_auc_score
>>> from supervised import AutoML
>>> df = pd.read_csv(
... "https://raw.githubusercontent.com/pplonski/datasets-for-start/master/adult/data.csv",
... skipinitialspace=True
... )
>>> X_train, X_test, y_train, y_test = train_test_split(
... df[df.columns[:-1]], df["income"], test_size=0.25
... )
>>> automl = AutoML()
>>> automl.fit(X_train, y_train)
>>> y_pred_prob = automl.predict_proba(X_test)
>>> print(f"AUROC: {roc_auc_score(y_test, y_pred_prob):.2f}%")
Multi-Class Classification Example:
>>> import pandas as pd
>>> from sklearn.datasets import load_digits
>>> from sklearn.metrics import accuracy_score
>>> from sklearn.model_selection import train_test_split
>>> from supervised import AutoML
>>> digits = load_digits()
>>> X_train, X_test, y_train, y_test = train_test_split(
... digits.data, digits.target, stratify=digits.target, test_size=0.25,
... random_state=123
... )
>>> automl = AutoML(mode="Perform")
>>> automl.fit(X_train, y_train)
>>> y_pred = automl.predict(X_test)
>>> print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}%")
Regression Example:
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.metrics import mean_squared_error
>>> from supervised import AutoML
>>> housing = load_boston()
>>> X_train, X_test, y_train, y_test = train_test_split(
... pd.DataFrame(housing.data, columns=housing.feature_names),
... housing.target,
... test_size=0.25,
... random_state=123,
... )
>>> automl = AutoML(mode="Compete")
>>> automl.fit(X_train, y_train)
>>> print("Test R^2:", automl.score(X_test, y_test))
Scikit-learn Pipeline Integration Example:
>>> from imblearn.over_sampling import RandomOverSampler
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import train_test_split
>>> from supervised import AutoML
>>> X, y = make_classification()
>>> X_train, X_test, y_train, y_test = train_test_split(X,y)
>>> pipeline = make_pipeline(RandomOverSampler(), AutoML())
>>> print(pipeline.fit(X_train, y_train).score(X_test, y_test))
"""
super(AutoML, self).__init__()
# Set user arguments
self.mode = mode
self.ml_task = ml_task
self.results_path = results_path
self.total_time_limit = total_time_limit
self.model_time_limit = model_time_limit
self.algorithms = algorithms
self.train_ensemble = train_ensemble
self.stack_models = stack_models
self.eval_metric = eval_metric
self.validation_strategy = validation_strategy
self.verbose = verbose
self.explain_level = explain_level
self.golden_features = golden_features
self.features_selection = features_selection
self.start_random_models = start_random_models
self.hill_climbing_steps = hill_climbing_steps
self.top_models_to_improve = top_models_to_improve
self.random_state = random_state
fit(self, X, y)
¶
Fit the AutoML model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
list or numpy.ndarray or pandas.DataFrame |
Training data |
required |
y |
list or numpy.ndarray or pandas.DataFrame |
Training targets |
required |
Returns:
Type | Description |
---|---|
AutoML object |
Returns |
Source code in supervised/automl.py
def fit(self, X, y):
"""Fit the AutoML model.
Arguments:
X (list or numpy.ndarray or pandas.DataFrame): Training data
y (list or numpy.ndarray or pandas.DataFrame): Training targets
Returns:
AutoML object: Returns `self`
"""
return self._fit(X, y)
predict(self, X)
¶
Computes predictions from AutoML best model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
list or numpy.ndarray or pandas.DataFrame |
Input values to make predictions on. |
required |
Returns:
Type | Description |
---|---|
numpy.ndarray |
|
Exceptions:
Type | Description |
---|---|
AutoMLException |
Model has not yet been fitted. |
Source code in supervised/automl.py
def predict(self, X):
"""
Computes predictions from AutoML best model.
Arguments:
X (list or numpy.ndarray or pandas.DataFrame):
Input values to make predictions on.
Returns:
numpy.ndarray:
- One-dimensional array of class labels for classification.
- One-dimensional array of predictions for regression.
Raises:
AutoMLException: Model has not yet been fitted.
"""
return self._predict(X)
predict_all(self, X)
¶
Computes both class probabilities and class labels for classification tasks. Computes predictions for regression tasks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
list or numpy.ndarray or pandas.DataFrame |
Input values to make predictions on. |
required |
Returns:
Type | Description |
---|---|
pandas.Dataframe |
Dataframe (n_samples, n_classes + 1) containing both class probabilities and class labels of the input samples for classification tasks. Dataframe with predictions for regression tasks. |
Exceptions:
Type | Description |
---|---|
AutoMLException |
Model has not yet been fitted. |
Source code in supervised/automl.py
def predict_all(self, X):
"""
Computes both class probabilities and class labels for classification tasks.
Computes predictions for regression tasks.
Arguments:
X (list or numpy.ndarray or pandas.DataFrame):
Input values to make predictions on.
Returns:
pandas.Dataframe:
Dataframe (n_samples, n_classes + 1) containing both class probabilities and class
labels of the input samples for classification tasks.
Dataframe with predictions for regression tasks.
Raises:
AutoMLException: Model has not yet been fitted.
"""
return self._predict_all(X)
predict_proba(self, X)
¶
Computes class probabilities from AutoML best model. This method can only be used for classification tasks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
list or numpy.ndarray or pandas.DataFrame |
Input values to make predictions on. |
required |
Returns:
Type | Description |
---|---|
numpy.ndarray of shape (n_samples, n_classes) |
Matrix of containing class probabilities of the input samples |
Exceptions:
Type | Description |
---|---|
AutoMLException |
Model has not yet been fitted. |
Source code in supervised/automl.py
def predict_proba(self, X):
"""
Computes class probabilities from AutoML best model.
This method can only be used for classification tasks.
Arguments:
X (list or numpy.ndarray or pandas.DataFrame):
Input values to make predictions on.
Returns:
numpy.ndarray of shape (n_samples, n_classes):
Matrix of containing class probabilities of the input samples
Raises:
AutoMLException: Model has not yet been fitted.
"""
return self._predict_proba(X)
score(self, X, y=None)
¶
Calculates a goodness of fit
for an AutoML instance.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
list or numpy.ndarray or pandas.DataFrame |
Test values to make predictions on. |
required |
y |
list or numpy.ndarray or pandas.DataFrame |
True labels for X. |
None |
Returns:
Type | Description |
---|---|
float |
Returns a goodness of fit measure (higher is better):
|
Source code in supervised/automl.py
def score(self, X, y=None):
"""Calculates a goodness of `fit` for an AutoML instance.
Arguments:
X (list or numpy.ndarray or pandas.DataFrame):
Test values to make predictions on.
y (list or numpy.ndarray or pandas.DataFrame):
True labels for X.
Returns:
float: Returns a goodness of fit measure (higher is better):
- For classification tasks: returns the mean accuracy on the given test data and labels.
- For regression tasks: returns the R^2 (coefficient of determination) on the given test data and labels.
"""
return self._score(X, y)