Skip to content

Preprocessing

mljar-supervised applies preprocessing automatically as part of training the model pipeline.

This is the default user experience:

  • pass raw tabular data to AutoML.fit(X, y)
  • let AutoML detect what needs to be transformed
  • use the trained model later on raw rows again

In most cases, you do not need to manually encode categoricals, scale numeric columns, or convert text before training.

What is handled automatically

Depending on the data and the selected algorithms, mljar-supervised can automatically handle:

  • missing values
  • categorical features
  • text features
  • datetime features
  • numeric scaling
  • target preprocessing
  • removal of empty or constant columns

The learned preprocessing is stored with the trained model and reused at prediction time.

Missing values

Missing values are handled automatically.

For input features:

  • numeric columns are filled with the median value
  • categorical columns are filled with the most frequent value
  • datetime columns are filled with the most frequent value
  • text columns are filled with a placeholder value

For the target:

  • rows with missing target values are removed during training

This means you usually do not need to fill NaN values yourself before calling fit().

Categorical features

Categorical columns are detected automatically.

mljar-supervised chooses an encoding strategy based on the data and the model requirements. Depending on the situation, it can use:

  • integer encoding
  • one-hot encoding
  • mixed encoding for some categorical setups

You do not need to run pd.get_dummies() before training.

Text features

Text columns are detected automatically and transformed with TF-IDF.

This is useful when your data contains free-text fields such as:

  • names
  • descriptions
  • comments
  • titles

You can see this behavior in the Titanic tutorial, where text from the Name column becomes model features automatically.

Datetime features

Datetime columns are detected automatically and transformed into model-ready features.

You do not need to manually split them into year, month, day, and similar components before training.

Numeric scaling

Some algorithms need scaled numeric inputs, and some do not.

mljar-supervised applies scaling when it is needed by the training pipeline. This is handled automatically, so you do not need to scale numeric columns yourself first.

Target preprocessing

Target values can also be preprocessed automatically.

Examples:

  • classification targets can be converted to numeric labels
  • multiclass targets can be encoded automatically
  • regression targets can be scaled when needed

This is especially useful when the target is not already in the exact numerical form expected by a specific learner.

Empty and constant columns

Columns that contain only missing values or only one unique non-missing value are removed automatically.

This helps avoid training on features that carry no useful information.

What happens at prediction time

The same preprocessing learned during training is applied automatically when you call:

  • predict()
  • predict_proba()
  • score()
  • app generation and app execution flows

You should pass data with the same feature columns, but you do not need to manually repeat the fitted preprocessing steps yourself.

This is an important part of the contract:

  • train on raw tabular data
  • predict on raw tabular data
  • let the stored pipeline handle consistency

What you should still do yourself

Automatic preprocessing does not replace data understanding.

You should still:

  • choose the correct target column
  • remove obvious leakage columns
  • remove identifiers if they should not be used as predictors
  • make sure training data reflects the real prediction scenario
  • keep training and prediction columns aligned

For example, a customer ID or transaction ID might technically be usable by the model, but it is usually not a meaningful predictive feature.

Common questions

Do I need one-hot encoding before fit()?

No. Categorical encoding is handled automatically.

Do I need to fill missing values myself?

Usually no. Missing values are handled automatically for both training and prediction inputs.

Can I pass strings and categoricals directly?

Yes. Raw categorical and text columns are supported.

Can I use the model later on raw rows?

Yes. The fitted preprocessing is stored with the model and reused automatically.

What if new prediction data contains missing values?

mljar-supervised applies the learned preprocessing to new data as well. If new missing values appear, they are handled in the preprocessing pipeline before prediction.

Example

from supervised import AutoML

automl = AutoML(results_path="AutoML")
automl.fit(X, y)

predictions = automl.predict(new_data)

In this example:

  • X can contain missing values
  • X can contain categorical columns
  • X can contain text columns
  • new_data should have matching feature columns
  • preprocessing is applied automatically in both training and prediction