AutoML in Fabric (preview)

AutoML (Automated Machine Learning) is a collection of methods and tools that automate machine learning model training and optimization with little human involvement. The aim of AutoML is to simplify and speed up the process of choosing the best machine learning model and hyperparameters for a given dataset, which usually demands much skill and computing power.

Important

This feature is in preview.

In Fabric, data scientists can use flaml.AutoML to automate their machine learning tasks.

AutoML can help ML professionals and developers from different sectors to:

  • Build ML solutions with minimal coding
  • Reduce time and cost
  • Apply data science best practices
  • Solve problems quickly and efficiently

AutoML workflow

flaml.AutoML is a class for AutoML based on the task. It can be used as a Scikit-learn style estimator with the usual fit and predict methods.

To start an AutoML trial, users only need to provide the training data and the task type. With the integrated MLflow experiences in Fabric, users can also examine the different runs that were attempted in the trial to see how the final model was chosen.

Training data

In Fabric, users can pass the following input types to the AutoML fit function:

  • Numpy Array: When the input data is stored in a Numpy array, it's passed to fit() as X_train and y_train.

  • Pandas dataframe: When the input data is stored in a Pandas dataframe, it's passed to fit() either as X_train and y_train, or as dataframe and label.

  • Pandas on Spark dataframe: When the input data is stored as a Spark dataframe, it can be converted into a Pandas on Spark dataframe using to_pandas_on_spark() and then passed to fit() as a dataframe and label.

    from flaml.automl.spark.utils import to_pandas_on_spark
    psdf = to_pandas_on_spark(sdf)
    automl.fit(dataframe=psdf, label='Bankrupt?', isUnbalance=True, **settings)
    

Machine learning problem

Users can specify the machine learning task using the task argument. There are various supported machine learning tasks, including:

  • Classification: The main goal of classification models is to predict which categories new data fall into based on learnings from its training data. Common classification examples include fraud detection, handwriting recognition, and object detection.
  • Regression: Regression models predict numerical output values based on independent predictors. In regression, the objective is to help establish the relationship among those independent predictor variables by estimating how one variable impacts the others. For example, automobile prices based on features like, gas mileage, safety rating, etc.
  • Time Series Forecasting: This is used to predict future values based on historical data points ordered by time. In a time series, data is collected and recorded at regular intervals over a specific period, such as daily, weekly, monthly, or yearly. The objective of time series forecasting is to identify patterns, trends, and seasonality in the data and then use this information to make predictions about future value.

To learn more about the other tasks supported in FLAML, you can visit the documentation on AutoML tasks in FLAML.

Optional inputs

Provide various constraints and inputs to configure your AutoML trial.

Constraints

When creating an AutoML trial, users can also configure constraints on the AutoML process, constructor arguments of potential estimators, types of models tried in AutoML, and even constraints on the metrics of the AutoML trial.

For example, the code below allows users to specify a metrics constraint on the AutoML trial.

metric_constraints = [("train_loss", "<=", 0.1), ("val_loss", "<=", 0.1)]
automl.fit(X_train, y_train, max_iter=100, train_time_limit=1, metric_constraints=metric_constraints)

To learn more about these configurations, you can visit the documentation on configurations in FLAML.

Optimization metric

During training, the AutoML function creates many trials, which try different algorithms and parameters. The AutoML tool iterates through ML algorithms and hyperparameters. In this process, each iteration creates a model with a training score. The better the score for the metric you want to optimize for, the better the model is considered to "fit" your data. The optimization metric is specified via the metric argument. It can be either a string, which refers to a built-in metric, or a user-defined function.

AutoML optimization metrics

Parallel tuning

In some cases, you might want to expedite your AutoML trial by using Apache Spark to parallelize your training. For Spark clusters, by default, FLAML launches one trial per executor. You can also customize the number of concurrent trials by using the n_concurrent_trials argument.

automl.fit(X_train, y_train, n_concurrent_trials=4, use_spark=True)

To learn more about how to parallelize your AutoML trails, you can visit the FLAML documentation for parallel Spark jobs.

Track with MLflow

You can also use the Fabric MLflow integration to capture the metrics, parameters, and metrics of the explored trails.

import mlflow
mlflow.autolog()

with mlflow.start_run(nested=True):
    automl.fit(dataframe=pandas_df, label='Bankrupt?', mlflow_exp_name = "automl_spark_demo")

# You can also provide a run_name pre-fix for the child runs

automl_experiment = flaml.AutoML()
automl_settings = {
    "metric": "r2",
    "task": "regression",
    "use_spark": True,
    "mlflow_exp_name": "test_doc",
    "estimator_list": [
        "lgbm",
        "rf",
        "xgboost",
        "extra_tree",
        "xgb_limitdepth",
    ],  # catboost does not yet support mlflow autologging
}
with mlflow.start_run(run_name=f"automl_spark_trials"):
    automl_experiment.fit(X_train=train_x, y_train=train_y, **automl_settings)

Supported models

AutoML in Fabric supports the following models:

Classification Regression Time Series Forecasting
(PySpark) Gradient-Boosted Trees (GBT) Classifier (PySpark) Accelerated Failure Time (AFT) Survival Regression Arimax
(PySpark) Linear SVM (PySpark) Generalized Linear Regression AutoARIMA
(PySpark) Naive Bayes (PySpark) Gradient-Boosted Trees (GBT) Regression Average
(Synapse) LightGBM (PySpark) Linear Regression CatBoost
CatBoost (Synapse) LightGBM Decision Tree
Decision Tree CatBoost ExponentialSmoothing
Extremely Randomized Trees Decision Tree Extremely Randomized Trees
Gradient Boosting Elastic Net ForecastTCN
K Nearest Neighbors Extremely Randomized Trees Gradient Boosting
Light GBM Gradient Boosting Holt-Winters Exponential Smoothing
Linear SVC K Nearest Neighbors K Nearest Neighbors
Logistic Regression LARS Lasso LARS Lasso
Logistic Regression with L1/L2 Regularization Light GBM Light GBM
Naive Bayes Logistic Regression with L1/L2 Regularization Naive
Random Forest Random Forest Orbit
Random Forest on Spark Random Forest on Spark Prophet
Stochastic Gradient Descent (SGD) Stochastic Gradient Descent (SGD) Random Forest
Support Vector Classification (SVC) XGBoost SARIMAX
XGboost XGBoost with Limited Depth SeasonalAverage
XGBoost with Limited Depth SeasonalNaive
Temporal Fusion Transformer
XGBoost
XGBoost for Time Series
XGBoost with Limited Depth for Time Series
ElasticNet

Visualize results

The flaml.visualization module provides utility functions for plotting the optimization process using Plotly. By leveraging Plotly, users can interactively explore their AutoML experiment results. To use these plotting functions, provide your optimized flaml.AutoML or flaml.tune.tune.ExperimentAnalysis object as an input.

You can use the following functions within your notebook:

  • plot_optimization_history: Plot optimization history of all trials in the experiment.
  • plot_feature_importance: Plot importance for each feature in the dataset.
  • plot_parallel_coordinate: Plot the high-dimensional parameter relationships in the experiment.
  • plot_contour: Plot the parameter relationship as contour plot in the experiment.
  • plot_edf: Plot the objective value EDF (empirical distribution function) of the experiment.
  • plot_timeline: Plot the timeline of the experiment.
  • plot_slice: Plot the parameter relationship as slice plot in a study.
  • plot_param_importance: Plot the hyperparameter importance of the experiment.