AutoML in Fabric (preview)
AutoML (Automated Machine Learning) is a collection of methods and tools that automate machine learning model training and optimization with little human involvement. The aim of AutoML is to simplify and speed up the process of choosing the best machine learning model and hyperparameters for a given dataset, which usually demands much skill and computing power.
Important
This feature is in preview.
In Fabric, data scientists can use flaml.AutoML
to automate their machine learning tasks.
AutoML can help ML professionals and developers from different sectors to:
- Build ML solutions with minimal coding
- Reduce time and cost
- Apply data science best practices
- Solve problems quickly and efficiently
AutoML workflow
flaml.AutoML
is a class for AutoML based on the task. It can be used as a Scikit-learn style estimator with the usual fit and predict methods.
To start an AutoML trial, users only need to provide the training data and the task type. With the integrated MLflow experiences in Fabric, users can also examine the different runs that were attempted in the trial to see how the final model was chosen.
Training data
In Fabric, users can pass the following input types to the AutoML fit
function:
Numpy Array: When the input data is stored in a Numpy array, it's passed to
fit()
as X_train and y_train.Pandas dataframe: When the input data is stored in a Pandas dataframe, it's passed to
fit()
either as X_train and y_train, or as dataframe and label.Pandas on Spark dataframe: When the input data is stored as a Spark dataframe, it can be converted into a
Pandas
onSpark
dataframe usingto_pandas_on_spark()
and then passed tofit()
as a dataframe and label.from flaml.automl.spark.utils import to_pandas_on_spark psdf = to_pandas_on_spark(sdf) automl.fit(dataframe=psdf, label='Bankrupt?', isUnbalance=True, **settings)
Machine learning problem
Users can specify the machine learning task using the task
argument. There are various supported machine learning tasks, including:
- Classification: The main goal of classification models is to predict which categories new data fall into based on learnings from its training data. Common classification examples include fraud detection, handwriting recognition, and object detection.
- Regression: Regression models predict numerical output values based on independent predictors. In regression, the objective is to help establish the relationship among those independent predictor variables by estimating how one variable impacts the others. For example, automobile prices based on features like, gas mileage, safety rating, etc.
- Time Series Forecasting: This is used to predict future values based on historical data points ordered by time. In a time series, data is collected and recorded at regular intervals over a specific period, such as daily, weekly, monthly, or yearly. The objective of time series forecasting is to identify patterns, trends, and seasonality in the data and then use this information to make predictions about future value.
To learn more about the other tasks supported in FLAML, you can visit the documentation on AutoML tasks in FLAML.
Optional inputs
Provide various constraints and inputs to configure your AutoML trial.
Constraints
When creating an AutoML trial, users can also configure constraints on the AutoML process, constructor arguments of potential estimators, types of models tried in AutoML, and even constraints on the metrics of the AutoML trial.
For example, the code below allows users to specify a metrics constraint on the AutoML trial.
metric_constraints = [("train_loss", "<=", 0.1), ("val_loss", "<=", 0.1)]
automl.fit(X_train, y_train, max_iter=100, train_time_limit=1, metric_constraints=metric_constraints)
To learn more about these configurations, you can visit the documentation on configurations in FLAML.
Optimization metric
During training, the AutoML function creates many trials, which try different algorithms and parameters. The AutoML tool iterates through ML algorithms and hyperparameters. In this process, each iteration creates a model with a training score. The better the score for the metric you want to optimize for, the better the model is considered to "fit" your data. The optimization metric is specified via the metric
argument. It can be either a string, which refers to a built-in metric, or a user-defined function.
Parallel tuning
In some cases, you might want to expedite your AutoML trial by using Apache Spark to parallelize your training. For Spark clusters, by default, FLAML launches one trial per executor. You can also customize the number of concurrent trials by using the n_concurrent_trials
argument.
automl.fit(X_train, y_train, n_concurrent_trials=4, use_spark=True)
To learn more about how to parallelize your AutoML trails, you can visit the FLAML documentation for parallel Spark jobs.
Track with MLflow
You can also use the Fabric MLflow integration to capture the metrics, parameters, and metrics of the explored trails.
import mlflow
mlflow.autolog()
with mlflow.start_run(nested=True):
automl.fit(dataframe=pandas_df, label='Bankrupt?', mlflow_exp_name = "automl_spark_demo")
# You can also provide a run_name pre-fix for the child runs
automl_experiment = flaml.AutoML()
automl_settings = {
"metric": "r2",
"task": "regression",
"use_spark": True,
"mlflow_exp_name": "test_doc",
"estimator_list": [
"lgbm",
"rf",
"xgboost",
"extra_tree",
"xgb_limitdepth",
], # catboost does not yet support mlflow autologging
}
with mlflow.start_run(run_name=f"automl_spark_trials"):
automl_experiment.fit(X_train=train_x, y_train=train_y, **automl_settings)
Supported models
AutoML in Fabric supports the following models:
Classification | Regression | Time Series Forecasting |
---|---|---|
(PySpark) Gradient-Boosted Trees (GBT) Classifier | (PySpark) Accelerated Failure Time (AFT) Survival Regression | Arimax |
(PySpark) Linear SVM | (PySpark) Generalized Linear Regression | AutoARIMA |
(PySpark) Naive Bayes | (PySpark) Gradient-Boosted Trees (GBT) Regression | Average |
(Synapse) LightGBM | (PySpark) Linear Regression | CatBoost |
CatBoost | (Synapse) LightGBM | Decision Tree |
Decision Tree | CatBoost | ExponentialSmoothing |
Extremely Randomized Trees | Decision Tree | Extremely Randomized Trees |
Gradient Boosting | Elastic Net | ForecastTCN |
K Nearest Neighbors | Extremely Randomized Trees | Gradient Boosting |
Light GBM | Gradient Boosting | Holt-Winters Exponential Smoothing |
Linear SVC | K Nearest Neighbors | K Nearest Neighbors |
Logistic Regression | LARS Lasso | LARS Lasso |
Logistic Regression with L1/L2 Regularization | Light GBM | Light GBM |
Naive Bayes | Logistic Regression with L1/L2 Regularization | Naive |
Random Forest | Random Forest | Orbit |
Random Forest on Spark | Random Forest on Spark | Prophet |
Stochastic Gradient Descent (SGD) | Stochastic Gradient Descent (SGD) | Random Forest |
Support Vector Classification (SVC) | XGBoost | SARIMAX |
XGboost | XGBoost with Limited Depth | SeasonalAverage |
XGBoost with Limited Depth | SeasonalNaive | |
Temporal Fusion Transformer | ||
XGBoost | ||
XGBoost for Time Series | ||
XGBoost with Limited Depth for Time Series | ||
ElasticNet |
Visualize results
The flaml.visualization
module provides utility functions for plotting the optimization process using Plotly. By leveraging Plotly, users can interactively explore their AutoML experiment results. To use these plotting functions, provide your optimized flaml.AutoML
or flaml.tune.tune.ExperimentAnalysis
object as an input.
You can use the following functions within your notebook:
plot_optimization_history
: Plot optimization history of all trials in the experiment.plot_feature_importance
: Plot importance for each feature in the dataset.plot_parallel_coordinate
: Plot the high-dimensional parameter relationships in the experiment.plot_contour
: Plot the parameter relationship as contour plot in the experiment.plot_edf
: Plot the objective value EDF (empirical distribution function) of the experiment.plot_timeline
: Plot the timeline of the experiment.plot_slice
: Plot the parameter relationship as slice plot in a study.plot_param_importance
: Plot the hyperparameter importance of the experiment.