What is AutoML?

Databricks AutoML simplifies the process of applying machine learning to your datasets by automatically finding the best algorithm and hyperparameter configuration for you.

Provide your dataset and specify the type of machine learning problem, then AutoML does the following:

  1. Cleans and prepares your data.
  2. Orchestrates distributed model training and hyperparameter tuning across multiple algorithms.
  3. Finds the best model using open source evaluation algorithms from scikit-learn, xgboost, LightGBM, Prophet, and ARIMA.
  4. Presents the results. AutoML also generates source code notebooks for each trial, allowing you to review, reproduce, and modify the code as needed.

Get started with AutoML experiments through a low-code UI or the Python API.

Requirements

  • Databricks Runtime 9.1 ML or above. For the general availability (GA) version, Databricks Runtime 10.4 LTS ML or above.
    • For time series forecasting, Databricks Runtime 10.0 ML or above.
    • With Databricks Runtime 9.1 LTS ML and above, AutoML depends on the databricks-automl-runtime package, which contains components that are useful outside of AutoML and also helps simplify the notebooks generated by AutoML training. databricks-automl-runtime is available on PyPI.
  • No additional libraries other than those preinstalled in Databricks Runtime for Machine Learning should be installed on the cluster.
    • Any modification (removal, upgrades, or downgrades) to existing library versions results in run failures due to incompatibility.
  • AutoML is incompatible with shared access mode clusters.
  • To use Unity Catalog with AutoML, the cluster access mode must be Single User, and you must be the designated single user of the cluster.
  • To access files in your workspace, you must have network ports 1017 and 1021 open for AutoML experiments. To open these ports or confirm they are open, review your cloud VPN firewall configuration and security group rules or contact your local cloud administrator. For additional information on workspace configuration and deployment, see Create a workspace.

AutoML algorithms

Databricks AutoML trains and evaluates models based on the algorithms in the following table.

Note

For classification and regression models, the decision tree, random forests, logistic regression, and linear regression with stochastic gradient descent algorithms are based on scikit-learn.

Classification models Regression models Forecasting models
Decision trees Decision trees Prophet
Random forests Random forests Auto-ARIMA (Available in Databricks Runtime 10.3 ML and above.)
Logistic regression Linear regression with stochastic gradient descent
XGBoost XGBoost
LightGBM LightGBM

Trial notebook generation

AutoML generates notebooks of the source code behind trials so you can review, reproduce, and modify the code as needed.

For forecasting experiments, AutoML-generated notebooks are automatically imported to your workspace for all trials of your experiment.

For classification and regression experiments, AutoML-generated notebooks for data exploration and the best trial in your experiment are automatically imported to your workspace. Generated notebooks for other experiment trials are saved as MLflow artifacts on DBFS instead of auto-imported into your workspace. For all trials besides the best trial, the notebook_path and notebook_url in the TrialInfo Python API are not set. If you need to use these notebooks, you can manually import them into your workspace with the AutoML experiment UI or the databricks.automl.import_notebook Python API.

If you only use the data exploration notebook or best trial notebook generated by AutoML, the Source column in the AutoML experiment UI contains the link to the generated notebook for the best trial.

If you use other generated notebooks in the AutoML experiment UI, these are not automatically imported into the workspace. You can find the notebooks by clicking into each MLflow run. The IPython notebook is saved in the Artifacts section of the run page. You can download this notebook and import it into the workspace, if downloading artifacts is enabled by your workspace administrators.

Shapley values (SHAP) for model explainability

Note

For MLR 11.1 and below, SHAP plots are not generated if the dataset contains a datetime column.

The notebooks produced by AutoML regression and classification runs include code to calculate Shapley values. Shapley values are based in game theory and estimate the importance of each feature to a model’s predictions.

AutoML notebooks calculate Shapley values using the SHAP package. Because these calculations are highly memory-intensive, the calculations are not performed by default.

To calculate and display Shapley values:

  1. Go to the Feature importance section in an AutoML-generated trial notebook.
  2. Set shap_enabled = True.
  3. Re-run the notebook.

Next steps