Configure a Delta Live Tables pipeline

Article
10/14/2024

This article describes the basic configuration for Delta Live Tables pipelines using the workspace UI.

Databricks recommends developing new pipelines using serverless. For configuration instructions for serverless pipelines, see Configure a serverless Delta Live Tables pipeline.

The configuration instructions in this article use Unity Catalog. For instructions for configuring pipelines with legacy Hive metastore, see Use Delta Live Tables pipelines with legacy Hive metastore.

Note

The UI has an option to display and edit settings in JSON. You can configure most settings with either the UI or a JSON specification. Some advanced options are only available using the JSON configuration.

JSON configuration files are also helpful when deploying pipelines to new environments or using the CLI or REST API.

For a complete reference to the Delta Live Tables JSON configuration settings, see Delta Live Tables pipeline configurations.

Configure a new Delta Live Tables pipeline

To configure a new Delta Live Tables pipeline, do the following:

Click Delta Live Tables in the sidebar.
Click Create Pipeline.
Provide a unique Pipeline name.
Use the file picker to configure notebooks and workspace files as Source code.
- You must add at least one source code asset.
- Use the Add source code button to add additional source code assets.
Select a Catalog to publish data.
Select a Schema in the catalog. All streaming tables and materialized views defined in the pipeline are created in this schema.
In the Compute section, check the box next to Use Photon Acceleration. For additional compute configuration considerations, see Compute configuration options.
Click Create.

These recommended configurations create a new pipeline configured to run in Triggered mode and use the Current channel. This configuration is recommended for many use cases, including development and testing, and is well-suited to production workloads that should run on a schedule. For details on scheduling pipelines, see Delta Live Tables pipeline task for jobs.

Compute configuration options

Databricks recommends always using Enhanced autoscaling. Default values for other compute configurations work well for many pipelines.

Serverless pipelines remove compute configuration options. For configuration instructions for serverless pipelines, see Configure a serverless Delta Live Tables pipeline.

Use the following settings to customize compute configurations:

Workspace admins can configure a Cluster policy. Compute policies allow admins to control what compute options are available to users. See Select a cluster policy.
You can optionally configure Cluster mode to run with Fixed size or Legacy autoscaling. See Optimize the cluster utilization of Delta Live Tables pipelines with enhanced autoscaling.
For workloads with autoscaling enabled, set Min workers and Max workers to set limits for scaling behaviors. See Configure compute for a Delta Live Tables pipeline.
You can optionally turn off Photon acceleration. See What is Photon?.
Use Cluster tags to help monitor costs associated with Delta Live Tables pipelines. See Configure cluster tags.
Configure Instance types to specify the type of virtual machines used to run your pipeline. See Select instance types to run a pipeline.
- Select a Worker type optimized for the workloads configured in your pipeline.
- You can optionally select a Driver type that differs from your worker type. This can be useful for reducing costs in pipelines with large worker types and low driver compute utilization or for choosing a larger driver type to avoid out-of-memory issues in workloads with many small workers.

Other configuration considerations

The following configuration options are also available for pipelines:

The Advanced product edition gives you access to all Delta Live Tables features. You can optionally run pipelines using the Pro or Core product editions. See Choose a product edition.
You might choose to use the Continuous pipeline mode when running pipelines in production. See Triggered vs. continuous pipeline mode.
If your workspace is not configured for Unity Catalog or your workload needs to use legacy Hive metastore, see Use Delta Live Tables pipelines with legacy Hive metastore.
Add Notifications for email updates based on success or failure conditions. See Add email notifications for pipeline events.
Use the Configuration field to set key-value pairs for the pipeline. These configurations serve two purposes:
- Set arbitrary parameters you can reference in your source code. See Use parameters with Delta Live Tables pipelines.
- Configure pipeline settings and Spark configurations. See Delta Live Tables properties reference.
Use the Preview channel to test your pipeline against pending Delta Live Tables runtime changes and trial new features.

Choose a product edition

Select the Delta Live Tables product edition with the best features for your pipeline requirements. The following product editions are available:

Core to run streaming ingest workloads. Select the Core edition if your pipeline doesn’t require advanced features such as change data capture (CDC) or Delta Live Tables expectations.
Pro to run streaming ingest and CDC workloads. The Pro product edition supports all of the Core features, plus support for workloads that require updating tables based on changes in source data.
Advanced to run streaming ingest workloads, CDC workloads, and workloads that require expectations. The Advanced product edition supports the features of the Core and Pro editions and includes data quality constraints with Delta Live Tables expectations.

You can select the product edition when you create or edit a pipeline. You can choose a different edition for each pipeline. See the Delta Live Tables product page.

Note: If your pipeline includes features not supported by the selected product edition, such as expectations, you will receive an error message explaining the reason for the error. You can then edit the pipeline to select the appropriate edition.

Configure source code

You can use the file selector in the Delta Live Tables UI to configure the source code defining your pipeline. Pipeline source code is defined in Databricks notebooks or SQL or Python scripts stored in workspace files. When you create or edit your pipeline, you can add one or more notebooks or workspace files or a combination of notebooks and workspace files.

Because Delta Live Tables automatically analyzes dataset dependencies to construct the processing graph for your pipeline, you can add source code assets in any order.

You can modify the JSON file to include Delta Live Tables source code defined in SQL and Python scripts stored in workspace files. The following example includes notebooks and workspace files:

{
  "name": "Example pipeline 3",
  "storage": "dbfs:/pipeline-examples/storage-location/example3",
  "libraries": [
    { "notebook": { "path": "/example-notebook_1" } },
    { "notebook": { "path": "/example-notebook_2" } },
    { "file": { "path": "/Workspace/Users/<user-name>@databricks.com/Apply_Changes_Into/apply_changes_into.sql" } },
    { "file": { "path": "/Workspace/Users/<user-name>@databricks.com/Apply_Changes_Into/apply_changes_into.py" } }
  ]
}

Manage external dependencies for pipelines that use Python

Delta Live Tables supports using external dependencies in your pipelines, such as Python packages and libraries. To learn about options and recommendations for using dependencies, see Manage Python dependencies for Delta Live Tables pipelines.

Use Python modules stored in your Azure Databricks workspace

In addition to implementing your Python code in Databricks notebooks, you can use Databricks Git Folders or workspace files to store your code as Python modules. Storing your code as Python modules is especially useful when you have common functionality you want to use in multiple pipelines or notebooks in the same pipeline. To learn how to use Python modules with your pipelines, see Import Python modules from Git folders or workspace files.

Share via