Azure Databricks concepts

Article
06/27/2024

This article introduces fundamental concepts you need to understand in order to use Azure Databricks effectively.

Accounts and workspaces

In Azure Databricks, a workspace is an Azure Databricks deployment in the cloud that functions as an environment for your team to access Databricks assets. Your organization can choose to have either multiple workspaces or just one, depending on its needs.

An Azure Databricks account represents a single entity that can include multiple workspaces. Accounts enabled for Unity Catalog can be used to manage users and their access to data centrally across all of the workspaces in the account.

Billing: Databricks units (DBUs)

Azure Databricks bills based on Databricks units (DBUs), which are units of processing capability per hour based on VM instance type.

See the Azure Databricks pricing page.

Authentication and authorization

This section describes concepts that you need to know when you manage Azure Databricks identities and their access to Azure Databricks assets.

User

A unique individual who has access to the system. User identities are represented by email addresses. See Manage users.

Service principal

A service identity for use with jobs, automated tools, and systems such as scripts, apps, and CI/CD platforms. Service principals are represented by an application ID. See Manage service principals.

Group

A collection of identities. Groups simplify identity management, making it easier to assign access to workspaces, data, and other securable objects. All Databricks identities can be assigned as members of groups. See Manage groups.

Access control list (ACL)

A list of permissions attached to the workspace, cluster, job, table, or experiment. An ACL specifies which users or system processes are granted access to the objects, as well as what operations are allowed on the assets. Each entry in a typical ACL specifies a subject and an operation. See Access control lists.

Personal access token (PAT)

A personal access token is a string used to authenticate REST API calls, Technology partners connections, and other tools. See Azure Databricks personal access token authentication.

Microsoft Entra ID (formerly Azure Active Directory) tokens can also be used to authenticate to the REST API.

Azure Databricks interfaces

This section describes the interfaces for accessing your assets in Azure Databricks.

UI

The Azure Databricks UI is a graphical interface for interacting with features, such as workspace folders and their contained objects, data objects, and computational resources.

REST API

The Databricks REST API provides endpoints for modifying or requesting information about Azure Databricks account and workspace objects. See account reference and workspace reference.

SQL REST API

The SQL REST API allows you to automate tasks on SQL objects. See SQL API.

CLI

The Databricks CLI is hosted on GitHub. The CLI is built on top of the Databricks REST API.

Data management

This section describes the logical objects that store data that you feed into machine learning algorithms and on which you perform analytics. Also, it describes the in-platform UI for exploring and managing data objects.

Unity Catalog

Unity Catalog is a unified governance solution for data and AI assets on Azure Databricks that provides centralized access control, auditing, lineage, and data discovery capabilities across Databricks workspaces. See What is Unity Catalog?.

DBFS root

Important

Storing and accessing data using DBFS root or DBFS mounts is a deprecated pattern and not recommended by Databricks. Instead, Databricks recommends using Unity Catalog to manage access to all data. See What is Unity Catalog?.

The DBFS root is a storage location available to all users by default. See What is DBFS?.

Catalog Explorer

Catalog Explorer allows you to explore and manage data and AI assets, including schemas (databases), tables, models, volumes (non-tabular data), functions, and registered ML models. You can use it to find data objects and owners, understand data relationships across tables, and manage permissions and sharing. See What is Catalog Explorer?.

Database

A collection of data objects, such as tables or views and functions, that is organized so that it can be easily accessed, managed, and updated. See What are schemas in Azure Databricks?

Table

A representation of structured data. You query tables with Apache Spark SQL and Apache Spark APIs. See What is a table?.

Delta table

By default, all tables created in Azure Databricks are Delta tables. Delta tables are based on the Delta Lake open source project, a framework for high-performance ACID table storage over cloud object stores. A Delta table stores data as a directory of files on cloud object storage and registers table metadata to the metastore within a catalog and schema.

Find out more about technologies branded as Delta.

Metastore

The component that stores all of the structure information of the various tables and partitions in the data warehouse including column and column type information, the serializers and deserializers necessary to read and write data, and the corresponding files where the data is stored. See Metastores

Every Azure Databricks deployment has a central Hive metastore accessible by all clusters to persist table metadata. You also have the option to use an existing external Hive metastore.

Computation management

This section describes concepts that you need to know to run computations in Azure Databricks.

Cluster

A set of computation resources and configurations on which you run notebooks and jobs. There are two types of clusters: all-purpose and job. See Compute.

You create an all-purpose cluster using the UI, CLI, or REST API. You can manually terminate and restart an all-purpose cluster. Multiple users can share such clusters to do collaborative interactive analysis.
The Azure Databricks job scheduler creates a job cluster when you run a job on a new job cluster and terminates the cluster when the job is complete. You cannot restart an job cluster.

Pool

A set of idle, ready-to-use instances that reduce cluster start and auto-scaling times. When attached to a pool, a cluster allocates its driver and worker nodes from the pool. See Pool configuration reference.

If the pool does not have sufficient idle resources to accommodate the cluster’s request, the pool expands by allocating new instances from the instance provider. When an attached cluster is terminated, the instances it used are returned to the pool and can be reused by a different cluster.

Databricks runtime

The set of core components that run on the clusters managed by Azure Databricks. See Compute. Azure Databricks has the following runtimes:

Databricks Runtime includes Apache Spark but also adds a number of components and updates that substantially improve the usability, performance, and security of big data analytics.
Databricks Runtime for Machine Learning is built on Databricks Runtime and provides prebuilt machine learning infrastructure that is integrated with all of the capabilities of the Azure Databricks workspace. It contains multiple popular libraries, including TensorFlow, Keras, PyTorch, and XGBoost.

Workflows

Frameworks to develop and run data processing pipelines:

Jobs: A non-interactive mechanism for running a notebook or library either immediately or on a scheduled basis.
Delta Live Tables: A framework for building reliable, maintainable, and testable data processing pipelines.

See Introduction to Azure Databricks Workflows.

Workload

Workload is the amount of processing capability needed to perform a task or group of tasks. Azure Databricks identifies two types of workloads: data engineering (job) and data analytics (all-purpose).

Data engineering An (automated) workload runs on a job cluster which the Azure Databricks job scheduler creates for each workload.
Data analytics An (interactive) workload runs on an all-purpose cluster. Interactive workloads typically run commands within an Azure Databricks notebook. However, running a job on an existing all-purpose cluster is also treated as an interactive workload.

Execution context

The state for a read–eval–print loop (REPL) environment for each supported programming language. The languages supported are Python, R, Scala, and SQL.

Data engineering

Data engineering tools aid collaboration among data scientists, data engineers, data analysts, and machine learning engineers.

Workspace

A workspace is an environment for accessing all of your Azure Databricks assets. A workspace organizes objects (notebooks, libraries, dashboards, and experiments) into folders and provides access to data objects and computational resources.

Notebook

A web-based interface for creating data science and machine learning workflows that can contain runnable commands, visualizations, and narrative text. See Introduction to Databricks notebooks.

Library

A package of code available to the notebook or job running on your cluster. Databricks runtimes include many libraries, and you can also upload your own. See Libraries.

Git folder (formerly Repos)

A folder whose contents are co-versioned together by syncing them to a remote Git repository. Databricks Git folders integrate with Git to provide source and version control for your projects.

AI and machine learning

Databricks provides an integrated end-to-end environment with managed services for developing and deploying AI and machine learning applications.

Mosaic AI

The brand name for products and services from Databricks Mosaic AI Research, a team of researchers and engineers responsible for Databricks biggest breakthroughs in generative AI. Mosaic AI products include the ML and AI features in Databricks. See Mosaic Research.

Machine learning runtime

To help you develop ML and AI models, Databricks provides a Databricks Runtime for Machine Learning, which automates compute creation with pre-built machine learning and deep learning infrastructure including the most common ML and DL libraries. It also has built-in, pre-configured GPU support including drivers and supporting libraries. Browse to information about the latest runtime releases from Databricks Runtime release notes versions and compatibility.

Experiment

A collection of MLflow runs for training a machine learning model. See Organize training runs with MLflow experiments.

Features

Features are an important component of ML models. A feature store enables feature sharing and discovery across your organization and also ensures that the same feature computation code is used for model training and inference. See What is a feature store?.

GenAI models

Databricks includes a set of pre-configured foundation models, which are large language models that are trained for use in a wide variety of use cases. See Generative AI and large language models (LLMs) on Azure Databricks.

AI playground

A chat-like environment in the workspace where you can test, prompt, and compare LLMs. See Chat with supported LLMs using AI Playground.

Model registry

Databricks provides a hosted version of MLflow Model Registry in Unity Catalog. Models registered in Unity Catalog inherit centralized access control, lineage, and cross-workspace discovery and access. See Manage model lifecycle in Unity Catalog.

Model serving

Mosaic AI Model Serving provides a unified interface to deploy, govern, and query AI models. Each model you serve is available as a REST API that you can integrate into your web or client application. With Mosaic AI Model Serving, you can deploy your own models, foundation models, or third-party models hosted outside of Databricks. See Model serving with Azure Databricks.

Data warehousing

Data warehousing refers to collecting and storing data from multiple sources so it can be quickly accessed for business insights and reporting. Databricks SQL is the collection of services that bring data warehousing capabilities and performance to your existing data lakes. See What is data warehousing on Azure Databricks?.

Query

A query is a valid SQL statement that allows you to interact with your data. You can author queries using the in-platform SQL editor, or connect using a SQL connector, driver, or API tools. See Access and manage saved queries to learn more about how to work with queries.

SQL warehouse

A computation resource on which you run SQL queries. There are three types of SQL warehouses: Classic, Pro, and Serverless. Azure Databricks recommends using serverless warehouses where available. See SQL warehouse types to compare available features for each warehouse type.

Query history

A list of executed queries and their performance characteristics. Query history allows you to monitor query performance, helping you identify bottlenecks and optimize query runtimes. See Query history.

Visualization

A graphical presentation of the result of running a query. See Visualizations in Databricks notebooks.

Dashboard

A presentation of data visualizations and commentary. You can use dashboards to automatically send reports to anyone in your Azure Databricks account. Use the Databricks Assistant to help you build visualizations based on natural langauge prompts. See Dashboards. You can also create a dashboard from a notebook. See Dashboards in notebooks. For legacy dashboards, see Legacy dashboards.

Share via