Compute system tables reference

Article
07/02/2024

Important

This article provides you with an overview of the compute system tables, including the schemas and example queries. There are two cluster system tables available now: clusters and node_types.

Cluster table schema

The cluster table is a slow-changing dimension table that contains the full history of cluster configurations over time for all-purpose and jobs clusters.

The clusters system table is located at system.compute.clusters and has the following schema:

Column name	Data type	Description	Example
`account_id`	string	ID of the account where this cluster was created.	`23e22ba4-87b9-` `4cc2-9770-d10b894b7118`
`workspace_id`	string	ID of the workspace where this cluster was created.	`1234567890123456`
`cluster_id`	string	ID of the cluster for which this record is associated.	`0000-123456-xxxxxxxx`
`cluster_name`	string	User defined name for the cluster.	`My cluster`
`owned_by`	string	Username of the cluster owner. Defaults to the cluster creator, but can be changed through the Clusters API.	`sample_user@email.com`
`create_time`	timestamp	Timestamp of the change to this compute definition.	`2023-01-09 11:00:00.000`
`delete_time`	timestamp	Timestamp of when the cluster was deleted. The value is `null` if the cluster is not deleted.	`2023-01-09 11:00:00.000`
`driver_node_type`	string	Driver node type name. This matches the instance type name from the cloud provider.	`Standard_D16s_v3`
`worker_node_type`	string	Worker node type name. This matches the instance type name from the cloud provider.	`Standard_D16s_v3`
`worker_count`	bigint	Number of workers. Defined for fixed-size clusters only.	`4`
`min_autoscale_workers`	bigint	The set minimum number of workers. This field is valid only for autoscaling clusters.	`1`
`max_autoscale_workers`	bigint	The set maximum number of workers. This field is valid only for autoscaling clusters.	`1`
`auto_termination_minutes`	bigint	The configured autotermination duration.	`120`
`enable_elastic_disk`	boolean	Autoscaling disk enablement status.	`true`
`tags`	map	User-defined tags for the cluster (does not include default tags).	`{"ResourceClass":"SingleNode"}`
`cluster_source`	string	Indicates the creator for the cluster: `UI`, `API`, `JOB`, etc.	`UI`
`init_scripts`	array	Set of paths for init scripts.	`"/Users/example@email.com` `/files/scripts/install-python-pacakges.sh"`
`aws_attributes`	struct	AWS specific settings.	`null`
`azure_attributes`	struct	Azure specific settings.	`{` `"first_on_demand": "0",` `"availability": "ON_DEMAND_AZURE",` `"spot_bid_max_price": "—1"` `}`
`gcp_attributes`	struct	GCP specific settings. This field will be empty.	`null`
`driver_instance_pool_id`	string	Instance pool ID if the driver is configured on top of an instance pool.	`1107-555555-crhod16-pool-DIdnjazB`
`worker_instance_pool_id`	string	Instance Pool ID if the worker is configured on top of an instance pool.	`1107-555555-crhod16-pool-DIdnjazB`
`dbr_version`	string	The Databricks Runtime of the cluster.	`14.x-snapshot-scala2.12`
`change_time`	timestamp	Timestamp of change to the compute definition.	`2023-01-09 11:00:00.000`
`change_date`	date	Change date. Used for retention.	`2023-01-09`

Node types table schema

The node type table captures the currently available node types with their basic hardware information. The node type system table is located at system.compute.node_types and has the following schema:

Column name	Data type	Description	Example
`account_id`	string	ID of the account where this cluster was created.	`23e22ba4-87b9-4cc2-9770-d10b894b7118`
`node_type_name`	string	Unique identifier for node type.	`Standard_D16s_v3`
`core_count`	double	Number of vCPUs for the instance.	`48.0`
`memory_mb`	long	Total memory for the instance.	`393216`
`gpu_count`	long	Number of GPUs for the instance.	`0`

Known limitations

Clusters that were marked deleted before October 23, 2023 do not appear in the clusters table. This might result in joins from the system.billing.usage table not matching cluster records in the clusters table. All active clusters have been backfilled.
The clusters table only includes records for all-purpose and jobs clusters. It does not contain Delta Live Tables clusters or SQL warehouses.

Sample queries

You can use the following sample queries to answer common questions about clusters:

Join cluster records with billable usage records
Attribute costs for a cluster to the cluster owner

Note

These examples join the cluster table with the system.billing.usage table. Since billing records are cross-regional and cluster records region-sepcific, billing records only match cluster records for the region in which you are querying. To see records from another region, please execute the query in that region.

Join cluster records with the most recent billing records

This query can help you understand spending over time. Once you update the usage_start_time to the most current billing period, it grabs the most recent updates to the billing records to join into clusters data.

Each record is associated with the cluster owner during that particular run. So, if the cluster owner changes, costs will roll up to the correct owner based on when the cluster was used.

SELECT
  u.record_id,
  c.cluster_id,
  c.owned_by,
  c.change_time,
  u.usage_start_time,
  u.usage_quantity
FROM
  system.billing.usage u
  JOIN system.compute.clusters c
  JOIN (SELECT u.record_id, c.cluster_id, max(c.change_time) change_time
    FROM system.billing.usage u
    JOIN system.compute.clusters c
    WHERE
      u.usage_metadata.cluster_id is not null
      and u.usage_start_time >= '2023-01-01'
      and u.usage_metadata.cluster_id = c.cluster_id
      and date_trunc('HOUR', c.change_time) <= date_trunc('HOUR', u.usage_start_time)
    GROUP BY all) config
WHERE
  u.usage_metadata.cluster_id is not null
  and u.usage_start_time >= '2023-01-01'
  and u.usage_metadata.cluster_id = c.cluster_id
  and u.record_id = config.record_id
  and c.cluster_id = config.cluster_id
  and c.change_time = config.change_time
ORDER BY cluster_id, usage_start_time desc;

Attribute costs for a cluster to the cluster owner

If you are looking to reduce compute costs, you can use this query to find out which cluster owners in your account are using the most DBUs.

SELECT
  u.record_id record_id,
  c.cluster_id cluster_id,
  max_by(c.owned_by, c.change_time) owned_by,
  max(c.change_time) change_time,
  any_value(u.usage_start_time) usage_start_time,
  any_value(u.usage_quantity) usage_quantity
FROM
  system.billing.usage u
  JOIN system.compute.clusters c
WHERE
  u.usage_metadata.cluster_id is not null
  and u.usage_start_time >= '2023-01-01'
  and u.usage_metadata.cluster_id = c.cluster_id
  and c.change_time <= u.usage_start_time
GROUP BY 1, 2
ORDER BY cluster_id, usage_start_time desc;

Share via