getting CLUSTER_CREATION_TIMED_OUT issue

Question

Hi,

my jobs are failing without spark session is starting and getting timed out issue.

all my pools are XXL configuration but inside pipeline calling Small or medium.

Answer

it may be potentially due to mismatched configurations between the cluster pool (XXL) and the jobs in the pipeline (small/medium)

Check Pool Configuration: Ensure that the pools associated with your jobs can support the configurations required for small or medium clusters. If your jobs request a smaller cluster configuration, but the pool is set up for XXL instances, this mismatch may cause delays in provisioning, leading to timeouts.

Increase Timeout Settings: You may need to increase the cluster creation timeout in your pipeline configuration. By default, the timeout for cluster creation is set to a specific limit, and if the cluster isn’t provisioned in time, the job fails. Increasing the timeout can give the pool more time to allocate the necessary resources.

Adjust Job Cluster Size: Ensure that the job configuration within the pipeline aligns with the cluster size supported by the pool. If the jobs are using a smaller size (small or medium), try adjusting the pool configuration to match this.
Cluster Autoscaling: If your pipeline jobs require dynamic scaling, ensure that autoscaling is enabled for your cluster pool. This can help in managing resources better when different job sizes (small, medium, XXL) are being executed.

Azure Quota and Resource Availability: Check if there are any quota limitations or resource availability issues in your region. Sometimes, if the required instances (e.g., Small or Medium) are not available in sufficient quantity, it may cause cluster creation to time out.

Logs and Diagnostics: Review the logs in your Databricks workspace for more detailed information about the cluster creation timeout. This will help pinpoint if the issue is related to resource availability, configuration mismatch, or some other factor.

If you're able to share more detailed logs or error messages, you can share to narrow down the issue

Share via

getting CLUSTER_CREATION_TIMED_OUT issue

1 answer

Your answer