databricks cluster sizing

Vineet S 265 Reputation points
2024-06-26T06:13:37.5433333+00:00

Hey,

how to calculate cluster core and workers node of 10gb data load every 2 hours ... what is the calculation behind this

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,042 questions
{count} votes

Accepted answer
  1. Smaran Thoomu 12,090 Reputation points Microsoft Vendor
    2024-06-26T10:54:39.5233333+00:00

    Hi @Vineet S

    Thanks for the question and using MS Q&A platform.

    To calculate the appropriate cluster size for processing a 10GB data load every 2 hours, you need to consider several factors such as the size of the data, the complexity of the processing tasks, and the desired processing time.

    Here are some general steps you can follow to calculate the cluster size:

    1. Determine the size of the data: In this case, you have a 10GB data load every 2 hours.
    2. Determine the processing time: You need to decide how long you want the processing to take. For example, if you want the processing to take no more than 1 hour, you have a processing window of 1 hour.
    3. Determine the processing rate: Divide the size of the data by the processing window to get the processing rate. For example, if you have a 10GB data load and a 1-hour processing window, the processing rate is 10GB / 1 hour = 10GB/hour.
    4. Determine the number of cores: Based on the processing rate, you can estimate the number of cores required to process the data. For example, if you have a processing rate of 10GB/hour and each core can process 1GB/hour, you would need 10 cores to process the data.
    5. Determine the number of workers: Based on the number of cores required, you can estimate the number of workers required. For example, if each worker has 4 cores, you would need 3 workers to process the data (10 cores / 4 cores per worker = 2.5 workers, rounded up to 3 workers).

    For example, you can start with a cluster with 2-4 worker nodes and 8 GB of memory per node. You can monitor the job performance and resource utilization and scale up the cluster if needed.

    User's image

    Keep in mind that these are general steps and the actual cluster size required may vary depending on the specific processing tasks and the complexity of the data. You may need to adjust the cluster size based on performance testing and monitoring.

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Deleted

    This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.


    Comments have been turned off. Learn more