Configure Structured Streaming batch size on Azure Databricks
Limiting the input rate for Structured Streaming queries helps to maintain a consistent batch size and prevents large batches from leading to spill and cascading micro-batch processing delays.
Azure Databricks provides the same options to control Structured Streaming batch sizes for both Delta Lake and Auto Loader.
Limit input rate with maxFilesPerTrigger
Setting maxFilesPerTrigger
(or cloudFiles.maxFilesPerTrigger
for Auto Loader) specifies an upper-bound for the number of files processed in each micro-batch. For both Delta Lake and Auto Loader the default is 1000. (Note that this option is also present in Apache Spark for other file sources, where there is no max by default.)
Limit input rate with maxBytesPerTrigger
Setting maxBytesPerTrigger
(or cloudFiles.maxBytesPerTrigger
for Auto Loader) sets a “soft max” for the amount of data processed in each micro-batch. This means that a batch processes approximately this amount of data and may process more than the limit in order to make the streaming query move forward in cases when the smallest input unit is larger than this limit. There is no default for this setting.
For example, if you specify a byte string such as 10g
to limit each microbatch to 10 GB of data and you have files that are 3 GB each, Azure Databricks processes 12 GB in a microbatch.
Setting multiple input rates together
If you use maxBytesPerTrigger
in conjunction with maxFilesPerTrigger
, the micro-batch processes data until reaching the lower limit of either maxFilesPerTrigger
or maxBytesPerTrigger
.
Limiting input rates for other Structured Streaming sources
Streaming sources such as Apache Kafka each have custom input limits, such as maxOffsetsPerTrigger
. For more details, see Configure streaming data sources.