Structured Streaming concepts
This article provides an introduction to Structured Streaming on Azure Databricks.
What is Structured Streaming?
Apache Spark Structured Streaming is a near real-time processing engine that offers end-to-end fault tolerance with exactly-once processing guarantees using familiar Spark APIs. Structured Streaming lets you express computation on streaming data in the same way you express a batch computation on static data. The Structured Streaming engine performs the computation incrementally and continuously updates the result as streaming data arrives.
Read from a data stream
You can use Structured Streaming to incrementally ingest data from supported data sources. Common data sources include the following:
- Data files in cloud object storage. See What is Auto Loader?.
- Message buses and queues. See Configure streaming data sources.
- Delta Lake. See Delta table streaming reads and writes.
Each data source provides a number of options to specify how to load batches of data. During reader configuration, you might need to configure options to do the following:
- Specify the data source or format (for example, file type, delimiters, and schema).
- Configure access to source systems (for example, port settings and credentials).
- Specify where to start in a stream (for example, Kafka offsets or reading all existing files).
- Control how much data is processed in each batch (for example, max offsets, files, or bytes per batch). See Configure Structured Streaming batch size on Azure Databricks.
Write to a data sink
A data sink is the target of a streaming write operation. Common sinks used in Azure Databricks streaming workloads include the following:
- Delta Lake
- Message buses and queues
- Key-value databases
As with data sources, most data sinks provide a number of options to control how data is written to the target system. During writer configuration, you specify the following options:
- Output mode (append by default). See Select an output mode for Structured Streaming.
- A checkpoint location (required for each writer). See Structured Streaming checkpoints.
- Trigger intervals. See Configure Structured Streaming trigger intervals.
- Options that specify the data sink or format (for example, file type, delimiters, and schema).
- Options that configure access to target systems (for example, port settings and credentials).