Implement schema enforcement

Completed

Schema enforcement is one of the key features of Delta Lake, helping you to maintain data integrity and consistency by ensuring that the data written to a Delta table adheres to the expected schema. This feature prevents erroneous data from corrupting the dataset, which can be valuable in production environments where multiple processes or users might be writing to the same dataset. To implement schema enforcement in Delta Lake, follow these steps:

Create a Delta table with a defined schema

Create a Delta table with a specific schema either programmatically using Spark SQL or by using the DataFrame API. An example of how to create a Delta Table with a defined schema using Spark SQL follows.

-- Create a Delta table using Spark SQL
CREATE TABLE my_delta_table_schema (
    id INT,
    name STRING,
    age INT
);

Alternatively, if you're using DataFrames to write data:

-- Insert valid data
INSERT INTO my_delta_table_schema (id, name, age)
VALUES
(1, 'Alice', 30),
(2, 'Bob', 25);

Enforce schema on writes

Once a Delta table is set up with a defined schema, Delta Lake automatically enforces this schema on all subsequent write operations. If an incoming DataFrame doesn't match the schema of the Delta table, the write operation fails. An example of how to test enforcing schema on writes follows.

-- Attempt to insert data with an invalid schema (missing 'age' field)
INSERT INTO my_delta_table_schema (id, name)
VALUES
(3, 'Charlie');

Handle schema mismatches

You can use the MERGE statement to handle updates and insertions in a way that accommodates schema changes.

-- Define a temporary view with new data
CREATE OR REPLACE TEMP VIEW my_new_delta_table_schema AS
SELECT * FROM VALUES
(3, 'Charlie', 28),
(4, 'Diana', 35)
AS my_new_delta_table_schema(id, name, age);

-- Use MERGE to upsert data
MERGE INTO my_delta_table_schema AS target
USING my_new_delta_table_schema AS source
ON target.id = source.id
WHEN MATCHED THEN
  UPDATE SET
    target.name = source.name,
    target.age = source.age
WHEN NOT MATCHED THEN
  INSERT (id, name, age)
  VALUES (source.id, source.name, source.age);

If the incoming data types are different but compatible, you can use the cast function to align the schemas.

-- Insert data with casting to match the schema
INSERT INTO my_delta_table_schema
SELECT
  cast(id as INT),
  cast(name as STRING),
  cast(age as INT)
FROM my_new_delta_table_schema;

By following these steps, you can effectively utilize Delta Lake's schema enforcement capabilities within Azure Databricks to maintain data quality and integrity.