This article provides guidance about how to use command-line tools to run Spark jobs on SQL Server Big Data Clusters.
Important
The Microsoft SQL Server 2019 Big Data Clusters add-on will be retired. Support for SQL Server 2019 Big Data Clusters will end on February 28, 2025. All existing users of SQL Server 2019 with Software Assurance will be fully supported on the platform and the software will continue to be maintained through SQL Server cumulative updates until that time. For more information, see the announcement blog post and Big data options on the Microsoft SQL Server platform.
A curl application to perform REST API calls to Livy
Spark jobs that use azdata or Livy
This article provides examples of how to use command-line patterns to submit Spark applications to SQL Server Big Data Clusters.
The Azure Data CLI azdata bdc spark commands surface all capabilities of SQL Server Big Data Clusters Spark on the command line. This article focuses on job submission. But azdata bdc spark also supports interactive modes for Python, Scala, SQL, and R through the azdata bdc spark session command.
If you need direct integration with a REST API, use standard Livy calls to submit jobs. This article uses the curl command-line tool in the Livy examples to run the REST API call. For a detailed example that shows how to interact with the Spark Livy endpoint by using Python code, see Use Spark from the Livy endpoint on GitHub.
Simple ETL that uses Big Data Clusters Spark
This extract, transform, and load (ETL) application follows a common data engineering pattern. It loads tabular data from an Apache Hadoop Distributed File System (HDFS) landing zone path. It then uses a table format to write to an HDFS-processed zone path.
Download the sample application's dataset. Then create PySpark applications by using PySpark, Spark Scala, or Spark SQL.
In the following sections, you'll find sample exercises for each solution. Select the tab for your platform. You'll run the application by using azdata or curl.
This example uses the following PySpark application. It's saved as a Python file named parquet_etl_sample.py on the local machine.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Read clickstream_data from storage pool HDFS into a Spark data frame. Applies column renames.
df = spark.read.option("inferSchema", "true").csv('/securelake/landing/criteo/test.txt', sep='\t',
header=False).toDF("feat1","feat2","feat3","feat4","feat5","feat6","feat7","feat8",
"feat9","feat10","feat11","feat12","feat13","catfeat1","catfeat2","catfeat3","catfeat4",
"catfeat5","catfeat6","catfeat7","catfeat8","catfeat9","catfeat10","catfeat11","catfeat12",
"catfeat13","catfeat14","catfeat15","catfeat16","catfeat17","catfeat18","catfeat19",
"catfeat20","catfeat21","catfeat22","catfeat23","catfeat24","catfeat25","catfeat26")
# Print the data frame inferred schema
df.printSchema()
tot_rows = df.count()
print("Number of rows:", tot_rows)
# Drop the managed table
spark.sql("DROP TABLE dl_clickstream")
# Write data frame to HDFS managed table by using optimized Delta Lake table format
df.write.format("parquet").mode("overwrite").saveAsTable("dl_clickstream")
print("Sample ETL pipeline completed")
Copy the PySpark application to HDFS
Store the application in HDFS so the cluster can access it for execution. As a best practice, standardize and govern application locations within the cluster to streamline administration.
In this example use case, all ETL pipeline applications are stored on the hdfs:/apps/ETL-Pipelines path. The sample application is stored at hdfs:/apps/ETL-Pipelines/parquet_etl_sample.py.
Run the following command to upload parquet_etl_sample.py from the local development or staging machine to the HDFS cluster.
The Spark documentation recommends creating an assembly JAR (or bundle) that contains your application and all of the dependencies. This step is required to submit the application bundle to the cluster for execution.
Setting up a complete Scala Spark development environment is beyond the scope of this article. For more information, see the Spark documentation for creating self-contained applications.
This example assumes that an application JAR bundle named parquet-etl-sample.jar is compiled and available. Run the following command to upload the bundle from the local development or staging machine to the HDFS cluster.
The azdata command runs the application by using commonly specified parameters. For complete parameter options for azdata bdc spark batch create, see azdata bdc spark.
This application requires the spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation configuration parameter. So the command uses the --config option. This setup shows how to pass configurations into the Spark session.
You can use the --config option to specify multiple configuration parameters. You could also specify them inside the application session by setting the configuration in the SparkSession object.
The "name" parameter should be unique each time a new batch is created.
The azdata command runs the application by using commonly specified parameters. For complete parameter options for azdata bdc spark batch create, see azdata bdc spark.
The application requires the spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation configuration parameter. So the command uses the --config option. This setup shows how to pass configurations into the Spark session.
You can use the --config option to specify multiple configuration parameters. You could also specify them inside the application session by setting the configuration in the SparkSession object.
The "name" parameter for batch name should be unique each time a new batch is created.
The azdata command runs the application by using commonly specified parameters. For complete parameter options for azdata bdc spark batch create, see azdata bdc spark.
Like the PySpark example, this application also requires the spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation configuration parameter. So the command uses the --config option. This setup shows how to pass configurations into the Spark session.
You can use the --config option to specify multiple configuration parameters. You could also specify them inside the application session by setting the configuration in the SparkSession object.