Apache Spark 指導方針

發行項
07/12/2024

本文提供「Azure HDInsight 上的 Apache Spark」的各種使用指導方針。

如何執行或提交 Spark 作業？

選項	文件
Visual Studio Code	使用 Spark & Hive Tools for Visual Studio Code
Jupyter Notebook	教學課程：在 Azure HDInsight 中的 Apache Spark 叢集上載入資料和執行查詢
IntelliJ	教學課程：使用 Azure Toolkit for IntelliJ 為 HDInsight 叢集建立 Apache Spark 應用程式
IntelliJ	教學課程：使用 IntelliJ 為 HDInsight 上的 Apache Spark 建立 Scala Maven 應用程式
Zeppelin Notebook	在 Azure HDInsight 上搭配使用 Apache Zeppelin Notebook 和 Apache Spark 叢集
使用 Livy 提交遠端作業	使用 Apache Spark REST API 將遠端作業提交至 HDInsight Spark 叢集
Apache Oozie	Oozie 是可管理 Hadoop 作業的工作流程和協調系統。
Apache Livy	您可以使用 Livy 執行互動式 Spark 殼層，或提交要在 Spark 上執行的批次作業。
適用於 Apache Spark 的 Azure Data Factory	Data Factory 管線中的 Spark 活動在您自己或隨選的 HDInsight 叢集上執行 Spark 程式。
適用於 Apache Hive 的 Azure Data Factory	Data Factory 管線中的 HDInsight Hive 活動在您自己或隨選的 HDInsight 叢集上執行 Hive 查詢。

如何監視和偵錯 Spark 作業？

選項	文件
Azure Toolkit for IntelliJ	使用 Azure Toolkit for IntelliJ (預覽) 對失敗的 Spark 作業進行偵錯
Azure Toolkit for IntelliJ 透過 SSH	使用 Azure Toolkit for IntelliJ 透過 SSH 對 HDInsight 叢集上的 Apache Spark 應用程式進行本機或遠端偵錯
Azure Toolkit for IntelliJ 透過 VPN	使用 Azure Toolkit for IntelliJ 來在 HDInsight 中透過 VPN 遠端偵錯 Apache Spark 應用程式
Apache Spark 記錄伺服器上的作業圖表	使用擴充的 Apache Spark 記錄伺服器對 Apache Spark 應用程式進行偵錯和診斷

如何更有效率執行 Spark 作業？

選項	文件
IO 快取	使用 Azure HDInsight IO 快取改進 Apache Spark 工作負載效能 (預覽)
設定選項	最佳化 Apache Spark 作業

如何連線到其他 Azure 服務？

選項	文件
HDInsight 上的 Apache Hive	整合 Apache Spark 和 Apache Hive 與 Hive Warehouse Connector
HDInsight 上的 Apache HBase	使用 Apache Spark 來讀取和寫入 Apache HBase 資料
HDInsight 上的 Apache Kafka	教學課程：將 Apache Spark 結構化串流用於 HDInsight 上的 Apache Kafka
Azure Cosmos DB	適用於 Azure Cosmos DB 的 Azure Synapse Link

我有什麼儲存體選項？

選項	文件
Azure Data Lake Storage Gen2	搭配 Azure HDInsight 叢集使用 Data Lake Storage Gen2
Azure Blob 儲存體	搭配使用 Azure 儲存體與 Azure HDInsight 叢集

下一步