你当前正在访问 Microsoft Azure Global Edition 技术文档网站。如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站，请访问 https://docs.azure.cn。

SparkJob 类

参考

独立的 Spark 作业。

继承: azure.ai.ml.entities._job.job.Job

SparkJob

azure.ai.ml.entities._job.parameterized_spark.ParameterizedSpark

SparkJob

azure.ai.ml.entities._job.job_io_mixin.JobIOMixin

SparkJob

azure.ai.ml.entities._job.spark_job_entry_mixin.SparkJobEntryMixin

SparkJob

构造函数

SparkJob(*, driver_cores: int | None = None, driver_memory: str | None = None, executor_cores: int | None = None, executor_memory: str | None = None, executor_instances: int | None = None, dynamic_allocation_enabled: bool | None = None, dynamic_allocation_min_executors: int | None = None, dynamic_allocation_max_executors: int | None = None, inputs: Dict | None = None, outputs: Dict | None = None, compute: str | None = None, identity: Dict[str, str] | ManagedIdentityConfiguration | AmlTokenConfiguration | UserIdentityConfiguration | None = None, resources: Dict | SparkResourceConfiguration | None = None, **kwargs)

参数

driver_cores: Optional[int]

要用于驱动程序进程（仅在群集模式下）的核心数。

driver_memory: Optional[str]

要用于驱动程序进程的内存量，格式为字符串，大小单位后缀 (“k”、“m”、“g”或“t”) (，例如“512m”、“2g”) 。

executor_cores: Optional[int]

每个执行程序上要使用的核心数。

executor_memory: Optional[str]

每个执行程序进程使用的内存量，格式为大小单位后缀 (“k”、“m”、“g”或“t”的字符串，) (例如“512m”、“2g”) 。

executor_instances: Optional[int]

执行程序的初始数目。

dynamic_allocation_enabled: Optional[bool]

是否使用动态资源分配，这会根据工作负荷纵向扩展和缩减为此应用程序注册的执行程序数。

dynamic_allocation_min_executors: Optional[int]

如果启用了动态分配，则为执行程序数的下限。

dynamic_allocation_max_executors: Optional[int]

如果启用了动态分配，则为执行程序数的上限。

inputs: Optional[dict[str, Input]]

作业中使用的输入数据绑定的映射。

outputs: Optional[dict[str, Output]]

作业中使用的输出数据绑定的映射。

compute: Optional[str]

运行作业的计算资源。

identity: Optional[Union[dict[str, str], ManagedIdentityConfiguration, AmlTokenConfiguration, UserIdentityConfiguration]]

Spark 作业在计算运行时将使用的标识。

示例

配置 SparkJob。


   from azure.ai.ml import Input, Output
   from azure.ai.ml.entities import SparkJob

   spark_job = SparkJob(
       code="./sdk/ml/azure-ai-ml/tests/test_configs/dsl_pipeline/spark_job_in_pipeline/basic_src",
       entry={"file": "sampleword.py"},
       conf={
           "spark.driver.cores": 2,
           "spark.driver.memory": "1g",
           "spark.executor.cores": 1,
           "spark.executor.memory": "1g",
           "spark.executor.instances": 1,
       },
       environment="AzureML-sklearn-1.0-ubuntu20.04-py38-cpu:33",
       inputs={
           "input1": Input(
               type="uri_file", path="azureml://datastores/workspaceblobstore/paths/python/data.csv", mode="direct"
           )
       },
       compute="synapsecompute",
       outputs={"component_out_path": Output(type="uri_folder")},
       args="--input1 ${{inputs.input1}} --output2 ${{outputs.output1}} --my_sample_rate ${{inputs.sample_rate}}",
   )

方法

dump	将作业内容转储到 YAML 格式的文件中。
filter_conf_fields	筛选掉不在 ~azure.ai.ml._schema.job.parameterized_spark 中列出的 Spark 配置字段中的 conf 属性的字段。CONF_KEY_MAP并在自己的字典中返回它们。

dump

将作业内容转储到 YAML 格式的文件中。

dump(dest: str | PathLike | IO, **kwargs) -> None

参数

dest: Union[<xref:PathLike>, str, IO[AnyStr]]

必需

要向其写入 YAML 内容的本地路径或文件流。如果 dest 是文件路径，则将创建一个新文件。如果 dest 是打开的文件，则将直接写入该文件。

kwargs: dict

要传递给 YAML 序列化程序的其他参数。

例外

FileExistsError

如果 dest 是文件路径且文件已存在，则引发。

IOError

如果 dest 是打开的文件且文件不可写，则引发。

filter_conf_fields

筛选掉不在 ~azure.ai.ml._schema.job.parameterized_spark 中列出的 Spark 配置字段中的 conf 属性的字段。CONF_KEY_MAP并在自己的字典中返回它们。

filter_conf_fields() -> Dict[str, str]

不是 Spark 配置字段的 conf 字段的字典。

返回类型

dict[str, str]

例外

FileExistsError

如果 dest 是文件路径且文件已存在，则引发。

IOError

如果 dest 是打开的文件且文件不可写，则引发。

属性

base_path

资源的基路径。

返回类型

str

creation_context

资源的创建上下文。

资源的创建元数据。

返回类型

Optional[SystemData]

entry

environment

要运行 Spark 组件或作业的 Azure ML 环境。

返回类型

Optional[Union[str, Environment]]

id

资源 ID。

资源的全局 ID，Azure 资源管理器 (ARM) ID。

返回类型

Optional[str]

identity

Spark 作业在计算运行时将使用的标识。

返回类型

Optional[Union[ManagedIdentityConfiguration, AmlTokenConfiguration, UserIdentityConfiguration]]

inputs

log_files

作业输出文件。

日志名称和 URL 的字典。

返回类型

Optional[Dict[str, str]]

outputs

resources

作业的计算资源配置。

返回类型

Optional[SparkResourceConfiguration]

status

作业的状态。

返回的常见值包括“正在运行”、“已完成”和“失败”。所有可能的值为：

NotStarted - 这是客户端 Run 对象在云提交之前处于的临时状态。
正在启动 - 运行已开始在云中处理。调用方此时具有运行 ID。
预配 - 正在为给定的作业提交创建按需计算。
准备 - 运行环境正在准备中，处于以下两个阶段之一：
- Docker 映像生成
- Conda 环境设置
已排队 - 作业在计算目标上排队。例如，在 BatchAI 中，作业处于排队状态

等待所有请求的节点准备就绪时。
正在运行 - 作业已开始在计算目标上运行。
正在完成 - 用户代码执行已完成，运行处于后处理阶段。
已请求取消 - 已请求取消作业。
已完成 - 运行已成功完成。这包括用户代码执行和运行

后期处理阶段。
失败 - 运行失败。通常，运行上的 Error 属性会提供有关原因的详细信息。
已取消 - 遵循取消请求并指示运行现已成功取消。
未响应 - 对于启用了检测信号的运行，最近未发送任何检测信号。

作业的状态。

返回类型

Optional[str]

studio_url

Azure ML Studio 终结点。

作业详细信息页的 URL。

返回类型

Optional[str]

type

作业的类型。

返回类型

Optional[str]

CODE_ID_RE_PATTERN

CODE_ID_RE_PATTERN = re.compile('\\/subscriptions\\/(?P<subscription>[\\w,-]+)\\/resourceGroups\\/(?P<resource_group>[\\w,-]+)\\/providers\\/Microsoft\\.MachineLearningServices\\/workspaces\\/(?P<workspace>[\\w,-]+)\\/codes\\/(?P<co)

通过

SparkJob 类

构造函数

参数

示例

方法

dump

参数

例外

filter_conf_fields

返回

返回类型

例外

属性

base_path

返回

返回类型

creation_context

返回

返回类型

entry

environment

返回

返回类型

id

返回

返回类型

identity

返回

返回类型

inputs

log_files

返回

返回类型

outputs

resources

返回

返回类型

status

返回

返回类型

studio_url

返回

返回类型

type

返回

返回类型

CODE_ID_RE_PATTERN

其他资源