將管線升級至 SDK v2

發行項
09/01/2024

在 SDK v2 中，「管線」會合併到作業中。

工作具有類型。大部分作業都是執行 command的命令作業，例如 python main.py。工作中執行的內容與任何程式設計語言無關，因此您可以執行 bash 指令碼、叫用 python 解譯器、執行一堆 curl 命令或任何其他項目。

pipeline是另一種作業類型，其定義可能具有輸入/輸出關聯性的子作業，形成導向無循環圖（DAG）。

若要升級，您必須變更程序代碼，以定義管線並將其提交至 SDK v2。您在子作業內執行的內容不需要升級至 SDK v2。不過，建議您從模型定型腳本中移除 Azure 特定的任何程式碼機器學習。此區隔可讓您更輕鬆地在本機和雲端之間進行轉換，並被視為成熟 MLOps 的最佳做法。實際上，這表示移除 azureml.* 程式代碼行。模型記錄和追蹤程式代碼應該取代為 MLflow。如需詳細資訊，請參閱如何在 v2 中使用 MLflow。

本文提供 SDK v1 和 SDK v2 中案例的比較。在下列範例中，我們會將三個步驟（定型、評分和評估）建置成虛擬管線作業。這示範如何使用 SDK v1 和 SDK v2 建置管線作業，以及如何取用步驟之間的數據和傳輸數據。

執行管線

SDK v1

# import required libraries
import os
import azureml.core
from azureml.core import (
    Workspace,
    Dataset,
    Datastore,
    ComputeTarget,
    Experiment,
    ScriptRunConfig,
)
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import Pipeline

# check core SDK version number
print("Azure Machine Learning SDK Version: ", azureml.core.VERSION)

# load workspace
workspace = Workspace.from_config()
print(
    "Workspace name: " + workspace.name,
    "Azure region: " + workspace.location,
    "Subscription id: " + workspace.subscription_id,
    "Resource group: " + workspace.resource_group,
    sep="\n",
)

# create an ML experiment
experiment = Experiment(workspace=workspace, name="train_score_eval_pipeline")

# create a directory
script_folder = "./src"

# create compute
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
amlcompute_cluster_name = "cpu-cluster"

# Verify that cluster does not exist already
try:
    aml_compute = ComputeTarget(workspace=workspace, name=amlcompute_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS12_V2',
                                                           max_nodes=4)
    aml_compute = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

aml_compute.wait_for_completion(show_output=True)

# define data set
data_urls = ["wasbs://demo@dprepdata.blob.core.windows.net/Titanic.csv"]
input_ds = Dataset.File.from_files(data_urls)

# define steps in pipeline
from azureml.data import OutputFileDatasetConfig
model_output = OutputFileDatasetConfig('model_output')
train_step = PythonScriptStep(
    name="train step",
    script_name="train.py",
    arguments=['--training_data', input_ds.as_named_input('training_data').as_mount() ,'--max_epocs', 5, '--learning_rate', 0.1,'--model_output', model_output],
    source_directory=script_folder,
    compute_target=aml_compute,
    allow_reuse=True,
)

score_output = OutputFileDatasetConfig('score_output')
score_step = PythonScriptStep(
    name="score step",
    script_name="score.py",
    arguments=['--model_input',model_output.as_input('model_input'), '--test_data', input_ds.as_named_input('test_data').as_mount(), '--score_output', score_output],
    source_directory=script_folder,
    compute_target=aml_compute,
    allow_reuse=True,
)

eval_output = OutputFileDatasetConfig('eval_output')
eval_step = PythonScriptStep(
    name="eval step",
    script_name="eval.py",
    arguments=['--scoring_result',score_output.as_input('scoring_result'), '--eval_output', eval_output],
    source_directory=script_folder,
    compute_target=aml_compute,
    allow_reuse=True,
)

# built pipeline
from azureml.pipeline.core import Pipeline

pipeline_steps = [train_step, score_step, eval_step]

pipeline = Pipeline(workspace = workspace, steps=pipeline_steps)
print("Pipeline is built.")

pipeline_run = experiment.submit(pipeline, regenerate_outputs=False)

print("Pipeline submitted for execution.")

SDK v2。完整範例連結

# import required libraries
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential

from azure.ai.ml import MLClient, Input
from azure.ai.ml.dsl import pipeline

try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

# Get a handle to workspace
ml_client = MLClient.from_config(credential=credential)

# Retrieve an already attached Azure Machine Learning Compute.
cluster_name = "cpu-cluster"
print(ml_client.compute.get(cluster_name))

# Import components that are defined with Python function
with open("src/components.py") as fin:
    print(fin.read())

# You need to install mldesigner package to use command_component decorator.
# Option 1: install directly
# !pip install mldesigner

# Option 2: install as an extra dependency of azure-ai-ml
# !pip install azure-ai-ml[designer]

# import the components as functions
from src.components import train_model, score_data, eval_model

cluster_name = "cpu-cluster"
# define a pipeline with component
@pipeline(default_compute=cluster_name)
def pipeline_with_python_function_components(input_data, test_data, learning_rate):
    """E2E dummy train-score-eval pipeline with components defined via Python function components"""

    # Call component obj as function: apply given inputs & parameters to create a node in pipeline
    train_with_sample_data = train_model(
        training_data=input_data, max_epochs=5, learning_rate=learning_rate
    )

    score_with_sample_data = score_data(
        model_input=train_with_sample_data.outputs.model_output, test_data=test_data
    )

    eval_with_sample_data = eval_model(
        scoring_result=score_with_sample_data.outputs.score_output
    )

    # Return: pipeline outputs
    return {
        "eval_output": eval_with_sample_data.outputs.eval_output,
        "model_output": train_with_sample_data.outputs.model_output,
    }


pipeline_job = pipeline_with_python_function_components(
    input_data=Input(
        path="wasbs://demo@dprepdata.blob.core.windows.net/Titanic.csv", type="uri_file"
    ),
    test_data=Input(
        path="wasbs://demo@dprepdata.blob.core.windows.net/Titanic.csv", type="uri_file"
    ),
    learning_rate=0.1,
)

# submit job to workspace
pipeline_job = ml_client.jobs.create_or_update(
    pipeline_job, experiment_name="train_score_eval_pipeline"
)

SDK v1 和 SDK v2 中的主要功能對應

SDK v1 中的功能	SDK v2 中的粗略對應
azureml.pipeline.core.Pipeline	azure.ai.ml.dsl.pipeline
OutputDatasetConfig	輸出
數據集as_mount	輸入
StepSequence	數據相依性

步驟和作業/元件類型對應

SDK v1 中的步驟	SDK v2 中的作業類型	SDK v2 中的元件類型
`adla_step`	無	無
`automl_step`	`automl` 工作	`automl` 元件
`azurebatch_step`	無	無
`command_step`	`command` 工作	`command` 元件
`data_transfer_step`	無	None
`databricks_step`	None	無
`estimator_step`	`command` 工作	`command` 元件
`hyper_drive_step`	`sweep` 工作	無
`kusto_step`	None	None
`module_step`	無	`command` 元件
`mpi_step`	`command` 工作	`command` 元件
`parallel_run_step`	`Parallel` 工作	`Parallel` 元件
`python_script_step`	`command` 工作	`command` 元件
`r_script_step`	`command` 工作	`command` 元件
`synapse_spark_step`	`spark` 工作	`spark` 元件

已發佈的管線

當管線啟動並執行之後，您就可以發佈管線，使其以不同的輸入執行。這稱為 已發佈的管線。 Batch 端點提出了類似的更強大的方式來處理在長期 API 下執行的多個資產，這就是為什麼已發佈的管線功能已移至批次端點中的管線元件部署的原因。

Batch 端點會將介面（endpoint）與實際實作（deployment）分離，並允許使用者決定哪個部署會提供端點的預設實作。批次端點中的管線元件部署可讓使用者部署管線元件，而不是管線，這可讓那些想要簡化 MLOps 做法的組織更好地使用可重複使用的資產。

下表顯示每個概念的比較：

概念	SDK v1	SDK v2
用於調用的管線 REST 端點	管線端點	批次端點
端點下管線的特定版本	已發佈的管線	管線元件部署
管線在調用上的自變數	管線參數	作業輸入
從已發佈的管線產生的作業	管線作業	批次工作

如需如何移轉至批次端點的特定指引，請參閱將管線端點升級至 SDK v2 。

如需詳細資訊，請參閱這裡的檔：

共用方式為

將管線升級至 SDK v2

執行管線

SDK v1 和 SDK v2 中的主要功能對應

步驟和作業/元件類型對應

已發佈的管線

意見反應

其他資源

共用方式為

將管線升級至 SDK v2

執行管線

SDK v1 和 SDK v2 中的主要功能對應

步驟和作業/元件類型對應

已發佈的管線

相關文件

意見反應

其他資源