チュートリアル 6: Feature Store を使用したネットワークの分離

[アーティクル]
09/04/2024

Azure Machine Learning のマネージド Feature Store を使うと、特徴量の検出、作成、運用化を行うことができます。特徴量は、さまざまな特徴量を実験するプロトタイプ作成フェーズから始まり、機械学習ライフサイクルの結合組織として機能します。このライフサイクルはモデルをデプロイする運用化フェーズに進み、推論のステップで特徴量のデータについて調べます。 Feature Store の詳細については、Feature Store の概念に関するドキュメントを読んでください。

このチュートリアルでは、プライベートエンドポイントによってセキュリティ保護されたイングレスと、マネージド仮想ネットワークによってセキュリティ保護されたエグレスを構成する方法について説明します。

このチュートリアルシリーズのパート 1 では、カスタム変換を使用して特徴量セットの仕様を作成し、その特徴量セットを使用してトレーニングデータを生成する方法について説明しました。このシリーズのパート 2 では、具体化を有効にしてバックフィルを実行する方法を説明しました。さらに、パート 2 では、モデルのフォーマンスを向上させる方法として、特徴を試す方法についても説明しました。パート 3 では、Feature Store がどのように実験とトレーニングのフローの機敏性を高めるかについて説明しました。また、パート 3 では、バッチ推論を実行する方法についても説明しました。チュートリアル 4 では、オンラインまたはリアルタイムの推論のユースケースに Feature Store を使用する方法について説明しました。チュートリアル 5 では、カスタムデータソースを使用して特徴量セットを開発する方法を説明しました。チュートリアル 6 では、以下の方法について説明します。

マネージド Feature Store のネットワーク分離に必要なリソースを設定する。
Feature Store リソースを新しく作成する。
ネットワーク分離のシナリオをサポートするように Feature Store を設定する。
ネットワーク分離のシナリオをサポートするようにプロジェクトワークスペース (現在のワークスペース) を更新する。

前提条件

Note

このチュートリアルでは、サーバーレス Spark コンピューティングを搭載した Azure Machine Learning ノートブックを使用します。

このチュートリアルシリーズのパート 1 からパート 4 まで完了してください
サーバーレス Spark ジョブ用にマネージド仮想ネットワークを使って有効化した Azure Machine Learning ワークスペース
プロジェクトワークスペースを構成するには:
1. network.yml という名前の YAML ファイルを作成します。
```
managed_network:
isolation_mode: allow_internet_outbound
```
2. 次のコマンドを実行してワークスペースを更新し、サーバーレス Spark ジョブ用にマネージド仮想ネットワークをプロビジョニングします。
```
az ml workspace update --file network.yml --resource-group my_resource_group --name
my_workspace_name
az ml workspace provision-network --resource-group my_resource_group --name my_workspace_name
--include-spark
```
詳細については、「サーバーレス Spark ジョブの構成」を参照してください。
お使いのユーザーアカウントに、Feature Store を作成するリソースグループへの Owner または Contributor ロールが割り当てられている必要があります。また、ユーザーアカウントには User Access Administrator ロールも必要です。

重要

お使いの Azure Machine Learning ワークスペースで、isolation_mode を allow_internet_outbound に設定します。これがサポートされる唯一のネットワーク分離モードです。このチュートリアルでは、プライベートエンドポイントを介してソース、具体化ストア、観測データに安全に接続する方法について説明します。

設定

このチュートリアルでは、Python Feature Store core SDK (azureml-featurestore) を使用します。 Python SDK は、特徴量セットの開発とテストにのみ使用します。 CLI は、Feature Store、特徴量セット、Feature Store エンティティの作成、読み取り、更新、削除 (CRUD) 操作に使用されます。これは、CLI/YAML が推奨される継続的インテグレーションと継続的デリバリー (CI/CD) または GitOps のシナリオで役に立ちます。

このチュートリアルでは、これらのリソースを明示的にインストールする必要はありません。ここに示すセットアップ手順では、conda.yaml ファイルにこれらのリソースが含まれています。

開発用にノートブック環境を準備するには:

次のコマンドを実行して、azureml-examples リポジトリをローカルの GitHub リソースにクローンします。

git clone --depth 1 https://github.com/Azure/azureml-examples

azureml-examples リポジトリから ZIP ファイルをダウンロードすることもできます。このページで、最初に [code] ドロップダウンを選んでから、[Download ZIP] を選びます。次に、内容をローカルデバイス上のフォルダーに解凍します。
Feature Store のサンプルディレクトリをプロジェクトワークスペースにアップロードする
1. Azure Machine Learning ワークスペースで、Azure Machine Learning スタジオの UI を開きます
2. 左側のナビゲーションパネルで [ノートブック] を選択します
3. ディレクトリの一覧でユーザー名を選びます
4. 省略記号 (...) を選択し、[フォルダーのアップロード] を選びます
5. クローンされたディレクトリパスから Feature Store のサンプルフォルダーを選びます: azureml-examples/sdk/python/featurestore-sample
チュートリアルを実行する
- オプション 1: 新しいノートブックを作成して、このドキュメントの手順をステップごとに実行します
- オプション 2: 既存のノートブック featurestore_sample/notebooks/sdk_and_cli/network_isolation/Network-isolation-feature-store.ipynb を開きます。このドキュメントは開いたままにしておき、追加の説明やドキュメントのリンクが必要な際に参照してください
  1. 上部のナビゲーションにある [コンピューティング] ドロップダウンで、[サーバーレス Spark コンピューティング] を選択します。この操作には 1 分から 2 分かかる場合があります。上部のステータスバーに [セッションの構成] が表示されるまで待ちます
  2. 上部のステータスバーで [セッションの構成] を選びます
  3. [Python パッケージ] を選びます
  4. [Conda ファイルをアップロード] を選びます
  5. ローカルデバイスにある azureml-examples/sdk/python/featurestore-sample/project/env/conda.yml ファイルを選びます
  6. (省略可能) サーバーレス Spark クラスターの起動時間を短縮するには、セッションタイムアウト値 (分単位のアイドル時間) を増やします
このコードセルによって、Spark セッションを開始します。すべての依存関係をインストールして Spark セッションを開始するには、約 10 分かかります。
```
# Run this cell to start the spark session (any code block will start the session ). This can take around 10 mins.
print("start spark session")
```

サンプル用のルートディレクトリを設定する

import os

# Please update your alias below (or any custom directory you have uploaded the samples to).
# You can find the name from the directory structure in the left navigation.
root_dir = "./Users/<your_user_alias>/featurestore_sample"

if os.path.isdir(root_dir):
    print("The folder exists.")
else:
    print("The folder does not exist. Please create or fix the path")

Azure Machine Learning CLI をセットアップします。
- Azure Machine Learning CLI 拡張機能をインストールします
```
# install azure ml cli extension
!az extension add --name ml
```
- 認証
```
# authenticate
!az login
```
- 既定のサブスクリプションを設定する
```
# Set default subscription
import os

subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]

!az account set -s $subscription_id
```
Note

Feature Store ワークスペースでは、プロジェクト間での特徴量の再利用がサポートされています。現在使用中のプロジェクトワークスペースでは、特定の Feature Store の特徴量を利用して、モデルのトレーニングと推論を実行します。多くのプロジェクトワークスペースで同じ Feature Store ワークスペースを共有し、再利用できます。

必要なリソースをプロビジョニングする

新しい Azure Data Lake Storage (ADLS) Gen2 ストレージアカウントとコンテナーを作成するか、既存のストレージアカウントとコンテナーリソースを Feature Store に再利用することができます。実際の状況では、異なるストレージアカウントで ADLS Gen2 コンテナーをホストできます。特定の要件に応じて、どちらのオプションも機能します。

このチュートリアルでは、同じ ADLS Gen2 ストレージアカウントに 3 つの異なるストレージコンテナーを作成します。

ソースデータ
オフラインストア
観測データ

ソースデータ、オフラインストア、観測データ用の ADLS Gen2 ストレージアカウントを作成します。

次のコードサンプルでは、Azure Data Lake Storage Gen2 ストレージアカウントの名前を指定しています。提供されている既定の設定で、次のコードセルを実行できます。必要に応じて、既定の設定をオーバーライドできます。

## Default Setting
# We use the subscription, resource group, region of this active project workspace,
# We hard-coded default resource names for creating new resources

## Overwrite
# You can replace them if you want to create the resources in a different subsciprtion/resourceGroup, or use existing resources
# At the minimum, provide an ADLS Gen2 storage account name for `storage_account_name`

storage_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
storage_resource_group_name = os.environ["AZUREML_ARM_RESOURCEGROUP"]
storage_account_name = "<STORAGE_ACCOUNT_NAME>"

storage_location = "eastus"
storage_file_system_name_offline_store = "offline-store"
storage_file_system_name_source_data = "source-data"
storage_file_system_name_observation_data = "observation-data"

このコードセルでは、上のコードセルで定義されている ADLS Gen2 ストレージアカウントを作成します。

# Create new storage account
!az storage account create --name $storage_account_name --enable-hierarchical-namespace true --resource-group $storage_resource_group_name --location $storage_location --subscription $storage_subscription_id

このコードセルでは、オフラインストア用の新しいストレージコンテナーを作成します。

# Create a new storage container for offline store
!az storage fs create --name $storage_file_system_name_offline_store --account-name $storage_account_name --subscription $storage_subscription_id

このコードセルでは、ソースデータ用の新しいストレージコンテナーを作成します。

# Create a new storage container for source data
!az storage fs create --name $storage_file_system_name_source_data --account-name $storage_account_name --subscription $storage_subscription_id

このコードセルでは、観測データ用の新しいストレージコンテナーを作成します。

# Create a new storage container for observation data
!az storage fs create --name $storage_file_system_name_observation_data --account-name $storage_account_name --subscription $storage_subscription_id

このチュートリアルシリーズで必要なサンプルデータを、新しく作成したストレージコンテナーにコピーします。

ストレージコンテナーにデータを書き込むには、こちらの手順に従って、共同作成者ロールとストレージ BLOB データ共同作成者ロールが Azure portal で作成された ADLS Gen2 ストレージアカウントのユーザー ID に割り当てられていることを確認します。

重要

共同作成者ロールとストレージ BLOB データ共同作成者ロールがユーザー ID に割り当てられていることを確認したら、ロールの割り当てからアクセス許可が伝達されるまで数分待ってから、次の手順に進みます。アクセス制御の詳細については、Azure ストレージアカウントのロールベースのアクセス制御 (RBAC) に関するページを参照してください。

次のコードセルは、このチュートリアルで使われるトランザクション特徴量セットのサンプルソースデータをパブリックストレージアカウントから新しく作成したストレージアカウントにコピーします。
```
# Copy sample source data for transactions feature set used in this tutorial series from the public storage account to the newly created storage account
transactions_source_data_path = "wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/datasources/transactions-source/*.parquet"
transactions_src_df = spark.read.parquet(transactions_source_data_path)

transactions_src_df.write.parquet(
    f"abfss://{storage_file_system_name_source_data}@{storage_account_name}.dfs.core.windows.net/transactions-source/"
)
```

このチュートリアルで使われるアカウント特徴量セットについては、アカウント特徴量セットのサンプルソースデータを新しく作成したストレージアカウントにコピーします。

# Copy sample source data for account feature set used in this tutorial series from the public storage account to the newly created storage account
accounts_data_path = "wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/datasources/accounts-precalculated/*.parquet"
accounts_data_df = spark.read.parquet(accounts_data_path)

accounts_data_df.write.parquet(
    f"abfss://{storage_file_system_name_source_data}@{storage_account_name}.dfs.core.windows.net/accounts-precalculated/"
)

トレーニングに使われるサンプル観測データをパブリックストレージアカウントから新しく作成したストレージアカウントにコピーします。

# Copy sample observation data used for training from the public storage account to the newly created storage account
observation_data_train_path = "wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/observation_data/train/*.parquet"
observation_data_train_df = spark.read.parquet(observation_data_train_path)

observation_data_train_df.write.parquet(
    f"abfss://{storage_file_system_name_observation_data}@{storage_account_name}.dfs.core.windows.net/train/"
)

バッチ推論に使われるサンプル観測データをパブリックストレージアカウントから新しく作成したストレージアカウントにコピーします。

# Copy sample observation data used for batch inference from a public storage account to the newly created storage account
observation_data_inference_path = "wasbs://data@azuremlexampledata.blob.core.windows.net/feature-store-prp/observation_data/batch_inference/*.parquet"
observation_data_inference_df = spark.read.parquet(observation_data_inference_path)

observation_data_inference_df.write.parquet(
    f"abfss://{storage_file_system_name_observation_data}@{storage_account_name}.dfs.core.windows.net/batch_inference/"
)

新しく作成したストレージアカウントで、公衆ネットワークアクセスを無効にします。

このコードセルでは、前に作成した ADLS Gen2 ストレージアカウントの公衆ネットワークアクセスを無効にします。

# Disable the public network access for the above created ADLS Gen2 storage account
!az storage account update --name $storage_account_name --resource-group $storage_resource_group_name --subscription $storage_subscription_id --public-network-access disabled

オフラインストア、ソースデータ、観測データのコンテナーの ARM ID を設定します。

# set the container arm id
offline_store_gen2_container_arm_id = "/subscriptions/{sub_id}/resourceGroups/{rg}/providers/Microsoft.Storage/storageAccounts/{account}/blobServices/default/containers/{container}".format(
    sub_id=storage_subscription_id,
    rg=storage_resource_group_name,
    account=storage_account_name,
    container=storage_file_system_name_offline_store,
)

print(offline_store_gen2_container_arm_id)

source_data_gen2_container_arm_id = "/subscriptions/{sub_id}/resourceGroups/{rg}/providers/Microsoft.Storage/storageAccounts/{account}/blobServices/default/containers/{container}".format(
    sub_id=storage_subscription_id,
    rg=storage_resource_group_name,
    account=storage_account_name,
    container=storage_file_system_name_source_data,
)

print(source_data_gen2_container_arm_id)

observation_data_gen2_container_arm_id = "/subscriptions/{sub_id}/resourceGroups/{rg}/providers/Microsoft.Storage/storageAccounts/{account}/blobServices/default/containers/{container}".format(
    sub_id=storage_subscription_id,
    rg=storage_resource_group_name,
    account=storage_account_name,
    container=storage_file_system_name_observation_data,
)

print(observation_data_gen2_container_arm_id)

具体化が有効になっている Feature Store を作成する

Feature Store のパラメーターを設定する

次のコードセルのサンプルに示すように、Feature Store の名前、場所、サブスクリプション ID、グループ名、ARM ID の値を設定します。

# We use the subscription, resource group, region of this active project workspace.
# Optionally, you can replace them to create the resources in a different subsciprtion/resourceGroup, or use existing resources
import os

# At the minimum, define a name for the feature store
featurestore_name = "<FEATURESTORE_NAME>"
# It is recommended to create featurestore in the same location as the storage
featurestore_location = storage_location
featurestore_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
featurestore_resource_group_name = os.environ["AZUREML_ARM_RESOURCEGROUP"]

feature_store_arm_id = "/subscriptions/{sub_id}/resourceGroups/{rg}/providers/Microsoft.MachineLearningServices/workspaces/{ws_name}".format(
    sub_id=featurestore_subscription_id,
    rg=featurestore_resource_group_name,
    ws_name=featurestore_name,
)

このコードセルは、具体化が有効化された Feature Store の YAML 仕様ファイルを生成します。

# The below code creates a feature store with enabled materialization
import yaml

config = {
    "$schema": "http://azureml/sdk-2-0/FeatureStore.json",
    "name": featurestore_name,
    "location": featurestore_location,
    "compute_runtime": {"spark_runtime_version": "3.2"},
    "offline_store": {
        "type": "azure_data_lake_gen2",
        "target": offline_store_gen2_container_arm_id,
    },
}

feature_store_yaml = root_dir + "/featurestore/featurestore_with_offline_setting.yaml"

with open(feature_store_yaml, "w") as outfile:
    yaml.dump(config, outfile, default_flow_style=False)

Feature Store を作成する

このコードセルは前のステップで生成された YAML 仕様ファイルを使って、具体化が有効化された Feature Store を作成します。

!az ml feature-store create --file $feature_store_yaml --subscription $featurestore_subscription_id --resource-group $featurestore_resource_group_name

Azure Machine Learning の Feature Store Core SDK クライアントを初期化する

次のセルで初期化された SDK クライアントにより、特徴量の開発と使用が容易になります。

# feature store client
from azureml.featurestore import FeatureStoreClient
from azure.ai.ml.identity import AzureMLOnBehalfOfCredential

featurestore = FeatureStoreClient(
    credential=AzureMLOnBehalfOfCredential(),
    subscription_id=featurestore_subscription_id,
    resource_group_name=featurestore_resource_group_name,
    name=featurestore_name,
)

Feature Store でユーザー ID にロールを割り当てる

次の手順のようにして、ユーザー ID の Microsoft Entra オブジェクト ID を取得します。その後、次のコマンドで Microsoft Entra オブジェクト ID を使って、作成された Feature Store でユーザー ID に AzureML データサイエンティストロールを割り当てます。

your_aad_objectid = "<YOUR_AAD_OBJECT_ID>"

!az role assignment create --role "AzureML Data Scientist" --assignee-object-id $your_aad_objectid --assignee-principal-type User --scope $feature_store_arm_id

Feature Store の既定のストレージアカウントとキーコンテナーを取得し、対応するリソースへの公衆ネットワークアクセスを無効にする

次のコードセルは、次の手順のために Feature Store オブジェクトを返します。

fs = featurestore.feature_stores.get()

このコードセルは、Feature Store の既定のストレージアカウントとキーコンテナーの名前を返します。

# Copy the properties storage_account and key_vault from the response returned in feature store show command respectively
default_fs_storage_account_name = fs.storage_account.rsplit("/", 1)[-1]
default_key_vault_name = fs.key_vault.rsplit("/", 1)[-1]

このコードセルでは、Feature Store の既定のストレージアカウントへの公衆ネットワークアクセスを無効にします。

# Disable the public network access for the above created default ADLS Gen2 storage account for the feature store
!az storage account update --name $default_fs_storage_account_name --resource-group $featurestore_resource_group_name --subscription $featurestore_subscription_id --public-network-access disabled

次のセルは、Feature Store の既定のキーコンテナーの名前を印刷します。

print(default_key_vault_name)

前に作成した Feature Store の既定のキーコンテナーに対する公衆ネットワークアクセスを無効にする

Azure portal で、前のセルで作成した既定のキーコンテナーを開きます。
[ネットワーク] タブを選択します。
[パブリックアクセスの無効化] を選んでから、ページの左下にある [適用] を選びます。

Feature Store ワークスペースのマネージド仮想ネットワークを有効にする

必要なアウトバウンド規則を使用して Feature Store を更新する

次のコードセルは、Feature Store に対して定義されたアウトバウンド規則の YAML 仕様ファイルを作成します。

# The below code creates a configuration for managed virtual network for the feature store
import yaml

config = {
    "public_network_access": "disabled",
    "managed_network": {
        "isolation_mode": "allow_internet_outbound",
        "outbound_rules": [
            # You need to add multiple rules here if you have separate storage account for source, observation data and offline store.
            {
                "name": "sourcerulefs",
                "destination": {
                    "spark_enabled": "true",
                    "subresource_target": "dfs",
                    "service_resource_id": f"/subscriptions/{storage_subscription_id}/resourcegroups/{storage_resource_group_name}/providers/Microsoft.Storage/storageAccounts/{storage_account_name}",
                },
                "type": "private_endpoint",
            },
            # This rule is added currently because serverless Spark doesn't automatically create a private endpoint to default key vault.
            {
                "name": "defaultkeyvault",
                "destination": {
                    "spark_enabled": "true",
                    "subresource_target": "vault",
                    "service_resource_id": f"/subscriptions/{featurestore_subscription_id}/resourcegroups/{featurestore_resource_group_name}/providers/Microsoft.Keyvault/vaults/{default_key_vault_name}",
                },
                "type": "private_endpoint",
            },
        ],
    },
}

feature_store_managed_vnet_yaml = (
    root_dir + "/featurestore/feature_store_managed_vnet_config.yaml"
)

with open(feature_store_managed_vnet_yaml, "w") as outfile:
    yaml.dump(config, outfile, default_flow_style=False)

このコードセルは生成された YAML 仕様ファイルを使って、Feature Store を更新します。

!az ml feature-store update --file $feature_store_managed_vnet_yaml --name $featurestore_name --resource-group $featurestore_resource_group_name

定義されたアウトバウンド規則のプライベートエンドポイントを作成する

provision-network コマンドは、具体化ジョブが実行されるマネージド仮想ネットワークからソース、オフラインストア、観測データ、既定のストレージアカウント、Feature Store の既定のキーコンテナーへのプライベートエンドポイントを作成します。このコマンドの完了には 20 分程度かかることがあります。

#### Provision network to create necessary private endpoints (it may take approximately 20 minutes)
!az ml feature-store provision-network --name $featurestore_name --resource-group $featurestore_resource_group_name --include-spark

このコードセルは、アウトバウンド規則によって定義されたプライベートエンドポイントの作成を確認します。

### Check that managed virtual network is correctly enabled
### After provisioning the network, all the outbound rules should become active
### For this tutorial, you will see 6 outbound rules
!az ml feature-store show --name $featurestore_name --resource-group $featurestore_resource_group_name

プロジェクトワークスペースのマネージド仮想ネットワークを更新する

次に、プロジェクトワークスペースのマネージド仮想ネットワークを更新します。まず、プロジェクトワークスペースのサブスクリプション ID、リソースグループ、ワークスペース名を取得します。

# lookup the subscription id, resource group and workspace name of the current workspace
project_ws_sub_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
project_ws_rg = os.environ["AZUREML_ARM_RESOURCEGROUP"]
project_ws_name = os.environ["AZUREML_ARM_WORKSPACE_NAME"]

必要なアウトバウンド規則を使用してプロジェクトワークスペースを更新する

プロジェクトワークスペースは、次のリソースにアクセスできる必要があります。

ソースデータ
オフラインストア
観測データ
機能ストア
Feature Store の既定のストレージアカウント

このコードセルでは、生成された YAML 仕様ファイルと必要なアウトバウンド規則を使って、プロジェクトワークスペースを更新します。

# The below code creates a configuration for managed virtual network for the project workspace
import yaml

config = {
    "managed_network": {
        "isolation_mode": "allow_internet_outbound",
        "outbound_rules": [
            # Incase you have separate storage accounts for source, observation data and offline store, you need to add multiple rules here. No action needed otherwise.
            {
                "name": "projectsourcerule",
                "destination": {
                    "spark_enabled": "true",
                    "subresource_target": "dfs",
                    "service_resource_id": f"/subscriptions/{storage_subscription_id}/resourcegroups/{storage_resource_group_name}/providers/Microsoft.Storage/storageAccounts/{storage_account_name}",
                },
                "type": "private_endpoint",
            },
            # Rule to create private endpoint to default storage of feature store
            {
                "name": "defaultfsstoragerule",
                "destination": {
                    "spark_enabled": "true",
                    "subresource_target": "blob",
                    "service_resource_id": f"/subscriptions/{featurestore_subscription_id}/resourcegroups/{featurestore_resource_group_name}/providers/Microsoft.Storage/storageAccounts/{default_fs_storage_account_name}",
                },
                "type": "private_endpoint",
            },
            # Rule to create private endpoint to default key vault of feature store
            {
                "name": "defaultfskeyvaultrule",
                "destination": {
                    "spark_enabled": "true",
                    "subresource_target": "vault",
                    "service_resource_id": f"/subscriptions/{featurestore_subscription_id}/resourcegroups/{featurestore_resource_group_name}/providers/Microsoft.Keyvault/vaults/{default_key_vault_name}",
                },
                "type": "private_endpoint",
            },
            # Rule to create private endpoint to feature store
            {
                "name": "featurestorerule",
                "destination": {
                    "spark_enabled": "true",
                    "subresource_target": "amlworkspace",
                    "service_resource_id": f"/subscriptions/{featurestore_subscription_id}/resourcegroups/{featurestore_resource_group_name}/providers/Microsoft.MachineLearningServices/workspaces/{featurestore_name}",
                },
                "type": "private_endpoint",
            },
        ],
    }
}

project_ws_managed_vnet_yaml = (
    root_dir + "/featurestore/project_ws_managed_vnet_config.yaml"
)

with open(project_ws_managed_vnet_yaml, "w") as outfile:
    yaml.dump(config, outfile, default_flow_style=False)

このコードセルでは、生成された YAML 仕様ファイルとアウトバウンド規則を使って、プロジェクトワークスペースを更新します。

#### Update project workspace to create private endpoints for the defined outbound rules (it may take approximately 15 minutes)
!az ml workspace update --file $project_ws_managed_vnet_yaml --name $project_ws_name --resource-group $project_ws_rg

このコードセルは、アウトバウンド規則によって定義されたプライベートエンドポイントの作成を確認します。

!az ml workspace show --name $project_ws_name --resource-group $project_ws_rg

また、Azure portal からアウトバウンド規則を検証することもできます。プロジェクトワークスペースの左側のナビゲーションパネルから [ネットワーク] に移動し、[ワークスペースで管理されるアウトバウンドアクセス] タブを開きます。

トランザクションローリング集計特徴量セットのプロトタイプ作成と開発

トランザクションのソースデータを調べる

Note

このチュートリアルで使用されるサンプルデータは、パブリックにアクセスできる BLOB コンテナーでホストされています。これは、wasbs ドライバーを使って Spark で読み取ることだけができます。独自のソースデータを使って特徴量セットを作成するときは、ADLS Gen2 アカウントでそれをホストし、データパスで abfss ドライバーを使ってください。

# remove the "." in the root directory path as we need to generate absolute path to read from Spark
transactions_source_data_path = f"abfss://{storage_file_system_name_source_data}@{storage_account_name}.dfs.core.windows.net/transactions-source/*.parquet"
transactions_src_df = spark.read.parquet(transactions_source_data_path)

display(transactions_src_df.head(5))
# Note: display(training_df.head(5)) displays the timestamp column in a different format. You can can call transactions_src_df.show() to see correctly formatted value

ローカル環境でトランザクション特徴量セットを開発する

特徴量セットの仕様は、自己完結型の特徴量セットの定義であり、ローカル環境で開発してテストできます。

ここでは、次のローリングウィンドウの集計特徴量を作成します。

3 日間のトランザクション数
3 日間の合計トランザクション金額
3 日間の平均トランザクション金額
7 日間のトランザクション数
7 日間の合計トランザクション金額
7 日間の平均トランザクション金額

特徴量変換コードファイル featurestore/featuresets/transactions/spec/transformation_code/transaction_transform.py を調べます。この Spark トランスフォーマーは、特徴量に対して定義されたローリング集計を実行します。

特徴量セットと変換の詳細については、Feature Store の概念に関する記事をご覧ください。

from azureml.featurestore import create_feature_set_spec, FeatureSetSpec
from azureml.featurestore.contracts import (
    DateTimeOffset,
    FeatureSource,
    TransformationCode,
    Column,
    ColumnType,
    SourceType,
    TimestampColumn,
)


transactions_featureset_code_path = (
    root_dir + "/featurestore/featuresets/transactions/transformation_code"
)

transactions_featureset_spec = create_feature_set_spec(
    source=FeatureSource(
        type=SourceType.parquet,
        path=f"abfss://{storage_file_system_name_source_data}@{storage_account_name}.dfs.core.windows.net/transactions-source/*.parquet",
        timestamp_column=TimestampColumn(name="timestamp"),
        source_delay=DateTimeOffset(days=0, hours=0, minutes=20),
    ),
    transformation_code=TransformationCode(
        path=transactions_featureset_code_path,
        transformer_class="transaction_transform.TransactionFeatureTransformer",
    ),
    index_columns=[Column(name="accountID", type=ColumnType.string)],
    source_lookback=DateTimeOffset(days=7, hours=0, minutes=0),
    temporal_join_lookback=DateTimeOffset(days=1, hours=0, minutes=0),
    infer_schema=True,
)
# Generate a spark dataframe from the feature set specification
transactions_fset_df = transactions_featureset_spec.to_spark_dataframe()
# display few records
display(transactions_fset_df.head(5))

特徴量セットの仕様をエクスポートする

特徴量セット仕様を Feature Store に登録するには、その仕様を特定の形式で保存する必要があります。

生成したトランザクション特徴量セット仕様を調べるには、ファイルツリーから次のファイルを開いて仕様を確認します。

featurestore/featuresets/accounts/spec/FeaturesetSpec.yaml

仕様には、次の要素が含まれています。

source: ストレージリソースへの参照。この場合は BLOB ストレージリソース内の parquet ファイルです
features: 特徴量とそのデータ型のリスト。変換コードを指定する場合
index_columns: 特徴量セットから値にアクセスするために必要な結合キー

特徴量セット仕様を YAML ファイルとして保持するもう 1 つの利点は、仕様をバージョンコントロールできることです。特徴量セットの仕様について詳しくは、最上位レベルの Feature Store エンティティに関するドキュメントと、特徴量セット仕様の YAML に関するリファレンスをご覧ください。

import os

# create a new folder to dump the feature set specification
transactions_featureset_spec_folder = (
    root_dir + "/featurestore/featuresets/transactions/spec"
)

# check if the folder exists, create one if not
if not os.path.exists(transactions_featureset_spec_folder):
    os.makedirs(transactions_featureset_spec_folder)

transactions_featureset_spec.dump(transactions_featureset_spec_folder, overwrite=True)

Feature Store エンティティを登録する

エンティティは、同じ論理エンティティを使う複数の特徴量セットの間で同じ結合キー定義の使用を強制するのに役立ちます。エンティティの例としては、アカウントエンティティや顧客エンティティなどがあります。通常、エンティティは 1 回作成され、特徴量セット間で再利用されます。詳細については、最上位レベルの Feature Store エンティティに関するドキュメントを参照してください。

このコードセルでは、Feature Store のアカウントエンティティを作成します。

account_entity_path = root_dir + "/featurestore/entities/account.yaml"
!az ml feature-store-entity create --file $account_entity_path --resource-group $featurestore_resource_group_name --workspace-name $featurestore_name

トランザクション特徴量セットを Feature Store に登録し、具体化ジョブを送信する

特徴量セット資産を共有して再利用するには、まずその資産を Feature Store に登録する必要があります。特徴量セット資産の登録では、バージョン管理や具体化を含む管理機能が提供されます。このチュートリアルシリーズでは、これらのトピックについて説明します。

特徴量セット資産は、前に作成した特徴量セット仕様と、バージョンや具体化の設定などの他のプロパティの両方を参照します。

機能セットの作成

次のコードセルは定義済みの YAML 仕様ファイルを使用して、特徴量セットを作成します。

transactions_featureset_path = (
    root_dir
    + "/featurestore/featuresets/transactions/featureset_asset_offline_enabled.yaml"
)
!az ml feature-set create --file $transactions_featureset_path --resource-group $featurestore_resource_group_name --workspace-name $featurestore_name

このコードセルでは、新しく作成された特徴量セットをプレビューします。

# Preview the newly created feature set

!az ml feature-set show --resource-group $featurestore_resource_group_name --workspace-name $featurestore_name -n transactions -v 1

バックフィル具体化ジョブを送信する

次のコードセルは特徴量具体化期間の開始日時と終了日時の値を定義して、バックフィル具体化ジョブを送信します。

feature_window_start_time = "2023-02-01T00:00.000Z"
feature_window_end_time = "2023-03-01T00:00.000Z"

!az ml feature-set backfill --name transactions --version 1 --by-data-status "['None']" --workspace-name $featurestore_name --resource-group $featurestore_resource_group_name --feature-window-start-time $feature_window_start_time --feature-window-end-time $feature_window_end_time

このコードセルは <JOB_ID_FROM_PREVIOUS_COMMAND> を指定して、バックフィル具体化ジョブの状態を確認します。

### Check the job status

!az ml job show --name <JOB_ID_FROM_PREVIOUS_COMMAND> -g $featurestore_resource_group_name -w $featurestore_name

このコードセルは、現在の特徴量セットに対するすべての具体化ジョブの一覧を表示します。

### List all the materialization jobs for the current feature set

!az ml feature-set list-materialization-operation --name transactions --version 1 -g $featurestore_resource_group_name -w $featurestore_name

Azure Cache for Redis をオンラインストアとしてアタッチする

Azure Cache for Redis を作成する

次のコードセルで、作成または再利用する Azure Cache for Redis の名前を定義します。必要に応じて、その他の既定の設定をオーバーライドできます。

redis_subscription_id = os.environ["AZUREML_ARM_SUBSCRIPTION"]
redis_resource_group_name = os.environ["AZUREML_ARM_RESOURCEGROUP"]
redis_name = "my-redis"
redis_location = storage_location

Redis のキャッシュレベル (Basic、Standard、または Premium) を選択できます。選択したキャッシュレベルで使用できる SKU ファミリを選択する必要があります。さまざまなレベルの選択がキャッシュのパフォーマンスに及ぼす影響の詳細については、このドキュメントリソースを参照してください。 Azure Cache for Redis のさまざまな SKU レベルとファミリの価格の詳細については、このドキュメントリソースを参照してください。

次のコードセルを実行して、Premium レベル、SKU ファミリ P、キャッシュ容量 2 の Azure Cache for Redis を作成します。 Redis インスタンスのプロビジョニングには 5 分から 10 分程度かかることがあります。

# Create new redis cache
from azure.mgmt.redis import RedisManagementClient
from azure.mgmt.redis.models import RedisCreateParameters, Sku, SkuFamily, SkuName

management_client = RedisManagementClient(
    AzureMLOnBehalfOfCredential(), redis_subscription_id
)

# It usually takes about 5 - 10 min to finish the provision of the Redis instance.
# If the following begin_create() call still hangs for longer than that,
# please check the status of the Redis instance on the Azure portal and cancel the cell if the provision has completed.
# This sample uses a PREMIUM tier Redis SKU from family P, which may cost more than a STANDARD tier SKU from family C.
# Please choose the SKU tier and family according to your performance and pricing requirements.

redis_arm_id = (
    management_client.redis.begin_create(
        resource_group_name=redis_resource_group_name,
        name=redis_name,
        parameters=RedisCreateParameters(
            location=redis_location,
            sku=Sku(name=SkuName.PREMIUM, family=SkuFamily.P, capacity=2),
            public_network_access="Disabled",  # can only disable PNA to redis cache during creation
        ),
    )
    .result()
    .id
)
print(redis_arm_id)

オンラインストアを使用して Feature Store を更新する

Azure Cache for Redis を Feature Store にアタッチして、オンライン具体化ストアとして使用します。次のコードセルは、Feature Store に対して定義されたオンラインストアのアウトバウンド規則を使用して、YAML 仕様ファイルを作成します。

# The following code cell creates a YAML specification file for outbound rules that are defined for the feature store.
## rule 1: PE to online store (redis cache): this is optional if online store is not used

import yaml

config = {
    "public_network_access": "disabled",
    "managed_network": {
        "isolation_mode": "allow_internet_outbound",
        "outbound_rules": [
            {
                "name": "sourceruleredis",
                "destination": {
                    "spark_enabled": "true",
                    "subresource_target": "redisCache",
                    "service_resource_id": f"/subscriptions/{storage_subscription_id}/resourcegroups/{storage_resource_group_name}/providers/Microsoft.Cache/Redis/{redis_name}",
                },
                "type": "private_endpoint",
            },
        ],
    },
    "online_store": {"target": f"{redis_arm_id}", "type": "redis"},
}

feature_store_managed_vnet_yaml = (
    root_dir + "/featurestore/feature_store_managed_vnet_config.yaml"
)

with open(feature_store_managed_vnet_yaml, "w") as outfile:
    yaml.dump(config, outfile, default_flow_style=False)

次のコードセルは、オンラインストア用のアウトバウンド規則を持つ生成された YAML 仕様ファイルを使用して、Feature Store を更新します。

!az ml feature-store update --file $feature_store_managed_vnet_yaml --name $featurestore_name --resource-group $featurestore_resource_group_name

プロジェクトワークスペースのアウトバウンド規則を更新する

プロジェクトワークスペースは、オンラインストアにアクセスできる必要があります。次のコードセルがプロジェクトワークスペースに必要なアウトバウンド規則を使って、YAML 仕様ファイルを作成します。

import yaml

config = {
    "managed_network": {
        "isolation_mode": "allow_internet_outbound",
        "outbound_rules": [
            {
                "name": "onlineruleredis",
                "destination": {
                    "spark_enabled": "true",
                    "subresource_target": "redisCache",
                    "service_resource_id": f"/subscriptions/{storage_subscription_id}/resourcegroups/{storage_resource_group_name}/providers/Microsoft.Cache/Redis/{redis_name}",
                },
                "type": "private_endpoint",
            },
        ],
    }
}

project_ws_managed_vnet_yaml = (
    root_dir + "/featurestore/project_ws_managed_vnet_config.yaml"
)

with open(project_ws_managed_vnet_yaml, "w") as outfile:
    yaml.dump(config, outfile, default_flow_style=False)

次のコードセルを実行することで、オンラインストア用のアウトバウンド規則を持つ生成された YAML 仕様ファイルを使用して、プロジェクトワークスペースを更新します。

#### Update project workspace to create private endpoints for the defined outbound rules (it may take approximately 15 minutes)
!az ml workspace update --file $project_ws_managed_vnet_yaml --name $project_ws_name --resource-group $project_ws_rg

トランザクション特徴量セットをオンラインストアに具体化する

次のコードセルは、transactions 特徴量セットのオンライン具体化を有効化します。

# Update featureset to enable online materialization
transactions_featureset_path = (
    root_dir
    + "/featurestore/featuresets/transactions/featureset_asset_online_enabled.yaml"
)
!az ml feature-set update --file $transactions_featureset_path --resource-group $featurestore_resource_group_name --workspace-name $featurestore_name

次のコードセルは特徴量具体化期間の開始日時と終了日時を定義して、バックフィル具体化ジョブを送信します。

feature_window_start_time = "2024-01-24T00:00.000Z"
feature_window_end_time = "2024-01-25T00:00.000Z"

!az ml feature-set backfill --name transactions --version 1 --by-data-status "['None']" --feature-window-start-time $feature_window_start_time --feature-window-end-time $feature_window_end_time --feature-store-name $featurestore_name --resource-group $featurestore_resource_group_name

登録した特徴量を使用してトレーニングデータを生成する

観測データを読み込む

まず、観測データを探索します。通常、トレーニングと推論に使われるコアデータには、観測データが含まれます。その後、そのデータは特徴量データと結合され、完全なトレーニングデータリソースが作成されます。観測データは、イベントの発生時にキャプチャされたデータです。ここでは、トランザクション ID、アカウント ID、トランザクション金額値を含むコアトランザクションデータがあります。ここでは観測データがトレーニングに使用されているため、ターゲット変数 (is_fraud) も追加します。

observation_data_path = f"abfss://{storage_file_system_name_observation_data}@{storage_account_name}.dfs.core.windows.net/train/*.parquet"
observation_data_df = spark.read.parquet(observation_data_path)
obs_data_timestamp_column = "timestamp"

display(observation_data_df)
# Note: the timestamp column is displayed in a different format. Optionally, you can can call training_df.show() to see correctly formatted value

登録されている特徴量セットを取得し、その特徴量を一覧表示する

次に、その名前とバージョンを指定して特徴量セットを取得し、この特徴量セットの特徴量を一覧表示します。また、特徴量の値のサンプルをいくつか出力します。

# look up the featureset by providing name and version
transactions_featureset = featurestore.feature_sets.get("transactions", "1")
# list its features
transactions_featureset.features

# print sample values
display(transactions_featureset.to_spark_dataframe().head(5))

特徴量を選択し、トレーニングデータを生成する

トレーニングデータ用に特徴量を選び、Feature Store SDK を使ってトレーニングデータを生成します。

from azureml.featurestore import get_offline_features

# you can select features in pythonic way
features = [
    transactions_featureset.get_feature("transaction_amount_7d_sum"),
    transactions_featureset.get_feature("transaction_amount_7d_avg"),
]

# you can also specify features in string form: featurestore:featureset:version:feature
more_features = [
    "transactions:1:transaction_3d_count",
    "transactions:1:transaction_amount_3d_avg",
]

more_features = featurestore.resolve_feature_uri(more_features)
features.extend(more_features)

# generate training dataframe by using feature data and observation data
training_df = get_offline_features(
    features=features,
    observation_data=observation_data_df,
    timestamp_column=obs_data_timestamp_column,
)

# Ignore the message that says feature set is not materialized (materialization is optional). We will enable materialization in the next part of the tutorial.
display(training_df)
# Note: the timestamp column is displayed in a different format. Optionally, you can can call training_df.show() to see correctly formatted value

ポイントインタイム結合によって、特徴量がトレーニングデータに追加されました。

次のステップ (省略可能)

セキュリティで保護された Feature Store を正常に作成し、具体化の実行を正常に送信できたため、チュートリアルシリーズの先に進んで Feature Store の理解を深めることができます。

このチュートリアルには、このシリーズのチュートリアル 1 と 2 の手順を組み合わせたものが含まれます。ネットワークを分離するために、他のチュートリアルノートブックで使用される必須のパブリックストレージコンテナーをこのチュートリアルノートブックで作成したものに置き換えてください。

これでチュートリアルは終わりです。トレーニングデータでは Feature Store の特徴量が使用されます。特徴量はストレージに保存して後で使うことも、直接使用してモデルトレーニングを実行することもできます。

次の方法で共有

チュートリアル 6: Feature Store を使用したネットワークの分離

前提条件

設定

必要なリソースをプロビジョニングする

具体化が有効になっている Feature Store を作成する

Feature Store のパラメーターを設定する

Feature Store を作成する

Azure Machine Learning の Feature Store Core SDK クライアントを初期化する

Feature Store でユーザー ID にロールを割り当てる

Feature Store の既定のストレージアカウントとキーコンテナーを取得し、対応するリソースへの公衆ネットワークアクセスを無効にする

前に作成した Feature Store の既定のキーコンテナーに対する公衆ネットワークアクセスを無効にする

Feature Store ワークスペースのマネージド仮想ネットワークを有効にする

必要なアウトバウンド規則を使用して Feature Store を更新する

定義されたアウトバウンド規則のプライベートエンドポイントを作成する

プロジェクトワークスペースのマネージド仮想ネットワークを更新する

必要なアウトバウンド規則を使用してプロジェクトワークスペースを更新する

トランザクションローリング集計特徴量セットのプロトタイプ作成と開発

トランザクションのソースデータを調べる

ローカル環境でトランザクション特徴量セットを開発する

特徴量セットの仕様をエクスポートする

Feature Store エンティティを登録する

トランザクション特徴量セットを Feature Store に登録し、具体化ジョブを送信する

機能セットの作成

バックフィル具体化ジョブを送信する

Azure Cache for Redis をオンラインストアとしてアタッチする

Azure Cache for Redis を作成する

オンラインストアを使用して Feature Store を更新する

プロジェクトワークスペースのアウトバウンド規則を更新する

トランザクション特徴量セットをオンラインストアに具体化する

登録した特徴量を使用してトレーニングデータを生成する

観測データを読み込む

登録されている特徴量セットを取得し、その特徴量を一覧表示する

特徴量を選択し、トレーニングデータを生成する

次のステップ (省略可能)

次のステップ

フィードバック

その他のリソース

次の方法で共有

チュートリアル 6: Feature Store を使用したネットワークの分離

前提条件

設定

必要なリソースをプロビジョニングする

具体化が有効になっている Feature Store を作成する

Feature Store のパラメーターを設定する

Feature Store を作成する

Azure Machine Learning の Feature Store Core SDK クライアントを初期化する

Feature Store でユーザー ID にロールを割り当てる

Feature Store の既定のストレージ アカウントとキー コンテナーを取得し、対応するリソースへの公衆ネットワーク アクセスを無効にする

前に作成した Feature Store の既定のキー コンテナーに対する公衆ネットワーク アクセスを無効にする

Feature Store ワークスペースのマネージド仮想ネットワークを有効にする

必要なアウトバウンド規則を使用して Feature Store を更新する

定義されたアウトバウンド規則のプライベート エンドポイントを作成する

プロジェクト ワークスペースのマネージド仮想ネットワークを更新する

必要なアウトバウンド規則を使用してプロジェクト ワークスペースを更新する

トランザクション ローリング集計特徴量セットのプロトタイプ作成と開発

トランザクションのソース データを調べる

ローカル環境でトランザクション特徴量セットを開発する

特徴量セットの仕様をエクスポートする

Feature Store エンティティを登録する

トランザクション特徴量セットを Feature Store に登録し、具体化ジョブを送信する

機能セットの作成

バックフィル具体化ジョブを送信する

Azure Cache for Redis をオンライン ストアとしてアタッチする

Azure Cache for Redis を作成する

オンライン ストアを使用して Feature Store を更新する

プロジェクト ワークスペースのアウトバウンド規則を更新する

トランザクション特徴量セットをオンライン ストアに具体化する

登録した特徴量を使用してトレーニング データを生成する

観測データを読み込む

登録されている特徴量セットを取得し、その特徴量を一覧表示する

特徴量を選択し、トレーニング データを生成する

次のステップ (省略可能)

次のステップ

フィードバック

その他のリソース

Feature Store の既定のストレージアカウントとキーコンテナーを取得し、対応するリソースへの公衆ネットワークアクセスを無効にする

前に作成した Feature Store の既定のキーコンテナーに対する公衆ネットワークアクセスを無効にする

定義されたアウトバウンド規則のプライベートエンドポイントを作成する

プロジェクトワークスペースのマネージド仮想ネットワークを更新する

必要なアウトバウンド規則を使用してプロジェクトワークスペースを更新する

トランザクションローリング集計特徴量セットのプロトタイプ作成と開発

トランザクションのソースデータを調べる

Azure Cache for Redis をオンラインストアとしてアタッチする

オンラインストアを使用して Feature Store を更新する

プロジェクトワークスペースのアウトバウンド規則を更新する

トランザクション特徴量セットをオンラインストアに具体化する

登録した特徴量を使用してトレーニングデータを生成する

特徴量を選択し、トレーニングデータを生成する