ดําเนินการปรับแต่ง hyperparameter ใน Fabric (ตัวอย่าง)

บทความ
04/03/2024

การปรับแต่ง Hyperparameter คือกระบวนการค้นหาค่าที่เหมาะสมสําหรับพารามิเตอร์ของแบบจําลองการเรียนรู้ของเครื่องที่ส่งผลกระทบต่อประสิทธิภาพ ซึ่งอาจเป็นเรื่องท้าทายและใช้เวลานาน โดยเฉพาะอย่างยิ่งเมื่อจัดการกับแบบจําลองที่ซับซ้อนและชุดข้อมูลขนาดใหญ่ ในบทความนี้ เราจะแสดงวิธีการดําเนินการปรับแต่ง hyperparameter ใน Fabric

ในบทช่วยสอนนี้เราจะใช้ชุดข้อมูลที่อยู่อาศัยของแคลิฟอร์เนียซึ่งประกอบด้วยข้อมูลเกี่ยวกับค่ามัธยฐานและคุณสมบัติอื่น ๆ สําหรับบล็อกการสํารวจสํานึกที่แตกต่างกันในแคลิฟอร์เนีย เมื่อนําหน้าข้อมูลแล้ว เราจะฝึกแบบจําลอง SynapseML LightGBM เพื่อคาดการณ์ค่าบ้านตามคุณสมบัติ ถัดไป เราจะใช้ FLAML ไลบรารี AutoML ที่รวดเร็วและน้ําหนักเบา เพื่อค้นหา hyperparameters ที่ดีที่สุดสําหรับแบบจําลอง LightGBM สุดท้าย เราจะเปรียบเทียบผลลัพธ์ของแบบจําลองที่ปรับแต่งแล้วกับแบบจําลองพื้นฐานที่ใช้พารามิเตอร์ค่าเริ่มต้น

สำคัญ

คุณลักษณะนี้อยู่ในตัวอย่าง

ข้อกำหนดเบื้องต้น

รับการสมัครใช้งาน Microsoft Fabric หรือลงทะเบียนเพื่อทดลองใช้งาน Microsoft Fabric ฟรี
ลงชื่อเข้าใช้ Microsoft Fabric
ใช้ตัวสลับประสบการณ์ทางด้านซ้ายของโฮมเพจของคุณเพื่อสลับไปยังประสบการณ์วิทยาศาสตร์ข้อมูล Synapse

สร้างสภาพแวดล้อม Fabric ใหม่ หรือตรวจสอบว่าคุณกําลังทํางานบน Fabric Runtime 1.2 (Spark 3.4 (หรือสูงกว่า) และ Delta 2.4)
สร้าง สมุดบันทึกใหม่
แนบสมุดบันทึกของคุณเข้ากับเลคเฮ้าส์ ทางด้านซ้ายของสมุดบันทึกของคุณ ให้เลือก เพิ่ม เพื่อเพิ่มเลคเฮ้าส์ที่มีอยู่แล้ว หรือสร้างขึ้นใหม่

เตรียมการฝึกอบรมและทดสอบชุดข้อมูล

ในส่วนนี้ เราเตรียมการฝึกและทดสอบชุดข้อมูลสําหรับแบบจําลอง LightGBM เราใช้ชุดข้อมูลที่อยู่อาศัยของแคลิฟอร์เนียจาก Sklearn เราสร้างกรอบข้อมูล Spark จากข้อมูลและใช้ VectorAssembler เพื่อรวมคุณลักษณะลงในคอลัมน์เวกเตอร์เดียว

from sklearn.datasets import fetch_california_housing
from pyspark.sql import SparkSession

# Load the Scikit-learn California Housing dataset
sklearn_dataset = fetch_california_housing()

# Convert the Scikit-learn dataset to a Pandas DataFrame
import pandas as pd
pandas_df = pd.DataFrame(sklearn_dataset.data, columns=sklearn_dataset.feature_names)
pandas_df['target'] = sklearn_dataset.target

# Create a Spark DataFrame from the Pandas DataFrame
spark_df = spark.createDataFrame(pandas_df)

# Display the data
display(spark_df)

จากนั้นเราแยกข้อมูลแบบสุ่มออกเป็นสามชุดย่อย: การฝึกอบรม การตรวจสอบความถูกต้อง และการทดสอบด้วย 85%, 12.75% และ 2.25% ของข้อมูลตามลําดับ เราใช้ชุดการฝึกอบรมและการตรวจสอบความถูกต้องสําหรับการปรับแต่ง hyperparameter และชุดการทดสอบสําหรับการประเมินแบบจําลอง

from pyspark.ml.feature import VectorAssembler

# Combine features into a single vector column
featurizer = VectorAssembler(inputCols=sklearn_dataset.feature_names, outputCol="features")
data = featurizer.transform(spark_df)["target", "features"]

# Split the data into training, validation, and test sets
train_data, test_data = data.randomSplit([0.85, 0.15], seed=41)
train_data_sub, val_data_sub = train_data.randomSplit([0.85, 0.15], seed=41)

ตั้งค่าการทดลอง ML

กําหนดค่า MLflow

ก่อนที่เราจะทําการปรับแต่ง hyperparameter เราจําเป็นต้องกําหนดฟังก์ชันรถไฟที่สามารถรับค่าที่ต่างกันของ hyperparameters และฝึกแบบจําลอง LightGBM บนข้อมูลการฝึก นอกจากนี้เรายังต้องประเมินประสิทธิภาพของแบบจําลองในข้อมูลการตรวจสอบความถูกต้องโดยใช้คะแนน R2 ซึ่งวัดว่าแบบจําลองเหมาะสมกับข้อมูลมากเพียงใด

ในการทําเช่นนี้ ก่อนอื่นเราจะนําเข้าโมดูลที่จําเป็นและตั้งค่าการทดลอง MLflow MLflow เป็นแพลตฟอร์มโอเพนซอร์ส (Open Source)สําหรับการจัดการวงจรชีวิตการเรียนรู้ของเครื่องแบบ end-to-end ซึ่งช่วยให้เราติดตามและเปรียบเทียบผลลัพธ์ของแบบจําลองและ hyperparameters ที่แตกต่างกัน

# Import MLflow and set up the experiment name
import mlflow

mlflow.set_experiment("flaml_tune_sample")

# Enable automatic logging of parameters, metrics, and models
mlflow.autolog(exclusive=False)

ตั้งค่าระดับการบันทึก

ที่นี่ เรากําหนดค่าระดับการบันทึกเพื่อระงับเอาต์พุตที่ไม่จําเป็นจากไลบรารี Synapse.ml รักษาตัวล้างบันทึก

import logging
 
logging.getLogger('synapse.ml').setLevel(logging.ERROR)

ฝึกแบบจําลองข้อมูลพื้นฐาน

ถัดไป เรากําหนดฟังก์ชันรถไฟที่ใช้ hyperparameters สี่ตัวเป็นข้อมูลป้อนเข้า: alpha, learningRate, numLeaves และ numIterations นี่คือ hyperparameters ที่เราต้องการปรับแต่งในภายหลังโดยใช้ FLAML

นอกจากนี้ฟังก์ชันรถไฟยังใช้กรอบข้อมูลสองชนิดเป็นข้อมูลป้อนเข้า: train_data และ val_data ซึ่งเป็นชุดข้อมูลการฝึกอบรมและการตรวจสอบความถูกต้องตามลําดับ ฟังก์ชันรถไฟส่งกลับผลลัพธ์สองรายการ: แบบจําลองที่ได้รับการฝึกและคะแนน R2 บนข้อมูลการตรวจสอบความถูกต้อง

# Import LightGBM and RegressionEvaluator
from synapse.ml.lightgbm import LightGBMRegressor
from pyspark.ml.evaluation import RegressionEvaluator

def train(alpha, learningRate, numLeaves, numIterations, train_data=train_data_sub, val_data=val_data_sub):
    """
    This train() function:
     - takes hyperparameters as inputs (for tuning later)
     - returns the R2 score on the validation dataset

    Wrapping code as a function makes it easier to reuse the code later for tuning.
    """
    with mlflow.start_run() as run:

        # Capture run_id for prediction later
        run_details = run.info.run_id

        # Create a LightGBM regressor with the given hyperparameters and target column
        lgr = LightGBMRegressor(
            objective="quantile",
            alpha=alpha,
            learningRate=learningRate,
            numLeaves=numLeaves,
            labelCol="target",
            numIterations=numIterations,
            dataTransferMode="bulk"
        )

        # Train the model on the training data
        model = lgr.fit(train_data)

        # Make predictions on the validation data
        predictions = model.transform(val_data)
        # Define an evaluator with R2 metric and target column
        evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="target", metricName="r2")
        # Compute the R2 score on the validation data
        eval_metric = evaluator.evaluate(predictions)

        mlflow.log_metric("r2_score", eval_metric)

    # Return the model and the R2 score
    return model, eval_metric, run_details

ในตอนท้ายเราใช้ฟังก์ชันรถไฟเพื่อฝึกแบบจําลองพื้นฐานด้วยค่าเริ่มต้นของ hyperparameters นอกจากนี้เรายังประเมินแบบจําลองพื้นฐานบนข้อมูลทดสอบและพิมพ์คะแนน R2

# Train the baseline model with the default hyperparameters
init_model, init_eval_metric, init_run_id = train(alpha=0.2, learningRate=0.3, numLeaves=31, numIterations=100, train_data=train_data, val_data=test_data)
# Print the R2 score of the baseline model on the test data
print("R2 of initial model on test dataset is: ", init_eval_metric)

ดําเนินการปรับแต่ง hyperparameter ด้วย FLAML

FLAML เป็นไลบรารี AutoML ที่รวดเร็วและน้ําหนักเบา ซึ่งสามารถค้นหา hyperparameters ที่ดีที่สุดสําหรับแบบจําลองและชุดข้อมูลที่ระบุได้โดยอัตโนมัติ ซึ่งใช้กลยุทธ์การค้นหาต้นทุนต่ําที่ปรับให้เข้ากับคําติชมจากเมตริกการประเมิน ในส่วนนี้ เราจะใช้ FLAML เพื่อปรับแต่ง hyperparameters ของแบบจําลอง LightGBM ที่เรากําหนดไว้ในส่วนก่อนหน้า

กําหนดฟังก์ชันการปรับแต่ง

หากต้องการใช้ FLAML เราจําเป็นต้องกําหนดฟังก์ชันการปรับแต่งที่นําพจนานุกรมกําหนดค่ามาเป็นข้อมูลป้อนเข้าและส่งกลับพจนานุกรมที่มีเมตริกการประเมินเป็นคีย์และค่าเมตริกเป็นค่า

พจนานุกรมกําหนดค่าประกอบด้วย hyperparameters ที่เราต้องการปรับแต่งและค่า ฟังก์ชันการปรับแต่งจะใช้ฟังก์ชันรถไฟที่เรากําหนดไว้ก่อนหน้านี้เพื่อฝึกและประเมินแบบจําลองด้วยการกําหนดค่าที่ระบุ

# Import FLAML
import flaml

# Define the tune function
def flaml_tune(config):
    # Train and evaluate the model with the given config
    _, metric, run_id = train(**config)
    # Return the evaluation metric and its value
    return {"r2": metric}

กําหนดพื้นที่การค้นหา

ถัดไป เราจําเป็นต้องกําหนดพื้นที่การค้นหาสําหรับ hyperparameters ที่เราต้องการปรับแต่ง ช่องว่างการค้นหาเป็นพจนานุกรมที่แมปชื่อ hyperparameter เข้ากับช่วงของค่าที่เราต้องการค้นหา FLAML มีฟังก์ชันที่สะดวกบางอย่างเพื่อกําหนดช่วงชนิดที่แตกต่างกัน เช่น รูปแบบเหมือนกัน ลอการิฟอร์ม และแรนด์อิน

ในกรณีนี้ เราต้องการปรับแต่ง hyperparameters สี่รายการต่อไปนี้: alpha, learningRate, numLeaves และ numIterations

# Define the search space
params = {
    # Alpha is a continuous value between 0 and 1
    "alpha": flaml.tune.uniform(0, 1),
    # Learning rate is a continuous value between 0.001 and 1
    "learningRate": flaml.tune.uniform(0.001, 1),
    # Number of leaves is an integer value between 30 and 100
    "numLeaves": flaml.tune.randint(30, 100),
    # Number of iterations is an integer value between 100 and 300
    "numIterations": flaml.tune.randint(100, 300),
}

กําหนดรุ่นทดลองใช้ hyperparameter

ในที่สุดเราจําเป็นต้องกําหนดรุ่นทดลองใช้ hyperparameter ที่จะใช้ FLAML เพื่อปรับ hyperparameters ให้เหมาะสม เราจําเป็นต้องส่งผ่านฟังก์ชันปรับแต่ง พื้นที่การค้นหา งบประมาณเวลา จํานวนตัวอย่าง ชื่อเมตริก โหมด และระดับแบบละเอียดไปยังฟังก์ชัน flaml.tune.run นอกจากนี้เรายังต้องเริ่มทํางาน MLflow ที่ซ้อนกันเพื่อติดตามผลลัพธ์ของการทดลองใช้

จะ flaml.tune.run function ส่งคืนออบเจ็กต์การวิเคราะห์ที่ประกอบด้วยการกําหนดค่าที่ดีที่สุดและค่าเมตริกที่ดีที่สุด

# Start a nested MLflow run
with mlflow.start_run(nested=True, run_name="Child Run: "):
    # Run the hyperparameter trial with FLAML
    analysis = flaml.tune.run(
        # Pass the tune function
        flaml_tune,
        # Pass the search space
        params,
        # Set the time budget to 120 seconds
        time_budget_s=120,
        # Set the number of samples to 100
        num_samples=100,
        # Set the metric name to r2
        metric="r2",
        # Set the mode to max (we want to maximize the r2 score)
        mode="max",
        # Set the verbosity level to 5
        verbose=5,
        )

หลังจากการทดลองใช้เสร็จสิ้น เราสามารถดูการกําหนดค่าที่ดีที่สุดและค่าเมตริกที่ดีที่สุดจากวัตถุการวิเคราะห์ได้

# Get the best config from the analysis object
flaml_config = analysis.best_config
# Print the best config
print("Best config: ", flaml_config)
print("Best score on validation data: ", analysis.best_result["r2"])

เปรียบเทียบผลลัพธ์

หลังจากค้นหา hyperparameters ที่ดีที่สุดด้วย FLAML เราจําเป็นต้องประเมินว่าพารามิเตอร์ปรับปรุงประสิทธิภาพของแบบจําลองมากน้อยเพียงใด ในการทําเช่นนี้ เราใช้ฟังก์ชันฝึกเพื่อสร้างแบบจําลองใหม่ด้วย hyperparameters ที่ดีที่สุดบนชุดข้อมูลการฝึกอบรมเต็มรูปแบบ จากนั้นเราใช้ชุดข้อมูลทดสอบเพื่อคํานวณคะแนน R2 สําหรับทั้งแบบจําลองใหม่และแบบจําลองพื้นฐาน

# Train a new model with the best hyperparameters 
flaml_model, flaml_metric, flaml_run_id = train(train_data=train_data, val_data=test_data, **flaml_config)

# Print the R2 score of the baseline model on the test dataset
print("On the test dataset, the initial (untuned) model achieved R^2: ", init_eval_metric)
# Print the R2 score of the new model on the test dataset
print("On the test dataset, the final flaml (tuned) model achieved R^2: ", flaml_metric)

บันทึกแบบจําลองขั้นสุดท้าย

เมื่อเราเสร็จสิ้นการทดลองใช้ hyperparameter ของเราตอนนี้เราสามารถบันทึกแบบจําลองสุดท้ายปรับแต่งเป็นรูปแบบ ML ใน Fabric

# Specify the model name and the path where you want to save it in the registry
model_name = "housing_model"  # Replace with your desired model name
model_path = f"runs:/{flaml_run_id}/model"

# Register the model to the MLflow registry
registered_model = mlflow.register_model(model_uri=model_path, name=model_name)

# Print the registered model's name and version
print(f"Model '{registered_model.name}' version {registered_model.version} registered successfully.")

แชร์ผ่าน

ดําเนินการปรับแต่ง hyperparameter ใน Fabric (ตัวอย่าง)

ข้อกำหนดเบื้องต้น

เตรียมการฝึกอบรมและทดสอบชุดข้อมูล

ตั้งค่าการทดลอง ML

กําหนดค่า MLflow

ตั้งค่าระดับการบันทึก

ฝึกแบบจําลองข้อมูลพื้นฐาน

ดําเนินการปรับแต่ง hyperparameter ด้วย FLAML

กําหนดฟังก์ชันการปรับแต่ง

กําหนดพื้นที่การค้นหา

กําหนดรุ่นทดลองใช้ hyperparameter

เปรียบเทียบผลลัพธ์

บันทึกแบบจําลองขั้นสุดท้าย

คำติชม

แหล่งทรัพยากรเพิ่มเติม

แชร์ผ่าน

ดําเนินการปรับแต่ง hyperparameter ใน Fabric (ตัวอย่าง)

ข้อกำหนดเบื้องต้น

เตรียมการฝึกอบรมและทดสอบชุดข้อมูล

ตั้งค่าการทดลอง ML

กําหนดค่า MLflow

ตั้งค่าระดับการบันทึก

ฝึกแบบจําลองข้อมูลพื้นฐาน

ดําเนินการปรับแต่ง hyperparameter ด้วย FLAML

กําหนดฟังก์ชันการปรับแต่ง

กําหนดพื้นที่การค้นหา

กําหนดรุ่นทดลองใช้ hyperparameter

เปรียบเทียบผลลัพธ์

บันทึกแบบจําลองขั้นสุดท้าย

เนื้อหาที่เกี่ยวข้อง

คำติชม

แหล่งทรัพยากรเพิ่มเติม