Batch inference using Foundation Model API provisioned throughput

Article
10/01/2024

This article provides an example notebook that performs batch inference on a provisioned throughput endpoint using Foundation Model APIs. It also includes an example notebook for determining the optimal concurrency for your endpoint based on your batch inference workload.

Requirements

A workspace in a Foundation Model APIs supported region.
Databricks Runtime 14.3 ML LTS or above.

Run batch inference

Generally, setting up batch inference involves 3 steps:

Prepare sample data and set up a benchmark endpoint.
Run a load test with the sample data on the benchmark endpoint to determine the ideal endpoint configuration.
Create the endpoint to be used for batch inference and send the batch inference requests.

The example notebook sets up batch inference and uses the Meta Llama 3.1 70B model and PySpark to accomplish the following:

Sample the input data to build a representative dataset
Create a benchmark endpoint with the chosen model
Load test the benchmark endpoint using the sample data to determine latency and concurrency
Create a provisioned throughput endpoint for batch inference given load test results
Construct the batch requests and send them to the batch inference endpoint

Batch inference with a provisioned throughput endpoint notebook

Get notebook

Determine optimal concurrency for your batch inference workload

The following notebook provides an alternative tool for load testing the benchmark endpoint using PySpark.

Determine optimal concurrency for batch inference notebook

Get notebook

Additional resources

Get started querying LLMs on Databricks

Share via

Batch inference using Foundation Model API provisioned throughput

Requirements

Run batch inference

Batch inference with a provisioned throughput endpoint notebook

Determine optimal concurrency for your batch inference workload

Determine optimal concurrency for batch inference notebook

Additional resources

Feedback

Additional resources