Batch inference using Foundation Model API provisioned throughput

This article provides an example notebook that performs batch inference on a provisioned throughput endpoint using Foundation Model APIs. It also includes an example notebook for determining the optimal concurrency for your endpoint based on your batch inference workload.

Requirements

Run batch inference

Generally, setting up batch inference involves 3 steps:

  1. Prepare sample data and set up a benchmark endpoint.
  2. Run a load test with the sample data on the benchmark endpoint to determine the ideal endpoint configuration.
  3. Create the endpoint to be used for batch inference and send the batch inference requests.

The example notebook sets up batch inference and uses the Meta Llama 3.1 70B model and PySpark to accomplish the following:

  • Sample the input data to build a representative dataset
  • Create a benchmark endpoint with the chosen model
  • Load test the benchmark endpoint using the sample data to determine latency and concurrency
  • Create a provisioned throughput endpoint for batch inference given load test results
  • Construct the batch requests and send them to the batch inference endpoint

Batch inference with a provisioned throughput endpoint notebook

Get notebook

Determine optimal concurrency for your batch inference workload

The following notebook provides an alternative tool for load testing the benchmark endpoint using PySpark.

Determine optimal concurrency for batch inference notebook

Get notebook

Additional resources