This article provides general API information for Databricks Foundation Model APIs and the models they support. The Foundation Model APIs are designed to be similar to OpenAI’s REST API to make migrating existing projects easier. Both the pay-per-token and provisioned throughput endpoints accept the same REST API request format.
Endpoints
Each pay-per-token model has a single endpoint, and users can interact with these endpoints using HTTP POST requests. Provisioned throughput endpoints can be created using the API or the Serving UI. These endpoints also support multiple models per endpoint for A/B testing, as long as both served models expose the same API format. For example, both models are chat models.
Requests and responses use JSON, the exact JSON structure depends on an endpoint’s task type. Chat and completion endpoints support streaming responses.
Responses include a usage sub-message which reports the number of tokens in the request and response. The format of this sub-message is the same across all task types.
Field
Type
Description
completion_tokens
Integer
Number of generated tokens. Not included in embedding responses.
prompt_tokens
Integer
Number of tokens from the input prompt(s).
total_tokens
Integer
Number of total tokens.
For models like llama-2-70b-chat a user prompt is transformed using a prompt template before being passed into the model. For pay-per-token endpoints, a system prompt might also be added. prompt_tokens includes all text added by our server.
Chat task
Chat tasks are optimized for multi-turn conversations with a model. Each request describes the conversation so far, where the messages field must alternate between user and assistant roles, ending with a user message. The model response provides the next assistant message in the conversation.
Required. A list of messages representing the current conversation.
max_tokens
null
null, which means no limit, or an integer greater than zero
The maximum number of tokens to generate.
stream
true
Boolean
Stream responses back to a client in order to allow partial results for requests. If this parameter is included in the request, responses are sent using the Server-sent events standard.
temperature
1.0
Float in [0,2]
The sampling temperature. 0 is deterministic and higher values introduce more randomness.
top_p
1.0
Float in (0,1]
The probability threshold used for nucleus sampling.
top_k
null
null, which means no limit, or an integer greater than zero
Defines the number of k most likely tokens to use for top-k-filtering. Set this value to 1 to make outputs deterministic.
stop
[]
String or List[String]
Model stops generating further tokens when any one of the sequences in stop is encountered.
n
1
Integer greater than zero
The API returns n independent chat completions when n is specified. Recommended for workloads that generate multiple completions on the same input for additional inference efficiency and cost savings. Only available for provisioned throughput endpoints.
Used only in conjunction with the tools field. tool_choice supports a variety of keyword strings such as auto, required, and none. auto means that you are letting the model decide which (if any) tool is relevant to use. With auto if the model doesn’t believe any of the tools in tools are relevant, the model generates a standard assistant message instead of a tool call. required means that the model picks the most relevant tool in tools and must generate a tool call. none means that the model does not generate any tool calls and instead must generate a standard assistant message. To force a tool call with a specific tool defined in tools, use a ToolChoiceObject. By default, if the tools field is populated tool_choice = "auto". Else, the tools field defaults to tool_choice = "none"
A list of tools that the model can call. Currently, function is the only supported tool type and a max of 32 functions are supported.
logprobs
false
Boolean
This parameter indicates whether to provide the log probability of a token being sampled.
top_logprobs
null
Integer
This parameter controls the number of most likely token candidates to return log probabilities for at each sampling step. Can be 0-20. logprobs must be true if using this field.
ChatMessage
Field
Type
Description
role
String
Required. The role of the author of the message. Can be "system", "user", "assistant" or "tool".
content
String
The content of the message. Required for chat tasks that do not involve tool calls.
Required. The type of the tool. Currently, only "function" is supported.
function
Object
Required. An object defining which tool to call of the form {"type": "function", "function": {"name": "my_function"}} where "my_function is the name of a FunctionObject in the tools field.
Required. The function definition associated with the tool.
FunctionObject
Field
Type
Description
name
String
Required. The name of the function to be called.
description
Object
Required. The detailed description of the function. The model uses this description to understand the relevance of the function to the prompt and generate the tool calls with higher accuracy.
parameters
Object
The parameters the function accepts, described as a valid JSON Schema object. If the tool is called, then the tool call is fit to the JSON Schema provided. Omitting parameters defines a function without any parameters. The number of properties is limited to 15 keys.
Chat response
For non-streaming requests, the response is a single chat completion object. For streaming requests, the response is a text/event-stream where each event is a completion chunk object. The top-level structure of completion and chunk objects is almost identical: only choices has a different type.
A chat completion message part of generated streamed responses from the model. Only the first chunk is guaranteed to have role populated.
finish_reason
String
The reason the model stopped generating tokens. Only the last chunk will have this populated.
Completion task
Text completion tasks are for generating responses to a single prompt. Unlike Chat, this task supports batched inputs: multiple independent prompts can be sent in one request.
Completion request
Field
Default
Type
Description
prompt
String or List[String]
Required. The prompt(s) for the model.
max_tokens
null
null, which means no limit, or an integer greater than zero
The maximum number of tokens to generate.
stream
true
Boolean
Stream responses back to a client in order to allow partial results for requests. If this parameter is included in the request, responses are sent using the Server-sent events standard.
temperature
1.0
Float in [0,2]
The sampling temperature. 0 is deterministic and higher values introduce more randomness.
top_p
1.0
Float in (0,1]
The probability threshold used for nucleus sampling.
top_k
null
null, which means no limit, or an integer greater than zero
Defines the number of k most likely tokens to use for top-k-filtering. Set this value to 1 to make outputs deterministic.
error_behavior
"error"
"truncate" or "error"
For timeouts and context-length-exceeded errors. One of: "truncate" (return as many tokens as possible) and "error" (return an error). This parameter is only accepted by pay per token endpoints.
n
1
Integer greater than zero
The API returns n independent chat completions when n is specified. Recommended for workloads that generate multiple completions on the same input for additional inference efficiency and cost savings. Only available for provisioned throughput endpoints.
stop
[]
String or List[String]
Model stops generating further tokens when any one of the sequences in stop is encountered.
suffix
""
String
A string that is appended to the end of every completion.
echo
false
Boolean
Returns the prompt along with the completion.
use_raw_prompt
false
Boolean
If true, pass the prompt directly into the model without any transformation.
Embedding tasks map input strings into embedding vectors. Many inputs can be batched together in each request.
Embedding request
Field
Type
Description
input
String or List[String]
Required. The input text to embed. Can be a string or a list of strings.
instruction
String
An optional instruction to pass to the embedding model.
Instructions are optional and highly model specific. For instance the The BGE authors recommend no instruction when indexing chunks and recommend using the instruction "Represent this sentence for searching relevant passages:" for retrieval queries. Other models like Instructor-XL support a wide range of instruction strings.
Embeddings response
Field
Type
Description
id
String
Unique identifier for the embedding.
object
String
The object type. Equal to "list".
model
String
The name of the embedding model used to create the embedding.