MLflow Tracing for agents

Important

This feature is in Public Preview.

This article describes MLflow Tracing and the scenarios where it is helpful for evaluating generative AI applications in your AI system.

In software development, tracing involves recording sequences of events like user sessions or request flows. In the context of AI systems, tracing often refers to interactions you have with an AI system. An example trace of an AI system might look like instrumenting the inputs and parameters for a RAG application that includes a user message with prompt, a vector lookup, and an interface with the generative AI model.

What is MLflow Tracing?

Using MLflow Tracing you can log, analyze, and compare traces across different versions of generative AI applications. It allows you to debug your generative AI Python code and keep track of inputs and responses. Doing so can help you discover conditions or parameters that contribute to poor performance of your application. MLflow Tracing is tightly integrated with Databricks tools and infrastructure, allowing you to store and display all your traces in Databricks notebooks or the MLflow experiment UI as you run your code.

When you develop AI systems on Databricks using libraries such as LangChain, LlamaIndex, OpenAI, or custom PyFunc, MLflow Tracing allows you to see all the events and intermediate outputs from each step of your agent. You can easily see the prompts, which models and retrievers were used, which documents were retrieved to augment the response, how long things took, and the final output. For example, if your model hallucinates, you can quickly inspect each step that led to the hallucination.

Why use MLflow Tracing?

MLflow Tracing provides several benefits to help you track your development workflow. For example, you can:

  • Review an interactive trace visualization and use the investigation tool for diagnosing issues in development.
  • Verify that prompt templates and guardrails are producing reasonable results.
  • Explore and minimize the latency impact of different frameworks, models, chunk sizes, and software development practices.
  • Measure application costs by tracking token usage by different models.
  • Establish benchmark (“golden”) datasets to evaluate the performance of different versions.
  • Store traces from production model endpoints to debug issues, and perform offline review and evaluation.

Install MLflow Tracing

MLflow Tracing is available in MLflow versions 2.13.0 and above.

%pip install mlflow>=2.13.0 -qqqU
%restart_python

Alternatively, you can %pip install databricks-agents to install the latest version of databricks-agents that includes a compatible MLflow version.

Use MLflow Tracing in development

MLflow Tracing helps you analyze performance issues and accelerate the agent development cycle. The following sections assume you are conducting agent development and MLflow Tracing from a notebook.

Note

In the notebook environment, MLflow Tracing might add up to a few seconds of overhead to the agent run time. This primarily comes from the latency of logging traces to the MLflow experiment. In the production model endpoint, MLflow Tracing has a much smaller impact on performance. See the Use MLflow Tracing in Production.

Add traces to your agent

MLflow Tracing provides three different ways to use traces with your generative AI application with traces. See Add traces to your agents for examples of using these methods. For API reference details, see the MLflow documentation.

API Recommended Use Case Description
MLflow autologging Development on integrated GenAI libraries Autologging automatically instruments traces for popular open source frameworks such as LangChain, LlamaIndex, and OpenAI. When you add mlflow.<library>.autolog() at the start of the notebook, MLflow automatically records traces for each step of your agent execution.
Fluent APIs Custom agent with Pyfunc Low-code APIs for instrumenting AI systems without worrying about the tree structure of the trace. MLflow determines the appropriate parent-child tree structure (spans) based on the Python stack.
MLflow Client APIs Advanced use cases such as multi-threading MLflowClient implements more granular, thread-safe APIs for advanced use cases. These APIs do not manage the parent-child relationship of the spans, so you need to manually specify it to construct the desired trace structure. This requires more code but gives you better control over the trace lifecycle, particularly for multi-threaded use cases.

Recommended for use cases that require more control, such as multi-threaded applications or callback-based instrumentation.

Reviewing traces

After you run the instrumented agent, you can review the generated traces in different ways:

  • The trace visualization is rendered inline in the cell output.
  • The traces are logged to your MLflow experiment. You can review the full list of historical traces and search on them in the Traces tab in the Experiment page. When the agent runs under an active MLflow Run, you can also find the traces in the Run page.
  • Programmatically retrieve traces using search_traces() API.

Use MLflow Tracing in production

MLflow Tracing is also integrated with Mosaic AI Model Serving, allowing you to debug issues efficiently, monitor performance, and create a golden dataset for offline evaluation. When MLflow Tracing is enabled for your serving endpoint, traces are recorded in an inference table under the response column.

To enable MLflow Tracing for your serving endpoint, you must set the ENABLE_MLFLOW_TRACING environment variable in the endpoint configuration to True. See Add plain text environment variables for how to deploy an endpoint with custom environment variables. If you deployed your agent using the deploy() API, traces are automatically logged to an inference table. See Deploy an agent for generative AI application.

Note

Writing traces to an inference table is done asynchronously, so it does not add the same overhead as in the notebook environment during development. However, it might still introduce some overhead to the endpoint’s response speed, particularly when the trace size for each inference request is large. Databricks does not guarantee any service level agreement (SLA) for the actual latency impact on your model endpoint, as it heavily depends on the environment and the model implementation. Databricks recommends testing your endpoint performance and gaining insights into the tracing overhead before deploying to a production application.

The following table provides a rough indication of the impact on inference latency for different trace sizes.

Trace size per request Impact to latency (ms)
~10 KB ~ 1 ms
~ 1 MB 50 ~ 100 ms
10 MB 150 ms ~

Limitations

  • MLflow Tracing is available in Databricks notebooks, notebook jobs, and Model Serving.

  • LangChain autologging may not support all LangChain prediction APIs. Please refer to the MLflow documentation for the full list of supported APIs.