Transparency Note: healthcare agent service - Copilot Features
Important
Customer Disclaimer: AI tools and technologies, including healthcare agent service can make mistakes and do not always provide accurate or complete information. It is your responsibility to: (1) thoroughly test and evaluate whether its use is fit for purpose, (2) identify and mitigate any risks or harms to end users associated with its use, and (3) ensure that any decisions made using output from healthcare agent service are made with human oversight and not based solely on the output. healthcare agent service is not intended, designed, or made available to be: (1) a medical device, or (2) a substitute for professional medical advice, diagnosis, treatment, or judgment and should not be used to replace or as a substitute for professional medical advice, diagnosis, treatment, or judgment. You are solely responsible for displaying and/or obtaining appropriate consents, warnings, disclaimers, and acknowledgements, including informing end users that they are interacting with an AI system. Output from healthcare agent service does not reflect the opinions of Microsoft, and the accuracy and reliability of the information provided by healthcare agent service may vary and are not guaranteed.
What is a Transparency Note?
An AI system includes not only the technology, but also the people who use it, the people who will be affected by it, and the environment in which it's deployed. Creating a system that is fit for its intended purpose requires an understanding of how the technology works, what its capabilities and limitations are, and how to achieve the best performance. Microsoft's Transparency Notes are intended to help you understand how our AI technology works, the choices system owners can make that influence system performance and behavior, and the importance of thinking about the whole system, including the technology, the people, and the environment. You can use Transparency Notes when developing or deploying your own system or share them with the people who use or are affected by your system.
Microsoft's Transparency Notes are part of a broader effort at Microsoft to put our AI Principles into practice. To find out more, see the Microsoft AI principles.
Introduction
The healthcare agent service infused with Generative AI expand healthcare agent service’ current functionality by enabling new Generative AI features. These features include a new orchestrator powered by Large Language Models (LLMs) that can work with customer defined sources, OpenAI Plugins or built-in Healthcare Intelligence sources. Customers have the possibility to ground generative answers on their sources. This can be done through Azure AI Search service powered by vector search on customers documents. Or via Bing Custom Search to search on selected customer websites. The healthcare agent service also provide Healthcare Intelligence features, which are selected credible healthcare sources. These sources could be used to ground the LLM when no answer has been found on the customer sources. All of the generated answers will be checked through our Healthcare Safeguards. The healthcare safeguards validate every Generative AI answer in several ways. For the public preview we're adding evidence detection, provenance and clinical code validation. The healthcare agent service also include a checked way to visualize every Generative AI answer through the Chat Safeguards, which includes a disclaimer, attribution to source, feedback mechanism and abuse monitoring.
All of these new features are powered by Azure OpenAI Service. To learn more about Azure OpenAI Service please review the Transparency Note.
Customers can quickly begin using these new features by enabling them through the healthcare agent service template catalog. They can import one or more relevant Generative Answer templates into their bot instance, providing a simple way to use these new capabilities. The features are accessible through the Healthcare Orchestrator or via dedicated steps in the scenario editor: "Generative Answers on Customer Sources" for customer-provided data and "Healthcare Intelligence" for built-in credible sources.
As part of the healthcare chat guardrails, all generative answers include by default a disclaimer [This message is generated by AI and doesn't provide nor replace professional medical advice. Make sure it's accurate and appropriate before relying on this response.], evidence on which data the answer has been grounded and feedback mechanism so customers can validate if the answers are deemed positive or negative by end users*.*
The use of the Healthcare agent service is governed by the licensing contracts agreed to by the customer, including the Product Terms.
Key terms
Term | Definition |
---|---|
Generative Answers on Customer sources | A healthcare agent service feature that allows customers to use their own data (documents or website) as inputs to the bot scenario healthcare agent service. This feature can be used throughout the orchestrator or via the scenario editor. |
Healthcare Intelligence | A healthcare agent service feature that includes a range of healthcare sources which have traditionally been deemed credible in the clinical field. This may include medical guidelines, drug instructions, and known drug interactions. Some of these have gone through Azure OpenAI Service. This feature can be enabled or through the scenario editor or via the Orchestrator. Microsoft provides no express or implied warranties and assumes no liability regarding the information provided by Healthcare Intelligence and Customer is solely responsible for any use of information from these sources. |
Azure OpenAI Data Connection | The customer data connection to their Azure OpenAI Endpoint. |
Customer Source Data Connection | The configured connection to the customer Vector Store. Currently we support Azure AI Search service as a vector store. |
Evidence | The information from the original source content used to ground the model. |
Customer Source | Information from the customer that is used to ground the Large Language Models. This can be data from Azure AI Search, or customer selected websites that are queried through Bing Custom Search. |
Abuse Monitoring | Abuse monitoring refers to the systematic observation of activities, interactions, or content to identify and address instances of misuse, harm, or inappropriate behavior. With the primary goal of ensuring a safe and respectful environment for users. |
Ungrounded Content | Ungrounded content refers to generated text or responses that may lack proper context, factual accuracy, or real-world basis. It highlights instances where the model generates information without being anchored in established knowledge or verifiable data. |
Tokens | Tokens refer to units of text that language models process during generation. A token can be as short as one character or as long as one word. In English, for example, the sentence "ChatGPT is great!" consists of six tokens: ["Chat," "G," "PT," " is," " great," "!"]. Tokens help models understand and manipulate language by breaking down text into manageable units. |
RAG | Retrieval-Augmented Generation (RAG): A framework integrating information retrieval and language generation. It fetches relevant context from external sources before generating responses, enhancing the model's accuracy and contextual relevance. |
XPIA | XPIA also known as Cross-domain Prompt Injection, is a new vulnerability that is affecting Large Language Models using prompt-based learning. When using content from websites, you need to be sure the content doesn't contain malicious code or injected prompts that can exploit the XPIA vulnerability. |
Chat Safeguards | Chat Safeguards are created to support end-users in consuming answers created by generative AI. These features contain disclaimer, evidence, feedback, and abuse monitoring |
Clinical Safeguards | Clinical Safeguards are created to validate every generated answer from an LLM. Currently the Clinical Safeguards validate the answer in several ways, evidence verification, clinical code verification and provenance |
Provenance | Provenance tracks the origin of the data, ensuring its authenticity, reliability, and trustworthiness. It provides transparency and accountability by showing how the LLM used the evidence to create an answer. This helps verify its integrity and value. |
Capabilities
System behavior
When using these new Generative AI Features, you have access to new capabilities. You can utilize the new healthcare adapted orchestrator or use the already available scenario editor to chat with your sources, with our credible healthcare sources or OpenAI Plugins. When utilizing the Generative Answers on your sources feature, you'll be able chat with your healthcare data in secure and safe way. This data can be documents, pdfs, text files, or even public websites that you manage. The healthcare agent service find the most relevant source information that you provided to answer end-user queries and infuse this with Generative AI through Azure OpenAI Large Language Models such as gpt-35-turbo or GPT-4(o). You can also choose to utilize one of our credible sources via our “Healthcare Intelligence“ feature. Both of these features can be used in the scenario editor or via the Healthcare Orchestrator powered by Generative AI. All of these features are protected by our Healthcare Safeguards, this means that every generated answer is checked by our clinical safeguards, this includes evidence verification, clinical code verification and visualized in a compliant and efficient way via our Chat Safeguards.
The various features can be easily enabled through the healthcare agent service Template Catalog. When you import one of these scenarios, the healthcare orchestrator is automatically configured, providing a ready-to-use scenario for immediate testing. Depending on the template, the bot will also establish secure connections to your Azure OpenAI endpoint or set up new connections with Azure AI Search or Bing Custom Search. Customers can further customize these scenarios to suit their specific needs. Additionally, all features can be enabled independently of the Template Catalog if preferred.
These templates give you insights on how to chat with your data or how to use our healthcare credible sources in your healthcare agent service. All templates can be used throughout the Healthcare Orchestrator or via the Scenario Editor.
For “Customer Sources”, we're using Azure OpenAI Studio to provide you with an easy way to upload and vectorize your data sources. To use your public websites as an input source, we use Bing Custom Search, where you can configure and include or exclude relevant websites. It's important to select the right Bing SKU when used with your LLM. You can find more information on the Bing pricing page.
The healthcare agent service have Abuse Monitoring enabled out of the box. Abuse Monitoring will automatically log and block problematic prompts in your healthcare agent service Instance. If the Healthcare agent service detect multiple problematic prompts in a short period, the end-user is automatically blocked and not able to use the Generative Answer features in your healthcare agent service Instance. You can view all the problematic prompts and blocked users in the healthcare agent service Portal. It's also possible to unblock users. Microsoft provides no express or implied warranties and assumes no liability and Customer is solely responsible for any consequences of disabling the Abusing Monitoring functionality.
Use cases
Intended uses
Healthcare organizations can use the new "Generative Answers on Customer Sources" to enhance experiences they build for their end-users by using Generative AI infused with their own sources or use the "Healthcare Intelligence" features to utilize our build-in healthcare knowledge combined with Generative AI.
Patients would be using this intelligence service, typically incorporated in the provider patient portal or available through their provider’s app, to get self-serve functionality, such as answer healthcare related questions, finding care facilities, or look up any kind of healthcare related question. Thanks to the Healthcare Safeguards, every answer is always checked and the end-user is always made aware that the answer was created by Generative AI.
Clinicians and medical professionals can utilize the healthcare agent service infused with Generative AI to ask detailed questions, guidelines, or instructions based on credible healthcare content and the customer healthcare sources. Thanks to the Healthcare Safeguards, every answer is always checked and the end-user is always made aware that the answer was generated by a Generative AI model.
Customers can utilize the orchestrator to connect one or more healthcare plugins such as the healthcare intelligence plugin, Generative Answers on Customer Sources Plugin, Conversational Plugin (scenario) or their own OpenAPI Plugin to let the Generative AI model decide which plugin to trigger, which will provide relevant answers based on the end-user questions. With the orchestrator, customers don’t need to create a decision tree but can apply the intelligence of Generative AI to find the most relevant plugin that will provide an answer.
Considerations when choosing other use cases
The Healthcare agent service Azure infused with generative answers is a valuable extension of the platform. However, given the sensitive nature of health-related data, it's important to consider your use cases carefully. In all cases, a human should be aware that the answers are coming from an AI and act with caution on all provided information.
Avoid scenarios that use this service as a substitute for professional medical advice, diagnosis, treatment, or judgment, or as a medical device, to provide clinical support.
Avoid using patient information that hasn't been de-identified or anonymized with this generative AI feature. You're solely responsible for any regulatory obligations that may arise from using this feature for patient-specific purposes.
Avoid scenarios that use personal health information for a purpose not permitted by patient consent or applicable law. Health information has special protections regarding privacy and consent. Make sure that all data you use has patient consent for the way you use the data in your system or you're otherwise compliant with applicable law as it relates to the use of health information.
Carefully choose the sources you're using to ground the generative answer. You're solely responsible for any regulatory obligations that may arise from using this feature on your grounded sources.
When using the Healthcare Intelligence feature, you're solely responsible for any regulatory obligations that may arise from using this feature.
Limitations
When it comes to large-scale natural language models, there are particular fairness and responsible AI issues to consider. People use language to describe the world and to express their beliefs, assumptions, attitudes, and values. As a result, publicly available text data typically used to train large-scale natural language processing models contains societal biases relating to race, gender, religion, age, and other groups of people, and other undesirable content. These societal biases are reflected in the distributions of words, phrases, and syntactic structures.
When using your own sources, use only trusted documents and verify they don’t include any harmful or inaccurate content. You're solely responsible for any use of inappropriate content or any use of content from a customer source.
Generative Answers may perform different in non-English languages. Our service has been tested and validated only in English. Customers are responsible to measure and evaluate the performance of the generated answers in other languages and whether it fits their needs.
Customers who want to use their trusted documents should store them in Azure AI Search Service. The fastest way to index this, is via the Azure OpenAI Studio.
It's the customer’s responsibility to implement the bot such that it only draws from trusted documents and doesn't include incorrect or inaccurate information in the generated answers.
Microsoft Provides No Express Or Implied Warranties And Assumes No Liability Regarding The Information Provided By Healthcare Intelligence. The customer is solely responsible for any use of the information from Healthcare Intelligence.
It's the customer’s responsibility to evaluate the service such that it only answers relevant questions.
It's the customer’s responsibility to measure potentially ungrounded content to ensure that the service doesn't respond outside of the provided trusted documents.
It's the customer’s responsibility to mitigate harms arising from their end-users attempting to bypass the safety system (for example, breaking the metaprompt).
Customers should always enable the Azure Content Safety on their Azure OpenAI Endpoint.
Customers should always enable the Jailbreak mitigation feature on their Azure OpenAI Endpoint.
It's the customer’s responsibility to implement the service in such a way that it only draws from the trusted documents and doesn't make up answers. The safeguards aren't foolproof, but rather an other layer to assist the customer in providing accurate generative AI responses.
Technical limitations, operational factors, and ranges
When using your own Azure OpenAI Service resources make sure to use the Azure OpenAI built-in content filtering system, to enable more safe and responsible question answering and to filter potentially harmful content.
When the response is grounded on sources, the answer will always include the evidence which was used to generate the answer. It also includes a reference between the generated sentences and the used sources. It’s best practice to include this in the response presented to the user.
Any answer provided by generative AI should include a clear disclaimer notifying the end-user that this answer was generated by AI model and should be treated as such. You may modify the default disclaimer, but you can't remove this. Always display a disclaimer to the end-user. It is your responsibility to display proper warnings and disclaimers.
For more information about the Azure OpenAI Service functionality, please see: Use cases for Azure OpenAI - Azure Cognitive Services | Microsoft Learn.
When adding new Plugins to the orchestrator, it's highly recommended to test and validate the answers. When adding new plugins the customer should test the behavior of the service to make sure there are no conflicts with other plugins, or there's an issue with certain parameters.
When incorporating content from websites, exercising caution is crucial to ensure that the content is free from any malicious code or injected prompts that may exploit the XPIA vulnerability. This newly identified vulnerability poses a risk to Large Language Models, particularly those employing prompt-based learning. It's essential to mitigate potential risks and safeguard against unwanted consequences.
When utilizing the Healthcare Orchestrator in combination with one or more plugins it's important to have a clear and distinguished description of every plugin to avoid overlaps in the selection processWhen starting with the Healthcare Orchestrator It's important to start with a few plugins, and evaluate the responses and accuracy, before adding more. Starting with 5+ plugins can result in unexpected behaviors.
System performance
The performance of new features using customer sources is heavily dependent on how your documents are organized. When using Generative Answers on Customer Sources, your documents should be chunked and not be bigger than 2000/3000 tokens, the chunked information should include as much metadata or context aware information to provide relevant results when you search the vector database. As the Customer Sources feature takes a dependency on Azure OpenAI Studio -Bring your own data, you can use the built-in functionality of the Azure OpenAI Studio feature to upload your documents via the studio. Azure OpenAI Studio will automatically extract the relevant information and split the information into usable chunks. If you want to further improve the quality of your Azure AI Search Service index, you can manually chunk, enrich, and upload your own documents to the Azure Cognitive Search Service index with relevant metadata to further improve vector search results. When using your websites as a source, we'll include as much information from the websites as possible, based on the token limitation of your Azure OpenAI model.
When using the Bing Custom Search Service, make sure your website is publicly accessible and only contains relevant and credible information. Don’t use websites that could contain wrong or harmful information.
When using websites that aren't managed by you or your organization, always make sure you're allowed to utilize this information into your service.
As these new features are built upon the Azure OpenAI Service functionality, for more information, please see: Use cases for Azure OpenAI - Azure Cognitive Services | Microsoft Learn.
Best practices for improving system performance
It's recommended to use the feedback system to understand if the AI-generated responses are useful for the end user.
It's recommended that customers keep the built-in symptom checker turned on. When end-users want to go through triage, it's advised to utilize the clinical triage engines instead.
It's recommended to always enable Generative Credible Fallback in the Healthcare Orchestrator, as this feature helps increase the rate of successfully found answers in the service
For customer sources, customers can easily configure the Relevance Score threshold - the minimal percentage of accuracy that should be used to select the most relevant grounding content. The default threshold is 0.5. To further improve relevance for document retrieval, customers can decide to increase it to 0.8 or 0.9. Viewing and editing the Relevance Score, can be done by entering the Data Connections setting under the Integration section. Select and edit the specific Customer Source Connection to find the Relevance Score under the Azure Cognitive Search Service configuration section.
We currently don’t advise using more than 16K tokens, as this can incur latency and bad user experience. We advise using the 16-K model which is sufficient in most cases for RAG.
We advise to not use Generative AI models with large token sizes. Generative AI models struggle to use information found in the middle of a long input context, also known as “lost in the middle” . When using the Customer Source option, make sure your data is chunked with the right metadata. Don’t chunk your data with large token sizes, as this will decrease the performance.
If you have different Customer data sources in your healthcare agent service scenarios, it is advised to split up the vector indexes. Every vector index can be used as a step in the Scenario editor or as a plugin in the orchestrator, meaning you can have more control on what the model will be grounded on and not have overlapping documents for certain articles.
Evaluating groundedness
When utilizing Customer Sources, it's important that you evaluate the resulting system for groundedness. To ensure optimal performance, conduct your own manual evaluations of the data you plan to use:
(1) Experiment with potential sources to decide on the optimal source for your specific use-case.
(2) Run a preliminary sample of use-case specific questions and use internal stakeholders to evaluate results.
(3) Consider incorporating and the following metrics: Answer Relevance – is the answer relevant to the question asked? Groundedness – is the answer based on the retrieved evidence? Evidence Relevance – are the selected evidence sources and provided links relevant to the question asked.
(4) Adjust the semantic ranker configuration and/or index definition if needed. Our service utilizes Azure AI Search service to provide customer search functionality on customer documents. Therefore, customizing its search capabilities are essential to improving evidence retrieval quality. This can impact the Healthcare agent service response quality and related metrics. In order to optimize the search process, it's advised to review the best practices described in the Azure AI search documentation. You can also review the System Performance section in this document for recommendations on performance improvements.
The primary focus of your evaluation efforts should be on selecting the appropriate model for your specific use case, understanding the limitations and biases of the model, and rigorously testing the end-to-end customer sources experience.
Evaluation of healthcare agent service
Evaluation process of the healthcare agent service quality included a multidimensional red-teaming process:
Harmful content evaluation – was conducted to assess the risk of propagation of harmful content into the Healthcare agent service answers. This evaluation was performed by a diverse team of evaluators in terms of gender, geography, and cultural perspectives. The bot's answers were reviewed for the presence of harmful content using a large sample of potentially sensitive questions, across various risk categories such as hate, violence, sexual content and self-harm The team was instructed to test different sources (for example, credible and customer sources including specific scenarios of using a malicious customer source). Multiturn conversations were also tested within this scope.
The assessment process yielded promising results with a minimal number of harms exposed. This is due to the effective combination of Azure’s Content Safety mechanism and the bot’s internal filtering processes, which together intercept most harmful content.
User prompt injection attacks (UPIA) and Cross domain prompt injection (XPIA) were evaluated using automated and manual testing on different grounding sources – credible, customer and a harmful adversarial public website. Testing yielded satisfactory safe results. These potential harms have been mitigated by enabling the Jailbreak content filter from Azure Content Filter, in combination with specific meta prompt instructions.
RAG assessment
Our evaluation method was devised to assess the Healthcare agent service’ capability of using the RAG framework to provide answers that are grounded solely on the retrieved evidence sources. The different aspects of the RAG framework that were inspected include the following:
a. Groundedness and Fabrications– assessing the Healthcare agent service capability of grounding its responses on the provided evidence sources, while refraining from using its own “internal knowledge” or providing factually incorrect information. An example of an ungrounded answer for the user query “what are the symptoms of diabetes?”, would be if the answer would list symptoms that weren't listed in the retrieved evidence source upon the answer should be based.
b. Answer Relevance – the ability to provide a response that is relevant to the user query.
An example for an irrelevant answer for the user query “what are the symptoms of diabetes?” would be if the answer is focused on the treatment of diabetes.
c. Evidence Relevance – assessing whether the selected evidence sources are relevant to the user query.
An example of irrelevant evidence for the user query “what are the symptoms of diabetes?” would be if the retrieved evidence would focus on other diseases, such as lupus.
The assessment was done by a diverse group of female and male healthcare professionals coming from diverse geographies and age groups. Medical questions were created and tested by the team)and the team was instructed to query four different bot endpoints – three were set up with credible medical sources (FDA, CDC, MedlinePlus) and one with customer healthcare data. Multiturn conversations were also tested within this scope.
Evaluation results
Overview of Evaluation Results
We evaluated our healthcare agent service using a detailed red-teaming process to check the system's safety and groundedness. A diverse team reviewed how the system handled potentially harmful content by analyzing sensitive questions across different risk areas. The results were safe with minimal instances of harmful outputs, thanks to Azure’s Content Safety tools, our healthcare safeguards and the bot's internal filters. Additionally, tests on user injection attacks showed that potential risks were well-managed through strong content filters and internal safety mechanisms.
Manual groundedness evaluation indicated a low frequency of ungrounded information when testing credible sources. However, when utilizing customer sources, it's important to only use trusted documents and evaluate the system’s provided outputs prior to implementation as the nature and complexity of different customer sources can highly vary. Review the following sections for more information: Best practices for improving system performance and Evaluating groundedness.
UX-level mitigations, via our Chat Safeguards, include the generated AI response, signature, and disclaimer stating that the answer is provided by generative AI and should be treated as such. Moreover, each answer from credible or customer sources is accompanied by the relevant evidence, for explainability and transparency, allowing end users to verify the correctness of the provided answer.
The evaluation results suggest a promising level of generalizability across different use cases that weren't explicitly tested in the evaluation process. However, it's advised to conduct preliminary testing when introducing new customer specific grounding sources.
In summary, while the evaluation has yielded encouraging results, ongoing monitoring and further investigations into untested scenarios are essential.
Evaluating and integrating Azure OpenAI templates into the healthcare agent service for your use
The intended uses of the new OpenAI features include enhancing experiences for healthcare professionals or patients by leveraging generative AI scenario allowing clinicians and medical professionals to surface information from credible healthcare content. However, it is important to carefully consider use cases and avoid using the service as a substitute for professional medical advice or for purposes not permitted by patient consent or applicable law.
There are limitations to consider, including societal biases in the data used to train the AI models and technical limitations related to performance and system behavior. It is recommended to use trusted healthcare documents, evaluate the performance of Azure OpenAI Service for different languages, implement mechanisms to draw responses from trusted documents, and display proper warnings and disclaimers.
For improving system performance, it is recommended to use a feedback system, avoid using OpenAI models with high token counts, chunk data appropriately, and split vector indexes for different data sources used in the service. Always utilize public data that you manage and know what sources are included for Bing Custom Search scenarios.
In all cases, it is important to do an evaluation on what type of answers you want your service to answer. It is strongly advised to combine the Azure OpenAI feature with the available built-in capabilities such as Healthcare triage, or the built-in healthcare information.
For more information about how to responsibly integrate Azure OpenAI Service backed features for your use, please see the overview document from Azure OpenAI Service.