Learn about search and analytics settings in eDiscovery (preview) cases
You can configure settings for each eDiscovery (preview) case to control the following functionality:
- Near duplicates and email threading
- Themes
- Autogenerated review set query
- Ignore text
- Optical character recognition
Tip
Get started with Microsoft Copilot for Security to explore new ways to work smarter and faster using the power of AI. Learn more about Microsoft Copilot for Security in Microsoft Purview.
Configure analytics settings for a case
To configure search and analytics settings for a case:
- Go to the Microsoft Purview portal and sign in using the credentials for a user account assigned eDiscovery permissions.
- Select the eDiscovery solution card and then select Cases (preview) in the left nav.
- Select a case, the select Case settings.
- On the Case settings page, select Search & analytics.
- The case Search & analytics page is displayed. These settings are applied to all review sets in a case.
- After selecting the applicable search and analytics options, select Save.
The following sections in this article describe the analytics settings that you can configure for a case.
Near duplicates and email threading
In this section, you can set parameters for duplicate detection, near duplicate detection, and email threading.
- Near duplicates/email threading: When turned on, duplicate detection, near duplicate detection, and email threading are included as part of the workflow when you run analytics on the data in a review set.
- Document and email similarity threshold: If the similarity level for two documents is above the threshold, both documents are put in the same near duplicate set.
- Minimum/maximum number of words: These settings specify that near duplicates and email threading analysis are performed only on documents that have at least the minimum number of words and at most the maximum number of words.
Near duplicate detection
Consider a set of documents to be reviewed in which a subset is based on the same template and has mostly the same boilerplate language, with a few differences here and there. If a reviewer could identify this subset, review one of them thoroughly, and review the differences for the rest, they wouldn't have missed any unique information while taking only a fraction of time that would have taken them to read all documents cover to cover. Near duplicate detection groups textually similar documents together to help you make your review process more efficient.
When near duplicate detection is run, the system parses every document with text. Then, it compares every document against each other to determine whether their similarity is greater than the set threshold. If it is, the documents are grouped together. Once all documents have been compared and grouped, a document from each group is marked as the "pivot"; in reviewing your documents, you can review a pivot first and review the other documents in the same near duplicate set, focusing on the difference between the pivot and the document that is in review.
Email threading
Consider an email conversation that has been going on for a while. In most cases, the last message in the email thread includes the contents of all the preceding messages. Therefore, reviewing the last message gives a complete context of the conversation that happened in the thread. Email threading identifies such messages so that reviewers can review a fraction of collected documents without losing any context.
Email threading parses each email thread and deconstructs it to individual messages. Each email thread is a chain of individual messages. eDiscovery (preview) analyzes all email messages in the review set to determine whether an email message has unique content or if the chain (parent messages) is wholly contained in the final message in the email thread. Email messages are divided into four inclusive values:
- Inclusive: An Inclusive email is the final email message in an email thread and contains all the previous content of that email thread.
- Inclusive minus: An email message is designated as Inclusive minus if there are one or more attachments associated with the specific message within the email thread. A reviewer can use the Inclusive minus value to determine which specific email message within the thread has associated attachments.
- Inclusive copy: An email message is considered an Inclusive copy if it's an exact copy of an Inclusive or Inclusive minus message.
- None: The None value indicates that the content of the message is wholly contained in at least one other email message that is marked as Inclusive or Inclusive minus.
How is it different from conversations in Outlook?
At a glance, this sounds similar to conversation groupings in Outlook. However, there are some important distinctions. Consider an email conversation that got forked into two conversations; for instance, someone responded to an email that isn't the latest in the conversation so the last two emails in the conversation both have unique content.
Outlook would still group the emails into a single conversation; reading only the last email may miss the context of the second-to-last email, which also contains unique content. Because email threading parses out each email into individual components and compares them, email threading would mark both of the last two emails as inclusive, ensuring that you won't miss any context as long as you read all emails marked as inclusive
Themes
In this section, you can set the following parameters for themes:
- Themes: When turned on, themes clustering is performed as part of the workflow when you run analytics on the data in a review set.
- Maximum number of themes: Specifies the maximum number of themes that can be generated when you run analytics on the data in a review set.
- Include numbers in themes: When turned on, numbers (that identifies a theme) are included when generating themes.
- Adjust maximum number of themes dynamically: In certain situations, there may not be enough documents in a review set to produce the desired number of themes. When this setting is enabled, eDiscovery adjusts the maximum number of themes dynamically rather than attempting to enforce the maximum number of themes.
When you create a new document, you generally start with one or more ideas that you want to convey in the document, and then compose the document using words that align with these ideas. The more prevalent an idea is, the more frequent the words that are related to that idea tend to be. This method also aligns to how readers consume documents. The important things to understand from reading a document are the main ideas that the document is trying to convey. This also includes which ideas appear where and what the relationships between the ideas are.
This process can be extended to how an eDiscovery reviewer wants to consume a set of documents in a case. They want to see which ideas are present in the review sets and which documents are talking about those ideas. If they find a particular document of interest, they want to be able to see documents that discuss similar ideas.
The Themes functionality in eDiscovery attempts to mimic how humans reason about documents, by analyzing the themes that are discussed in a review set and assigning a theme to documents in the review set. In eDiscovery, Themes goes one step further and identifies the dominant theme in each review set and document. The dominant theme is the one that appears the most often in a document.
How do themes work?
The Themes functionality analyzes documents with text in a review set to parse out common themes that appear across all the documents in the review set. eDiscovery assigns those themes to the documents in which they appear. It also labels each theme with the words used in the documents that are representative of the theme. Because a document can contain various types of subject matter, eDiscovery often assigns multiple themes to review sets and documents. This is referred to as the Themes list. The theme that appears most prominently in a review set or document is designated as its dominant theme.
Configuring Themes
Themes are supported for cases and apply to all the review sets within them. You can configure the settings for themes when you create a new case or you can update the theme settings for an existing case.
To configure themes in a case, complete the following steps:
- Go to the Microsoft Purview portal and sign in using the credentials for a user account assigned eDiscovery permissions.
- Select the eDiscovery solution card and then select Cases (preview) in the left nav.
- Select a case, the select Case settings.
- On the Case settings page, select Search & analytics.
- Select the following theme options as applicable:
- Max number of themes: Specifies the maximum number of themes that can be generated when you run analytics on the data in review sets included in a case. For more information on limits, see Limits in eDiscovery.
- Include numbers in themes: Numbers (that identify a theme) are included when generating themes.
- Adjust maximum number of themes dynamically: In certain situations, there may not be enough documents in a review set to produce the desired number of themes for the case. When this setting is enabled, the maximum number of themes is adjusted dynamically rather than attempting to enforce the maximum number of themes.
- If you need to exclude keywords associated with themes, enter the text or regular expression needed in the Ignore text field. In the Apply to field, select Themes to apply the text or regular expression to all themes.
- Select Save.
After a new case is created, analytics are automatically run on the data when the review sets are added to the case. Themes for the review sets are generated as part of the analytics processing.
Review set query
If you select the Automatically create a For Review saved search after analytics checkbox, eDiscovery autogenerates review set query named For Review.
This query filters out duplicate items from the review set, allowing you to quickly review the unique items in the review set. This query is created only when you run analytics for a review set in the case. For more information about review set queries, see Query the data in a review set.
Ignore text
There are situations where certain text diminishes the quality of analytics, such as lengthy disclaimers that get added to email messages regardless of the content of the email. If you know of text that should be ignored, you can exclude it from analytics by specifying the text string and the analytics functionality (near-duplicates, email threading, themes, and relevance) that the text should be excluded for. Using regular expressions (RegEx) for ignored text is also supported.
Optical character recognition (OCR)
When this setting is turned on, OCR processing runs on image files. OCR processing runs in the following situations:
- When data sources are added to a case: When OCR is applied to image files, the text in those files is available in search results. OCR processing is performed during the Advanced indexing process (if this option is selected in the search query). OCR is only run on items that are processed during Advanced indexing. For example, if a large PDF file that is partially indexed or had other indexing errors is processed during Advanced indexing, the file has OCR applied. OCR processing only occurs on files that are reindexed during the Advanced indexing process. This means there may be situations where data sources are added to a case, but some email attachments won't be processed for OCR because those files aren't processed during Advanced indexing.
- When content is added from other data sources: This applies to data sources aren't associated with a case and when the search results are added to a review set.
After data is added to a review set, image text can be reviewed, searched, tagged, and analyzed. You can view the extracted text in the Text viewer of the selected image file in the review set. For more information, see: