Classify data using trainable classifiers

Completed

Most organizations have content to protect that can't be detected using the methods used by sensitive information types. Data classification using trainable classifiers is useful when content isn't easily identified using pattern matching. Trainable classifiers apply the power of artificial intelligence (AI) and machine learning (ML) to find data to track, protect, and govern. You train a classifier to identify sensitive content based on what it is, rather than the elements in the item.

Note

This feature is a capability included with:

  • Microsoft 365 E5
  • Microsoft 365 E5 Compliance
  • Microsoft 365 E5 Information Protection and Governance

Please review Microsoft 365 licensing guidance for security & compliance to identify required licenses for your organization.

Types of classifiers

  • Pre-Trained Classifiers: Microsoft provides pretrained classifiers that are ready to use without training. These classifiers appear with the status of Ready to use.
  • Custom Trainable Classifiers: If the pretrained classifiers don't cover your needs, you can create and train your own.

For a complete list of pretrained classifiers, refer to Trainable Classifiers Definitions.

Custom trainable classifiers

You can create and train your own classifiers to look for data unique to your organization such as customer records, human resources data, and contracts. Creating a trainable classifier can take a significant amount of time and requires careful preparation.

Once the one-time setup process is complete, you can begin configuring trainable classifiers. The trainable classifier configuration process can be broken down as follows:

  1. Seed: Prepare your sample data and create the trainable classifier.
  2. Test: Prepare test data, test the predictive model, and evaluate the results.
  3. Publish: Make the trainable classifier available for use in your compliance solutions.

Once published, the classifier can classify content in locations like SharePoint Online, Exchange, and OneDrive. You can continue training it through a feedback process similar to the initial training.

Examples of custom trainable classifiers include legal documents, strategic business documents, pricing information, and financial information.

Here's each step explained in more detail.

One-time setup

A one-time scan must be completed before creating any custom trainable classifiers. This scan is needed so Microsoft 365 can learn more about the content in your organization. This process takes 7 to 14 days. The image shows the message you receive when attempting to create a custom trainable classifier for the first time.

Screenshot showing One time setup alert.

Seed

Step 1: Prepare sample data: Prepare content to seed your predictive model consisting of known positive samples of the content you want to classify. Store the sample content in a SharePoint Online document library or folder. You need at least 50 and as many as 500 samples that strongly represent the type of content you want the trainable classifier to detect.

Step 2: Create trainable classifier: Create the classifier by navigating to Microsoft Purview compliance portal > Data classification > Trainable classifiers > Create trainable classifier. Give the classifier a name, a description, and provide the location of the seed content. The next image shows the Provide seed content from SharePoint page in the Create new classifier wizard.

Screenshot shows Provide seed content from SharePoint screen.

Within 24 hours, the trainable classifier processes the seed data and builds a prediction model. The classifier status is In progress while it processes the seed data. When the classifier is finished processing the seed data, the status changes to Need test items. You can now view the details page by choosing the classifier.

Screenshot shows trainable classifier ready for testing.

Test

Step 1: Prepare test data: To test the predictive model, select a set of test content items picked by humans. These items should include a mix of strong positives, strong negatives, and less obvious examples that you want the classifier to detect accurately between category-related and unrelated items. The set should consist of at least 200 items, with a maximum limit of 10,000. Make sure this set of content is different from the initial seed data provided. After processing the test data, manually review the results to determine the correctness of each prediction. This feedback helps improve the trainable classifier's prediction model. Create a second SharePoint document library or folder for the test content, move your content there, and wait for the folder to be indexed.

Step 2: Test predictive model: To start the test wizard, choose Add items to test and enter the SharePoint Online site, library, and folder URL for the test content site mentioned in the previous step. Then, select Add sites to include them. Complete the wizard by choosing Done. It may take up to an hour for the trainable classifier to process the test files. Once the processing is complete, the status on the details page changes to Ready to review. If you need to increase the test sample size, select Add items to test and allow the trainable classifier to process the more items.

Step 3: Evaluate predictions: You need to tell the model if it's accurately predicting the relevance of the test content when the trainable classifier is done processing your test data. The Review items to improve the classifier accuracy step's status is Ready to review when it's ready for you to conduct the evaluation.

The next image shows the review process is currently underway with eight test items reviewed so far. The status is Not available because not enough test content has been evaluated yet.

Screenshot shows Test and review items to improve the classifier's accuracy.

Choose the Tested items to review tab to review and evaluate items. Microsoft 365 presents 30 items at a time. Review them; in the We predict this item is "Relevant". Do you agree? box, choose Yes, or No, or Not sure, skip to next item. Model accuracy is automatically updated every 30 items. You should review at least 200 items.

Screenshot shows Publish trainable classifier.

Once you have reviewed enough items and accuracy reaches at least 70%, you can publish the trainable classifier. You can also choose to continue improving the accuracy of the model by conducting more testing and evaluation.

Screenshot shows Publish trainable classifier with status Ready to use.

Publish

Publish the trainable classifier when you're satisfied with the results from the predictive model. Once published, your custom trainable classifier is available in selected compliance solutions. The status for a published trainable classifier is Ready to use.

Trainable classifiers interactive guide

Use the Identify content using trainable classifiers interactive guide for a walkthrough on using pretrained classifiers and creating custom trainable classifiers.

Cover for an interactive guide that says How to: Identify content using trainable classifiers.

Learn more