AI/ML Pivots to the Security Development Lifecycle Bug Bar

By Andrew Marshall, Jugal Parikh, Emre Kiciman, and Ram Shankar Siva Kumar

November 2019

This article is a deliverable of the Microsoft AETHER Engineering Practices for AI Working Group. This article functions as a supplement to the existing SDL bug bar used to triage traditional security vulnerabilities. It's intended to be used as a reference for the triage of AI/ML-related security issues. The Vulnerability Severity Classification for AI Systems (published by Microsoft Security Response Center), defines common vulnerability types and severity levels for systems involving AI.

This guidance is organized around the Adversarial Machine Learning Threat Taxonomy, created by Ram Shankar Siva Kumar, David O'Brien, Kendra Albert, Salome Viljoen, and Jeffrey Snover, and titled Failure Modes in Machine Learning. While the research this content is based on addresses both intentional/malicious and accidental behaviors in ML failure modes, this bug bar supplement focuses entirely on intentional/malicious behaviors that would result in a security incident and/or deployment of a fix.

Threat Description/Business Risks/Examples
Data Poisoning

Corrupting the training data - The end goal of the attacker is to contaminate the machine model generated in the training phase, so that predictions on new data are modified in the testing phase.

In targeted poisoning attacks, the attacker wants to misclassify specific examples to cause specific actions to be taken or omitted.

Submitting AV software as malware to force its misclassification as malicious and eliminate the use of targeted AV software on client systems.

A company scrapes a well-known and trusted website for futures data to train their models. The data provider's website is then compromised via SQL Injection attack. The attacker can poison the dataset at-will and the model being trained has no notion that the data is tainted.

Model Stealing

Recreation of the underlying model by legitimately querying it. The functionality of the new model is same as that of the underlying model. Once the model is recreated, it can be inverted to recover feature information or make inferences on training data.

Equation solving – For a model that returns class probabilities via API output, an attacker can craft queries to determine unknown variables in a model.

Path Finding – an attack that exploits API particularities to extract the "decisions" taken by a tree when classifying an input.

Transferability attack - An adversary can train a local model—possibly by issuing prediction queries to the targeted model - and use it to craft adversarial examples that transfer to the target model. If your model is extracted and discovered vulnerable to a type of adversarial input, new attacks against your production-deployed model can be developed entirely offline by the attacker who extracted a copy of your model.

In settings where an ML model serves to detect adversarial behavior, such as identification of spam, malware classification, and network anomaly detection, model extraction can facilitate evasion attacks

Model Inversion

The private features used in machine learning models can be recovered. This includes reconstructing private training data that the attacker doesn't have access to. This is accomplished by finding the input which maximizes the confidence level returned, subject to the classification matching the target.

Example: Reconstruction of facial recognition data from guessed or known names and API access to query the model.

Adversarial Example in Physical Domain These examples can manifest in the physical domain, like a self-driving car tricked into running a stop sign because of a certain color of light (the adversarial input) being shone on the stop sign, forcing the image recognition system to no longer see the stop sign as a stop sign.
Attack ML Supply Chain

Owing to large resources (data + computation) required to train algorithms, the current practice is to reuse models trained by large corporations and modify them slightly for task at hand (e.g: ResNet is a popular image recognition model from Microsoft).

These models are curated in a Model Zoo (Caffe hosts popular image recognition models).

In this attack, the adversary attacks the models hosted in Caffe, thereby poisoning the well for anyone else.

Backdoored Algorithm from Malicious ML Provider

Compromising the underlying algorithm

A malicious ML-as-a-Service provider presents a backdoored algorithm, wherein the private training data is recovered. This provides the attacker with the ability to reconstruct sensitive data such as faces and texts, given only the model.

Neural Net Reprogramming

With a specially crafted query from an attacker, ML systems can be reprogrammed to a task that deviates from the creator's original intent

Weak access controls on a facial recognition API enabling 3rd parties to incorporate into apps designed to harm users, such as a deep fakes generator.

This is an abuse/account takedown scenario

Adversarial Perturbation

In perturbation-style attacks, the attacker stealthily modifies the query to get a desired response from a production-deployed model. This is a breach of model input integrity which leads to fuzzing-style attacks where the end result isn't necessarily an access violation or EOP. Instead it compromises the model's classification performance.

This can be manifested by trolls using certain target words in a way that the AI bans them, effectively denying service to legitimate users with a name matching a "banned" word.

Forcing benign emails to be classified as spam or causing a malicious example to go undetected. These are also known as model evasion or mimicry attacks.

Attacker can craft inputs to reduce the confidence level of correct classification, especially in high-consequence scenarios. This can also take the form of a large number of false positives meant to overwhelm administrators or monitoring systems with fraudulent alerts indistinguishable from legitimate alerts.

Membership Inference

Infer individual membership in a group used to train a model

Ex: prediction of surgical procedures based on age/gender/hospital