Best practices for using the Multivariate Anomaly Detector API
Important
Starting on the 20th of September, 2023 you won’t be able to create new Anomaly Detector resources. The Anomaly Detector service is being retired on the 1st of October, 2026.
This article provides guidance around recommended practices to follow when using the multivariate Anomaly Detector (MVAD) APIs. In this tutorial, you'll:
- API usage: Learn how to use MVAD without errors.
- Data engineering: Learn how to best cook your data so that MVAD performs with better accuracy.
- Common pitfalls: Learn how to avoid common pitfalls that customers meet.
- FAQ: Learn answers to frequently asked questions.
API usage
Follow the instructions in this section to avoid errors while using MVAD. If you still get errors, refer to the full list of error codes for explanations and actions to take.
Input parameters
Required parameters
These three parameters are required in training and inference API requests:
source
- The link to your zip file located in the Azure Blob Storage with Shared Access Signatures (SAS).startTime
- The start time of data used for training or inference. If it's earlier than the actual earliest timestamp in the data, the actual earliest timestamp will be used as the starting point.endTime
- The end time of data used for training or inference which must be later than or equal tostartTime
. IfendTime
is later than the actual latest timestamp in the data, the actual latest timestamp will be used as the ending point. IfendTime
equals tostartTime
, it means inference of one single data point which is often used in streaming scenarios.
Optional parameters for training API
Other parameters for training API are optional:
slidingWindow
- How many data points are used to determine anomalies. An integer between 28 and 2,880. The default value is 300. IfslidingWindow
isk
for model training, then at leastk
points should be accessible from the source file during inference to get valid results.MVAD takes a segment of data points to decide if the next data point is an anomaly. The length of the segment is
slidingWindow
. Please keep two things in mind when choosing aslidingWindow
value:- The properties of your data: whether it's periodic and the sampling rate. When your data is periodic, you could set the length of 1 - 3 cycles as the
slidingWindow
. When your data is at a high frequency (small granularity) like minute-level or second-level, you could set a relatively higher value ofslidingWindow
. - The trade-off between training/inference time and potential performance impact. A larger
slidingWindow
may cause longer training/inference time. There is no guarantee that largerslidingWindow
s will lead to accuracy gains. A smallslidingWindow
may cause the model difficult to converge to an optimal solution. For example, it is hard to detect anomalies whenslidingWindow
has only two points.
- The properties of your data: whether it's periodic and the sampling rate. When your data is periodic, you could set the length of 1 - 3 cycles as the
alignMode
- How to align multiple variables (time series) on timestamps. There are two options for this parameter,Inner
andOuter
, and the default value isOuter
.This parameter is critical when there is misalignment between timestamp sequences of the variables. The model needs to align the variables onto the same timestamp sequence before further processing.
Inner
means the model will report detection results only on timestamps on which every variable has a value, i.e. the intersection of all variables.Outer
means the model will report detection results on timestamps on which any variable has a value, i.e. the union of all variables.Here is an example to explain different
alignModel
values.Variable-1
timestamp value 2020-11-01 1 2020-11-02 2 2020-11-04 4 2020-11-05 5 Variable-2
timestamp value 2020-11-01 1 2020-11-02 2 2020-11-03 3 2020-11-04 4 Inner
join two variablestimestamp Variable-1 Variable-2 2020-11-01 1 1 2020-11-02 2 2 2020-11-04 4 4 Outer
join two variablestimestamp Variable-1 Variable-2 2020-11-01 1 1 2020-11-02 2 2 2020-11-03 nan
3 2020-11-04 4 4 2020-11-05 5 nan
fillNAMethod
- How to fillnan
in the merged table. There might be missing values in the merged table and they should be properly handled. We provide several methods to fill them up. The options areLinear
,Previous
,Subsequent
,Zero
, andFixed
and the default value isLinear
.Option Method Linear
Fill nan
values by linear interpolationPrevious
Propagate last valid value to fill gaps. Example: [1, 2, nan, 3, nan, 4]
->[1, 2, 2, 3, 3, 4]
Subsequent
Use next valid value to fill gaps. Example: [1, 2, nan, 3, nan, 4]
->[1, 2, 3, 3, 4, 4]
Zero
Fill nan
values with 0.Fixed
Fill nan
values with a specified valid value that should be provided inpaddingValue
.paddingValue
- Padding value is used to fillnan
whenfillNAMethod
isFixed
and must be provided in that case. In other cases it is optional.displayName
- This is an optional parameter which is used to identify models. For example, you can use it to mark parameters, data sources, and any other meta data about the model and its input data. The default value is an empty string.
Input data schema
MVAD detects anomalies from a group of metrics, and we call each metric a variable or a time series.
You could download the sample data file from Microsoft to check the accepted schema from: https://aka.ms/AnomalyDetector/MVADSampleData
Each variable must have two and only two fields,
timestamp
andvalue
, and should be stored in a comma-separated values (CSV) file.The column names of the CSV file should be precisely
timestamp
andvalue
, case-sensitive.The
timestamp
values should conform to ISO 8601; thevalue
could be integers or decimals with any number of decimal places. A good example of the content of a CSV file:timestamp value 2019-04-01T00:00:00Z 5 2019-04-01T00:01:00Z 3.6 2019-04-01T00:02:00Z 4 ... ... Note
If your timestamps have hours, minutes, and/or seconds, ensure that they're properly rounded up before calling the APIs.
For example, if your data frequency is supposed to be one data point every 30 seconds, but you're seeing timestamps like "12:00:01" and "12:00:28", it's a strong signal that you should pre-process the timestamps to new values like "12:00:00" and "12:00:30".
For details, please refer to the "Timestamp round-up" section in the best practices document.
The name of the csv file will be used as the variable name and should be unique. For example, "temperature.csv" and "humidity.csv".
Variables for training and variables for inference should be consistent. For example, if you are using
series_1
,series_2
,series_3
,series_4
, andseries_5
for training, you should provide exactly the same variables for inference.CSV files should be compressed into a zip file and uploaded to an Azure blob container. The zip file can have whatever name you want.
Folder structure
A common mistake in data preparation is extra folders in the zip file. For example, assume the name of the zip file is series.zip
. Then after decompressing the files to a new folder ./series
, the correct path to CSV files is ./series/series_1.csv
and a wrong path could be ./series/foo/bar/series_1.csv
.
The correct example of the directory tree after decompressing the zip file in Windows
.
└── series
├── series_1.csv
├── series_2.csv
├── series_3.csv
├── series_4.csv
└── series_5.csv
An incorrect example of the directory tree after decompressing the zip file in Windows
.
└── series
└── series
├── series_1.csv
├── series_2.csv
├── series_3.csv
├── series_4.csv
└── series_5.csv
Data engineering
Now you're able to run your code with MVAD APIs without any error. What could be done to improve your model accuracy?
Data quality
- As the model learns normal patterns from historical data, the training data should represent the overall normal state of the system. It's hard for the model to learn these types of patterns if the training data is full of anomalies. An empirical threshold of abnormal rate is 1% and below for good accuracy.
- In general, the missing value ratio of training data should be under 20%. Too much missing data may end up with automatically filled values (usually linear values or constant values) being learned as normal patterns. That may result in real (not missing) data points being detected as anomalies.
Data quantity
The underlying model of MVAD has millions of parameters. It needs a minimum number of data points to learn an optimal set of parameters. The empirical rule is that you need to provide 5,000 or more data points (timestamps) per variable to train the model for good accuracy. In general, the more the training data, better the accuracy. However, in cases when you're not able to accrue that much data, we still encourage you to experiment with less data and see if the compromised accuracy is still acceptable.
Every time when you call the inference API, you need to ensure that the source data file contains just enough data points. That is normally
slidingWindow
+ number of data points that really need inference results. For example, in a streaming case when every time you want to inference on ONE new timestamp, the data file could contain only the leadingslidingWindow
plus ONE data point; then you could move on and create another zip file with the same number of data points (slidingWindow
+ 1) but moving ONE step to the "right" side and submit for another inference job.Anything beyond that or "before" the leading sliding window won't impact the inference result at all and may only cause performance downgrade. Anything below that may lead to an
NotEnoughInput
error.
Timestamp round-up
In a group of variables (time series), each variable may be collected from an independent source. The timestamps of different variables may be inconsistent with each other and with the known frequencies. Here's a simple example.
Variable-1
timestamp | value |
---|---|
12:00:01 | 1.0 |
12:00:35 | 1.5 |
12:01:02 | 0.9 |
12:01:31 | 2.2 |
12:02:08 | 1.3 |
Variable-2
timestamp | value |
---|---|
12:00:03 | 2.2 |
12:00:37 | 2.6 |
12:01:09 | 1.4 |
12:01:34 | 1.7 |
12:02:04 | 2.0 |
We have two variables collected from two sensors which send one data point every 30 seconds. However, the sensors aren't sending data points at a strict even frequency, but sometimes earlier and sometimes later. Because MVAD takes into consideration correlations between different variables, timestamps must be properly aligned so that the metrics can correctly reflect the condition of the system. In the above example, timestamps of variable 1 and variable 2 must be properly 'rounded' to their frequency before alignment.
Let's see what happens if they're not pre-processed. If we set alignMode
to be Outer
(which means union of two sets), the merged table is:
timestamp | Variable-1 | Variable-2 |
---|---|---|
12:00:01 | 1.0 | nan |
12:00:03 | nan |
2.2 |
12:00:35 | 1.5 | nan |
12:00:37 | nan |
2.6 |
12:01:02 | 0.9 | nan |
12:01:09 | nan |
1.4 |
12:01:31 | 2.2 | nan |
12:01:34 | nan |
1.7 |
12:02:04 | nan |
2.0 |
12:02:08 | 1.3 | nan |
nan
indicates missing values. Obviously, the merged table isn't what you might have expected. Variable 1 and variable 2 interleave, and the MVAD model can't extract information about correlations between them. If we set alignMode
to Inner
, the merged table is empty as there's no common timestamp in variable 1 and variable 2.
Therefore, the timestamps of variable 1 and variable 2 should be pre-processed (rounded to the nearest 30-second timestamps) and the new time series are:
Variable-1
timestamp | value |
---|---|
12:00:00 | 1.0 |
12:00:30 | 1.5 |
12:01:00 | 0.9 |
12:01:30 | 2.2 |
12:02:00 | 1.3 |
Variable-2
timestamp | value |
---|---|
12:00:00 | 2.2 |
12:00:30 | 2.6 |
12:01:00 | 1.4 |
12:01:30 | 1.7 |
12:02:00 | 2.0 |
Now the merged table is more reasonable.
timestamp | Variable-1 | Variable-2 |
---|---|---|
12:00:00 | 1.0 | 2.2 |
12:00:30 | 1.5 | 2.6 |
12:01:00 | 0.9 | 1.4 |
12:01:30 | 2.2 | 1.7 |
12:02:00 | 1.3 | 2.0 |
Values of different variables at close timestamps are well aligned, and the MVAD model can now extract correlation information.
Limitations
There are some limitations in both the training and inference APIs, you should be aware of these limitations to avoid errors.
General Limitations
- Sliding window: 28-2880 timestamps, default is 300. For periodic data, set the length of 2-4 cycles as the sliding window.
- Variable numbers: For training and batch inference, at most 301 variables.
Training Limitations
- Timestamps: At most 1000000. Too few timestamps may decrease model quality. Recommend having more than 5,000 timestamps.
- Granularity: The minimum granularity is
per_second
.
Batch inference limitations
- Timestamps: At most 20000, at least 1 sliding window length.
Streaming inference limitations
- Timestamps: At most 2880, at least 1 sliding window length.
- Detecting timestamps: From 1 to 10.
Model quality
How to deal with false positive and false negative in real scenarios?
We have provided severity that indicates the significance of anomalies. False positives may be filtered out by setting up a threshold on the severity. Sometimes too many false positives may appear when there are pattern shifts in the inference data. In such cases a model may need to be retrained on new data. If the training data contains too many anomalies, there could be false negatives in the detection results. This is because the model learns patterns from the training data and anomalies may bring bias to the model. Thus proper data cleaning may help reduce false negatives.
How to estimate which model is best to use according to training loss and validation loss?
Generally speaking, it's hard to decide which model is the best without a labeled dataset. However, we can leverage the training and validation losses to have a rough estimation and discard those bad models. First, we need to observe whether training losses converge. Divergent losses often indicate poor quality of the model. Second, loss values may help identify whether underfitting or overfitting occurs. Models that are underfitting or overfitting may not have desired performance. Third, although the definition of the loss function doesn't reflect the detection performance directly, loss values may be an auxiliary tool to estimate model quality. Low loss value is a necessary condition for a good model, thus we may discard models with high loss values.
Common pitfalls
Apart from the error code table, we've learned from customers like you some common pitfalls while using MVAD APIs. This table will help you to avoid these issues.
Pitfall | Consequence | Explanation and solution |
---|---|---|
Timestamps in training data and/or inference data weren't rounded up to align with the respective data frequency of each variable. | The timestamps of the inference results aren't as expected: either too few timestamps or too many timestamps. | Please refer to Timestamp round-up. |
Too many anomalous data points in the training data | Model accuracy is impacted negatively because it treats anomalous data points as normal patterns during training. | Empirically, keep the abnormal rate at or below 1% will help. |
Too little training data | Model accuracy is compromised. | Empirically, training a MVAD model requires 15,000 or more data points (timestamps) per variable to keep a good accuracy. |
Taking all data points with isAnomaly =true as anomalies |
Too many false positives | You should use both isAnomaly and severity (or score ) to sift out anomalies that aren't severe and (optionally) use grouping to check the duration of the anomalies to suppress random noises. Please refer to the FAQ section below for the difference between severity and score . |
Sub-folders are zipped into the data file for training or inference. | The csv data files inside sub-folders are ignored during training and/or inference. | No sub-folders are allowed in the zip file. Please refer to Folder structure for details. |
Too much data in the inference data file: for example, compressing all historical data in the inference data zip file | You may not see any errors but you'll experience degraded performance when you try to upload the zip file to Azure Blob as well as when you try to run inference. | Please refer to Data quantity for details. |
Creating Anomaly Detector resources on Azure regions that don't support MVAD yet and calling MVAD APIs | You'll get a "resource not found" error while calling the MVAD APIs. | During preview stage, MVAD is available on limited regions only. Please bookmark What's new in Anomaly Detector to keep up to date with MVAD region roll-outs. You could also file a GitHub issue or contact us at AnomalyDetector@microsoft.com to request for specific regions. |
FAQ
How does MVAD sliding window work?
Let's use two examples to learn how MVAD's sliding window works. Suppose you have set slidingWindow
= 1,440, and your input data is at one-minute granularity.
Streaming scenario: You want to predict whether the ONE data point at "2021-01-02T00:00:00Z" is anomalous. Your
startTime
andendTime
will be the same value ("2021-01-02T00:00:00Z"). Your inference data source, however, must contain at least 1,440 + 1 timestamps. Because MVAD will take the leading data before the target data point ("2021-01-02T00:00:00Z") to decide whether the target is an anomaly. The length of the needed leading data isslidingWindow
or 1,440 in this case. 1,440 = 60 * 24, so your input data must start from at latest "2021-01-01T00:00:00Z".Batch scenario: You have multiple target data points to predict. Your
endTime
will be greater than yourstartTime
. Inference in such scenarios is performed in a "moving window" manner. For example, MVAD will use data from2021-01-01T00:00:00Z
to2021-01-01T23:59:00Z
(inclusive) to determine whether data at2021-01-02T00:00:00Z
is anomalous. Then it moves forward and uses data from2021-01-01T00:01:00Z
to2021-01-02T00:00:00Z
(inclusive) to determine whether data at2021-01-02T00:01:00Z
is anomalous. It moves on in the same manner (taking 1,440 data points to compare) until the last timestamp specified byendTime
(or the actual latest timestamp). Therefore, your inference data source must contain data starting fromstartTime
-slidingWindow
and ideally contains in total of sizeslidingWindow
+ (endTime
-startTime
).
What's the difference between severity
and score
?
Normally we recommend you to use severity
as the filter to sift out 'anomalies' that aren't so important to your business. Depending on your scenario and data pattern, those anomalies that are less important often have relatively lower severity
values or standalone (discontinuous) high severity
values like random spikes.
In cases where you've found a need of more sophisticated rules than thresholds against severity
or duration of continuous high severity
values, you may want to use score
to build more powerful filters. Understanding how MVAD is using score
to determine anomalies may help:
We consider whether a data point is anomalous from both global and local perspective. If score
at a timestamp is higher than a certain threshold, then the timestamp is marked as an anomaly. If score
is lower than the threshold but is relatively higher in a segment, it's also marked as an anomaly.