Generating artificial anomalies of a specific dataset

2 次查看(过去 30 天)
Hi to all,
I am working on anomaly detection but I have only normal data so that I need to generate artificial anomalies based on the characteristics of the normal data.I have searched for a way to do that and I got the following:
1. Using the randn function in Matlab based on the mean and std of the normal data (i.e) Anomalies=randn(m,n)*std+mean. I do not know to what extent this formula is true! and is it enough to generate a anomalies that could be used to evaluate the performance of my detector.
2. I got an advice by a friend to use the Probability Distribution Function (PDF) to generate anomalous data, but I do not how? can any body give me some hints to start with this way (preferable with an example).
Appreciate your help!!!
Thanks.

回答(1 个)

Ayush Aniket
Ayush Aniket 2025-6-3
Both of the methods are logically correct to generate anomaly data. Let's analyze them individually:
Using randn :
The formula generates data that follow the same distribution as your normal data because randn produces samples from a standard normal distribution (mean 0, standard deviation 1). Multiplying by σ and then adding μ maps the data to the normal distribution with mean μ and standard deviation σ. If you generate data exactly with these parameters, you are essentially generating more “normal” data.
To simulate anomalies from the tails of the distribution, you must increase the spread or shift the mean. For example, if you want anomalies that are “extreme,” you could use a scaling factor greater than 1 or add an offset so that most generated points lie several standard deviations away from the normal mean. Refer the example code below:
% Assume you have computed the mean and std from your normal data:
mu = 10;
sigma = 2;
% Generate m x n anomalous samples that are, say, 3 times more extreme
m = 100;
n = 1;
% Option 1: Scale the standard deviation to move samples into the tails
anomalies_extreme1 = randn(m, n) * (3*sigma) + mu;
% Option 2: Alternatively, shift the mean (add an offset)
offset = 6;
anomalies_extreme2 = randn(m, n) * sigma + (mu + offset);
% Visualize the results
figure;
histogram(anomalies_extreme1,30);
hold on;
histogram(anomalies_extreme2,30);
xlabel('Value');
ylabel('Frequency');
legend('Scaled Std','Shifted Mean');
.
Using PDF :
Basically, another approach to generate anomalies directly from the tail regions of the fitted probability distribution. The idea is to:
  1. Fit a distribution to your normal data.
  2. Choose probability levels that lie in the tails (for instance, below 0.05 or above 0.95).
  3. Generate new samples by inverting these tail probability values using the inverse cumulative distribution function (ICDF). Refer the following documentation link to read more about the function: https://www.mathworks.com/help/stats/prob.normaldistribution.icdf.html
% Assume normalData is your 1-D normal data vector.
normalData = normrnd(10,2,1000,1);
% Fit a normal distribution to the data
pd = fitdist(normalData, 'Normal');
% Decide how many anomaly samples you want to generate
m = 100;
% Generate probability values for the tails
% For instance, half of the anomalies from the lower tail and half from the upper tail:
m_half = round(m/2);
u_lower = rand(m_half, 1) * 0.05; % lower 5% tail
u_upper = 0.95 + rand(m - m_half, 1) * 0.05; % upper 5% tail
% Use the inverse cumulative distribution function (ICDF) to generate anomaly values
anomalies_lower = icdf(pd, u_lower);
anomalies_upper = icdf(pd, u_upper);
% Combine the anomalies
anomalies_pdf = [anomalies_lower; anomalies_upper];
% Visualize the normal data and the anomalies
figure;
histogram(normalData, 30, 'Normalization', 'pdf');
hold on;
y_limits = ylim;
% Plot anomalies as red dots along the x-axis at a low probability level
plot(anomalies_pdf, repmat(y_limits(2)*0.05, numel(anomalies_pdf), 1), 'ro');
xlabel('Value');
ylabel('Probability Density');
legend('Normal Data PDF','Artificial Anomalies');
Both methods can be insightful when testing your anomaly detection algorithm. The choice depends on the assumptions you have about the nature of anomalies and how they relate to the normal data.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by