Out-Sample normalization problem

1 次查看(过去 30 天)
Hi. I’m working on a binary classification system that I have 21 financial ratios and variables for inputs and my output is one of financial criteria that could be 0 or 1. Before insert data to my classification model (MLP, SVM or ELM) I normalize data (max/min mapping or whitening). My financial ratios are from companies’ statements so we have various size of companies in our data.
Otherwise I'm using 5-fold cross validation for designing my model. After design the model now I want use it by new data so I must normalize these data. I find that for Max-Min mapping I must use Maximum and Minimum of designing phase data-set and for whitening I must use mean and variance of it.
Suppose that in x-min/max-min, my new data set has a feature sample that x of it is lower than previous minimum so now this normalized feature (for that specific sample) is negative. This is not a problem? Is the output (1 or 0) true for this specific sample? Besides this in whittling method we can have same problem.
Thanks.

采纳的回答

Greg Heath
Greg Heath 2014-4-3
编辑:Greg Heath 2014-4-3
Regardless of what you use in the model, I always standardize pre-modelling using zscore or mapstd to identify outliers for removal or modification.
Warning: Each dimension should be normalized separately.
P.S. If you use neural nets the default is mapminmax to [-1,1] and the hidden layer transfer functions are the odd function tanh.
Hope this helps
Thank you for formally accepting my answer
Greg
  6 个评论
Image Analyst
Image Analyst 2014-4-7
Jack's second so-called "Answer" moved here:
Thank you again Greg.
I don’t use k-means clustering after employ other outlier detection techniques. Outlier detection using k-means clustering is an option for outlier detection in my system besides your proposed technique. So I can choose any of these two techniques. With regard to the above discussion, what is your idea about k-means clustering?
You mentioned that I can use ‘(x-meanx)/std > threshold of your choice ‘so your proposed technique does not consider all inputs (in my case: 21 variables) simultaneously and I can analyze one feature with it at a time. Is this true?
Thanks.
Greg Heath
Greg Heath 2014-4-11
No. You consider all at once using matrix coding. I consider a 21 dimensional vector an outlier if one or more components is an outlier.
All MATLAB code is matrix based. So if you find one or more outlying components in a column of an input or target matrix, either modify or delete the column. Any target column corresponding to a deleted input must also be deleted and vice versa.

请先登录,再进行评论。

更多回答(0 个)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by