cvpartition grouped data for stratified hold-out classification?

9 次查看(过去 30 天)
Hi,
My ultimate objective is to build a machine learning classifier that takes student academic results and their school as input features and an alphabetic grade as the output.
I have tried using cvpartition to partition a 100 x N array into stratified 70% training and 30% hold-out testing for machine learning classification. The stratification is important because I want to maintain a similar class distribution for both the training and test sets as for the whole dataset.
cvpartition works when each row (sample) in my array is independent of all other rows (samples). For instance, this data are 100 students randomly picked from 100 different schools. The N variables are their average academic performance across all the subjects they study and their final alphabetic grade (A+,A,A-,B+,B,B-,C,D or F). In this situation, cvpartition readily partitions my original data for hold-out testing where none of the data was used in training.
However, I now have a 500 x N+1 where the rows (samples) are grouped where the new variable is their school. In this new dataset, 5 students are picked at random from 100 different schools and I want to train a learner that classifies students based also on their school. However, if I use cvpartition it will bias my results because it is possible that data from the same school (but for different students) was used both for training and for testing.
So, I want to create stratified 70% training and 30% hold-out testing datasets where none of the GROUPED data in testing was used for training, and I want to repeat this training/testing loop say 1000 times with random sets of training and hold-out testing data. I want to ensure that in each loop, student data from the schools used for testing are not from the schools used for training.
Is there a Matlab command that allows us to easily cvpartition grouped data in the manner I require? For instance, does the command diverand help in this case?
Thank you in advance.

回答(1 个)

ahmed nebli
ahmed nebli 2018-9-2
编辑:ahmed nebli 2018-9-2
i think you should split you data into training and testing maualy without even using cv partition because as u said, cvpartition test on data that had been training with it.
to do so, create a function that splits your data and save the training and testing into a .mat file
  1 个评论
Cuong Quang
Cuong Quang 2018-9-3
Hi Ahmed,
Yes as you say there are more manual methods available.
What I'm wondering is whether there is the equivalent of a single cvpartition command for grouped data?
Thanks Cuong

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Gaussian Process Regression 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by