Why do I get this message : Error using kmeans ---X must have more rows than the number of clusters.
25 次查看(过去 30 天)
显示 更早的评论
clear all; close all
load BRI
rng(0,'twister');
% A_train is 2055 x 89 factors
A_train = D_num(:,[1:66,68:89]);
all_factors={'recommended_for_research','umbrella','year','donor','Gov''t_funding_agency'...
'State-owned_funding_company','Other_private_funding_company','implementing_agency_china'...
'Pipeline: Pledge','Pipeline: Commitment','Implementation','Completion','Suspended','Cancelled'...
'Debt forgiveness','Export Credits','Grant','Strategic/Supplier Credit','Debt Rescheduling'...
'Free-standing Technical Assistance','Scholarship/Training in Donor Country','Joint Venture with Recipient'...
'Loan','Foreign Direct Investment','ODA-like_flow class','OOF-like_flow class','Vague_flow class','Other flow '...
'Development Intent','Commercial Intent','Representational Intent','Mixed Intent','amount','Cash/physical_money'...
'USD_currency','CMY_currency','Other currency','usd_defl_2014','usd_current_publish','usd_current_2019','crs_sector_code'...
'sources_count','cofinancing_agency','Gov''t_recepient_agency','State-owned_recepient_company','Other_private_agency'...
'recipient_agencies_count','deflators_used','exchange_rates_used','start_actual','start_planned','end_actual','end_planned'...
'Beginning_date_since_2000','End_date_since_2014','Planned_start > Actual_start','Planned_start > Actual_start','Planned_start = Actual_start'...
'Planned_end > Actual_end','Planned_end < Actual_end','Planned_end = Actual_end','Planned_duration','Actual_Duration','year_uncertain'...
'2019 population','GDP(IMF)_of _reipient ','GDP_per_capita','recipient_count','recipient_cow_code','recipient_oecd_code'...
'recipient_un_code','recipient_imf_code','Africa ','Middle East','Asia','The Pacific','Latin America and the Caribbean','Central and Eastern Europe'...
'line_of_credit','is_cofinanced','is_ground_truthing','loan_type','interest_rate','maturity','grace_period','grant_element','source_triangulation'...
'field_completeness'};
% B_train is 2055 x 1 (1 if debt distressed, 0 if not)
B_train = D_num(:,90);
% Deal with missing GDP (IMF)
no_GDP=(A_train(:,66)==0); %row numbers of those missing an age (showing 0 instead)
avg_age=nanmean(A_train(no_GDP==0,66)); % average age of those with one listed
A_train(no_GDP==1,66)=avg_age; %fill in those missing ages with the average value
% Deal with missing GDP per capita
no_GDP2=(A_train(:,67)==0); %row numbers of those missing an age (showing 0 instead)
avg_age2=nanmean(A_train(no_GDP2==0,67)); % average age of those with one listed
A_train(no_GDP2==1,67)=avg_age2; %fill in those missing ages with the average value
The Eerror occurs Here:
k=8; % Number of clusters
dist_type='sqeuclidean'; % Distance metric (others include 'cityblock' (L1), 'cosine', and 'correlation')
[clust,centr]=kmeans(A_train,k,'dist',dist_type); % returns cluster assignments & centroid of each cluster
And I have not been able to continue on wards
figure(1)
colstyle = {'cs','rd','b^','go','k+','d',':bs','-mo'}; %define 8 color/style combos for this plot
attribs=[1 2 3]; %categories for x, y, and z axes
for j=1:k
q=find(clust==j); %ID numbers of the items in this cluster
nsample(j)=length(q); %Sample size in the cluster
survival(j)=mean(B_train(q)); %Survival rate withini this cluster
plot3(A_train(q,attribs(1)),A_train(q,attribs(2)),A_train(q,attribs(3)),colstyle{j}) % 3-D plot with marker types by cluster
hold on
end
hold off
legend('Cluster 1','Cluster 2','Cluster 3','Cluster 4','Cluster 5','Cluster 6','Cluster 7','Cluster 8');
xlabel(all_factors(attribs(1)));
ylabel(all_factors(attribs(2)));
zlabel(all_factors(attribs(3)));
figure(2);
silhouette(A_train,clust,dist_type)
Try various numbers of clusters
nn=100; dist_type='sqeuclidean';
for j=2:nn
[clust,centr,sumd]=kmeans(A_train,j,'dist',dist_type);
Dtot(j,1)=sum(sumd);
end
figure(3)
plot(2:nn,Dtot(2:nn),'b-');
0 个评论
采纳的回答
Adam Danz
2019-4-10
编辑:Adam Danz
2019-4-10
Possibility 1
Your variable 'A_train' does not have enough rows. You are requesting 8 clusters (k=8) and, as the error indicates, 'A_train' needs to have at least k+1 rows.
[clust,centr]=kmeans(A_train,k,'dist',dist_type);
To confirm this is the problem, call this just prior to the kmean() funciton.
size(A_train)
Possibility 2
Your variable 'A_train' has too many rows that contain at least one NaN value. kmeans() ignores any rows that contain at least one NaN value. To determine that you have enough rows that do not contain NaN values, run this line:
sum(any(~isnan(A_train), 2))
ans =
5 % only 5 rows have no-nan values which is less than K (8)
4 个评论
Adam Danz
2019-4-11
To answer that, I'd step away from thinking about how to implement the analysis to the more fundamental problem of classifying mising data. There is no easy solution for this problem.
Sometimes missing data only accounts for a small portion of the dataset and those samples can just be ignored. That doesn't seem to be the case with your data.
If you're classifying a matrix with many variables (columns) and there's just one variable that contains most of the missing data, you could run the analysis without that variable as long as it's not an influential variable.
You could determine the number of rows that contain a complete set of data and reduce your cluster size accordingly but that's usually a poor solution since the number of klusters should be chosen with intention.
Some sources suggest that you could fill in missing values but means or randoms but such arbitrary decisions are bad practice and can really throw off the results such that they no longer represent the underlying unknown reality.
A simple search on google scholar lists these two papers with >100 citations. They discuss the problem of missing data in classification models and potential solutions.
更多回答(0 个)
另请参阅
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!