How can I reassign clusters based on similarity or any other method?

Question

Med Future 2024-8-1

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2141951-how-can-i-reassign-clusters-based-on-similarity-or-any-other-method

评论： Umar 2024-8-10

DataCluster.mat

Description: I have data in cell format obtained from K-means clustering. The main issue is that similar clusters are being split into two separate clusters. I need to reassign the clusters based on a similarity or anyother method. Specifically, if two clusters have the same features, they should be combined into one cluster. Conversely, if a cluster has two different features, it should be split so that each subcluster has similar features. Each cell has subcells with first four columns representing important features. For example, clusters 3 and 6 have almost similar features, while cluster 2 has two different features. How can I achieve this reassignment?

23 个评论
显示 21更早的评论隐藏 21更早的评论

Umar 2024-8-1

编辑：Walter Roberson 2024-8-6

在 MATLAB Online 中打开

Hi @Med Future ,

To reassign clusters based on similarity in K-means clustering results, we can compare the features of each cluster and merge or split them accordingly. First,

load the DataCluster.mat file containing the K-means clustering results. Extract the features of each cluster from the cell array. Compare the features of all clusters to identify similarities and differences. Then, merge clusters with similar features and split clusters with different features. Finally, update the cluster assignments based on the merging and splitting. Here is partial code snippet,

% Load the data from datcluster.mat
load('DataCluster.mat');
% Assuming 'clusters' is the cell array containing the clustering results
num_clusters = numel(clusters);
% Initialize a matrix to store the features of each cluster
cluster_features = cell(num_clusters, 1);
% Extract features of each cluster
for i = 1:num_clusters
    cluster_data = clusters{i};
    features = cluster_data(:, 1:4); % Assuming the first four columns are features
    cluster_features{i} = features;
end
% Compare features and reassign clusters
for i = 1:num_clusters
    for j = i+1:num_clusters
        % Compare features of clusters i and j
        if isequal(cluster_features{i}, cluster_features{j})
            % Merge clusters i and j
            % Update cluster assignments accordingly
        else
            % Check for differences and split clusters if needed
            % Update cluster assignments accordingly
        end
    end
end
% Display or save the updated cluster assignments
disp('Clusters reassigned based on similarity.');
% Additional code to save or display the updated cluster assignments

So, the above partial code snippet provides a framework for reassigning clusters based on feature similarity in K-means clustering results.You may need to implement the merging and splitting logic based on your specific data and requirements. Also, make sure to adapt the code to match the structure of your datcluster.mat file and the features of your clusters. Hope, this will help you get started with your project. Please let me know if you have any further questions.

Med Future 2024-8-2

在 MATLAB Online 中打开

% Normalize the rows of the cells for cosine similarity
cell1_norm = cell1 ./ sqrt(sum(cell1.^2, 2));
cell2_norm = cell2 ./ sqrt(sum(cell2.^2, 2));
% Compute the cosine similarity matrix
similarity_matrix = cell1_norm * cell2_norm';
% Average similarity score
similarity_score = mean(similarity_matrix(:));
% Display the similarity score
fprintf('Average Cosine Similarity Score: %f\n', similarity_score);
% Define the number of clusters
k = 1; % Number of clusters for the combined data
if similarity_score > 0.9
    % Combine the data from both cells
    combinedData = [cell1; cell2];
    
    % Apply K-means clustering
   [idx, C] = kmeans(combinedData, k);
    
    % Display results
  %  figure;
 %   gscatter(combinedData(:,1), combinedData(:,2), idx);
%    title('K-means Clustering for Combined Data');
    
    % Save the clustering results
    save('merged_clustered_data.mat', 'idx', 'C', 'combinedData');
else
    fprintf('Similarity score is less than 0.9, not merging the cells.\n');
end

But in the following we manully want to give the cluster number, This is main issue

Umar 2024-8-2

Hi @Med Future ,

I have modified your code shared on the form and it is capable of reassigning clusters based on similarity.

% Define cell1 and cell2

cell1 = [1, 2, 3; 4, 5, 6]; % Example data for cell1

cell2 = [7, 8, 9; 10, 11, 12]; % Example data for cell2

% Normalize the rows of the cells for cosine similarity

cell1_norm = cell1 ./ sqrt(sum(cell1.^2, 2));

cell2_norm = cell2 ./ sqrt(sum(cell2.^2, 2));

% Compute the cosine similarity matrix

similarity_matrix = cell1_norm * cell2_norm';

% Average similarity score

similarity_score = mean(similarity_matrix(:));

% Display the similarity score

fprintf('Average Cosine Similarity Score: %f\n', similarity_score);

% Define the threshold for similarity to reassign clusters

similarity_threshold = 0.9;

if similarity_score > similarity_threshold

    % Combine the data from both cells

    combinedData = [cell1; cell2];

    % Apply K-means clustering

    k = 2; % Define the number of clusters 'k'

    [idx, C] = kmeans(combinedData, k);

    % Calculate centroid distances for cluster reassignment

    centroid_distances = pdist(C); % Calculate pairwise distances between centroids

    avg_distance = mean(centroid_distances); % Calculate the average centroid distance

    % Reassign clusters if centroid distances exceed a certain threshold

    centroid_threshold = 5; % Define a threshold for centroid distances

    if avg_distance > centroid_threshold

        % Calculate the pairwise distances between data points and centroids
        distances = pdist2(combinedData, C);

        % Find the minimum distance for each data point

        [~, min_indices] = min(distances, [], 2);

        % Update the cluster assignments in 'idx' based on the minimum distances

        idx = min_indices;

end

    % Iterate over the clusters and check for different features

    unique_clusters = unique(idx); % Get the unique cluster labels

    num_clusters = numel(unique_clusters); % Get the number of clusters

    for i = 1:num_clusters

        cluster_data = combinedData(idx == unique_clusters(i), :); % Get the data points for the current cluster

        % Check for different features within the cluster

        if any(range(cluster_data) > 1)

            % Split the cluster into subclusters with similar features

            subclusters = kmeans(cluster_data, 2);

            % Update the cluster assignments in 'idx' for the subclusters

            idx(idx == unique_clusters(i)) = subclusters + max(idx);

end

end

    % Merge clusters with similar features

    unique_clusters = unique(idx); % Get the updated unique cluster labels

    num_clusters = numel(unique_clusters); % Get the updated number of clusters

    for i = 1:num_clusters

        cluster_data = combinedData(idx == unique_clusters(i), :); % Get the data points for the current cluster

        % Check for similar features with other clusters

        for j = i+1:num_clusters

            other_cluster_data = combinedData(idx == unique_clusters(j), :); % Get the data points for the other cluster

            % Check for similar features using a threshold

            if max(pdist2(cluster_data, other_cluster_data)) < 1

                % Merge the clusters into a single cluster

                idx(idx == unique_clusters(j)) = unique_clusters(i);

end

end

end

    % Display the updated clustering results

    figure;

    gscatter(combinedData(:,1), combinedData(:,2), idx);

    title('Modified Clustering Results');

    % Save the modified clustering results

    save('modified_clustered_data.mat', 'idx', 'combinedData');

else

    fprintf('Similarity score is less than %f, not reassigning clusters.\n', similarity_threshold);

end

I will go through the code step by step to let you understand how it achieves this. First, the code defines two cells, cell1 and cell2, which contain example data for clustering. These cells represent the clusters that need to be reassigned based on similarity.

cell1 = [1, 2, 3; 4, 5, 6]; % Example data for cell1

cell2 = [7, 8, 9; 10, 11, 12]; % Example data for cell2

Next, the code normalizes the rows of the cells using the cosine similarity measure. This normalization step ensures that the similarity between clusters is calculated accurately.

cell1_norm = cell1 ./ sqrt(sum(cell1.^2, 2));

cell2_norm = cell2 ./ sqrt(sum(cell2.^2, 2));

After normalizing the cells, the code computes the cosine similarity matrix between cell1_norm and cell2_norm. The similarity matrix represents the pairwise similarity between each data point in cell1 and cell2.

similarity_matrix = cell1_norm * cell2_norm';

To determine the average similarity score between the clusters, the code calculates the mean of all elements in the similarity matrix.

similarity_score = mean(similarity_matrix(:));

The code then displays the average cosine similarity score.

fprintf('Average Cosine Similarity Score: %f\n', similarity_score);

Next, the code defines a similarity threshold. If the similarity score is greater than the threshold, the clusters will be reassigned based on similarity.

similarity_threshold = 0.9;

The code checks if the similarity score exceeds the threshold. If it does, the clusters will be reassigned.

if similarity_score > similarity_threshold

    % Combine the data from both cells

    combinedData = [cell1; cell2];

    % Apply K-means clustering

    k = 2; % Define the number of clusters 'k'

    [idx, C] = kmeans(combinedData, k);

The code then calculates the centroid distances between the clusters. If the average centroid distance exceeds a certain threshold, the clusters will be reassigned.

    centroid_distances = pdist(C); % Calculate pairwise distances between centroids

    avg_distance = mean(centroid_distances); % Calculate the average centroid distance

    % Reassign clusters if centroid distances exceed a certain threshold

    centroid_threshold = 5; % Define a threshold for centroid distances

    if avg_distance > centroid_threshold

        % Calculate the pairwise distances between data points and centroids

        distances = pdist2(combinedData, C);

        % Find the minimum distance for each data point

        [~, min_indices] = min(distances, [], 2);

        % Update the cluster assignments in 'idx' based on the minimum distances

        idx = min_indices;

end

The code then iterates over the clusters and checks for different features within each cluster. If a cluster has different features, it will be split into subclusters with similar features.

    unique_clusters = unique(idx); % Get the unique cluster labels

    num_clusters = numel(unique_clusters); % Get the number of clusters

    for i = 1:num_clusters

        cluster_data = combinedData(idx == unique_clusters(i), :); % Get the data points for the current cluster

        % Check for different features within the cluster

        if any(range(cluster_data) > 1)

            % Split the cluster into subclusters with similar features

            subclusters = kmeans(cluster_data, 2);

            % Update the cluster assignments in 'idx' for the subclusters

            idx(idx == unique_clusters(i)) = subclusters + max(idx);

end

end

After splitting clusters with different features, the code merges clusters with similar features. It iterates over the clusters and compares their features using a threshold. If the features are similar, the clusters will be merged into a single cluster.

    unique_clusters = unique(idx); % Get the updated unique cluster labels

    num_clusters = numel(unique_clusters); % Get the updated number of clusters

    for i = 1:num_clusters

        cluster_data = combinedData(idx == unique_clusters(i), :); % Get the data points for the current cluster

        % Check for similar features with other clusters

        for j = i+1:num_clusters

            other_cluster_data = combinedData(idx == unique_clusters(j), :); % Get the data points for the other cluster

            % Check for similar features using a threshold

            if max(pdist2(cluster_data, other_cluster_data)) < 1

                % Merge the clusters into a single cluster

                idx(idx == unique_clusters(j)) = unique_clusters(i);

end

end

end

Finally, the code displays the updated clustering results by plotting the data points with their assigned clusters.

    % Display the updated clustering results

    figure;

    gscatter(combinedData(:,1), combinedData(:,2), idx);

    title('Modified Clustering Results');

    % Save the modified clustering results

    save('modified_clustered_data.mat', 'idx', 'combinedData');

else

    fprintf('Similarity score is less than %f, not reassigning clusters.\n', similarity_threshold);

end

In nutshell, this modified code is capable of reassigning clusters based on similarity. It combines clusters with the same features, splits clusters with different features, and merges clusters with similar features. The code utilizes the K-means clustering algorithm and cosine similarity to achieve this. Please see attached plot along with test results.

Hope, this answers your question.

Umar 2024-8-2

编辑：Walter Roberson 2024-8-6

在 MATLAB Online 中打开

@Med Future,

To address your query regarding, “Means this solution does not work as well”,

In my opinion, DBSCAN is a valuable clustering algorithm but custom reassignment logic based on feature comparison may be more suitable for the your’s specific scenario. Here's a simplified example to illustrate the concept:

% Example code for reassigning clusters based on feature similarity
for i = 1:numel(clusters)
    for j = i+1:numel(clusters)
        % Compare features of clusters i and j
        similarity = calculateSimilarity(clusters{i}, clusters{j});
        
        % Reassign clusters based on similarity threshold
        if similarity > threshold
            % Merge clusters i and j
            mergedCluster = mergeClusters(clusters{i}, clusters{j});
            % Update clusters list
            clusters{i} = mergedCluster;
            clusters(j) = [];
        elseif similarity < threshold
            % Split clusters i and j into subclusters
            [subCluster1, subCluster2] = splitClusters(clusters{i});
            % Update clusters list
            clusters{i} = subCluster1;
            clusters{end+1} = subCluster2;
        end
    end
end

Besides that, there are bunch of clustering algorithms to explore for.

Walter Roberson 2024-8-8

@Med Future

I merely formatted the code on behalf of @Umar

Umar 2024-8-8

Hi @Walter Robertson,

I highly appreciate your help and support.

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Image Analyst 2024-8-2

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2141951-how-can-i-reassign-clusters-based-on-similarity-or-any-other-method#answer_1494111

Sorry, I didn't delve into this lengthy discussion in detail but for what it's worth, I'm attaching a demo that lets you relabel clusters found with kmeans. As you know kmeans assigns cluster labels randomly and the label for a given cluster may change from run to run, and it might be more convenient if they were labeled according to some predefined criteria. In the demo, it does not change the number of clusters but merely gives the clusters numbers/labels that correspond to some other attribute, like distance of centroid to the origin. kmeans assumes that you know in advance the number of clusters, though there are functions that can help you determine the best choice for that number.

Another demo I'm attaching is for dbscan. https://en.wikipedia.org/wiki/DBSCAN It does not need you to tell it how many clusters there are since it finds clusters on how many points can be connected without exceeding a specified distance.

19 个评论
显示 17更早的评论隐藏 17更早的评论

Image Analyst 2024-8-2

编辑：Image Analyst 2024-8-2

在 MATLAB Online 中打开

If you think that some clusters should be combined even though it said the optimal number of clusters was 7 (but you think clusters 3 and 6 should be one cluster so that there should only be 6 clusters) you can reassign their cluster number in the ClassID column. For example (untested)

% Get distance of every cluster center to every other cluster center.
% x and y are the 7 cluster center coordinates
xy = [x(:), y(:)]; % 7  by 2 matrix
distances = pdist2(xy, xy);
% Find out which distances are less than, say 5 (You have to decide how
% close is too close for you).
closeClusters = (distances < 5) & (distances ~= 0);
[cluster1Index, cluster2Index] = find(closeClusters);
% Loop over each cluster setting the other cluster to this cluster's
% number (label ID)
for k = 1 : numel(cluster1Index)
    thisClusterNumber = cluster1Index;
    % Set all data points with cluster 2 number equal to the cluster 1
    % number because they are so close together that they should have the same
    % cluster number.
    ClusterID(ClusterID == cluster2Index) = thisClusterNumber;
end

Now if clusters numbers 3 and 6 were too close (closed than 5), those data points that were labeled 6 will now be labeled 3 in your list of cluster IDs.

By the way, have you tried the classification Learner app on the Apps tab of the tool ribbon?

Image Analyst 2024-8-3

在 MATLAB Online 中打开

Go to the Apps tab of the tool ribbon. In the list of applets there you should see the Regression Learner, if you have the Statistics and Machine Learning Toolbox installed. Run the Regression Learner applet. Tell it what columns of your data are your input (predictors), and which column of your data matrix or table is the output (response variable), or your response could be its own column vector.

Then go to the next tab and pick several models to try out. Might as well select all of them, however I don't select "Stepwise regression" since that seems to take forever. Then tell it to run and it will try to model your response variables from your predictor variables for all the models you have chosen. It will also give a goodness of fit metric like MSE or RMS. Tell it to sort the results by that metric and pick the one with the lowest error. I find the neural network model is never the best. In my experience it seems like one of the Gaussian Process Regression models will be the best. You can click on a model and tell it to give you a scatter plot of predicted vs "true" if you want to visualize the results.

Once you've decided on a model, click the Export button and export your model to a workspace variable called "trainedModel". Then in the command window save that model to a file

save("trainedModel.mat, "trainedModel.mat");

Then, when you want to use that model on some new, non-training data, you can load it from the file:

s = load("trainedModel.mat")

If you leave off the semicolon, it should give you instructions on how to pass your data to the model and get the predicted results back. I think it will tell you to call the predict function, something like

predictedResults = predict(trainedModel, yourData)

or something like that.

That's if you want to do regression to get a number. If you want to do classification (to get a class like "normal" or "abnormal" or "defective") then do basically the same thing but use the Classification Learner applet instead of the regression Learner applet.

==============================================================

If you know that (for some weird reason) that you want to use kmeans then kmeans returns two variables. One is a vector that says what each multi-dimensional data point got classified into (a vector of class numbers ), and the other output is the (x,y,z,.....) coordinates of the centroid of the clusters that it decided to use. Like if your cluster is like a fuzzy ball (like points in a spherical galaxy), then the centroid would be the middle coordinate of that fuzzy ball.

Like if you had input1 and input2 and input3, each with 1000 measurements.

data = [ input1(:), input2(:), input3(:)];
[assignedClassNumbers, classCentroids] = kmeans(data, 5); % Find 5 classes (clusters)

assignedClassNumbers would be a vector of 1000 class numbers from 1 to 5 that it thinks that row of your data best belongs to. classCentroids is a list of 5 coordinates (with x,y,z values corresponding to positions on the input1, input2, and input3 axes) where the coordinate is the center/centroid of the 5 different clusters.

Image Analyst 2024-8-5

@Med Future You will have a cluster center given for each cluster you request. So if you tell it you have 7 clusters (regardless of how many data points you have) then k=7 and it will return to you the coordinates of 7 cluster centers. If you tell it 5 clusters, you will get 5 center locations, if you tell it 3 clusters you will get 3 centers, etc.

Yes, k is a "hard value" in that you must specify it, though to figure out what is the best k, there are things you can do that help with that, and it sounds like you've looked at the examples and know how to figure out the best k.

Personally for my work in color image analysis I tend to like discriminant analysis rather than k means. It's one where you basically train it by giving it some examples. The problem with kmeans, and why I don't really like it, is that you're forcing it to give exactly that number of clusters. I like my algorithms to be robust, even in the extremes. What happens if your data doesn't have points representing one or more of the clusters? Like what if one set of data had only data points in the region of cluster 1, and no data points where clusters 2 and 3 live? Will, it would force it to find 3 clusters, but they are all in the region of cluster 1. That's bad! Likewise, what if your new data set had 5 natural clusters but you told it 3, well some of those data points would have to be grouped into one of the other 3 clusters and actually some clusters might be split/divided along weird lines, not where you think they'd be. It's not like all of cluster 4 and 5 would just be grouped into cluster 3. So for that reason kmeans is not robust to the entire range of data. You won't have that problem with discriminant analysis. As long as your training data had all 3 clusters, if you had only 1 cluster, it will tell you that all your data is in cluster 1 and it won't try to invent two more clusters.

Image Analyst 2024-8-6

With discriminant analysis you don't have to collect training data for each set of new data. You collect it once, however it does use that data as part of it's new prediction so it's probably a little slower than some super simple model like just plugging data into a polynomial with known coefficients. That said, it can process millions of samples (e.g. pixels) in near real time.

I'm attaching my discriminant analysis demo that classifies any picture into some known number of classes that you define in advance when you outline image regions during the part where you define ground truth. For example if you have a yellow flower and blue sky, you'd draw regions around the sky, yellow, and green plant matter and define those as your ground truth. Then it passes that training data into the function along with your image to be classified. There are several different types of discriminant analysis you can pick from and the demo shows you the results of all the different types you can select from so you can use whichever one you think works best. That demo is Classify_RGB_Image.m.

I'm also attaching two other classification demos, one that uses decision trees and one that uses K Nearest Neighbors.

Image Analyst 2024-8-6

编辑：Image Analyst 2024-8-7

在 MATLAB Online 中打开

DataCluster.mat

I'm just trying to visualize the clusters. Here is what I have so far:

% Initialization steps.
clc;    % Clear the command window.
close all;  % Close all figures (except those of imtool.)
clear;  % Erase all existing variables. Or clearvars if you want.
workspace;  % Make sure the workspace panel is showing.
format long g;
format compact;
fontSize = 18;
% Read in the clusters the kmeans decided upon.
s = load('DataCluster.mat');
clusters = s.clusters
numClusters = numel(clusters)
for k = 1 : numClusters
    thisClustersData = clusters{k};
    % Append data
    if k == 1
        allData = thisClustersData;
    else
        allData = [allData; thisClustersData];
    end
end
% Scatter every predictors against every other predictor.
numPredictors = size(allData, 2) - 1;
for k1 = 1 : numPredictors
    for k2 = 1 : numPredictors
        plotNumber = (k1-1) * numPredictors + k2;
        fprintf('Plotting plot #%d.\n', plotNumber)
        subplot(5, 5, plotNumber);
        gscatter(allData(:, k1), allData(:, k2), allData(:, end));
        xAxisLabel = sprintf('Predictor #%d', k1);
        yAxisLabel = sprintf('Predictor #%d', k2);
        xlabel(xAxisLabel);
        ylabel(yAxisLabel);
        grid on;
        drawnow;
    end
end
%helpdlg('Done!')

Please tell me what each column means in your data. It looks like column 6 is the cluster number (either ground truth or predicted), but what do the other columns mean? Because some of the scatterplots look odd. And the data looks highly quantized so I think you can just classify probably based on thresholds (essentially a decision tree but you specify the thresholds that separate each). Or you might just try K Nearest Neighbors.

Umar 2024-8-8

Hi @Med Future ,

Addressing your query regarding, “if data is highly quantized then how can i processed with it.?”

As you suggested, using thresholds can be effective if clear boundaries exist within your feature space. For instance, you could implement a decision tree where splits are determined by specific feature values that separate different clusters effectively. If you are still finding odd scatterplots or unexpected cluster shapes, consider experimenting with different values of K (the number of clusters) or trying other clustering algorithms like hierarchical clustering that might handle your data's characteristics better. Also, here is some of my additional insights to consider, use scatter plots not just for visualization but also to identify patterns or anomalies in how clusters are formed based on the features. Also, you already employed silhouette analysis to determine the optimal number of clusters (K). Consider plotting silhouette scores for different K values to visually assess how well-defined each cluster is.Depending on what you observe from your analysis, you may want to engineer new features or transform existing ones (e.g., normalization or scaling) to improve clustering performance. Also, try modifying the existing code snippet provided by iterating through each data point and reassigning it to the cluster with the closest centroid based on the first four features. Select the relevant features (first four columns) that are crucial for clustering. Discard any irrelevant columns like the fifth column with random numbers. Make sure that the quantization levels do not introduce significant distortions in the clustering process, calculate the distance of each data point to the centroids of the clusters and then assigning the data point to the cluster with the closest centroid based on the first four features. You may need to adjust the clustering algorithm parameters or apply dimensionality reduction techniques to mitigate the effects of quantization.

By following these suggestions and clarifying your dataset structure, you should be better positioned to refine your clustering approach and draw meaningful conclusions from your analysis. If you still need further assistance with specific implementations or concepts, please feel free to ask us!

Umar 2024-8-8

Hi @MedFuture,

At this point, I will suggest two options to resolve your problem.

Option A: Since you have explored discriminant analysis and other classification methods but still finding your self struggling to effectively process and classify the data then I will suggest start over, start fresh using small steps first and then take big steps when facing to solve quantized data.

OR

You can try option B, which is a structured approach if you are willing to stick to it and ready to take risk.

Option B:

So, step number 1, your description, the first four columns represent significant features for clustering, while column six indicates the cluster index assigned post-K-means. Column five appears to be irrelevant for clustering purposes. It is crucial to ensure that only meaningful features are utilized in any clustering or classification algorithm. Once you understand your data structure thoroughly, proceed to step number 2, which is revisiting clustering techniques, given that you’ve encountered limitations with K-means and DBSCAN, consider the following alternatives: Try Hierarchical Clustering which can provide a more nuanced view of how clusters are formed, allowing you to visualize potential merges or splits or Gaussian Mixture Models (GMM) which can handle overlapping clusters better than K-means by assuming that data points are generated from a mixture of several Gaussian distributions. Afterwards, proceed to step number 3, which is Dynamic Reassignment of Clusters, which is to manage real-time data effectively by implementing an Online Learning such as Mini-Batch K-means or incremental versions of GMM allow for updating clusters as new data arrives without needing to retrain from scratch and threshold based reassignment by defining thresholds for each feature based on your domain knowledge to reassign clusters dynamically when new data points arrive. Once you have accomplished steps 1,2 and 3, time to move forward with step 4, to utilize decision tress or ensemble methods, so, if you find that your clusters remain inconsistent, you can create a decision tree classifier based on your initial training data (the first four features) and use it to predict cluster memberships for new incoming data or ensemble Random Forests method can improve accuracy by averaging multiple decision trees, which helps in reducing overfitting. Then, move to step 5, which is what you are struggling with, and I call it feature engineering. Given the quantized nature of your data, try applying techniques like Min-Max scaling or Z-score normalization to mitigate the impact of quantization. For dimensionality reduction, I will recommend techniques like PCA (Principal Component Analysis) can help reduce noise and highlight essential features. Now, the last step, which is most important of all, visualization of results, while real-time visualization might not be feasible, periodically generating visualizations (e.g., scatter plots) after processing batches of data can help you understand cluster formations and adjustments over time. Also, some additional insights to consider: Continuously monitor performance metrics such as silhouette scores or Davies-Bouldin index after each update to assess the quality of your clusters. Also, try handling merge clusters if you suspect that certain centroids might represent the same cluster, implement a merging strategy based on distance thresholds between centroids. If two centroids are closer than a specified threshold, merge their corresponding clusters.

Hopefully, if you choose to follow as I mentioned then you should be able to enhance your clustering process and adapt more effectively to real-time changes in your dataset. If issues persist or if specific implementations require clarification, feel free to reach out for further assistance!

Med Future 2024-8-10

@Umar Thanks for your detail answer. But please can you make a demo code for that. I have tried that method to it does not work. if you think to Use anyother clustering method to get optimal K and applies any other Clustering. You can also use that but the problem is still there not solved

Umar 2024-8-10

Hi @ Med Future,

Can you share the code with the method applied based on my instructions. I would like to take a closer look if you don’t mind because what I described above in my comments is a very advanced and tedious process to achieve what you mentioned and not easy to implement without help. Afterwards, I will try to make a demo code as you suggested.

请先登录，再进行评论。