A scalable manifold learning (SUDE) method that can cope with large-scale and high-dimensional data in an efficient manner
Sampling-enabled scalable manifold learning unveils the discriminative cluster structure of high-dimensional data (SUDE)
We propose a scalable manifold learning (SUDE) method that can cope with large-scale and high-dimensional data in an efficient manner. It starts by seeking a set of landmarks to construct the low-dimensional skeleton of the entire data, and then incorporates the non-landmarks into this skeleton based on the constrained locally linear embedding. This toolkit includes the main code of SUDE, and also two applications for preprocess scRNA-seq and ECG data. This paper has been published in Nature Machine Intelligence, and more details can be seen https://www.nature.com/articles/s42256-025-01112-9.
HOW TO RUN
The sude.m function provides multiple hyperparameters for user configuration as follows
function [Y, id_samp, para] = sude(X, varargin)
% This function returns representation of the N by D matrix X in the lower-dimensional space and
% the ID of landmarks sampled by PPS. Each row in X represents an observation.
%
% Parameters are:
%
% 'NumDimensions'- A positive integer specifying the number of dimension of the representation Y.
% Default: 2
% 'NumNeighbors' - A non-negative integer specifying the number of nearest neighbors for PPS to
% sample landmarks. It must be smaller than N.
% Default: adaptive
% 'Normalize' - Logical scalar. If true, normalize X using min-max normalization. If features in
% X are on different scales, 'Normalize' should be set to true because the learning
% process is based on nearest neighbors and features with large scales can override
% the contribution of features with small scales.
% Default: True
% 'LargeData' - Logical scalar. If true, the data can be split into multiple blocks to avoid the problem
% of memory overflow, and the gradient can be computed block by block using 'learning_l' function.
% Default: False
% 'InitMethod' - A string specifying the method for initializing Y before manifold learning.
% 'le' - Laplacian eigenmaps.
% 'pca' - Principal component analysis.
% 'mds' - Multidimensional scaling.
% Default: 'le'
% 'AggCoef' - A positive scalar specifying the aggregation coefficient.
% Default: 1.2
% 'MaxEpoch' - Maximum number of epochs to take.
% Default: 50
The main.m file provides an example
% Input data
clear;
data = csvread('mfeat.csv');
% Obtain data size and true annotations
[~, m] = size(data);
ref = data(:, m);
X = data(:, 1:m-1);
clear data
% Perform SUDE embedding
t1 = clock;
[Y, idx, para] = sude(X,'NumNeighbors',10);
t2 = clock;
disp(['Elapsed time:', num2str(etime(t2,t1)),'s']);
[knnACC, svmACC, clusACC] = ml_eval(X, Y, ref);
disp(['knnACC:', num2str(knnACC),' svmACC:', num2str(svmACC),' clusACC:', num2str(clusACC)]);
plotcluster2(Y, ref);
引用格式
Peng, Dehua, et al. “Sampling-Enabled Scalable Manifold Learning Unveils the Discriminative Cluster Structure of High-Dimensional Data.” Nature Machine Intelligence, vol. 7, no. 10, Sept. 2025, pp. 1669–84, https://doi.org/10.1038/s42256-025-01112-9.
| 版本 | 已发布 | 发行说明 | Action |
|---|---|---|---|
| 1.0.0 |
