Clustering Algorithms on the Iris Dataset in MATLAB

版本 1.0.0 (136.4 KB) 作者: Omprakash
Comparing K-Means, Hierarchical, and DBSCAN clustering on the Iris dataset, evaluating performance with metrics and visualizing results.
43.0 次下载
更新时间 2025/1/13

查看许可证

Clustering Algorithms on the Iris Dataset in MATLAB
Overview:
This project focuses on applying and comparing different clustering algorithms on the famous Iris dataset. The dataset contains 150 samples from three different species of Iris flowers, with four features: sepal length, sepal width, petal length, and petal width.
The main goal of this project is to apply three clustering algorithms to the Iris dataset, evaluate their performance using various metrics, and visualize the results.
The three clustering algorithms used in this project are:
1. K-Means Clustering
2. Hierarchical Clustering
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Each of these algorithms works differently, and the project evaluates how well each algorithm performs in grouping the data.
Project Objective:
The main objectives of the project are:
1. Clustering: Apply three clustering algorithms (K-Means, Hierarchical, DBSCAN) to the Iris dataset to group the data points based on similarities.
2. Evaluation: Evaluate the quality of the clusters produced by the algorithms using various metrics:
Silhouette Score: Measures how well-separated and well-defined the clusters are.
Davies-Bouldin Index: Measures the average similarity ratio between each cluster and the most similar cluster.
Adjusted Rand Index (ARI): Measures the similarity between the true species labels and the predicted cluster labels.
3. Visualization: Visualize the results of clustering to make it easier to compare how each algorithm has grouped the data points.
Steps and Process:
1. Loading the Iris Dataset
The project begins by loading the Iris dataset (fisheriris in MATLAB). This dataset contains 150 data points with the following features:
Sepal length
Sepal width
Petal length
Petal width
The target variable (species of the Iris flower) includes three categories:
Setosa
Versicolor
Virginica
2. Data Standardization
To ensure that all features contribute equally to the clustering algorithms, the data is standardized using the z-score. This ensures that the mean of each feature becomes 0 and the standard deviation becomes 1, which is important because clustering algorithms like K-Means are sensitive to the scale of the features.
3. Clustering Algorithms
K-Means Clustering:
Objective: Partition the dataset into clusters, since there are three Iris species.
Process: The K-Means algorithm assigns data points to the nearest cluster centroid and updates the centroids iteratively to minimize the sum of squared distances between points and their corresponding centroids.
Visualization: The algorithm's results are visualized with a scatter plot showing how the data points are clustered, and the cluster centroids are marked.
Hierarchical Clustering:
Objective: Build a hierarchy of clusters using a bottom-up approach.
Process: The Ward linkage method is used, which minimizes the variance within clusters at each step.
Visualization: A dendrogram plot is generated, which shows how clusters are merged at each step, and the hierarchy is represented by a tree-like structure.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
Objective: Find regions of high density and group the points within these regions.
Process: DBSCAN defines clusters as groups of points that are closely packed together, while points that lie in low-density regions are marked as noise. The parameters used are (maximum distance between two points) and minPts = 5 (minimum number of points required to form a cluster).
Visualization: The resulting clusters are visualized on a scatter plot, and noise points (outliers) are identified.
4. Cluster Evaluation
To assess the quality of the clustering results, the following metrics are calculated:
Silhouette Score: Measures the similarity of a point to its own cluster compared to other clusters. A higher score means the clusters are well-separated and defined.
Davies-Bouldin Index: This index measures how well the clusters are separated by comparing the average similarity between each cluster and the cluster most similar to it. A lower value indicates better clustering quality.
Adjusted Rand Index (ARI): Measures the similarity between the predicted cluster labels and the true species labels. The ARI adjusts for chance, and a higher value indicates a better match between the clustering results and the true labels.
5. Comparison of Clustering Results
The results of the three clustering algorithms are compared visually:
K-Means Clustering Plot: Shows the clustered data points with centroids marked.
Hierarchical Clustering Dendrogram: Displays the hierarchical merging process.
DBSCAN Clustering Plot: Displays the clustering results and noise points.
Clustering Comparison Plot: A subplot compares the clustering results of K-Means, Hierarchical, and DBSCAN, making it easier to visually inspect how each algorithm groups the data.
6. Saving Results
The results of K-Means and DBSCAN are saved as CSV files for further analysis:
KMeans_Clustering_Results.csv
DBSCAN_Clustering_Results.csv
The generated figures (plots) are saved as PNG files for later use or inclusion in reports:
KMeans_Clustering_Figure.png
Hierarchical_Clustering_Figure.png
DBSCAN_Clustering_Figure.png
Clustering_Comparison_Figure.png
Outputs and Results:
1. Visual Plots:
K-Means Clustering Plot: Shows how K-Means has grouped the data with centroids marked.
Hierarchical Clustering Dendrogram: Displays the hierarchical clustering process.
DBSCAN Clustering Plot: Visualizes how DBSCAN has grouped the data points and identified noise points.
Clustering Comparison Plot: Compares the clustering results of K-Means, Hierarchical, and DBSCAN in one figure.
2. Metrics:
Silhouette Score for K-Means and DBSCAN.
Davies-Bouldin Index for K-Means and DBSCAN.
Adjusted Rand Index for K-Means and DBSCAN.
3. Saved Results:
CSV Files: Clustering results for K-Means and DBSCAN.
PNG Files: Figures showing the clustering results for all three algorithms.
Custom Functions:
1. Rand Index:
This function calculates the Adjusted Rand Index (ARI), which compares the true labels (species) with the predicted cluster labels. The ARI ranges from -1 (completely different) to 1 (perfect match), with 0 indicating random clustering.
2. Davies-Bouldin Index:
This function computes the Davies-Bouldin Index, which measures the average similarity between each cluster and its most similar cluster. A lower value indicates better clustering quality.
Requirements:
MATLAB: Any version supporting basic clustering functions like kmeans, dbscan, linkage, dendrogram, etc.
fisheriris dataset: The standard Iris dataset available in MATLAB or a similar custom dataset.
Conclusion:
This project demonstrates the application of three different clustering techniques on the Iris dataset: K-Means, Hierarchical, and DBSCAN. By evaluating these algorithms using various metrics such as Silhouette Score, Davies-Bouldin Index, and Adjusted Rand Index, we can determine which algorithm provides the most accurate clustering. The visualizations and saved results offer a clear comparison of the clustering algorithms, allowing insights into their relative performance. This project is useful for anyone learning about clustering algorithms and their evaluation in machine learning and data analysis.

引用格式

Omprakash (2025). Clustering Algorithms on the Iris Dataset in MATLAB (https://ww2.mathworks.cn/matlabcentral/fileexchange/179049-clustering-algorithms-on-the-iris-dataset-in-matlab), MATLAB Central File Exchange. 检索时间: .

MATLAB 版本兼容性
创建方式 R2024b
兼容任何版本
平台兼容性
Windows macOS Linux

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!
版本 已发布 发行说明
1.0.0