DRSA: a non-hierarchical clustering algorithm using k-NN graph and its application in vegetation classification
I. V. Goncharenko
DOI: https://doi.org/10.31111/vegrus/2015.27.125
Annotation
In this article we proposed a new method of non-hierarchical cluster analysis using k-nearest-neighbor graph and discussed it with respect to vegetation classification.
The method of k-nearest neighbor (k-NN) classification was originally developed in 1951 (Fix, Hodges, 1951). Later a term “k-NN graph” and a few algorithms of k-NN clustering appeared (Cover, Hart, 1967; Brito et al., 1997). In biology k-NN is used in analysis of protein structures and genome sequences. Most of k-NN clustering algorithms build «excessive» graph firstly, so called hypergraph, and then truncate it to subgraphs, just partitioning and coarsening hypergraph. We developed other strategy, the “upward” clustering in forming (assembling consequentially) one cluster after the other. Until today graph-based cluster analysis has not been considered concerning classification of vegetation datasets.
We called our clustering strategy “sorting by ranking” or «Distance-Ranked Sorting Assembling», DRSA in abbreviated form (Goncharenko, 2015). DRSA is extremely robust due to ranks in finding k-nearest objects (phytocoenoses). Unlike density-based clustering, DRSA is effective when density of clusters (phytocoenons) differs much. DRSA clustering algorithm consists of k-NN asymmetric graph construction and then assembling objects into clusters. Process of assembling of each cluster consists of the following steps: initializing, expanding and stopping (cutting off). We invented heuristic measure (Q-index) based on connectivity of k-NN components for cluster’s stopping rule and thus in outliers detection. We proposed two indexes of «voting of objects» — freeness and connectedness for selection objects in cluster’s expanding stage. Technique of determining optimal k (k-nearest neighbors) parameter was elaborated by comparing symmetric (mutual) and asymmetric k-NN graph. We developed two agglomeration modes of DRSA which is one of the greedy algorithms.
As for vegetation classification we tested DRSA on four sample datasets from the Czech Republic (Chytrý, Vicherek, 1995; Chytrý, Vicherek, 1996; Chytrý, Horák, 1997) and Ukraine (Goncharenko, 2003). To evaluate quality of phytocoenons we used internal clustering validation measures (based on a distance matrix) and floristic (based on number of faithful species) criteria. We also measured nominal correlation between automatic (using DRSA method) and expert (according to Braun-Blanquet approach) classifications.
After testing DRSA method on dataset of 780 relevés from Ukraine we received 25 clusters (phytocoenons). We calculated within-cluster and between-clusters average similarities, then built pair-wise matrix for clusters and discovered diagonalization (bigger similarities concentrated along matrix diagonal). Average within-cluster similarity between phytocoenoses was also high, 46.7 % by Otiai index, as well as silhouette statistics. Therefore, we concluded DRSA clusters are valid by inner criteria of cluster validation.
Interpretability of clusters was assessed using Optimclass approach (Tichý et al., 2010). The basic idea was that if the amount of faithful species is high, the clusters are “good” in the sense of floristic diagnosability. When the threshold value of affinity index (using geometric mean of species-to-cluster constancy and specifity) was 50 %, there were from 5 to 12.8 faithful species per cluster (phytocoenon). Therefore, the DRSA gives interpretable clusters from the floristic point of view. Due to outlier removal the amount of faithful species was even more than the same indicator in case of expert (original) classifications.
DRSA method is perspective for vegetation classification thanks to several features. There is no need to specify number of clusters or depth of division before starting cluster analysis. You have the ability to vary the scale of clustering using only a few clear tuning parameters of DRSA — similarity coefficient between phytocoenoses, k-nearest neighbors taken into account and the mode of DRSA which differs by cluster’s stopping rule. For the rest DRSA is full-automatic that allows avoiding of manual sorting of relevés. DRSA is non-parametric clustering, thus it is robust and remains effective even in the case ofhigh heterogeneity of the data and varying alpha- and beta-diversity with big scope. Results of DRSA clustering are low sensitive to what similarity coefficient or distance metric was applied. DRSA is noise-detective clustering, as well removal of ecotonic phytocoenoses allows obtaining better results by all measures (distance-based and floristic-based) of cluster validation.
Key words: DRSA, sorting assembling, cluster analysis, vegetation classification
Section: Research methods
How to cite
Goncharenko I. V. 2015. DRSA: a non-hierarchical clustering algorithm using k-NN graph and its application in vegetation classification // Vegetation of Russia. N 27. P. 125–138. https://doi.org/10.31111/vegrus/2015.27.125
Received April 6 2015
References
Brito M., Chavez E., Quiroz A., Yukich J. 1997. Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection // Statistics & Probability Letters. Vol. 35. N 1. P. 33–42.
Chytrý M., Horák J. 1997. Plant communities of the thermophilous oak forests in Moravia // Preslia. Vol. 68 P. 193–240.
Chytrý M., Vicherek J. 1995. Lesní vegetace Národního parku Podyjí / Thayatal. Die Waldvegetation des Nationalparks Podyjí / Thayatal. Praha. 1995. 166 p.
Chytrý M., Vicherek J. 1996. Přirozená a polopřirozená vegetace údolí řek Oslavy, Jihlavy a Rokytné // Přírod. Sborn. Západomorav. Muz. Třebíč. Vol. 22. P. 1–125.
Cover T. M., Hart P. E. 1967. Nearest neighbor pattern classification // Information Theory. Vol. 13. P. 21–27. https://doi.org/10.1109/TIT.1967.1053964
Fix E., Hodges Jr. J. L. 1951. Discriminatory analysis-nonparametric discrimination: consistency properties. DTIC Document. Available online. URL: http://www.dtic.mil/dtic/tr/fulltext/u2/a800276.pdf (Accessed October, 7, 2015).
Goncharenko I. V. 2003. Analiz roslynnogo pokryvu pivnichno-shidnogo Lisostepu Ukrai’ny. Monografija [Analysis of vegetation of the northeast Forest-Steppe of Ukraine] // Ukrainian Phytosociological Collection. Ser. А. Vol. 19. N 1. 203 p. (In Ukrainian).
Goncharenko I. V. 2015. Metod «sortujuchoi’» klasteryzacii’ (DRSA) dlja klasyfikacii’ roslynnosti [A method of “sorting” clustering (DRSA) for the classification of plant communities] // Reports of the National Academy of Sciences of Ukraine. N 9. P. 129–136. (In Ukrainian).
Tichý L., Chytrý M., Hájek M., Talbot S. S., Botta-Dukát Z. 2010. OptimClass: Using species-to-cluster fidelity to determine the optimal partition in classification of ecological communities // J. Veg. Sci. Vol. 21. P. 287–299. https://doi.org/10.1111/j.1654-1103.2009.01143.x