Gene expression profile has become a useful biological resource in recent years and its plays an important role in a broad range of biology. But a large number of genes and the complexity of biological networks greatly increase the evaluation of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. In the computational analysis of gene expression data, the main aspect is to finding co-expressed genes as the proximity (similarity or dissimilarity) measures that are used in the clustering method. Several number of proximity measures work are used in the gene data but the majority of these works has given emphasis on the biological results and no critical assessment of the suitability of the proximity measures for the analysis of gene expression data. For these consequences this paper is to investigate the appropriate proximity measurement for gene expression data. As a case study, we considered six real datasets. Based on this, we provide a comparative study of five proximity measures: Euclidean distance, Manhattan distance, Pearson correlation, Spearman correlation, Cosine distance. We discuss Adjusted Rand Index, Silhouette Index of clustering to assess the quality and reliability of the results. Our results reveal that the Cosine distance method with complete linkage exhibited the best performance for both Affymetrix and cDNA datasets according to Adjusted Rand Index. Our results also reveal that the Spearman correlation measure with complete linkage exhibited the best performance for both Affymetrix and cDNA datasets according to Silhouette Index.
Published in | International Journal of Biomedical Materials Research (Volume 5, Issue 5) |
DOI | 10.11648/j.ijbmr.20170505.11 |
Page(s) | 59-63 |
Creative Commons |
This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited. |
Copyright |
Copyright © The Author(s), 2017. Published by Science Publishing Group |
Proximity Measures, Agglomerative Hierarchical Clustering, Adjusted Rand Index, Silhouette Index, Gene Expressions Data
[1] | Brown M P and Bostein D (1999); Exploring the new world of genome with DNA microarrays. Nature Genetics, vol. 21 (1), pp. 33-37. |
[2] | Cunningham K M and Ogilvie J C (1972); Evaluation of hierarchical grouping techniques: A preliminary study. The Computer Journal, vol. 15 (3), pp. 209–213. |
[3] | Johnson R A and Wichern D W (2002). Applied Multivariate Statistical Analysis. Upper Saddle River, NJ: Prentice Hall. |
[4] | Monti S, Tamayo P, Mesirov J, Golub T (2003); Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data; Machine Learning. Vol. 52 (1), pp. 91-118. |
[5] | Daxin J, Chun T, and Aidong Z (2004); Cluster Analysis for Gene Expression Data: A Survey, IEEE Transactions on Knowledge and Data Engineering, vol. 16 (11), pp. 1370-1386. |
[6] | Costa I G, Carvalho F A D and Souto M C P D (2004); Comparative Analysis of Clustering Methods for Gene Expression Time Course Data. Genetics and Molecular Biology, vol. 27 (4), pp. 623-631. |
[7] | Kerr G, Ruskin H J, Crane M and Doolan P (2008); Techniques for clustering gene expression data. ComputBiol Med, vol. 38 (3), pp. 283-293. |
[8] | Geetha T and Michael A (2010); Enhanced Hierarchical Clustering for Gene Expression data. International Journal of Computer Applications, vol. 1 (20), pp. 92–98. |
[9] | Marcilio C P de Souto, Ivan G Costa, Daniel S A de Araujo, Teresa B Ludermir and Alexander Schliep (2008); Clustering cancer gene expression data: a comparative study. BMC Bioinformatics, pp. 01-14. |
[10] | Kuiper F K and Fisher L (1975); A Monte Carlo comparison of six clustering procedures. Biometrics, vol. 31 (8), pp. 777–783. |
[11] | Hubert L (1974); Approximate evaluation techniques for the single-link and complete link hierarchical clustering procedures. Journal of the American Statistical Association, vol. 69, pp. 698–704. |
[12] | Blashfield R K (1976); Mixture model tests of cluster analysis: Accuracy of four agglomerative hierarchical methods. The Psychological Bulletin, vol. 83, pp. 377–388. |
[13] | Hands S and Everitt B (1987); A Monte Carlo study of the recovery of cluster structure in binary data by hierarchical clustering techniques. Multivariate Behavioral Research, vol. 22 (2), pp. 235–243. |
[14] | Anderberg M (1973); Cluster analysis for applications. New York: Academic Press. |
[15] | Jain A K and Dubes R C (1988); Algorithms for clustering data, Prentice Hall. |
[16] | Guojun G, Chaoqun M and Jianhong W (2007); ASA-SIAM Series on Statistics and Applied Probability, SIAM, Philadelphia, ASA, Alexandria, VA. Data Clustering: Theory, Algorithms, and Applications |
[17] | Gentleman R, Ding B, Dudoit S and Ibrahim J (2005); Bioinformatics and Computational Biology Solutions Using R and Bioconductor Statistics for Biology and Health, Springer-Verlag London Limited. |
[18] | Pablo A Jaskowiak, Ricardo J G B Campello and Ivan G Costa (2013); Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 10 (4), pp. 845-857. |
[19] | Md. Bipul Hossen, Md. Siraj-Ud-Doulah, Aminul Hoque (2015); Methods for Evaluating Agglomerative Hierarchical Clustering for Gene Expression Data: A Comparative Study, Computaitonal Biology and Bioinformatics, Vol. 3 (6), pp. 88-94. |
[20] | Md. Siraj-Ud-Doulah, Md. Bipul Hossen (2016); Performance Evaluation of Clustering Methods in Microarray Data. American Journal of Bioinformatics Research, Vol. 6 (1), pp. 19-25. |
[21] | Jaskowiak P A, Campello R J G B and Costa I G (2013); Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis, Computational Biology and Bioinformatics. Vol. 10 (4), pp. 845-857. |
[22] | Eldesoky, A. E, M. Saleh, N. A. Sakr (2009); Novel Similarity Measure fo Document Clustering Basedon Topic Phrase, International Conferenceon Networking and Media Convergence, vol. 24, pp. 92-96. |
[23] | Milligan G W and Cooper M C (1988); A study of standardization of variables in cluster analysis. Journal of Classification, vol. 5 (2), pp. 181-204. |
[24] | Peter J. Rousseeuw (1987); Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Computational and Applied Mathematics. Vol. 20: pp. 53–65. |
APA Style
Md. Bipul Hossen, Arefin Mowla, Md. Harun or Rashid, Md. Binyamin. (2017). On the Selection of Appropriate Proximity Measurement for Gene Expression Data. International Journal of Biomedical Materials Research, 5(5), 59-63. https://doi.org/10.11648/j.ijbmr.20170505.11
ACS Style
Md. Bipul Hossen; Arefin Mowla; Md. Harun or Rashid; Md. Binyamin. On the Selection of Appropriate Proximity Measurement for Gene Expression Data. Int. J. Biomed. Mater. Res. 2017, 5(5), 59-63. doi: 10.11648/j.ijbmr.20170505.11
AMA Style
Md. Bipul Hossen, Arefin Mowla, Md. Harun or Rashid, Md. Binyamin. On the Selection of Appropriate Proximity Measurement for Gene Expression Data. Int J Biomed Mater Res. 2017;5(5):59-63. doi: 10.11648/j.ijbmr.20170505.11
@article{10.11648/j.ijbmr.20170505.11, author = {Md. Bipul Hossen and Arefin Mowla and Md. Harun or Rashid and Md. Binyamin}, title = {On the Selection of Appropriate Proximity Measurement for Gene Expression Data}, journal = {International Journal of Biomedical Materials Research}, volume = {5}, number = {5}, pages = {59-63}, doi = {10.11648/j.ijbmr.20170505.11}, url = {https://doi.org/10.11648/j.ijbmr.20170505.11}, eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijbmr.20170505.11}, abstract = {Gene expression profile has become a useful biological resource in recent years and its plays an important role in a broad range of biology. But a large number of genes and the complexity of biological networks greatly increase the evaluation of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. In the computational analysis of gene expression data, the main aspect is to finding co-expressed genes as the proximity (similarity or dissimilarity) measures that are used in the clustering method. Several number of proximity measures work are used in the gene data but the majority of these works has given emphasis on the biological results and no critical assessment of the suitability of the proximity measures for the analysis of gene expression data. For these consequences this paper is to investigate the appropriate proximity measurement for gene expression data. As a case study, we considered six real datasets. Based on this, we provide a comparative study of five proximity measures: Euclidean distance, Manhattan distance, Pearson correlation, Spearman correlation, Cosine distance. We discuss Adjusted Rand Index, Silhouette Index of clustering to assess the quality and reliability of the results. Our results reveal that the Cosine distance method with complete linkage exhibited the best performance for both Affymetrix and cDNA datasets according to Adjusted Rand Index. Our results also reveal that the Spearman correlation measure with complete linkage exhibited the best performance for both Affymetrix and cDNA datasets according to Silhouette Index.}, year = {2017} }
TY - JOUR T1 - On the Selection of Appropriate Proximity Measurement for Gene Expression Data AU - Md. Bipul Hossen AU - Arefin Mowla AU - Md. Harun or Rashid AU - Md. Binyamin Y1 - 2017/06/30 PY - 2017 N1 - https://doi.org/10.11648/j.ijbmr.20170505.11 DO - 10.11648/j.ijbmr.20170505.11 T2 - International Journal of Biomedical Materials Research JF - International Journal of Biomedical Materials Research JO - International Journal of Biomedical Materials Research SP - 59 EP - 63 PB - Science Publishing Group SN - 2330-7579 UR - https://doi.org/10.11648/j.ijbmr.20170505.11 AB - Gene expression profile has become a useful biological resource in recent years and its plays an important role in a broad range of biology. But a large number of genes and the complexity of biological networks greatly increase the evaluation of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. In the computational analysis of gene expression data, the main aspect is to finding co-expressed genes as the proximity (similarity or dissimilarity) measures that are used in the clustering method. Several number of proximity measures work are used in the gene data but the majority of these works has given emphasis on the biological results and no critical assessment of the suitability of the proximity measures for the analysis of gene expression data. For these consequences this paper is to investigate the appropriate proximity measurement for gene expression data. As a case study, we considered six real datasets. Based on this, we provide a comparative study of five proximity measures: Euclidean distance, Manhattan distance, Pearson correlation, Spearman correlation, Cosine distance. We discuss Adjusted Rand Index, Silhouette Index of clustering to assess the quality and reliability of the results. Our results reveal that the Cosine distance method with complete linkage exhibited the best performance for both Affymetrix and cDNA datasets according to Adjusted Rand Index. Our results also reveal that the Spearman correlation measure with complete linkage exhibited the best performance for both Affymetrix and cDNA datasets according to Silhouette Index. VL - 5 IS - 5 ER -