Sains Malaysiana 50(9)(2021):
2579-2589
http://doi.org/10.17576/jsm-2021-5009-07
Enhanced
Dimensionality Reduction Methods for Classifying Malaria Vector Dataset using
Decision Tree
(Peningkatan Kaedah Pengurangan Kedimensian untuk Mengelaskan Set Data Vektor Malaria menggunakan Pokok Keputusan)
MICHEAL OLAOLU AROWOLO*, MARION OLUBUNMI ADEBIYI & AYODELE ARIYO ADEBIYI
Department of Computer Science, Landmark University, Omu-Aran, Nigeria
Diserahkan: 6 Oktober 2020/Diterima: 21 Januari 2021
ABSTRACT
RNA-Seq data are
utilized for biological applications and decision making for classification of
genes. Lots of work in recent time are focused on reducing the dimension of RNA-Seq data. Dimensionality reduction approaches have been
proposed in fetching relevant information in a given data. In this study, a
novel optimized dimensionality reduction algorithm is proposed, by combining an
optimized genetic algorithm with Principal Component Analysis and Independent
Component Analysis (GA-O-PCA and GAO-ICA),
which are used to identify an optimum subset and latent correlated features,
respectively. The classifier uses Decision tree on the reduced mosquito
anopheles gambiae dataset to enhance the accuracy and scalability in the gene
expression analysis. The proposed algorithm is used to fetch relevant features
based from the high-dimensional input feature space. A feature ranking and
earlier experience are used. The performances of the model are evaluated and
validated using the classification accuracy to compare existing approaches in
the literature. The achieved experimental results prove to be promising for
feature selection and classification in gene expression data analysis and
specify that the approach is a capable accumulation to prevailing data mining
techniques.
Keywords: Decision tree; independent
component analysis; malaria vector; optimized genetic algorithm; principal
component analysis
ABSTRAK
Data RNA-Seq digunakan untuk aplikasi biologi dan membuat keputusan untuk pengelasan gen. Banyak kajian kebelakangan ini memfokus untuk mengurangkan dimensi data RNA-Seq. Pendekatan pengurangan dimensi telah diusulkan dalam pengambilan maklumat yang relevan dalam data yang diberikan. Dalam kajian ini, algoritma pengurangan dimensi optimum baharu dicadangkan dengan menggabungkan algoritma genetik yang dioptimumkan dengan Analisis Komponen Utama dan Analisis Komponen Bebas (GA-O-PCA dan GAO-ICA),
yang digunakan untuk mengenal pasti ciri subset optimum dan korelasi laten. Pengelas menggunakan Pokok keputusan pada kumpulan data terturun nyamuk anopheles gambiae untuk meningkatkan ketepatan dan kebolehan pengukuran dalam analisis ekspresi gen. Algoritma yang dicadangkan digunakan untuk mengambil ciri yang relevan berdasarkan ruang ciri input dimensi tinggi. Ciri pemeringkatan dan pengalaman sebelumnya digunakan. Prestasi model dinilai dan disahkan menggunakan ketepatan pengelasan untuk membandingkan pendekatan sedia ada dalam kepustakaan. Hasil uji kaji yang dicapai terbukti menjanjikan ciri pemilihan dan pengelasan dalam analisis data ekspresi gen dan menentukan bahawa pendekatan tersebut merupakan pengumpulan yang mampu dilakukan terhadap teknik perlombongan data yang berlaku.
Kata kunci: Algoritma genetik yang dioptimumkan; analisis komponen bebas; analisis komponen utama; Pokok keputusan; vektor malaria
RUJUKAN
Arowolo,
M.O., Adebiyi, M.O., Adebiyi, A.A. & Okesola,
J.O. 2020a. PCA Model for RNA-Seq malaria vector data
classification using KNN and decision tree algorithm. International
Conference in Mathematics, Computer Engineering and Computer Science (ICMCECS).
pp. 1-8.
Arowolo,
M.O., Adebiyi, M.O. & Adebiyi, A.A. 2020b. An efficient PCA ensemble learning
approach for prediction of RNA-Seq malaria vector
gene expression data classification. International Journal of Engineering
Research and Technology 13(1): 163-169.
Arowolo,
M.O., Abdulsalam, S.O., Isisaka, R.M. & Gbolagade, K.A. 2017. A hybrid dimensionality reduction
model for classification of microarray dataset. International Journal of
Information Technology and Computer Science 9(11): 57-63.
Aziz, R., Verma,
C.K. & Srivastava, N. 2017. Dimension reduction methods for microarray
data: A review. AIMS Bioengineering 4(1): 179-197.
Bajaj, V., Taran, S., Khare, S.K. & Sengur, A. 2020. Feature extraction method for
classification of alertness and drowsiness states EEG signals. Applied
Acoustics 163: 107224.
Bose, J. 2016.
Hybrid GA/KNN/SVM algorithm for classification of data. BioHouse Journal of Computer Science 2(2): 5-11.
Cai, J., Luo,
J., Wang, S. & Yang, S. 2018. Feature selection in machine learning: A new
perspective. Neurocomputing 300: 70-79.
Chen, C-W.,
Tsai, Y-H., Chang, F-R. & Lin, W-C. 2020. Ensemble feature selection in
medical datasets: Combining filter, wrapper, and embedded feature selection
results. Expert Systems, Special Issue on Advances in Visual Analytics and
Mining Visual Data 37(5): e12553.
Chiesa, M., Maioli, G., Colombo, G.J. & Piacentini,
L. 2020. GARS: Genetic algorithm for the identification of a robust subset of
features in high-dimensional datasets. BMC Bioinformatics 21(1): 54.
Chuang, L., Chu,
Y., Li, J.C. & Yang, C. 2012. A hybrid BPSO-CGA approach for gene selection
and classification of microarray data. Journal of Computational Biology 19:
68-82.
Feng, C., Liu,
C., Zhang, H., Guan, R., Li, D., Zhou, F., Liang, Y. & Feng, X. 2020.
Dimension reduction and clustering models for single-cell RNA-Seq data: A comparative study. International Journal of
Molecular Sciences 21(2181): 1-21.
Feng, C., Lu,
S., Zhang, H. & Feng, X. 2018. Dimension reduction and clustering models
for Sc-RNA sequencing data. International Journal of Molecular Sciences 21:
1-21.
Hashemi, F.S.G.,
Ismail, M.R., Yusop, M.R., Hashemi, M.S.G., Shahraki, M.H.N., Rastegari, H.,
Miah, G. & Aslani, F. 2018. Intelligent mining of
large-scale bio-data: Bioinformatics applications. Biotechnology, and
Biotechnological Equipment http://dx.doi.org/10.1080/13102818.2017.1364977.
Hira, Z.M. &
Gillies, D.F. 2015. A review of feature selection and feature extraction
methods applied on microarray data. Advances in Bioinformatics. 2015:
Article ID. 198363.
Hodgson, S.H.,
Muller, J., Lockstone, H.E., Hill, A.V.S., Marsh, K.,
Draper, S.J. & Knight, J.C. 2019. Use of gene expression studies to
investigate the human immunological response to malaria infection. Malaria
Journal 18(1): 418.
Hyunh,
P-C., Nguyen, V-H. & Do, T.N. 2019. Novel hybrid DCNN-SVM model for
classifying RNA-Sequencing gene expression data. Journal of Information and
Telecommunication 3(4): 533-547.
Jabeen, A., Ahmad, N. & Raza, K. 2018. Machine learning-based
state-of-the-art methods for the classification of RNA-Seq data. In Classification in BioApps. Lecture Notes
in Computational Vision and Biomechanics, vol 26, edited by Dey, N.,
Ashour, A. & Borra, S. New York: Springer, Cham. pp. 133-172.
Jain, D. &
Singh, V. 2018. An efficient hybrid feature selection model for dimensionality
reduction. International Conference on Computational Intelligence and Data
Science, Procedia Computer Science 123: 333-341.
Kong, W.,
Vanderburg, C.R., Gunshin, H., Rogers, J.T. &
Huang, X. 2018. A review of independent component analysis application to
microarray gene expression data. Biotechniques 45(5): 501-520.
Lin, Z. &
Zhang, G. 2019. Genetic algorithm-based parameter optimization for EO-1
Hyperion remote sensing image classification. European Journal of Remote
Sensing 50(1): 124-131.
Liu, Y., Ju, S.,
Wang, J. & Su, C. 2020. A new feature selection
method for text classification based on independent feature space search. Mathematical
Problems in Engineering 2020: Article ID. 6076272.
Mafarja, M.
& Mirjalili, S. 2018. Whale optimization for
wrapper feature selection. Applied Soft Computing 62: 441-453.
Mohan, A., Rao,
M.D., Sunderrajan, S. & Pennathur,
G. 2014. Automatic classification of protein structures using physicochemical
parameters. Interdiscip. Sci.: Comput. Life Sci. 6: 176-186.
Motieghader,
H., Najafi, A., Sadeghi, B. & M-Nejad, A. 2017. A
Hybrid gene selection algorithm for microarray cancer classification using
genetic algorithm and learning automata. Informatics in Medicine Unlocked 9:
246-254.
Pashaei,
E., Pashaei, E. & Aydin, N. 2019. Gene selection
using hybrid binary black hole algorithm and modified binary particle swarm
optimization. Genomics 111(4): 669-686.
Pragadeesh,
C., Jeyaraj, R., Siranjeevi, K., Abishek,
R. & Jeyakumar, G. 2019. Hybrid feature selection
using micro genetic algorithm on microarray gene expression data. Journal of
Intelligent and Fuzzy Systems 36(3): 2241-2246.
Sahu, B., Dehuri, S. & Jagadev,
A. 2018. A study on relevance of feature selection methods in microarray data. The
Open Bioinformatics Journal 11: 117-139.
Shen, L., Jiang,
H., He, M. & Liu, G. 2017. Collaborative representation-based
classification of microarray gene expression data. PLoS ONE 12(12): e0189533.
Shukla, A.K.,
Singh, P. & Vardhan, M. 2019. A new hybrid wrapper TLBO and SA with SVM
approach for gene expression data. Information Sciences 503: 238-254.
Sun, L., Kong,
X., Xu, J., Xue, Z., Zhai,
R. & Zhang, S. 2019. A hybrid gene selection method based on Refief-F and Ant colony optimization algorithm for tumor classification. Nature Research Academics 9:
8978.
Susmi,
S.J., Nehimiah, H.K. & Kannan, A. 2018. Hybrid
dimensionality reduction techniques with genetic algorithm and neural network
for classifying leukemia gene expression data. Indian
Journal of Science and Technology 9(1): 1-8.
Tadist, K., Najah, S., Nikolov, N.S., Mrabti, F. & Zahi, A. 2019.
Feature selection methods and genomic big data: A systematic review. Journal
of Big Data 6: 79.
Uma, S.M. & Kirubakaran, E. 2016. A hybrid heuristic dimensionality
reduction technique for microarray gene expression data classification: A
blending of GA, PSO, and ACO. International Journal of Data Mining,
Modelling and Management 8(2): 160-179.
Wang, J., Du,
P., Niu, T. & Yang, W. 2017. A novel hybrid
system based on a new proposed algorithm-multi-objective whale optimization
algorithm for wind speed forecasting. Applied Energy 208: 344-360.
Wang, L., Wang,
Y. & Chang, Q. 2017. Feature selection methods for big data bioinformatics:
A Survey from the search perspective. Methods 111: 21-31.
Wenric, S. & Shemirani, R. 2018. Using
supervised learning methods for gene selection in RNA-Seq case-control studies. Frontiers in Genetics 9: 297.
Zhao, S.,
Fung-Leung, W-P., Bottner, A., Ngo, K. & Liu, X.
2014. Comparison of RNA-Seq and microarray in
transcriptome profiling of activated t-cells. PLoS ONE 9(1): e78644.
*Pengarang untuk surat-menyurat; email: arowolo.olaolu@lmu.edu.ng
|