Bulletin de la Société Royale des Sciences de Liège Bulletin de la Société Royale des Sciences de Liège -  Volume 85 - Année 2016  Actes de colloques  Special edition 

A New Synthetic Oversampling Method Using Ontology and Feature Selection in Order to Improve Imbalanced Textual Data Classification in Persian Texts

Jafar Pouramini
Department of Information Technology, Faculty of Engineering, University of Qom, j_pouramini@pnu.ac.ir
Behrouz Minaei-Bidgoli
Faculty of Computer Engineering, Iran University of Science and Technology, b_minaei@iust.ac.ir

Abstract

Ever-growing extension of textual data has increased the necessity of processing textual data. Data imbalance in classification of textual data is one of the cases that decrease efficiency. In order to confront with imbalance problem, various methods are suggested. Some of the methods are: data-based, cost-based, algorithm-based and feature selection methods. In recent researches, some methods are considered into account using ensemble methods. In this research, a new oversampling method is suggested. In the new method the number of minor class samples is increased using ontology and then random oversampling is performed for minor class. Finally, using the methods of feature selection, appropriate features are selected. New ensemble method was tested using Hamshahri data. The results show that the ensemble method on Hamshahri collection, despite decreasing number of features, causes the improvement of classification results for polynomial Naïve Bayes and decision tree.

Keywords : feature selection, imbalanced, ontology, oversampling

Pour citer cet article

Jafar Pouramini & Behrouz Minaei-Bidgoli, «A New Synthetic Oversampling Method Using Ontology and Feature Selection in Order to Improve Imbalanced Textual Data Classification in Persian Texts», Bulletin de la Société Royale des Sciences de Liège [En ligne], Volume 85 - Année 2016, Actes de colloques, Special edition, 358 - 375 URL : http://popups.ulg.be/0037-9565/index.php?id=5414.