An Improved Approach of Unstructured Text Document Classification Using Predetermined Text Model and Probability Technique

S Sreedhar; Syed Ahmed; P Flora; LS Hemanth; J Aishwarya; Rahul Naik

Proceedings of the First International Conference on Advanced Scientific Innovation in Science, Engineering and Technology, ICASISET 2020, 16-17 May 2020, Chennai, India

Research Article

An Improved Approach of Unstructured Text Document Classification Using Predetermined Text Model and Probability Technique

Download490 downloads

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.4108/eai.16-5-2020.2304041,
    author={S Kumar Sreedhar and Syed  Ahmed and P Mercy Flora and LS  Hemanth and J  Aishwarya and Rahul Gopal Naik},
    title={An Improved Approach of Unstructured Text Document Classification Using Predetermined Text Model and Probability Technique},
    proceedings={Proceedings of the First  International Conference on Advanced Scientific Innovation in Science, Engineering and Technology, ICASISET 2020, 16-17 May 2020, Chennai, India},
    publisher={EAI},
    proceedings_a={ICASISET},
    year={2021},
    month={1},
    keywords={classification classifier keyword based document classification (kbdc) predetermined irrelevant text model probability technique pre-determined keyword text pattern model (pktpm)},
    doi={10.4108/eai.16-5-2020.2304041}
}

S Kumar Sreedhar
Syed Ahmed
P Mercy Flora
LS Hemanth
J Aishwarya
Rahul Gopal Naik
Year: 2021
An Improved Approach of Unstructured Text Document Classification Using Predetermined Text Model and Probability Technique
ICASISET
EAI
DOI: 10.4108/eai.16-5-2020.2304041

S Kumar Sreedhar¹, Syed Ahmed¹^,*, P Mercy Flora¹, LS Hemanth¹, J Aishwarya¹, Rahul Gopal Naik¹

1: Department of Computer Science Engineering, Dr.T. Thimmaiah Institute of Technology, Kolar Gold Fields – 563122, Karnataka India

*Contact email: syed@drttit.edu.in

Abstract

Document classification is the task to split the document set into dis-tinct highly relative classes or groups based on nature of the document con-tents.Here, an improved approach of document classification called keyword-based document classification (KBDC) is introduced. It focuses on splitting the unstructured text document set into K number of dissimilar classes based on K predetermined keywords text models by improved probability technique. This new system comprises of the following stages. Namely, pre-processing, classi-fication and classifier stage respectively. Initial, the proposed system (KBDC) recognizes all the immaterial existing contents in the input text document through constructed Predetermined Irrelevant Text Pattern Model (PITPM). Next, it divides the pre-processed document set into ‘K’ different groups or classes by K number of Pre-determined Keyword Text Pattern Models (PKTPM) through probability technique, where K denotes the number of groups or classes or models. Finally, the KBDC system classifies the trial test text document without any class label that belongs to either of the existing group based on the K different class models (PKTPs). Experimentation results show that the KBDC is appropriate to split and identifies the unstructured text document set into K distinct extremely comparative classes.

Keywords: classification classifier keyword based document classification (kbdc) predetermined irrelevant text model probability technique pre-determined keyword text pattern model (pktpm)

Published: 2021-01-27
Publisher: EAI

: http://dx.doi.org/10.4108/eai.16-5-2020.2304041