Proceedings of the First International Conference on Advanced Scientific Innovation in Science, Engineering and Technology, ICASISET 2020, 16-17 May 2020, Chennai, India

Research Article

An Improved Approach of Unstructured Text Document Classification Using Predetermined Text Model and Probability Technique

Download490 downloads
  • @INPROCEEDINGS{10.4108/eai.16-5-2020.2304041,
        author={S Kumar Sreedhar and Syed  Ahmed and P Mercy Flora and LS  Hemanth and J  Aishwarya and Rahul Gopal Naik},
        title={An Improved Approach of Unstructured Text Document Classification Using Predetermined Text Model and Probability Technique},
        proceedings={Proceedings of the First  International Conference on Advanced Scientific Innovation in Science, Engineering and Technology, ICASISET 2020, 16-17 May 2020, Chennai, India},
        publisher={EAI},
        proceedings_a={ICASISET},
        year={2021},
        month={1},
        keywords={classification classifier keyword based document classification (kbdc) predetermined irrelevant text model probability technique pre-determined keyword text pattern model (pktpm)},
        doi={10.4108/eai.16-5-2020.2304041}
    }
    
  • S Kumar Sreedhar
    Syed Ahmed
    P Mercy Flora
    LS Hemanth
    J Aishwarya
    Rahul Gopal Naik
    Year: 2021
    An Improved Approach of Unstructured Text Document Classification Using Predetermined Text Model and Probability Technique
    ICASISET
    EAI
    DOI: 10.4108/eai.16-5-2020.2304041
S Kumar Sreedhar1, Syed Ahmed1,*, P Mercy Flora1, LS Hemanth1, J Aishwarya1, Rahul Gopal Naik1
  • 1: Department of Computer Science Engineering, Dr.T. Thimmaiah Institute of Technology, Kolar Gold Fields – 563122, Karnataka India
*Contact email: syed@drttit.edu.in

Abstract

Document classification is the task to split the document set into dis-tinct highly relative classes or groups based on nature of the document con-tents.Here, an improved approach of document classification called keyword-based document classification (KBDC) is introduced. It focuses on splitting the unstructured text document set into K number of dissimilar classes based on K predetermined keywords text models by improved probability technique. This new system comprises of the following stages. Namely, pre-processing, classi-fication and classifier stage respectively. Initial, the proposed system (KBDC) recognizes all the immaterial existing contents in the input text document through constructed Predetermined Irrelevant Text Pattern Model (PITPM). Next, it divides the pre-processed document set into ‘K’ different groups or classes by K number of Pre-determined Keyword Text Pattern Models (PKTPM) through probability technique, where K denotes the number of groups or classes or models. Finally, the KBDC system classifies the trial test text document without any class label that belongs to either of the existing group based on the K different class models (PKTPs). Experimentation results show that the KBDC is appropriate to split and identifies the unstructured text document set into K distinct extremely comparative classes.