ГИБРИДНЫЙ МЕТОД КЛАССИФИКАЦИИ ТЕКСТОВЫХ ДАННЫХ С УЗКОСПЕЦИАЛИЗИРОВАННОЙ ТЕРМИНОЛОГИЕЙ

Влада Сергеевна Серова; Александр Владимирович Голлай; Елена Вячеславовна Бунова

Authors

Vlada S. Serova Author
Alexander V. Hollay Author
Elena V. Bunova Author

Abstract

In the context of exponential growth of text information, especially in domain-specific areas (technical, medical, legal), the task of automatic classification of texts saturated with highly specialized terminology is of critical importance. Existing approaches, including transformer models (BERT), often demonstrate a decrease in accuracy when working with rare or domain-specific vocabulary due to training on common corpora. The aim of the study is to develop a hybrid method Combined Neural BERT (CNB), which provides maximum classification accuracy (100 %) for texts with specialized terminology due to the synergistic combination of the advantages of contextual language models, lexical-statistical methods, and visualization tools. Materials and methods. The proposed CNB method integrates three key components: 1) BERT (or its derivatives) for generating deep contextual embeddings that take into account semantics and word order; 2) fully connected neural networks (FCNN) acting as a classifier based on BERT features and/or processing lexical-statistical features; 3) the Word Cloud method and TF-IDF for extracting and visualizing key domain terms, forming a feature dictionary and improving interpretability. The architecture of the method includes the following stages: text preprocessing (normalization, cleaning), parallel feature extraction (BERT contextual embeddings + TF-IDF vectors), merging feature spaces, classification using FCNN, interactive tuning based on the Word Cloud analysis. Results. The hybrid CNB approach was tested on a real corpus of 10,000 requests from residents of the Chelyabinsk region (7 thematic categories) using 70 key terms and 150 stop words. The method demonstrated 100 % classification accuracy after three training iterations (total time is 90 minutes). Key benefits: Higher accuracy due to compensation of BERT's weaknesses in specialized domains with lexical-statistical features; Improved interpretability due to visualization of key terms with the “Word Cloud”; Efficiency of processing large volumes of specialized texts. Conclusion. The developed hybrid CNB method has proven its exceptional efficiency for classifying texts with highly specialized terminology. It is a powerful tool for analyzing domain-specific text arrays (legal documents, technical documentation, medical reports, etc.) in the context of constantly growing data volu¬mes. Prospects include adapting the method to other domains and optimizing computational efficiency.

Author Biographies

Vlada S. Serova

Postgraduate student of the Department of Informational and Analytical Support of Control in Social and Economic Systems, South Ural State University, Chelyabinsk, Russia
Alexander V. Hollay

Dr. Sci. (Eng.), Ass. Prof., Prof. of the Department of Information and Analytical Support of Management in Social and Economic Systems, Director of the Higher School of Electronics and Computer Science, South Ural State University, Chelyabinsk, Russia
Elena V. Bunova

Cand. Sci. (Eng.), Ass. Prof., Ass. Prof. of the Department of Applied Mathematics and Programming, South Ural State University, Chelyabinsk, Russia

HYBRID METHOD OF CLASSIFICATION OF TEXT DATA WITH SPECIALIZED TERMINOLOGY

Authors

Abstract

Author Biographies

Published

Issue

Section