ACTA

Enhancing Language Model Performance with a Novel Text Preprocessing Method

A. Jalili^a, H. Tabrizchi^{b, c}, A. Mosavi^{c, d, e}, A.R. Varkonyi-Koczy^e
^aDepartment of Computer Science, School of Mathematics, Statistics and Computer Science, College of Science University of Tehran, 16 Azar street 1417935840, Tehran, Iran
^bDepartment of Computer Science, Faculty of Mathematics, Statistics, and Computer Science, University of Tabriz, 29 Bahman Blvd, 5166616471 Tabriz, Iran
^cJohn von Neumann Faculty of Informatics, Obuda University, 1034 Budapest, Hungary
^dLudovika University of Public Service, 1083 Budapest, Hungary
^eJ. Selye University, 94501 Komárno, Slovakia

Full Text PDF

Advances in natural language processing highlight the importance of text data preparation with machine learning. It has been reported that the traditional methods often fail to deal with the language complexity which affects model performance. Consequently, this paper proposes an approach which uses tokenization, noise reduction, and normalization to improve text quality.

DOI:10.12693/APhysPolA.146.542
topics: natural language processing, text preprocessing, artificial intelligence, deep learning