Abstract
we studied a range of automatic term extraction methods, from well-established techniques such as TF-IDF, RAKE, and TextRank to recent transformer-based approaches, including KeyBERT and large language models (LLMs). Our findings show that hybrid approaches — combining statistical and neural methods – achieve superior performance by ensuring both formal relevance and semantic depth in term selection. We proposed a multi-stage pipeline: initial text preprocessing and annotation, followed by the parallel application of several extraction algorithms to minimize stochastic variation, and final refinement through neural network-based ranking. This integrated algorithm significantly improves the quality of term extraction and enhances both the accuracy and practical utility of the retrieved information. The proposed method enables scalable analysis of large unstructured text corpora without the need for manual annotation or domain-specific training, making it suitable for a wide range of research and applied digitalization tasks, including applications in medicine, education, and document management.
References
Sparck Jones K. A Statistical Interpretation of Term Specificity and Its Application in Retrieval. Journal of Documentation. 1972;28:11–21.
Rose S., Engel D., Cramer N., Cowley W. Automatic Keyword Extraction from Individual Documents. Text Mining: Applications and Theory. 2010. DOI: 10.1002/9780470689646.ch1.
Mihalcea R., Tarau P. TextRank: Bringing Order into Text. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain. Association for Computational Linguistics; 2004:404–411.
Маннинг К. Д., Рагхаван П., Шютце Х. Введение в информационный поиск. Вильямс; 2020. 528 с. ISBN 978-5-907203-20-4.
Grootendorst M. KeyBERT: Minimal Keyword Extraction with BERT. Режим доступа: https://www.maartengrootendorst.com/blog/keybert/.