Обогащение базы знаний с помощью автоматического извлечения ключевых слов и аннотаций

E. G. Tunyan; R. S. Sazikov; S. A. Kharlamov

Vol 6 No 2 (2025), Papers

Vol 6 No 2 (2025)

Automatic Extraction of Keywords and Summaries for Knowledge Base Population

Papers

Published June 30, 2025

E. G. Tunyan^∗⁻
R. S. Sazikov^∗⁻
S. A. Kharlamov^∗⁻

E. G. Tunyan

Surgut State University, Surgut, Russian Federation; LLC Edro, Surgut, Russian Federation; Surgut Branch of Scientific Research Institute for System Analysis of the National Research Centre “Kurchatov Institute”, Surgut, Russian Federation

https://orcid.org/0009-0003-3260-1310

R. S. Sazikov

Surgut State University, Surgut, Russian Federation; LLC Edro, Surgut, Russian Federation; Surgut Branch of Scientific Research Institute for System Analysis of the National Research Centre “Kurchatov Institute”, Surgut, Russian Federation

https://orcid.org/0009-0005-0078-0013

S. A. Kharlamov

Surgut State University, Surgut, Russian Federation; LLC Edro, Surgut, Russian Federation

https://orcid.org/0009-0000-5605-0531

PDF (Russian)

Keywords

knowledge enrichment
term extraction
keywords
abstract
TF-IDF
RAKE
TextRank
KeyBERT
LLM
intelligent search
automated text processing
semantic analysis

How to Cite

1.

Tunyan E.G., Sazikov R.S., Kharlamov S.A. Automatic Extraction of Keywords and Summaries for Knowledge Base Population // Russian Journal of Cybernetics. 2025. Vol. 6, № 2. P. 108–113.

Abstract

we studied a range of automatic term extraction methods, from well-established techniques such as TF-IDF, RAKE, and TextRank to recent transformer-based approaches, including KeyBERT and large language models (LLMs). Our findings show that hybrid approaches — combining statistical and neural methods – achieve superior performance by ensuring both formal relevance and semantic depth in term selection. We proposed a multi-stage pipeline: initial text preprocessing and annotation, followed by the parallel application of several extraction algorithms to minimize stochastic variation, and final refinement through neural network-based ranking. This integrated algorithm significantly improves the quality of term extraction and enhances both the accuracy and practical utility of the retrieved information. The proposed method enables scalable analysis of large unstructured text corpora without the need for manual annotation or domain-specific training, making it suitable for a wide range of research and applied digitalization tasks, including applications in medicine, education, and document management.

PDF (Russian)

References

Sparck Jones K. A Statistical Interpretation of Term Specificity and Its Application in Retrieval. Journal of Documentation. 1972;28:11–21.

Rose S., Engel D., Cramer N., Cowley W. Automatic Keyword Extraction from Individual Documents. Text Mining: Applications and Theory. 2010. DOI: 10.1002/9780470689646.ch1.

Mihalcea R., Tarau P. TextRank: Bringing Order into Text. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain. Association for Computational Linguistics; 2004:404–411.

Маннинг К. Д., Рагхаван П., Шютце Х. Введение в информационный поиск. Вильямс; 2020. 528 с. ISBN 978-5-907203-20-4.

Grootendorst M. KeyBERT: Minimal Keyword Extraction with BERT. Режим доступа: https://www.maartengrootendorst.com/blog/keybert/.

Downloads