Abstract
the paper presents the linguistic software for intelligent search systems. Large language models have been actively developing for the last decade. LLMs suitable for information search often require extensive resources and have redundant functionality when embedded into targeted information systems. Lightweight approaches to natural language processing are needed. We considered an extractive approach to a “question-answer” search intended to find sentences that answer a question in the specified document. For this, we proposed methods for analyzing the morphology, syntax, and semantics of the natural language. A corpus of Russian language texts containing 8,800 sentences was collected to implement graph-based syntax analysis with a weighting of a completely oriented graph by a forward-propagation artificial neural network. This corpus was also used to produce a set of syntax-oriented vector representations of words, applied in the semantic analysis by using a model based on a continuous bag of words architecture. The sentence ranking by relevance to the question is based on representing the semantics of the natural language text as a strongly connected directed graph, revealing implicit meaningful patterns within the language structures.
References
Could ChatGPT Pose a Threat to Google’s Dominance in Search? Режим доступа: https://www.entrepreneur.com/science-technology/could-chatgpt-pose-a-threat-to-googles-dominance-in-search/449033.
Проект «Открытый корпус». Режим доступа: http://opencorpora.org.
Дяченко П. В., Иомдин Л. Л., Лазурский А. В. и др. Современное состояние глубоко аннотированного корпуса текстов русского языка (СинТагРус). Труды института русского языка им. В. В. Виноградова. 2015;6:272–300. EDN: VJQBEX.
Перцев Ю. В., Япарова Н. М. Синтаксически аннотированный корпус веб-текстов русского языка. Свидетельство о государственной регистрации базы данных № 2023621467 от 02.05.2023.
Shazeer N., Stern M. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost. DOI: 10.48550/arXiv.1804.04235.
Перцев Ю. В., Япарова Н. М. Программа анализа русскоязычных текстов с выделением некоторых синтаксических зависимостей. Свидетельство о государственной регистрации программы для ЭВМ № 2022681794 от 10.11.2022.
Mikolov T., Yih W., Zweig G. Linguistic Regularities in Continuous Space Word Representations. Proceedings of NAACL-HLT. 2013. P. 746–751.
Mikolov T., Chen K., Corrado G. S., Dean J. Efficient Estimation of Word Representations in Vector Space. International Conference on Learning Representations. DOI: 10.48550/arXiv.1301.3781.
Levy O., Goldberg Y. Dependency-Based Word Embeddings. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics 2014;2:302–308. DOI: 10.3115/v1/P14-2050.
Перцев Ю. В., Япарова Н. М. Программа синтеза синтаксически ориентированных векторных представлений слов. Свидетельство о государственной регистрации программы для ЭВМ № 2024617697 от 01.04.2024.