О возможностях современных мультимодальных трансформеров в области анализа изображений

P. A. Alexandrov; A. A. Prusakov; G. N. Antonova; M. N. Shakhov; S. E. Stelmak; A. V. Beklemisheva; V. G. Sarkisov

Vol 7 No 1 (2026), Papers

Vol 7 No 1 (2026)

The Potential of Current Multimodal Transformers for Image Analysis

Papers

Published March 31, 2026

P. A. Alexandrov^∗⁻
A. A. Prusakov^∗⁻
G. N. Antonova^∗⁻
M. N. Shakhov^∗⁻
S. E. Stelmak^∗⁻
A. V. Beklemisheva^∗⁻
V. G. Sarkisov^∗⁻

P. A. Alexandrov

National Research Centre “Kurchatov Institute”, Moscow, Russian Federation

A. A. Prusakov

National Research Centre “Kurchatov Institute”, Moscow, Russian Federation

G. N. Antonova

National Research Centre “Kurchatov Institute”, Moscow, Russian Federation

M. N. Shakhov

National Research Centre “Kurchatov Institute”, Moscow, Russian Federation

S. E. Stelmak

National Research Centre “Kurchatov Institute”, Moscow, Russian Federation

A. V. Beklemisheva

National Research Centre “Kurchatov Institute”, Moscow, Russian Federation

V. G. Sarkisov

National Research Centre “Kurchatov Institute”, Moscow, Russian Federation

PDF (Russian)

Supplementary Files

Download JATS XML

Keywords

computer vision
machine learning
artificial intelligence systems
artificial neural networks
image analysis
ChatGPT
DeepSeek
transformers
large language models
few-shot learning

How to Cite

1.

Alexandrov P.A., Prusakov A.A., Antonova G.N., Shakhov M.N., Stelmak S.E., Beklemisheva A.V., Sarkisov V.G. The Potential of Current Multimodal Transformers for Image Analysis // Russian Journal of Cybernetics. 2026. Vol. 7, № 1. P. 93-103.

Abstract

we studied the image analysis capabilities of two widely used neural network services: ChatGPT-5 mini and DeepSeek-3.1 Thinking. We measured the quality of feature generation and analogy matching using a new methodology and a unique experimental framework that employed all four training examples for each of two classes. In experiments with 93 proposed sounds and automatically generated Modified Bongard Tests, ChatGPT-5 mini completed 15 (16.1%) tests, and DeepSeek-3.1 Thinking completed 17 (18.3%). These results demonstrate that, despite clear progress in few-shot learning, current multimodal neural network transformers still face fundamental limitations in contextual learning.

PDF (Russian)

References

GPT-5 is here – OpenAI. Режим доступа: https://openai.com/gpt-5.

DeepSeek. Режим доступа: https://www.deepseek.com.

Face Recognition Grand Challenge (FRGC). Режим доступа: https://www.nist.gov/programs-projects/face-recognition-grand-challenge-frgc.

ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Режим доступа: https://image-net.org/challenges/LSVRC/index.php.

Radford А. et al. Learning Transferable Visual Models from Natural Language Supervision. International Conference on Machine Learning. 2021:8748-8763. DOI: https://doi.org/10.48550/arXiv.2103.00020.

Бонгард М. М. Проблема узнавания. М.: Физматгиз; 1967. 320 с.

Hofstadter D. R. Gödel, Escher, Bach: an Eternal Golden Braid. Basic books; 1999.

Nie W. et al. Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning. Advances in Neural Information Processing Systems. 2020;33:16468-16480. Режим доступа: https://proceedings.neurips.cc/paper_files/paper/2020/file/bf15e9bbff22c7719020f9df4badc20a-Paper.pdf.

Index of Bongard Problems. Режим доступа: https://www.foundalis.com/res/bps/bpidx.htm.

Małkiński M., Pawlonka S., Mańdziuk J. Reasoning Limitations of Multimodal Large Language Models. A Case Study of Bongard Problems. 2024. arXiv:2411.01173. Режим доступа: https://arxiv.org/abs/2411.01173.

IQ Test. Режим доступа: https://www.mensa.org/mensa-iq-challenge/#test.

Tracking AI. Monitoring Artificial Intelligence. Режим доступа: https://www.trackingai.org/home.

Chollet F. On the Measure of Intelligence. 2019. arXiv:1911.01547. Режим доступа: https://arxiv.org/pdf/1911.01547.

Chollet F. How We Get To AGI. 2025. Режим доступа: https://www.youtube.com/watch?v=5QcCeSsNRks.

ARC Prize 2024: Technical Report. 2024. Режим доступа: https://arcprize.org/competitions/2024/.

Akyürek E. et al. The Surprising Effectiveness of Test-Time Training for Few-Shot Learning. 2024. arXiv:2411.07279. Режим доступа: https://arxiv.org/html/2411.07279v2.

ARC Prize 2024. Режим доступа: https://arcprize.org/competitions/2024/.

База данных 93 изображений тестов МТБ 2025. Режим доступа: https://disk.yandex.ru/d/SDvvt4xqDh49ZQ.

Мясников В. В. и др. Методы обнаружения и распознавания объектов на цифровых изображениях. Самара: Изд-во СГАУ; 2006. 168 c. Режим доступа: https://repo.ssau.ru/handle/Uchebnye-posobiya/Metody-obnaruzheniya-i-raspoznavaniya-obektov-na-cifrovyh-izobrazheniyah-Elektronnyi-resurs-uchebposobie-54225.

Copilot 3D Transforms an Image into a Usable 3D Model. Режим доступа: https://copilot.microsoft.com/labs/experiments/copilot-3d.

Downloads

Download data is not yet available.

The Potential of Current Multimodal Transformers for Image Analysis

Supplementary Files

Keywords

How to Cite

Download Citation

Abstract

References

Downloads