The Potential of Current Multimodal Transformers for Image Analysis
PDF (Russian)

Keywords

computer vision
machine learning
artificial intelligence systems
artificial neural networks
image analysis
ChatGPT
DeepSeek
transformers
large language models
few-shot learning

How to Cite

1.
Alexandrov P.A., Prusakov A.A., Antonova G.N., Shakhov M.N., Stelmak S.E., Beklemisheva A.V., Sarkisov V.G. The Potential of Current Multimodal Transformers for Image Analysis // Russian Journal of Cybernetics. 2026. Vol. 7, № 1. P. 93-103.

Abstract

we studied the image analysis capabilities of two widely used neural network services: ChatGPT-5 mini and DeepSeek-3.1 Thinking. We measured the quality of feature generation and analogy matching using a new methodology and a unique experimental framework that employed all four training examples for each of two classes. In experiments with 93 proposed sounds and automatically generated Modified Bongard Tests, ChatGPT-5 mini completed 15 (16.1%) tests, and DeepSeek-3.1 Thinking completed 17 (18.3%). These results demonstrate that, despite clear progress in few-shot learning, current multimodal neural network transformers still face fundamental limitations in contextual learning.

PDF (Russian)

References

GPT-5 is here – OpenAI. Режим доступа: https://openai.com/gpt-5.

DeepSeek. Режим доступа: https://www.deepseek.com.

Face Recognition Grand Challenge (FRGC). Режим доступа: https://www.nist.gov/programs-projects/face-recognition-grand-challenge-frgc.

ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Режим доступа: https://image-net.org/challenges/LSVRC/index.php.

Radford А. et al. Learning Transferable Visual Models from Natural Language Supervision. International Conference on Machine Learning. 2021:8748-8763. DOI: https://doi.org/10.48550/arXiv.2103.00020.

Бонгард М. М. Проблема узнавания. М.: Физматгиз; 1967. 320 с.

Hofstadter D. R. Gödel, Escher, Bach: an Eternal Golden Braid. Basic books; 1999.

Nie W. et al. Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning. Advances in Neural Information Processing Systems. 2020;33:16468-16480. Режим доступа: https://proceedings.neurips.cc/paper_files/paper/2020/file/bf15e9bbff22c7719020f9df4badc20a-Paper.pdf.

Index of Bongard Problems. Режим доступа: https://www.foundalis.com/res/bps/bpidx.htm.

Małkiński M., Pawlonka S., Mańdziuk J. Reasoning Limitations of Multimodal Large Language Models. A Case Study of Bongard Problems. 2024. arXiv:2411.01173. Режим доступа: https://arxiv.org/abs/2411.01173.

IQ Test. Режим доступа: https://www.mensa.org/mensa-iq-challenge/#test.

Tracking AI. Monitoring Artificial Intelligence. Режим доступа: https://www.trackingai.org/home.

Chollet F. On the Measure of Intelligence. 2019. arXiv:1911.01547. Режим доступа: https://arxiv.org/pdf/1911.01547.

Chollet F. How We Get To AGI. 2025. Режим доступа: https://www.youtube.com/watch?v=5QcCeSsNRks.

ARC Prize 2024: Technical Report. 2024. Режим доступа: https://arcprize.org/competitions/2024/.

Akyürek E. et al. The Surprising Effectiveness of Test-Time Training for Few-Shot Learning. 2024. arXiv:2411.07279. Режим доступа: https://arxiv.org/html/2411.07279v2.

ARC Prize 2024. Режим доступа: https://arcprize.org/competitions/2024/.

База данных 93 изображений тестов МТБ 2025. Режим доступа: https://disk.yandex.ru/d/SDvvt4xqDh49ZQ.

Мясников В. В. и др. Методы обнаружения и распознавания объектов на цифровых изображениях. Самара: Изд-во СГАУ; 2006. 168 c. Режим доступа: https://repo.ssau.ru/handle/Uchebnye-posobiya/Metody-obnaruzheniya-i-raspoznavaniya-obektov-na-cifrovyh-izobrazheniyah-Elektronnyi-resurs-uchebposobie-54225.

Copilot 3D Transforms an Image into a Usable 3D Model. Режим доступа: https://copilot.microsoft.com/labs/experiments/copilot-3d.

Downloads

Download data is not yet available.