Qwen3-Omni: a new era for multimodal AI models? – AI News – #4 September 2025
d-tags
d-tags
AI has long functioned like a collection of isolated tools: for text, for sound, for images. Combining them into one cohesive system required additional work and specialized knowledge from the user. Alibaba throws down the gauntlet to this concept, presenting Qwen3-Omni – an open model designed from the ground up for simultaneous handling of text, images, audio, and video. This is a step toward a future where interaction with machines will resemble natural conversation rather than typing commands.
Qwen3-Omni is a natively multimodal, multilingual “omni” model. This means it can smoothly process different types of input data (reading, listening, watching), and then respond using both text and naturally sounding speech in real time. Most importantly, it achieves this without losing performance in any of the supported modalities, which was often a problem with earlier hybrid models.
The model developed by Alibaba stands out from the competition with several features that define its potential.
The core strength of Qwen3-Omni lies in its versatility.
Example: you can send a short video with the question “What is happening here?”, and the model will respond both with a spoken explanation and a text summary.
Qwen3-Omni was created with global use in mind.
Thanks to this, it becomes a tool accessible to users worldwide – from developers in India to teachers in Brazil.
Many multimodal models lose quality on text tasks when trained on audio or video data. Qwen3-Omni avoids this pitfall.
The speed and naturalness of responses are thanks to a unique design.
This architecture, supported by the MoE (Mixture of Experts) mechanism, significantly reduces latency, allowing real-time interaction with delays as low as 211 ms (audio only) and 507 ms (audio-video).
This kind of technology opens doors to completely new applications:
Alibaba provides Qwen3-Omni under the Apache 2.0 license, allowing free use, including commercial purposes.
Transformers), vLLM (for better performance), and the DashScope API. A ready-to-use Docker image is also available.Qwen3-Omni is most likely a preview of how we will communicate with technology in the coming years. The era of text-only chatbots is slowly ending. The future belongs to models that can simultaneously see, hear, and speak – all in a smooth and natural manner. If you are creating next-generation applications or simply interested in the direction artificial intelligence is heading, Qwen3-Omni is a project definitely worth paying attention to. We too will certainly be following similar technological advances, so stay up-to-date with us and subscribe to the Delante newsletter!
Source of information about Qwen3-Omni: https://qwen.ai/blog?id=65f766fc2dcba7905c1cb69cc4cab90e94126bf4&from=research.latest-advancements-list