Qwen3-Omni: a new era for multimodal AI models? – AI News – #4 September 2025

3min.

Comments:0

29 September 2025

 Qwen3-Omni: a new era for multimodal AI models? – AI News – #4 September 2025d-tags
Alibaba has unveiled Qwen3-Omni, a revolutionary open-source AI model that simultaneously processes text, images, audio, and video, responding in real-time with both text and natural speech. Unlike many hybrid solutions, Qwen3-Omni achieves state-of-the-art performance on audio-visual tasks without compromising its text analysis capabilities, directly challenging closed-source models from Google and OpenAI.

3min.

Comments:0

29 September 2025

AI has long functioned like a collection of isolated tools: for text, for sound, for images. Combining them into one cohesive system required additional work and specialized knowledge from the user. Alibaba throws down the gauntlet to this concept, presenting Qwen3-Omni – an open model designed from the ground up for simultaneous handling of text, images, audio, and video. This is a step toward a future where interaction with machines will resemble natural conversation rather than typing commands.

What exactly is Qwen3-Omni?

Qwen3-Omni is a natively multimodal, multilingual “omni” model. This means it can smoothly process different types of input data (reading, listening, watching), and then respond using both text and naturally sounding speech in real time. Most importantly, it achieves this without losing performance in any of the supported modalities, which was often a problem with earlier hybrid models.

Key features and capabilities of Qwen3-Omni

The model developed by Alibaba stands out from the competition with several features that define its potential.

One model, many formats

The core strength of Qwen3-Omni lies in its versatility.

    • Input: the model accepts text, images, sound, and even video clips.

    • Output: responses are generated not only in text form but also as fluent, natural speech.

Example: you can send a short video with the question “What is happening here?”, and the model will respond both with a spoken explanation and a text summary.

True multilingualism

Qwen3-Omni was created with global use in mind.

    • Supports text in 119 languages.

    • Understands speech in 19 languages.

    • Generates speech in 10 languages.

Thanks to this, it becomes a tool accessible to users worldwide – from developers in India to teachers in Brazil.

Performance without compromise

Many multimodal models lose quality on text tasks when trained on audio or video data. Qwen3-Omni avoids this pitfall.

    • Maintains high performance in text and graphic benchmarks.

    • Achieves SOTA (State-of-the-Art) status in 32 of 36 audio and audiovisual benchmarks, outperforming closed models like Gemini-2.5-Pro and GPT-4o-Transcribe.

Innovative architecture: “the thinker and the talker”

The speed and naturalness of responses are thanks to a unique design.

    • Thinker: this part of the model is responsible for reasoning, analysis, and generating textual content.

    • Talker: receives processed data from the “Thinker” and instantly converts it into streamed speech tokens.

This architecture, supported by the MoE (Mixture of Experts) mechanism, significantly reduces latency, allowing real-time interaction with delays as low as 211 ms (audio only) and 507 ms (audio-video).

Practical applications that make sense

This kind of technology opens doors to completely new applications:

    • Education: a teacher can record a lecture, and the model will generate summaries and key points in several languages.

    • Accessibility: people with hearing impairments can get accurate live transcriptions from video or audio materials.

    • Business: a meeting recording can be instantly processed into a task list, summary, and you can ask the model for details of the discussion.

    • Daily interactions: by showing the model a cooking video, instead of just receiving the answer “This is pasta,” you can get a step-by-step instruction on how to prepare the dish.

Information for developers

Alibaba provides Qwen3-Omni under the Apache 2.0 license, allowing free use, including commercial purposes.

    • Requirements: The model is resource-intensive. Running it locally requires powerful graphics cards (up to 144 GB of VRAM).

    • Availability: The model is accessible via Hugging Face (with Transformers), vLLM (for better performance), and the DashScope API. A ready-to-use Docker image is also available.

Does Qwen3-Omni signal a new era of interaction?

Qwen3-Omni is most likely a preview of how we will communicate with technology in the coming years. The era of text-only chatbots is slowly ending. The future belongs to models that can simultaneously see, hear, and speak – all in a smooth and natural manner. If you are creating next-generation applications or simply interested in the direction artificial intelligence is heading, Qwen3-Omni is a project definitely worth paying attention to. We too will certainly be following similar technological advances, so stay up-to-date with us and subscribe to the Delante newsletter!

Source of information about Qwen3-Omni: https://qwen.ai/blog?id=65f766fc2dcba7905c1cb69cc4cab90e94126bf4&from=research.latest-advancements-list

Author
Maciej Jakubiec - Junior SEO Specialist
Author
Maciej Jakubiec

SEO Specialist

A marketing graduate specializing in e-commerce from the University of Economics in Kraków – part of Delante’s SEO team since 2022. A firm believer in the importance of well-crafted content, and apart from being an SEO, a passionate music producer crafting sounds since his early teens.

Author
Maciej Jakubiec - Junior SEO Specialist
Author
Maciej Jakubiec

SEO Specialist

A marketing graduate specializing in e-commerce from the University of Economics in Kraków – part of Delante’s SEO team since 2022. A firm believer in the importance of well-crafted content, and apart from being an SEO, a passionate music producer crafting sounds since his early teens.