01What multimodal means

Multimodal refers to working with multiple types of data in a single AI interaction. A multimodal AI can look at an image and answer questions about it, transcribe audio and summarise the content, analyse a chart and provide written interpretation, or combine text instructions with a photograph to perform a task.

GPT-4V (GPT-4 with Vision), Claude 3's vision capability, and Google Gemini are all multimodal: they can process both text and images in the same interaction. Some systems additionally handle audio input and output.

02Business use cases

The business use cases for multimodal AI are significant and expanding.

Document processing: many business documents contain images, charts, tables, and diagrams alongside text. Multimodal AI can process these documents holistically rather than only extracting the text. A financial report with embedded charts can be analysed with the charts included, rather than the AI working from text alone.

Visual quality control: manufacturing and logistics operations can use AI to inspect images of products or processes, identifying defects or anomalies that would previously have required human visual inspection at scale.

Presentation and content review: Copilot's ability to work with PowerPoint presentations includes understanding the visual content of slides, not just the text.

Customer service: multimodal customer service AI can analyse product photos that customers submit alongside their service queries, enabling more effective problem diagnosis.

Accessibility: AI that can describe images for visually impaired users, or transcribe audio for hearing-impaired users, provides accessibility capabilities at scale.

03What multimodal does not mean

Multimodal AI is not magic. It can identify what is visually present in an image with significant accuracy. It cannot reliably count or precisely measure things in images, it can be fooled by unusual perspectives or lighting, and it makes errors in complex visual scenes.

For business deployments, the same principles apply as for text AI: multimodal outputs in high-stakes contexts require expert verification, and the specific use case should be validated rather than assumed to work based on general capability demonstrations.

Key Takeaways

1.Multimodal AI processes multiple input types (text, images, audio) in a single interaction, enabling qualitatively new business use cases.
2.Business applications include holistic document processing (charts + text), visual quality control, presentation analysis, and multimodal customer service.
3.GPT-4V, Claude 3 vision, and Google Gemini are all multimodal; Microsoft Copilot leverages multimodal capability in its Office integration.
4.Multimodal AI has significant visual capability but is not precise for counting, measuring, or complex visual scenes; validation of specific use cases remains necessary.
5.Multimodal capability is developing rapidly; business use cases that are not yet feasible are becoming feasible with each generation of model improvement.

References & Further Reading

[1]
GPT-4 Technical Report: Multimodal CapabilitiesOpenAI
[2]
Google Gemini: Multimodal Model OverviewGoogle DeepMind

Want to discuss this with an expert?

Book a strategy call to explore how these insights apply to your organisation.

Book a Strategy Call

What Is Multimodal AI? When Your AI Can Read, See, and Hear

01What multimodal means

02Business use cases

03What multimodal does not mean

Key Takeaways

References & Further Reading