Language is not just a means of communication but also a lens we understand the world through. In multimodal learning, we build a model where language describes a visual scene, a phenomenon, or even a concept. This is beyond the simple conversion of text and involves a high level of abstraction and context handling. Such confluent nature of multimodal learning makes it the cornerstone of general-purpose, strong AI.
Multilingual Vision-Language Models
There have been recent successes of large-scale models in image captioning or generation, but the training data is English-centric. Can we expect the same performance in other low-resource languages? Translating text in the data is a basic solution, but what if the machine translation quality is poor? How about concepts and rhetoric which are specific to a certain language?
Sound Generation from Text
The water is flowing and the birds are singing. Modeling the relation between environmental sound and its description is a step toward another direction of world understanding. We need the right model, right data, and a controllable decoding method to generate variants and mixes of sounds.
For a more complex task like video generation, a multimodal decoder predicts different modalities at the same time, e.g. image-sound, speech-caption, or event-metadata. If we generate each output individually, they might not be aligned with each other. Research questions are how to synchronize multiple modalities during decoding and how to make the model extendable with more modalities.