Translation of speech is a long-standing dream of AI. As technology evolves, it will fundamentally change the way we meet people all around the world. Apart from its interesting applications in metaverse or tourism, we face an exciting scientific challenge: How should we solve a chain of related tasks, e.g. automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS)?
ASR + MT = multilingual captioning. A caption is the most popular method for localizing video content. For real-time usage in live streams or wearable devices, we need blazingly fast inference, preferably with a direct speech-to-text model instead of cascading ASR and MT in two steps. The model should also deal with partial or restarted utterances elegantly.
MT + TTS = multilingual narration. Support more languages in audiobooks, audio guides, and games with an AI-driven process. Here, model quality is more important than inference speed: domain adaptation and human-in-the-loop workflow can play a significant role. Another point is how to make use of speech in the original text language, either generated by a TTS model or recorded by a human.
ASR + MT + TTS = multilingual dubbing. Dubbed speech prevents users from being distracted by subtitles and also improves accessibility for blind or illiterate people. For automatic dubbing, solving the three tasks in a row is vulnerable to error propagation between tasks. But a direct speech-to-speech model is yet a big challenge; a comprehensive effort is required in data collection and training.