Google needs to make speaking possible in another language that you hardly know. The company has presented a first of its kind translation model in order to make speaking another language simpler and easier. The social giant announced its first direct speech-to-speech translation system and named it “Translatotron”. It can directly convert speech from one language into another during maintaining voice and cadence of a speaker.
Translatotron is based on a sequence-to-sequence network. It takes source spectrograms as input (a visual representation of the spectrum of frequencies). It then generates the spectrograms of the translated content in the target language. Most translation system splits the job into 3 parts. The first one turns the speech into text. Second takes that text and translates it into another language. The 3rd part turns the text back to the speech.
It often happens in a different voice from the original speaker. Moreover, new system of the company avoids dividing the task into separate stages. The Software Engineers Ye Jia and Ron Weiss explained in Google blog that it gives a few advantages over cascaded systems, including faster inference speed, naturally avoiding compounding errors between recognition and translation, making it straightforward to retain the voice of the original speaker after translation, and better handling of words that do not need to be translated, Ye Jia and Ron Weiss, Software Engineers explained in google blog.
The new tool makes use of 2 other separately trained components. The first one is a neural vocoder that converts output spectrograms to time-domain wave-forms. The 2nd is an optional speaker encoder component. It works to maintain a speaker’s voice in the synthesized translated speech. The AI researchers of Google believe that an end-to-end system can make the task easier by removing the middle man. The advance research in this field would prove perfection for Google’s future AI-powered translation systems.