Researchers have discovered a method to make AI-generated voices, resembling digital private assistants, more expressive, with a minimal quantity of coaching. The method, which interprets textual content to speech, may also be utilized to voices that had been by no means a part of the system’s coaching set.
The crew of laptop scientists and electrical engineers from the University of California San Diego offered their work on the ACML 2021 convention, which came about on-line not too long ago.
In addition to private assistants for smartphones, properties and vehicles, the method might assist enhance voice-overs in animated films, automated translation of speech in a number of languages—and more.The method might additionally assist create customized speech interfaces that empower people who’ve misplaced the power to communicate, comparable to the computerized voice that Stephen Hawking used to talk, however far more expressive.
“We have been working in this area for a fairly long period of time,” stated Shehzeen Hussain, a Ph.D. scholar on the UC San Diego Jacobs School of Engineering and one of many paper’s lead authors. “We wanted to look at the challenge of not just synthesizing speech but of adding expressive meaning to that speech.”
Existing strategies fall wanting this work in two methods. Some methods can synthesize expressive speech for a particular speaker through the use of a number of hours of coaching information for that speaker. Others can synthesize speech from only some minutes of speech information from a speaker by no means encountered earlier than; however they aren’t ready to generate expressive speech and solely translate textual content to speech. By distinction, method developed by the UC San Diego crew is the one one that may generate with minimal coaching expressive speech for a topic that has not been a part of its coaching set.
The researchers flagged the pitch and rhythm of the speech in coaching samples, as a proxy for emotion. This allowed their cloning system to generate expressive speech with minimal coaching, even for voices it had by no means encountered earlier than.
“We demonstrate that our proposed model can make a new voice express, emote, sing or copy the style of a given reference speech,” the researchers write.
Their method can study speech straight from textual content; reconstruct a speech pattern from a goal speaker; and switch the pitch and rhythm of speech from a unique expressive speaker into cloned speech for the goal speaker.
The crew is conscious that their work could possibly be used to make deepfake movies and audio clips more correct and persuasive. As a consequence, they plan to launch their code with a watermark that may establish the speech created by their method as cloned.
“Expressive voice cloning would become a threat if you could make natural intonations,” stated Paarth Neekhara, the paper’s different lead writer and a Ph.D. scholar in laptop science on the Jacobs School. “The more important challenge to address is detection of these media and we will be focusing on that next.”
The method itself nonetheless wants to be improved. It is biased towards English audio system and struggles with audio system with a powerful accent.
Paarth Neekhara et al, Expressive Neural Voice Cloning. arXiv:2102.00151v1 [cs.SD], arxiv.org/abs/2102.00151
Audio examples: expressivecloning.github.io/
University of California – San Diego
New method to make AI-generated voices more expressive (2022, January 5)
retrieved 5 January 2022
This doc is topic to copyright. Apart from any truthful dealing for the aim of personal examine or analysis, no
half could also be reproduced with out the written permission. The content material is offered for info functions solely.