While video deepfakes are fascinating and effectual, voice deepfakes should not be underestimated. At a first glance, audio seems always be just another part of video that comes along with it. It then often turns out that audio and sound are the more complex aspects of audiovisual media. Deepfake is no exception (in 2021).
High-quality deepfake examples (in 2021) involve impersonators who imitate the original voice. Those examples are intended to raise awareness. We didn't see yet voice deepfakes at this quality level (in 2021).
Humans are sensitive to the natrual sound of voices. Amazon Alexa, Apple Siri, all the voices in navigation systems, announcements in train stations - they sound pretty okay but still not truly natural like a real human.
Apple has a long history of leveraging text-to-speech technologies since the 1980s. Plenty of voices are available for various languages in addition to the well known Siri voices. Apple doesn't offer programing interfaces for developers to generate new voices and most APIs are only available for Apple's own ecosystems. Google technologies on the other hand are available to developers on all platforms including cloud services. Google's premium voices are generated by a WaveNet model that is also used for Google Assistant and Google Translate. Google claims their WaveNet technology generates speech that sounds more natural than other text-to-speech systems.
Voice deepfakes don't necessarily need to be complete new creations to serve a specific purpose. It can be enough to just exchange some words to alter a message in a way that gives it a different and devastating meaning. Such deepfakes work well on voice because they only synthesize parts of sentences and not create an entire longer speech which would be harder to fake without glitches. A corresponding video deepfake doesn't need to recreate the entire person but only the lower face region to adjust the movement of the mouth.
«VoCo» was a software developed by Adobe that allowed voice editing like changing words and even adding new phrases. A first prototype was already presented in 2016 but Adobe decided later to not ship the software due to legal concerns. Meanwhile, alternatives are available.
In 2019, Google released the AI based speech-to-speech translation technology «Translatotron». The system creates synthesized voice translations from the original voice. The first version of Translatotron was also able to generate speech in a different voice resulting in the ability to easily create deepfakes. Translatotron 2, released in July 2021, removed the ability to create voice deepfakes but just like with Adobe VoCo, the potential abuse of such technology is irreversible.
«Veritone» is an ad agency that focuses on synthetic voice reproductions for selected celebrities and influencers as well as voice-over artists. This opens up new business possibilities to monetize official voices at lower costs by avoiding expensive studio recordings. While Veritone focus on validated and official voices, other services such as «Descript Overdub» are available for everyone.
Recent developments are getting closer to make synthetic voices sound truly natural. Professional recording studios can still produce better quality and can validate that a real person gave consent to record their real voice for a specific purpose defined in a contract. This doesn't prevent misuse and high quality is not always required to make deepfake voice sound real.
Low Quality Advantage
It will still need some improvements in deepfake software to deliver convincing results in realtime, e.g. to make it work in interactive conference calls. Hardware requirements for video deepfakes are high (in 2021) while voice deepfakes are already possible in realtime.
There are scenarios where high-quality deepfakes are not even required. Low-quality deepfakes can be used in combination with simulated bad internet connectivity to make attendees on the other side believe the voice and image distortions are caused by low bandwidth. Even simpler, a pure voice call with the same excuse about bad connectivity or by superimposing background noises can do the trick for voice deepfakes. Celluar and landline phones have bad sound quality by design. In such scenarios, electronic sound characteristics of a deepfake may not be noticeable for most people.