While video deepfakes are fascinating and effectual, voice deepfakes should not be underestimated. At a first glance, audio seems always be just another part of video that comes along with it. It then often turns out that audio and sound are the more complex aspects of audiovisual media. Deepfake is no exception.
Current high-quality deepfake examples involve impersonators who imitate the original voice. Those examples are intended to raise awareness. We didn't see yet malicious deepfakes about people of interest at this quality level. But this will be just a matter of time.
We immediately notice if voice sounds unnatural or electronically. Amazon Alexa, Apple Siri, all the voices in navigation systems, announcements in train stations - they sound pretty okay but still not truly natural like a real human. Apple has probably the longest history of leveraging text-to-speech technologies since the 1980s. Plenty of voices are available for various languages in addition to the well known Siri voices. Apple doesn't offer programing interfaces for developers to generate new voices and most APIs are only available for Apple's own ecosystems. Google technologies on the other hand are available to developers on all platforms including cloud services. Google's premium voices are generated by a WaveNet model that is also used for Google Assistant and Google Translate. Google claims their WaveNet technology generates speech that sounds more natural than other text-to-speech systems.
Voice deepfakes don't necessarily need to be complete new creations to serve a specific purpose. It can be enough to just exchange some words to alter a message in a way that gives it a different and devastating meaning. Such deepfakes work well on voice because they only synthesize parts of sentences and not create an entire longer speech which would be harder to fake without glitches. A corresponding video deepfake doesn't need to recreate the entire person but only the lower face region to adjust the movement of the mouth.
VoCo was a software developed by Adobe that allowed such modifications like changing words and even adding new phrases. A first prototype was already presented in 2016 but Adobe later decided later to not ship the software due to legal concerns. Meanwhile, alternatives are available.
In 2019, Google released the AI based speech-to-speech translation technology Translatotron. The system creates synthesized voice translations from the original voice. The first version of Translatotron was also able to generate speech in a different voice resulting in the ability to easily create deepfakes. Translatotron 2 released in July 2021 removed the ability to create voice deepfakes but just like with Adobe VoCo the potential abuse of such technology is irreversible.
Veritone is an ad agency that focuses on synthetic voice reproductions for selected celebrities and influencers as well as voice-over artists. This opens up new business possibilities to monetize official voices at lower costs by avoiding expensive studio recordings. While Veritone focus on validated and official voices, other services such as Descript Overdub are available for everyone. Creating synthetic voices is possible with or without consent and cannot be restricted.
Recent developments are getting closer to make synthetic voices sound truly natural. Professional recording studios can still produce better quality and can validate that a real person gave consent to record their real voice for a specific purpose defined in a contract. This doesn't prevent misuse and high quality is not always required to make deepfake voice sound real.
Low Quality Advantage
It will still need some improvements in deepfake software to deliver convincing results in realtime to make it work in interactive conference calls. Hardware requirements for video deepfakes are crazy high while voice deepfakes are already possible in realtime.
There are scenarios where high-quality deepfakes are not even required. Low-quality deepfakes can be used in combination with simulated bad internet connectivity to make attendees on the other side believe the voice and image distortions are caused by low bandwidth.
Even simpler, a pure voice call with the same excuse about bad connectivity or by superimposing background noises can do the trick for voice deepfakes. GSM and landline phones have bad sound quality by design. In such scenarios, the electronic sound characteristics of a deepfake may not be noticeable for most people.
Imagine lawyers, accountants, business partners or even family members of CEOs receive deepfake voice calls with the goal to expose confidential information or worse. It's time to get prepared...