After the unprofessional deepfake attempt on Ukraine's president Volodymyr Zelensky, a fake version of Kyiv's major Vitali Klitschko duped European politicians and officials. But was it also a deepfake?
About 3 months ago, a pre-produced deepfake of president Zelensky was uploaded to a compromised Ukrainian news website. The infiltration of the website looked like the work of professional hackers but the quality of the uploaded deepfake was amateurish. In the week of June-20 this year, a real looking Vitali Klitschko joined three conference calls with the majors of Berlin, Vienna and Madrid. It appeared that it was not the real Vitali Klitschko showing in the videos calls. The incident was quickly called to be a deepfake. Let's take a closer look.
So far, the standard procedure for deepfake creations looks like this:
- Collecting sufficient audiovisual reference content of the deepfake target
- Training of AI software with collected reference content
- Selecting a target video to superimpose the deepfake head and face
- Tweaking parameters based on test runs until satisfied with final result
- Additional video compositing in commercial software might be required
The quality of the deepfake output depends on the quality of collected content and the experience of the person who operates the AI software. This common production process is part of Vgency's deepfake workshops and awareness building. The process is file based with reference files as source and and a deepfake file as result.
Creating a deepfake as a live video stream in good quality didn't work well so far. However, the technology evolves fast. Over the last months, we recognized increased activities in the deepfake community to improve the creation of live deepfakes. The training part is still required for live deepfakes. The main difference is that a live stream replaces the target video file to superimpose the deepfake.
On June 28th, the FBI released a public announcement warning about an increase of live deepfakes in job interviews.
Deepfake is one of many special effects (SFX) methods to manipulate audiovisual content. The buzz about deepfake that explains the hype is the involvement of AI to create real looking synthetic faces and real sounding voices. But there are limitations as of today:
- Complex hair structures like long curly hair can be very challenging for AI algorithms resulting in image distortions
- Glasses can cause similar problems depending the design, reflections, viewing angle and other factors
- Voice only works in English or English with foreign accents
There are ways to overcome and workaround the limitations. For example, an actor with similar hair can be recorded in a studio. The deepfake will be only applied on the face while the hair is from a real actor. Hair and make-up designers are needed to replicate characteristic hairstyles. The original idea of a quick AI deepfake creation on a PC can quickly turn into a complex movie-like production where human professionals work around the limitations of AI.
As we described in another article, voice deepfakes can be more challenging to create than video. So far, voice deepfakes that sound like the original voice of the target person only work in English. The latest technology supports voice deepfakes in English with foreign accents. Vgency demonstrated this ability for a French enterprise customer to produce the voice of a CxO in English with French accent. The deepfake voice sounded as real as the real target person. However, even with public figures and lots of reference content available, it can still not be enough training material to create a realistic sounding deepfake version of the desired text. Some phonetic expressions may need specific AI training. There are special text scripts to train the AI to a level that it can create almost any spoken text. But you need the target person to contribute by reading the script and granting permission to record and use it. Therefore, many deepfake productions leverage voice artists to mimic the original voice.
So, what about the fake-Klitschko in the video conference, was it a deepfake?
German investigative journalist Daniel Laufer raises valid questions if the fake-Klitschko was indeed a deepfake.
He found the source video on YouTube and was able to identify still images of the video conference that match the same frames in the YouTube version. He argues that a deepfake would have altered the video frames in ways that they would not match the YouTube version anymore. He concludes that several pre-produced video sequences could have been mixed as live stream into the video conference feeds. A deepfake might not have happened on the video and a voice impersonator imitated the original voice.
Daniel Laufer provided an excellent analysis without being able to review a recording of the actual video conference. We tend to agree that a live deepfake would not have been optimal. But it would also not have been impossible based on the following conditions:
- There is plenty of public image content available to train the AI software
- There is plenty of public audio content available in English language
- Vitali Klitschko has short hair with little complexity
- Vitali Klitschko doesn't wear glasses
- Only one camera is used
- Camera and image are very static
- It's possible to just superimpose deepfake based lips and mouth
- Deepfake errors are less noticeable in low video conferencing quality
Vgency already created CxO deepfakes where studio and webcam recordings of the target person were used to either superimpose the deepfake head or only lips, mouth and the area around the mouth. In our projects, English spoken text was generated by a special deepfake voice software without the need of a voice impersonator.
An older alternative to deepfake is «Face Reenactment» where expressions of lips, mouth, jaw, eyebrows and forehead follow the webcam input of a source actor. This older technique is not AI based and works in real-time on normal commodity PCs. Deepfake provides more possibilities but requires significantly more computing power of top-line PC workstations.
Technically, it is not difficult to create a deepfake based on the specific Klitschko interview on YouTube by just replacing lips and mouth and keeping the rest of the video unaltered including all head and body movements. However, facial reenactment in combination with a voice impersonator can do the same in real-time with much less technical requirements. It's plausible that this was a so called «cheap fake».
Equally important, such cyberattacks on video conferences are possible with our without infiltrating a deepfake.
Missing Security Measures
Why was it so easy to arrange video conference calls under a prominent fake identity with the majors of three European capital cities?
From a technical point of view, local officials don't seem to enjoy high enough standards of IT security. Details about the preparation of the conference calls have not been published but it's reasonable to assume that a typical phishing scheme has been used. Video conferences are arranged via email invites, so the emails and email domains must have appeared real in the eyes of all three majors offices. Official pictures of the video conferences in Berlin and Vienna show that the video calls took place via Cisco Webex. The Vienna office even used a Cisco Webex Room 70 device.
With video communication and home office taking off during the COVID pandemic, security breaches started to skyrocket. Home office and video communication can help to be more productive and to better manage work-life balance. What is lacking is awareness about online and cyber security in every aspect of the digital lifestyle. Naivety and carelessness are a problem since years with fake news twisting people's minds in social media. The fake-Klitschko appearing in an official video conference is a new level, fooling public officials who normally warn their voters about fake news.
Officials need to learn from this incident and better understand the risks of manipulation. Official communication needs to be verifiable.
Finally, video conferencing providers didn't prioritize deepfake protection in their solutions. There is not a single mechanism available in the market leading solutions from Microsoft, Zoom or Cisco that help preventing deepfakes in official video or audio conferencing. The opposite is true: Virtual backgrounds and low video (and audio) quality make it easier for deepfake attackers. Even if the fake-Klitschko was not a deepfake, the video clearly didn't come from a real camera.
Deepfake protection needs to be a requirement for digital communication. Protection features need to be part of every virtual communication solution and service. The hype around the Metaverse needs to take a break. Tech companies need to acknowledge the risks the Metaverse introduces. We cannot be overloaded by virtual avatars, virtual meeting rooms, virtual backgrounds, face and voice enhancements, etc. without keeping control in the real world. Reality needs to be verifiable in the Metaverse.