Lip-syncing AI lets you put words in anyone's mouth

Lip-syncing AI lets you put words in anyone's mouth

Researchers have developed new algorithms that can turn audio clips into realistic, lip-synced videos of the person seeming to say those words. 

The team generated highly-realistic videos of former president Barack Obama talking about terrorism, fatherhood, job creation and other topics using audio clips, weekly video addresses and even clips from when he attended Harvard Law School, that were originally on a different topic.

The system works by converting audio files of an individual's speech into realistic mouth shapes - which are then grafted onto the head of that person from another existing video. 

This video shows how researchers at the University of Washington generated highly-realistic videos of former president Barack Obama talking about topics using audio clips that were originally on a different topic

The study, conducted by researchers at the University of Washington, explained how the team created realistic videos of Obama talking about different topics. 

'These type of results have never been shown before,' said Dr Ira Kemelmacher-Shlizerman, an assistant professor at the UW’s Paul G. Allen School of Computer Science & Engineering. 

'Realistic audio-to-video conversion has practical applications like improving video conferencing for meetings, as well as futuristic ones such as being able to hold a conversation with a historical figure in virtual reality by creating visuals just from audio. 

'This is the kind of breakthrough that will help enable those next steps.'

APPLICATIONS OF LIP-SYNCING

According to researchers at the University of Washington, realistic audio-to-video conversion has practical applications, such as:

  • Improving video conferencing for meetings
  • Holding a conversation with a historical figure in virtual reality by creating visuals just from audio
  • Allowing chat tools such as Skype or Messenger to collect videos that could be used to train computer models
  • Help end video chats that time out abruptly due to poor connectivity
  • Reverse the process - feed video into the system instead of audio to develop algorithms that could detect whether a video is real or manufactured.

The research team chose to work wi th videos of Obama because the machine learning system needs video of the person to learn from, and there is an abundance of videos of the former president. 

'In the future video, chat tools like Skype or Messenger will enable anyone to collect videos that could be used to train computer models,' Dr Kemelmacher-Shlizerman said.

Because streaming audio over the internet takes up less bandwidth than video, the system could help end video chats that time out abruptly due to poor connectivity. 

'When you watch Skype or Google Hangouts, often the connection is stuttery and low-resolution and really unpleasant, but often the audio is pretty good,' said co-author and Allen School professor Dr Steve Seitz. 

'So if you could use the audio to produce much higher-quality video, that would be terrific.'

The researchers say they c ould also reverse the process - feed video into the system instead of audio - to develop algorithms that could detect whether a video is real or manufactured. 

According to the researchers, the new machine learning system make significant progress in overcoming the 'uncanny valley' problem, which has been a hurdle in creating realistic video from audio. 

When manufactured human likenesses appear to be almost real, but still misses the mark, people find them creepy or off-putting.

'People are particularly sensitive to any areas of your mouth that don’t look realistic,' said lead author Dr Supasorn Suwajanakorn, a recent PhD graduate of the Allen School.

HOW IT WORKS

The lip-syncing system, developed by researchers at the University of Washington (UW), works by converting audio files of an individual's speech into realistic mouth shapes.

These are then grafted onto the head of that person from another existing video.

The first step involved training a neural network to watch videos of a person, and translate different sounds into basic mouth shapes.

Then, by building on research from UW's Graphics and Image Laboratory team with a new mouth synthesis technique, the researchers were able to superimpose and blend realistic mouth shapes and textures on an existing video of that person.

Another important aspect of the technique was to allow a small time shift so that the neural network can anticipate what the speaker is going t o say next. 

'If you don’t render teeth right or the chin moves at the wrong time, people can spot it right away and it’s going to look fake. 

'So you have to render the mouth region perfectly to get beyond the uncanny valley.'

In previous research into audio-to-video conversion, processes have involved filming multiple people in a studio saying the same sentences over and over to attempt capture how a particular sound correlates to different mouth shapes.

However, this process is expensive, tedious and time-consuming. 

By contrast, Dr Suwajanakorn developed algorithms that can learn from videos that exist 'in the wild' on the internet or elsewhere.

'There are millions of hours of video that already exist from interviews, video chats, movies, television programs and ot her sources. 

'And these deep learning algorithms are very data hungry, so it’s a good match to do it this way,' Suwajanakorn said.

Instead of creating a video directly from audio, the research team carried out two steps. 

A neural network first converts the sounds from an audio file into basic mouth shapes. Then the system grafts and blends those mouth shapes onto an existing target video and adjusts the timing to create a new realistic, lip-synced video

A neural network first converts the sounds from an audio file into basic mouth shapes. Then the system grafts and blends those mouth shapes onto an existing target video and adjusts the timing to create a new realistic, lip-synced video

The first step involved training a neural network to watch videos of a person, and translate different sounds into basic mouth shapes.

Then, by building on research from UW's Graphics and Image Laboratory team with a new mouth synthesis technique, the researchers were able to superimpose and blend realistic mouth shapes and textures on an existing video of that person.

Another important aspect of the technique was to allow a small time shift so that the neural network can anticipate what the speaker is going to say next.

At the moment, the system is designed to learn on one indivi dual at a time, meaning that Obama’s voice â€" words he actually uttered â€" is the only information used to 'drive' the manufactured video. 

The research team consciously decided against going down the path of putting other people¿s words into someone¿s mouth. 'We¿re simply taking real words that someone spoke and turning them into realistic video of that individual,' said Dr Dr Steve Seitz, a co-author of the research

The research team consciously decided against going down the path of putting other people’s words into someone’s mouth. 'We’re simply taking real words that someone spoke and turning them into realistic video of that individual,' said Dr Dr Steve Seitz, a co-author of the research

However, in the future, the researchers want to improve the algorithms so they can generalize across situations and recognize a person’s voice and speech patterns with less data â€" for example just an hour of video to learn from.

'You can’t just take anyone’s voice and turn it into an Obama video,' Dr Seitz said.

'We very consciously decided against going down the path of putting other people’s words into someone’s mouth. 

'We’re simply taking real words that someone spoke and turning them into realistic video of that individual.' 

قالب وردپرس

Subscribe to receive free email updates:

0 Response to "Lip-syncing AI lets you put words in anyone's mouth"

Posting Komentar