AI can simulate anyone’s voice with 3 seconds of audio

Get help or discuss anything relating to audio/video software & hardware
User avatar
Tex
Posts: 1157
Joined: Fri Feb 19, 2021 1:12 am
Location: Texas
Has thanked: 6 times
Been thanked: 586 times

Re: AI can simulate anyone’s voice with 3 seconds of audio

Post by Tex »

The real potential is restoration of low quality sources. Imagine hearing voices from the 1920s to 1940s in modern quality as if recorded yesterday.
User avatar
Lord Reith
Posts: 4602
Joined: Thu Feb 18, 2021 8:22 am
Location: BBC House
Has thanked: 139 times
Been thanked: 3965 times

Re: AI can simulate anyone’s voice with 3 seconds of audio

Post by Lord Reith »

Golem wrote: Sun Jan 29, 2023 10:19 pm
Yeah that's a very reasonable point, like I don't think AI sentience is really a massive risk here. But we can only hope the power for AI to fight evil advances as fast as it's ability to create it. Counter AIs do exist to be fair, when you have an AI that creates something fake, you need another counter AI that judges whether its actually any good or not, and so I suppose you could use a counter AI to figure out what's actually real. Plus, as good as the AIs are, they still have tells, and so who knows if "perfect" fake audio/video is even achievable.
Yes but I think the AI arms race is what will drive it forward. The same as the Apollo missions only happened because of cold war tensions. As white hats and black hats battle to outwit each other, the tech will get smarter and smarter. I believe that is what will drive it, not Apple trying to find new gimmicks to put in its phones.
Engonoceras wrote: Mon Jan 30, 2023 12:14 am The real potential is restoration of low quality sources. Imagine hearing voices from the 1920s to 1940s in modern quality as if recorded yesterday.
Consider that the difference between what most people would call a "lofi" recording and a "hifi" one is but one octave of extra frequency response - between 5 and 10khz. And all of that just consists of overtones, not actual musical notes. The other six octaves of musical information below that are already present in just about any lofi recording. While there are some clever tools for synthesising this missing octave, what we need is a software that can intelligently recreate it by referring to a similar sounding performance. Something similar to the demixing algrithms, but with a different end goal. I would imagine this is not too far fetched, but there probably isn't the demand for such a thing these days except among sound restorationists. Maybe that is why it hasn't happened yet, or is happening very slowly.
Women there don't treat you mean, in Abilene
User avatar
Golem
Posts: 353
Joined: Fri Feb 26, 2021 5:25 pm
Has thanked: 22 times
Been thanked: 64 times

Re: AI can simulate anyone’s voice with 3 seconds of audio

Post by Golem »

I've always wondered if it was possible for an AI to just recreate performances from scratch, like we can recreate other instruments digitally, why couldn't we mimic vocal cords? Like not even a deepfake, but the physical muscles that make the sound. I don't know if that's the easiest way, but it's a wonder I've had since playing this one game a few years ago.
theboxinargentina
Posts: 301
Joined: Fri Oct 08, 2021 4:12 pm
Has thanked: 184 times
Been thanked: 97 times

Re: AI can simulate anyone’s voice with 3 seconds of audio

Post by theboxinargentina »

Lord Reith wrote: Mon Jan 30, 2023 8:18 am Consider that the difference between what most people would call a "lofi" recording and a "hifi" one is but one octave of extra frequency response - between 5 and 10khz. And all of that just consists of overtones, not actual musical notes. The other six octaves of musical information below that are already present in just about any lofi recording. While there are some clever tools for synthesising this missing octave, what we need is a software that can intelligently recreate it by referring to a similar sounding performance. Something similar to the demixing algrithms, but with a different end goal. I would imagine this is not too far fetched, but there probably isn't the demand for such a thing these days except among sound restorationists. Maybe that is why it hasn't happened yet, or is happening very slowly.
Yes this is something I've imagined, a sort of "up-scaling" of less than perfect recordings. It will happen!
tdgrnwld
Posts: 118
Joined: Thu Mar 18, 2021 3:17 pm
Has thanked: 1 time
Been thanked: 1 time

Re: AI can simulate anyone’s voice with 3 seconds of audio

Post by tdgrnwld »

theboxinargentina wrote: Mon Jan 30, 2023 3:43 pm
Lord Reith wrote: Mon Jan 30, 2023 8:18 am Consider that the difference between what most people would call a "lofi" recording and a "hifi" one is but one octave of extra frequency response - between 5 and 10khz. And all of that just consists of overtones, not actual musical notes. The other six octaves of musical information below that are already present in just about any lofi recording. While there are some clever tools for synthesising this missing octave, what we need is a software that can intelligently recreate it by referring to a similar sounding performance. Something similar to the demixing algrithms, but with a different end goal. I would imagine this is not too far fetched, but there probably isn't the demand for such a thing these days except among sound restorationists. Maybe that is why it hasn't happened yet, or is happening very slowly.
Yes this is something I've imagined, a sort of "up-scaling" of less than perfect recordings. It will happen!
It's a reasonable application of neural networks, but to do it right, it requires a large dataset of paired hi-fi and lo-fi (recorded on the same rig you aim to upsample from) recordings. I suspect that the lack of such a dataset - due to the cost of producing one - is a major reason we haven't seen these systems already.
User avatar
Golem
Posts: 353
Joined: Fri Feb 26, 2021 5:25 pm
Has thanked: 22 times
Been thanked: 64 times

Re: AI can simulate anyone’s voice with 3 seconds of audio

Post by Golem »

tdgrnwld wrote: Mon Jan 30, 2023 4:07 pm
theboxinargentina wrote: Mon Jan 30, 2023 3:43 pm
Lord Reith wrote: Mon Jan 30, 2023 8:18 am Consider that the difference between what most people would call a "lofi" recording and a "hifi" one is but one octave of extra frequency response - between 5 and 10khz. And all of that just consists of overtones, not actual musical notes. The other six octaves of musical information below that are already present in just about any lofi recording. While there are some clever tools for synthesising this missing octave, what we need is a software that can intelligently recreate it by referring to a similar sounding performance. Something similar to the demixing algrithms, but with a different end goal. I would imagine this is not too far fetched, but there probably isn't the demand for such a thing these days except among sound restorationists. Maybe that is why it hasn't happened yet, or is happening very slowly.
Yes this is something I've imagined, a sort of "up-scaling" of less than perfect recordings. It will happen!
It's a reasonable application of neural networks, but to do it right, it requires a large dataset of paired hi-fi and lo-fi (recorded on the same rig you aim to upsample from) recordings. I suspect that the lack of such a dataset - due to the cost of producing one - is a major reason we haven't seen these systems already.
I mean, you could just get a bunch of high quality recordings, and then convert them to a lower quality
User avatar
Lord Reith
Posts: 4602
Joined: Thu Feb 18, 2021 8:22 am
Location: BBC House
Has thanked: 139 times
Been thanked: 3965 times

Re: AI can simulate anyone’s voice with 3 seconds of audio

Post by Lord Reith »

tdgrnwld wrote: Mon Jan 30, 2023 4:07 pm It's a reasonable application of neural networks, but to do it right, it requires a large dataset of paired hi-fi and lo-fi (recorded on the same rig you aim to upsample from) recordings. I suspect that the lack of such a dataset - due to the cost of producing one - is a major reason we haven't seen these systems already.
I'm sure it could be doable but it doesn't attract the sort of people who could do it. There's no mass market application for it like there is with demixing. Back in the 80s and even 90s there were people working with old audio from the 20s and 30s, but now there is very little interest in that.
Women there don't treat you mean, in Abilene
User avatar
Ziggy C
Posts: 551
Joined: Thu Oct 14, 2021 12:10 am
Location: Woodland Hills, CA
Has thanked: 96 times
Been thanked: 125 times

Re: AI can simulate anyone’s voice with 3 seconds of audio

Post by Ziggy C »

It would be nice if there were a colorization algorithm that could convert the varying degrees of gray in B/W film into the actual colors. This would certainly save the time of assigning colors based on known information and still photos.

And for that matter, the idea I posited back in the 80's, for a video recorder that could plug straight into the wall and record the cable signal for say, two hours. And then that could be converted into AV from every channel broadcast during that span. So we could play it back and select the channel at that time. This business of DVR's only allowing for six simultaneous recordings, as we have now, is just a stroke job. It follows my prediction from the 80's. But it's only 6 feeds. Clearly the technology exists.

And what does this have to do with AI? Wait and see.
User avatar
Tex
Posts: 1157
Joined: Fri Feb 19, 2021 1:12 am
Location: Texas
Has thanked: 6 times
Been thanked: 586 times

Re: AI can simulate anyone’s voice with 3 seconds of audio

Post by Tex »

Even with low quality recordings you can easily recognize different voices so what we recognize as a specific voice is not even in the higher frequencies that's just where the clarity is. So the voice pattern is largely found in the middle and lower frequencies.

The test would be how much can you degrade a good voice recording and reconstruct it digitally with AI to approximate the original.
zaval80
Posts: 390
Joined: Sun Mar 07, 2021 9:19 pm
Has thanked: 23 times
Been thanked: 18 times

Re: AI can simulate anyone’s voice with 3 seconds of audio

Post by zaval80 »

tdgrnwld wrote: Mon Jan 30, 2023 4:07 pm It's a reasonable application of neural networks, but to do it right, it requires a large dataset of paired hi-fi and lo-fi (recorded on the same rig you aim to upsample from) recordings. I suspect that the lack of such a dataset - due to the cost of producing one - is a major reason we haven't seen these systems already.
Would be under lock by the governments.
Post Reply