The concept of extracting audio from a static image may sound like a plot from a science fiction novel, but a scientist has made this a reality with the assistance of artificial intelligence (AI).
Through the development of a machine learning tool named Side Eye, a team led by Kevin Fu, a professor specializing in electrical and computer engineering and computer science at Northeastern University, has achieved an astounding level of insight when it comes to analyzing images.
What it Does
Utilizing Side Eye on a still image allows them to discern the gender of a speaker present in the room where the photograph was taken, transcribe the spoken words, and even identify the location. As reported by TechXplore, this tool can also be applied to muted videos.
Here is what Fu said about Side Eye:
Imagine someone is doing a TikTok video and they mute it and dub music. Have you ever been curious about what they’re really saying? Was it ‘Watermelon watermelon’ or ‘Here’s my password?’ Was somebody speaking behind them? You can actually pick up what is being spoken off camera.
How it Works
The machine learning-driven Side Eye harnesses image stabilization technology that is universally employed in nearly all smartphone cameras.
Smartphone cameras incorporate a lens suspension system with springs submerged in liquid, ensuring that photos remain clear and focused even when the photographer has an unsteady hand. These cameras utilize sensors and an electromagnet to counteract any movement by adjusting the lens in the opposite direction, thereby stabilizing the image.
Interestingly, when someone speaks in close proximity to the camera lens while a photo is being taken, it generates minute vibrations in the springs, subtly altering the path of light. Extracting the audio frequencies from these vibrations may seem nearly impossible, but it becomes feasible thanks to the rolling shutter technique commonly used in photography by most cameras.
The way cameras work today to reduce cost basically is they don’t scan all pixels of an image simultaneously – they do it one row at a time. [That happens] hundreds of thousands of times in a single photo. What this basically means is you’re able to amplify by over a thousand times how much frequency information you can get, basically the granularity of the audio.
Good or Bad?
While Side Eye is currently in a rudimentary stage and necessitates a substantial amount of training data to enhance and reach perfection, in the wrong hands, a more advanced iteration of this system could potentially become a significant cybersecurity threat.
However, there are also optimistic prospects for this technology, particularly if an advanced version of Side Eye were to be employed as a digital tool for law enforcement agencies in crime investigations, offering valuable digital evidence.