AI model from OpenAI automatically recognizes speech and translates it into English

A pink waveform on a blue background that poetically suggests sound.

Benj Edwards / Ars Technica

On Wednesday, OpenAI released a new open source AI model called Whisper that recognizes and translates sound at a level approaching human recognition. It can transcribe interviews, podcasts, conversations and more.

OpenAI trained Whisper on 680,000 hours of audio data and matching transcripts in 98 languages ​​collected from the web. According to OpenAI, this open collection approach has led to “improved robustness to accents, background noise, and technical language.” It can also record the spoken language and translate it into English.

OpenAI describes Whisper as an encoder-decoder transformer, a type of neural network that can use context gleaned from input data to learn associations that can then be translated into the model’s output. OpenAI presents this overview of Whisper’s operation:

Input audio is split into 30 second chunks, converted to a log-Mel spectrogram, and then sent to an encoder. A decoder is trained to predict the corresponding text header, interspersed with special tokens that direct the individual model to perform tasks such as language identification, sentence-level timestamps, multilingual speech transcription and for English speech translation.

By open-sourcing Whisper, OpenAI hopes to introduce a new foundational model that others can build on in the future to improve speech processing and accessibility tools. OpenAI has a significant track record on this front. In January 2021, OpenAI released CLIP, an open source computer vision model that arguably ignited the latest era of rapidly advancing image synthesis technology such as DALL-E 2 and Stable Diffusion.

At Ars Technica, we tested Whisper from code available on GitHub and fed it several samples, including a podcast episode and a particularly hard-to-understand section of audio taken from a phone interview. Although it took some time while running through a standard Intel desktop CPU (the technology doesn’t work in real time yet), Whisper did a good job of transcribing the audio to text through the Python demo program – far better than any AI-powered audio transcription services we have tried before.

Sample console output from OpenAI's Whisper demo program when transcribing a podcast.
Enlarge / Sample console output from OpenAI’s Whisper demo program when transcribing a podcast.

Benj Edwards / Ars Technica

With the correct setup, Whisper could easily be used to transcribe interviews, podcasts, and potentially translate podcasts produced in non-English languages ​​into English on your machine – for free. It’s a potent combination that could ultimately disrupt the transcription industry.

As with almost every major new AI model these days, Whisper brings positive benefits and potential for abuse. On Whisper’s model sheet (under the “Broader Implications” section), OpenAI warns that Whisper can be used to automate surveillance or identify individual speakers in a conversation, but the company hopes it will be used “primarily for beneficial purposes.”

Leave a Reply

Your email address will not be published.