jfoster 8d
It seems like OpenAI are finally living up to their name for once with this release? Anything I'm missing?

From what I can gather:

1. Includes model weights. I can't find the URL, but they reference them enough and have a CLI tool, so I presume I just haven't found them yet.

2. Includes code:

3. Released under MIT License:

wongarsu 8d
> About a third of Whisper’s audio dataset is non-English, and it is alternately given the task of transcribing in the original language or translating to English. We find this approach is particularly effective at learning speech to text translation and outperforms the supervised SOTA on CoVoST2 to English translation zero-shot.

That's intriguing. You can just set the model to transcribe everything into English, no matter which language the speaker is using, and it just works. Given that many people are much better at understanding English than at speaking it, this might make voice interfaces much more accessible without much work.

pen2l 8d
Neat, - they have open-sourced it including the model weights, so they are living up to their name in this instance.

The 4 examples are stunningly good (the examples have speakers with heavy accents, speaking in foreign language, dynamic background noise, etc.), this is far and away better than anything else I've seen. Will be super curious to see other folks trying it out and seeing if it's as robust as it seems.

I think it's fair to say that AI-transcription accuracy is now decidedly superior to the average human's, what the implications of this are I'm not sure.

gok 8d
Comparing this model's word error rates to the state of the art [1] on a few common test sets:

                           Whisper    SoTA
  LibriSpeech test-clean      2.7%     1.8%
  LibriSpeech test-other      5.6%     2.9%
  Switchboard                13.1%     4.9%
  CallHome                   15.8%     9.5%
The authors do explicitly state that they're trying to do a lot of fancy new stuff here, like be multilingual, rather than pursuing just accuracy.


StevenWaterman 8d
That example at the top of the page (speed talking) blew me away. He started talking, I was stunned for a minute, then realised yes, it really was English, and I just burst out laughing.

That's so, so far beyond the previous state-of-the-art, it's absurd.

dindindin 8d
I'm not in the Speech Recognition circles and am looking for open source speech recognition I can play around with - would this be the new state of the art?
IceWreck 8d
Is there a list of system requirements somewhere ? Can it run on cheaper low memory GPUs ? maybe CPUs ?
localy 8d
Are there any published benchmarks available outlining how this compares to other open source ASR software, such as
Simorgh 8d
I’ve been experimenting with voice-interfaces where typing is replaced by talking, but I find it hard to transition users to voice - we ‘seem’ to prefer typing to talking.

I wonder if this will change.

jdmoreira 8d
Looking forward to see if this works well with foreign accents
TaylorAlexander 8d
Hey this looks great! I like to record audio notes while driving in my car after work, to kind of decompress my thoughts from the day. But I never go back and listen as they can be long and meandering. Sometimes in the audio log I will sum up my thoughts, but this might be 20 minutes in and hard to find. I really wish I had transcriptions so I could easily scan the full contents. I have tried Mozilla Deepspeech (I don't want a cloud solution) and I was surprised to find that I could not get Deepspeech to reliably transcribe them. There is a bit of road noise, though I think for a human listener they are easy to understand. It looks like this one might actually do the trick!

EDIT: Tried it and it worked great! It is very easy to use. I just did the pip install line in the readme and was ready to go. You literally just run the one pip install line, and then you run the program in the format "whisper my_audio.wav" and it goes. Really nice job OpenAI!

andy_xor_andrew 8d
Hold on, it does not only speech recognition, but also language translation, in the same model?

What an interesting approach. What benefits does this have over having two dedicated models, one for speech-to-text, and another for translation?

It just seems so odd, given the problems of speech-to-text and Spanish-to-English seems so different from one another (in terms of the problem domain). Seems so unusual to have both handled by one model!

Does knowledge of speech-to-text carry over into knowledge of translation? Does knowledge of translation carry over into knowledge of speech-to-text? So weird.

liminalsunset 8d
I really wish I had this about half a year ago when I was building a tool to automatically turn online school lectures into searchable, clickable transcripts (kind of like YouTube or EdX transcripts).

I was originally using Adobe Premiere Pro's speech to text to do it, and wrote Python to convert its output to the Hyperaudio format on GitHub. With this, I can totally skip all of that step and this is fully open source, too.

App idea:

Build an app that takes a video and uses Hyperaudio or a similar project to add a clickable and searchable transcript (clicking in transcript seeks video)

zeagle 8d
It would be exceptional to get a healthy competitor to microsoft/nuance's dragon monopoly on voice recognition in healthcare. At a couple thousand bucks a license and the more recent SaaS subscription trend there is a lot of money to be made in that space.
darkpicnic 8d
I just wrote a script with Hazel to automatically transcribe my voice notes to txt. It handles punctuation extremely well. What a wonderful contribution!
nicholasjarnold 8d
This is so cool! I was just speaking to a non-technical family member about privacy concerns around using "OK Google" and the like. They responded inquiring about "private" alternatives, to which my answer was "I'm not aware of good ones that give you that level of accuracy and convenience."

Perhaps this development along with continued optimization and device compute power increases will lead us into a near-future where things like Mycroft devices and cellphones could have local-only speech-to-text and translation capabilities which are accurate even with environmental background noise variations encountered IRL.

Great work OpenAI team!

The5thElephant 8d
How is it Apple, Google, or Microsoft are not further ahead of the game on speech recognition like this? They have the resources to hire the best ML researchers and throw tons of computing hours at it, yet Siri, Google, and Cortana continue to struggle to get anywhere near this level of comprehension.
mmh0000 8d
Okay this is super impressive. I just downloaded Whisper and fed it a random flac file I had handy and it did a really good job. Also impressive that it works on my weak CPU:

A 3m07s flac took 5m to transcribe:

  $ whisper --device cpu 'BLACKPINK - BORN PINK/01 Pink Venom.flac'
  Detecting language using up to the first 30 seconds. Use `--language` to specify the language
  Detected language: korean
  [00:00.000 --> 00:10.000]  Blackpink
  [00:11.000 --> 00:14.000]  Kick in the door, wave in the coco
  [00:14.000 --> 00:16.000]  팝콘이는 친게 껴들 생각 말고
  [00:16.000 --> 00:19.000]  I talk to talk, run ways I walk walk
  [00:19.000 --> 00:21.000]  힘 감고 팝 팝 안 봐도 척
  [00:21.000 --> 00:24.000]  By one and two by two
  [00:24.000 --> 00:26.000]  내 손끝 두 하나에 타면 아지은 중
  [00:26.000 --> 00:30.000]  갓 자쇼 지금 화려해 T makes no sense
  [00:30.000 --> 00:32.000]  You couldn't get a dollar out of me
  [00:33.000 --> 00:38.000]  자 오늘 밤이야 눈톱을 품고
  [00:38.000 --> 00:41.000]  미혼을 뺏음 down
  [00:41.000 --> 00:43.000]  Look what you made us do
  [00:43.000 --> 00:47.000]  천천히 널 잠재울 파이어
  [00:48.000 --> 00:52.000]  잠이 날 만큼 아름다워
  [00:52.000 --> 00:53.000]  I bring the pain like
  [00:53.000 --> 00:57.000]  디스탑, 팽팽, 디스탑, 팽팽, 디스탑, 팽팽, 팽팽
  [00:57.000 --> 00:58.000]  Get em, get em, get em
  [00:58.000 --> 01:00.000]  Straight till you don't like
  [01:00.000 --> 01:01.000]  Whoa, whoa, whoa
  [01:01.000 --> 01:03.000]  Straight till you don't like
  [01:03.000 --> 01:04.000]  Ah, ah, ah
  [01:04.000 --> 01:05.000]  Taste that, pink venom
  [01:05.000 --> 01:06.000]  Taste that, pink venom
  [01:06.000 --> 01:08.000]  Taste that, pink venom
  [01:08.000 --> 01:09.000]  Get em, get em, get em
  [01:09.000 --> 01:11.000]  Straight till you don't like
  [01:11.000 --> 01:12.000]  Whoa, whoa, whoa
  [01:12.000 --> 01:13.000]  Straight till you don't like
  [01:13.000 --> 01:14.000]  Ah, ah, ah
  [01:14.000 --> 01:15.000]  Blackpink and Amo
  [01:15.000 --> 01:17.000]  Got it by the smack ram
  [01:17.000 --> 01:18.000]  But rest in peace
  [01:18.000 --> 01:19.000]  Please light up a candle
  [01:19.000 --> 01:20.000]  This the knife of a vando
  [01:20.000 --> 01:22.000]  Messed up and I'm still in saline
aidenn0 8d
For those on NixOS, here's a quick and dirty flake.nix that will let you make a venv in which to "pip install"'

Just put it in a flake.nix, and "nix develop" followed by "virtualenv ./venv; . ./venv/bin/activate; pip install git+"

      description = "Python 3.9 development environment";

      outputs = { self, nixpkgs }:
          system = "x86_64-linux";
          pkgs = import nixpkgs { inherit system; };
        in {
          devShells.${system}.default = pkgs.mkShell {
            buildInputs = [
isoprophlex 8d
Really incredible to see that their multilingual audio-to-English approach is viable. I'm super excited about this, and great to see that openai actually open up about something, for once.

Skimming the codebase I can't immediately see code to do additional training.

Being able to fine-tune the model to a specific language or case (eg. teach it specifically about some technical topic that might not be so prevalent in the current train set) would be majorly disruptive to current SOTA in "callcenter analytics" tech. Especially when combining Whisper with GPT3.

O__________O 8d
Anyone know if it is possible to output IPA using this?

International Phonetic Alphabet (IPA)


throwamon 8d
Is it feasible to use this for Talon-like voice-driven computer usage?
eatsyourtacos 8d
Can this be used as a real-time transcription or is it too slow for that?

Curious what anyone is using these days for a real-time transcription. It doesn't have to be perfect, but just good enough.

My kids watch some youtube vidoes where people will make a mod where it converts them talking to text then look for keywords and spawn a boss in Terraria if you say the wrong keyword etc.

I made a clone of that with the .NET System.Speech.Recognition library. It... works.. but my biggest problem is that #1 it waits until you are done speaking to translate to text on the callback, so there was too much of a delay for it to be fun.. the point is that it will be checking a stream of chatter. #2 is the recognition is pretty crap, I mean it's nearly good enough for my silly purpose but it's still pretty bad.

petercooper 8d
Just tested this on some developer podcasts which usually fail hard given they're full of technical jargon, brand names, etc. Whisper is a revolution! It's picking up terms like Heroku, DigitalOcean, GitHub, ECS, AWS, etc. and capitalizing properly - something nothing else did unless you provided a whole pile of guiding vocabulary.
dom96 8d
This really makes me want to build a Amazon Echo/Google Nest/etc replacement that's open hardware, open source and most importantly recognises voice completely offline. I find that I don't use these smart devices for much more than setting timers anyway so this seems like an easy project.

I just wonder what system requirements Whisper has and whether there are open source voice recognition models that are specifically built for embedded devices.

O__________O 8d
Wish OpenAI would:

- Open source the risk analysis they have done on all their models and provide clarity on the specific safety or business reasonings for why they are open sourcing some projects and not others.

- If safety gaps exist, develop benchmarks that if addressed by others would result in code being reviewed for release.

- Provide way to estimate real cost to reproduce from raw data & code utilizing an existing automated build, hardware assumption, etc.

- If possible, roadmaps of planned research; realize this is a HUGE ask, possibly resulting in unnecessarily being beaten to a goal, but ultimately if the goal is what matters, to me this makes sense.


Basically, be as open as possible. If I missed something, flawed reasoning, etc - let me know, not big deal, just know real progress will make a real difference.

adeptima 8d
Japanese results looks pretty impressive!

Took マッコウクジラ14頭が海岸に打ち上げられる オーストラリア(2022年9月21日)

Extracted audio with youtube-dl -f bestaudio\?v\=bZkNIzeRBk4

Converted into [00:00.000 --> 00:13.000] オーストラリア南部の島で、真っ向くじら14棟が海岸に打ち上げられて死んでいるのが見つかり、専門家が調査のため原地入りしました。 [00:13.000 --> 00:25.000] 原地メディアによりますと、オーストラリア南部のキング棟で、19日、少なくとも14棟の真っ向くじらが海岸に打ち上げられて死んでいるのが見つかりました。 [00:25.000 --> 00:31.000] ほとんどが若いオーストを見られ、専門家が現場に重むき調査に当たっています。 [00:31.000 --> 00:41.000] くじらの死害は大きく運んだり埋めたりすることが難しいため、自然に分解されるのを待つ方針が検討されています。 [00:41.000 --> 00:52.000] また、死害を狙い、サメが海に集まる可能性があるとして、原地東局はサーファーなどに周囲に近づかないように呼びかけています。 [00:52.000 --> 01:02.000] 一方、21日にはタスマニア棟でおよそ230棟のくじらが浜辺に打ち上げられた状態で見つかりました。 [01:02.000 --> 01:07.000] およそ半数がまだ生きている模様で急助活動が進められています。 [01:07.000 --> 01:23.000] 見つかったのは、ゴンドーくじらの仲間と見られています。

Jnr 8d

I am one of the top contributors to the tiny Mozilla Common Voice data-set for my language. The data-set is very small compared to other popular languages and none of the other mentioned data-sets contribute to that language to train the model of Whisper.

And even with so little data to train on it still works surprisingly well.

no1youknowz 8d
This is awesome. But I really want the other way.

To be able to give it text and hear the speech. A TTS (text to speech).

As a language learner, the ability to create my own sentences (based on existing ones I have, in changing a word here or there). Would be amazing.

How long till we have this I wonder. I know I could use a service to do this currently. But having something running locally, I'd prefer.

Hopefully someone in the OpenAI team reads this. :)

shpx 8d
We shouldn't call this open source. The model definition + the data is the source code. The model weights are a compilation artifact.

> The source code must be the preferred form in which a programmer would modify the program. [...] Intermediate forms such as the output of a preprocessor or translator are not allowed.


If I asked a programmer from OpenAI to modify the model to better support Japanese speakers from Hokkaido, their "preferred form" of the model would include the 680,000 hours of audio used to train the model.

Yes that means that there are almost no open source models and yes it's awesome that they released this and made the weights available. Just don't call it open source.

knaik94 8d
I got a super weird results with the 'medium' and language Japanese (with a --task translate). The song is False Sympathy by Mondo Grosso.

"[01:17.000 --> 01:32.000] Translated by Releska" when using the translate to english. That entire part of the song is instrumental. This line does not appear at all in the original transcribe.

The model hung for a solid extra minute at the end when translating to english, the last 90ish seconds of the song took real time 60 seconds, while the entire rest took about 90. The same behavior was not observed with the transcribe.

Some of the english words are incorrect but that was expected. The first Japanese "mistake" I found was "全ては二人の" instead of "すべては ふたりの". With the left being what whisper wrote. A single random word "hey" was transcribed/translated to english even though it's the singer elongating the 園 while singing the 楽園. "落ちてゆく 二人で繋がれた二人のラグ HEY" instead of "落ちていく 鎖でつながれた 二人の楽園" .

I am using the official subtitles released on the youtube video.

It's a complex Japanese song with both japanese and english, and the original transcribe took about 20 real time seconds to start with the first line, 130 seconds for the whole song. It seems to be showing results in 20 second window increments, but this seems to depend on what it considers audio and what it is throwing away.

On my computer I wasn't able to use the large model because I ran out of VRAM, I have 8gb, not sure how much more it'd require. So I ran it with medium.

The song is False Sympathy by Mondo Grosso. The mv is suggestive, in case that matters. I grabbed a fresh audio rip from Youtube because I didn't want to take it out of my cd case.

It is translating this version differently from the director's cut version. I ripped both as opus.

thuttinger 8d
I tried running it in realtime with live audio input (kind of).

If you want to give it a shot, you can find the python script in this repo:

A bit more context on how it works: The systems default audio input is captured with python, split into small chunks and is then fed to OpenAI's original transcription function. It tries (currently rather poorly) to detect word breaks and doesn't split the audio buffer in those cases. With how the model is designed, it doesn't make the most sense to do this, but i found it would be worth trying. It works acceptably well.

harry8 8d
Can you plug this into a computer on your premises to get speech recognition without amazon, apple or google's cloud (or any other cloud) involvement?

Right now I decline all speed recognition because I don't want orwellian listening devices in my house or pocket and haven't seen an answer. (Also haven't been too bothered about speech command interfaces to bother with a load of research - lazy me).

eugenhotaj 8d
Now someone just needs to pipe the output into stable diffusion.
RockRobotRock 8d
Dude, this is insane. This is so much better than other speech to text libraries I've tried.
graderjs 8d
The big question is why is Google's speech recognition in Gboard voice typing still so shit?

MIT licensed model seems way better

mwlp 8d
Super impressive. I tested it on a Japanese streamer whose enunciation isn't exactly perfect and it did a decent job:

  [00:00.000 --> 00:06.500]  Since the last one started, the number of times I've eaten has decreased.
  [00:06.500 --> 00:11.000]  If I get too carried away with the last one, I'll get hungry and do it.
  [00:11.000 --> 00:14.500]  I don't have time to eat.
  [00:15.500 --> 00:18.000]  I'm going to eat now.
  [00:20.000 --> 00:23.000]  It's going to take about 10 minutes from here.
  [00:23.000 --> 00:31.000]  It's been a while since I've had my last meal.
  [00:31.000 --> 00:36.000]  I feel like I'm losing my女子力.
  [00:36.000 --> 00:39.000]  I have to go back to my original self.
  [00:39.000 --> 00:44.000]  I have to get ready and go to bed.
  [00:44.000 --> 00:46.000]  It's not good.
  [00:46.000 --> 00:51.000]  I've been drinking a lot lately, so I'm going home.
  [00:51.000 --> 00:53.000]  I have to get my nails done this fall.
  [00:53.000 --> 00:54.000]  Halloween nails.
  [00:54.000 --> 00:57.000]  Halloween, Halloween, Halloween.
  [00:57.000 --> 00:59.000]  I'm going to the beauty salon today.
  [00:59.000 --> 01:02.000]  I'm going to get my nails done the day after tomorrow.
  [01:02.000 --> 01:10.000]  I used to look at a lot of clothes, but I stopped looking at them.
  [01:10.000 --> 01:12.000]  I'm going crazy.
  [01:12.000 --> 01:22.000]  My stomach's stopped in the middle of summer.
Tistron 7d
It understands my Swedish attempts at English really well with the medium.en model. (Although, it gives me a funny warning: `UserWarning: medium.en is an English-only model but receipted 'English'; using English instead.`. I guess it doesn't want to be told to use English when that's all it can do.)

However, it runs very slowly. It uses the CPU on my macbook, presumably because it hasn't got a NVidia card.

Googling about that I found [plaidML]( which is a project promising to run ML on many different gpu architectures. Does anyone know whether it is possible to plug them together somehow? I am not an ML researcher, and don't quite understand anything about the technical details of the domain, but I can understand and write python code in domains that I do understand, so I could do some glue work if required.

danso 7d
This is an astonishing package. Every AI voice-to-text model I've tried on "The Wire's" famous "fuck" scene [0] usually fails, because the youtube clip's audio quality is bad and it's a scene with virtually no dialogue except breathing and "Fuck". But Whisper returned impressive results [1]



    $ yt-dlp --extract-audio --audio-format mp3 -o wire-fuck.mp3

    $ whisper --language en wire-fuck.mp3
    [00:00.000 --> 00:02.000]  Oh
    [00:13.260 --> 00:15.260]  Fuck
    [00:15.260 --> 00:31.260]  Motherfucker
    [00:50.700 --> 00:52.700]  Fuck
    [00:52.700 --> 00:58.700]  Oh
    [00:58.700 --> 01:10.700]  Fuck
    [01:28.700 --> 01:55.900]  Fuck
    [02:02.340 --> 02:03.700]  Motherfuck.
    [02:10.220 --> 02:11.220]  Oh, fuck.
    [02:11.780 --> 02:12.780]  Oh, fuck.
    [02:25.900 --> 02:27.900]  Fuck, fuck, fuck, fuck, fuck, fuck.
    [02:27.900 --> 02:28.900]  Motherfucker.
    [02:32.900 --> 02:33.900]  Oh, fuck.
    [02:34.900 --> 02:35.900]  Fuck.
    [02:35.900 --> 02:36.900]  Oh, fuck.
    [02:36.900 --> 02:37.900]  Oh, fuck.
    [02:37.900 --> 02:38.900]  Oh, fuck.
    [02:48.900 --> 02:49.900]  Motherfucker.
    [02:53.900 --> 02:54.900]  Fucking A.
    [02:54.900 --> 02:56.900]  Mm hmm.
    [02:56.900 --> 03:12.900]  Fuck.
    [03:26.900 --> 03:28.900]  Motherfucker.
    [03:28.900 --> 03:32.900]  Fuck me.
    [03:58.900 --> 04:01.900]  Oh.
    [04:28.900 --> 04:34.900]  Fuck.
rlt 7d
As a casual observer I get the sense that OpenAI and others are very rapidly creating building blocks of something much bigger…
txtai 7d
Check out this notebook for an example on how to run Whisper as a txtai pipeline in Python or as an API service: