Are keyboards and touchscreens on the way out? In the past seven months, the world’s imagination has been captured by language generation technologies like ChatGPT and Claude. In the meantime, another AI-enabled revolution is quietly unfolding. , thanks to models such as OpenAI’s Whisper. These models allow us to easily communicate with our devices and AI helpers via voice. It is the advent of unprecedented speech recognition capabilities The importance of OpenAI’s Whisper was first brought to my attention by Ethan Mollick’s and by AI Explained, which I strongly recommend. tweets this video In this article: I explore what Whisper can do and why it is significantly better than other speech recognition tools. I show you how you can start using Whisper in your daily life to transform the way you work and interact with your devices. I present three prototypes that I’ve personally built to fully exploit Whisper and integrate it into my work. Finally, I explore the broader implications for the future of communication technology. By the way, I didn't type this article. It was directly dictated to Whisper on my laptop's microphone! About 95% of it is unaltered Whisper output. What Whisper Can Do For the past seventy years, writing on a keyboard (or more recently, tapping on a touchscreen) has been the main interface for communicating with computers. Clearly, this is not how humans naturally communicate. And in every respectable sci-fi production, people talk to their computers. We are biologically and culturally primed to communicate with our voices. In fact, we’ve had ‘decent’ speech-to-text solutions for quite a while now, like ’s Siri and ’s Alexa. But they were never quite good enough to ignite an interface revolution. With interface technology, a good enough solution is usually not sufficient. To get people to change their habits, you need a solution that is almost perfect. Apple Amazon Is OpenAI’s Whisper ‘almost perfect’? The best way to find out is to experiment for yourself. From my part, . For instance, 95% of the text you are reading was directly dictated to Whisper, with very little manual intervention. I find Whisper to be extremely accurate Whisper has several features which make it a superior speech recognition tool: besides English. I have tried transcribing Italian, Spanish, German, and Romanian; the transcriptions seemed quite accurate. It can transcribe at least 57 languages It is not a shapeless blob of words like with most transcription tools. That means less time invested in revising and correcting the text. When you talk to Whisper, the text comes out fully structured, with correct punctuation and syntax. like , , and . To do this, it uses contextual knowledge about the world that it has acquired during training. Whisper is able to accurately transcribe technical names ChatGPT PyPy package management SSRIs by . For example, you can choose between a well-structured text and a more literal transcription that involves speech artifacts like “mmm”, “ahh”, and so on. You can direct and personalize Whisper’s output providing a written prompt From what I’ve experienced, Whisper can identify the main voice in the audio and transcribe it without getting distracted by background noise. Whisper can deal with noisy environments and overlapping voices. Like ChatGPT, Whisper is also , so you could host the model locally without depending on OpenAI. Whisper is available via API for developers to build on it. open-source Currently, it costs $0.36 to transcribe one hour of audio via OpenAI’s Whisper endpoint. Whisper is relatively cheap. How To Use Whisper ChatGPT Phone Applications This is the best way to try Whisper for free. These apps have been released very recently, and not many users know that they contain a state-of-the-art speech recognition model. I’ve already made extensive use of this feature: I find that . Which makes ChatGPT feel a bit less like a lifeless chat and more like a trusted assistant or a support agent. Whisper handles voice input in the ChatGPT app for Android and iOS. it allows me to quickly explain my problem to ChatGPT and get a response in under one minute Given Whisper’s frontier capabilities, I was excited to integrate it more tightly with my workflows. Since I couldn’t find any apps that do this, I built a few prototypes that showcase Whisper’s power. Prototype 1: Whisper As Complementary Keyboard I built a plugin to make it happen. It is best to see it in action in this one-minute video demo. I wanted Whisper to be a second keyboard for my laptop. https://www.youtube.com/watch?v=VnFtVR72jM4&embedable=true At any time I can press a button on my laptop, talk, and the transcription will be sent directly where my cursor is. This allows me to use Whisper to write documents, chat, or run searches: in short, for everything. Google The code is available for free in my . GitHub repository Prototype 2: Whisper As Chat Companion While having Whisper in the ChatGPT app is great, I wanted to integrate it with all my chats. I built a Telegram bot for this purpose: I call it WhisperBot. I send voice messages to WhisperBot and it responds with a transcription. Whisper can directly transcribe or translate at least 57 languages, so you can probably talk to it in your language. Here’s how I’ve been using WhisperBot: One semi-hidden source of social conflict for chats is that , Talking is more convenient than typing, and reading is more convenient than listening. people prefer sending voice messages but they prefer receiving text messages. With WhispeBot I can: I personally like to make long voice messages with my thoughts, but people are usually not excited to get a 3-minute voice message from me. Now, I send my voice message to WhisperBot and then forward the transcription. It usually doesn’t require any manual revision. Turn my voice messages into text. When I get a voice message from someone, I usually forward it to WhisperBot and just read the transcription. Transcribe other people’s voice messages. I already had a private telegram chat where I wrote down thoughts and ideas. It is more fun to do it with my voice and have it transcribed immediately. Once your thoughts are immediately turned into text, you can send them to your knowledge management system or to ChatGPT for further refinement. Note down my thoughts. I can now write an entire blog post or email during a walk. I explain my thoughts in a voice message to WhisperBot, get the transcript, put it into ChatGPT, and have it reformatted as a post. I can also include direct indications, such as “ChatGPT, add two more examples of situations in which communication is crucial”. This system greatly reduces the time to produce a document, and it is very fun to write while on a walk. Write while I’m on the go. Prototype 3: Whisper As Translator and Interpreter Since Whisper can easily understand 57+ languages and translate them to English, it is clear that it can be used to build . This is important for me because my parents and partner don’t speak the same language, and I want them to be able to communicate independently. an amazing translator and interpreter application Therefore, , crossing cultural and language barriers. That is a topic for future posts, but I wanted to mention it here since it’s a great use case for Whisper. I’ve built a prototype for an app that allows people to speak on Telegram without knowing each other’s language GitHub Copilot Voice: A New Programming Paradigm Copilot Voice is an upcoming GitHub Copilot feature. I warmly recommend watching . the demo It combines Whisper with Copilot, which uses GPT to translate the programmer’s intent into code. Copilot Voice will allow programmers to vocally describe what they want, while the AI takes charge of the low-level task of writing the code. As a professional developer, I believe that it will radically transform how we work, and this transformation will occur within the next year. This is a revolutionary way to rethink coding. I have applied for the Copilot Voice waitlist, but I’m still waiting to try it. Hear that, GitHub? 😊 The Future of Computing Interfaces We have explored how powerful Whisper is and how it can be used for daily life. Now let us zoom out a little and reflect on what this means for the future of computing interfaces. Writing on a keyboard or tapping a touchscreen is not an efficient way for a human to communicate. The average typing speed is about 40 words per minute. The average writing speed can go up to 150 words per minute. Talking can be three times faster than typing. People who are unfamiliar with technology must learn how to type on a keyboard. They don’t need to learn how to talk because they spent the first few years of their lives learning that. Talking feels more fun and natural. It is likely that we will spend more time producing text if we can do it with our voice. An excellent speech recognition system allows you to . You could be on a walk, driving, or you could be a doctor engaged in a delicate surgical operation. Again, this makes communication easier and more natural. communicate without your hands It is important to note that than some hits on a keyboard. your voice contains much more information Here’s ChatGPT’s take: Voice messages possess a rich depth of communication attributes such as tone, pitch, volume, cadence, rhythm, and emphasis, which can significantly influence the meaning of a message. Moreover, voice messages also carry non-verbal sounds, such as laughter, sighs, pauses, or breaths, which can further augment the communicated sentiments. Other nuanced auditory elements like accent, pronunciation, and voice quality can reveal cultural background, education, or even health status. Furthermore, temporal aspects, such as the pace of speech, the length of pauses, and the use of filler words can offer insights into the speaker's thought process, comfort level, or their familiarity with the topic. While Whisper doesn’t analyze this information now, you can bet that future speech recognition tools will differentiate themselves by their ability to use this information to better understand and assist the user. Voice understanding capabilities will also lead to more immersion. It will be possible to speak with AI entities via live voice communication, phone, and video calls, or . within a video game In fact, it is perhaps a historical accident that the main interface for communicating with AI models is currently a written chat. It is likely that a year from now, most people will interact with AI via voice. Moreover, the lines between media types will blur because any content that exists in the form of audio will instantly be available in the form of text at almost zero cost. Language barriers will also blur because it will be possible to instantly and accurately transcribe and translate content from any language, whether written or audio. These changes will transform the way we produce and consume content. Conclusion In the near future, computer devices will become more usable, accessible, fun to use, and immersive. Of course, this also applies to AI entities like . We will still produce a lot of text in our digital lives, but most of this text will be made via voice and not via mechanical typing. Content barriers and language barriers will fall apart. Programmers will work with their voices more than with their hands, and their offices will become as noisy as salespeople’s. ChatGPT These changes will require adaptation, especially for ‘dinosaurs’ like me who have been using the mechanical typing interface for more than 20 years. After I integrated Whisper with my keyboard, many activities became faster and more efficient. The biggest limit on exploiting this new capability was… my constant forgetting that it was there and that I should use it! The force of habit is extremely strong, and I am used to doing everything with the keyboard. If I want to leverage this power, I have to retrain myself and completely alter the way that I relate to my computer. Finally, we will have to update our norms and expectations to handle the reality of people constantly speaking to their devices in the office, at home, and on the street. If this seems weird to you, remember that wearing headphones or talking to a phone in public . was once considered weird too Also published . here Lead image . source

This story contains new, firsthand information uncovered by the writer.

The is an opinion piece based on the author’s POV and does not necessarily reflect the views of HackerNoon.

OpenAI's Whisper: Paving the Way for the Voice Interface Revolution

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Gptrim: Reduce Your GPT Prompt Size by 50% For Free!

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

Gptrim: Reduce Your GPT Prompt Size by 50% For Free!

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps