OpenAI’s New ChatGPT Voice and Image Options Generate Excitement

Apple’s Siri and Amazon’s Alexa now have formidable competition in OpenAI’s latest version of ChatGPT. The chatbot now has new capabilities, allowing users to speak to it and receive an audio response.

According to OpenAI’s release notes on its website:

“We are beginning to roll out new voice and image capabilities in ChatGPT. They offer a new, more intuitive type of interface by allowing you to have a voice conversation or show ChatGPT what you’re talking about.

“Snap a picture of a landmark while traveling and have a live conversation about what’s interesting about it. When you’re home, snap pictures of your fridge and pantry to figure out what’s for dinner (and ask follow up questions for a step by step recipe). After dinner, help your child with a math problem by taking a photo, circling the problem set, and having it share hints with both of you.”

This is a fascinating leap for the chatbot and will likely open a world of information for millions of users. The efforts CEO Sam Altman put into his work is astounding and, according to New York Magazine, he worked so hard on building Loopt, his first project, that he suffered from malnutrition and got scurvy — a vitamin-C deficiency that stems from not eating enough fruits and vegetables.

Now, over a decade after selling that startup (for over $40 million), Altman’s company allows Plus users on iOS and Android to use their voice to engage in a two-sided conversation with ChatGPT. ‘

“Speak with it on the go, request a bedtime story, or settle a dinner table debate,” the company said in its announcement.

To use the app with voice, head to Settings → New Features on the mobile app and opt into voice conversations. Then, tap the headphone button located in the top-right corner of the home screen and choose your preferred voice out of five different voices.

OpenAI has also announced that Plus users can now show ChatGPT one or more images.

“Troubleshoot why your grill won’t start, explore the contents of your fridge to plan a meal, or analyze a complex graph for work-related data. To focus on a specific part of the image, you can use the drawing tool in our mobile app.”

To use this feature, tap the photo button to capture or choose an image. You can also discuss multiple images or use our drawing tool to guide your assistant. If you’re on iOS or Android, tap the plus button first.

OpenAI explains that image understanding is powered by multimodal GPT-3.5 and GPT-4.

“These models apply their language reasoning skills to a wide range of images, such as photographs, screenshots, and documents containing both text and images,” the company said.

Of course Amazon announced on Monday that it will invest up to $4 billion in Anthropic — the AI startup founded by siblings Dario and Daniela Amodei – former OpenAI employees.

This deal is relatively miniscule compared to the reported $13 billion that Microsoft has so far invested into OpenAI, according to The Verge.

OpenAI recognizes the risks inherent in its technology and addresses these concerns. The company says its goal is “to build AGI that is safe and beneficial.”

The company also says it believes in “making our tools available gradually, which allows us to make improvements and refine risk mitigations over time while also preparing everyone for more powerful systems in the future.”

Naturally, this strategy becomes even more important with advanced AI models involving voice and vision.

OpenAI's new technology, capable of crafting realistic synthetic voices from a few seconds of real speech, opens doors to many creative and accessibility-focused applications. However, the company acknowledges that these capabilities also present new risks, “such as the potential for malicious actors to impersonate public figures or commit fraud.”

For this reason, Open AI explained it is using voice actors they worked directly with and is collaborating in a similar way to Spotify, which is using this technology for the pilot of its Voice Translation feature.

OpenAI acknowledges that vision-based models also “present new challenges, ranging from hallucinations about people to relying on the model’s interpretation of images in high-stakes domains.”

Before deploying the technology more broadly, the company says it tested the model “for risk in domains such as extremism and scientific proficiency, and a diverse set of alpha testers.”

This research enabled the company to calibrate the technology in important and sensitive areas to ensure it can be used responsibly.

OpenAI says it has taken “technical measures to significantly limit ChatGPT’s ability to analyze and make direct statements about people since ChatGPT is not always accurate and these systems should respect individuals’ privacy.”

Since users might depend on ChatGPT for specialized topics, such as research, the company says it is “transparent” about the AI model’s limitations and therefore discourages its use for sensitive areas and certainly without verification.

This terrific advancement on the technology side clearly comes with significant risk, but the excitement is palpable and hopefully the benefits will outweigh the threat.

We can’t wait to see what comes next!