Users Will Fall in Love With OpenAI’s New GPT-4o Model. Literally.

Rifx.Online
Generative AI , Chatbots , Natural Language Processing
01 Nov, 2024

The company’s new GPT-4o can understand and mimic human speech and emotion

In the iconic 2013 film Her, the protagonist develops an intense relationship — which morphs into a love affair — with a voice-enabled AI system.

The AI in Her is everything that today’s voice-enabled systems are not: emotive, funny, and able to intuit the subtleties of human conversation.

In a major announcement this morning, OpenAI announced the release of a new version of its ChatGPT system that natively integrates speech, transcription, and intelligence into a single model.

It’s powerful, intuitive, and disturbingly human-like. Essentially, OpenAI has built a real-life version of Her.

A Bad Conversationalist

ChatGPT has had voice capabilities for months now. Even today, you can open the ChatGPT app on your phone, press the headphones icon, and converse with the system using your voice.

The problem, though, was that ChatGPT was a terrible conversationalist.

Essentially, ChatGPT’s voice capabilities were a hack created by splicing together three different models.

When you would speak to the system, it would first use a transcription model to turn your voice into text. It would then feed that text into its intelligence model — basically, the same system that underpins GPT-4.

The intelligence system would generate text, which ChatGPT would feed back into a text-to-speech system to create a computerized voice that would respond to you.

This made the system nominally conversational, but actually speaking with it was clunky and awkward.

All the extra steps of sending content between different models meant that the system was laggy. In my own testing, I found it often took 3 to 5 seconds between speaking to the system and getting a response back.

Human conversation relies on subtleties that unfold over milliseconds. A system that takes up to five seconds to respond to speech feels clunky and robotic.

The previous system also lacked many fundamental aspects of human speech.

For example, you couldn’t interrupt it; you had to wait for it to finish speaking before you could respond.

Speaking with it often felt like talking to one of those un-interruptable people who blabbers on about a random topic with no awareness of the other people in the room. You often felt like bring up the Oscars’ orchestra in a desperate attempt to get the system to stop talking.

It was also constrained by its inability to interpret emotion in voices or to accurately mimic human emotion in its own responses.

Humans are excellent at reading between the lines, partially because we can pick up on subtle emotive cues in the speaker’s voice.

If I ask my friend, “How was your day?” and they respond, “It was fine,” but they insert a subtle pause between “was” and “fine” (or there’s a hint of exasperation in the final word), I’d know that they actually had a challenging day, and I should ask some follow-up questions.

ChatGPT couldn’t do these things, which made speaking to it feel like communicating with some kind of alien intelligence, not a human.

In short, the previous system fell squarely into the uncanny valley. It was good enough at conversing and had a convincing enough voice that parts of the conversation could feel human-like.

But the weird pauses, lack of emotive understanding, and lag ultimately shattered the illusion, making it come off as more unsettling than useful.

I tried using the previous system with my six-year-old son. He was so creeped out by it that he wouldn’t let me switch the audio back on again.

OpenAI’s Revoluntary New Model

Today, OpenAI is changing all of that. In their announcement this morning, the company revealed that they are releasing a new model, GPT-4o.

GPT-4o natively integrates speech recognition, speech generation, and intelligence into a single system.

That means that the spaghetti code system integrating three different models to simulate conversation is gone. Instead, the new version of ChatGPT will be able to take in speech, process it instantly, and respond with realistically generated speech of its own.

For users, this will enable several new capabilities that OpenAI CEO Sam Altman described as “like magic.”

For one, you’ll be able to converse with ChatGPT much more naturally. Instead of having to type your questions and follow-ups into an interface, you’ll be able to speak with the app as if you’re talking to a friend.

In several live demos, OpenAI’s engineers showed how the system can listen to a user and respond with an intelligent result within milliseconds.

Again, those speeds are possible because the new model doesn’t need to waste time switching modalities — it can process voice and respond with its own voice in a single step, instead of resorting to multiple lower-level models.

GPT-4o can also interpret and create emotion.

In one demo, an OpenAI staff member asked the system to lead him through a breathing exercise.

He then pretended to hyperventilate, and ChatGPT — sensing the speed with which he was breathing and the apparent panic in his voice — urged him to slow down and take deeper breaths.

The system also appears capable of modulating the emotion in its own responses. In another demo, the staff member asked GPT-4o to read a bedtime story in an increasingly dramatic voice.

It obliged, ultimately sounding like a middle school theater kid horrifically overacting a scene!

Because the new system is also integrated with GPT-4’s vision capabilities, it can perform functions like interpreting the emotions on a person’s face.

This increased level of emotional intelligence will likely make the system a much better conversationalist.

Other new capabilities will help, too. Users can interrupt GPT-4o mid-sentence.

During their demos, OpenAI staff members frequently interrupted the model when it started to go on tangents, as one might interrupt a friend to start responding to a real-life question.

Huge Potential

The demos this morning were lighthearted and funny. But one can quickly see how a model that can easily interpret, quickly process, and realistically create emotive human speech could be incredibly powerful.

Several times during the demo, ChatGPT responded in ways that reminded me of the fictional AI from Her.

ChatGPT appeared to laugh at itself, become embarrassed when OpenAI staff members complimented it, and perhaps even throw in a flirty line here and there.

Several (purportedly) unscripted interactions also revealed some of the deeper capabilities that better conversation could unlock.

Based on an audience question, OpenAI’s staff members demonstrated how the system could listen to speech in Italian and quickly and accurately translate it into English speech, and vice versa.

One can easily imagine how such a capability could make multi-lingual interactions incredibly simple, essentially eliminating language barriers (and perhaps, human translators).

A doctor, for example, could pull up ChatGPT and use it to quickly speak with a patient in any language. While traveling, you could pull up the app on your phone and use it as a free and instantaneous translator to ask someone for directions or to make a purchase in a store.

Adding the vision capabilities, one could even show ChatGPT a foreign restaurant menu, ask for a translation of certain items, tell it when you like to eat at home, and ask it to recommend some dishes you might want to order (or avoid.)

I can also see how quickly the new system could venture into Her territory. OpenAI still doesn’t allow the kinds of NSFW interactions that happened in the movie.

But GPT-4o’s ability to understand and mimic emotion — coupled with its powerful, often uncanny abilities to produce its own convincing human emotional speed — is striking.

Listening to the demos, I’m certain that people will fall in love with this system, just as the protagonist did in Her. It’s that good.

Will it get used?

All of this is amazing on paper. It’s unclear, however, how many users actually want a fully emotive AI voice companion.

Most people I work with use ChatGPT not as a conversational companion, but for utilitarian purposes.

I’ve seen colleagues leverage the system for boring and mundane tasks like writing the landing page copy for a webinar, turning out a quick response to an email from their landlord, or writing the first draft of a blog post.

None of these utilitarian functions really require conversation. It’s unclear whether being able to speak these kinds of requests to an AI would be useful.

The real test, then, is not necessarily how capable OpenAI’s new system is, but how well they integrate it into places where people are already interacting with computers via their voices.

Realistically, I can’t see many users sitting down at work and conversing with AI.

But if OpenAI integrates GPT-4o into voice interfaces on cell phones, in cars, or on smart devices like the Amazon Echo, I could easily see the system’s emotive capabilities becoming much more useful.

Even if people don’t want to speak with ChatGPT very much, the new capabilities of a natively multimodal audio and vision model will be incredibly powerful for developers who build applications on top of OpenAI’s existing API.

In their announcement, OpenAI said that GPT-4o will be available through their existing developer interfaces. The system will also be 50% cheaper than previous models of GPT-4.

Those changes alone are massive. Whether or not the speech element really takes off, the intelligence that powers it will also make hundreds of existing GPT-4-powered applications smarter, faster, better, and cheaper to operate.

The conversational elements of the new system, in other words, might turn out to be a cool gimmick. But the underlying impact will be subtler and broader.

I’m excited to see how real-life users interact with GPT-4o. Will they be creeped out? Amazed? Wooed?

But I’m even more excited to fire up my Python IDE and add GPT-4o into the applications I’ve already built using OpenAI’s tools.

Speaking to a machine is cool. But a natively multimodal AI model that understands human emotions, and that I can summon with a few lines of Python code, for cheap? That could truly change the world.

I’ve tested thousands of ChatGPT prompts over the last year. As a full-time creator, there are a handful I come back to every day that fit with the ethical uses I mention in this article. I compiled them into a free guide, 7 Enormously Useful ChatGPT Prompts For Creators. Grab a copy today!