What Makes a Great AI Conversation on the Phone?

By Gadi Shamia
November 16, 2019

A few weeks ago, I wrote a blog post about the making of a great conversation, regardless if it is human to human, or human to machine (AKA “Thinking Machine”); designing any conversation is not easy. You want to ensure that it is dense with accurate information, you want to speak in a language that your customers understand, and you want to respect the customer’s time. In short, a great conversation helps customers achieve their desired goals quickly and pleasantly.

Humans are naturally great conversationalists. After all, we have been conversing for 60,000+ years so it is second nature to us. We have spent the last 140 years mastering phone conversations and now understand the impact of basic phone etiquette like unnatural pauses, unplanned background noises, and human emotion to conduct effective calls. Machines, on the other hand, just got on the phone a few years ago so not much is programmed on how to make them great conversationalists.

At Replicant, we are teaching machines to solve tier 1 customer service issues on the phone, and we made it a priority to figure out what makes enjoyable machine to human conversations. Just like Asimov and his three laws of robotics, we created our three laws for human to machine communications and we cannot wait to share them with you.

Conversational Speed is Everything

When was the last time you failed to get a response for 7 to 10 seconds after asking a question because the other caller was on mute? Probably yesterday. There is a reason you remember cases like this; they feel strange. We expect people to respond instantly or we get frustrated.

Now imagine a conversation where every time you finish talking, it takes 7 to 10 seconds to get a response. This is actually quite common with most machine to human conversations – machines “add” audio files of fictitious typing or paper shuffling to make pauses feel more natural because solving latency for machines is incredibly difficult.

At Replicant, we understand that customers expect fluid conversations so we designed our technology to respond in under a second, just like a human would. This required building a state-of-the-art telephony system, an AI brain that can think in milliseconds, and many more subsystems that work in concert to ensure conversational speed.

Conversational Accuracy is Equally Important

Even well-trained agents do not always understand everything customers say. There may be background noise, language barriers, or everyday distractions during customer calls. Nevertheless, we expect the person on the other side to understand us and we become frustrated if they do not.

Low accuracy with an IVR or AI system also leads to frustration; customers try to “game” the system by guessing the correct key words like AGENT or RETURNS to speak with an agent faster. The moment this happens, the conversation ends, and the shouting match begins.

Yet, many AI systems are unable to have fluid conversations because their underlying models lack processing speed, accuracy, and most importantly, contextual awareness. In order to infer meaning in conversation, one must understand the full range of responses. This is very easy for humans as contextual awareness is gained through everyday conversational experiences.

However, in the context of the “Thinking Machine”, it may ask, “Have you seen a doctor for this condition in the last twelve months?”. If the response is, “I went to the clinic last week”, a typical machine may have trouble understanding this as a nuanced yes. It is obvious to our human ears, but only a sophisticated machine would understand this too.

Another important element is the ability to constantly retrain models as they learn to improve contextual awareness. Imagine if a caller is asked, “Are you ready?”, and they respond with, “Bring it on”. It is unlikely that a conversational model will have been trained on this reference so it must quickly ask a follow-on question to progress the conversation. Once the “Thinking Machine” gets a clear yes, this confirmation can be used to automatically retrain the model to recognize associated phrases so that it becomes smarter over time. Without a robust, continuous learning system, even a well-intentioned machine will quickly lose context.

Engineering an Expressive Voice is a Necessity

There is one remaining quality that we expect in conversations and that is to speak with someone that has an emotionally, in-tune, and expressive voice. It does not have to be a perfect voice with Hollywood-like quality, but it should be expressive.

It is hard to stay engaged during long conversations when emphasis on key words is weak, questions are not always clear, and the voice is monotone. It can be overlooked during a short voicemail recording, but not for full length conversations. While what is said, and how fast it is said are far more important, having an expressive voice is the icing on the cake, especially if you wish to run a meaningful conversation between a machine and human.

Replicant’s speedy, smart, and engaging “Thinking Machine” can help you solve tier-one customer service issues on the phone in no time. We relentlessly focused on all three of these guiding principles when we built Replicant to create shorter, more effective calls to delight your customers. We hope that you too will be pleasantly surprised to see how much you enjoy speaking with Replicant. Visit our website to listen for yourself.