Hackernoon logoThe Future Is Now: How Voice Robots Work And What They Can Do by@Alexander Kuznetsov

The Future Is Now: How Voice Robots Work And What They Can Do

Alexander Kuznetsov Hacker Noon profile picture

@Alexander KuznetsovAlexander Kuznetsov

More than 15 years experience in Telco. Co-founder and COO at Neuro.net

Robotization of routine operations, when robots rather than people are used to solve simple but labor-intensive tasks, has become very active. Many things are being automated, including telephone conversations with customers. Neuro.net is developing technologies that can improve the capabilities of robots.

In this post, developers talk about the technologies and details of recognizing the interlocutor’s gender by voice and working on important elements of the conversation.

First, we’ll talk about a business case, then we’ll discuss technologies in detail.

One of the most interesting business cases involves the introduction of a voice robot instead of employees of the call center of a partner company. The robot’s capabilities were used not for regular tasks, such as clarifying the delivery address, but in order to find out why some customers have become less likely to visit the company’s website.

The technology was based on a fully functional neural network rather than individual scripts. It was the neural network that helped solve the problems that usually confuse robots. First of all, we are talking about such answers of the interlocutor as “well, I don’t know yet, maybe yes or rather no” or even “well, probably, not.” Words that are common to humans become an insurmountable obstacle for a robot.

After the training, the robot understands the meaning of different phrases and possible answers to them. The robot had several voices, both male and female. The main task was to make the robot more like a person, so that its human interlocutor would not test the capabilities of the machine itself but would conduct a dialog within the framework of the target scenario.

Below is an example of the result.


The robot listens to the interlocutor and gives meaningful answers. The total number of different branches of the conversation script is more than a thousand.

The main goal of this robot was to understand the reason for the decrease in customer activity on the company’s website and to make an interesting offer to customer. This was one of the company’s first attempts to automate the call centers.

New robots are significantly improved. Here are some more examples of how robots communicate with humans: first, second, third.

Now let’s talk about the underlining technologies.

There are three key technological features that ensure the performance of the robot:

● Recognition of the interlocutor’s gender using their voice

● Age recognition

● Managing dialog with a human interlocutor

Recognition of the interlocutor’s gender using their voice

What is it for? Initially, this function was developed to conduct surveys using robots. Previously, surveys were conducted by people who filled out a questionnaire with a number of questions. For example, it was necessary to indicate the gender of the interlocutor. It is clear that a human interviewer does not need to ask who they are talking to, a man or a woman. In 99% of cases, everything is clear. With robots, the situation is different; in order for them to more or less accurately learn to recognize voices, developers had to solve many problems. And this work was not in vain, now this technology is used to personalize offers and voice prompts depending on gender.

An important point: the female voice is universal and applicable for working with the widest range of products, and it is especially important for products for women. According to various studies, a female voice is perceived positively by any audience, respectively, in this case, the conversion is better. The exception is campaigns to promote “male” products, when a male voice is preferable.

How does it work? First, primary data processing is performed based on the processing of voice recordings and fragments lasting 20 ms. All collected voice fragments are preprocessed using the VAD (Voice Activity Detection) component. This is necessary to separate the grains from the chaff — that is, the speech from noise. All unnecessary elements are removed, thereby increasing the accuracy of the models.

The so-called space of cepstral coefficients of the first/second order differences is used for recognition. The GMM ( Gauss Mixture Models) method is the basis for this approach.

So, we take an interval of 10–20 ms and calculate the current power spectrum, after which we apply the inverse Fourier transform for the logarithm of the spectrum with the search for the necessary coefficients.

Our GMM models are set up separately for training using male and female voices, and we also use models to determine adult and children’s voices. Of course, you cannot train the system from scratch, you need voice recordings with markups.

To increase the efficiency of the system, we use the coefficients of timbre voice models:

● Timbral sharpness

● Timbral warmth

● Timbral brightness

● Timbral depth

● Timbral robustness

● Timbre growth

● Timbral unevenness

● Timbral reverb

Timbre models are needed to correctly identify the voices of children since any other models consider the child’s voice as female. In addition, you need to distinguish between gruff female voices (for example, the voice of an elderly smoking woman), high male voices, etc. By the way, if a person said “hello” and then coughed, then all previous models without timbre filters consider such a voice as male.

The main component of the system is the data classification module based on the MLP multilayer perceptron. It receives data from models of male and female voices and data from timbral models. At the input, the system receives an array of preclassified values, and at the output, the result of determining the gender.

The described technology is used to work in both online (according to the first phrase of the customer) and offline classification modes (after the conversation). Gender recognition accuracy is around 95%. It is important to note the delay when working online does not exceed 120–150 ms, which is extremely important to make the robot more like a person. Usually, the pauses during communication between a robot and a human are not milliseconds but seconds, which, of course, looks strange to the human interlocutor, and it immediately becomes clear that the digital system is involved in the communication.

Developers are also going to add functionality for working with text. If the interlocutor speaks about itself in a feminine gender, then the interlocutor is definitely a woman. In the near future, this technology will be improved and integrated into the recognition system.

Determining the interlocutor’s age

What is it for? The main goal is to prevent the offer of various products and services to minors. In addition, knowing the age is useful to personalize offers by age categories.

How does it work? For this purpose, exactly the same technologies are used as in the previous case. The accuracy of the system is about 90%.

Constructing dialogs

And now, we will discuss the most interesting part — the principles of constructing dialogs.

What is it for? To effectively replace a person, a robot must be able to work both in linear and nonlinear dialog scenarios. In the first case, it can be a questionnaire, and in the second one it can be interaction with subscribers of a call center, technical support service, etc.

How does it work? We use the NLU Engine based on semantic analysis of texts received from ASR systems. Then, recognition objects such as entities and intents, which are used in the logic for constructing conversational flows, are defined.

Here is an example of using this technology.

A text received from a speech recognition system (ASR):

“In general, I am interested in your proposal, but I would like something cheaper. And I’m a little busy right now. Could you call me back at six o’clock tomorrow?”

Objects populated using the NLU Engine:








date= 02.01.2019 (suppose the call date is January 1, 2019)



The approach to filling objects in this example:


• The “I am interested in your proposal” text has been translated into the “confirmation” intent with a value of “true.”

• The “but I would like something cheaper” text has been translated into the “objection” intent with the value of “expensive.”

• The “And I’m a little busy right now” text has been translated into the “wrong_time” intent with the value of “true.”

• The “Could you call me back at six o’clock tomorrow?” text has been translated into the “call_back” intent with the value of “true.”

• The subscriber did not ask any questions, so the “question” intent is null.


• The “tomorrow” text has been automatically translated into the “date” entity with the value of “January 2, 2019” using the following formula: current_date + 1 (suppose that the date of the call was January 1, 2019).

• The “at six o’clock” text has been automatically translated into the “time” entity with the value of “6:00 p.m.”

• The “six” text was translated into the “amount” entity with a value of “6,” which in this logic can be ignored since there are entities with a higher priority.

All intents and entities are assigned specific values, which are then used to construct a conversational flow.

Now, let’s talk about the algorithms that are supported by the NLU Engine. The system includes two levels.

The first level works with a relatively small sample of data containing about 600–1,000 records. ML algorithms are used here. Recognition accuracy is 90%–95%.

The transition to the second level is carried out after the launch of the project and the accumulation of a large sample of data containing more than 1 million records. DL algorithms are used here. Recognition accuracy is 95%–98%.

This solution works with two subsystems:

- A subsystem for categorization and classification of text data

- A dialog design subsystem

Both subsystems work in parallel. The categorization and classification system receives the recognized text based on voice communication with the subscriber and issues the filled-in Entity and Value parameters to construct an answer.

The dialog construction subsystem for nonlinear scenarios is built on the basis of a neural network. The system receives the recognized text based on voice communication with the subscriber and issues a decision on which recording should be played back at the next moment.

A nonlinear scenario is suitable for the first support line where the robot does not know who is calling, which product is of interest, and what questions can be asked. Here, the further course of the dialog depends on the customer’s response.

However, for outgoing calls, the best solution would be a linear scenario. The corresponding example was given at the very beginning of the article. Another example of a linear scenario is related to conducting a survey when it does not matter what the customer answers since the answers will be further analyzed by professionals. Nevertheless, it is important to guide the customer through all the questions that are on the list.

In conclusion, we want to emphasize that voice robots will not replace people. Now, they are doing an excellent job of their routine tasks by calling people to ask them some questions and listen, record, and/or analyze the answers. Thus, call center and technical support operators will be spared from performing the same routine procedures. Instead, they can focus on solving really interesting problems and completing important tasks.


Join Hacker Noon

Create your free account to unlock your custom reading experience.