“The most important single ingredient in the formula of success is knowing how to get along with people”, famous politician Theodore Roosevelt used to say. This formula is crucial when it comes to robots.
After all, without the ability to communicate the robot simply turns into an ordinary, albeit high-tech, device.
In this article, engineers of the “Promobot” company, shed some light on what the notion of “robot communication” actually means. Apparently, the way of building the speech recognition system with only a microphone and dynamics paired in one device is an outdated one and to find a suitable “head” for it is nearly impossible.
One of the most trivial things for humans to do is to be able to hear and understand the speech of an interlocutor. The person hears a message and the person replies to it. When our interlocutor is a robot and we need it to hear and understand our speech, it is another story altogether. “Human-robot” interaction can take place in challenging circumstances: we need to take into account the various sources of noise produced by the environment, people talking to the robot simultaneously from different sides, sometimes the microphones don’t differentiate and even react to the robot’s own speech. All of these things pose a variety of issues to resolve.
In the course of their everyday lives humans never think of communication as a complex technical process, but to teach “Promobot” robots to communicate we had to see it as such and develop a perfect set of “ears” and “mouth” for our robots to properly work with.
To have the perfect hearing the robot needs two things: good “ears” to perceive information and a reliable smart “head” that understands the information perceived. The only adequate solution for it is a hardware and software system based on the microphone array.
Hardware is easy because the set of microphones can be installed on the robot’s body wherever it is required. Software, on the other hand, presents a more difficult task because the “head”, now installed with “ears”, must be able to:
cut the robot's speech out of the information heard;
separate the sound from the noise;
recognize the raw human speech in the separated sound;
identify the source of speech;
form a beam in order to amplify the original sound signal from the source;
recognize the target speech in the audio track after all the above-mentioned manipulations;
*As a rule, the last step is out of the microphone array’s responsibility.
Nowadays there are not many ready-made solutions for that on the global market. The ones that are suitable for the “Promobot” level are even fewer to come across. The most promising seems to be the joint development from Honda and Kyoto University called HARK (Honda research institute Japan Audition for Robots with Kyoto University).
HARK was originally designed for robots working in the anthropomorphic environment, where it is necessary to hear and understand commands from a human. This open-source software could be integrated with the Robot Operating System (ROS) and the audio processing from the microphones could be easily tuned. The additional advantage of this device is the ability to simultaneously detect multiple acoustic sources. It seemed to be exactly all we needed! The Japanese innovation was a pretty good alternative, so we quickly stopped looking for another one.
For the first tests in 2014, we used the microphone array of RASP LC type for 8 microphones. 4 microphones were located on the central part of the robot`s chest, around the screen; 3 of them were on the upper part of the chest, closer to the neck; and one was in the center in the back, near the base of the neck. After that first test, we identified two crucial problems: vibration and the complexity of the calculations.
The thing is that the robot itself is a mechanism with a large number of moving pieces and each movement creates the so-called microphone pickups, i.e. the background noise that required constant monitoring and analysis.
The second problem was the processing of the received audio data. The HARK developers followed two fundamentally different approaches to audio stream analysis. The first one is called geometric and is reduced to the exact (to the tenths of a millimeter) description of the microphones’ location in space, as well as taking their directions into account. The second one, with no special name given to it, is based on a calibration model. It is created by repeatedly rerecording the same audio through the array and from different points of the space around it.
According to the creators, the geometric model was supposed to perform well in most cases. It did not happen with the first prototype of the communicative promobot. The fundamental problem was the accuracy of the microphone installation. Many of the phases of assembly are still done manually now, let alone seven years ago. It is not always possible to perfectly replicate the product. If the discrepancy of a millimeter is not critical to the human eye, it turned out to be fatal for the precise software algorithms.
Eventually, when the program tried to calculate the correlation between the 8 pieces of audio with the length of 10 ms for each one, received from several different microphones based on the inaccurate location model, the result was quite discouraging.
The second prototype got its own card with the microphone array. It was equipped with a powerful general-purpose microcontroller, which was responsible for synchronizing the data collected from the microphones. The microphones themselves were placed around the screen on the robot`s chest. An important problem was already noticed in the model itself: the robot heard its own speech. We decided to use acoustic echo cancellation (AEC) of the robot's speech in the data from the microphones on the side of the microphone array.
The first trial was fairly simple. The audio signal with the robot's speech in front of the speakers was fed into the microphone array, and a Fourier algorithm was used to subtract the robot's speech from the data received from the microphones. The plan that seemed good on paper ran into the problem of de-synchronization in reality. Processes of speech pronunciation and their further recognition are instantaneous by human standards, but they were not so quick in terms of algorithms and techniques of the robot’s speech recognition. The delay was considerable enough to significantly degrade the suppression quality, leaving a distorted but recognizable robot`s voice that poisoned the microphones’ data with its parasitic signal.
We did not give up and continued to look for alternatives. The new solution appeared to be easy and elegant. We periodically play a special audio signal, catch it with the microphones, calculate the delay and, using this data, perform the subtraction of the robot's speech from the microphones’ signal. At this test stage, however, we found out that the microcontroller’s resource is not enough for a full-fledged subtraction, and installing a more expensive controller that can perform the quality subtraction is no longer justifiable from a financial perspective. Furthermore, a periodically beeping robot could reasonably raise some eyebrows.
Organizing the suppression on the side of the robot's operating system seemed quite expensive in terms of development and consumed computing resources. Moreover, in this case, the delay in the data received from different sources becomes even less predictable.
Having seemingly reached a dead end, we, nevertheless, found a compromise: the microphones on the microphone array were turned off at the moment when the audio signal went to the speakers. It was a clear advantage since the robot would definitely not hear itself. It was very simple and worked flawlessly. The downside was the robot's “narcissism” since while it was talking, it stopped hearing anyone else. That was the price we had to pay for the compromise.
An egoistic robot is not something that we would like to supply to all the countries of the world. Therefore the hardware development department of “Promobot” continued its work on finding the optimal arrangement of the microphones and carried out HARK tests with the geometric model. Even with a separate rigid vibration-free board, the geometric model failed to work the way we wanted it to. Hence, we launched research on the process of creating a calibration model for our microphone array.
Calibration is carried out in the following way: the source of the sound is located at some distance from the microphone array. It plays a special sound file that contains a window of frequencies ranging from 8k to 0 hertz, which changes over the course of a single second. At each of the predetermined coordinates, this track is played 3 times and recorded by the microphone array into a calibration file. As a result, for each of the coordinates in which HARK will catch speech we have formed a set of calibration audio files.
Having obtained a set of such recordings, we send these files at the input of one of the HARK applications, indicating which file corresponds to the point in space relative to the microphone array. At the output, we get what is called a model: a list of the sources in space from which sound can come to HARK, and a set of coefficients for each of these sources. These coefficients will be used in sophisticated analytical procedures to localize and isolate the sound from the entire data stream.
The procedure is not the fastest one, and the mistakes made at any of the stages are able to easily destroy the final model.
When we started creating our own calibration model the first thing that became obvious for us is that there was a lot of noise around. Quite a lot. It was the noise of a very diverse nature. What we don't notice in our day-to-day life turns out to be critical when creating something as fragile as a clean sound recording. Someone stomped clumsily during the recording - record again. We started working night shifts but at night the echo was clearly audible. Constantly falling into our own trap, we came to the creation of our first calibration room. We took out all the furniture from one of the meeting rooms, draped the walls with thick dense fabric, and worked out the calibration process at night. Yet even at night, the street is not quiet enough. Tired of re-recording because of a passing car, we went to the recording studio for the first time.
However, what works in the laboratory does not always perform well “in the field”. At the next exhibition, our new promobot showed a quality that was far from what we had expected - too noisy and too many different sources of sound.
The "Promobot" office has its own recording studio now. We started muffling external sounds, taking into account the natural noises of the switched-on robot, and processing the words in the variety of their pronunciations. Our calibration laboratory has also detected a huge number of problems that prevent the robot from communicating in the effective fully-functional way.
Despite all the advantages, HARK remained a black box for us in many ways. It worked well in some specific scenarios, but once you start manipulating the parameters, its behavior could become difficult to predict.
For example, we discovered that the robot does not hear quiet voices. It seems to be not fond of shy people. In order not to overload the system we tried to cut off background extraneous noises and other parasitic signals from processing in the speech recognition module. The only method available to us in HARK was setting a sensitivity threshold in decibels. By setting a certain threshold, we protect ourselves from processing "noise". Soft-spoken people with quieter voices “paid” for it. If the threshold is lowered too much, the robot starts hearing people 20 meters away from itself and spends resources on processing every tiny sound.
On the other hand, there sometimes can be such a high noise level at the exhibitions that it is necessary to manually adjust the HARK settings during the work process and increase the sensitivity threshold so that the robot hears only those who speak loudly next to it. While the robot is speaking, it does not hear anyone. There are, of course, small pauses between the words, in which it can hear the voice of a human and interrupt itself after understanding that something is being said to it. That is extremely rare, though.
We made a lot of research and we had already abandoned hardware echo cancellation three times by that time. Along with that we still have not found any systems that are guaranteed to be superior in characteristics to HARK. Once having chosen our path, we have to follow it to the bitter end.
Today, the engineers of “Promobot” have created their own algorithm for detecting the presence of speech in a sound signal (or for VAD - voice activity detection). It works in conjunction with HARK and prepares the signal for processing. Promobots have become more tactful. They have learned to hear better and react with an appropriate response. Everyone can try it out and see it at the next exhibition or forum with Promobot. However, as engineers, we know that human communication and human-robot interaction happen to be two completely different things.
To bring the robot closer to the level of human communication, we still have a huge amount of work to do. What is encouraging is that when we finally create a new software solution, even the models of previous years can be simply updated. Humans are very different beasts when it comes to that.