The self-driving industry has experienced significant growth in recent years, primarily focused on sensors such as cameras, radars, and lidars. However, there has been limited attention paid to the potential use of microphones and audio in autonomous driving (AV), with the exception of emergency vehicle recognition. In this article, we will explore the possible benefits and applications of incorporating microphones and audio technology into autonomous driving systems.
Tesla's sensors primarily rely on cameras, as opposed to other types of sensors [6, 7]. This decision is based on the premise that visual data can provide enough information for machine learning models to achieve the necessary level of accuracy. While humans rely on visual input, we also possess the ability to hear and interpret sounds from nearby objects. For example, when walking down a busy street, we can estimate the location of a passing car and distinguish its identity based on its sound. We can even make educated guesses about its physical properties based on the sound of its interaction with the environment.
Sound is the process by which sound waves propagate through a gas or liquid and are detected by the ear. Typically, humans can perceive sound waves with frequencies ranging from 16-20 Hz to 15-20 kHz, corresponding to wavelengths of approximately 17 m to 1.7 cm, respectively.
Using two ears, known as binaural hearing, enables humans to locate the direction of sound sources. When facing the source, the sound waves reach both ears with the same phase. However, when the source moves or the head is turned, the sound waves reach each ear at a slightly different phase shift, allowing the brain to triangulate the location of the source.
This localization ability works best for low-frequency waves whose wavelengths are longer than the distance between the ears. Additionally, changes in the intensity of the sound wave can signal whether the object producing the sound is approaching or moving away, as the intensity decreases with the square of the distance from the source.
Throughout history, humans have needed to locate objects, and prior to the advent of modern signal processing algorithms and hardware, they often relied on sound to do so. In fact, sound localization proved quite successful for many tasks, including the detection of enemy aircraft [31] before the invention of radar, as well as for identifying ships in foggy conditions [25].
To achieve this, humans exploited the binaural effect, which allows us to determine the direction of a sound source by comparing the differences in the intensity and arrival time of sound waves at each ear. This ability was further enhanced for weak sounds by using horns or other amplification devices (Figure 1).
Special installations resembling large concrete mirrors have been preserved near London, which were used to detect enemy aircraft raids [1]. Known as "sound mirrors", these structures were designed with a large concrete parabolic dish that collected sound waves emanating from the English Channel and focused them on a microphone, amplifying the sound. This allowed operators to detect incoming aircraft raids 20-30 minutes before they arrived. The principle of operation is based on the fact that sound waves can be reflected and concentrated by a curved surface, similar to how a parabolic mirror focuses light. By collecting and amplifying sound waves in this way, the sound mirrors were able to detect the distinctive sound of approaching aircraft engines from long distances, giving advance warning to military authorities. Today, these sound mirrors serve as a testament to the ingenuity and resourcefulness of engineers and scientists during wartime.
Since a shock sound wave is generated at the moment of firing, sound localization is actively used to locate artillery installations or snipers. Synchronized microphones are used to measure the time taken for the sound wave to reach each microphone, enabling the calculation of the source's location based on the known speed of sound propagation (Figure 4). Gunfire locators and other small systems (Figure 3) still utilize this technology today [12].
It's fascinating how this technology can benefit humanity in various ways, such as in smart speakers and autonomous vehicles. Let's explore its implementation in greater detail and understand how it can enhance the capabilities of these devices.
Let's explore the various cases where acoustic sensors can be utilized and the potential benefits of utilizing acoustic data.
Objects that emit sound signals, such as ambulance cars and other vehicles using sirens, can be easily located through sound localization. Using sound localization can lead to a more precise determination of the location of such vehicles. Additionally, we can better distinguish between different types of objects, such as trucks or passenger cars - [4] provides models for classifying the sound of passenger car engines.
Incorporating sound in sensor fusion can significantly enhance the accuracy of object classification, particularly in scenarios with poor visibility, such as fog or when other sensors are dirty. Sound data provides an independent and uncorrelated data source, which is beneficial for sensor fusion because the independent data source satisfies the assumptions of normality and independence. Furthermore, sound sensors can provide unique features for detecting phenomena that other sensors may not capture, such as metallic clanging indicating a breakage or fault detection [25, 38].
Additionally, microphones are a more cost-effective and energy-efficient option compared to cameras or lidars, making them a popular choice for certain applications. Furthermore, they are often considered to be more privacy-preserving since they don't capture visual images, and they require less network bandwidth to transmit the data they collect [32].
Sound waves can penetrate obstacles, enabling us to obtain measurements from objects that are not directly visible. This is possible due to the longer wavelengths of acoustic waves in air and the sizes of typical obstructions in road scenarios. In some driving situations, environmental sound may be the most effective modality for situational awareness. For example, imagine an unmanned device near an obstacle that shadows its sensors (such as cameras, lidars, and radars), or standing behind a car unable to see what's in front. In these scenarios, smaller robots, such as delivery robots, may be further limited by non-line-of-sight (NLOS) issues, rendering cameras, lidars, and possibly radars useless. However, if the moving object produces a characteristic audio pattern, such as the sound of a motorcycle, we can track it using sound source localization of acoustic sources hidden from direct view (Figure 5), as analyzed in [22].
Microphones can be used as hardware on the roof of a drone to create a microphone array (Figure 6), which can consist of more than two microphones [11]. Similar systems are used in smart speakers and even in gaming consoles, such as Kinect, which can also be used in audiovisual applications [41]. In mass production, a microphone array should not be very expensive, and its design can be made in such a way that it does not require cleaning, unlike cameras and lidars.
In addition to standalone microphone arrays, there are also acoustic cameras [18], which combine a camera and a microphone array. They can output a thermal map of sound, showing where the sound source is located on an image (Figure 7). This is useful when we need to identify a noise source, such as a clanging sound from a defect in an industrial machine. A DIY acoustic camera (Figure 7) can be made using the guidelines provided in a helpful article [17], and the acoular library [13] can be used for beamforming.
However, wind noise and the vehicle's own noise can negatively impact the accuracy of sound localization. This presents a challenge, as the microphone needs to be shielded from the wind to avoid noise interference. To achieve this, special attachments made of foam or fur (Figure 9) can be used [5], which are effective in protecting against wind speeds of up to 1 m/s (approximately 2 miles per hour). While natural wind is not a significant issue, the high speeds that a vehicle can reach, up to 250 km/h, can pose limitations. An aerodynamic outer shell for the sensor is necessary to mitigate this problem, and it may also need to be hidden behind an aerodynamic element [28].
There is another type of microphone used in aerodynamic research of automobiles - the surface microphone [8,9]. This microphone is ideal for installation directly on the surface of the vehicle (Figure 10) during wind tunnel testing or for measurements in limited spaces, such as on a firewall or on the underside of a vehicle. They are used to analyze the sounds generated by the vehicle body. It is possible to create cheaper analogs based on such microphones that will look like parking sensors rather than furry attachments.
It may be necessary to use an anechoic chamber (Figure 11) for calibrating such sensors. This type of chamber eliminates reflections from walls by using special sound-absorbing materials on the walls. Anechoic chambers are much simpler to construct for sound than for, say, radio.
To determine the optimal number and placement of microphones on a vehicle, an optimization problem can be formulated. In [33], a compromise between low-frequency angular resolution (requiring larger distances between microphones) and high-frequency angular resolution (requiring smaller distances) is achieved. By using the Radon transform and specifying the frequency variation we want to distinguish, as well as the allowed direction of arrival resolution (DOA), the number and position of microphones can be optimized. As a result, multiple microphone assemblies (consisting of closely spaced microphones) can be used along the contour of the vehicle. In this case, four assemblies are recommended.
Once the audio signal is digitized, it can be represented as a series of numbers, with intervals between them taken at a certain frequency known as the sampling frequency. According to the Nyquist theorem, to accurately represent the signal, the sampling frequency must be at least twice the frequency of the highest frequency component. For instance, a sampling frequency of 44.1 kHz is commonly used in music digitization.
To prepare data for modeling, it is often necessary to extract features. One way to do this for audio data is to use the Fourier transform to convert the time-domain signal into a frequency-domain representation, or spectrum. This can be visualized as a spectrogram (Figure 12), which shows the distribution of frequency content over time. Spectrograms can provide insight into the temporal and spectral characteristics of a sound signal and are commonly used in audio analysis and machine learning applications. What does a spectrogram look like?
But since we know that sound intensity is logarithmic, we need to convert it back to a linear scale. This is where Mel Frequency Cepstral Coefficients (MFCC) come in. MFCCs are a set of features originally used in speech processing and now widely used in various applications, including Music Information Retrieval (MIR) [23]. They allow us to convert the audio signal to a more compact and informative representation that can be used for modeling.
Sound separation is a common issue that arises when a microphone captures sound from multiple sources simultaneously. This is known as the Cocktail Party Problem, where we aim to separate the sound from each source. A similar problem arises when separating instruments and vocals in a recorded music track. To solve this problem, a modification of the classic Non-Negative Matrix Factorization (NMF) algorithm is proposed in [19]. The output features are then fed into a machine-learning model. Alternatively, a deep learning model can be trained for this purpose [21]. Other algorithms, including NMF, are presented in [43].
Assistance in self-localization. There is also the possibility of using this as a localization.
Now let's talk about object detection - here you can use model-based approaches, as well as new ML approaches. The former does not require data, while the latter shows better results in some cases.
Model-based object detection and tracking. There are several methods for this task: time difference of arrival, beamforming, and holographic.
Time difference of arrival (TDOA) The easiest way is the classic time difference of arrival.
We have an array of microphones spaced apart by a distance of s, and each microphone captures sound from a source. As the wavefront arrives at each microphone with a varying delay, we can compute the cross-correlation of signals for each microphone pair to determine the sound delay time $\tau_{i,j}$ (where i and j denote the pair of microphones). This time is directly proportional to the distance d from the microphone to the wavefront. Using d and s, we can calculate the direction of arrival $\theta$. Now, we have a unique estimation of the direction of arrival for each microphone pair. We can cluster these estimates and output the cluster centers as our final answer.
Unfortunately, we lack any knowledge regarding the moment of sound radiation or its initial intensity, which prevents us from directly measuring the distance. However, for objects where we have prior knowledge of the signal's nature, such measurement is possible. We may also employ triangulation using multiple directions of arrival estimates from independent microphone arrays, but this method has been shown to have significant errors [45]. Alternatively, we can use a ray tracer and construct hypotheses regarding the object's location, accounting for the received wave, the direct line of sight, and the reflected wave. In [22], research is conducted to determine the position of a sound source, even if it is not visible, from several reflected signals.
After estimating the direction of arrival (DOA), various filtering techniques can be applied to enhance the accuracy of the estimation. For instance, a particle filter [29] can be used. Alternatively, one can work in the frequency domain instead of the time domain. In [37], time-difference-of-arrival (TDOA) is used, and it is shown that by incorporating a high-definition (HD) map into the model and building appropriate hypotheses, detection and tracking can be improved, particularly for overtaking and close passes near the ego vehicle.
Beamforming [46] is a powerful technique used to focus on a specific sound source in space. By directing the focus in the direction of the actual sound source, an energy function creates a peak. The approach relies on the assumption that more energy is emitted in the direction of the sound source than in other directions. In circular microphone arrays, the search must be performed over the entire field of view (FoV), which can make the process more complex and time-consuming. This technique is similar to passive radar, where the radar only receives signals without emitting them. By utilizing multiple receivers, the phased array antenna principle can be employed to form a beam. The acoular [13] library is a powerful tool that can be used for beamforming. Additionally, spatial features can also be incorporated in pytorch [14] to further improve the performance of beamforming.
Figure 15 demonstrates the simplest form of delay-and-sum beamforming, which involves a set of microphones and the signals they capture. Because the microphones are at slightly different distances from the sound source, we can estimate the position of the source by determining the delays between the signals from each microphone. The delays are proportional to the distances traveled by the sound wave to each microphone. The goal is to find the delays for each microphone such that the sum of the delayed signals has the highest energy peak, indicating the direction of the sound source (case a). In contrast, if we use delays from (case a) to estimate the direction of a different sound source (case b), the peak will not be at the same position, and we will need to estimate the delays again for the new source.
Based on the previous description, it is evident that we can estimate the direction of arrival accurately. However, accurately determining the distance to the object is not straightforward, as it requires performing reverse ray tracing. Nonetheless, we can switch from a model-based approach to a data-driven approach.
First, we can simplify the task. We can shift our focus from object detection to scene classification. Scene classification is an essential task not only in the field of autonomous vehicles but also in robotics, where robots operate in a variety of environments. In a recent study [36], researchers combined CNN image features with MLP features that process MFCC sound representations to classify scenes. The results showed a significant improvement in classification accuracy up to 79.93% compared to using image information only, which achieved 65.92%. This approach demonstrates the potential of incorporating audio information in scene classification tasks.
Further development and application of various ML algorithms are presented in [24]. Here, it is possible to try using only audio data, but the question arises of what to do if there is no labeling. In [26], a simpler task is also chosen - classification (no object, object on the right, object on the left).
In [32], a self-supervised approach for object detection and tracking is proposed. Self-supervised learning is a type of machine learning where the model learns from the data itself without human-labeled labels or supervision. It involves training the model to predict certain properties or transformations of the data, using the inherent structure of the data to create its own supervisory signal. Their model consists of a visual "teacher" network and a stereo-sound "student" network (Figure 16). During training, knowledge from a well-established visual vehicle detection model is transferred to the audio domain using unlabeled videos as a bridge. During testing, the stereo-sound student network can perform object localization independently using only stereo audio and camera meta-data, without any visual input. Their auditory object tracking is shown to be robust in poor lighting conditions, where traditional vision-based object tracking often fails. However, the authors also observed some failure cases with fast-moving vehicles and noisy sounds, such as construction, wind, and sediment. The study presents interesting cases where their StereoSoundNet successfully tracks moving cars despite occlusion, backlighting, reflection, and poor lighting conditions, while visual object localization fails. In [35], there is a further development of a self-supervised learning approach by using more than two microphones. Their primary advancement is in the use of contrastive learning for this task and making their dataset publicly available.
It is interesting to note that in [39], good results were achieved for non-line-of-sight tracking using radar. The approach involves detecting flat surfaces from which signals can be reflected and making detections based on a three-bounce reflection hypothesis (car → wall → object → wall → car). In the acoustic domain, [40] conducted an analysis in an anechoic chamber.
Recently, the use of multimodal datasets, which combine different types of data such as text, image, or video, has gained great popularity. This trend has also extended to the field of autonomous driving, where there is a wealth of datasets available that include data from cameras, lidars, and radars. However, it is unfortunate that audio data is often overlooked and is not included in many of these datasets. Despite the potential benefits of audio data, such as its ability to provide valuable information about the environment and the behavior of other road users, it is still an underutilized modality in the context of autonomous driving.
Several multimodal datasets for autonomous driving have emerged in recent years, with data from cameras, lidars, and radars. However, audio data is still a rarity. Notably, the OLIMP dataset [15, 16] features a variety of modalities including a camera, ultra-wideband radar, narrow-band radar, and acoustic sensors, and was collected from a stationary rover. This dataset contains 407 scenes and 47,354 synchronized frames, with four categories: pedestrian, cyclist, car, and tram. Another dataset [26] was recorded at five T-junction locations with blind corners around the city, while a third dataset [34] includes over 70 minutes of time synchronized audio and video recordings of vehicles on roads, with more than 300 bounding box annotations. Additionally, tool [42] provides a simulation of sound source and receiver.
Most likely, we don't see such sensors on cars for the following reasons. Autonomous companies analyze trip logs and see where disengagement or something else occurred. Probably, the number of such cases where an audio sensor could help is not that great compared to other problems. The next problem is that complex signal processing will be required for implementation in the pipeline. But from the article, we learned that the sensors themselves are inexpensive and, moreover, can improve the performance of sensor fusion and allow solving problems that others cannot - detect non-light of site objects. So most likely we will see them on robots in the near future.
https://www.atlasobscura.com/places/greatstone-sound-mirrors
Wieczorkowska, Alicja, et al. "Spectral features for audio based vehicle and engine classification." Journal of Intelligent Information Systems 50.2 (2018): 265-290. https://link.springer.com/article/10.1007/s10844-017-0459-2
https://www.movophoto.com/blogs/movo-photo-blog/what-is-microphone-wind-muff-for
https://www.bksv.com/-/media/literature/Product-Data/bp2055.ashx (Automotive Surface Microphones ó Types 4949 and 4949 B)
https://www.grasacoustics.com/products/special-microphone/surface-microphones
Wikipedia. 2023. "Sound localization." Wikimedia Foundation. Last modified January 3, 2023. https://en.wikipedia.org/wiki/Sound_localization.
https://www.bksv.com/en/transducers/acoustic/microphone-array
Mimouna, A., Alouani, I., Ben Khalifa, A., El Hillali, Y., Taleb-Ahmed, A., Menhaj, A., Ouahabi, A., & Ben Amara, N. E. (2020). OLIMP: A Heterogeneous Multimodal Dataset for Advanced Environment Perception. Electronics, 9(4), 560. https://doi.org/10.3390/electronics9040560
https://sites.google.com/view/ihsen-alouani/datasets?authuser=0#h.p_Up2TVj2xKDQ8
https://navat.substack.com/p/diy-acoustic-camera-using-uma-16
https://en.wikipedia.org/wiki/Non-negative_matrix_factorization
King, E. A., A. Tatoglu, D. Iglesias, and A. Matriss. 2021. “Audio-Visual Based Non-Line-of-Sight Sound Source Localization: A Feasibility Study.” Applied Acoustics 171 (January): 107674.
Rawat, Raghav, Shreyash Gupta, Shreyas Mohapatra, Sujata Priyambada Mishra, and Sreesankar Rajagopal. 2021. “Intelligent Acoustic Module for Autonomous Vehicles Using Fast Gated Recurrent Approach.” arXiv [cs.LG] . arXiv. http://arxiv.org/abs/2112.03174.
Bianco, Michael J., Peter Gerstoft, James Traer, Emma Ozanich, Marie A. Roch, Sharon Gannot, and Charles-Alban Deledalle. 2019. “Machine Learning in Acoustics: Theory and Applications.” The Journal of the Acoustical Society of America 146 (5): 3590.
Liu, Yangfan, J. Stuart Bolton, and Patricia Davies. n.d. “Acoustic Source Localization Techniques and Their Applications.”
Schulz, Yannick, Avinash Kini Mattar, Thomas M. Hehn, and Julian F. P. Kooij. 2020. “Hearing What You Cannot See: Acoustic Vehicle Detection Around Corners.” arXiv [cs.RO] . arXiv. https://doi.org/10.1109/LRA.2021.3062254. https://arxiv.org/pdf/2007.15739.pdf
https://github.com/tudelft-iv/occluded_vehicle_acoustic_detection
https://www.idmt.fraunhofer.de/en/institute/projects-products/projects/the_hearing_car.html
Mizumachi, Mitsunori, Atsunobu Kaminuma, Nobutaka Ono, and Shigeru Ando. 2014. “Robust Sensing of Approaching Vehicles Relying on Acoustic Cues.” Sensors 14 (6): 9546–61.
Van der Voort, A. W. M., and Ronald M. Aarts. "Development of Dutch sound locators to detect airplanes (1927–1940)." Proceedings NAG/DAGA, Rotterdam, The Netherlands, March 23 26 (2009).APA
Gan, Chuang, Hang Zhao, Peihao Chen, David Cox, and Antonio Torralba. 2019. “Self-Supervised Moving Vehicle Tracking with Stereo Sound.” arXiv [cs.CV] . arXiv. http://arxiv.org/abs/1910.11760.
Barak, Ohad, Nizar Sallem, and Marc Fischer. n.d. “Microphone Array Optimization for Autonomous-Vehicle Audio Localization Based on the Radon Transform.” Accessed March 8, 2023. https://dcase.community/documents/workshop2020/proceedings/DCASE2020Workshop_Barak_72.pdf.
Zürn, Jannik, and Wolfram Burgard. 2022. “Self-Supervised Moving Vehicle Detection from Audio-Visual Cues.” arXiv [cs.CV] . arXiv. http://arxiv.org/abs/2201.12771.
Bird, Jordan J., Diego R. Faria, Cristiano Premebida, Anikó Ekárt, and George Vogiatzis. 2020. “Look and Listen: A Multi-Modality Late Fusion Approach to Scene Classification for Autonomous Machines.” arXiv [cs.CV] . arXiv. http://arxiv.org/abs/2007.10175.
Jiang, Kun, Diange Yang, Benny Wijaya, Bowei Zhang, Mengmeng Yang, Kai Zhang, and Xuewei Tang. 2021. “Adding Ears to Intelligent Connected Vehicles by Combining Microphone Arrays and High Definition Map.” IET Intelligent Transport Systems 15 (10): 1228–40. https://doi.org/10.1049/itr2.12091.
Gong, Cihun-Siyong Alex, Chih-Hui Simon Su, Yuan-En Liu, De-Yu Guu, and Yu-Hua Chen. 2022. “Deep Learning with LPC and Wavelet Algorithms for Driving Fault Diagnosis.” Sensors 22 (18). https://doi.org/10.3390/s22187072.
Scheiner, Nicolas, Florian Kraus, Fangyin Wei, Buu Phan, Fahim Mannan, Nils Appenrodt, Werner Ritter, et al. 2019. “Seeing Around Street Corners: Non-Line-of-Sight Detection and Tracking In-the-Wild Using Doppler Radar.” arXiv [cs.CV]
Lindell, David B., Gordon Wetzstein, and Vladlen Koltun. 2019. “Acoustic Non-Line-Of-Sight Imaging.” In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 6773–82. https://doi.org/10.1109/CVPR.2019.00694.
Tadjine, Hamma, and Daniel Goehring. n.d. “Acoustic/lidar Sensor Fusion for Car Tracking in City Traffic Scenarios.” Accessed March 9, 2023. https://www.mi.fu-berlin.de/inf/groups/ag-ki/publications/Acoustic_Lidar-sensor/tadjine15fastzero.pdf.
Damiano, Stefano, and Toon van Waterschoot. n.d. “PYROADACOUSTICS: A ROAD ACOUSTICS SIMULATOR BASED ON VARIABLE LENGTH DELAY LINES.”
Fachbereich, Vom, and Yury Furletov. n.d. “Sound Processing for Autonomous Driving.” Accessed January 25, 2023. https://tuprints.ulb.tu-darmstadt.de/22090/1/Furletov_Thesis_RMR.pdf.
Fachbereich, Vom, and Yury Furletov. n.d. “Sound Processing for Autonomous Driving.” Accessed January 25, 2023. https://tuprints.ulb.tu-darmstadt.de/22090/1/Furletov_Thesis_RMR.pdf.
Santana, Leandro de. 2017. “Fundamentals of Acoustic Beamforming.” NATO Educational Notes EN-AVT-287 4.