Patient-Specific CNN-RNN for Lung-Sound Detection With 4× Smaller Memory

Written by modeltuning | Published 2025/09/09
Tech Story Tags: convolutional-neural-networks | deep-neural-networks-(dnn) | respiratoryai | wearable-healthcare-devices | patient-specific-model-tuning | recurrent-neural-networks | mel-spectrograms | weight-quantization

TLDRLog-quant cuts model memory 4× with minimal accuracy loss; compute fits mobile SoCs, enabling real-time, wearable lung-sound screening.via the TL;DR App

Table of Links

Abstract and I Introduction

II. Materials and Methods

III. Results and Discussions

IV. Conclusion and References

III. RESULTS AND DISCUSSIONS

A. Generalized Model

Firstly, we evaluated our model on four class breathing cycle using micro and macro metrics described in section II-B. To compare our results with traditionally used CNN architectures, we used VGGnet and Mobilenet. All models were trained and tested on a workstation with Intel Xeon E5- 2630 CPU with 128GB RAM and NVIDIA TITAN Xp GPU. The results are tabulated in table II. The results are averaged over five randomized train-test sets. As it can be seen from the table II, the proposed hybrid CNN-RNN model trained with data augmentation produces state of the art results. Both VGG-16 and Mobilenet produces slightly lower scores both in terms of macro and micro metrics. The score obtained by the proposed model also outperform results reported by Kochetov et al. using noise labels on a similar 80-20 split (Table I). We have also performed a 10-fold cross-validation on the dataset for our proposed model and the average score obtained is 66.43%. Due to unavailability of similar audio datasets in biomedical field, we have also tested the proposed hybrid model on Tensorflow speech recognition challenge [47] to benchmark its performance. For an eleven-class classification with 90% − 10% train-test split, it produced a respectable accuracy of 96%. For the sake of completeness, we also tested the dataset using same train-test split strategy with a variety of commonly used temporal and spectral features (RMSE, ZCR, spectral centroid, roll-off frequency, entropy, spectral contrast etc. [48]) with non-DL methods such as SVM, shallow neural network, random forest and gradient boosting. The resulting scores were significantly lower (44.5 − 51.2%).

B. Patient Specific Model Tuning Strategy

It has been shown by Chambres et al. [43] that though it is difficult to achieve high scores for breathing cycle level classification, it is much easier to achieve high accuracy in patient level binary classification (healthy/sick). Hence, we propose a screen and model tuning strategy. First, the patients are screened using the pre-trained model and if a patient is found to be unhealthy, the pre-trained model is retrained on the patient data to build the patient specific model that can monitor the patient condition in future with higher reliability. The proposed model is shown in Fig. 3. To evaluate the performance of the proposed methodology, we used leave one out validation. Since there are variable number of recordings from each patient in the dataset, of the n samples from a patient, n − 1 samples are used to retrain the model and it is tested on the other sample. This method is repeated so that all the samples are in the test set once. We trained the proposed model on the patients in the train set and evaluated it on the patients in test set. Since leave one out validation is not possible for patients with only one sample, we only considered patients with more than one sample. The dataset contains different number of recordings from each patient and the length of the recordings and number of breathing cycles in each recording vary widely. But on average, ≈ 47 patient breathing cycles are used for the fine-tuning of patient specific models.

Secondly, since we are using patient specific data to train the models, we have to verify if our proposed model tuning strategy provides any advantage over a simple classifier trained on only patient specific data. To verify this, we used an ImageNet [49] trained VGG-16 [40] as a feature extractor along with an SVM classifier to build patient specific models. Variants of VGG trained on ImageNet dataset have been shown to be very efficient feature extractors not only for image classification, but also for audio classification [50]. Here we use the pre-trained CNN to extract features from patient recordings and train an SVM based on those features only on the patient specific data.

Thirdly, we are proposing that by pre-training the hybrid CNN-RNN model on the respiratory data, the model learns domain specific feature representations that are transferred to the patient specific model. To justify this claim, we trained the same model on tensorflow speech recognition challenge dataset [47] as well as urban sounds 8K dataset [51]. Then we used the same model tuning strategy to re-train the model on patient specific data. If the proposed model learns only the audio feature specific abstract representations from the data, then a model trained on any sufficiently large audio database should perform well. But, if the model learns respiratory sound domain specific features from the data, the model pre-trained on respiratory sounds should outperform the model pre-trained on any other type of audio database. Finally, we compare the results of our model with pure CNN models VGG-16 and MobileNet using the same experimental methodology.

The results are tabulated in table III. Firstly, Our proposed strategy outperforms all other models and strategies and obtains a score of 71.81%. Secorndly, VGG-16 and MobileNet achieves scores 68.54% and 67.60% which signifies pure CNNs can be employed for respiratory audio classification, albeit not as effective as a CNN-RNN hybrid model. Thirdly, results corresponding to both audio trained networks shows that audio domain pre-training is not very effective for respiratory domain feature extraction. We explain this observation in further details in section IV. Finally, Imagenet trained VGG16 shows promise as a feature extractor for respiratory data,

although it does not reach the same level of performance as ICBHI trained models.

C. Memory and Computational Complexity

Even though the proposed models show excellent performance in the classification task, the memory requirement for storing huge number of weights for these models make them unsustainable for application in mobile and wearable platforms. Hence, we apply the local log quantization scheme proposed in section II-D4. Figure 4 shows the score achieved by the models as a function of bit precision of weights. As expected, VGG-16 outperforms the other to models due to its over-parameterized design [38]. MobileNet shows particularly poor performance in weight quantization and is only able to achieve optimum accuracy at 10 bit precision. This poor quantization performance can be attributed to large number of batch-normalization layers and RELU6 activation of MobileNet architecture [38]. While several approaches have been proposed to circumvent these issues [52], these methods are not compatible with Imagenet pre-trained MobileNet model since they focus on modifications in the architecture rather than quantization of pre-trained weights. The hybrid CNNRNN model performs slightly worse than VGG-16 since it has LSTM layer which requires higher bit precision compared to the CNN counterpart [53].

Finally, Our proposed system requires data pre-processing, feature extraction and classification only once in each breathing cycle. Therefore, if we consider a ping-pong buffer architecture [54] for audio acquisition and processing, our system needs to perform end to end classification of breathing cycles at a latency smaller than minimum breathing cycle duration for real time operation. The primary computational bottleneck of the proposed system is the DL architecture as mentioned earlier. The number of computations of the proposed architecture is of the same order as Mobilenet as shown in fig. 5. Since the minimum breathing cycle duration is > 1 second [55] and the per sample latency of Mobilenet on modern mobile SoCs is only ∼ 100 ms [56], the proposed system should easily be able to perform real time classification of respiratory anomalies.

IV. CONCLUSION

In this paper, we have developed a hybrid CNN-RNN model that produces state of the art results for ICBHI’17 respiratory audio dataset. It produces a score of 66.31% score on 80- 20 split for four-class respiratory cycle classification. We also propose a patient screening and model tuning strategy to identify unhealthy patients and then build patient specific models through patient speecific re-training. This proposed model provides significantly more reliable results for the original train-test split achieving a score of 71.81% for leaveone-out cross-validation. It is observed that trained models from image recognition field, surprisingly, perform better in transferring knowledge than those pre-trained on speech. A possible explanation for this could be that while image-trained models are trained on a much larger Imagenet dataset and therefore, has better generalization performance compared to models trained on relatively smaller audio datasets. While lack of availability of pre-trained models in audio domain and prohibitively long training time required for training a model with audio datasets of sizes comparable to Imagenet prevent us from verifying this hypothesis in this work, in future we plan to explore transfer learning performance of audio and image datasets in further detail. We also develop a local log quantization strategy for reducing the memory cost of the models that achieves ≈ 4× reduction in minimum memory required without loss of performance. The primary significance of this result is that this weight quantization strategy is able to achieve considerable weight compression without any architectural modification to the model or quantization aware training. Finally, while the proposed model has higher computational complexity than MobileNet, it has minimal memory footprint among the models under consideration. Since the amount of data from a single patient is still very small for this dataset, in future, we plan to employ this strategy with larger amount of patient specific data. We also plan to create an embedded implementation of this algorithm for a wearable device to be used in patient monitoring at home. Further reductions in computational complexity will be explored using a neuromorphic spike based approach [57], [58].

REFERENCES

[1] N. Gavriely, Y. Palti, G. Alroy, and J. B. Grotberg, “Measurement and theory of wheezing breath sounds,” Journal of Applied Physiology, vol. 57, no. 2, pp. 481–492, 1984.

[2] P. Piirila and A. Sovijarvi, “Crackles: recording, analysis and clinical significance,” European Respiratory Journal, vol. 8, no. 12, pp. 2139– 2148, 1995.

[3] M. Bahoura and C. Pelletier, “Respiratory sounds classification using gaussian mixture models,” in Canadian Conference on Electrical and Computer Engineering 2004 (IEEE Cat. No. 04CH37513), vol. 3. IEEE, 2004, pp. 1309–1312.

[4] J. Acharya, A. Basu, and W. Ser, “Feature extraction techniques for lowpower ambulatory wheeze detection wearables,” in 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 2017, pp. 4574–4577.

[5] B.-S. Lin and B.-S. Lin, “Automatic wheezing detection using speech recognition technique,” Journal of Medical and Biological Engineering, vol. 36, no. 4, pp. 545–554, 2016.

[6] M. Bahoura, “Pattern recognition methods applied to respiratory sounds classification into normal and wheeze classes,” Computers in biology and medicine, vol. 39, no. 9, pp. 824–843, 2009.

[7] J. Zhang, W. Ser, J. Yu, and T. Zhang, “A novel wheeze detection method for wearable monitoring systems,” in 2009 International Symposium on Intelligent Ubiquitous Computing and Education. IEEE, 2009, pp. 331–334.

[8] P. Bokov, B. Mahut, P. Flaud, and C. Delclaux, “Wheezing recognition algorithm using recordings of respiratory sounds at the mouth in a pediatric population,” Computers in biology and medicine, vol. 70, pp. 40–50, 2016.

[9] I. Sen, M. Saraclar, and Y. P. Kahya, “A comparison of svm and gmmbased classifier configurations for diagnostic classification of pulmonary sounds,” IEEE Transactions on Biomedical Engineering, vol. 62, no. 7, pp. 1768–1776, 2015.

[10] N. Jakovljevic and T. Lon ´ car-Turukalo, “Hidden markov model based ˇ respiratory sound classification,” in Precision Medicine Powered by pHealth and Connected Health. Springer, 2018, pp. 39–43.

[11] R. X. A. Pramono, S. Bowyer, and E. Rodriguez-Villegas, “Automatic adventitious respiratory sound analysis: A systematic review,” PloS one, vol. 12, no. 5, p. e0177926, 2017.

[12] H. Chen, X. Yuan, Z. Pei, M. Li, and J. Li, “Triple-classification of respiratory sounds using optimized s-transform and deep residual networks,” IEEE Access, vol. 7, pp. 32 845–32 852, 2019.

[13] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sanchez, ´ “A survey on deep learning in medical image analysis,” Medical image analysis, vol. 42, pp. 60–88, 2017.

[14] E. Hosseini-Asl, G. Gimel’farb, and A. El-Baz, “Alzheimer’s disease diagnostics by a deeply supervised adaptable 3d convolutional network,” arXiv preprint arXiv:1607.00556, 2016.

[15] M. J. van Grinsven, B. van Ginneken, C. B. Hoyng, T. Theelen, and C. I. Sanchez, “Fast convolutional neural network training using selective data ´ sampling: Application to hemorrhage detection in color fundus images,” IEEE transactions on medical imaging, vol. 35, no. 5, pp. 1273–1284, 2016.

[16] Y. Song, L. Zhang, S. Chen, D. Ni, B. Lei, and T. Wang, “Accurate segmentation of cervical cytoplasm and nuclei based on multiscale convolutional network and graph partitioning,” IEEE Transactions on Biomedical Engineering, vol. 62, no. 10, pp. 2421–2433, 2015.

[17] O. Oktay, W. Bai, M. Lee, R. Guerrero, K. Kamnitsas, J. Caballero, A. de Marvao, S. Cook, D. ORegan, and D. Rueckert, “Multi-input cardiac image super-resolution using convolutional neural networks,” in International conference on medical image computing and computerassisted intervention. Springer, 2016, pp. 246–254.

[18] P. Kisilev, E. Sason, E. Barkan, and S. Hashoul, “Medical image description using multi-task-loss cnn,” in Deep Learning and Data Labeling for Medical Applications. Springer, 2016, pp. 121–129.

[19] P. V. Tran, “A fully convolutional neural network for cardiac segmentation in short-axis mri,” arXiv preprint arXiv:1604.00494, 2016.

[20] H. K. van der Burgh, R. Schmidt, H.-J. Westeneng, M. A. de Reus, L. H. van den Berg, and M. P. van den Heuvel, “Deep learning predictions of survival based on mri in amyotrophic lateral sclerosis,” NeuroImage: Clinical, vol. 13, pp. 361–369, 2017.

[21] T. Kooi, B. van Ginneken, N. Karssemeijer, and A. den Heeten, “Discriminating solitary cysts from soft tissue lesions in mammography using a pretrained deep convolutional neural network,” Medical physics, vol. 44, no. 3, pp. 1017–1027, 2017.

[22] X. Chen, Y. Xu, D. W. K. Wong, T. Y. Wong, and J. Liu, “Glaucoma detection based on deep convolutional neural network,” in 2015 37th annual international conference of the IEEE engineering in medicine and biology society (EMBC). IEEE, 2015, pp. 715–718.

[23] Y. Bengio, P. Simard, P. Frasconi et al., “Learning long-term dependencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp. 157–166, 1994.

[24] H. Salehinejad, S. Sankar, J. Barfett, E. Colak, and S. Valaee, “Recent advances in recurrent neural networks,” arXiv preprint arXiv:1801.01078, 2017.

[25] A. Ullah, J. Ahmad, K. Muhammad, M. Sajjad, and S. W. Baik, “Action recognition in video sequences using deep bi-directional lstm with cnn features,” IEEE Access, vol. 6, pp. 1155–1166, 2018.

[26] Y. Zhao, X. Jin, and X. Hu, “Recurrent convolutional neural network for speech processing,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 5300–5304.

[27] J. Amoh and K. Odame, “Deep neural networks for identifying cough sounds,” IEEE transactions on biomedical circuits and systems, vol. 10, no. 5, pp. 1003–1011, 2016.

[28] H. Nakano, T. Furukawa, and T. Tanigawa, “Tracheal sound analysis using a deep neural network to detect sleep apnea,” Journal of Clinical Sleep Medicine, vol. 15, no. 08, pp. 1125–1133, 2019.

[29] H. Ryu, J. Park, and H. Shin, “Classification of heart sound recordings using convolution neural network,” in 2016 Computing in Cardiology Conference (CinC). IEEE, 2016, pp. 1153–1156.

[30] H. Chang, J. Han, C. Zhong, A. M. Snijders, and J.-H. Mao, “Unsupervised transfer learning via multi-scale convolutional sparse coding for biomedical applications,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 5, pp. 1182–1194, 2018.

[31] A. Payan and G. Montana, “Predicting alzheimer’s disease: a neuroimaging study with 3d convolutional neural networks,” arXiv preprint arXiv:1502.02506, 2015.

[32] L. S. Hu, H. Yoon, J. M. Eschbacher, L. C. Baxter, A. C. Dueck, A. Nespodzany, K. A. Smith, P. Nakaji, Y. Xu, L. Wang et al., “Accurate patient-specific machine learning models of glioblastoma invasion using transfer learning,” American Journal of Neuroradiology, vol. 40, no. 3, pp. 418–425, 2019.

[33] A. Bellot and M. Schaar, “Boosting transfer learning with survival data from heterogeneous domains,” in The 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 57–65.

[34] S. Kiranyaz, T. Ince, R. Hamila, and M. Gabbouj, “Convolutional neural networks for patient-specific ecg classification,” in 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 2015, pp. 2608–2611.

[35] P. G. Gibson, “Monitoring the patient with asthma: an evidence-based approach,” Journal of Allergy and Clinical Immunology, vol. 106, no. 1, pp. 17–26, 2000.

[36] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks,” in Advances in neural information processing systems, 2016, pp. 4107–4115.

[37] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.

[38] T. Sheng, C. Feng, S. Zhuo, X. Zhang, L. Shen, and M. Aleksic, “A quantization-friendly separable convolution for mobilenets,” in 2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2). IEEE, 2018, pp. 14– 18.

[39] B. Rocha, D. Filos, L. Mendes, I. Vogiatzis, E. Perantoni, E. Kaimakamis, P. Natsiavas, A. Oliveira, C. Jacome, A. Marques ´ et al., “A respiratory sound database for the development of automated classification,” in Precision Medicine Powered by pHealth and Connected Health. Springer, 2018, pp. 33–37.

[40] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[41] K. Kochetov, E. Putin, M. Balashov, A. Filchenkov, and A. Shalyto, “Noise masking recurrent neural network for respiratory sound classification,” in International Conference on Artificial Neural Networks. Springer, 2018, pp. 208–217.

[42] D. Perna, “Convolutional neural networks learning from respiratory data,” in 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2018, pp. 2109–2113.

[43] G. Chambres, P. Hanna, and M. Desainte-Catherine, “Automatic detection of patient with respiratory diseases using lung sound analysis,” in 2018 International Conference on Content-Based Multimedia Indexing (CBMI). IEEE, 2018, pp. 1–6.

[44] E. Cakır, G. Parascandolo, T. Heittola, H. Huttunen, and T. Virtanen, “Convolutional recurrent neural networks for polyphonic sound event detection,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 6, pp. 1291–1303, 2017.

[45] J. Sang, S. Park, and J. Lee, “Convolutional recurrent neural networks for urban sound classification using raw waveforms,” in 2018 26th European Signal Processing Conference (EUSIPCO). IEEE, 2018, pp. 2444– 2448.

[46] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[47] P. Warden, “Speech commands: A public dataset for single-word speech recognition,” Dataset available from http://download. tensorflow. org/data/speech commands v0, vol. 1, 2017.

[48] R. X. A. Pramono, S. A. Imtiaz, and E. Rodriguez-Villegas, “Evaluation of features for classification of wheezes and normal respiratory sounds,” PloS one, vol. 14, no. 3, p. e0213659, 2019.

[49] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.

[50] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold et al., “Cnn architectures for large-scale audio classification,” in 2017 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 2017, pp. 131–135.

[51] J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” in 22nd ACM International Conference on Multimedia (ACM-MM’14), Orlando, FL, USA, Nov. 2014, pp. 1041–1044.

[52] S. Alyamkin, M. Ardi, A. Brighton, A. C. Berg, B. Chen, Y. Chen, H.-P. Cheng, Z. Fan, C. Feng, B. Fu et al., “Low-power computer vision: Status, challenges, opportunities,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2019.

[53] T. Gokmen, M. Rasch, and W. Haensch, “Training lstm networks with resistive cross-point devices,” Frontiers in neuroscience, vol. 12, p. 745, 2018.

[54] D. Katz, “Fundamentals of embedded audio, part 3,” Sep 2007. [Online]. Available: https://www.eetimes.com/ fundamentals-of-embedded-audio-part-3/

[55] W. Q. Lindh, M. Pooler, C. D. Tamparo, B. M. Dahl, and J. Morris, Delmar’s comprehensive medical assisting: administrative and clinical competencies. Cengage Learning, 2013.

[56] A. Ignatov, R. Timofte, W. Chou, K. Wang, M. Wu, T. Hartley, and L. Van Gool, “Ai benchmark: Running deep neural networks on android smartphones,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 0–0.

[57] A. Basu, J. Acharya, and T. K. an et . al, “Low-power, adaptive neuromorphic systems: Recent progress and future directions,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, pp. 6–27, 2018.

[58] J. Acharya, A. Patil, X. Li, Y. Chen, S. C. Liu, and A. Basu, “A comparison of low-complexity real-time feature extraction for neuromorphic speech recognition,” Frontiers in neuroscience, vol. 12, p. 160, 2018.

Authors:

(1) Jyotibdha Acharya (Student Member, IEEE), HealthTech NTU, Interdisciplinary Graduate Program, Nanyang Technological University, Singapore;

(2) Arindam Basu (Senior Member, IEEE), School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore.


This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-SHAREALIKE 4.0 INTERNATIONAL license.


Written by modeltuning | Transferring the essence of optimal performance, and saving the model from the abyss of underfitting.
Published by HackerNoon on 2025/09/09