182 ukufunda

I-Machine Learning for the Birds: Ukwakha I-Our Own Bird Vocalization Classifier

nge pictureint...20m2025/06/09
Read on Terminal Reader

Inde kakhulu; Ukufunda

I-BirdCLEF+ 2025 umnqweno utshintsho kubasebenzisi ukwakha umzobo wokuhlanganisa umzobo enokuthintela ngokunyaniseka iintlobo zeentombazana ye-audio. Le nqakraza ibonise indlela yokwakha i-classifier yakho ye-vocalization yeentombazana eyenziwe kunye ne-GBV yokuhlanganisa iintlobo zeentombazana ezininzi. Le nqakraza isetyenziselwa izixhobo ezisetyenzisiweyo ezisetyenziswa kwi-Google Research article Separating Birdsong in the Wild for Classification.
featured image - I-Machine Learning for the Birds: Ukwakha I-Our Own Bird Vocalization Classifier
Picture in the Noise HackerNoon profile picture
0-item

Ukucaciswa

Iingcali usebenzisa iinkqubo ze-automated ukufundisa i-ecosystems ezininzi. Kwiimeko ze-forest and jungle areas,autonomous recording units (ARUs)iinkcukacha ezisetyenziswa ukubonisa i-audio eyenziwa ukunceda ukucacisa iintlobo ezahlukeneyo ze-animals kunye ne-insects. Le nkcazelo ingasetyenziswa ukucaciswa okuphumeleleyo lokuquka iintlobo ngaphakathi kwimvelo efikelelekayo. Kwiimveliso ze-birds, i-Google Research ibonisa kwi-artikeli yayoUkuhlukanisa i-Birdsong in the Wild ngenxa ye-Classificationukuba "i-ecologists usebenzisa amafutha ukufumana iinkqubo yemveliso kunye neemveliso zempilo - umzekelo, ukuba kukho iimveliso ezininzi kwiimveliso, oku kuthetha ukuba kukho iimveliso ezininzi ezidlulileyo." Ukongezelela, zibonisa i-value ye-audio-based identification: "[Kuba] amafutha zihlanganisa kunye nokubonisa indawo ngeengoma kunye neengxaki, kunzima kakhulu ukucinga kubo nge-ear.

Ukuhlukanisa i-Birdsong in the Wild ngenxa ye-Classification

Kwangoku, iI-BirdCLEF+ ye-2025Iingcingo ezivela kwiKagglePhantsi umnqweno weUkucingainkampani. I-ImageCLEF inikezela uphando kwi-anotation ye-cross-language kunye nokufumana iifoto kwiintlobo ezahlukeneyo. Umthamo we-competition yinto epheleleyo: yenza iimodeli ye-classification enokufuneka ngokunyaniseka iintlobo ze-bird ukusuka kwiphepha le-audio.

I-BirdCLEF+ ye-2025Ukucinga

Okokuqala, umsebenzi kubonakala banxibelelana ngokufanelekileyo ngokufanelekileyoI-Google Bird Vocalization (GBV) I-Classifier ye-GoogleKwakhona eyaziwa njengePerch. I-GBV i-classifier ifunyenwe kwi-11,000 iintlobo ze-bird yaye ngoko kubalulekile ukuba ifunyenwe njenge-model ye-classification.

I-Google Bird Vocalization (GBV) I-Classifier ye-Google

Nangona kunjalo, i-competition ibandakanya iintlobo ze-bird ezikhoyo kwi-GBV classifier training set. Ngenxa yoko, i-GBV classifier ibandakanya kuphela i-~60% i-accuracy kwi-BirdCLEF+ 2025 competition test dataset. Ngenxa yoko kufuneka yenzelwe iimodeli ye-custom.

Le nqakraza ibonisa indlela yokwakha i-classifier yakho ye-bird vocalization leyo ingasetyenziswa kunye ne-GBV classifier uklasifikisa i-selection ye-bird species. Le nqakraza isetyenziswa izixhobo zokuqala ezaziwayo kwi-Google ResearchiimvelisoI-Design isebenzisa i-BirdCLEF+ 2025 competition data set ekulungiseleni.

Ukuqeqesha Data

YintoniI-BirdCLEF+ 2025 i-training data set, kuquka iifayile ezihambelana, i-approximately 12 GB. Iifayile ezihambelanayo kunye neefayile ezihambelanayo iifayile ze-dataset ziquka:

birdclef_2025
|__ train_audio
|__ train_soundscapes
|__ test_soundscapes
recording_location.txt
taxonomy.csv
train.csv

train_audio

Yintonitrain_audioI-catalogue yi-component eninzi ye-dataset, ebandakanya i-28,564 iingcebiso ze-audio ze-training.oggaudio format. Iingcebiso ze-audio zihlanganiswa kwi-sub-directories ezininzi zihlanganisa iintlobo zangaphakathi, njl:

train_audio
|__amakin1
   |__ [AUDIO FILES]
|__amekes
   |__ [AUDIO FILES]
...

Yintonitaxonomy.csvifayile ingasetyenziswa ukufumana iimpawu zenzululwazi kunye neengxaki ezivela kwiintlobo ze-bird ezidlulileyo kunye neengxaki ze-sub-directory, njl:

SUB-DIRECTORY NAME          SCIENTIFIC NAME             COMMON NAME
amakin1                     Chloroceryle amazona        Amazon Kingfisher
amekes                      Falco sparverius            American Kestrel
...


Amazon Kingfisher


American Kestrel

I-dataset ye-competition ibandakanya i-206 iintlobo ze-animal, ngoko ke i-206 iingqungquthela.IntroductionIintlobo ze-63 ziyafumanekaNgabaUkubandakanya iGBV ClassifierIinkcukachaNon-GBViindidi zokusetyenziswa ngokubanzi nge-numerical class identifier:


1139490, 1192948, 1194042, 126247, 1346504, 134933, 135045, 1462711, 1462737, 1564122, 21038, 21116, 21211, 22333, 22973, 22976, 24272, 24292, 24322, 41663, 41778, 41970, 42007, 42087, 42113, 46010, 47067, 476537, 476538, 48124, 50186, 517119, 523060, 528041, 52884, 548639, 555086, 555142, 566513, 64862, 65336, 65344, 65349, 65373, 65419, 65448, 65547, 65962, 66016, 66531, 66578, 66893, 67082, 67252, 714022, 715170, 787625, 81930, 868458, 963335, grasal4, verfly, y00678

Uninzi leNon-GBVIzifundo zihlanganiswa nge:

  1. Limited training data.
    • Class 1139490, for example, only contains 2 audio recordings. By contrast, class amakin1, which is a “known” GBV class, contains 89 recordings.
  2. Poor recording quality.
    • Highlighting class 1139490 again, both training recordings are of poor quality with one being particularly difficult to discern.

Lezi zimo ezimbini zibonisa ubunzima phakathi kwezilwanyana ngokutsho ubungakanani umgangatho umgangatho umgangatho umgangatho umgangatho.

Zonke iingcingo zokusebenza ze-audio kwiintlobo ezimbiniGBViimvelisoNon-GBVizifundo zihlanganisa nokufunda umntu, kwaye umdlali uthetha iingcingo kunye neengcingo efana ne-species of bird that was recorded and location of the recording. In most -Kwaye akukho zonke- iimeko, iingcebiso zihlanganisa iingcebiso ze-bird ebhalisiweyo.

Iingcingo ezisetyenziswa ukuyisombulula i-class imbalance kunye nokufumaneka kwe-anotations ye-human speech ziya kubhalwe kwiBuilding the Classifieriinkcukacha

train_soundscapes

Yintonitrain_soundscapesinkxaso ibandakanya malunga ne-10,000unlabelediingxowa ze-audio ze-birdsong. Njengoko kuxhomekeke ku-Building the Classifierinqaku, le iingcingo ze-audio ziquka kwiinkcukacha zokusebenza nge-pseudo-labeling.

test_soundscapes

Yintonitest_soundscapesinkxaso ifumaneka ngaphandle areadme.txtifayile. Le kathathu ibandakanya i-set ye-test audio ebandayo ngexesha lokuthumela iziphumo ze-prediction kwi-BirdCLEF+ 2025 competition.

Ukwakhiwa kweClassic

Umgangatho Basic kunye ne-Background

Isisombululo esisiseko esetyenziswaUkuhlolwa kwe-GoogleUkuqeqesha i-vocalization ye-bird ye-classifier yayo kubandakanya:

  1. Ukuqhathanisa i-audio yokufaka kwi-5 iingxaki ze-second.
  2. Ukuguqula i-audio segments kwi-mel spectrograms.
  3. Ukuqeqesha i-image classifier kwi-mel spectrograms.

Umgangatho efanayo uya kulandelayo kule nqakraza. I-image classifier eyenziwa ku-GoogleUmgangatho B0Model. Ukuba unemibuzo kunyeEfficientNetiimodeli yobugcisa, uyazi ukuba ziye ziye zenzelwe ukusetyenziswa kwe-image efanelekileyo.

Umgangatho B0

Nangona kunjalo, phambi kokuba amaxabiso ze-audio ziyafumaneka kunye nokuguqulwa kwi-spectrograms ye-mel, kufuneka sincoma i-class imbalance kunye ne-human annotation problems ezaziwayo kwi-Mel.Training DataNgokubanzi, iingxaki zayo ziya kuthathwa ngokufanayo nge-data augmentation kunye ne-slicing ye-audio samples.

Kwiintsuku ezidlulileyo, i-sub-section ezilandelayo inikeza ulwazi olutsha olutsha.

Imodeli EfficientNet

I-Google Research ibonise i-family yayoEfficientNetiimveliso ngo-2019 njenge-set ofconvolutional neural networkiimveliso ezihlangene iimveliso ze- state-of-the-art, ngexesha elidlulileyo, ngokutsho ubungakanani kunye nokusebenza.


EfficientNet model family performance

Ukucaciswa2iimveliso, ezakhiwe ngo-2021, zinikeza ukusebenza nangaphezulu kunye ne-parameter efficiency.

Nangona uqeqeshoUkucingaIimodeli ze-EfficientNet zibonise izisombululo zabo xa ziye zithunyelwe kwiimveliso ezininzi zebhanki zebhanki, okwenza i-choice engundoqo njenge-technology ye-classification yeeprojekthi.

I-Mel Spectrograms

I-mel spectrogram ye-visual representation ye-audio signal. Oku kunokwenzeka ukuba iyiphi i-heatmap ye-sound.


Sample mel spectrogram

I-x-axis ye-spectrogram ye-mel ibonisa i-time dimension ye-audio signal, kwaye i-y-axis ibonisa i-frequencies yeengxaki kwi-signal. Nangona kunjalo, ngaphandle kokubonisa zonke i-frequencies kwi-scale epheleleyo, i-frequencies zihlanganiswa kwimel bands. Lezi band, ngoko ke, zithunyelwe ukusetyenziswamel scale. I-mel scale i-alogarithmicisilinganiselo esivela kwinkqubo ye-auditory yabantu kunye neendlela yokufunda i-sound. Imibala ye-spectrogram ye-mel ibonisa i-amplitude yeengxube kwi-bands. Imibala eluhlaza ibonisa i-amplitudes ezininzi kwaye imibala eluhlaza ibonisa i-amplitudes ezincinane.

Ukucinga

Indawo yami ekubuyiselwa umklamo kubonelela ukubuyekeza kwinqanaba eliphezulu kwinkqubo ngaphandle kokufumana iinkcukacha ezininzi. I-training (i-fine-tuning) logic esekelwe kuleI-Notebook ye-Kaggle("i-training notebook") ebandakanya kwi-4 iinkalo eziphambili:

  • Isigaba 1: Ukulungiselela idatha ye-audio.
  • Isigaba 2: Ukusetyenziswa kwedatha ye-audio.
  • I-Section 3: Ukwenza i-spectrogram yeMel kunye nokwenza i-input.
  • Isigaba 4: Model Ukuqeqeshwa.

Ingaba uya kufumana ukuba iiyunithi ezimbini ezimbini yeenkcukacha zokuqala ze- (1) iimporti ezisetyenziselwa le nkcukacha kunye ne- (2) i-aConfigcell defining constants ezisetyenziswa kwisiqingatha elandelayo kunye neengxenye ezidlulileyo.

I-training notebook ngokwenene kuqalaSection 0apho iipakheji base yePython ezisetyenziswa kulo lonke i-notebook zithunyelwe. Le nkcukacha kubandakanya i-logic ye-logic ye-loginWeights & Biases("WandB") ukulawula ukuqeqeshwa. Uya kufuneka uqhagamshele WandB yakhoAPI keyI-notebook ye-AKaggle SecretUkusetyenziswa kweNameWANDB_API_KEY.

Njengoko kuxhomekeke kwiTraining Datainqaku, i-unlabeled training soundscapes ingasetyenziselwa idatha ze-training nge-pseudo-labeling. Ukusetyenziswa kwe-pseudo-labeled data kubhalwe kwiSection 3.5 - Pseudo-LabelingQaphela ukuba iimeko ze-Kaggle ze-non-GPU ziquka kwi-30 GiB yeememori.

Iimodeli ebandayo emva kokufaka kwe-experimental ebandakanya kwi-sub-section ezilandelayo ifumaneka kwi-Kaggle apha. Ukuba ufuna, unako usebenzisa le model ngaphandle kokufunda yakho yayo kwaye uqhagamshelane ngqo kwi-Running Inference i-section ukuqhuba i- inference kwi-birdsong audio.

Iimodeli eyenziwe emva kokufaka kwe-experimental eyenziwa kwi-sub-section ezilandelayo ifakwe kwi-KaggleYibaUkuba ufuna, ungasebenzisa le model ngaphandle kokufunda yakho kunye nokufuka ngqo kwiRunning Inferenceinqaku ukuqhuba i-inference kwi-birdsong audio.

Isigaba 1 - Ukulungiswa kweedatha ye-audio

YintoniAudio Data LoadingUmzekelo weNotebook:

  1. Izixhobo zihlanganisa iindidi ezininzi kwi-BirdCLEF+ 2025 umgangatho we-competition data that are not covered by the GBV classifier.
  2. Izixhobo ze-audio ezisetyenziswa nge-load_training_audio method.
  3. Yenza i-processed_audio directory kwaye ibhalisele i-copy yeedatha ye-audio ebhalisiwe njengezifayile ze-wav kwi-director.

YintoniConfigcell of this section kuquka aMAX_FILESI-constante. Le ngqaku ibonise inani elikhulu lwezifayili ze-audio ezivela kwi-class eyodwa. Le ngqaku iboniswe ngempumelelo kwi-value enkulu ye-audio1000ukuqinisekisa ukuba zonke iifayile ze-audio zithunyelwe kwinon-GBViindidi. Ingaba kufuneka uqhagamshelane le ngqongqo kwi-setting yakho ye-experimental. Ngokwesibonelo, ukuba uqhagamshelane idatha ye-audioallkwizilwanyana, ungenza ukucacisa le nkqu kwi-value ephantsi ukuze akufanele ukuchithwa kwe-memory eyenziwa.

Yintoniload_training_audioindlela kungenziwa nge aclassesi-parameter, nto leyo i-lists yeeyunithi eziquka i-audio. Kule-project, inon-GBViindidi zithunyelwe njenge-list kwaye zithunyelwe kwi-variablemissing_classesKwiimeko ezidlulileyo kwi-load_training_audioUmgangatho ngeclassesUkucaciswa

# `missing_classes` list
['1139490', '1192948', '1194042', '126247', '1346504', '134933', '135045', '1462711', '1462737', '1564122', '21038', '21116', '21211', '22333', '22973', '22976', '24272', '24292', '24322', '41663', '41778', '41970', '42007', '42087', '42113', '46010', '47067', '476537', '476538', '48124', '50186', '517119', '523060', '528041', '52884', '548639', '555086', '555142', '566513', '64862', '65336', '65344', '65349', '65373', '65419', '65448', '65547', '65962', '66016', '66531', '66578', '66893', '67082', '67252', '714022', '715170', '787625', '81930', '868458', '963335', 'grasal4', 'verfly', 'y00678']

Uyakwazi ukulanda zonke iindidi ze-206 ze-BirdCLEF+ 2025 ngokuthumela iindidi eluhlaza njenge-parameter yeindidi.

Uyakwazi ukulanda zonke iindidi ze-206 ze-BirdCLEF+ 2025 ngokuthumela iindidi eluhlaza njenge-parameter yeindidi.


Umgangatho we-load_training_audio uyakwazi ukufumana i-use_slice ye-boolean ye-optional. Le parameter isebenza kunye ne-LOAD_SLICE constant ebonakalayo kwiseli le-Config. I-use_slice parameter kunye ne-LOAD_SLICE constant ayisetyenziselwa lo mveliso. Nangona kunjalo, ingasetyenziselwa ukulayisha inani elifanelekileyo le-audio ukusuka kwifayile ngalinye. Umzekelo, ukuba ukulayisha iiveki ezini kuphela le-audio ukusuka kwifayile ye-audio, uqhagamshelane i-LOAD_SLICE kwi-160000, ebonakalisa njengoko iiveki ezininzi le-sampling ye-32,000; kwaye uqhagamshelane yi-use_

Yintoniload_training_audioMethod ekubeni i-optional booleanuse_slicei-parameter. Le i-parameter isebenza nge-LOAD_SLICEI-Constant defined kwiConfigiimvelisouse_sliceIiparamitha kunyeLOAD_SLICEUkulungiselelanotezisetyenziselwa le nophuhliso. Nangona kunjalo, ingasetyenziselwa ukulayisha inani elifanelekileyo ye-audio ukusuka kwifayile omnye. Umzekelo, ukuba ukulayisha kuphela iiveki ezili-5 ye-audio ukusuka kwifayile omnye we-audio,LOAD_SLICEUkucinga160000, leyo ilebulwe njenge5times I-sampling rate ye32000• UkusukaTrueUkusukause_sliceUkucaciswa

Yintoniload_training_audioMethod akhawunti booleanmake_copyi-parameter. Xa le i-parameterTrueI-Logic yenza aprocessed_audioisixhobo kunye nokuphepha i-copy ye-audio sample ye-audio.waviifayile kwi-directory. Iifayile ze-audio zithunyelwe kwi-sub-directories ezibonisa i-class eyenza.processed_audioI-directory isetyenziswa kwi-section elandelayo ukugcina amaxabiso ze-audio kwi-disk ngaphandle kokuphendula amaxabiso ze-dataset ze-BirdCLEF+ 2025.


Yintoniload_training_audioi-methode ivumela i-dictionary yeedatha ze-audio ezisetyenziswa ngokusebenzisa i-class names njenge-keys. Yonke ixabiso kwi-dictionary i-lists ye-tuples ye-form(AUDIO_FILENAME, AUDIO_DATA):

{'1139490': [('CSA36389.ogg', tensor([[-7.3379e-06,  1.0008e-05, -8.9483e-06,  ...,  2.9978e-06,
3.4201e-06,  3.8700e-06]])), ('CSA36385.ogg', tensor([[-2.9545e-06,  2.9259e-05,  2.8138e-05,  ..., -5.8680e-09, -2.3467e-09, -2.6546e-10]]))], '1192948': [('CSA36388.ogg', tensor([[ 3.7417e-06, -5.4138e-06, -3.3517e-07,  ..., -2.4159e-05, -1.6547e-05, -1.8537e-05]])), ('CSA36366.ogg', tensor([[ 2.6916e-06, -1.5655e-06, -2.1533e-05,  ..., -2.0132e-05, -1.9063e-05, -2.4438e-05]])), ('CSA36373.ogg', tensor([[ 3.4144e-05, -8.0636e-06,  1.4903e-06,  ..., -3.8835e-05, -4.1840e-05, -4.0731e-05]])), ('CSA36358.ogg', tensor([[-1.6201e-06,  2.8240e-05,  2.9543e-05,  ..., -2.9203e-04, -3.1059e-04, -2.8100e-04]]))], '1194042': [('CSA18794.ogg', tensor([[ 3.0655e-05,  4.8817e-05,  6.2794e-05,  ..., -5.1450e-05,
-4.8535e-05, -4.2476e-05]])), ('CSA18802.ogg', tensor([[ 6.6640e-05,  8.8530e-05,  6.4143e-05,  ...,  5.3802e-07, -1.7509e-05, -4.8914e-06]])), ('CSA18783.ogg', tensor([[-8.6866e-06, -6.3421e-06, -3.1125e-05,  ..., -1.7946e-04, -1.6407e-04, -1.5334e-04]]))] ...}

Umgangatho uqhagamshelane i-statistics esisiseko ebonakalayo iinkcukacha zithunyelwe ngalinye iiklasi njenge-comma-separated-value string. Uyakwazi ukuhambisa iinkcukacha ngokufanelekileyo ukuze ufake iinkcukacha.

class,sampling_rate,num_files,num_secs_loaded,num_files_loaded
1139490,32000,2,194,2
1192948,32000,4,420,4
1194042,32000,3,91,3
...

Section 2 - Audio Data Processing

YintoniAudio Data ProcessingUmzekelo weNotebook:

  1. Ukukhangisa iingxaki ze-silent kunye neengxaki ze-audio ukunciphisa iingxaki ze-humane ezininzi kwi-audio. Ukukhangisa iingxaki ze-silent ukunciphisa iingxaki zeengxaki ze-audio.
  2. I-Audio Augmentation ifumaneka (1) ukongeza isignali ye-noise esebenzayo, (2) ukuguqulwa kwe-tempo ye-ray audio, okanye (3) ukongeza isignali ye-noise esebenzayo kwaye ukuguqulwa kwe-tempo ye-ray audio.
Isigaba 2.1 - Ukubonisa i-Silenced Segments

Yintonidetect_silenceindlela isetyenziselwa "ukukhanyisa" phezu ngalinye iisampuli ye-audio ebonakalayo kunye nokukhanyisa iingxaki ze-silent ngokubala iisampuli ze-audioroot-mean square (RMS) value of a given segment to a specified threshold. If the RMS is below the threshold, the segment is identified as a silent segment. The following constants specified in the Configcell of this section ukulawula isebenziswanodetect_silenceUmgangatho:

SIL_FRAME_PCT_OF_SR = 0.25
SIL_FRAME = int(SR * SIL_FRAME_PCT_OF_SR)
SIL_HOP = int(1.0 * SIL_FRAME)
SIL_THRESHOLD = 5e-5
SIL_REPLACE_VAL = -1000 # Value used to replace audio signal values within silent segments

The SIL_FRAMEiimvelisoSIL_HOP constants can be modified to adjust how the method "slides" over the raw audio. Similarly, the SIL_THRESHOLDixabiso ingasetyenziswa ukuze kwenziwe i-methode enzima okanye enzima kunceda ukucacisa iingxaki ze-silent.

The method outputs a dictionary of silent segment markers for each file in each class. Audio files with no detected silent segments are identified by empty lists.

{'1139490': {'CSA36389.ogg': [0, 8000, 16000, 272000, 280000, 288000, 296000, 304000], 'CSA36385.ogg': [0, 8000, 16000, 24000, 240000, 248000, 256000]}, '1192948': {'CSA36388.ogg': [0, 8000, 16000, 24000, 256000, 264000, 272000, 288000], 'CSA36366.ogg': [0, 8000, 16000, 24000, 256000, 264000, 272000, 280000, 288000], 'CSA36373.ogg': [0, 8000, 16000, 24000, 256000, 264000, 272000, 288000], 'CSA36358.ogg': [8000]}, '1194042': {'CSA18794.ogg': [], 'CSA18802.ogg': [], 'CSA18783.ogg': [0, 8000, 16000, 24000, 600000, 608000, 616000]}, '126247': {'XC941297.ogg': [], 'iNat1109254.ogg': [], 'iNat888527.ogg': [], 'iNat320679.ogg': [0], 'iNat888729.ogg': [], 'iNat146584.ogg': []}, '1346504': {'CSA18803.ogg': [0, 8000, 16000, 24000, 3000000, 3008000, 3016000], 'CSA18791.ogg': [], 'CSA18792.ogg': [], 'CSA18784.ogg': [0, 8000, 16000, 1232000, 1240000, 1248000], 'CSA18793.ogg': [0, 8000, 16000, 24000, 888000]} ...}
Section 2.2 - Removing Silent Segments and Eliminating Human Annotations

The USE_REMOVE_SILENCE_AND_HUMAN_ANNOTI-Constant defined kwiConfig cell of this section specifies if audio should be stripped of silent segments iimveliso sliced to remove Ukucinga human annotations.

USE_REMOVE_SILENCE_AND_HUMAN_ANNOT = True

The remove_silence_and_human_annoti-methode iveza iingxaki ze-silent ezivela kwi-audio samples usebenzisa i-outputdetect_silence method. Further, it implements logic to handle human annotations based on a simple observation: many audio samples, namely those with human annotations, tend to have the following structure:

|  < 10s   |   ~1s   |                  |
| BIRDSONG | SILENCE | HUMAN ANNOTATION |

I-birdsong kunye ne-human annotation sections ngokufanelekileyo ziquka iingxaki ze-silent. Nangona kunjalo, njengoko kubonisa kwi-diagram phezulu, iingxaki ze-bird vocalization ziyafumaneka kwimizuzu ezimbini ezidlulileyo ze-audio. Ngoko ke, indlela elula, ukuba ingxaki ze-human annotations zihlanganisa iisampuli ze-audio kwi-first silent segment marker eyenza ngaphandle kwicandelo ebonakalayo, phantsi kwimeko ukuba i-anotation ye-humane ibandakanya le-silent segment.remove_silence_and_human_annot logic uses the ANNOT_BREAKPOINT constant in the Configifayile ukuyifaka ukuba umnqweno we-segment ye-silent ibekwe ngaphandle kwibhokisi ebonakalayoANNOT_BREAKPOINT, ifumaneka kwinqanaba le-seconds. Ukuba yinto, i-logic iveza i-audio ebonakalayo kwi-markers yayo kwaye ibekwe kuphela idatha ebonakalayo phambi kokuba. Ukuhlolwa kwimifanekiso ye-audio ebonakalayo ngexesha lokucwangcisa iye yaziwa ukuba yinkqubo efanelekileyo. Nangona kunjalo, njengoko ibekwe kwiTraining Datainqaku, kukhoZonke audio recordings where the human annotation precedes the birdsong recording. The logic described here does notZonke iimveliso ze-audio zihlanganisa iinkqubo ezininzi ze-birdsong ezihambisiweyo kunye neemveliso ezininzi ziquka iingxaki ze-silent. Iimveliso zezi zihlanganiswa ngexesha elandelayo kwaye zihlanganiswa ngokupheleleyo.

Iingubo yesibini esisodwa,SLICE_FRAME, can be optionally used in a final processing step to return an even more refined slice of the processed audio. Set SLICE_FRAMEImininingwane yeenkcukacha ze-audio ezisetyenziswa ukuba ushiye.

Yintoniremove_silence_and_human_annotMethod ukugcina i-audio eyenziwe kwi-disk phantsi kwe-directoryprocessed_audioUkusebenzisa isave_audioI-parameter, ebizwa ngokuba yi-True. Umgangatho ivumela i-Dictionary of theUkubalwa seconds of processed audio for each class.

{'1139490': 14, '1192948': 29, '1194042': 24, '126247': 48, '1346504': 40, '134933': 32, '135045': 77, ...}

Yintoniget_audio_statsUhlobo olusetyenziswa emvaremove_silence_and_human_annotukuze ufumane inani elidlulileyo lombala ye-audio kwiiyunithi ezininzi.

Section 2.3 - Calculating Augmentation Turns for Minority Classes

Njengoko kulandwa kwiTraining Data section, the classes are not balanced. Augmentation is used in this notebook section to help address the imbalance leveraging the average number of seconds of audio across all classes, as provided by the get_audio_statsMethod. Classes nge-seconds epheleleyo ye-audio eyakhelwebelow the average are augmented. The get_augmentation_turns_per_classi-methode ibonise inani le-augmentation turns ngalinye i-minority class ngokusebenzisa inani le-seconds ngalinye i-processed audio sample.

TURNS = (AVG_SECS_AUDIO_ACROSS_CLASSES - TOTAL_SECS_AUDIO_FOR_CLASS)/AVG_SECS_PER_AUDIO_SAMPLE

Iiyunivesithi ezininzi phantsi komgangatho iya kuba i-augmentation turns ngaphezulu kunokuba iiyunivesithi ezininzi ziquka komgangatho, leyo iya kuba i-augmentation turns ezincinane.

The get_augmentation_turns_per_class includes a AVG_SECS_FACTOR constant which can be used to adjust the value for

average number of seconds of audio across all classes. The constant can be used to make the logic more conservative or aggressive when calculating the number of augmentation turns.

Yintoniget_augmentation_turns_per_classUkubandakanya aAVG_SECS_FACTORumgangatho, leyo ingasetyenziswa ukuguqulwa kwinqanaba

ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane.

Section 2.4 - Running Augmentations

The USE_AUGMENTATIONS constant defined in the Configi-cell ye-section inikeza ukuba i-audio kufuneka ifumaneka.

USE_AUGMENTATIONS = True

Njengoko kwangaphambili, i-audio augmentation ibandakanya (1) ukongeza isignali ye-noise eyenziwe ngempumelelo, (2) ukuguqulwa kwe-tempo ye-ray audio, okanye (3) ukongeza isignali ye-noise eyenziwe ngempumelelo kwaye ukuguqulwa kwe-tempo ye-ray audio.add_noiseiimvelisochange_tempo methods encapsulate the logic for adding a noise signal and changing the tempo respectively. The noise signal range and tempo change range can be adjusted via the following constants in the ConfigUkucinga:

NOISE_RNG_LOW = 0.0001
NOISE_RNG_HIGH = 0.0009
TEMPO_RNG_LOW = 0.5
TEMPO_RNG_HIGH = 1.5

The run_augmentations method runs the augmentations using the output from the get_augmentations_turns_per_classindlela. Kuba izifundo ezidlulileyo, i-logic:

  1. Ukukhetha ngempumelelo isampula ye-audio eyenziwe (i.e. iingxaki ze-silent ziye zithunyelwe) ukuze ifumaneke.
  2. Randomly selects the augmentation to perform: (1) adding noise, (2) changing the tempo, or (3) adding noise and changing the tempo.
  3. Ukugcina i-audio ephakamileyo kwi-disk phantsi kwe-class efanelekileyo kwi-processed_audio directory.

Nangona i-notebook logic ibandakanya iiyunithi ezininzi kunye neentsuku ezininzi ze-audio phantsi komgangatho, ibandakanya iiyunithi ezininzi kunye neentsuku ezininzi ze-audio phantsi komgangatho. Le nqakraza yenzelwe ukulawula i-memory ezikhoyo kunye nokufumaneka ukuba i-class imbalance ibandakanya ngakumbi ngokusebenzisa ukhetho lwe-loss function.

Section 3 - Mel Spectrogram Generation and Input Preparation

The Mel Spectrogram Generation and Input Preparation section of the notebook:

  1. Ukuqhathanisa idatha ye-audio eyenziwe kwi-training kunye ne-validation lists.
  2. Splits audio into 5 second frames.
  3. Yenza i-mel spectrograms ngalinye i-frame ye-audio ye-5 ye-second.
  4. Resizes mel spectrograms to a target size of (224, 224).
  5. Optionally loads pseudo-labeled data samples to augment training data.
  6. One-hot encodes training data and validation data labels.
  7. Ukwakha i-TensorFlow Dataset i-objects kwi-training kunye ne-validation data lists.
  8. Optionally uses MixUp logic to augment training data.
Section 3.1 - Splitting Processed Audio Data

Processed audio data is loaded from the processed_audio folder. The data is split into 4 lists:

training_audio
training_labels
validation_audio
validation_labels

I-Etiquette, ngoko ke, iimveliso ze-class ezinxulumene neempawu ze-audio. I-EtiquetteSPLIT constant defined in the Config cell controls the split ratio between the training and validation data lists. Processed audio data is shuffled before splitting.

I-Section 3.2 - Ukubala i-Audio kwi-frames

I-Audio ifakwe kwi-5 segments ye-second usebenzisa iframe_audioindlela, nto leyo isebenzisa i-TensorFlowsignal.frameumgangatho ukwahlula umzekelo we-audio. I-constants ezilandelayo kwiConfig cell control the split operation:

FRAME_LENGTH = 5
FRAME_STEP = 5
Section 3.3 - Generating Mel Spectrograms

Mel spectrograms are generated for each 5 second audio frame generated in Section 3.2Ukusebenzisa iaudio2melspecindlela. Iimeko ezilandelayo kwiConfigcell kucacisa iiparamitha ezisetyenziswa ekwakheni i-mel spectrograms, njenge-number of mel bands, i-frequency engaphansi, kunye ne-frequency engaphezulu:

# Mel spectrogram parameters
N_FFT = 1024  # FFT size
HOP_SIZE = 256
N_MELS = 256
FMIN = 50  # minimum frequency
FMAX = 14000 # maximum frequency

I-band ye-frequency ilungiselwe ukuba ibonisa i-Umthamoiindidi ezininzi ze-bird vocalizations. Nangona kunjalo, ezinye iintlobo ze-bird zinokufunda ngaphandle kwesixeko.

Isigaba 3.4 - Ukuguqulwa kwe-Mel Spectrograms

Yintonito_melspectrogram_imageindlela isetyenziselwa ukuguqulwa yonke i-mel spectrogram kwi-apillow ImageObjects. ZonkeImage object is subsequently resized to (224, 224)Yintoni i-input dimension esithathwe yi-EfficientNet B0 model.

Section 3.5 - Loading Pseudo-Labeled Data

Njengoko kulandwa kwiTraining Data section, the train_soundscapesinkxaso ibandakanya malunga ne-10,000unlabeled audio recordings of birdsong. These audio recordings can be incorporated into the training data via pseudo-labeling. A simple process to create pseudo-labeled data is as follows:

  • Train a classifier without pseudo-labeled data.
  • Load training soundscape audio files.
  • I-Segmentation ye-audio ye-soundscape kwi-frames ye-5 ye-second.
  • Yenza i-mel spectrograms ngalinye i-frame ye-5 ye-second kunye nokuguqulwa kwe-224, 224.
  • Yenza imibuzo kwi-spectrogram ye-mel ye-re-size usebenzisa i-classifier eyenziwe ngexesha lokuqala.
  • Hlola iingcebiso ngaphezu kwinqanaba elidlulileyo yobugcisa kunye nokugcina i-spectrograms yeengcebiso zihlanganisa phantsi kwe-label ye-class ebonakalayo.
  • Train your classifier again using the psuedo-labeled data.

Ukuba ufuna ukuvelisa idatha yakho ye-pseudo-labelled, uqhagamshelane neengxaki ezininzi ukuvelisa i-classifier yakhongaphandleiinkcukacha ze-pseudo-labelled. Emva koko, usebenzisa i-classizer yakho ukuvelisa iinkcukacha zakho ze-pseudo-labelled usebenzisa inkqubo efumaneka phezulu. Ekugqibeleni, uphuhlise i-classizer yakho usebenzisa iinkcukacha zakho ze-pseudo-labelled.

Ukusetyenziswa oku kukusebenzisa iinkcukacha ze-pseudo-labeled. Nangona kunjalo, unako ukuguqulwa i-inference notebook ebizwaRunning Inferenceinqaku ukuvelisa idatha ye-pseudo-labeled.

Yenza iUSE_PSEUDO_LABELSUkulungiselela kwiConfigI-Cell 2Falseukunceda ukusetyenziswa kwedatha ze-pseudo-labeled.

Section 3.6 - Encoding Labels

Yintoniprocess_labelsindlela isetyenziselwa i-one-hot encode labels. I-one-hot encoded labels zithunyelwe njenge-NumPy array kwaye ziquka kwi-training label kunye ne-validation label lists.

I-Section 3.7 - Ukuguqulwa kwe-Training and Validation Data Lists kwi-TensorFlowDatasetIimpawu
Dataset

I-TensorFlowdata.Dataset.from_tensor_slicesindlela isetyenziselwa ukuvelisa TensorFlowDatasetizixhobo ze-training kunye ne-validation data lists. Theshuffleindlela ibizwa kwi-trainingDatasetObjection ukuba shuffle idatha yokulungisa ngaphambi batching. ThebatchUhlobo olusetyenziswa kumabiniDataset objects to batch the training and validation datasets. The BATCH_SIZEUkulungiselela kwiConfigcell ukulawula ubungakanani batch.

I-Section 3.8 - Ukusebenzisa i-MixUp ukwandisa i-Training Data

Njengoko uyazi xa, i-MixUp yinkqubo yokwandisa idatha enokufaka iifoto ezimbini kunye nokwenza isampuli entsha yeedatha. I-class ye-imaging ye-blended is a mixture of the classes associated with the original 2 images.mix_upindlela, kunye nesample_beta_distribution method, encapsulates the optional MixUp logic.

Ukusetyenziswa oku kusebenzisa i-MixUp ukwandisa idatha ye-training. Ukusetyenzisa i-MixUp,USE_MIXUPUkulungiselela kwiConfigI-Cell 2True.

Isigaba 4 - Model Training

YintoniModel TrainingUmzekelo weNotebook:

  1. I-Inicializes kunye ne-configures yeprojekthi ye-WandB ukuthatha idatha ze-training run.
  2. Ukwakhiwa kunye nokwakhiwa kwimodeli EfficientNet B0.
  3. Ukuqeqesha imodeli.
  4. Yenza iimodeli ezidlulileyo kwi-disk.
Section 4.1 - Initializing and Configuring WandB Project

Qinisekisa ukuba uqhagamshelane key yakho WandB API njenge Kaggle Secret kwi-notebook kwaye ukuba i-WandB login method kwi-Section 0 ye-notebook iye yenza True.

Ensure that you have attached your own WandB API key as a Kaggle Secretkwi-notebook kunye ne-WandBloginUkucingaSection 0I-notebook yaye yaye yibhalweTrue.

YintoniConfigiseli kule ndawo ibandakanya i-logic yokufaka kunye nokufaka inkqubo entsha ye-WandB (ukuba inkqubo ayikho ngoku) leyo uya kufumana idatha ze-training run:

wandb.init(project="my-bird-vocalization-classifier")
config = wandb.config
config.batch_size = BATCH_SIZE
config.epochs = 30
config.image_size = IMG_SIZE
config.num_classes = len(LABELS)

Kwakhona, unako ukuguqulwa igama le nkqubomy-bird-vocalization-classifierukuba igama lakho leprojekthi WandB efunyenwe.

Isigaba 4.2 - Ukwakhiwa kunye nokwakhiwa Model EfficientNet B0

Yintonibuild_modelindlela isetyenziselwa ukulayisha imodeli efanelekileyo EfficientNet B0 kunye ne-ImageNet we-weights kwaye ngaphandle kwe-top layer:

model = EfficientNetB0(include_top=False, input_tensor=inputs, weights="imagenet")

The model is frozen to leverage the pre-trained ImageNet weights with the objective to only Ukucinga(i.e. isitimela) izihlangu kwi-stage yokuqala ye-model:

# Unfreeze last `unfreeze_layers` layers and add regularization
for layer in model.layers[-unfreeze_layers:]:
   if not isinstance(layer, layers.BatchNormalization):
      layer.trainable = True
      layer.kernel_regularizer = tf.keras.regularizers.l2(L2_RATE)

The constant UNFREEZE_LAYERSKwimekoConfigcell ibonise inani layers ukuba unfreeze.

The top of the model is rebuilt with a final Denselayer reflecting the number bird species classes.Categorical focal cross-entropyifumaneka njenge-Loss Function ukunceda ukuguqulwa kwe-class imbalance. ILOSS_ALPHAiimvelisoLOSS_GAMMAI-Constants kwiConfigcells zisetyenziswa kunye ne-function loss.

Isigaba 4.3 - Model Training

Yintonifit method is called on the compiled modelukusukaSection 4.2ukuqhuba ukuqeqeshwa. Qaphela ukuba alearning rate scheduleriimvelisolr_scheduler, isetyenziswa ngempumelelo yokufunda. Isilinganiso yokufunda yokufunda4.0e-4i-callback ye-hardcoded. I-learning rate ifumaneka kwiiyure ezimbini ezisekusene ne-epoch count. Inani le-epoch ye-training ilawulwa ngu-epochEPOCHSUkulungiselela kwiConfigUkucinga

Isigaba 4.4 - Model Saving

Yintonisaveindlela ibizwa ku-compiledmodelemva kokufunda ukugcina imodeli kwi-disk.

model.save("bird-vocalization-classifier.keras")

Imiphumo yemfundo

Ukusebenza kwe-notebook kufuneka yenza imiphumo ye-training ezilandelayo, ngexesha ukuba usebenzise isakhiwo se-experimental esifundowe kwisisombululo.Building the ClassifierUkucaciswa:

Training results

Njengoko kubonakala, ukucaciswa kwama-90% kwaye ukucaciswa kwama-70% emva kokufunda kwama-30 epochs. Nangona kunjalo, njengoko kubonakala, ukucaciswa kwama-validation kubaluleke kakhulu. Le ingxaki kubaluleke ngexabiso se-class kunye ne-memory eyenza ukusetyenziswa kwezongezelelo ezongezelelweyo ukucaciswa ngokupheleleyo. Iimiphumo zibonisa ukuba iimodeli ivimbele ukusetyenziswa kwedatha ze-training kwaye ayikho ukuguqulwa njengoko kunokufuneka. Nangona kunjalo, iimodeli ingasetyenziselwa ukucaciswa kunye ne-GBV ngokufanelekileyo ngokufanelekileyo.

Ukusuka Inference

YintoniI-Notebook ye-Kaggle ("inference notebook") can be used for running inference. The inference notebook logic uses both the GBV classifier model and the model that you trained in the preceding section. It runs inference on the unlabeled soundscapes files in the train_soundscapesIifayile ye-soundcapes ye-audio yahlukaniswa kwi-5 iifayile ze-second. I-soundcapes ye-audioMAX_FILESI-Constant defined kwiConfig cell of Section 0of the notebook ukulawula inani iifayile ye-soundcapes ye-audio eyenziwa ukususela.

I-Inference Notebook kuqala ukuvelisa imibuzo usebenzisa i-GBV classifier. Imibuzo ye 143 iindidi ze-BirdCLEF+ 2025 ezihlangene kwi-GBV classifier yaziwa. Ukuba i-probability ephakeme phakathi kwe-143 iindidi ezaziwayo iye aphezulu okanye efana ne:GBV_CLASSIFIER_THRESHOLD, ngoko i-GBV ebonakalayo i-class ifumaneka njenge-class efanelekileyo. Ukuba i-probability ephakeme phakathi kwe-143 "ezaziwayo" i-class ifumaneka phantsiGBV_CLASSIFIER_THRESHOLD, it is assumed that the true class is among the 63 classes "unknown" to the GBV classifier - i.e. the classes used to train the model in the preceding section. The logic then runs predictions using the finetuned model. The predicted class from that prediction set is subsequently selected as the true class.

YintoniGBV_CLASSIFIER_THRESHOLDI-Constante ifumaneka kwi-ConfigI-Cell yeSection 5of the inference notebook. Izixhobo zithunyelwe kwiifayile ze-2:

  1. A preds.csv file that captures the prediction and prediction probability for each 5-second soundscape slice.
  2. Iifayile ye-submission.csv enikeza zonke iimeko ze-class kwi-format ye-BirdCLEF+ 2025.

Yenza i-path to your finetuned model kwi-cell yokuqala ye-Section 4 ye-Inference Notebook.

Yenza i-path to your finetuned model kwi-cell yokuqalaSection 4Umbhali we-Inference Notebook

Future Work

I-training notebook ingasetyenziselwa ukuqeqesha umzila kwi-206 BirdCLEF+ 2025 izifundo, ukunciphisa inkinga ye-GBV classifier, ngexabiso malunga ne-concours data set. Njengoko kwangaphambili, ukufumana isifundo esifutshane,[]Ukusuka kuload_training_audioumgangatho uqhagamshelane idatha ye-audio evela kumazwe omnye. IMAX_FILESiimvelisoLOAD_SLICE constants can be used to limit the amount of loaded audio in order to work within the confines of a Kaggle notebook environment.

Nangona kunjalo, iimodeli efanelekileyo ingasetyenziselwa ukusetyenziswa kwamathambo emininzi yobugcisa. Kwiimeko ezininzi, inani elikhulu le-augmentations iya kuthathwa ukunyuswa kwizinga le-class. Ukongezelela, ezinye izixhobo ze-augmentation, njenge-CutMix, ingasetyenziselwa ukunyuswa kwakhona iidatha ze-training. Nangona kunjalo, lezi zindlela zinxibelelana kwimeko yokukhula emangalisayo.

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks