Ukucaciswa
Iingcali usebenzisa iinkqubo ze-automated ukufundisa i-ecosystems ezininzi. Kwiimeko ze-forest and jungle areas,autonomous recording units (ARUs)iinkcukacha ezisetyenziswa ukubonisa i-audio eyenziwa ukunceda ukucacisa iintlobo ezahlukeneyo ze-animals kunye ne-insects. Le nkcazelo ingasetyenziswa ukucaciswa okuphumeleleyo lokuquka iintlobo ngaphakathi kwimvelo efikelelekayo. Kwiimveliso ze-birds, i-Google Research ibonisa kwi-artikeli yayoUkuhlukanisa i-Birdsong in the Wild ngenxa ye-Classificationukuba "i-ecologists usebenzisa amafutha ukufumana iinkqubo yemveliso kunye neemveliso zempilo - umzekelo, ukuba kukho iimveliso ezininzi kwiimveliso, oku kuthetha ukuba kukho iimveliso ezininzi ezidlulileyo." Ukongezelela, zibonisa i-value ye-audio-based identification: "[Kuba] amafutha zihlanganisa kunye nokubonisa indawo ngeengoma kunye neengxaki, kunzima kakhulu ukucinga kubo nge-ear.
Ukuhlukanisa i-Birdsong in the Wild ngenxa ye-ClassificationKwangoku, iI-BirdCLEF+ ye-2025Iingcingo ezivela kwiKagglePhantsi umnqweno weUkucingainkampani. I-ImageCLEF inikezela uphando kwi-anotation ye-cross-language kunye nokufumana iifoto kwiintlobo ezahlukeneyo. Umthamo we-competition yinto epheleleyo: yenza iimodeli ye-classification enokufuneka ngokunyaniseka iintlobo ze-bird ukusuka kwiphepha le-audio.
I-BirdCLEF+ ye-2025UkucingaOkokuqala, umsebenzi kubonakala banxibelelana ngokufanelekileyo ngokufanelekileyoI-Google Bird Vocalization (GBV) I-Classifier ye-GoogleKwakhona eyaziwa njengePerch. I-GBV i-classifier ifunyenwe kwi-11,000 iintlobo ze-bird yaye ngoko kubalulekile ukuba ifunyenwe njenge-model ye-classification.
I-Google Bird Vocalization (GBV) I-Classifier ye-GoogleNangona kunjalo, i-competition ibandakanya iintlobo ze-bird ezikhoyo kwi-GBV classifier training set. Ngenxa yoko, i-GBV classifier ibandakanya kuphela i-~60% i-accuracy kwi-BirdCLEF+ 2025 competition test dataset. Ngenxa yoko kufuneka yenzelwe iimodeli ye-custom.
Le nqakraza ibonisa indlela yokwakha i-classifier yakho ye-bird vocalization leyo ingasetyenziswa kunye ne-GBV classifier uklasifikisa i-selection ye-bird species. Le nqakraza isetyenziswa izixhobo zokuqala ezaziwayo kwi-Google ResearchiimvelisoI-Design isebenzisa i-BirdCLEF+ 2025 competition data set ekulungiseleni.
Ukuqeqesha Data
YintoniI-BirdCLEF+ 2025 i-training data set, kuquka iifayile ezihambelana, i-approximately 12 GB. Iifayile ezihambelanayo kunye neefayile ezihambelanayo iifayile ze-dataset ziquka:
birdclef_2025
|__ train_audio
|__ train_soundscapes
|__ test_soundscapes
recording_location.txt
taxonomy.csv
train.csv
train_audio
Yintonitrain_audio
I-catalogue yi-component eninzi ye-dataset, ebandakanya i-28,564 iingcebiso ze-audio ze-training.ogg
audio format. Iingcebiso ze-audio zihlanganiswa kwi-sub-directories ezininzi zihlanganisa iintlobo zangaphakathi, njl:
train_audio
|__amakin1
|__ [AUDIO FILES]
|__amekes
|__ [AUDIO FILES]
...
Yintonitaxonomy.csv
ifayile ingasetyenziswa ukufumana iimpawu zenzululwazi kunye neengxaki ezivela kwiintlobo ze-bird ezidlulileyo kunye neengxaki ze-sub-directory, njl:
SUB-DIRECTORY NAME SCIENTIFIC NAME COMMON NAME
amakin1 Chloroceryle amazona Amazon Kingfisher
amekes Falco sparverius American Kestrel
...
I-dataset ye-competition ibandakanya i-206 iintlobo ze-animal, ngoko ke i-206 iingqungquthela.IntroductionIintlobo ze-63 ziyafumanekaNgabaUkubandakanya iGBV ClassifierIinkcukachaNon-GBViindidi zokusetyenziswa ngokubanzi nge-numerical class identifier:
1139490, 1192948, 1194042, 126247, 1346504, 134933, 135045, 1462711, 1462737, 1564122, 21038, 21116, 21211, 22333, 22973, 22976, 24272, 24292, 24322, 41663, 41778, 41970, 42007, 42087, 42113, 46010, 47067, 476537, 476538, 48124, 50186, 517119, 523060, 528041, 52884, 548639, 555086, 555142, 566513, 64862, 65336, 65344, 65349, 65373, 65419, 65448, 65547, 65962, 66016, 66531, 66578, 66893, 67082, 67252, 714022, 715170, 787625, 81930, 868458, 963335, grasal4, verfly, y00678
Uninzi leNon-GBVIzifundo zihlanganiswa nge:
- Limited training data.
- Class
1139490
, for example, only contains 2 audio recordings. By contrast, classamakin1
, which is a “known” GBV class, contains 89 recordings.
- Class
- Poor recording quality.
- Highlighting class
1139490
again, both training recordings are of poor quality with one being particularly difficult to discern.
- Highlighting class
Lezi zimo ezimbini zibonisa ubunzima phakathi kwezilwanyana ngokutsho ubungakanani umgangatho umgangatho umgangatho umgangatho umgangatho.
Zonke iingcingo zokusebenza ze-audio kwiintlobo ezimbiniGBViimvelisoNon-GBVizifundo zihlanganisa nokufunda umntu, kwaye umdlali uthetha iingcingo kunye neengcingo efana ne-species of bird that was recorded and location of the recording. In most -Kwaye akukho zonke- iimeko, iingcebiso zihlanganisa iingcebiso ze-bird ebhalisiweyo.
Iingcingo ezisetyenziswa ukuyisombulula i-class imbalance kunye nokufumaneka kwe-anotations ye-human speech ziya kubhalwe kwiBuilding the Classifieriinkcukacha
train_soundscapes
Yintonitrain_soundscapes
inkxaso ibandakanya malunga ne-10,000unlabelediingxowa ze-audio ze-birdsong. Njengoko kuxhomekeke ku-Building the Classifierinqaku, le iingcingo ze-audio ziquka kwiinkcukacha zokusebenza nge-pseudo-labeling.
test_soundscapes
Yintonitest_soundscapes
inkxaso ifumaneka ngaphandle areadme.txt
ifayile. Le kathathu ibandakanya i-set ye-test audio ebandayo ngexesha lokuthumela iziphumo ze-prediction kwi-BirdCLEF+ 2025 competition.
Ukwakhiwa kweClassic
Umgangatho Basic kunye ne-Background
Isisombululo esisiseko esetyenziswaUkuhlolwa kwe-GoogleUkuqeqesha i-vocalization ye-bird ye-classifier yayo kubandakanya:
- Ukuqhathanisa i-audio yokufaka kwi-5 iingxaki ze-second.
- Ukuguqula i-audio segments kwi-mel spectrograms.
- Ukuqeqesha i-image classifier kwi-mel spectrograms.
Umgangatho efanayo uya kulandelayo kule nqakraza. I-image classifier eyenziwa ku-GoogleUmgangatho B0Model. Ukuba unemibuzo kunyeEfficientNetiimodeli yobugcisa, uyazi ukuba ziye ziye zenzelwe ukusetyenziswa kwe-image efanelekileyo.
Umgangatho B0Nangona kunjalo, phambi kokuba amaxabiso ze-audio ziyafumaneka kunye nokuguqulwa kwi-spectrograms ye-mel, kufuneka sincoma i-class imbalance kunye ne-human annotation problems ezaziwayo kwi-Mel.Training DataNgokubanzi, iingxaki zayo ziya kuthathwa ngokufanayo nge-data augmentation kunye ne-slicing ye-audio samples.
Kwiintsuku ezidlulileyo, i-sub-section ezilandelayo inikeza ulwazi olutsha olutsha.
Imodeli EfficientNet
I-Google Research ibonise i-family yayoEfficientNetiimveliso ngo-2019 njenge-set ofconvolutional neural networkiimveliso ezihlangene iimveliso ze- state-of-the-art, ngexesha elidlulileyo, ngokutsho ubungakanani kunye nokusebenza.
Ukucaciswa2iimveliso, ezakhiwe ngo-2021, zinikeza ukusebenza nangaphezulu kunye ne-parameter efficiency.
Nangona uqeqeshoUkucingaIimodeli ze-EfficientNet zibonise izisombululo zabo xa ziye zithunyelwe kwiimveliso ezininzi zebhanki zebhanki, okwenza i-choice engundoqo njenge-technology ye-classification yeeprojekthi.
I-Mel Spectrograms
I-mel spectrogram ye-visual representation ye-audio signal. Oku kunokwenzeka ukuba iyiphi i-heatmap ye-sound.
I-x-axis ye-spectrogram ye-mel ibonisa i-time dimension ye-audio signal, kwaye i-y-axis ibonisa i-frequencies yeengxaki kwi-signal. Nangona kunjalo, ngaphandle kokubonisa zonke i-frequencies kwi-scale epheleleyo, i-frequencies zihlanganiswa kwimel bands. Lezi band, ngoko ke, zithunyelwe ukusetyenziswamel scale. I-mel scale i-alogarithmicisilinganiselo esivela kwinkqubo ye-auditory yabantu kunye neendlela yokufunda i-sound. Imibala ye-spectrogram ye-mel ibonisa i-amplitude yeengxube kwi-bands. Imibala eluhlaza ibonisa i-amplitudes ezininzi kwaye imibala eluhlaza ibonisa i-amplitudes ezincinane.
Ukucinga
Indawo yami ekubuyiselwa umklamo kubonelela ukubuyekeza kwinqanaba eliphezulu kwinkqubo ngaphandle kokufumana iinkcukacha ezininzi. I-training (i-fine-tuning) logic esekelwe kuleI-Notebook ye-Kaggle("i-training notebook") ebandakanya kwi-4 iinkalo eziphambili:
- Isigaba 1: Ukulungiselela idatha ye-audio.
- Isigaba 2: Ukusetyenziswa kwedatha ye-audio.
- I-Section 3: Ukwenza i-spectrogram yeMel kunye nokwenza i-input.
- Isigaba 4: Model Ukuqeqeshwa.
Ingaba uya kufumana ukuba iiyunithi ezimbini ezimbini yeenkcukacha zokuqala ze- (1) iimporti ezisetyenziselwa le nkcukacha kunye ne- (2) i-aConfigcell defining constants ezisetyenziswa kwisiqingatha elandelayo kunye neengxenye ezidlulileyo.
I-training notebook ngokwenene kuqalaSection 0apho iipakheji base yePython ezisetyenziswa kulo lonke i-notebook zithunyelwe. Le nkcukacha kubandakanya i-logic ye-logic ye-loginWeights & Biases("WandB") ukulawula ukuqeqeshwa. Uya kufuneka uqhagamshele WandB yakhoAPI keyI-notebook ye-AKaggle SecretUkusetyenziswa kweNameWANDB_API_KEY
.
Njengoko kuxhomekeke kwiTraining Datainqaku, i-unlabeled training soundscapes ingasetyenziselwa idatha ze-training nge-pseudo-labeling. Ukusetyenziswa kwe-pseudo-labeled data kubhalwe kwiSection 3.5 - Pseudo-LabelingQaphela ukuba iimeko ze-Kaggle ze-non-GPU ziquka kwi-30 GiB yeememori.
Iimodeli ebandayo emva kokufaka kwe-experimental ebandakanya kwi-sub-section ezilandelayo ifumaneka kwi-Kaggle apha. Ukuba ufuna, unako usebenzisa le model ngaphandle kokufunda yakho yayo kwaye uqhagamshelane ngqo kwi-Running Inference i-section ukuqhuba i- inference kwi-birdsong audio.
Iimodeli eyenziwe emva kokufaka kwe-experimental eyenziwa kwi-sub-section ezilandelayo ifakwe kwi-KaggleYibaUkuba ufuna, ungasebenzisa le model ngaphandle kokufunda yakho kunye nokufuka ngqo kwiRunning Inferenceinqaku ukuqhuba i-inference kwi-birdsong audio.
Isigaba 1 - Ukulungiswa kweedatha ye-audio
YintoniAudio Data LoadingUmzekelo weNotebook:
- Izixhobo zihlanganisa iindidi ezininzi kwi-BirdCLEF+ 2025 umgangatho we-competition data that are not covered by the GBV classifier.
- Izixhobo ze-audio ezisetyenziswa nge-load_training_audio method.
- Yenza i-processed_audio directory kwaye ibhalisele i-copy yeedatha ye-audio ebhalisiwe njengezifayile ze-wav kwi-director.
YintoniConfigcell of this section kuquka aMAX_FILES
I-constante. Le ngqaku ibonise inani elikhulu lwezifayili ze-audio ezivela kwi-class eyodwa. Le ngqaku iboniswe ngempumelelo kwi-value enkulu ye-audio1000
ukuqinisekisa ukuba zonke iifayile ze-audio zithunyelwe kwinon-GBViindidi. Ingaba kufuneka uqhagamshelane le ngqongqo kwi-setting yakho ye-experimental. Ngokwesibonelo, ukuba uqhagamshelane idatha ye-audioallkwizilwanyana, ungenza ukucacisa le nkqu kwi-value ephantsi ukuze akufanele ukuchithwa kwe-memory eyenziwa.
Yintoniload_training_audio
indlela kungenziwa nge aclasses
i-parameter, nto leyo i-lists yeeyunithi eziquka i-audio. Kule-project, inon-GBViindidi zithunyelwe njenge-list kwaye zithunyelwe kwi-variablemissing_classes
Kwiimeko ezidlulileyo kwi-load_training_audio
Umgangatho ngeclasses
Ukucaciswa
# `missing_classes` list
['1139490', '1192948', '1194042', '126247', '1346504', '134933', '135045', '1462711', '1462737', '1564122', '21038', '21116', '21211', '22333', '22973', '22976', '24272', '24292', '24322', '41663', '41778', '41970', '42007', '42087', '42113', '46010', '47067', '476537', '476538', '48124', '50186', '517119', '523060', '528041', '52884', '548639', '555086', '555142', '566513', '64862', '65336', '65344', '65349', '65373', '65419', '65448', '65547', '65962', '66016', '66531', '66578', '66893', '67082', '67252', '714022', '715170', '787625', '81930', '868458', '963335', 'grasal4', 'verfly', 'y00678']
Uyakwazi ukulanda zonke iindidi ze-206 ze-BirdCLEF+ 2025 ngokuthumela iindidi eluhlaza njenge-parameter yeindidi.
Uyakwazi ukulanda zonke iindidi ze-206 ze-BirdCLEF+ 2025 ngokuthumela iindidi eluhlaza njenge-parameter yeindidi.
Umgangatho we-load_training_audio uyakwazi ukufumana i-use_slice ye-boolean ye-optional. Le parameter isebenza kunye ne-LOAD_SLICE constant ebonakalayo kwiseli le-Config. I-use_slice parameter kunye ne-LOAD_SLICE constant ayisetyenziselwa lo mveliso. Nangona kunjalo, ingasetyenziselwa ukulayisha inani elifanelekileyo le-audio ukusuka kwifayile ngalinye. Umzekelo, ukuba ukulayisha iiveki ezini kuphela le-audio ukusuka kwifayile ye-audio, uqhagamshelane i-LOAD_SLICE kwi-160000, ebonakalisa njengoko iiveki ezininzi le-sampling ye-32,000; kwaye uqhagamshelane yi-use_
Yintoniload_training_audio
Method ekubeni i-optional booleanuse_slice
i-parameter. Le i-parameter isebenza nge-LOAD_SLICE
I-Constant defined kwiConfigiimvelisouse_slice
Iiparamitha kunyeLOAD_SLICE
Ukulungiselelanotezisetyenziselwa le nophuhliso. Nangona kunjalo, ingasetyenziselwa ukulayisha inani elifanelekileyo ye-audio ukusuka kwifayile omnye. Umzekelo, ukuba ukulayisha kuphela iiveki ezili-5 ye-audio ukusuka kwifayile omnye we-audio,LOAD_SLICE
Ukucinga160000
, leyo ilebulwe njenge5
times I-sampling rate ye32000
• UkusukaTrue
Ukusukause_slice
Ukucaciswa
Yintoniload_training_audio
Method akhawunti booleanmake_copy
i-parameter. Xa le i-parameterTrue
I-Logic yenza aprocessed_audio
isixhobo kunye nokuphepha i-copy ye-audio sample ye-audio.wav
iifayile kwi-directory. Iifayile ze-audio zithunyelwe kwi-sub-directories ezibonisa i-class eyenza.processed_audio
I-directory isetyenziswa kwi-section elandelayo ukugcina amaxabiso ze-audio kwi-disk ngaphandle kokuphendula amaxabiso ze-dataset ze-BirdCLEF+ 2025.
Yintoniload_training_audio
i-methode ivumela i-dictionary yeedatha ze-audio ezisetyenziswa ngokusebenzisa i-class names njenge-keys. Yonke ixabiso kwi-dictionary i-lists ye-tuples ye-form(AUDIO_FILENAME, AUDIO_DATA)
:
{'1139490': [('CSA36389.ogg', tensor([[-7.3379e-06, 1.0008e-05, -8.9483e-06, ..., 2.9978e-06,
3.4201e-06, 3.8700e-06]])), ('CSA36385.ogg', tensor([[-2.9545e-06, 2.9259e-05, 2.8138e-05, ..., -5.8680e-09, -2.3467e-09, -2.6546e-10]]))], '1192948': [('CSA36388.ogg', tensor([[ 3.7417e-06, -5.4138e-06, -3.3517e-07, ..., -2.4159e-05, -1.6547e-05, -1.8537e-05]])), ('CSA36366.ogg', tensor([[ 2.6916e-06, -1.5655e-06, -2.1533e-05, ..., -2.0132e-05, -1.9063e-05, -2.4438e-05]])), ('CSA36373.ogg', tensor([[ 3.4144e-05, -8.0636e-06, 1.4903e-06, ..., -3.8835e-05, -4.1840e-05, -4.0731e-05]])), ('CSA36358.ogg', tensor([[-1.6201e-06, 2.8240e-05, 2.9543e-05, ..., -2.9203e-04, -3.1059e-04, -2.8100e-04]]))], '1194042': [('CSA18794.ogg', tensor([[ 3.0655e-05, 4.8817e-05, 6.2794e-05, ..., -5.1450e-05,
-4.8535e-05, -4.2476e-05]])), ('CSA18802.ogg', tensor([[ 6.6640e-05, 8.8530e-05, 6.4143e-05, ..., 5.3802e-07, -1.7509e-05, -4.8914e-06]])), ('CSA18783.ogg', tensor([[-8.6866e-06, -6.3421e-06, -3.1125e-05, ..., -1.7946e-04, -1.6407e-04, -1.5334e-04]]))] ...}
Umgangatho uqhagamshelane i-statistics esisiseko ebonakalayo iinkcukacha zithunyelwe ngalinye iiklasi njenge-comma-separated-value string. Uyakwazi ukuhambisa iinkcukacha ngokufanelekileyo ukuze ufake iinkcukacha.
class,sampling_rate,num_files,num_secs_loaded,num_files_loaded
1139490,32000,2,194,2
1192948,32000,4,420,4
1194042,32000,3,91,3
...
Section 2 - Audio Data Processing
YintoniAudio Data ProcessingUmzekelo weNotebook:
- Ukukhangisa iingxaki ze-silent kunye neengxaki ze-audio ukunciphisa iingxaki ze-humane ezininzi kwi-audio. Ukukhangisa iingxaki ze-silent ukunciphisa iingxaki zeengxaki ze-audio.
- I-Audio Augmentation ifumaneka (1) ukongeza isignali ye-noise esebenzayo, (2) ukuguqulwa kwe-tempo ye-ray audio, okanye (3) ukongeza isignali ye-noise esebenzayo kwaye ukuguqulwa kwe-tempo ye-ray audio.
Isigaba 2.1 - Ukubonisa i-Silenced Segments
Yintonidetect_silence
indlela isetyenziselwa "ukukhanyisa" phezu ngalinye iisampuli ye-audio ebonakalayo kunye nokukhanyisa iingxaki ze-silent ngokubala iisampuli ze-audioroot-mean square (RMS) value of a given segment to a specified threshold. If the RMS is below the threshold, the segment is identified as a silent segment. The following constants specified in the Configcell of this section ukulawula isebenziswanodetect_silence
Umgangatho:
SIL_FRAME_PCT_OF_SR = 0.25
SIL_FRAME = int(SR * SIL_FRAME_PCT_OF_SR)
SIL_HOP = int(1.0 * SIL_FRAME)
SIL_THRESHOLD = 5e-5
SIL_REPLACE_VAL = -1000 # Value used to replace audio signal values within silent segments
The SIL_FRAME
iimvelisoSIL_HOP
constants can be modified to adjust how the method "slides" over the raw audio. Similarly, the SIL_THRESHOLD
ixabiso ingasetyenziswa ukuze kwenziwe i-methode enzima okanye enzima kunceda ukucacisa iingxaki ze-silent.
The method outputs a dictionary of silent segment markers for each file in each class. Audio files with no detected silent segments are identified by empty lists.
{'1139490': {'CSA36389.ogg': [0, 8000, 16000, 272000, 280000, 288000, 296000, 304000], 'CSA36385.ogg': [0, 8000, 16000, 24000, 240000, 248000, 256000]}, '1192948': {'CSA36388.ogg': [0, 8000, 16000, 24000, 256000, 264000, 272000, 288000], 'CSA36366.ogg': [0, 8000, 16000, 24000, 256000, 264000, 272000, 280000, 288000], 'CSA36373.ogg': [0, 8000, 16000, 24000, 256000, 264000, 272000, 288000], 'CSA36358.ogg': [8000]}, '1194042': {'CSA18794.ogg': [], 'CSA18802.ogg': [], 'CSA18783.ogg': [0, 8000, 16000, 24000, 600000, 608000, 616000]}, '126247': {'XC941297.ogg': [], 'iNat1109254.ogg': [], 'iNat888527.ogg': [], 'iNat320679.ogg': [0], 'iNat888729.ogg': [], 'iNat146584.ogg': []}, '1346504': {'CSA18803.ogg': [0, 8000, 16000, 24000, 3000000, 3008000, 3016000], 'CSA18791.ogg': [], 'CSA18792.ogg': [], 'CSA18784.ogg': [0, 8000, 16000, 1232000, 1240000, 1248000], 'CSA18793.ogg': [0, 8000, 16000, 24000, 888000]} ...}
Section 2.2 - Removing Silent Segments and Eliminating Human Annotations
The USE_REMOVE_SILENCE_AND_HUMAN_ANNOT
I-Constant defined kwiConfig cell of this section specifies if audio should be stripped of silent segments iimveliso sliced to remove Ukucinga human annotations.
USE_REMOVE_SILENCE_AND_HUMAN_ANNOT = True
The remove_silence_and_human_annot
i-methode iveza iingxaki ze-silent ezivela kwi-audio samples usebenzisa i-outputdetect_silence
method. Further, it implements logic to handle human annotations based on a simple observation: many audio samples, namely those with human annotations, tend to have the following structure:
| < 10s | ~1s | |
| BIRDSONG | SILENCE | HUMAN ANNOTATION |
I-birdsong kunye ne-human annotation sections ngokufanelekileyo ziquka iingxaki ze-silent. Nangona kunjalo, njengoko kubonisa kwi-diagram phezulu, iingxaki ze-bird vocalization ziyafumaneka kwimizuzu ezimbini ezidlulileyo ze-audio. Ngoko ke, indlela elula, ukuba ingxaki ze-human annotations zihlanganisa iisampuli ze-audio kwi-first silent segment marker eyenza ngaphandle kwicandelo ebonakalayo, phantsi kwimeko ukuba i-anotation ye-humane ibandakanya le-silent segment.remove_silence_and_human_annot
logic uses the ANNOT_BREAKPOINT
constant in the Configifayile ukuyifaka ukuba umnqweno we-segment ye-silent ibekwe ngaphandle kwibhokisi ebonakalayoANNOT_BREAKPOINT
, ifumaneka kwinqanaba le-seconds. Ukuba yinto, i-logic iveza i-audio ebonakalayo kwi-markers yayo kwaye ibekwe kuphela idatha ebonakalayo phambi kokuba. Ukuhlolwa kwimifanekiso ye-audio ebonakalayo ngexesha lokucwangcisa iye yaziwa ukuba yinkqubo efanelekileyo. Nangona kunjalo, njengoko ibekwe kwiTraining Datainqaku, kukhoZonke audio recordings where the human annotation precedes the birdsong recording. The logic described here does notZonke iimveliso ze-audio zihlanganisa iinkqubo ezininzi ze-birdsong ezihambisiweyo kunye neemveliso ezininzi ziquka iingxaki ze-silent. Iimveliso zezi zihlanganiswa ngexesha elandelayo kwaye zihlanganiswa ngokupheleleyo.
Iingubo yesibini esisodwa,SLICE_FRAME
, can be optionally used in a final processing step to return an even more refined slice of the processed audio. Set SLICE_FRAME
Imininingwane yeenkcukacha ze-audio ezisetyenziswa ukuba ushiye.
Yintoniremove_silence_and_human_annot
Method ukugcina i-audio eyenziwe kwi-disk phantsi kwe-directoryprocessed_audio
Ukusebenzisa isave_audio
I-parameter, ebizwa ngokuba yi-True
. Umgangatho ivumela i-Dictionary of theUkubalwa seconds of processed audio for each class.
{'1139490': 14, '1192948': 29, '1194042': 24, '126247': 48, '1346504': 40, '134933': 32, '135045': 77, ...}
Yintoniget_audio_stats
Uhlobo olusetyenziswa emvaremove_silence_and_human_annot
ukuze ufumane inani elidlulileyo lombala ye-audio kwiiyunithi ezininzi.
Section 2.3 - Calculating Augmentation Turns for Minority Classes
Njengoko kulandwa kwiTraining Data section, the classes are not balanced. Augmentation is used in this notebook section to help address the imbalance leveraging the average number of seconds of audio across all classes, as provided by the get_audio_stats
Method. Classes nge-seconds epheleleyo ye-audio eyakhelwebelow the average are augmented. The get_augmentation_turns_per_class
i-methode ibonise inani le-augmentation turns ngalinye i-minority class ngokusebenzisa inani le-seconds ngalinye i-processed audio sample.
TURNS = (AVG_SECS_AUDIO_ACROSS_CLASSES - TOTAL_SECS_AUDIO_FOR_CLASS)/AVG_SECS_PER_AUDIO_SAMPLE
Iiyunivesithi ezininzi phantsi komgangatho iya kuba i-augmentation turns ngaphezulu kunokuba iiyunivesithi ezininzi ziquka komgangatho, leyo iya kuba i-augmentation turns ezincinane.
The get_augmentation_turns_per_class
includes a AVG_SECS_FACTOR
constant which can be used to adjust the value for
average number of seconds of audio across all classes. The constant can be used to make the logic more conservative or aggressive when calculating the number of augmentation turns.
Yintoniget_augmentation_turns_per_class
Ukubandakanya aAVG_SECS_FACTOR
umgangatho, leyo ingasetyenziswa ukuguqulwa kwinqanaba
ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane ubuncinane.
Section 2.4 - Running Augmentations
The USE_AUGMENTATIONS
constant defined in the Configi-cell ye-section inikeza ukuba i-audio kufuneka ifumaneka.
USE_AUGMENTATIONS = True
Njengoko kwangaphambili, i-audio augmentation ibandakanya (1) ukongeza isignali ye-noise eyenziwe ngempumelelo, (2) ukuguqulwa kwe-tempo ye-ray audio, okanye (3) ukongeza isignali ye-noise eyenziwe ngempumelelo kwaye ukuguqulwa kwe-tempo ye-ray audio.add_noise
iimvelisochange_tempo
methods encapsulate the logic for adding a noise signal and changing the tempo respectively. The noise signal range and tempo change range can be adjusted via the following constants in the ConfigUkucinga:
NOISE_RNG_LOW = 0.0001
NOISE_RNG_HIGH = 0.0009
TEMPO_RNG_LOW = 0.5
TEMPO_RNG_HIGH = 1.5
The run_augmentations
method runs the augmentations using the output from the get_augmentations_turns_per_class
indlela. Kuba izifundo ezidlulileyo, i-logic:
- Ukukhetha ngempumelelo isampula ye-audio eyenziwe (i.e. iingxaki ze-silent ziye zithunyelwe) ukuze ifumaneke.
- Randomly selects the augmentation to perform: (1) adding noise, (2) changing the tempo, or (3) adding noise and changing the tempo.
- Ukugcina i-audio ephakamileyo kwi-disk phantsi kwe-class efanelekileyo kwi-processed_audio directory.
Nangona i-notebook logic ibandakanya iiyunithi ezininzi kunye neentsuku ezininzi ze-audio phantsi komgangatho, ibandakanya iiyunithi ezininzi kunye neentsuku ezininzi ze-audio phantsi komgangatho. Le nqakraza yenzelwe ukulawula i-memory ezikhoyo kunye nokufumaneka ukuba i-class imbalance ibandakanya ngakumbi ngokusebenzisa ukhetho lwe-loss function.
Section 3 - Mel Spectrogram Generation and Input Preparation
The Mel Spectrogram Generation and Input Preparation section of the notebook:
- Ukuqhathanisa idatha ye-audio eyenziwe kwi-training kunye ne-validation lists.
- Splits audio into 5 second frames.
- Yenza i-mel spectrograms ngalinye i-frame ye-audio ye-5 ye-second.
- Resizes mel spectrograms to a target size of
(224, 224)
. - Optionally loads pseudo-labeled data samples to augment training data.
- One-hot encodes training data and validation data labels.
- Ukwakha i-TensorFlow Dataset i-objects kwi-training kunye ne-validation data lists.
- Optionally uses MixUp logic to augment training data.
Section 3.1 - Splitting Processed Audio Data
Processed audio data is loaded from the processed_audio
folder. The data is split into 4 lists:
training_audio
training_labels
validation_audio
validation_labels
I-Etiquette, ngoko ke, iimveliso ze-class ezinxulumene neempawu ze-audio. I-EtiquetteSPLIT
constant defined in the Config cell controls the split ratio between the training and validation data lists. Processed audio data is shuffled before splitting.
I-Section 3.2 - Ukubala i-Audio kwi-frames
I-Audio ifakwe kwi-5 segments ye-second usebenzisa iframe_audio
indlela, nto leyo isebenzisa i-TensorFlowsignal.frame
umgangatho ukwahlula umzekelo we-audio. I-constants ezilandelayo kwiConfig cell control the split operation:
FRAME_LENGTH = 5
FRAME_STEP = 5
Section 3.3 - Generating Mel Spectrograms
Mel spectrograms are generated for each 5 second audio frame generated in Section 3.2Ukusebenzisa iaudio2melspec
indlela. Iimeko ezilandelayo kwiConfigcell kucacisa iiparamitha ezisetyenziswa ekwakheni i-mel spectrograms, njenge-number of mel bands, i-frequency engaphansi, kunye ne-frequency engaphezulu:
# Mel spectrogram parameters
N_FFT = 1024 # FFT size
HOP_SIZE = 256
N_MELS = 256
FMIN = 50 # minimum frequency
FMAX = 14000 # maximum frequency
I-band ye-frequency ilungiselwe ukuba ibonisa i-Umthamoiindidi ezininzi ze-bird vocalizations. Nangona kunjalo, ezinye iintlobo ze-bird zinokufunda ngaphandle kwesixeko.
Isigaba 3.4 - Ukuguqulwa kwe-Mel Spectrograms
Yintonito_melspectrogram_image
indlela isetyenziselwa ukuguqulwa yonke i-mel spectrogram kwi-apillow
Image
Objects. ZonkeImage
object is subsequently resized to (224, 224)
Yintoni i-input dimension esithathwe yi-EfficientNet B0 model.
Section 3.5 - Loading Pseudo-Labeled Data
Njengoko kulandwa kwiTraining Data section, the train_soundscapes
inkxaso ibandakanya malunga ne-10,000unlabeled audio recordings of birdsong. These audio recordings can be incorporated into the training data via pseudo-labeling. A simple process to create pseudo-labeled data is as follows:
- Train a classifier without pseudo-labeled data.
- Load training soundscape audio files.
- I-Segmentation ye-audio ye-soundscape kwi-frames ye-5 ye-second.
- Yenza i-mel spectrograms ngalinye i-frame ye-5 ye-second kunye nokuguqulwa kwe-224, 224.
- Yenza imibuzo kwi-spectrogram ye-mel ye-re-size usebenzisa i-classifier eyenziwe ngexesha lokuqala.
- Hlola iingcebiso ngaphezu kwinqanaba elidlulileyo yobugcisa kunye nokugcina i-spectrograms yeengcebiso zihlanganisa phantsi kwe-label ye-class ebonakalayo.
- Train your classifier again using the psuedo-labeled data.
Ukuba ufuna ukuvelisa idatha yakho ye-pseudo-labelled, uqhagamshelane neengxaki ezininzi ukuvelisa i-classifier yakhongaphandleiinkcukacha ze-pseudo-labelled. Emva koko, usebenzisa i-classizer yakho ukuvelisa iinkcukacha zakho ze-pseudo-labelled usebenzisa inkqubo efumaneka phezulu. Ekugqibeleni, uphuhlise i-classizer yakho usebenzisa iinkcukacha zakho ze-pseudo-labelled.
Ukusetyenziswa oku kukusebenzisa iinkcukacha ze-pseudo-labeled. Nangona kunjalo, unako ukuguqulwa i-inference notebook ebizwaRunning Inferenceinqaku ukuvelisa idatha ye-pseudo-labeled.
Yenza iUSE_PSEUDO_LABELS
Ukulungiselela kwiConfigI-Cell 2False
ukunceda ukusetyenziswa kwedatha ze-pseudo-labeled.
Section 3.6 - Encoding Labels
Yintoniprocess_labels
indlela isetyenziselwa i-one-hot encode labels. I-one-hot encoded labels zithunyelwe njenge-NumPy array kwaye ziquka kwi-training label kunye ne-validation label lists.
I-Section 3.7 - Ukuguqulwa kwe-Training and Validation Data Lists kwi-TensorFlowDataset
Iimpawu
Dataset
I-TensorFlowdata.Dataset.from_tensor_slices
indlela isetyenziselwa ukuvelisa TensorFlowDataset
izixhobo ze-training kunye ne-validation data lists. Theshuffle
indlela ibizwa kwi-trainingDataset
Objection ukuba shuffle idatha yokulungisa ngaphambi batching. Thebatch
Uhlobo olusetyenziswa kumabiniDataset
objects to batch the training and validation datasets. The BATCH_SIZE
Ukulungiselela kwiConfigcell ukulawula ubungakanani batch.
I-Section 3.8 - Ukusebenzisa i-MixUp ukwandisa i-Training Data
Njengoko uyazi xa, i-MixUp yinkqubo yokwandisa idatha enokufaka iifoto ezimbini kunye nokwenza isampuli entsha yeedatha. I-class ye-imaging ye-blended is a mixture of the classes associated with the original 2 images.mix_up
indlela, kunye nesample_beta_distribution
method, encapsulates the optional MixUp logic.
Ukusetyenziswa oku kusebenzisa i-MixUp ukwandisa idatha ye-training. Ukusetyenzisa i-MixUp,USE_MIXUP
Ukulungiselela kwiConfigI-Cell 2True
.
Isigaba 4 - Model Training
YintoniModel TrainingUmzekelo weNotebook:
- I-Inicializes kunye ne-configures yeprojekthi ye-WandB ukuthatha idatha ze-training run.
- Ukwakhiwa kunye nokwakhiwa kwimodeli EfficientNet B0.
- Ukuqeqesha imodeli.
- Yenza iimodeli ezidlulileyo kwi-disk.
Section 4.1 - Initializing and Configuring WandB Project
Qinisekisa ukuba uqhagamshelane key yakho WandB API njenge Kaggle Secret kwi-notebook kwaye ukuba i-WandB login method kwi-Section 0 ye-notebook iye yenza True.
Ensure that you have attached your own WandB API key as a Kaggle Secretkwi-notebook kunye ne-WandBlogin
UkucingaSection 0I-notebook yaye yaye yibhalweTrue
.
YintoniConfigiseli kule ndawo ibandakanya i-logic yokufaka kunye nokufaka inkqubo entsha ye-WandB (ukuba inkqubo ayikho ngoku) leyo uya kufumana idatha ze-training run:
wandb.init(project="my-bird-vocalization-classifier")
config = wandb.config
config.batch_size = BATCH_SIZE
config.epochs = 30
config.image_size = IMG_SIZE
config.num_classes = len(LABELS)
Kwakhona, unako ukuguqulwa igama le nkqubomy-bird-vocalization-classifier
ukuba igama lakho leprojekthi WandB efunyenwe.
Isigaba 4.2 - Ukwakhiwa kunye nokwakhiwa Model EfficientNet B0
Yintonibuild_model
indlela isetyenziselwa ukulayisha imodeli efanelekileyo EfficientNet B0 kunye ne-ImageNet we-weights kwaye ngaphandle kwe-top layer:
model = EfficientNetB0(include_top=False, input_tensor=inputs, weights="imagenet")
The model is frozen to leverage the pre-trained ImageNet weights with the objective to only Ukucinga(i.e. isitimela) izihlangu kwi-stage yokuqala ye-model:
# Unfreeze last `unfreeze_layers` layers and add regularization
for layer in model.layers[-unfreeze_layers:]:
if not isinstance(layer, layers.BatchNormalization):
layer.trainable = True
layer.kernel_regularizer = tf.keras.regularizers.l2(L2_RATE)
The constant UNFREEZE_LAYERS
KwimekoConfigcell ibonise inani layers ukuba unfreeze.
The top of the model is rebuilt with a final Dense
layer reflecting the number bird species classes.Categorical focal cross-entropyifumaneka njenge-Loss Function ukunceda ukuguqulwa kwe-class imbalance. ILOSS_ALPHA
iimvelisoLOSS_GAMMA
I-Constants kwiConfigcells zisetyenziswa kunye ne-function loss.
Isigaba 4.3 - Model Training
Yintonifit
method is called on the compiled model
ukusukaSection 4.2ukuqhuba ukuqeqeshwa. Qaphela ukuba alearning rate scheduleriimvelisolr_scheduler
, isetyenziswa ngempumelelo yokufunda. Isilinganiso yokufunda yokufunda4.0e-4
i-callback ye-hardcoded. I-learning rate ifumaneka kwiiyure ezimbini ezisekusene ne-epoch count. Inani le-epoch ye-training ilawulwa ngu-epochEPOCHS
Ukulungiselela kwiConfigUkucinga
Isigaba 4.4 - Model Saving
Yintonisave
indlela ibizwa ku-compiledmodel
emva kokufunda ukugcina imodeli kwi-disk.
model.save("bird-vocalization-classifier.keras")
Imiphumo yemfundo
Ukusebenza kwe-notebook kufuneka yenza imiphumo ye-training ezilandelayo, ngexesha ukuba usebenzise isakhiwo se-experimental esifundowe kwisisombululo.Building the ClassifierUkucaciswa:
Njengoko kubonakala, ukucaciswa kwama-90% kwaye ukucaciswa kwama-70% emva kokufunda kwama-30 epochs. Nangona kunjalo, njengoko kubonakala, ukucaciswa kwama-validation kubaluleke kakhulu. Le ingxaki kubaluleke ngexabiso se-class kunye ne-memory eyenza ukusetyenziswa kwezongezelelo ezongezelelweyo ukucaciswa ngokupheleleyo. Iimiphumo zibonisa ukuba iimodeli ivimbele ukusetyenziswa kwedatha ze-training kwaye ayikho ukuguqulwa njengoko kunokufuneka. Nangona kunjalo, iimodeli ingasetyenziselwa ukucaciswa kunye ne-GBV ngokufanelekileyo ngokufanelekileyo.
Ukusuka Inference
YintoniI-Notebook ye-Kaggle ("inference notebook") can be used for running inference. The inference notebook logic uses both the GBV classifier model and the model that you trained in the preceding section. It runs inference on the unlabeled soundscapes files in the train_soundscapes
Iifayile ye-soundcapes ye-audio yahlukaniswa kwi-5 iifayile ze-second. I-soundcapes ye-audioMAX_FILES
I-Constant defined kwiConfig cell of Section 0of the notebook ukulawula inani iifayile ye-soundcapes ye-audio eyenziwa ukususela.
I-Inference Notebook kuqala ukuvelisa imibuzo usebenzisa i-GBV classifier. Imibuzo ye 143 iindidi ze-BirdCLEF+ 2025 ezihlangene kwi-GBV classifier yaziwa. Ukuba i-probability ephakeme phakathi kwe-143 iindidi ezaziwayo iye aphezulu okanye efana ne:GBV_CLASSIFIER_THRESHOLD
, ngoko i-GBV ebonakalayo i-class ifumaneka njenge-class efanelekileyo. Ukuba i-probability ephakeme phakathi kwe-143 "ezaziwayo" i-class ifumaneka phantsiGBV_CLASSIFIER_THRESHOLD
, it is assumed that the true class is among the 63 classes "unknown" to the GBV classifier - i.e. the classes used to train the model in the preceding section. The logic then runs predictions using the finetuned model. The predicted class from that prediction set is subsequently selected as the true class.
YintoniGBV_CLASSIFIER_THRESHOLD
I-Constante ifumaneka kwi-ConfigI-Cell yeSection 5of the inference notebook. Izixhobo zithunyelwe kwiifayile ze-2:
- A
preds.csv
file that captures the prediction and prediction probability for each 5-second soundscape slice. - Iifayile ye-submission.csv enikeza zonke iimeko ze-class kwi-format ye-BirdCLEF+ 2025.
Yenza i-path to your finetuned model kwi-cell yokuqala ye-Section 4 ye-Inference Notebook.
Yenza i-path to your finetuned model kwi-cell yokuqalaSection 4Umbhali we-Inference Notebook
Future Work
I-training notebook ingasetyenziselwa ukuqeqesha umzila kwi-206 BirdCLEF+ 2025 izifundo, ukunciphisa inkinga ye-GBV classifier, ngexabiso malunga ne-concours data set. Njengoko kwangaphambili, ukufumana isifundo esifutshane,[]
Ukusuka kuload_training_audio
umgangatho uqhagamshelane idatha ye-audio evela kumazwe omnye. IMAX_FILES
iimvelisoLOAD_SLICE
constants can be used to limit the amount of loaded audio in order to work within the confines of a Kaggle notebook environment.
Nangona kunjalo, iimodeli efanelekileyo ingasetyenziselwa ukusetyenziswa kwamathambo emininzi yobugcisa. Kwiimeko ezininzi, inani elikhulu le-augmentations iya kuthathwa ukunyuswa kwizinga le-class. Ukongezelela, ezinye izixhobo ze-augmentation, njenge-CutMix, ingasetyenziselwa ukunyuswa kwakhona iidatha ze-training. Nangona kunjalo, lezi zindlela zinxibelelana kwimeko yokukhula emangalisayo.