Machin aprantisaj pou zwazo yo: bati pwòp ou klasifikasyon vocalization zwazo

Introduction nan Syantis yo itilize sistèm otomatik yo etidye gwo ekosistèm. Nan ka a nan zòn forè ak jungle, se itilize pou anrejistre odyo ki ka itilize pou ede identifye diferan espès nan bèt ak insektis. enfòmasyon sa a ka itilize pou devlope yon pi bon konprann nan distribisyon an nan espès nan yon anviwònman bay. Nan ka a nan zwazo, Google Rechèch note nan mak li yo "Ekologists sèvi ak zwazo yo konprann sistèm manje ak sante forè - pou egzanp, si gen plis peckers nan yon forè, sa vle di gen yon anpil nan bois moute." Anplis, yo note valè a nan identification ki baze sou odyo: "[Pou] zwazo kominikasyon ak makèt teritwa ak chante ak apèl, li se pi efikas yo identifye yo pa odyo. An reyalite, ekspè ka identifye jiska 10x plis zwazo pa odyo kòm pa vizyon." autonomous recording units (ARUs) Separe Birdsong nan Wild pou Klasizasyon Separe Birdsong nan Wild pou Klasizasyon dènyèman, nan Konpetisyon lanse nan Soti nan ombrè a nan òganizasyon. ImageCLEF sipòte rechèch nan anotasyon ant lang ak retrè imaj atravè yon varyete domèn. Objektif la nan konpetisyon an se dirèkteman: konsepsyon yon modèl klasifikasyon ki ka presize espès la nan zwazo soti nan yon enskripsyon odyo. BirdCLEF+ nan 2025 Kaggle Imèl BirdCLEF+ nan 2025 Imèl Nan kòmansman an, travay la sanble trivial akòz disponiblite a nan Epitou li te ye kòm GBV klasifikatè a se fòmasyon sou prèske 11,000 espès zwazo ak se konsa yon chwa evidan kòm modèl klasifikasyon an. Google Bird Vocalization (GBV) klasifikasyon Perch Google Bird Vocalization (GBV) klasifikasyon Sepandan, konpetisyon an gen ladan zwazo sa yo ki deyò fòmasyon an GBV klasifikatè. Kòm yon rezilta, GBV klasifikatè a sèlman reyalize ~60% akòzite sou konpetisyon an tès dataset la BirdCLEF + 2025. Kòm yon rezilta, yon modèl Custom dwe devlope. Nan gid sa a detaye yon apwòch pou bati pwòp ou klasifikasyon vocalization zwazo ki ka itilize ansanm ak klasifikasyon an GBV yo klasifye yon seleksyon pi lajè nan espès zwazo. Aplikasyon an ap itilize menm teknik baz ki dekri nan Google Rechèch konsepsyon an sèvi ak dataset la nan konpetisyon an BirdCLEF+ 2025 pou fòmasyon. Règleman Data fòmasyon The , ki gen ladan dosye sipòte, se apeprè 12 GB. Katarye prensipal yo ak dosye ki konpoze estrikti dataset yo: BirdCLEF+ 2025 fòmasyon Dataset birdclef_2025 |__ train_audio |__ train_soundscapes |__ test_soundscapes recording_location.txt taxonomy.csv train.csv train_audio nan direktori a se pi gwo eleman an nan dataset la, ki gen 28,564 enskripsyon odyo fòmasyon nan fòma odyo. Enskripsyon odyo yo grupye nan sous-directory ki chak reprezante yon espesyalize zwazo, egzanp: train_audio .ogg train_audio |__amakin1 |__ [AUDIO FILES] |__amekes |__ [AUDIO FILES] ... nan dosye a ka itilize pou rechèch nan non yo syantifik ak komen reyèl nan espès la zwazo reprezante pa non sub-direktori, egzanp: taxonomy.csv SUB-DIRECTORY NAME SCIENTIFIC NAME COMMON NAME amakin1 Chloroceryle amazona Amazon Kingfisher amekes Falco sparverius American Kestrel ... Dataset konpetisyon an gen ladan 206 espès zwazo inik, sa vle di 206 klas. 63 nan klas sa yo se Kouvri nan sa yo klas yo anjeneral etikèt lè l sèvi avèk yon nimewo klas nimewo: Introduction pa GBV Classifier Non-GBV 1139490, 1192948, 1194042, 126247, 1346504, 134933, 135045, 1462711, 1462737, 1564122, 21038, 21116, 21211, 22333, 22973, 22976, 24272, 24292, 24322, 41663, 41778, 41970, 42007, 42087, 42113, 46010, 47067, 476537, 476538, 48124, 50186, 517119, 523060, 528041, 52884, 548639, 555086, 555142, 566513, 64862, 65336, 65344, 65349, 65373, 65419, 65448, 65547, 65962, 66016, 66531, 66578, 66893, 67082, 67252, 714022, 715170, 787625, 81930, 868458, 963335, grasal4, verfly, y00678 Yon kèk nan Klas yo karakterize pa: Non-GBV Limited training data. Class , for example, only contains 2 audio recordings. By contrast, class , which is a “known” class, contains 89 recordings. 1139490 amakin1 GBV Poor recording quality. Highlighting class again, both training recordings are of poor quality with one being particularly difficult to discern. 1139490 Dwa kondisyon sa yo mennen nan yon echilibre enpòtan ant klas nan kantite kalite odyo ak odyo ki disponib. Pifò nan enstriksyon an odyo anrejistreman nan tou de ak klas yo tou gen ladan pale moun, ak paleur anotasyon anrejistreman ak detay tankou kalite a nan zwazo ki te anrejistre ak kote a nan anrejistreman an. Nan pifò - - ka, anotasyon yo swiv vokalizasyon an nan zwazo anrejistre. GBV Non-GBV Men, pa tout Taktik ki itilize pou fè fas a debalans klas ak prezans anotasyon nan lang moun yo diskite nan Segondè Building the Classifier train_soundscapes nan Directory gen ladan prèske 10,000 enskripsyon odyo nan zwazo. Kòm pral diskite nan seksyon, sa yo anrejistreman odyo ka entegre nan done fòmasyon . train_soundscapes unlabeled Building the Classifier pseudo-labeling test_soundscapes nan Directory a se vè, san yo pa yon dosye. Kataloz sa a se popilè ak yon seri occulte nan odyo tès lè soumèt rezilta pratik nan konpetisyon an BirdCLEF+ 2025. test_soundscapes readme.txt Kreye klassifye a Basic Approach ak Background Yon prensip prensipal ki itilize pa pou fòmasyon klasifikasyon vocalization zwazo yo se tankou sa a: Google Rechèch Divize anrejistre odyo nan 5 segments segonn. Konvèti segman odyo nan mel spectrograms. Train yon classifier imaj sou spectrograms mel la. Yon metòd menm jan an pral swiv nan gid sa a. Classifiateur imaj ki pral fòme se Google a modèl. Si ou gen konesans ak fanmi modèl, ou konnen ke yo te fèt pou pwosesis imaj efikas. EfficientNet B0 EfficientNet EfficientNet B0 Sepandan, anvan echantiyon odyo ka divize ak konvèti nan mel spectrograms, nou gen pou fè fas a debalans klas ak pwoblèm anotasyon imen an nan Jeneralman, pwoblèm sa yo pral rezoud respektivman nan agrandi done ak koupe echantiyon odyo. Training Data Anvan plonje nan konsepsyon an aktyèl, sous-seksyon sa yo bay kèk enfòmasyon kout background. Modèl efikas Google Rechèch prezante fanmi li yo nan modèl nan 2019 kòm yon seri de modèl ki depase modèl state-of-the-art, nan moman sa a, nan tou de gwosè ak pèfòmans. EfficientNet convolutional neural network modèl, lanse nan 2021, ofri pi bon pèfòmans ak efikasite paramèt. efikasite2 Malgre ke fòmasyon nan Modèl EfficientNet yo te demontre utilite yo lè transfere nan lòt dataset, fè yo yon chwa atraksyon kòm teknoloji klasifikasyon pou pwojè sa a. Imèl Spectrograms nan Mel Yon spectrogram mel se yon reprezantan vizyèl nan yon sinyal odyo. Li ta ka pi bon analogize ak yon tèmik pou son. X-aks nan yon spectrogram mel reprezante dimansyon tan nan sinyal la odyo, ak y-aks reprezante frekans yo nan son yo nan sinyal la. Sepandan, olye pou yo montre tout frekans sou yon skala kontinyèl, frekans yo grupye nan . Bann sa yo, nan chemen an, espasye soti lè l sèvi avèk Scale nan mel se yon Yon koulè ki apwopriye sistèm odyo moun ak ki jan moun perceive son. koulè yo nan spectrogram la mel reprezante amplitid la nan son yo nan band yo. koulè pi klè reprezante amplitid ki pi wo pandan y ap koulè ki pi fon reprezante amplitid ki pi ba. mel bands mel scale logarithmic konsepsyon Objè m 'nan diskite sou konsepsyon an se bay yon revizyon segondè-nivo nan apwòch la san yo pa antre nan plis detay. Lojisyèl la fòmasyon prensipal la (fine-tuning) se capture nan sa a ("notè fòmasyon") ki konsiste de 4 seksyon prensipal: Pwodwi pou Telefòn Seksyon 1: Loading done odyo. Seksyon 2: pwosesis done odyo. Seksyon 3: Mel jenerasyon spectrogram ak preparasyon envantè. Seksyon 4: Modèl fòmasyon. Ou pral observe ke premye 2 selil nan chak seksyon prensipal yo (1) imports itilize pa seksyon sa a ak (2) a selil definye konstan yo itilize nan seksyon sa a ak seksyon ki sot pase yo. Config Kòm yon fòmasyon notebook reyèlman kòmanse ak kote pakè baz Python itilize nan tout notebook yo enpòte. Seksyon sa a gen ladan tou logik la pou login nan ("WandB") pou kontwole fòmasyon kouri. Ou pral bezwen anbake pwòp WandB ou nan notebook la kòm yon Sèvi ak non . Section 0 Weights & Biases API key Kaggle Secret WANDB_API_KEY Kòm diskite nan nan seksyon an, son nan fòmasyon ki pa etikèt ka entegre nan done yo fòmasyon via pseudo-labeling. Sèvi ak pseudo-labeled done yo diskite nan Remake ke Kaggle anviwònman ki pa GPU yo limite nan 30 GiB nan memwa. Training Data Section 3.5 - Pseudo-Labeling Yon modèl fòmasyon apre konfigirasyon an eksperyans ki dekri nan sous-seksyon yo anba a te pibliye nan Kaggle isit la. Si ou vle, ou ka sèvi ak modèl sa a san yo pa fòmasyon pwòp ou ak ale dirèkteman nan seksyon la Running Inference yo kouri inference sou zwazo odyo. Yon modèl fòmasyon apre konfigirasyon an eksperyans ki dekri nan sous-seksyon yo anba a te pibliye nan Kaggle isit la. Si ou vle, ou ka sèvi ak modèl sa a san yo pa fòmasyon pwòp ou ak ale dirèkteman nan seksyon la Running Inference yo kouri inference sou zwazo odyo. Section 1 - Audio Data Loading nan seksyon nan notebook la: Audio Data Loading Extracts sa yo klas nan set done konpetisyon an BirdCLEF+ 2025 ki pa kouvri pa klasifikasyon an GBV. Loader done odyo brik atravè metòd la load_training_audio. Kreye yon katwòch processed_audio ak sove yon kopi nan done a odyo chaje kòm dosye .wav nan katwòk sa a. nan selil nan seksyon sa a gen ladan konstan. Konstan sa a espesifye kantite maksimòm nan dosye odyo yo chaje soti nan yon klas bay. Konstan sa a se arbitre mete nan valè a gwo nan pou asire w ke tout dosye odyo yo chaje pou klas. Ou ka bezwen ajiste konstan sa a pou pwòp ou eksperimantyèl konfigirasyon. Pou egzanp, si ou chaje done odyo soti nan klas, ou ta ka bezwen mete konstan sa a nan yon valè pi ba pou evite exhausting memwa disponib. Config MAX_FILES 1000 non-GBV all nan metòd ka rele ak a paramèt, ki se yon lis nan klas ki odyo yo pral chaje. Pou pwojè sa a, klas yo sove kòm yon lis ak bay nan varyab la ki se apre sa a pase nan method via the nan paramèt. load_training_audio classes non-GBV missing_classes load_training_audio classes # `missing_classes` list ['1139490', '1192948', '1194042', '126247', '1346504', '134933', '135045', '1462711', '1462737', '1564122', '21038', '21116', '21211', '22333', '22973', '22976', '24272', '24292', '24322', '41663', '41778', '41970', '42007', '42087', '42113', '46010', '47067', '476537', '476538', '48124', '50186', '517119', '523060', '528041', '52884', '548639', '555086', '555142', '566513', '64862', '65336', '65344', '65349', '65373', '65419', '65448', '65547', '65962', '66016', '66531', '66578', '66893', '67082', '67252', '714022', '715170', '787625', '81930', '868458', '963335', 'grasal4', 'verfly', 'y00678'] Ou ka chaje tout 206 klas BirdCLEF+ 2025 pa pase yon lis vid kòm paramèt la klas. Ou ka chaje tout 206 klas BirdCLEF+ 2025 pa pase yon lis vid kòm paramèt la klas. metòd la load_training_audio tou aksepte yon opsyon boolean use_slice paramèt. Parametre sa a travay ak konstan LOAD_SLICE defini nan selil la Config. Paramèt la use_slice ak konstan LOAD_SLICE pa itilize ak implemantasyon sa a. Sepandan, yo ka itilize yo chaje yon kantite sèten odyo soti nan chak dosye. Pou egzanp, pou chaje sèlman 5 segonn odyo soti nan chak dosye odyo, mete LOAD_SLICE nan 160000, ki se kalkil kòm 5 fwa vitès la echantiyon nan 32000; ak pase True nan paramèt la use_slice. nan metòd tou aksepte yon opsyon boolean paramèt. Sa a paramèt travay ak konstante definye nan selilè. nan paramèt yo ak Kòmanse itilize ak implemantasyon sa a. Sepandan, yo ka itilize yo chaje yon kantite espesifik nan odyo soti nan chak dosye. Pou egzanp, yo chaje sèlman 5 segonn odyo soti nan chak dosye odyo, mete nan , ki se kalkil kòm times the sampling rate of Epi pase nan nan paramèt. load_training_audio use_slice LOAD_SLICE Config use_slice LOAD_SLICE not LOAD_SLICE 160000 5 32000 True use_slice nan metòd aksepte yon boolean paramèt. Lè paramèt sa a se Lojisyèl la kreye yon directory and saves a copy of each each audio sample as a dosye nan katwòk la. Kopi odyo yo sove nan sous-katwòk ki reflete klas la ki nan yo. Kataloy la se itilize nan seksyon an pwochen pou sove echantiyon odyo modifye nan disk la san yo pa afekte kataloy yo dataset BirdCLEF+ 2025. load_training_audio make_copy True processed_audio .wav processed_audio nan metòd retire yon dictionnaire nan done odyo chaje lè l sèvi avèk non klas yo kòm kle yo. Chak valè nan dictionnaire a se yon lis nan tuples nan fòm la : load_training_audio (AUDIO_FILENAME, AUDIO_DATA) {'1139490': [('CSA36389.ogg', tensor([[-7.3379e-06, 1.0008e-05, -8.9483e-06, ..., 2.9978e-06, 3.4201e-06, 3.8700e-06]])), ('CSA36385.ogg', tensor([[-2.9545e-06, 2.9259e-05, 2.8138e-05, ..., -5.8680e-09, -2.3467e-09, -2.6546e-10]]))], '1192948': [('CSA36388.ogg', tensor([[ 3.7417e-06, -5.4138e-06, -3.3517e-07, ..., -2.4159e-05, -1.6547e-05, -1.8537e-05]])), ('CSA36366.ogg', tensor([[ 2.6916e-06, -1.5655e-06, -2.1533e-05, ..., -2.0132e-05, -1.9063e-05, -2.4438e-05]])), ('CSA36373.ogg', tensor([[ 3.4144e-05, -8.0636e-06, 1.4903e-06, ..., -3.8835e-05, -4.1840e-05, -4.0731e-05]])), ('CSA36358.ogg', tensor([[-1.6201e-06, 2.8240e-05, 2.9543e-05, ..., -2.9203e-04, -3.1059e-04, -2.8100e-04]]))], '1194042': [('CSA18794.ogg', tensor([[ 3.0655e-05, 4.8817e-05, 6.2794e-05, ..., -5.1450e-05, -4.8535e-05, -4.2476e-05]])), ('CSA18802.ogg', tensor([[ 6.6640e-05, 8.8530e-05, 6.4143e-05, ..., 5.3802e-07, -1.7509e-05, -4.8914e-06]])), ('CSA18783.ogg', tensor([[-8.6866e-06, -6.3421e-06, -3.1125e-05, ..., -1.7946e-04, -1.6407e-04, -1.5334e-04]]))] ...} The method also returns basic statistics describing the data loaded for each class as a comma-separated-value string. You can optionally export these statistics to inspect the data. class,sampling_rate,num_files,num_secs_loaded,num_files_loaded 1139490,32000,2,194,2 1192948,32000,4,420,4 1194042,32000,3,91,3 ... Section 2 - Audio Data Processing The seksyon nan notebook la: Audio Data Processing Opsyonèlman strips segments silent ak strips odyo yo elimine pi anotasyon moun soti nan odyo grès. stripping segments silent elimine pati irrelevant nan sinyal odyo a. Optional Augmentes odyo pou klas minoritè yo ede rezoud echilib la klas. Augmentation odyo konsiste nan (1) ajoute yon sinyal odyo ki te kreye alegan, (2) chanje ritm la nan odyo a brik, oswa (3) ajoute yon sinyal odyo ki te kreye alegan ak chanje ritm la nan odyo a brik. Seksyon 2.1 - Detekte segments silen nan method is used to "slide" over each raw audio sample and identify silent segments by comparing the se valè a nan yon segman bay nan yon kouch espesifye. Si RMS a se anba kouch la, segman an se identifye kòm yon segman silan. Konstan yo sa yo espesifye nan selil nan seksyon sa a kontwole konpòtman an nan Metòd : detect_silence root-mean square (RMS) Config detect_silence SIL_FRAME_PCT_OF_SR = 0.25 SIL_FRAME = int(SR * SIL_FRAME_PCT_OF_SR) SIL_HOP = int(1.0 * SIL_FRAME) SIL_THRESHOLD = 5e-5 SIL_REPLACE_VAL = -1000 # Value used to replace audio signal values within silent segments nan and constants can be modified to adjust how the method "slides" over the raw audio. Similarly, the valè ka modifye yo fè metòd la plis agresif oswa konservatè akòz identification nan segments silent. SIL_FRAME SIL_HOP SIL_THRESHOLD The method outputs a dictionary of silent segment markers for each file in each class. Audio files with no detected silent segments are identified by empty lists. {'1139490': {'CSA36389.ogg': [0, 8000, 16000, 272000, 280000, 288000, 296000, 304000], 'CSA36385.ogg': [0, 8000, 16000, 24000, 240000, 248000, 256000]}, '1192948': {'CSA36388.ogg': [0, 8000, 16000, 24000, 256000, 264000, 272000, 288000], 'CSA36366.ogg': [0, 8000, 16000, 24000, 256000, 264000, 272000, 280000, 288000], 'CSA36373.ogg': [0, 8000, 16000, 24000, 256000, 264000, 272000, 288000], 'CSA36358.ogg': [8000]}, '1194042': {'CSA18794.ogg': [], 'CSA18802.ogg': [], 'CSA18783.ogg': [0, 8000, 16000, 24000, 600000, 608000, 616000]}, '126247': {'XC941297.ogg': [], 'iNat1109254.ogg': [], 'iNat888527.ogg': [], 'iNat320679.ogg': [0], 'iNat888729.ogg': [], 'iNat146584.ogg': []}, '1346504': {'CSA18803.ogg': [0, 8000, 16000, 24000, 3000000, 3008000, 3016000], 'CSA18791.ogg': [], 'CSA18792.ogg': [], 'CSA18784.ogg': [0, 8000, 16000, 1232000, 1240000, 1248000], 'CSA18793.ogg': [0, 8000, 16000, 24000, 888000]} ...} Seksyon 2.2 - Retire segments silen ak elimine anotasyon moun The konstante definye nan cell of this section specifies if audio should be stripped of silent segments Pwodwi pou retire human annotations. USE_REMOVE_SILENCE_AND_HUMAN_ANNOT Config ak most USE_REMOVE_SILENCE_AND_HUMAN_ANNOT = True nan metòd strips segman silen soti nan echantiyon odyo lè l sèvi avèk pwodiksyon an soti nan method. Further, it implements logic to handle human annotations based on a simple observation: many audio samples, namely those with human annotations, yo dwe gen estrikti sa yo: remove_silence_and_human_annot detect_silence tend | < 10s | ~1s | | | BIRDSONG | SILENCE | HUMAN ANNOTATION | The birdsong and human annotation sections themselves may contain silent segments. However, as seen in the diagram above, the bird vocalization recordings often occur within the first few seconds of audio. Therefore, a simple, if imperfect, approach to deal with human annotations is to slice audio samples at the first silent segment marker that occurs outside of a specified window, under the assumption that a human annotation follows that silent segment. The Lojisyèl la itilize constant in the cell to check if a silent segment marker lies outside the window specified by , expressed in number of seconds. If it does, the logic slices the raw audio at that marker and only retains the data that occurs before it. A manual inspection of processed audio during experimentation revealed this approach to be satisfactory. However, as mentioned in the Segondè, gen audio recordings where the human annotation enskripsyon zwazo. Logik la deskripsyon isit la fè address those cases. Some audio samples feature long sequences of recorded birdsong and these samples often do not have silent segments. Such samples are unaffected by the previously described logic and kept in their entirety. remove_silence_and_human_annot ANNOT_BREAKPOINT Config ANNOT_BREAKPOINT Training Data Yon kèk Premye pa A second constant, , can be optionally used in a final processing step to return an even more refined slice of the processed audio. Set to the number of seconds of processed audio that you want to retain. SLICE_FRAME SLICE_FRAME The metòd sove processe odyo nan disk anba katwòk la atravè parameter, which is passed as Metòd la retire yon dictionary nan sekondè nan processe odyo pou chak klas. remove_silence_and_human_annot processed_audio save_audio True Total nan {'1139490': 14, '1192948': 29, '1194042': 24, '126247': 48, '1346504': 40, '134933': 32, '135045': 77, ...} The se metòd la itilize apre to get the average number of seconds of audio across all classes. get_audio_stats remove_silence_and_human_annot Section 2.3 - Calculating Augmentation Turns for Minority Classes Kòm te di nan seksyon, klas yo pa balans. Augmentation se itilize nan seksyon sa a notebook yo ede rezoud debalans la lè l sèvi avèk nimewo an mwayèn nan segonn nan odyo nan tout klas yo, tankou bay pa method. Classes with total seconds of processed audio the average are augmented. The metòd determines nimewo a nan augmentation tours pou chak klas minoritè lè l sèvi avèk nimewo an mwayèn nan segonn pou chak echantiyon odyo pwosesis. Training Data get_audio_stats below get_augmentation_turns_per_class TURNS = (AVG_SECS_AUDIO_ACROSS_CLASSES - TOTAL_SECS_AUDIO_FOR_CLASS)/AVG_SECS_PER_AUDIO_SAMPLE Minority classes further below the average will have more augmentation turns versus minority classes nearer the average which will have fewer augmentation turns. The Ki gen ladan a konstan ki ka itilize pou ajiste valè a pou get_augmentation_turns_per_class AVG_SECS_FACTOR konstan ka itilize yo fè logik la plis konsèvatif oswa agresif lè kalkil nimewo a nan augmentation tours. The Ki gen ladan a konstan ki ka itilize pou ajiste valè a pou get_augmentation_turns_per_class AVG_SECS_FACTOR average number of seconds of audio across all classes. The constant can be used to make the logic more conservative or aggressive when calculating the number of augmentation turns. Seksyon 2.4 - Kòmanse Augmentations nan konstante definye nan cell of this section specifies if audio should be augmented. USE_AUGMENTATIONS Config USE_AUGMENTATIONS = True As mentioned earlier, audio augmentation consists of (1) adding a randomly generated noise signal, (2) changing the tempo of the raw audio, or (3) adding a randomly generated noise signal and changing the tempo of the raw audio. The and methods encapsulate the logic for adding a noise signal and changing the tempo respectively. The noise signal range and tempo change range can be adjusted via the following constants in the Selilè : add_noise change_tempo Config NOISE_RNG_LOW = 0.0001 NOISE_RNG_HIGH = 0.0009 TEMPO_RNG_LOW = 0.5 TEMPO_RNG_HIGH = 1.5 nan method runs the augmentations using the output from the method. For those classes that will be augmented, the logic: run_augmentations get_augmentations_turns_per_class Randomly chwazi yon echantiyon odyo pwosesis (yo se segments silent deja retire) pou agrandi. Randomly selects the augmentation to perform: (1) adding noise, (2) changing the tempo, or (3) adding noise and changing the tempo. Saves the augmented audio to disk under the appropriate class within the directory. processed_audio While the notebook logic augments minority classes with total seconds of audio below the average, it ignores those classes with total seconds of audio above the average. This approach was taken to manage available memory and with the understanding that the class imbalance is further addressed through choice of the loss function. Section 3 - Mel Spectrogram Generation and Input Preparation nan section of the notebook: Mel Spectrogram Generation and Input Preparation Divize done odyo pwosesis nan fòmasyon ak lis validasyon. Splits audio into 5 second frames. Genere mel spectrograms pou chak 5 sekondè kad odyo. Resize spectrograms mel nan yon gwosè objektif nan (224, 224). Opsyonèlman chaje echantiyon done pseudo-labeled pou ogmante done fòmasyon. One-hot enkode done fòmasyon ak etikèt done validasyon. Konstriksyon objè TensorFlow Dataset soti nan lis done fòmasyon ak validasyon. Optionally uses MixUp logic to augment training data. Section 3.1 - Splitting Processed Audio Data Sèvi ak done a odyo se chaje soti nan dosye. Done a se divize an 4 lis: processed_audio training_audio training_labels validation_audio validation_labels Etikèt yo, nan kou, non klas yo ki asosye ak egzanp odyo yo. constant defined in the cell controls the split ratio between the training and validation data lists. Processed audio data is shuffled before splitting. SPLIT Config Section 3.2 - Splitting Audio into Frames Audio is split into 5 second segments using the metòd, ki sèvi ak TensorFlow metòd nan divize chak egzanp odyo. Konstan yo anba a nan cell control the split operation: frame_audio signal.frame Config FRAME_LENGTH = 5 FRAME_STEP = 5 Section 3.3 - Generating Mel Spectrograms Mel spectrograms are generated for each 5 second audio frame generated in via the metòd. Konstan yo ki anba a nan selil precifye paramèt yo itilize lè kreye spectrograms mel, tankou kantite band mel, frekans minimòm, ak frekans maksimòm: Section 3.2 audio2melspec Config # Mel spectrogram parameters N_FFT = 1024 # FFT size HOP_SIZE = 256 N_MELS = 256 FMIN = 50 # minimum frequency FMAX = 14000 # maximum frequency The frequency band was chosen to reflect the varyete a nan pi fò vocalizations zwazo. Sepandan, kèk espès zwazo ka vocalize deyò nan varyete sa a. Potansyèl Seksyon 3.4 - Resize Mel Spectrograms The se metòd la itilize pou konvèti chak spectrogram mel nan yon object. Each Objè a se ranplase nan which is the input dimension expected by the EfficientNet B0 model. to_melspectrogram_image pillow Image Image (224, 224) Seksyon 3.5 - Loading nan Done Pseudo-Labeled Kòm te di nan Segondè, nan Directory gen ladan prèske 10,000 audio recordings of birdsong. These audio recordings can be incorporated into the training data via . A simple process to create pseudo-labeled data is as follows: Training Data train_soundscapes unlabeled pseudo-labeling Train yon klasifikatè san yo pa pseudo-etikèt done. Load training soundscape audio files. Segment each audio soundscape into 5 second frames. Jere spectrograms mel pou chak kad 5 sekondè ak redimansyon nan (224, 224). Run predictions on each resized mel spectrogram using the classifier that you trained in the first step. Keep the predictions above a desired confidence level and save the mel spectrograms for those predictions to disk under the predicted class label. Train your classifier again using the psuedo-labeled data. Pseudo-labeled data can improve the performance of your classifier. If you want to generate your own pseudo-labeled data, you should continue with the remaining sections to train a classifier pseudo-labeled done. Lè sa a, lè l sèvi avèk classifier ou kreye seri pwòp ou nan pseudo-labeled done lè l sèvi avèk pwosesis la desine pi wo a. Finalman, re-train classifier ou lè l sèvi avèk pseudo-labeled done ou. Pa gen This implementation does not use pseudo-labeled data. However, you can modify the inference notebook referenced in the section to generate pseudo-labeled data. Running Inference mete nan Konstan nan selilè to skip the use of pseudo-labeled data. USE_PSEUDO_LABELS Config False Seksyon 3.6 - Kòd etikèt The method is used to one-hot encode labels. One-hot encoded labels are returned as NumPy arrays and added to the training label and validation label lists. process_labels Section 3.7 - Converting Training and Validation Data Lists to TensorFlow Objè Dataset Dataset nan TensorFlow se metòd la itilize yo kreye TensorFlow objects from the training and validation data lists. The Metòd la se rele sou fòmasyon an objekte pou shuffle done fòmasyon anvan batching. Metòd la se rele sou tou de Objè yo batch fòmasyon ak valizyon dataset. Konstan nan selilè kontwole gwosè batch la. data.Dataset.from_tensor_slices Dataset shuffle Dataset batch Dataset BATCH_SIZE Config Section 3.8 - Using MixUp to Augment Training Data As you may already know, MixUp is a data augmentation technique that effectively mixes two images together to create a new data sample. The class for the blended image is a blend of the classes associated with the original 2 images. The method, along with the metòd, enkapsule opsyonèl la Logic MixUp. mix_up sample_beta_distribution This implementation uses MixUp to augment the training data. To use MixUp, set the Konstan nan cell to . USE_MIXUP Config True Seksyon 4 - fòmasyon modèl nan seksyon nan notebook la: Model Training Initializes and configures a WandB project to capture training run data. Kreye ak kompile modèl la EfficientNet B0. Trains the model. Save modèl la fòmasyon nan disk. Section 4.1 - Initializing and Configuring WandB Project Verifye ke ou te ajoute pwòp ou WandB API kle kòm yon Kaggle Secret nan notebook la ak ke metòd la WandB login nan seksyon 0 nan notebook la te retounen True. Verifye ke ou te ajoute pwòp ou WandB API kle kòm yon Kaggle Secret nan notebook la ak ke metòd la WandB login nan seksyon 0 nan notebook la te retounen True. nan selil nan seksyon sa a gen lògik pou inisyalize ak konfigirasyon yon nouvo pwojè WandB (si pwojè a pa deja egziste) ki pral retire done kouri fòmasyon: Config wandb.init(project="my-bird-vocalization-classifier") config = wandb.config config.batch_size = BATCH_SIZE config.epochs = 30 config.image_size = IMG_SIZE config.num_classes = len(LABELS) Evidentman, ou ka chanje non nan pwojè a to your desired WandB project name. my-bird-vocalization-classifier Section 4.2 - Building and Compiling the EfficientNet B0 Model nan method is used to load the pre-trained EfficientNet B0 model with ImageNet weights and without the top layer: build_model model = EfficientNetB0(include_top=False, input_tensor=inputs, weights="imagenet") Modèl la se freze yo sèvi ak pèfòmans ImageNet pre-touche ak objektif la nan sèlman (i.e. tren) kouch nan etap final la nan modèl la: unfreeze # Unfreeze last `unfreeze_layers` layers and add regularization for layer in model.layers[-unfreeze_layers:]: if not isinstance(layer, layers.BatchNormalization): layer.trainable = True layer.kernel_regularizer = tf.keras.regularizers.l2(L2_RATE) The constant nan cell specifies the number of layers to unfreeze. UNFREEZE_LAYERS Config Top la nan modèl la se rekonstrue ak yon final kouch ki reflete kantite espesifikasyon an nan espesifikasyon an. se chwazi kòm fonksyon pèdi pou ede rezoud echilib la nan klas la. ak Konstan nan selil yo itilize ak fonksyon nan pèdi. Dense Categorical focal cross-entropy LOSS_ALPHA LOSS_GAMMA Config Seksyon 4.3 - Modèl fòmasyon nan method is called on the compiled soti nan to run training. Note that a Retounen nan , se itilize nan plas nan yon vitès aprantisaj konstante. Yon vitès aprantisaj inisyal nan se hardcoded nan callback la. Rate a aprantisaj diminye nan 2 etap ki baze sou kont epòk la. Nimewo a nan epòk fòmasyon se kontwole pa Konstan nan nan selil. fit model Section 4.2 learning rate scheduler lr_scheduler 4.0e-4 EPOCHS Config Section 4.4 - Model Saving nan method is called on the compiled apre fòmasyon pou sove modèl la nan disk. save model model.save("bird-vocalization-classifier.keras") Rezilta nan fòmasyon Running the notebook should produce the following training results, assuming you used the experimental setup that was described in the Segondè : Building the Classifier Kòm wè, egzatè a se yon ti kras pi wo pase 90% ak egzatè a validasyon se sou 70% apre fòmasyon pou 30 epòk. Sepandan, kòm wè, egzatè a validasyon fluctuate enpòtan. Varyasyon sa a se pati nan atribye a echilib la nan klas la ak memwa ki disponib limite itilizasyon an nan augmentations adisyonèl yo konplètman rezoud echilib la. Rezilta suggere ke modèl la peze soti nan overfitting sou done fòmasyon ak pa jeneralize kòm byen ke li ta dwe espere pou. Sepandan, modèl la ka itilize pou prediksyon ansanm ak klasifikasyon GBV dapre objektif la orijinal la. Running Inference sa a ("Inferans notebook") ka itilize pou kouri inferans. Logik la inferans notebook sèvi ak tou de modèl la GBV klassifye ak modèl la ou te fòme nan seksyon an anvan. Li kouri inferans sou dosye yo soundcapes san etikèt nan katagori. Chak soundcapes dosye odyo se divize nan 5 segonn kad. konstante definye nan selilè of the notebook controls the number of soundscapes audio files that are loaded for inference. Pwodwi pou Telefòn train_soundscapes MAX_FILES Config Section 0 Notebook a inferans premye genere prediksyon lè l sèvi avèk GBV klasifikatè a. Prediksyon yo pou 143 BirdCLEF+ 2025 konpetisyon dataset klas yo konnen pou klasifikatè a GBV yo izolasyon. Si maksimòm probabilite nan mitan 143 "konnen" klas yo se pi wo a oswa menm jan ak: , then the GBV predicted class is selected as the true class. If the maximum probability among the 143 "known" classes is below , li se asume ke klas la reyèl se nan mitan 63 klas yo "eksepsyonèl" pou klasifikasyon an GBV - sa vle di klas yo itilize yo fòme modèl la nan seksyon an anvan. Lojisyèl la Lè sa a, kouri prediksyon lè l sèvi avèk modèl la finetuned. Klas la prezante soti nan set la prediksyon sa a se apre sa chwazi kòm klas la reyèl. GBV_CLASSIFIER_THRESHOLD GBV_CLASSIFIER_THRESHOLD nan constant is defined in the selilè nan notebook a enkyetid. prezantasyon yo se pwodiksyon nan 2 dosye: GBV_CLASSIFIER_THRESHOLD Config Section 5 A file that captures the prediction and prediction probability for each 5-second soundscape slice. preds.csv Yon dosye submission.csv ki retire tout probabilite klas nan fòma pou konpetisyon an BirdCLEF+ 2025. Sitou wout la nan modèl la fini ou nan premye selil la nan seksyon 4 nan notebook a inferans. Konfigire wout la nan modèl fin vye granmoun ou nan premye selil la nan soti nan notebook la inference. Section 4 Jodi a nan travay Notebook la fòmasyon ka itilize yo fòme yon modèl sou tout 206 klas BirdCLEF+ 2025, elimine bezwen an pou GBV klasifikatè, omwen akòz konpetisyon dataset la. Kòm te di pi bonè, pase yon lis vid, , to the metòd la pral chaje done odyo soti nan tout klas yo. ak konstan yo ka itilize yo limite kantite odyo chaje nan lòd yo travay nan limit nan yon anviwònman notebook Kaggle. [] load_training_audio MAX_FILES LOAD_SLICE Natirèlman, yon modèl pi presizyon ka fòme lè l sèvi avèk yon kantite plis done fòmasyon. Ideyalman, yon kantite pi gwo nan augmentations ta dwe itilize pou rezoud echilib la nan klas la. Anplis de sa, lòt teknik augmentation, tankou CutMix, ta ka aplike pou plis ogmante done fòmasyon yo. Sepandan, estrateji sa yo mande pou yon anviwònman devlopman plis robust.