Ababhali:
(1) uShehan Munasinghe, u-Mohamed bin Zayed University of AI kanye Nomnikelo olinganayo;
(2) Rusiru Thushara, Mohamed bin Zayed University of AI kanye Nomnikelo Olinganayo;
(3) Muhammad Maaz, Mohamed bin Zayed University of AI;
(4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI;
(5) Salman Khan, Mohamed bin Zayed University of AI kanye Australian National University;
(6) Mubarak Shah, University of Central Florida;
(7) U-Fahad Khan, u-Mohamed bin Zayed University of AI kanye ne-Linköping University.
Inothi Lomhleli: Lena Ingxenye 1 kweziyi-10 zocwaningo oluchaza kabanzi ngokuthuthukiswa kwemodeli ye-AI ehlakaniphile yamavidiyo. Funda okusele ngezansi.
Impahla Eyengeziwe
Ukwelula amamodeli amakhulu e-Multimodal (ama-LMM) asekelwe ezithombeni kuyinselele ngenxa yobunkimbinkimbi bemvelo bedatha yevidiyo. Izindlela zakamuva zokunweba ama-LMM asekelwe ezithombeni kumavidiyo kungenzeka azinawo amandla okusekela (isb, i-VideoChat, i-Video-ChatGPT, i-Video-LLaMA) noma ungasebenzisi izimpawu zomsindo ukuze uqondwe kangcono ngevidiyo (isb., I-Video-ChatGPT). Uma sibhekana nalezi zikhala, siphakamisa i-PG-Video-LLaVA, i-LMM yokuqala enamandla okumisa i-pixellevel, ehlanganisa izimpawu zomsindo ngokuzibhala zibe umbhalo ukuze kuthuthukiswe ukuqonda kokuqukethwe kwevidiyo. Uhlaka lwethu lusebenzisa i-tracker engaphandle kweshalofu kanye nemojula yenoveli eyisisekelo, elwenza lukwazi ukwenza izinto ngokwendawo kumavidiyo ngokulandela imiyalelo yomsebenzisi. Sihlola i-PG-Video-LLaVA sisebenzisa izilinganiso zokukhiqiza ezisekelwe kuvidiyo nezimpendulo zemibuzo futhi sethula amabhentshimakhi amasha aklanyelwe ukukala ukusebenza kwesisekelo sento esekelwe ekwazisweni kumavidiyo. Ngaphezu kwalokho, siphakamisa ukusetshenziswa kwe-Vicuna nge-GPT-3.5, njengoba isetshenziswa ku-VideoChatGPT, ekulinganiseni izingxoxo ezisekelwe kuvidiyo, ukuqinisekisa ukuphinda kukhiqizwe kwemiphumela okuyinkinga ngokuphathelene nemvelo ye-GPT-3.5. Uhlaka lwethu lwakha phezu kwemodeli ye-LLaVA esekelwe kumfanekiso we-SoTA futhi lunwebe izinzuzo zalo esizindeni sevidiyo, lulethe izinzuzo ezithembisayo engxoxweni esekelwe kuvidiyo nemisebenzi eyisisekelo.
Imizamo yakamuva yamamodeli amakhulu we-Multimodal (LMMs), eholwa yi-GPT-4V [25], ivumela izingxoxo ezinemininingwane mayelana nezithombe kodwa ngokuvamile azifiki kahle kumavidiyo. Ubukhulu bedatha yevidiyo bukhulu kakhulu ngaphezu kwezinye izindlela ngenxa yomthamo wayo omkhulu ezinkundleni zokuxhumana naku-inthanethi. Ngaphezu kwalokho, ukunweba ama-LMM kumavidiyo kuyinselele ngenxa yokuguquguquka kwawo okuyinkimbinkimbi onomongo omude wesikhashana odinga ukuqondwa ngokunembile. Nakuba yakamuva
izindlela eziya kuma-video-LMM afana ne-VideoChat [15], Video-LLaMA [45], kanye ne-Video-ChatGPT [22] abonise amakhono okuqonda ngevidiyo kanye nenkhulumomphendvulwano, awanaso isici esibalulekile sokubukwa. Isisekelo esibonakalayo kumavidiyo sihlose ukuhlobanisa izimpendulo ze-LMM nezinto ezithile ngaphakathi kokokufaka kwevidiyo. Sibhekana naleli gebe, sethula i-PG-Video-LLaVA, ividiyo-LMM yokuqala ekwazi ukwenza okwasendaweni izinto ezivela ezimpendulweni ze-LMM. Lo msebenzi uholela ekuqineni okuthuthukisiwe futhi ubonise ukuqonda okujulile kokuqukethwe kwevidiyo.
Ku-PG-Video-LLaVA, sibhekana nezinselele eziyingqayizivele ezibangelwa idatha yevidiyo. Imodeli yakhelwe ukulandelela izinto ezingaphakathi kwamavidyo kliphu amafushane agcina ukubukwa kwekhamera okungaguquki, okunika amandla ukubukwa okunembile kuzo zonke izigcawu nokunyakaza. Lokhu kulandelela kuxhumanisa amasegimenti e-spatio-temporal ngokuqondile nezinto zengxoxo, okuthuthukisa ukuqonda komongo kwemodeli. Isici esibalulekile se-PG-VideoLLaVA ukwakheka kwayo okumodulayo, okuvumela ukuhlanganiswa okulula namamojula akhona aphansi kanye nokuguquguquka kokuzivumelanisa nezithuthukisi zesikhathi esizayo kubuchwepheshe bokwenza izinto ezibonakalayo. Ngaphezu kwalokho, i-PG-Video-LLaVA ithuthukisa amakhono ayo ngokuhlanganisa umongo womsindo. Ifinyelela lokhu ngokusebenzisa umsindo wevidiyo ngendlela eqondakalayo ku-LLM, ewusizo ikakhulukazi ezimeni lapho ulwazi lokuzwa lubalulekile engxoxweni. Lokhu kufakwa kukhulisa ukuqonda kwemodeli, kuyenze ibe nezinto ezihlukahlukene ekutolikeni okuqukethwe kwevidiyo.
Ngaphezu kwalokho, lo msebenzi wethula uhlaka oluthuthukisiwe lokulinganisa amamodeli engxoxo asekelwe kuvidiyo, ukuzulazula ezindleleni zangaphambilini [22] ezisebenzisa kakhulu imodeli yobunikazi ye-GPT-3.5-Turbo ukuze ihlolwe. Uma kubhekwa ukuthi i-GPT-3.5-Turbo ingaphansi kwezinguquko noma kunini futhi ayinakho obala ngenxa yemvelo yayo yomthombo ovaliwe, yethula izinselele mayelana nokuthembeka nokukhiqizwa kabusha. Ukubhekana nalokhu, siphakamisa ukusetshenziswa kwe-Vicuna, i-LLM yomthombo ovulekile wokulinganisa. Lokhu kushintsha akugcini nje ngokuthuthukisa ukukhiqizwa kabusha kodwa futhi kuthuthukisa ukubonakala enqubweni yokuhlola. Sihlola i-PG-Video-LLaVA sisebenzisa izilinganiso zethu ezithuthukisiwe futhi sibonisa ukuthuthuka okuphawulekayo kumamodeli engxoxo evidiyo akhona njenge-VideoChatGPT [22] ne-Video-LLaMA [45] ezingxoxweni ezingenasisekelo, ukufeza ukusebenza kwesimanje (SoTA).
Iminikelo eyinhloko yalo msebenzi yilezi:
• Siphakamisa i-PG-Video-LLaVA, i-LMM yokuqala esekelwe kuvidiyo enamandla okusekela izinga le-pixel, ehlanganisa idizayini eyimojuli yokuguquguquka okuthuthukisiwe.
• Ngokuhlanganisa umongo womsindo, i-PG-Video-LLaVA ithuthukisa kakhulu ukuqonda kwayo kokuqukethwe kwevidiyo, ikwenze iphelele futhi ifanelekele izimo lapho isignali yomsindo ibalulekile ekuqondweni kwevidiyo (isb, izingxoxo nezingxoxo, amavidiyo ezindaba, njll.) .
• Sethula amabhentshimakhi obuningi athuthukisiwe amamodeli engxoxo asekelwe kuvidiyo. Amabhentshimakhi ethu asebenzisa i-Vicuna LLM yomthombo ovulekile ukuze kuqinisekiswe ukukhiqizwa kabusha okungcono kanye nokuba sobala. Futhi siphakamisa amabhentshimakhi okuhlola amandla okusekela amamodeli engxoxo asekelwe kuvidiyo.
Leli phepha litholakala ku-arxiv ngaphansi kwelayisensi ye-CC BY 4.0 DEED.