Vanyori:
(1) Shehan Munasinghe, Mohamed bin Zayed University yeAI uye Equal Contribution;
(2) Rusiru Thushara, Mohamed bin Zayed University of AI uye Equal Contribution;
(3) Muhammad Maaz, Mohamed bin Zayed University yeAI;
(4) Hanoona Abdul Rasheed, Mohamed bin Zayed University of AI;
(5) Salman Khan, Mohamed bin Zayed University of AI uye Australian National University;
(6) Mubarak Shah, University of Central Florida;
(7) Fahad Khan, Mohamed bin Zayed University yeAI uye Linköping University.
Chiziviso cheMhariri: Ichi chikamu chekutanga chegumi cheongororo inotsanangura kuvandudzwa kweiyo yakangwara AI modhi yemavhidhiyo. Verenga zvimwe pasi apa.
Supplementary Material
Kuwedzera mifananidzo-yakavakirwa Makuru Multimodal Models (LMMs) kumavhidhiyo zvinonetsa nekuda kwekuoma kwemukati kwevhidhiyo data. Nzira dzichangoburwa dzekuwedzera maLMM emifananidzo kumavhidhiyo angave asina masimba ekugadzirisa (semuenzaniso, VideoChat, Vhidhiyo-ChatGPT, Vhidhiyo-LLaMA) kana kusashandisa maodhiyo-masaini ekunzwisisa zviri nani vhidhiyo (semuenzaniso, Vhidhiyo-ChatGPT). Tichigadzirisa mapeji aya, isu tinokurudzira PG-Vhidhiyo-LLaVA, yekutanga LMM ine pixellevel grounding kugona, kubatanidza maodhiyo cues nekuanyora muzvinyorwa kuti ipfumise vhidhiyo-context kunzwisisa. Yedu dhizaini inoshandisa off-the-sherufu tracker uye novel grounding module, ichiigonesa kuisa munzvimbo zvinhu mumavhidhiyo zvichitevera mirairo yemushandisi. Isu tinoongorora PG-Vhidhiyo-LLaVA tichishandisa vhidhiyo-yakavakirwa generative uye mibvunzo-yekupindura mabhenji uye tinosuma mabhenji manyoro akagadzirirwa kuyera nekukasira-kwakavakirwa chinhu kudzika kuita mumavhidhiyo. Uyezve, tinokurudzira kushandiswa kweVicuna pamusoro peGPT-3.5, sekushandiswa muVhidhiyoChatGPT, yevhidhiyo-based based conversation benchmarking, kuve nechokwadi chekuberekazve kwemigumisiro iyo inofunganya nehutano hweGPT-3.5. Yedu dhizaini inovaka paSoTA-yakavakirwa LLaVA modhi uye inowedzera zvakanakira iyo vhidhiyo domain, ichiunza inovimbisa mibairo pavhidhiyo-yakavakirwa hurukuro uye mabasa epasi.
Kuedza kuchangoburwa paMakuru Multimodal Models (LMMs), inotungamirwa neGPT-4V [25], inobvumira nhaurirano dzakadzama nezvemifananidzo asi kazhinji hadzina kukwira kumavhidhiyo. Hukuru hwevhidhiyo data chikero kure kupfuura mamwe maitiro nekuda kwehukuru hwayo pasocial uye internet media. Uyezve, kuwedzera maLMM kumavhidhiyo kwakaoma nekuda kwesimba ravo rakaoma rine mamiriro enguva refu anoda kunzwisiswa nemazvo. Kunyangwe munguva pfupi yapfuura
nzira dzevhidhiyo-LMM dzakadai seVhidhiyoChat [15], Vhidhiyo-LLaMA [45], uye Vhidhiyo-ChatGPT [22] vakaratidza kugona muvhidhiyo kunzwisisa uye nhaurirano, ivo havana chinhu chakakosha chekuona pasi. Kuona kudzika mumavhidhiyo kune chinangwa chekubatanidza mhinduro dzeLMM kune chaiwo zvinhu mukati meiyo vhidhiyo yekuisa. Tichigadzirisa gaka iri, tinosuma PG-Vhidhiyo-LLaVA, vhidhiyo yekutanga-LMM inokwanisa kugadzirisa zvinhu zvinoonekwa mumhinduro dzeLMM. Iri basa rinotungamira mukukwidziridzwa kusagoneka uye rinoratidza kunzwisisa kwakadzama kwevhidhiyo zvirimo.
MuPG-Vhidhiyo-LLaVA, tinogadzirisa matambudziko akasiyana anounzwa nevhidhiyo data. Iyo modhi yakagadzirirwa kuteedzera zvinhu mukati mapfupi evhidhiyo zvimedu zvinochengetedza zvinoenderana kamera maonero, zvichiita kuti kwakaringana kutariswa kwekuona pamifananidzo uye mafambiro. Kutevera uku kunobatanidza zvikamu zve spatio-temporal zvakananga kune zvekutaura, zvichisimudzira kunzwisisa kwemuenzaniso wemamiriro ezvinhu. Chinhu chakakosha chePG-VideoLLaVA dhizaini yayo modular, inobvumira kubatanidzwa nyore nemamodule epasi aripo uye kuchinjika kuchinjika kune ramangwana rinowedzera mune yekuona pasi tekinoroji. Uyezve, PG-Vhidhiyo-LLaVA inosimudzira kugona kwayo nekubatanidza odhiyo mamiriro. Iyo inozadzisa izvi nekusimudzira vhidhiyo odhiyo muchimiro chinonzwisisika kuLLM, iyo inonyanya kubatsira mumamiriro ezvinhu apo ruzivo rwekunzwa rwakakosha kuhurukuro. Kusanganisirwa uku kunowedzera kunzwisisa kwemodhi, zvichiita kuti iwande mukududzira zvirimo zvevhidhiyo.
Uyezve, basa iri rinounza yakagadziridzwa sisitimu yekumisikidza vhidhiyo-yakavakirwa modhi yekukurukurirana, inotenderera kubva kune yakapfuura nzira [22] iyo yainyanya kushandisa iyo proprietary GPT-3.5-Turbo modhi yekuongorora. Tichifunga kuti GPT-3.5-Turbo iri pasi pekuchinja chero nguva uye inoshaya pachena nekuda kweiyo yakavharwa-sosi hunhu, inopa matambudziko maererano nekuvimbika uye kuberekazve. Kugadzirisa izvi, isu tinokurudzira kushandiswa kweVicuna, yakavhurika-sosi LLM yekumisikidza. Kuchinja uku hakungokwidziri kuberekana chete asiwo kunonatsiridza kujeka muchirongwa chekuongorora. Isu tinoongorora PG-Vhidhiyo-LLaVA tichishandisa mabhenji edu akavandudzwa uye tinoratidza kuvandudzwa kunoshamisa pane aripo evhidhiyo ekukurukurirana mhando seVhidhiyoChatGPT [22] uye Vhidhiyo-LLaMA [45] munhaurirano dzisina kuvhurika, kubudirira kweiyo-ye-iyo-(SoTA) kuita.
Zvipo zvakakosha zvebasa iri ndezvi:
• Tinopa zano PG-Vhidhiyo-LLaVA, yekutanga vhidhiyo-yakavakirwa LMM ine pixel-level yekumisa kugona, ine modular dhizaini yekuwedzera kuchinjika.
• Nekubatanidza maodhiyo, PG-Vhidhiyo-LLaVA inosimudzira zvakanyanya kunzwisiswa kwayo kwezvemukati mevhidhiyo, ichiita kuti ive yakazara uye inokodzera mamiriro ezvinhu apo chiratidzo chekuteerera chakakosha pakunzwisisa kwevhidhiyo (semuenzaniso, nhaurirano nehurukuro, mavhidhiyo enhau, nezvimwewo) .
• Tinosuma mabhenji akavandudzwa ehuwandu hwemamodhiyo ehurukuro. Mabhenji edu anoshandisa yakavhurika-sosi Vicuna LLM kuona zvirinani kuberekana uye pachena. Isu tinopawo mabenchmarks kuti tiongorore kugona kwekudzika kwevhidhiyo-based modhi yekutaurirana.
Iri bepa rinowanikwa pa arxiv pasi peCC BY 4.0 DEED rezinesi.