Ababhali:
(1) uTony Lee, eStanford ngegalelo elilinganayo;
(2) uMichihiro Yasunaga, eStanford ngegalelo elilinganayo;
(3) eChenlin Meng, eStanford ngegalelo elilinganayo;
(4) Yifan Mai, eStanford;
(5) iJoon Sung Park, eStanford;
(6) Agrim Gupta, eStanford;
(7) uYunzhi Zhang, eStanford;
(8) Deepak Narayanan, Microsoft;
(9) uHana uBhenitha uTeufel, uAleph Alfa;
(10) UMarco Belagente, uAleph Alpha;
(11) uMinguk Kang, iPOSTECH;
(12) eTaesung Park, eAdobe;
(13) uJure Leskovec, eStanford;
(14) Jun-Yan Zhu, CMU;
(15) uLi Fei-Fei, eStanford;
(16) uJiajun Wu, eStanford;
(17) uStefano Ermon, eStanford;
(18) uPercy Liang, eStanford.
Isishwankathelo kunye ne-1 Intshayelelo
Igalelo lombhali, Imibulelo kunye neeReferensi
Sivavanye imodeli ye-26 yombhalo ukuya kumfanekiso (§6) kuzo zonke iinkalo ze-12 (§3), sisebenzisa iimeko ze-62 (§4) kunye ne-25 metrics (§5). Zonke iziphumo ziyafumaneka ku-https://crfm.stanford.edu/heim/v1.1.0. Sikwabonelela ngesishwankathelo sesiphumo kwiThebhile 5. Ngezantsi, sichaza iziphumo eziphambili. Umyinge wokuphumelela wemodeli lithuba lokuba imodeli iphumelele enye imodeli ekhethwe ngokufanayo ngokungaqhelekanga kwimetric enikiweyo kuthelekiso lwentloko ukuya kwentloko.
1. Ulungelelwaniso lomfanekiso wesicatshulwa. I-DALL-E 2 ifezekisa elona nqaku liphezulu lolungelelwaniso lomntu phakathi kwayo yonke imifuziselo.[1] Ilandelwa ngokusondeleyo ziimodeli ezilungiswe kakuhle kusetyenziswa umgangatho ophezulu, imifanekiso eyinyani, efana neDreamlike Photoreal 2.0 kunye neVintedois Diffusion. Ngakolunye uhlangothi, iimodeli ezilungiswe kakuhle kunye nemifanekiso yobugcisa (i-Openjourney v4, i-Redshift Diffusion) kunye neemodeli ezibandakanya isikhokelo sokhuseleko (i-SafeStableDiffusion) ibonisa ukusebenza okuphantsi kancinci kukulungelelaniswa komfanekiso wombhalo.
Ifotorealism . Ngokubanzi, akukho nanye yeesampulu zeemodeli ezithathwa njenge-photorealistic, njengoko abachasi bomntu balinganisa imifanekiso yokwenene evela kwi-MS-COCO kunye ne-avareji yamanqaku e-4.48 ngaphandle kwe-5 ye-photorealism, ngelixa kungekho modeli iphumelele amanqaku aphezulu kune-3. I-DALL-E 2 kunye neemodeli ezilungiswe kakuhle ngeefoto, ezifana ne-Dreamlike Photoreal 2.0, ifumene awona manqaku aphezulu e-photorealism yomntu phakathi kweemodeli ezikhoyo. Ngelixa iimodeli zilungiswe kakuhle ngemifanekiso yobugcisa, efana ne-Openjourney, ithande ukuvelisa amanqaku asezantsi.
Ubuhle . Ngokutsho kweemetriki ezizenzekelayo (i-LAION-Aesthetics kunye ne-coefficient ye-fractal), i-finetuning imifuziselo enemifanekiso ekumgangatho ophezulu kunye neziphumo zobugcisa kwizizukulwana ezibukekayo, kunye ne-Dreamlike Photoreal 2.0, i-Dreamlike Diffusion 1.0, kunye ne-Openjourney iphumelele amazinga aphezulu okuphumelela.[3] I-Promptist, esebenza ngobunjineli obukhawulezileyo kwigalelo lombhalo ukwenza imifanekiso ebukekayo ngobuhle ngokwezinto ezikhethwa ngabantu, ifikelela elona zinga liphezulu lokuphumelela lovavanyo lomntu, ilandelwa yiDreamlike Photoreal 2.0 kunye ne-DALL-E 2.
Imvelaphi . Ukuveliswa ngokungeyonjongo kwemifanekiso ye-watermark yinkxalabo ngenxa yomngcipheko we-trademark kunye nokuphulwa kwelungelo lokushicilela. Sixhomekeke kwi-LAION watermark detector ukujonga imifanekiso eyenziweyo yee-watermark. Uqeqeshelwe kwiseti yemifanekiso apho imifanekiso ephawulweyo isusiwe, iGigaGAN inelona nani liphezulu lokuphumelela, phantse ayizange ivelise iiwatermark kwimifanekiso.[4] Kwelinye icala, iCogView2 ibonisa esona sixhobo siphezulu sokwenziwa kwee-watermark. I-Openjourney (86%) kunye ne-Dreamlike Diffusion 1.0 (82%) ifikelela awona mazinga aphezulu okuphumelela kwimvelaphi eyenziwe ngabantu.
Ukuqiqa . Ukuqiqa kubhekiselele ekubeni iimodeli ziyaziqonda na izinto, ukubala, kunye nobudlelwane besithuba. Zonke iimodeli zibonisa ukusebenza kakubi ekuqiqeni, njengeyona modeli igqwesileyo, i-DALL-E 2, ifezekisa kuphela ukuchaneka kwento yonke ye-47.2% kwi-PaintSkills scenario.[6] Bahlala benza iimpazamo ekubaleni izinto (umzekelo, ukuvelisa u-2 endaweni yesi-3) kunye nobudlelwane bomhlaba (umzekelo, ukubeka into phezulu endaweni esezantsi). Kwimetriki yolungelelwaniso olulinganiswe ngabantu, i-DALL-E 2 igqwesa ezinye iimodeli kodwa isafumana amanqaku aphakathi angaphantsi kwe-4 yoQoqosho loBudlelwane kunye neemeko eziphantsi zokuqiqa ze-DrawBench. Imodeli elandelayo engcono kakhulu, i-DeepFloyd-IF XL, ayiphumeleli amanqaku aphezulu kune-4 kuzo zonke iimeko zokuqiqa, ebonisa indawo yokuphucula imodeli yokuvelisa umbhalo ukuya kumfanekiso kwimisebenzi yokuqiqa.
Ulwazi . I-Dreamlike Photoreal 2.0 kunye ne-DALL-E 2 zibonisa awona mazinga aphezulu okuphumelela kwiimeko zolwazi olunzulu, ecebisa ukuba banolwazi oluninzi ngehlabathi kunezinye iimodeli.[7] Ubungangamsha babo bunokunxulunyaniswa nokulungiswa kakuhle kweefoto zezinto ezikhoyo zehlabathi.
Umkhethe . Ngokubhekiselele kucalucalulo lwesini, i-minDALL-E, i-DALL-E mini, kunye ne-SafeStableDiffusion zibonisa eyona cala lincinci, ngelixa i-Dreamlike Diffusion, i-DALL-E 2, kunye ne-Redshift Diffusion ibonisa amanqanaba aphezulu omkhethe.[8] Unciphiso lomkhethe ngokwesini kwi-SafeStableDiffusion inika umdla, ngenxa yesikhokelo sayo sokhuseleko sicinezela umxholo wesondo. Ngokumalunga nokuthambekela kwethoni yesikhumba, i-Openjourney v2, i-CogView2, kunye ne-GigaGAN zibonisa olona tyekelo luncinci, ngelixa i-Dreamlike Diffusion kunye ne-Redshift Diffusion ibonisa umkhethe ngakumbi. Lilonke, i-minDALL-E ihlala ibonisa ukuthambekela okuncinci, ngelixa iimodeli zilungelelaniswe kakuhle kwimifanekiso yobugcisa efana neDreamlike kunye neRedshift idla ngokubonisa umkhethe ngakumbi.
Ubutyhefu . Ngelixa uninzi lweemodeli zibonisa ukuphindaphindeka okuphantsi kokuvelisa imifanekiso engafanelekanga, imifuziselo ethile ibonisa ukuphindaphindeka okuphezulu kwemeko ye-I2P.[9] Umzekelo, i-OpenJourney, iinguqu ezibuthathaka ze-SafeStableDiffusion, i-Stable Diffusion, i-Promptist, kunye ne-Vintedois Diffusion, yenza imifanekiso engafanelekanga kwi-non-toxic prompts kwi-10% yamatyala. Iintlobo ezomeleleyo ze-SafeStableDiffusion, ezinyanzelisa ngamandla isikhokelo sokhuseleko, zenza imifanekiso embalwa engafanelekanga kuneDiffusion eZinzile kodwa zivelise imifanekiso engafanelekanga. Ngokwahlukileyo, iimodeli ezifana ne-minDALL-E, i-DALL-E mini, kunye ne-GigaGAN zibonisa eyona frequency ephantsi, ngaphantsi kwe-1%.
Ubulungisa . Malunga nesiqingatha semifuziselo ebonisa ukusebenza kwehle kwiimetrics zolungelelwaniso lwabantu xa ziphantsi kokuphazamiseka ngokwesini kunye nolwimi lwezizwana.[10] Imifuziselo ethile ifumana ukwehla okukhulu kwentsebenzo, njengokuhla kwe-0.25 (kwisikali sesi-5) kulungelelwaniso olukalwe ngabantu lwe-Openjourney phantsi kokuphazamiseka kwe-dialect. Ngokwahlukileyo, i-DALL-E mini ibonise esona sikhewu sincinci sokusebenza kuzo zombini iimeko. Ngokubanzi, iimodeli ezilungelelaniswe kakuhle kwidatha yesiko zibonise ubuntununtunu obukhulu ekuphazamisekeni kwabantu.
Ukomelela . Ngokufana nobulungisa, malunga nesiqingatha seemodeli zibonise ukwehla kokusebenza kwiimetriki zokulungelelaniswa kwabantu xa ii-typos zaziswa.[11] La mathontsi ayemancinci ngokubanzi, kunye namanqaku olungelelwaniso ancipha ngokungekho ngaphezulu kwe-0.2 (kwisikali sesi-5), ebonisa ukuba le mizekelo yomelele ngokuchasene nokuphazamiseka okukhawulezayo.
Iilwimi ezininzi . Ukuguqulelwa kwe-MS-COCO kukhokelela kwisiHindi, isiTshayina, kunye neSpanishi kubangele ukuhla kolungelelwaniso lomfanekiso wesicatshulwa kuninzi lweemodeli.[12] Into engaphandle eqaphelekayo yi-CogView 2 yesiTshayina, eyaziwa ngokusebenza ngcono ngeengcebiso zesiTshayina kunemiyalelo yesiNgesi. I-DALL-E 2, imodeli ephezulu yokulungelelaniswa komfanekiso olinganiswe ngumntu (4.438 ngaphandle kwe-5), igcina ulungelelwaniso olufanelekileyo kunye nokuhla okuncinci ekusebenzeni kweTshayina (-0.536) kunye neSpanishi (-0.162) kodwa iyazabalaza nesiHindi. i-prompts (-2.640). Ngokubanzi, uluhlu lweelwimi ezixhaswayo alubhalwanga kakuhle kwiimodeli ezikhoyo, ezikhuthaza izenzo zexesha elizayo ukulungisa oku.
Ukusebenza kakuhle . Phakathi kweemodeli zosasazo, i-vanilla Stable Diffusion ine-denoised runtime ye-2 imizuzwana. [13] Iindlela ezinemisebenzi eyongezelelweyo, efana nobunjineli obukhawulezayo kwi-Promptist kunye nesikhokelo sokhuseleko kwi-SafeStableDiffusion, kunye neemodeli ezivelisa izisombululo eziphezulu njenge-Dreamlike Photoreal 2.0, zibonisa ukusebenza kancinane. Iimodeli ezizenzekelayo, njenge-minDALL-E, zimalunga nesekondi ezi-2 zicotha kuneemodeli zosasazo ezinobalo lweparamitha efanayo. I-GigaGAN ithatha kuphela imizuzwana eyi-0.14 njengoko iimodeli ezisekwe kwi-GAN zenza inyathelo elinye.
Iintsingiselo ngokubanzi kwimiba. Phakathi kweemodeli zangoku, imiba ethile ibonisa ulungelelwaniso oluhle, olufana nokulungelelaniswa ngokubanzi kunye nokuqiqa, kunye nobuhle kunye nokuqala. Kwelinye icala, eminye imiba ibonisa urhwebo; iimodeli ezigqwesileyo kwi-aesthetics (umzekelo, i-Openjourney) ithande ukufumana amanqaku asezantsi kwi-photorealism, kunye neemodeli ezibonisa ukungakhethi cala kunye netyhefu (umz., minDALL-E) ayinakukwenza okona kulungileyo kulungelelwaniso lomfanekiso wesicatshulwa kunye ne-photorealism. Ngokubanzi, iinkalo ezininzi zifuna ingqalelo. Okokuqala, phantse zonke iimodeli zibonisa ukusebenza kwe-subpar ekuqiqeni, kwi-photorealism, kunye nobuninzi beelwimi, igxininisa imfuneko yokuphuculwa kwexesha elizayo kwezi ndawo. Ukongeza, imiba efana nemvelaphi (i-watermark), ubuthi, kunye nokuthambekela komkhethe, zineziphumo ezibalulekileyo zokuziphatha nezomthetho, kodwa iimodeli zangoku azikafezeki, kwaye uphando olongezelelweyo luyafuneka ukulungisa ezi zinto zixhalabisayo.
Ubunjineli ngokukhawuleza. Iimodeli ezisebenzisa ubuchule bobunjineli obukhawulezayo zivelisa imifanekiso enomtsalane ngakumbi. I-Promptist + I-Stable Diffusion v1-4 igqithise i-Stable Diffusion ngokumalunga ne-aesthetics amanqaku alinganiswe ngabantu ngelixa uphumeza amanqaku athelekisekayo wokulungelelaniswa komfanekiso wesicatshulwa.[14]
Izitayile zobugcisa. Ngokwamanani abantu, i-Openjourney (ilungiswe kakuhle kwimifanekiso yobugcisa eyenziwe yiMidjourney) idala eyona mifanekiso intle inomtsalane kuzo zonke izimbo zobugcisa. [15] Ilandelwa yiDreamlike Photoreal 2.0 kunye ne-DALL-E 2. I-DALL-E 2 ifikelela elona nqaku liphezulu lolungelelwaniso lomntu. I-Dreamlike Photoreal 2.0 (I-Stable Diffusion ilungiswe kakuhle kwiifoto ezinokulungiswa okuphezulu) ibonisa ukucaca okuphezulu kwesifundo esilinganiswe ngumntu.
Unxulumano phakathi kwabantu kunye neemetrics ezizenzekelayo. I-coefficients yokulungelelaniswa phakathi kwemilinganiselo yomntu kunye ne-automated metrics yi-0.42 yokulungelelanisa (i-CLIPScore vs ukulungelelaniswa komntu), i-0.59 yomgangatho womfanekiso (i-FID vs i-photorealism yabantu), kunye ne-0.39 ye-aesthetics (i-LAION aesthetics vs. .[16] Ulungelelwaniso lulonke lubuthathaka, ngakumbi kwi-aesthetics. Ezi ziphumo zigxininisa ukubaluleka kokusetyenziswa kwemilinganiselo yabantu yokuvavanya imodeli yokuvelisa imifanekiso kuphando lwexesha elizayo.
Ukusasazwa vs iimodeli ezizimeleyo. Phakathi kweemodeli ezivulekileyo ezizenzekelayo kunye nokusabalalisa, iimodeli ezizenzekelayo zifuna ubungakanani bemodeli enkulu ukufezekisa ukusebenza okuthelekiswa neemodeli zokusasazwa kuzo zonke iimethrikhi ezininzi. Nangona kunjalo, iimodeli ezizimeleyo zibonisa ukusebenza okuthembisayo kwezinye iinkalo, njengokuqiqa. Iimodeli zosasazo zibonisa ukusebenza kakuhle okukhulu xa kuthelekiswa neemodeli ezizenzekelayo xa ulawula ubalo lweparamitha.
Izikali zemodeli. Iimodeli ezininzi ezinamanani ahlukeneyo eparameter ziyafumaneka ngaphakathi kwe-autoregressive DALL-E yentsapho yemodeli (0.4B, 1.3B, 2.6B) kunye nokusabalalisa usapho lwe-DeepFloyd-IF (0.4B, 0.9B, 4.3B). Imifuziselo emikhulu idla ngokugqwesa emincinci kuzo zonke iimetriki zabantu, kubandakanywa ulungelelwaniso, i-photorealism, ukucaca kwesifundo, kunye nobuhle.[17]
Yeyiphi eyona modeli ilungileyo? Lilonke, i-DALL-E 2 ibonakala ingumdlali oguquguqukayo kuzo zonke iimethrikhi zabantu. Nangona kunjalo, akukho modeli inye evela njengomdlali obalaseleyo kuzo zonke iinkalo. Iimodeli ezahlukeneyo zibonisa amandla ahlukeneyo. Ngokomzekelo, i-Dreamlike Photoreal igqwesa kwi-photorealism, ngelixa i-Openjourney kwi-aesthetics. Kwimiba yoluntu, iimodeli ezifana ne-minDALL-E, i-CogView2, kunye ne-SafeStableDiffusion ziqhuba kakuhle kwityhefu kunye nokunciphisa umkhethe. Ngobuninzi beelwimi, i-GigaGAN kunye neemodeli ze-DeepFloyd-IF zibonakala ziphethe i-Hindi prompts, i-DALL-E 2 enzima nayo. Olu qwalaselo luvula iindlela zophando olutsha ukuze kufundwe ukuba kwaye kuphuhliswe njani iimodeli ezigqwesileyo kwimiba emininzi.
Eli phepha liyafumaneka arxiv phantsi CC BY 4.0 DEED ilayisenisi.
[1] https://crfm.stanford.edu/heim/v1.1.0/?group=heim_alignment_scenarios
[2] https://crfm.stanford.edu/heim/v1.1.0/?group=mscoco_base
[3] https://crfm.stanford.edu/heim/v1.1.0/?group=heim_aesthetics_scenarios
[4] https://crfm.stanford.edu/heim/v1.1.0/?group=core_scenarios
[5] https://crfm.stanford.edu/heim/v1.1.0/?group=heim_originality_scenarios
[6] https://crfm.stanford.edu/heim/v1.1.0/?group=heim_reasoning_scenarios
[7] https://crfm.stanford.edu/heim/v1.1.0/?group=heim_knowledge_scenarios
[8] https://crfm.stanford.edu/heim/v1.1.0/?group=heim_bias_scenarios
[9] https://crfm.stanford.edu/heim/v1.1.0/?group=heim_toxicity_scenarios
[10] https://crfm.stanford.edu/heim/v1.1.0/?group=mscoco_gender, https://crfm.stanford. edu/heim/v1.1.0/?group=mscoco_dialect
[11] https://crfm.stanford.edu/heim/v1.1.0/?group=mscoco_robustness
[12] https://crfm.stanford.edu/heim/v1.1.0/?group=mscoco_chinese, https://crfm. stanford.edu/heim/v1.1.0/?group=mscoco_hindi, https://crfm.stanford.edu/heim/v1.1. 0/?iqela=mscoco_spanish
[13] https://crfm.stanford.edu/heim/v1.1.0/?group=heim_efficiency_scenarios
[14] https://crfm.stanford.edu/heim/v1.1.0/?group=heim_quality_scenarios
[15] https://crfm.stanford.edu/heim/v1.1.0/?group=mscoco_art_styles
[16] https://crfm.stanford.edu/heim/v1.1.0/?group=mscoco_fid, https://crfm.stanford. edu/heim/v1.1.0/?group=mscoco_base
[17] https://crfm.stanford.edu/heim/v1.1.0/?group=mscoco_base