Ababhali:
(1) UMartyna Wiącek, i-Institute of Computer Science, i-Polish Academy of Sciences;
(2) uPiotr Rybak, i-Institute of Computer Science, i-Polish Academy of Sciences;
(3) Łukasz Pszenny, i-Institute of Computer Science, i-Polish Academy of Sciences;
(4) U-Alina Wróblewska, i-Institute of Computer Science, i-Polish Academy of Sciences.
Inothi loMhleli: Lena Ingxenye 1 kweziyi-10 zocwaningo lokuthuthukisa ukuhlolwa nokuqhathaniswa kwamathuluzi asetshenziswa ekucutshungulweni kwangaphambili kolimi lwemvelo. Funda okusele ngezansi.
Abstract kanye 1. Isingeniso kanye nemisebenzi ehlobene
2.2. Isistimu yokulinganisa eku-inthanethi
Ngokuthuthuka kwezakhiwo ezisekelwe ku-transformer, sibona ukukhuphuka kwamathuluzi okucubungula ulimi lwemvelo (NLPre) akwazi ukuxazulula imisebenzi yokuqala ye-NLP (isb., ukumaka ingxenye yenkulumo, ukuhlukanisa ukuncika, noma ukuhlaziywa kwe-morphological) ngaphandle kwanoma yisiphi isiqondiso solimi. . Kunzima ukuqhathanisa izixazululo zenoveli kumathuluzi athuthukisiwe agxile kahle, kuncike ekuhlaziyeni okusekelwe emthethweni noma kuzichazamazwi. Ngokuqaphela ukushiyeka kwezindlela ezikhona zokuhlola ze-NLPre, siphenya indlela entsha yokuhlola okuthembekile nokulungile kanye nokubika ukusebenza. Igqugquzelwe ibhentshimakhi ye-GLUE, isistimu yokulinganisa egxile olimini ehlongozwayo inika amandla ukuhlolwa okuphelele okuqhubekayo kwamathuluzi amaningi e-NLPre, kuyilapho ilandelela ukusebenza kwawo ngokukholekayo. Uhlelo lokusebenza lwe-prototype luhlelelwe isiPolish futhi luhlanganiswe nebhentshimakhi ye-NLPre-PL ehlanganiswe kahle. Ngokusekelwe kulesi silinganiselo, senza ukuhlola okubanzi kwezinhlelo ezihlukahlukene ze-Polish NLPre. Ukusiza ukwakhiwa kwezindawo zokulinganisa ezinye izilimi, isb i-NLPre-GA yesi-Irish noma i-NLPre-ZH yesiShayina, siqinisekisa ukwenziwa ngokwezifiso okuphelele kwekhodi yomthombo ekhishwe esidlangalaleni yesistimu yokumaka. Izixhumanisi eziya kuzo zonke izinsiza (izinkundla ezisetshenzisiwe, ikhodi yomthombo, amamodeli aqeqeshiwe, amasethi edatha njll.) zingatholakala kuwebhusayithi yephrojekthi: https://sites.google.com/view/nlpre-benchmark.
Amagama angukhiye : i-benchmarking, ibhodi yabaphambili, ukuhlukaniswa, ukumaka kwe-POS, ukuhlukanisa ukuncika, isiPolish
Izici ze-Morphosyntactic ezibikezelwe omaka bengxenye yenkulumo (POS) kanye nabahlaluli bokuncika bangaphansi kwemisebenzi eyahlukene yomfula, okuhlanganisa kodwa kungagcini ekuhlaziyweni kwemizwa (Sun et al., 2019), isizinda sobudlelwano (Zhang et al., 2018; Vashishth et al., 2018; Guo et al., 2019), ukulebula indima ye-semantic (Wang et al., 2019; Kasai et al., 2019), impendulo yombuzo (Khashabi et al., 2018), noma ukuhumusha ngomshini (Chen et al., 2017; Zhang et al., 2019). Ngakho-ke le misebenzi eyisisekelo ingase kubhekwe kuyo njengemisebenzi ye-natural language preprocessing (NLPre), njengoba yandulela imisebenzi ye-NLP ethuthukisiwe. Njengoba ikhwalithi yokubikezela kwe-morphosyntactic inomthelela obalulekile ekusebenzeni kwemisebenzi engezansi (Sachan et al., 2021), kuwubuhlakani ukusebenzisa amathuluzi akhona e-NLPre angcono kakhulu ukubikezela izici zolimi ezifanele. Sihlome ngezindlela ezehlukene ze-NLPre, kusukela kumathuluzi asekelwe emithethweni anezinhlelo zolimi ezenziwe ngezandla (isb. Crouch et al., 2011), ngezinhlelo zezibalo (isb uNivre, 2009; McDonald et al., 2005; Straka et al., 2016), amasistimu emizwa asekelwa amamodeli olimi aqeqeshwe kusengaphambili (isb. Qi et al., 2020; Nguyen et al., 2021a) kuya kumamodeli wezilimi ezinkulu (LLM Ouyang et al., 2022).
Esimweni sokuhlola ngaphakathi kwamathuluzi e-NLPre nokubika ukusebenza kwawo, kuphakanyiswe izindlela ezihlukahlukene, isb umsebenzi okwabelwana ngawo, ithebula lokusebenza, kanye nenqolobane yenqubekelaphambili. Umgomo oyinhloko womsebenzi okwabelwana ngawo uwukuhlola kabanzi izinhlelo ezibamba iqhaza kumadathasethi akhishiwe kusetshenziswa indlela yokuhlola echazwe ngokucophelela. Imisebenzi eminingi eyabiwe ye-NLPre isihleliwe kuze kube manje (isb uBuchholz noMarsi, 2006; Seddah et al., 2013; Zeman et al., 2017, 2018), futhi ngokungangabazeki bakhuphule ukuthuthukiswa kwe-NLPre. Nakuba ithandwa kabanzi, imisebenzi eyabiwe iyangabazeka njengomthombo ophelele novuselelwe wolwazi mayelana nenqubekelaphambili ye-NLPre. Okokuqala, bacubungula kuphela izixazululo ezishiwo emqhudelwaneni wamanje futhi abafaki amasistimu abamba iqhaza ezinhlelweni zangaphambilini noma okungenzeka kube yizo ezizayo. Okwesibili, njengoba imisebenzi eyabiwe ihlelwa ngezikhathi ezithile, imiphumela yayo ayibuyekezwa futhi ingase iphelelwe yisikhathi ngokushesha. Impela, amasethi edatha akhishelwe imisebenzi eyabiwe angasetshenziswa kabusha ekuhlolweni okubandakanya amathuluzi amanoveli. Imiphumela yocwaningo olunjalo ingabikwa ezincwadini ezizimele zesayensi. Noma kunjalo, lezi zincwadi zihlakazeke kakhulu, azinayo inkundla emaphakathi yokulandelela ngokuhlelekile inqubekelaphambili ye-NLPre eqhubekayo mayelana nolimi oluthile.
Imiphumela yethuluzi elisha noma elithuthukisiwe le-NLPre livame ukubikwa kumathebula okusebenza (isb. I-Stanza[1] noma i-Trankit[2]). Amathebula anjalo ahlinzeka ngolwazi mayelana nekhwalithi yethuluzi ekucubunguleni ngaphambili isethi yezilimi. Amathebula okusebenza, nokho, awavamile ukuqhathaniswa nezinye izinhlelo eziqeqeshelwe lezi zilimi ezithile. Ukwengeza, njengoba amasistimu e-NL Pre angase aqeqeshwe ekukhishweni kwedathasethi ehlukene (isb. I-Universal Dependencies), ukuqhathanisa amathebula awo okusebenza akuqinisekisi.
Ulwazi olumayelana namathrendi nenqubekelaphambili ocwaningweni lwe-NLP luvame ukuqoqwa kumakhosombe omphakathi njengamaphepha aneKhodi[3] noma inqubekelaphambili ye-NLP[4]. Lawa makhosombe aqukethe iqoqo ledathasethi yemisebenzi evamile ye-NLP, isb ukuncika kokuncika kanye nokumaka kwe-POS, kanye namazinga amamodeli aqeqeshwe futhi ahlolwe kulawa madathasethi. Avulekele ukufaka isandla kumasethi edatha amasha nemiphumela, okuthi, ukuze kuqinisekiswe ukwethembeka kwawo, aqhamuke emaphepheni esayensi ashicilelwe naxhunyiwe. Kodwa-ke, imiphumela ehamba phambili kodwa engashicilelwe yohlelo olusha noma oluthuthukisiwe lwe-NLPre ayifaneleki ukubika. Imisebenzi ye-NLPre ihambisana namasethi edatha ikakhulukazi esiNgisini, okuphakamisa inkinga yokungameleli kolimi kwamakhosombe. Okokugcina, Inqolobane Yamaphepha aneKhodi ijwayele ukuhlukumeza. Ngemva kokungena ngemvume, umuntu angakwazi ukwengeza imiphumela emisha futhi ayixhume namaphepha angabalulekile futhi ahlele imiphumela ekhona. Imiphumela yomgunyathi isakazwa ngokushesha.
Naphezu kokunikeza ulwazi olubalulekile mayelana nenqubekelaphambili ku-NLPre, izindlela zokuhlola ezishiwo ziphinde zembule ukushiyeka, isb imiphumela ephelelwe yisikhathi nengaphelele, ukuntuleka kokuqhathaniswa kwezinhlelo ezihlukene, ukunganaki ezinye izinhlelo, ubungozi bokukhohlisa kwemiphumela kanye nokungabikho kombono ogxile olimini.
Ngokulandela izinqubo ezijwayelekile ocwaningweni lwe-NLP, siphakamisa ukuhlola ngokuqinile nangokufanele amathuluzi e-NLPre kusetshenziswa indlela yokumaka evumela ukuhlolwa kokusebenza nenqubekelaphambili kwamamodeli e-NLP. Amabhentshimakhi e-NLP ahlanganiswe namabhodi wabaphambili abika futhi abuyekeze ukusebenza kwemodeli emisebenzini yebhentshimakhi, isb. I-GLUE (Wang et al., 2018), XTREME (Hu et al., 2020), GEM (Gehrmann et al., 2021). Indlela evamile yokulinganisa ingase ithuthukiswe ngamandla, iboniswe inkundla ye-Dynabench (Kiela et al., 2021), eyenza abasebenzisi bakwazi ukukhuphula idatha yebhentshimakhi ngokufaka izibonelo zangokwezifiso. Lesi simo sebhentshimakhi somuntu nemodeli-in-the-loop sibonakala sithembisa imisebenzi ye-NLU. Noma kunjalo, ingase ingasebenzi esimweni se-NLPre, njengoba ukuchaza izibonelo ezithembekile zezihlahla zokwenziwa noma izici ze-morphological kudinga ulwazi lochwepheshe. Ukuthola ochwepheshe abaningi phakathi kwabasebenzisi abavamile kungaba isithiyo esikhulu, ngaleyo ndlela sisebenzisa isistimu yethu ngokuhambisana nendlela evamile yokulinganisa.
Ngokwazi kwethu, ukulinganisa akukaze kusetshenziselwe ukukala amasistimu e-NLPre, ngisho noma ibalulekile futhi ifiswa umphakathi owakha usebe lwezihlahla noma ukuklama amapayipi e-NLP athuthukile. Indlela yethu yokulinganisa ye-NLPre igcwalisa leli gebe. Isistimu yokulinganisa ye-inthanethi ehlongozwayo ihlola ngokuzenzakalelayo izibikezelo ezithunyelwe zezinhlelo ze-NLPre futhi ishicilele izinga lokusebenza kwazo ebhodini lamaphuzu lomphakathi (bona Isigaba 2.2). Uhlelo lugxile kulimi kanye ne-tagset-agnostic, luvumela ukuhlola okuphelele nokuthembekile futhi luhlanganisa umthombo wakamuva wolwazi ngenqubekelaphambili ye-NLPre yolimi oluthile. Ngokungafani nezinkundla ezifanayo, isb. i-Codalab (Pavao et al., 2022), isistimu yokulinganisa ye-NLPre iyalungiseka ngokugcwele futhi kulula ukuyimisa, evumela abasebenzisi ukuthi basungule indawo yokuhlola yanoma yiluphi ulimi. Ukwengeza, ingazisingatha yona ngokwayo, ikwenze kube lula kubathuthukisi nabacwaningi abasebenza ngolimi oluthile ukuze ifinyeleleke kuseva yendawo.
Ukuze kuthethelelwe ukusetshenziswa kwendlela yokulinganisa yemisebenzi ye-NLPre, senza ucwaningo olusebenzayo esimweni esiyinselele ngesiPolish njengolimi lwesibonelo. Endabeni ye-Polish, isithiyo esisodwa esikhulu siphakama - ukungafani phakathi kwamathegi ahlukene, izikimu zezichasiselo namasethi edatha asetshenziselwa ukuqeqesha amasistimu ahlukene avimbela ukuqhathanisa kwawo okuqondile. Ngakho-ke silinganisa ukuqeqeshwa nokuhlolwa kwezinhlelo ze-NLPre kubhentshimark entsha yokusebenza yesiPolish, ngemuva kwalokhu i-NLPre-PL (bona Isigaba 3). Iqukethe isethi echazwe ngaphambilini yemisebenzi ye-NLPre kanye nezinguqulo ezihlelwe kabusha zamasethi edatha ase-Polish akhona. Isigaba sesi-4 siveza ukuhlola kwethu okuqinile nokuthembekile kwezinhlelo ezikhethiwe ze-NLPre ku-benchmark ye-NLPre-PL. Ngokolwazi lwethu, azikho izivivinyo zokuhlola ezenziwe ngesiPolishi ukuze kuqhathaniswe ukusebenza kwama-LLM angaphandle kweshalofu, amasistimu e-neural NLPre kanye nama-disambiguators okufaka amathegi ngenxa yokuntuleka kwendawo yokuhlola ehambisanayo.
Lo msebenzi wenza umnikelo wezingxenye ezintathu ohlanganisa ubusha, ucwaningo, kanye nentuthuko esekelwe i-ethos yomthombo ovulekile. (1) Siphakamisa indlela yokulinganisa enoveli egxile olimini ukuze sihlole futhi silinganise izinhlelo ze-NLPre. (2) Senza ukuhlola kwesayensi kwendlela ehlongozwayo esimweni solimi lwesi-Polish olungasho lutho kubhentshimakhi ye-NLPre-PL ehlanganisiwe. (3) Sishicilela izinkundla zokulinganisa ze-inthanethi zezilimi ezintathu ezihlukene: isiPolish[5], isiShayina[6], nesi-Irish[7], futhi sikhulula ikhodi yomthombo yesistimu yokumaka njengomthombo ovulekile.
Leli phepha litholakala ku-arxiv ngaphansi kwelayisensi ye-CC BY-NC-SA 4.0 DEED.
[1] https://stanfordnlp.github.io/stanza/performance.html (UD v2.8)
[2] https://trankit.readthedocs.io/en/latest/performance. html#universal-dependencies-v2-5 (UD v2.5)
[3] https://paperwithcode.com
[4] http://nlpprogress.com
[5] https://nlpre-pl.clarin-pl.eu
[6] https://nlpre-zh.clarin-pl.eu
[7] https://nlpre-ga.clarin-pl.eu