Meet Yambda: One of the world’s largest open datasets for RecSys. I-Algorithms ye-Recommender ikunceda abantu ukufumana iimveliso ezifanelekileyo, iifilimu, umculo, kunye nokunye. Zihlanganisa iinkonzo ezininzi ezisuka kwiinkonzo ze-intanethi ukuya kwiiplatforms ze-streaming. I-Advance ye-algorithms iya kuxhomekeke ngqo kwi-research, nto leyo kuxhomekeke kwi-high-quality, i-big-scale datasets. Nangona kunjalo, iinkqubo ezininzi ze-open-source ziquka ezincinane okanye ezidlulileyo, njengoko iinkampani ezihambisa i-terabytes yeendaba zibonisa ngokubanzi ngenxa yeengxaki ze-privacy. Namhlanje, sinikezela i-Yambda, enye yeenkcukacha ze-recommendation ehlabathini. Le dataset ibandakanya i-4,79 billion user interactions, ibandakanya imizuzu ye-10 yokusebenza kwamakhasimende. I-Music inikezela ngenxa yokuba i-subscription-based streaming service yeRashiya, kunye ne-audio yenyanga ye-28 million abasebenzisi. Inani elikhulu le dataset ibandakanya iingxowa, iingxowa, kunye neengxowa, kunye ne-track attributes ezisekelwe kwinkqubo ye-recommendation ye-personalized Zonke iinkcukacha ze-user kunye ne-track ziye zithunyelwe: i-dataset ibandakanya kuphela i-ID ye-numeric, ukhuseleko ubumfihlo we-user. I-vibe Ukukhishwa kweendaba ezinkulu ze-open datasets ezifana neYambda kunceda ukutshintsha iinkinga ezininzi. Ukufinyelela kwegama eziphezulu kunye ne-big-scale idatha ivula iindlela ezintsha zophando lwezenzulululwazi kwaye ivumela abaculi abantwana abaculi abavumela ukuyisebenzisa i-machine learning kwiingxaki zehlabathi. I-Alexander Ploshkin, kwaye ndingathanda ukuvelisa umgangatho we-personalization kwi-Yandex. Kule nqakraza, ndiyazixazulule into le dataset, indlela siqokelela, kwaye indlela ungasetyenzisa i-algorithms ezintsha ze-recomender. Yenza ukuqala! Yintoni iinkcukacha ezininzi ze-open data zihlanganisa? Iinkqubo ze-Recommender ziya kubonakala kwiminyaka embalwa. Iinkampani zobuchwepheshe ziquka ngokubanzi iimodeli ezisekelwe kwi-transformer, ezikhuselekileyo kwiimodeli ezininzi ze-linguages (LLMs) kwizinto ezininzi. Yintoni ethandwa kwi-computer vision kunye ne-processing ye-language yendalo, ukuba i-volume yeendaba kubalulekile ukuba izindlela zayo zenzelwe: i-transformers ayinempumelelo kakhulu kwi-datasets ezincinane, kodwa ziya kubalulekile xa zithembisa kwi-billions of tokens. Iintlobo ze-open data ezininzi kakhulu ziyafumaneka kwidolophu ye-Recommender Systems. Iimpawu ezaziwayo ezifana ne-LFM-1B, ne-LFM-2B kunye ne-Music Listening Histories Dataset (27B) ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye ziye zibe. Kwimeko, irekhodi yama-interactions yabasetyenziswa yi-Criteo ye-advertising dataset, kunye neengxaki ze-4 billion. Oku kwenza ingxaki kubafundi: Uninzi bafumana iinkonzo ze-web-scale, nto leyo kuthetha ukuba bafumane i-algorithms kwiimeko ezininzi ezibonakalayo kwiimeko zehlabathi. Iintlobo zeendatha ezidumileyo ezifana neMovieLens, i-Steam, okanye i-Netflix Prize zihlanganisa, ngexesha elide, izigidi zeengxaki kunye neengxaki zihlanganisa kwi-feedback evulekileyo, njenge-ratings kunye neengxaki. Kwixesha elide, i-production recommender systems isebenza kunye neengxaki ezininzi ezininzi kunye neengxaki ezininzi: i-click, i-like, i-listen, i-view, i-buy, njl. Kukho enye ingxaki ebalulekileyo: ukunciphisa i-dynamics ye-time. Izixhobo ezininzi ze-datasets ayikwazi ukuxhaswa kwe-chronological phakathi kwe-training kunye ne-test sets, nto leyo kubalulekile ukucacisa i-algorithms eyenza ukuhlaziywa kwexesha elandelayo, ngaphandle kokucacisa kuphela kwiminyaka elidlulileyo. Ukuphumelela kwezi zophando kunye nokukhusela ukuvelisa i-algorithms ezintsha kwiinkqubo ze-recommender, sinika i-Yambda. I-dataset yinto ye-open resource ye-user interactions ye-recommendation domain. Yintoni yintoni Yambda? I-dataset ibandakanya iingxaki ze-1 million abasebenzisi kunye neengxaki ze-9 million ze-Music ukusuka kwisevisi ye-Music, kunye neengxaki ze-4.79 billion. Okokuqala, ukuba kulungile: Zonke iziganeko zithunyelwe. I-dataset isebenzisa kuphela i-ID ye-number ye-users, i-tracks, i-albums, kunye ne-artists. Oku kunceda ukhuseleko lwe-privacy kunye ne-protect data ye-users. I-dataset ibandakanya iingxaki ezininzi ze-implicit kunye neengxaki ze-user: Listen: Umdlali wabhala i-music track. Like: Umdlali wabelana i-track (“i-thumbs up”). Ukusabela: Umdlali wafumana like. Ukukhangisa: Umdlali wabhala umdla (“thumbs down”). Undislike: Umdlali wafumana i-dislike. Ukwandisa ukufikelela kwedatha, siye sinikezela iisampuli ezincinane eziquka iiyure ze-480 million kunye ne-48 million. I-Statistry Summary for these subsets ifumaneka kwi-table ngezantsi: Iinkcukacha zithunyelwe kwifomati ye-Apache Parquet, ebizwa ngokuba yi-Python data analysis libraries efana nePandas kunye nePolar. Ukusetyenziswa kwe-usability, i-dataset ifumaneka ngokupheleleyo kwiifomati ezimbini: Flat: Umgca omnye umxokozelo omnye phakathi kwe-user kunye ne-track. I-Sequential: Yonke umgaqo inezinxalenye i-history epheleleyo ye-interaction ye-user eyodwa. Ukwakhiwa kweDataSets kuhle: Iimpawu ezininzi ze-Yambda i-flag, ebandakanya ngalinye iingxaki. Le i-flag ikunceda ukufumanisa phakathi kwizinto ze-user ezivela ngokwemvelo kunye neengxaki ze-recommendations. is_organic Ukuba , oku kuthetha ukuba i-event iye yasungulwa yi-recommendation. is_organic = 0 Ngokwesibonelo, kwi-stream yokuzonwabisa umculo okanye kwi-recommended playlist. Zonke iziganeko ezininzi zihlanganiswa njengezinto eziphilayo, nto leyo kuthetha ukuba usebenzise i-content ngokufanelekileyo. Umbala elandelayo inikeza i-statistics kwi-recommendation-driven events: I-History ye-interaction ye-user yi-key yokwenza i-recommendations ye-personalized. I-History ye-interaction ye-user yi-capture kunye ne-preferences ye-long-term kunye neengxaki ze-momentary ezinokuthi ziyafumaneka kwi-context. Ukuze kukunceda ukufumana ukwakhiwa kweedatha, apha ezinye iimveliso eziqhelekileyo kwi-dataset yethu: Iigrafu ezidlulileyo zibonisa ukuba ubude lwezilwanyana lwabasetyenziswa ngempumelelo. Yintoni, phantse abathengi abaninzi bafumane iintlobo ezincinane, iqela elincinane kodwa elikhulu kunezinto ezininzi ezininzi. Oku kubalulekile kakhulu ukuchitha xa ukwakha iimodeli zokuzonwabisa, ukunceda ukuxhaswa kubasebenzisi abasebenzayo kakhulu kunye nokugcina umgangatho kwizilwanyana ezininzi abasebenzayo. Ngokungafani, ukusabalalisa kwi-track ibonisa umzekelo oluthile kakhulu. Le chart ibonisa ngokucacileyo ukuxhaswa phakathi kweengcingo ezininzi ezidumileyo kunye neengcingo ezininzi yeengcingo: ngaphezu kwe-90% yeengcingo ziye zithunyelwe ngaphantsi kwe-100 imidlalo ngexesha lokugqibela lokucoca idatha. Nangona kunjalo, iinkqubo ze-recomender ziya kuqhagamshelane kunye ne-catalogue epheleleyo ukuze ziyafumanise i-track ye-low-popularity efanelekileyo kunye neengxaki ze-user eyodwa. Ukusebenzisa i-Yambda ukucacisa ukusebenza kwe-algorithmic Ukuhlolwa kwizinga le-algorithm ye-recomender isetyenziswa ngokuqhelekileyo kwi-lease-one-out (LOO) scheme, apho ingxaki ye-username eyodwa isetyenziselwa ukuhlolwa, kunye neentlawulo isetyenziselwa ukuqeqesho. Nangona kunjalo, le methode inika iziphumo ezimbini ezininzi: I-Temporal Inconsistency: Iziganeko ze-Test zihlanganisa iziganeko ezidlulileyo ezidlulileyo ezidlulileyo ezidlulileyo ezidlulileyo ezidlulileyo ezidlulileyo. Ukulinganiswa okuhlobene kubasebenzisi: Abasebenzisi abahlobene kubasebenzi kubasebenzi kubasebenzi kubasebenzi kubasebenzi kubasebenzi kubasebenzi kubasebenzi kubasebenzi kubasebenzi kubasebenzi kubasebenzi kubasebenzi kubasebenzi kubasebenzi kubasebenzi. Ukuze ukwandisa iimeko yokuhlola ngokufanelekileyo kwiinkqubo yokuhlola kwihlabathi, sincoma i-alternative: . global temporal split Le ndlela elula ukhethe i-point in time (T), ngaphandle zonke iziganeko ezilandelayo kwi-training set. Ngokwenza oku, umzekelo umzekelo umzekelo umzekelo umzekelo umzekelo umzekelo umzekelo umzekelo umzekelo umzekelo umzekelo umzekelo umzekelo umzekelo umzekelo umzekelo umzekelo umzekelo umzekelo umzekelo umzekelo umzekelo: Ukuhlolwa kwethu, sinikezela iintsuku eyodwa zeendaba njengoko i-holdout set ngenxa yeemeko ezimbini eziphambili: Kwakhona i-day value yeedatha inikeza ivolumu efanelekileyo ukucacisa ukusebenza kwe-algorithm. Iimodeli zokuvelisa kwihlabathi zangaphakathi zihlanganisa iimpawu ezininzi: ezinye ziquka iimodeli ezininzi ze-stat (isib. iingcebiso ze-popularity-based), ezinye ziquka iimodeli ezincinane (i-boosting, i-matrix factorization, iimodeli ze-two-tower), kwaye ezinye ziquka i-updated user interaction histories (iimodeli eziqhelekileyo kunye ne-transformer-based). Ngokutsho yethu, i-one-day window iyona ixesha elungileyo yokuhlola ukuba iimodeli ziyafumaneka ngexesha elifanelekileyo nangokufumana iintlobo zangaphakathi. Umngcipheko wesiqingatha le nqakraza yinto engabonakaliyo, njengeengqungquthela kwiveki kwi-music listening behavior. Thina sincoma ukugcina iziphumo zezi zophando zophando. Ibalazwe Zibonisa i-algorithms ezininzi ze-recommendor ezidumileyo kwi-Yambda ukuze zibonise izibane zokufunda kunye nokufaka kwizifundo zangaphambili. I-algorithms ethandwa kubandakanya: MostPop, DecayPop, ItemKNN, iALS, BPR, SANSA, kunye ne-SASRec. Ukuze ukubaluleka, sisebenzisa iimitha ezilandelayo: I-NDCG@k (i-Normalized Discounted Cumulative Gain), ebonakalisa umgangatho we-ranking kwi-recommendations. I-Recall@k, ebonakalisa umthamo we-algorithm ukufumana iingcebiso ezifanelekileyo kwi-total pool. I-Coverage@k, ebonisa ukuba i-catalogue ye-recommendation ifumaneka ngokubanzi. Iziphumo ziyafumaneka kwiitebhasi, kwaye ikhowudi ifumaneka kwi . Ukukhusela Face Ukukhusela Face Ukucinga I-Yambda inokufanelekileyo yokufunda i-algorithms ye-recommendation kwi-data ye-big-scale, apho i-performance kunye ne-capability ye-modeling ye-behavioral dynamics kubalulekile. I-dataset ifumaneka kwiintlobo ezintathu: i-set epheleleyo ye-5 billion events, kunye ne-subset ezincinane ze-500 million kunye ne-50 million events. I-Developers kunye ne-Researchers ziquka i-version efanelekileyo kwiiprojekthi zabo kunye neengxaki ze-computing. Zonke i-dataset kunye ne-code ye-evaluation ziyafumaneka kwi . Ukukhusela Face Ukukhusela Face Ukukhusela Face Thina nceda le dataset ibonelela ezisetyenziswayo kwiimvavanyo zakho kunye nezifundo! Nceda ucebisa!