Meet Yambda: One of the world’s largest open datasets for RecSys. Qalabka dhismaha dhismaha dhismaha dhismaha dhismaha dhismaha dhismaha dhismaha dhismaha dhismaha dhismaha dhismaha dhismaha dhismaha dhismaha However, most open-source datasets are either small or outdated, as companies that accumulate terabytes of data rarely make them publicly available due to privacy concerns. Today, we’re releasing Yambda, one of the world’s largest recommendation datasets. This dataset features 4.79 billion anonymized user interactions, compiled from 10 months of user activity. Waxaan ka heli karaa Music sida waxaa laga yaabaa ugu badan oo ka mid ah adeegga ah ee Russia oo ka mid ah 28 milyan oo ka mid ah macaamiisha. A significant portion of the dataset includes aggregated listens, likes, and dislikes, as well as track attributes sourced from the personalized recommendation system . All user and track data is anonymized: the dataset contains only numeric identifiers, ensuring user privacy. My Vibe Sida loo yaqaan "Yambda" waxaa loo isticmaali karaa in ka mid ah macluumaadka iyo macluumaadka ugu badan oo ka mid ah macluumaadka. Access to high quality, large-scale data opens new avenues for scientific research and engages young researchers keen to apply machine learning to real-world challenges. I’m Alexander Ploshkin, and I lead personalization quality development at Yandex. In this article, I’ll explain what the dataset consists of, how we collected it, and how you can use it to evaluate new recommender algorithms. Let’s begin! Markaas ka mid ah macluumaadka dhismaha dhismaha dhismaha dhismaha dhismaha? Recommender systems are experiencing a true renaissance in recent years. Tech companies are increasingly adopting transformer-based models, inspired by the success of large language models (LLMs) in other domains. Waxaad ka heli karaa oo ku saabsan wax soo saarka computer vision iyo wax soo saarka xafiiska ah waa in wax soo saarka wax soo saarka waaweyn: transformers waa la soo saarka ah in la mid ah macluumaadka dhismaha ah, laakiin waxaa laga yaabaa in ka mid ah wax soo saarka oo ka mid ah billions of tokens. Truly large-scale open datasets are a rarity in the recommender systems domain. Shuruudaha dhismaha oo leh LFM-1B, LFM-2B, iyo Dataset Music Listening Histories (27B) waxaa laga yaabaa in ka mid ah waqti ka mid ah wax soo saarka. Currently, the record for the number of user interactions is held by Criteo’s advertising dataset, with approximately 4 billion events. This creates a challenge for researchers: most don’t have access to web-scale services, meaning they can’t test algorithms under conditions that resemble real-world deployments. Sida loo yaqaan MovieLens, Steam, ama Netflix Prize waxay ku yaalaa, ugu fiican, mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah. Sidaas, nidaamka adeegga wax soo saarka waxaa loo isticmaali karaa si ay u isticmaali karaa si ay u isticmaali karaa si ay u isticmaali karaa si ay u isticmaali karaa si ay u isticmaali karaa: click, like, full listenens, views, purchases, and so on. There’s another critical issue: the lack of temporal dynamics. Many datasets don’t allow for an honest chronological split between training and test sets, which is crucial for evaluating algorithms that aim to predict the future, not just explain the past. Waayo, waxaa loo yaqaan "Yambda" waxaa loo yaqaan "Yambda" iyo "Yambda" waxaa loo yaqaan "Yambda" iyo "Yambda" waxaa loo yaqaan "Yambda". Dhismaha waa mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah. What’s inside Yambda? Dhismaha waxay ku yaalaa isticmaalka ka mid ah 1 million users iyo ka badan 9 million music tracks from the Music service, oo ka mid ah 4.79 billion events. First, to be clear: all events are anonymized. The dataset uses only numeric identifiers for users, tracks, albums, and artists. This is to ensure privacy and protect user data. The dataset includes key implicit and explicit user actions: Shuruudaha: User waxaa la heli karaa music track. Sidaa: User liked a track (“Thumbs up”). Sida loo isticmaali karaa: User removed a like. Ma rabtaa in aad u baahan tahay in aad u baahan tahay in aad u baahan tahay in aad u baahan tahay in aad u baahan yahay. Unlike: User ayaa ka soo xigtay in la xigtay. To make the dataset more accessible, we’ve also released smaller samples containing 480 million and 48 million events, respectively. Summary statistics for these subsets are provided in the table below: Data waa ku salaysan in format Apache Parquet, oo ay ku yaalaa in ay ku dhismaha data analytics ee Python sida Pandas iyo Polars. Flat: Dhammaan waa mid ka mid ah mid ka mid ah mid ka mid ah user iyo track. : Each row contains the complete interaction history of a single user. Sequential Dhismaha dhismaha waa mid ka mid ah: A key feature of Yambda is the Taageerada, oo ku yaqaan "flag", waxaa loo yaqaan "flag" iyo "flag" oo loo yaqaan "flag" iyo "flag" waxaa loo yaqaan "flag" iyo "flag" (flag). is_organic If Sida loo yaabaa, waxaa loo yaabaa in ay ku yaalaa in ay ku yaalaa in ay ku yaalaa in ay ku yaalaa. is_organic = 0 Sida loo yaabaa in lagu soo bandhigay lagu soo bandhigay lagu soo bandhigay ama lagu soo bandhigay lagu soo bandhigay. All other events are considered organic, meaning the user discovered the content on their own. The table below provides statistics on recommendation-driven events: Warshadaha isticmaalka waa mid ka mid ah loo soo saarka kharashka kala duwan. Waxaa ka mid ah soo saarka xawaaraha xawaaraha oo ka mid ah xawaaraha xawaaraha xawaaraha oo ka mid ah xawaaraha xawaaraha xawaaraha. To help you better understand the data structure, here are some quick statistics on our dataset: The above charts reveal that user history length follows a heavy-tailed distribution. This means while most users have relatively few interactions, a small but significant group has very long interaction histories. This is especially important to account for when building recommendation models, to avoid overfitting to highly active users and to maintain quality for the “heavy tail” of the less engaged users. In contrast, the distribution across tracks tells a very different story. This chart clearly shows the imbalance between the highly popular tracks and a large volume of niche content: over 90% of tracks received fewer than 100 plays during the entire data collection period. Despite this, recommender systems must engage with the entire catalog to surface even low-popularity tracks that align well with individual user preferences. Sida loo isticmaali karaa Yambda si loo isticmaali karaa algorithmic performance Qalabka wax soo saarka ah ee algorithm Recommender waxay isticmaali karaa nidaamka Leave-one-Out (LOO), oo ka mid ah wax soo saarka macaamiisha ah waxaa loo isticmaali karaa si ay u baabuurta iyo wax soo saarka waxaa loo isticmaali karaa si ay u baabuurta. Sida loo yaabaa, waxaa laga yaabaa in ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ah mid ka mid ah mid ah mid ka mid ah mid ah mid ka mid ah mid ah mid ka mid ah mid ah mid ka mid ah mid ah mid ka mid ah. Incoherence Temporal: Dhammaan Test waa in ay ku yaalaa wax soo saarka oo ka mid ah wax soo saarka oo ka mid ah wax soo saarka. Shuruudaha ugu badan ee isticmaala: isticmaala inactive waxay ku habboonay si ay u hesho si ay u habboonay, oo ay ku habboonay wax soo saarka. To bring evaluation conditions closer to real-world recommender system scenarios, we propose an alternative: . global temporal split This simple method selects a point in time (T), excluding all subsequent events from the training set. Sida loo yaabaa, model waxaa loo isticmaali karaa data warshadaha iyo waxaa loo isticmaali karaa data warshadaha warshadaha warshadaha warshadaha. For our evaluation, we reserved one day of data as the holdout set for two main reasons: Markaad ka mid ah dhismaha dayactirka ah waa in la soo saarka adeegga algorithm ah. Models in real-world production have different characteristics: some require frequent stat updates (for example, popularity-based recommendations), others are fine-tuned or retrained periodically (boosting, matrix factorization, two-tower models), and some depend on continuously updated user interaction histories (recurrent and transformer-based models). From our viewpoint, a one-day window is the optimal evaluation period to keep models static while still capturing short-term trends. The drawback of this approach is that it doesn’t account for longer-term patterns, such as weekly shifts in music listening behavior. We suggest leaving those aspects for future research. Baselines We evaluated several popular recommender algorithms on Yambda to establish baselines for future research and comparison. The algorithms we tested include: MostPop, DecayPop, ItemKNN, iALS, BPR, SANSA, and SASRec. For evaluation, we used the following metrics: NDCG@k (Normalized Discounted Cumulative Gain) waa mid ka mid ah cilmi-baarista ah ee kharashka. which assesses the algorithm’s ability to retrieve relevant recommendations from the total pool. Recall@k, Coverage@k, oo ku yaalaa in ay ku yaalaa in ay ka mid ah ka mid ah cataloog ah ee kharashka. Qalabka waxaa loo isticmaali karaa in tababarka, iyo codka waxaa loo isticmaali karaa . Hugging Face Hugging Face Conclusion Yambda can be valuable for research into recommendation algorithms on large-scale data, where both performance and the ability to model behavioral dynamics are crucial. The dataset is available in three versions: a full set with 5 billion events, and smaller subsets with 500 million and 50 million events. Waayo, waxaa laga yaabaa in ay ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ka mid ah mid ah mid ka mid ah mid ah mid ka mid ah. . Qalabka Face Hugging Face Qalabka Face We hope this dataset proves useful in your experiments and research! Thank u leh!