paint-brush
Preparing Complex Datasets for Amazon's Recommender System Study by@escholar
212 reads

Preparing Complex Datasets for Amazon's Recommender System Study

by EScholar: Electronic Academic Papers for Scholars
EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture

EScholar: Electronic Academic Papers for Scholars

@escholar

We publish the best academic work (that's too often lost...

May 4th, 2024
Read on Terminal Reader
Read this story in a terminal
Print this story
Read this story w/o Javascript
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Efficiently prepare the dataset for Amazon's Barrier-to-Exit study using Python and scientific packages, focusing on data optimization, engineering, and transformation steps. Learn about strategies for handling complex datasets and achieving efficient computation for large-scale analysis.
featured image - Preparing Complex Datasets for Amazon's Recommender System Study
1x
Read by Dr. One voice-avatar

Listen to this story

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars

EScholar: Electronic Academic Papers for Scholars

@escholar

We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

About @escholar
LEARN MORE ABOUT @ESCHOLAR'S
EXPERTISE AND PLACE ON THE INTERNET.
0-item

STORY’S CREDIBILITY

Academic Research Paper

Academic Research Paper

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Authors:

(1) Jonathan H. Rystrøm.

Abstract and Introduction

Previous Literature

Methods and Data

Results

Discussions

Conclusions and References

A. Validation of Assumptions

B. Other Models

C. Pre-processing steps

C Pre-processing steps

Dealing with a dataset with millions of rows and complex types like ”categories” and ”dates” requires special engineering considerations. This section outlines the pre-processing steps required to get the data from Ni et al. (2019) in an analysis-ready shape.


All pre-processing of the data was done using python (Van Rossum, 2007). This is particularly because of the rich ecosystem of scientific packages. For this project we use numpy (Harris et al., 2020), pandas (McKinney, 2011), and numba (Lam et al., 2015) for efficient large-scale data processing. We also use scikit-learn (Pedregosa et al., 2011) to efficiently parse categories (see repository for implementation).


Most computations were performed on the Oxford Internet Institute’s HPC cluster. This allowed us to benefit from multi-core processing (Gorelick & Ozsvald, 2020) and increased RAM.


The first step is creating a dataset of category relevance for the books (see repository for details). Here, we simply take the original gzipped file and extract a list of categories and item ID (asin). This drastically reduces the file size, so we can do the computations in memory.


The next step is preparing the rating data. We start by filtering the dataset to only have users with more than 20 ratings. This reduces the dataset considerably as we saw in Fig. 2. We then left-join the data with the category similarity data described above. Each row now consists of a user id, category id, timestamp, and preference score (i.e. the rating multiplied by category relevance; see eq. 1) for each rating that the user has made for any given category. Note, that each individual rating can be represented in multiple rows if a book has multiple categories (which most have).


Finally, we summarise the data to get the sum of preference scores and amount of ratings per user, category, and quarter. This gives us a further reduced dataset that is more manageable to work with. Revealed preferences are defined as the weighted sum of ratings and category relevance (eq. 1), so this decision is mainly one of granularity.


This paper is available on arxiv under CC 4.0 license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

TOPICS

THIS ARTICLE WAS FEATURED IN...

Permanent on Arweave
Read on Terminal Reader
Read this story in a terminal
 Terminal
Read this story w/o Javascript
Read this story w/o Javascript
 Lite
X REMOVE AD