As a junior data scientist, most of the times training data are ready for me to train the model (either by accessing database/external csv file provided). However, dataset is not always available for us, we need scraping skill to make our life easier.
My friend and I decided to start this mini-project which contains the whole process from scraping, to exploratory and finally perform some simple prediction using various machine learning models within 1 day as a small challenge for our own.
We can’t wait to share our findings and hopefully you could enjoy reading it. Besides, feel free to reach us by commenting below if you find anything unclear/spot some points we have miss out! :)
We will be scrapping TOP 100 COOLEST TECH GADGETS OF 2018 from bestproducts.com and save them into a csv file. These product names will be searched on iprice.sg and results will be sorted by relevance order. Finally, product names and their corresponding original price, discount percentage, platform provided, and current price of each result returned by iprice in the first page will be scrapped for later analysis and modelling.
Clear? Let’s start to scrape!
We will be using Scrapy to scrape the data, I am pretty new to Scrapy, as I have been using an easier but heavier (in consuming memory when process is running, more details can refer this post on stackoverflow) framework-Selenium before. If you would like to know more details on comparison between these two frameworks, feel free to check my friend’s post here.
I will focus only on how to clean scrapped data using item pipeline (items.py). So, for Scrapy beginner do check out this post first:)
For this webpages, after clicking on Inspect button, all the products will be flatten and product names can be access. After inspecting on the Network > XHR, we manage to get the url to access for server response data in JSON format!
We use a simple request python script to get the JSON data, process it and store into a csv file as shown below:
For iprice.sg, we will be trying out the Scrapy framework. We first notice that each container of the products are being stored in xpath:
//div[@class=' pu product relative no-underline w-50 w-33-m w-25-ol w-20-xl white'], so, we will use a for loop to search through this container, and extract the following fields:
loader = ItemLoader(item=IpriceScraperItem(), selector=row, response=response)
loader.add_xpath('ori_price', ".//a/figure/figcaption/div/span[@class='f11 lh-11 original strike db mb3']/text()")
loader.add_xpath('brand', ".//a/figure/figcaption/div/div/div/div[@class='s-n gray-dark overflow-hidden f13']/strong/text()")
Note that we are using ItemLoader to store each scrapped data in a specified format, so it is different from the normal yield usage in parse function.
To use the ItemLoader, we need to specify our desired format in
[Spidername]ScaperItem class which can be found in items.py as follow:
(*Directory to items.py can be found on
projectname_scrapper > spiders > items.py .)
Take name as an example:
name = scrapy.Field(
input_processor = MapCompose(remove_tags, str.strip, clean_name),
output_processor = TakeFirst()
MapCompose:Map the scrapped value to functions that you would like to perform. Note that the input to MapCompose are functions.
remove_tags: remove all html tags from the scrapped values.
str.strip: remove blank space before and after the scrapped values.
clean_name: custom function to be performed on the scrapped values.
output_processor=TakeFirst() : Tell Scrapy to extract only the first result that match the specified xpath.
Yay, we can now run the scrapy by the following command and generate the output into a csv file:
scrapy crawl iprice -o data.csv
Now, we have collected all the data. After cleaning, we compared the scrapped data on iprice.sg and bestproducts.com. We notices the following insights:
Out of 100 products from bestproducts.com, only 63 of them can be found on iprice. The missing products can be summarised using wordscloud. We notice that a lot of them are audio devices like earbuds, headphones, and speaker.
2. The largest contribution across different e-commerce platform
Lazada is the top searched platform from iprice on electronics products (82%), followed by Qoo10 (10%) and Shopee(6.2%).
3. Most frequent products that can be found on Iprice
Refer to the following worldcloud, we can see that a lot of Apple products can be found on Iprice. It might be due to the fact that Apple products is very common in Singapore.
Performance of Iprice search engine
Now, let us filter out the false results returned by Iprice, for example, searching “Apple Ipad Pro” will results in the case for Apple Ipad Pro. Let us get rid of those cases, and see a few more insights:
We found that the accuracy of the search results returned by Iprice on first page is only 36.21% — out of 1193 search results, only 432 are exactly the item that we are interested in!(WHAT?). However, since we manually label the data to determine whether one result is correct, our labels might contain human errors.
2. The platform that provides most of the CHEAPEST electronic gadgets
Last but not least, which platform always sell the cheapest electronic gadget? This is in fact our main target of this project. haha!
The answer is => Lazada!
After filtered out the false searched results, we grouped all the data by refer_name(the original product name from bestproducts.com), and then count the frequency of each platform that provide the lowest price for each product. As shown below:
There are 31 out of 50 products (62%) which can be found in the lowest price on Lazada!
As a conclusion, it seems like Lazada is the best choice if you want to buy electronic gadget, in the sense that you have the higher chance to get the cheapest price compared to other platform in Singapore.
However, as we realised, from the top 100 products we scrapped from bestproducts.com, most of them are Apple products, plus the fact that most of Apple products are search-able on iprice.
Besides, there are 37
refer_name which we can’t find on iprice. There might be due to two reasons.
Thanks for reading through the end, the github repository to this project can be found here. And.. hopes that you enjoy my first post. 😊
Remember that I mentioned this is a joint project I have done with my friend?
Lastly, feel free to comment below to share your thought on this project and know more about me here ya! :)
Create your free account to unlock your custom reading experience.