https://www.kdnuggets.com/2018/02/top-news-week-0129-0204.html I was searching for flight tickets and noticed that ticket prices fluctuate during the day. I tried to find out when the best time to buy tickets is, but there was nothing on the Web that helped. I built a small program to automatically collect the data from the web — a so-called scraper. It extracted information for my specific flight destination on predetermined dates and notified me when the price got lower. Web scraping is a technique used to extract data from websites through an automated process. I learned a lot from this experience with Web scraping, and I want to share it. This post is intended for people who are interested to know about the common design patterns, pitfalls and rules related to the web scraping. The ariticle presents several and a collection of typical , such as how , and , and how to your scraper. use cases problems not to be detected dos don’ts speed up (parallelization) Everything will be accompanied by , so that you can start straight away. This document will also go through several useful python packages. python snippets Use Cases There are many reasons and use cases why you would want to scrape data. Let me list some of them: to spot if some of the clothes you want to buy got discounted scrape pages of a e-retailer of several clothes brands by scraping their pages compare prices price of the flight tickets can vary during the day. One could crawl the travel website and get alarmed once the price was lowered analyze the action websites to answer the question if starting bid should be low or high to attract more bidders or if the longer auction correlates with a higher end bid Tutorial Structure of the tutorial: Available packages Basic code Pitfalls Dos and dont’s Speed up — parallelization Before we start: Be NICE to the servers; you DON’T want to crash a website. 1. Available packages and tools There is no universal solution for web scraping because the way data is stored on each website is usually specific to that site. In fact, if you want to scrape the data, you need to understand the website’s structure and either build your own solution or use a highly customizable one. However, you don’t need to reinvent the wheel: there are many packages that do the most work for you. Depending on your programming skills and your intended use case, you might find different packages more or less useful. 1.1 Inspect option Most of the time you will finding yourself inspecting the the website. You can easily do it with an “inspect” of your bowser. HTML option The section of the website that holds my name, my avatar and my description is called (how interesting that Medium calls its writers ‘heroes’ :)). The <h1> class that holds my name is called and the description is contained within the <p> class . hero hero--profile u-flexTOP ui-h2 hero-title ui-body hero-description You can read more about , and differences between and . HTML tags classes ids here 1.2 Scrapy There is a stand-alone ready-to-use data extracting framework called Apart from extracting HTML the package offers lots of functionalities like exporting data in formats, logging etc. It is also highly customisable: run different spiders on different processes, disable cookies¹ and set download delays². It can also be used to extract data using API. However, the learning curve is not smooth for the new programmers: you need to read tutorials and examples to get started. Scrapy . ¹ Some sites use cookies to identify bots.² The website can get overloaded due to a huge amount of crawling requests. For my use case it was too much ‘out of the box’: I just wanted to extract the links from all pages, access each link and extract information out of it. 1.3 BeautifulSoup with Requests is a library that allows you to parse the HTML source code in a beautiful way. Along with it you need a library that will fetch the content of the url. However, you should take care of everything else like error handling, how to export data, how to parallelize the web scraper, etc. BeautifulSoup Request I chose BeautifulSoup as it would force me to figure out a lot of stuff that Scrapy handles on its own, and hopefully help me learn faster from my mistakes. 2. Basic code It’s very straightforward to start scraping a website. Most of the time you will find yourself inspecting of the website to access the classes and IDs you need. Lets say we have a following html structure and we want to extract the elements. Note: element is optional. HTML main_price discounted_price <body><div id="listings_prices"><div class="item"><li class="item_name">Watch</li><div class="main_price">Price: $66.68</div><div class="discounted_price">Discounted price: $46.68</div></div><div class="item"><li class="item_name">Watch2</li><div class="main_price">Price: $56.68</div></div></div></body> The basic code would be to import the libraries, do the request, parse the html and then to find the . class main_price https://gist.github.com/jkokatjuhha/02af3a28cf512ee8a3096273850fe029 It can happen that the is present in another section of the website. To avoid extracting unnecessary from any other part of the webpage we could have first addressed the and only then find all elements with . class main_price class main_price id listings_prices class main_price 3. Pitfalls 3.1 Check robots.txt The scraping rules of the websites can be found in the file. You can find it by writing robots.txt after the main domain, e.g . These rules identify which parts of the websites are not allowed to be automatically extracted or how frequently a bot is allowed to request a page. Most people don’t care about it, but try to be respectful and at least look at the rules even if you don’t plan to follow them. robots.txt [www.website_to_scrape.com/robots.txt](http://www.website_to_scrap.com/robots.txt) 3.2 HTML can be evil HTML tags can contain id, class or both. HTML id specifies a id and HTML class is non-unique. Changes in the class name or element could either break your code or deliver wrong results. unique There are two ways to avoid it or at least to be alerted about it: Use specific rather than since it is less likely to be changed id class Check if the element returns None https://gist.github.com/jkokatjuhha/392744085cec10b6000a1f82adb462fe However, because some fields can be optional (like in our HTML example), corresponding elements would not appear on each listing. In this case you can count the percentage of how many times this specific element returned None to the number of listings. If it is 100%, you might want to check if the element name was changed. discounted_price 3.3 User agent spoofing Everytime you visit a website, it gets your via . Some websites won’t show you any content unless you provide a user agent. Also, some sites offer different content to different browsers. Websites do not want to block genuine users but you would look suspicious if you send 200 requests/second with the same user agent. A way out might be either to generate (almost) random user agent or to set one yourself. browser information user agent https://gist.github.com/jkokatjuhha/083c1b5e14e64b3b1ff734bb45b859be 3.4 Timeout request will keep waiting for a response indefinitely. Therefore, it is advised to set the timeout parameter. By default, Request https://gist.github.com/jkokatjuhha/64cecefa0bf31c2b21111373c11fcc66 3.5 Did I get blocked? Frequent appearance of the like 404 (Not Found), 403 (Forbidden), 408 (Request Timeout) might indicate that you got blocked. You may want to check for those error codes and proceed accordingly.Also, be ready to handle exceptions from the request. status codes https://gist.github.com/jkokatjuhha/a33467fae4c9f7fac64f067501b484ac 3.6 IP Rotation Even if you randomize your user agent, all your requests will be from the same IP address. That doesn’t sound abnormal because libraries, universities, and also companies have only a few IP addresses. However, if there are uncommonly many requests coming from a single IP address, a server can detect it. Using shared can help you become a ghost ;). proxies, VPNs or TOR https://gist.github.com/jkokatjuhha/a4df4078aa1f86846511332c472fadbf By using a shared proxy, the website will see the IP address of the proxy server and not yours. A VPN connects you to another network and the IP address of the VPN provider will be sent to the website. 3.7 Honeypots Honeypots are means to detect or scrapers. crawlers These can be ‘hidden’ links that are not visible to the users but can be extracted by scrapers/spiders. Such links will have a CSS style set to , they can be blended by having the color of the background, or even be moved off of the visible area of the page. Once your crawler visits such a link, your IP address can be flagged for further investigation, or even be instantly blocked. display:none Another way to spot crawlers is to add links with infinitely deep directory trees. Then one would need to limit the number of retrieved pages or limit the traversal depth. 4. Dos and Don’ts Before scraping, check if there is a public API available. Public APIs provide easier and faster (and legal) data retrieval than web scraping. Check out that provides APIs for different purposes. Twitter API In case you scrape lots of data, you might want to consider using a database to be able to analyze or retrieve it fast. Follow on how to create a local database with python. this tutorial Be polite. As suggests, it is recommended to let people know that you are scraping their website so they can better respond to the problems your bot might cause. this answer Again, do not overload the website by sending hundreds of requests per second. 5. Speed up — parallelization If you decide to parallelize your program, be careful with your implementation so you don’t slam the server. And be sure you read the section. Check out the the definitions of parallelization vs concurrency, processors and threads and . Dos and Don’ts here here If you extract a huge amount of information from the page and do some preprocessing of the data while scraping, the number of requests per second you send to the page can be relatively low. For my other project where I scraped apartment rental prices, I did heavy preprocessing of the data while scraping, which resulted in 1 request/second. In order to scrape 4K ads, my program would run for about one hour. In order to send requests in parallel you might want to use a package. multiprocessing Let’s say we have 100 pages and we want to assign every processor equal amount of pages to work with. If is the number of CPUs, you can evenly chunk all pages into the bins and assign each bin to a processor. Each process will have its own name, target function and the arguments to work with. The name of the process can be used afterwards to enable writing data to a specific file. n n I assigned 1K pages to each of my 4 CPUs which yielded 4 requests/second and reduced the scraping time to around 17 mins. https://gist.github.com/jkokatjuhha/7927b27cf7a831c48e223b7c06fbd401 Happy scraping!

Fetch

Web Scraping Tutorial with Python: Tips and Tricks

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

11 Habits that changed my life (and will yours too)

14 Hot Chrome Extensions for Geeks 🔥

21 Daily Habits To Help You Handle Trying Moments Without Zapping Your Energy, Enthusiasm And…

3 Motivation Hacks for Seemingly Overwhelming Workloads

30 Chrome Extensions, Apps, and Hacks to 2x Your Productivity

5 Lessons Learned from My Best Year of Personal Growth

11 Habits that changed my life (and will yours too)

14 Hot Chrome Extensions for Geeks 🔥

21 Daily Habits To Help You Handle Trying Moments Without Zapping Your Energy, Enthusiasm And…

3 Motivation Hacks for Seemingly Overwhelming Workloads

30 Chrome Extensions, Apps, and Hacks to 2x Your Productivity

5 Lessons Learned from My Best Year of Personal Growth

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps