Web data extraction or web scraping in 2020 is the only way to get desired data if owners of a web site don't grant access to their users through API. Many web sites like Twitter, YouTube, or Facebook provide an easy way to access their data through a public API. All the information that you obtained using API is both well structured and normalized. For example, it can be in the format of JSON, CSV, or XML. 3 ways to extract data from any website. #1 Official API. First of all, you should always check out if there's an official API that you can use to get the desired data. Sometimes the official API is not updated accurately, or some of the data are missing from it. #2 "Hidden API". The backend might generate data in JSON or XML format, consumed by the frontend. Investigating XMLHttpRequest (XHR) with a web browser inspector gives us another way to access the data. It would provide us the data in the same way as an official API would do it. How to get this data? Let's hunt for API endpoint! For example, let's look at resource showing local COVID-19 cases for website visitors. https://covid-19.dataflowkit.com/ Call Chrome DevTools by pressing Ctrl+Shift+I Once the console appears, go to the "Network" tab. Let's select the XHR filter to catch an API endpoint as the "XHR" request if it is available." Make sure the "recording" button is enabled. Refresh the webpage. Click Stop "recording" when you see the data related content has already appeared on the webpage. Now you can see a list of requests on the left. Investigate them. The preview tab shows an array of values for the item named " ." v1 Press the "Headers" tab to see details of the request. The most important thing for us is the URL.  Request URL for " " is . Now, let's just open that URL as another browser tab to see what happens. v1 https://covid-19.dataflowkit.com/v1 Cool! That's what we're looking for. Taking data either directly from an API or using the technique described above, is the easiest way to download datasets from websites. Of course, theses approaches are not going to be effective for all the website, and that is why the web scraping libraries are still necessary. Web data extraction or is the only way to get desired data i . Web Scraping is the that substitutes manual repetitive typing or copy-pasting. web scraping f owners of a web site don't grant access to their users through API data extraction technique #3 Website scraping. Know the rules! What should you check before scraping a website? Photo by / Adam Sherez Unsplash ☑️ is the first thing to check when you plan to scrape website data. Robots.txt file lists the rules on how you or a bot should interact with them. You should always respect and follow all the rules listed in robots.txt. Robots.txt ☑️ Make sure you also look at a site's If terms of use provision do not say that it limits access to bots and spiders and does not prohibit rapid requests of the server, crawling is fine. Terms of use. ☑️ To be compliant with the new EU , you should first evaluate your web scrapping project. General Data Protection Regulation, or GDPR If you don't scrape personal data, then GDPR does not apply. In this case, you can skip this section and move to the next step. ☑️ Be careful about how you use the extracted data as you may violate the sometimes. If the terms of use do not provide a limitation on a particular use of the data, anything goes so long as the crawler does not violate copyright. copyrights Find more information: Is web scraping legal or not? Sitemaps Typical websites have sitemap files containing a list of links belong to this web site. They help to make it easier for search engines to crawl web sites and index their pages. Getting URLs from sitemaps to crawl is always much faster than gathering it sequentially with a web scraper. Render JavaScript-driven web sites JavaScript Frameworks like Angular, React, Vue.js used widely for building modern web applications. In short, a typical web application frontend consists of  HTML + JS code + CSS Styles. Usually, source HTML initially does not contain all the actual content. During a web page download, HTML DOM elements are loaded dynamically along with rendering JavaScript code. As a result, we get rendered static HTML. ☑️ You can use for website scraping, but it is not a good idea. Many tutorials are teaching how to use Selenium for scraping data from websites. Their home page clearly states that Selenium is for automating web applications for testing purposes Selenium " ." ☑️ was suitable to take care of such tasks earlier, but since 2018 its development has been suspended. PhantomJS ☑️ Alternatively, was an option for Python programmers before Headless Chrome. Scrapinghub's Splash Your browser is a website  scraper by its nature. The best way nowadays is to use as it renders web pages "natively." Headless Chrome Node library is the best choice for Javascript developers to control Chrome over DevTools Protocol. Puppeteer Go developers have an option to choose from either or to access Chrome via DevTools protocol. chromedp cdp Be smart. Don't let them block you. Photo by / Randy Fath Unsplash Some web sites use anti-scraping techniques to prevent web scrapper tools from harvesting online data. Web scraping is always a "cat and mouse" game. So when building a web scraper, consider the following ways to avoid getting blocked. Or you risk not receiving the desired results. Tip #1: Make random delays between requests. When a human visits a web site, the speed of accessing different pages is in times less compared to a web crawler's one. Web scraper, on the opposite, can extract several pages simultaneously in no time. Huge traffic coming to the site in a short period on time looks suspicious. You should find out the ideal crawling speed that is individual for each website. To mimic human user behavior, you can add random delays between requests. Don't create excessive load for the site. Be polite to the site that you extract data from so that you can keep scraping it without getting blocked. Tip #2: Change User-agents. When a browser connects to a web site, it passes the string in the HTTP header. This field identifies the browser, its version number, and a host operating system. User-Agent (UA) A typical user agent string looks like this: " ". Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36 If multiple requests to the same domain consist of the same user-agent, the web site can detect and block you very soon. Some websites block specific requests if they contain User-Agent that differ from a general browser. If "user-agent" value is missed, many websites won't allow accessing their content. What is the solution? You have to build a list of user-agents and rotate them randomly. Tip #3: Rotate IP addresses. Use Proxy servers. If you send multiple requests from the same IP address during scraping, the website considers suspicious behavior and blocks you. For the most straightforward cases, it is enough to use the cheapest .  But some websites have advanced bot detection algorithms, so you have to use either or proxies to scrape them. Datacenter proxies residential mobile For example, someone in Europe wants to extract data from a website with limited access to US users only. It is evident to make requests through a proxy server located in the USA since their traffic seems to be coming from the local to US IP address. To obtain country-specific versions of target websites, just specify any arbitrary country in request parameters in . Dataflow Kit fetch service Tip #4: Avoid scraping patterns. Imitate humans behavior. Humans are not consistent while navigating a website. They do different random actions like clicks on the page and mouse movements. In opposite, web scraping bots follow specified patterns when crawling a web site. Teach your scraper to imitate human beings' behavior. This way, website bot detection algorithms don't have any reason to block you from automation your scraping tasks. Tip #5: Keep eyes on anti-scraping tools. One of the most frequently used tools for the detection of hacking or web scraping  attempts is the The honey pots are not visible to the human eye but can be seen by bots or web scrapers. Right after your scraper clicks such a hidden link, the site blocks you quite easily. "honey pot." Find out whether a link has the " " or " " CSS properties set if they do just stop following that link. Otherwise, a site immediately identifies you as a bot or scraper, fingerprints the properties of your requests, and bans you. display: none visibility: hidden Tip #6: Solve online CAPTCHAs. While scraping a website on a large scale, there is a chance to be blocked by a website. Then you start seeing captcha pages instead of web pages. CAPTCHA is a test used by websites to battle back against bots and crawlers, asking website visitors to prove they're human before proceeding. Many websites use reCAPTCHA from Google. The last version v3 of reCAPTCHA analyses human behavior and require them to tick " " box. I'm not a robot CAPTCHA solving services use two methods for solving CAPTCHAs: ☑️ Human-based CAPTCHA Solving Services When you send your CAPTCHA to such service,  human workers solve a CAPTCHA and send it back. ☑️ OCR (Optical Character Recognition) Solutions In this case, OCR technology is used to solve CAPTCHAs automatically. Point-and-click visual selector. Of course, we don't intend only to download and render JavaScript-driven web pages but to extract structured data from them. Before starting of data extraction, let's specify patterns of data. Look at the sample screenshot taken from web store selling smartphones. We want to scrape the Image, Title of an item, and its Price. does a great job of investigating the DOM structure of HTML web pages. Google chrome inspect tool Click the Inspect icon in the top-left corner of DevTools. Chrome Inspector tool With the Chrome Inspect tool, you can easily find and copy either of specified DOM elements on the web page. CSS Selector or XPath Usually, when scraping a web page, you have more than one similar block of data to extract. Often you crawl several pages during one scraping session. Surely, you can use Chrome Inspector to build a payload for scraping. In some complex cases, it is only a way to investigate particular element properties on a web page. Though modern online web scrapers, in most cases, offer a more comfortable way to specify patterns (CSS Selectors or XPath) for data scraping, set up pagination rules, and rules for processing detailed pages on its way. Look at this video to find out how it works. https://youtu.be/SKBkclf1FxA Manage your Data Storage strategy. The most well-known simple data formats for storing structured data nowadays include CSV, Excel, JSON (Lines). Extracted data may be encoded to destination format right after parsing a web page. These formats are suitable for use as low sized volumes storages. Crawling a few pages may be easy,  but millions of pages require different approaches. How to crawl several million pages and extract tens of million records? What to do if the size of output data is from moderate to huge? Choose the right format as output data. Photo by / Ricardo Gomez Angel Unsplash Format #1. Comma Separated Values (CSV) format CSV is the most simple human-readable data exchange format. Each line of the file is a data record. Each record consists of an identical list of fields separated by commas. Here is a list of families represented as CSV data: CSV is limited to store two-dimensional, untyped data. There is no way to specify nested structures or types of values in plain CSV. like the names of children Format #2. JSON Representing nested structures in is easy, however. JSON files Nowadays became a de-facto of data exchange format standard, replacing XML in most cases. , JavaScript Object Notation (JSON) One of our projects consists of 3 Millions of parsed pages. As a result, the size of the final JSON is more than 700 Mb. The problem arises when you have to deal with such sized JSONs. To insert or read a record from a JSON array, you need to parse the whole file every time, which is far from ideal. Format #3. JSON Lines Let's look into what format is, and how it compares to traditional JSON. It is already common in the industry to use . and store logs as JSON Lines. JSON Lines JSON Lines Logstash Docker The same list of families expressed as a format looks like this: JSON Lines {"id":1,"father":"Mark","mother":"Charlotte","children":["Tom"]} {"id":2,"father":"John","mother":"Ann","children":["Jessika","Antony","Jack"]} {"id":3,"father":"Bob","mother":"Monika","children":["Jerry","Karol"]} consists of several lines in which each line is a valid JSON object, separated by the newline character \n. JSON Lines Since every entry in JSON Lines is a valid JSON, you can parse every line as a standalone JSON document. For example, you can seek within it, split a 10gb file into smaller files without parsing the entire thing. You can read as many lines as needed to get the same amount of records. Summary A good scraping platform should: ☑️ . Fetch and extract data from web pages concurrently We use concurrency features of Golang, and found them fantastic; ☑️ Persist extracted blocks of scraped data in the central database regularly. This way, you don't have to store much data in the RAM while scraping many pages. Besides, it is easy to export data to different formats several times later. We use MongoDB as our central storage. ☑️ Be web-based. Online Website scraper is accessible anywhere from any device which can connect to the internet. Different operating systems aren't an issue anymore. It's all about the browser. ☑️ Be cloud-friendly. It should provide a way to quickly scale up or down cloud capacity according to the current requirement of a web data extraction project. Conclusion In this post, I tried to explain how to scrape web pages in the year 2020.  But before considering scraping, try to find out official API exists or hunt for some "hidden" API endpoints. I would appreciate it if you could take a minute to tell me which one of the web scraping methods you use the most in 2020. Just . leave me a comment below Happy scraping!

Facebook

Fetch

Google

Mozilla

Target

Twitter

YouTube

PostgreSQL Change Data Capture and Golang Sample Code

Hacker News scraping challenge.

Read My Stories

Portfolio

Newsletter

Too Long; Didn't Read

The A-Z of Web Scraping in 2020 [A How-To Guide]

The A-Z of Web Scraping in 2020 [A How-To Guide]

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

An In-depth Look Into MySQL Vs. PostgreSQL

116 Stories To Learn About Web Scraping

3 Mejores Formas de Crawl Datos desde Website

5 Técnicas Anti-Scraping que Puedes Encontrar

53 Stories To Learn About Data Scraping

8 Browser Extensions for Scraping Google Maps like a Pro

An In-depth Look Into MySQL Vs. PostgreSQL

116 Stories To Learn About Web Scraping

3 Mejores Formas de Crawl Datos desde Website

5 Técnicas Anti-Scraping que Puedes Encontrar

53 Stories To Learn About Data Scraping

8 Browser Extensions for Scraping Google Maps like a Pro

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps