Collecting data from the web can be the core of many data science/machine learning projects and a fundamental skill for almost every data practitioner. I also think it’s as fun as it is useful as I always enjoyed watching all that data being collected automatically in such a short time.
However, for someone new at this, it’ll probably be a little tricky at first. The process of familiarizing yourself with all the tools available and knowing which one to use for each situation will take some time.
Also, not having a clear path on what to learn makes everything harder. Many will stumble from tutorial to tutorial not being sure about what they’re learning and end up giving up.
The most common way to scrape it using a programming language, such as Python. This path can be a little tricky, especially for those who are not used to coding, as you’ll have to deal with different scenarios that will demand different approaches, and techniques that will take even more time to get familiarized with.
In this article, we’ll take a look at this path and also propose an alternative that, depending on the purpose of the scraping, can make much more sense than starting a journey with a programming language.
If you’re building your first scraper, please do not try to collect the most complex data from the most complex webpage.
There are pages made for people to learn web scraping. That’s their only purpose, so take advantage of it. quotes.toscrape.com is the best example of a good starting point.
But don’t use only these simple websites. Once you understand the basics of the process, go after data that match your interests: sports, job offers, online stores, etc. as it will help your learning process by:
Keeping you motivated. Presenting real-life obstacles that will make you search for new solutions and approaches on your own and there’s no better way to learn.
This first step is useful whether or not you’re using a programming language, However, you should pay more attention to it if you’re using code. It’s even more important to go slow until you get used to the basics of libraries such as requests and BeatifulSoup.
All right, so now you know the basics. And if you choose Python, the next step: dominate BeatifulSoup.
BeatifulSoup is a very powerful parsing tool with lots of different possibilities. The process of mastering it will have you understanding all the arguments in the find and find_all methods, the concept of parents, children, and siblings, and the attributes in the HTML tags.
As a bonus, this will also give you a bit of HTML knowledge that I did not list as a pre-requisite for scraping as I didn’t have any when I started.
That’s when Selenium shows up. Selenium works by actually opening an automated browser which makes it capable not only of accessing the entire content of the page as it actually loads everything but of interacting with the page just like a real person.
By interacting I mean: clicking on a button, filling out a form, checking a box, scrolling down or up, pressing any key on the keyboard, etc. Basically, anything can be accomplished.
It’s fair to say that you know your way around scraping websites by now. But there’s one more step to improve your skills: interacting with APIs.
API is short for Application Programming Interface and it’s how most websites connect their frontend to their backend.
It is not exactly easy to implement at first, but none of what you learn was, but with some practice, you can take big advantage of this approach.
As you begin to scrape real, more complex websites, you’ll run into new challenges. Fortunately, you’ll also run into new tools to deal with these challenges.
You’ll have to have your code error-proof, so it doesn’t just crash in the middle of the thousands of pages it’s scraping. So it’s good to understand how to try and except clauses can be useful.
You’ll also need to consider that it’s not good to overload the server with numerous requests per minute so it’s nice to use the sleep function to pause your code from time to time.
Finally, you may want to consider learning some SQL (if you haven’t already) so you can create your own database to store the scraped data in a more professional fashion.
Each of the steps above would definitely take some time for anyone to master, especially for people who are not used to programming, which makes it clear that that is the harder path to take to get into web scraping.
Fortunately, writing code is not the only way to collect data from the internet, which is good because maybe you’re not interested or do not have the time to take on this big path we just saw.
Doing all this without having to learn how to be good is already a good advantage. However, this tool allows you to go even further. For example, if the idea is to collect data from famous sites such as Amazon or eBay, you really do have to build anything as Octoparse has built-in templates to scrape these pages. Your job is only to point out which information you want.
And even if there’s no template for the website you want to scrape, you can tell the software to auto-detect patterns on the page. This means that the tool will identify the parts of the page that are more likely to be of the scraper’s interest so you don’t have much left to do.
Not to mention the option of running tasks in their cloud where a complete infrastructure of IP addresses and a backup of the data are already provided and easy connection to SQL databases, Dropbox, Google Sheets, MongoDB, and other data storage options.
As we’ve seen for far, web scraping is a very broad field, and there are different ways and tools to approach it.
Therefore, in order to make the best use of each tool, it’s important to have a good understanding of your context, the data you need, the problem that it will solve as well as the budget and the expertise of the professionals involved in the project.