Imagine a young developer with experience in web development and the different niches in the web development world sitting at his desk pondering startup ideas, as most young developers do now. “What if I make ‘x?’ What if I do ‘x?’ How can I ‘x?’”
All these questions need answers, but none are coming.
As developers, we have all been there before. Of course, you have the feeling of inadequacy you get from creative stagnation. Even if you do come to a conclusion and think, “YES, THIS IS WHAT I CAN DO!” how will you really do it?
You probably realized the technology for your idea is not present…or maybe it is? What if you did come up with a startup idea, and it is valid and can work? Even better, it works, is valid, and the technology to actualize your dream is not just present. It is growing in efficiency, usability, and entrepreneurial practicality day by day. This is our situation, keen reader.
In short, I had a startup idea dependent on constant data from commercial websites. It is practical, usable, and definitely in need in today’s current society. Almost no downsides. Except it touches a part of web development that I had previously not touched: web scraping. I knew then and there that I had to give web scraping a try.
Well, lets me strap in and pontificate on my new experiences with web scraping together! ;0
When people hear the term “web scraping,” the answer is realized very quickly. However, let us take a look at a definitive official definition of the term. parsehub.com says:
Web scraping refers to the extraction of data from a website. This information is collected and then exported into a format that is more useful for the user. Be it a spreadsheet or an API.
In an even simpler, albeit personal term, web scraping is when you use a module/library to get data freely available on a website and store it for usage or automatically use it for something else. It does not necessarily have to be stored, e.g. Using a web scraper to collect data from a weather update website to then display the current day weather. Data is analyzed, used then thrown away once displayed on the users’ screen.
Web scrapers are becoming more relevant to entrepreneurs. Why is this? I will not go into too much detail here but do not be dismayed, subscribe to me as I may touch on this in the near future, and I wouldn’t want you to miss out! Until then, let me satiate your curiosity with a small piece of what’s to come. According to Cem Dilmegani of AIMultiple.com, web scraping is being used more and more by businesses, both large enterprises and small startups. It’s in a wide array of fields but is often most useful in predicting marketing trends for products, opponent monitoring for dynamic pricing, and even financial guidance.
In my case, web scraping will get me the relevant information from the commercial sites I want to use. To incorporate web scraping into your startup is like making a house after seeing exactly what’s wrong with the houses of others and having reference material to use. It makes things easier, faster, and overall better. Web scrapers will be a staple in all future businesses, so you might as well build a startup off it!
Let’s keep in mind some crucial details. At the beginning of this startup journey, I used multiple web scrapers to test each validity and use case. This was to determine which would be better for what I wanted to achieve. As a result, I tried coding the project in JavaScript before switching to Python. I touched on two different libraries/modules in JavaScript and then one in Python.
Now, it is about time for us to get nitty-gritty! Let us dive right into the different web scrapers I experimented with! What each one was, how it impacted my development, and other thoughts pertaining to each one.
On our first library stop, we are stopping at Puppeteer! It was the first library I experimented with to see its effects and usability for commercial websites and web-app web scraping. By their own definition, what is Puppeteer?
Puppeteer is a Node.js library which provides a high-level API to control Chrome/Chromium over the DevTools Protocol. Puppeteer runs in headless mode by default, but can be configured to run in full ("headful") Chrome/Chromium.
This, at first, is a load full. What do all these defining words mean? It is important to keep the original definition of what a web scraper is. It is just a tool that allows you to extract selected data from a website for usage or storage and usage. So Puppeteer is exactly that. The complex wording is secondary as long as we know what it does. Let’s break down how Puppeteer defines itself; once we do that, we can see where it differentiates itself from other libraries of its type.
“Puppeteer is a Node.js library which provides a high-level API to control Chrome/Chromium over the DevTools Protocol.” What does this mean? It means Puppeteer is a JavaScript library with a complex data extrapolation feature, so it runs a mini-chrome browser that you can’t see and takes the data through there. This makes data parsing on web apps easier. It replaces the DevTools Protocol, which was previously used. DevTools Protocol has a lot of features, but Puppeteer competes with them in Chromium edit-ability.
“Puppeteer runs in headless mode by default but can be configured to run in full ("headful") Chrome/Chromium.” This means Puppeteer runs by hiding the Chromium/Chrome browser. But it can show the full browser instance. It makes for your data extrapolation if you so need it to be.
What is important about all this for the startup? It means Puppeteer gives full flexibility and editability when capturing data. It can also: automate tasks, capture timeline traces to find performance issues during development, test chromium extensions, etc. It is a very loaded library. It can definitely do the job I want to do and more! However, because of its grandeur and complexity, could it impede rather than help? Let’s see.
Let us go through the code excerpt above. I import Puppeteer and create an “asynchronous” or “async function. This async part essentially lets my target functionality run autonomously once certain requirements are met. In this case, the requirement is “logging” the HTML variable, which will trigger the function. The function’s purpose is to scan my target site and extrapolate the front-end HTML properties into my “CMD” or “Terminal” tab. By doing this, we will know if the web scraper is efficient at web scraping or if there are problems to be solved.
Originally this did not work. The code was correct, Puppeteer was working just fine, but I was not getting data back from the website. Even for a simple operation such as this. Why is that?
Look at the code above. Is it different from the original? Yes, it is. I replaced Puppeteer with Puppeteer-Stealth. What is Puppeteer Stealth? It is like Puppeteer, except it hides a lot of properties and functionality in Puppeteer, like Puppeteer’s headless state, fooling the website servers into thinking it’s a normal Chromium browser when it’s not. I found out that a lot of the websites I was trying to scrape from had upgraded their security features. This meant a lot of the ways traditional Puppeteer tried to gain access to the website were blocked. Puppeteer-Stealth became a legal exception to these rules and was then allowed to web scrape.
Now that we know Puppeteer could capture the simple HTML of the page, what are the other ways it could capture the data? At this point, I was pondering on different ways of encapsulating data and the ways Puppeteer allowed one to do so.
Look at the code above…. You can save data as a jpeg or png file! Even as an svg! Originally I was considering using this function as I had planned to ingrain automatic emails for end users once the web scraping was complete. They could get the data in picture form. This idea was tossed out as soon as I realized Gmail, Yahoo, etc., oftentimes minimize pictures in emails and don’t present them relative to mail display size. The user would have to download it to see the data. Inefficient…. Maybe I just entertained the idea because of personal bias. Either way, the idea was thrown out. What other ways can we scrape, aggregate and present data to the user?
From the code above, similar to the ones preceding it, we can tell that it’s web scraping the website. This is also the best solution for me. There’s a major difference between the ones from before and these ones. What is it? I made the web scraper go inside the HTML of the webpage and specifically target certain sections of the text on the page. How do I do this? Specifically, look at the “document.body.innerText” This looks for body innerText elements in the body of the “document,” which in this case is the extrapolated HTML.
With this, I can get more specific to the text pieces on the page I am looking for, extrapolate, and then send to the user. This is better than the jpeg option. Why didn’t I find this option from the start? It’s because I saw the way I could utilize the data in jpeg form as new and something that can be interesting to do. Oh well, interesting does not mean practical! So this simple extrapolation of text, organizing text then sending the text is the best option.
All this talk about web scrapers…. It’s making me hungry…. hungry for some Cheerios! No, not those actual edible Cheerios, good as they may be, no. It got me hungry for the Cheerio web scraper library!
Cheerio is similar to Puppeteer in that they are both JavaScript libraries that are mainly or even expressly, used for web scraping. How does the team who made Cheerio describe it? If we go to their homepage, we can see how they define the product:
The fast, flexible & elegant library for parsing and manipulating HTML and XML.
A stark contrast from Puppeteer. They are not focusing on the different ways they replaced certain technologies or using too much technical jargon for their product’s definition. There is a reason for this. Puppeteer does a lot, as was already insinuated when we went through the features they offer. In contrast to Puppeteer’s hyper-technical nature, Cheerio is much more “softer” and “easier” to get into. This is not just something you notice from the descriptions of the technologies. It is something I noticed as I used Cheerio. Let’s review how I used and tested Cheerio to see if it was an asset in my web scraping ventures.
Puppeteer and Cheerio are both JavaScript web scraping libraries, but the way they look is wildly different! As well as that, Cheerio doesn’t have as many “extra” features and functionality. Its mono-usage makes it have a much simpler path but a more complicated layout than Puppeteer. Let’s explore this a little. Cheerio is specifically only made for web scraping, so unlike Puppeteer, Cheerio gives you the “bare bones” of what you need. This means that how your code interprets every single line of HTML and data is up to you. That is why, in the above code, I go into specifics. I specifically grapple with and present very specific HTML elements. With Puppeteer, you can do the same thing but with much fewer lines of code, but keeping in mind all Puppeteers functionality and technicalities is a hassle.
Let’s keep Cheerio in mind in reference to the app for the startup I am making. The most important thing to make now is an MVP (Most Valued Product). This means I need the bare-bone functionality of what makes my product work. Cheerio can definitely give that bare minimum functionality. But do we need or want slightly more?
It’s time to get playing with Playwright! Playwright is actually a Python web scraping library. That’s right; we’re ditching JavaScript (for now) and switching over to Python. Why did I do this? Because as I was experimenting with other libraries and researching more about web scrapers, I found something. Python-based libraries are the most popular for web scraping activity. Python is the best language for web scraping at the moment. Danielius Radavicius of Oxylabs.com even stated:
If you asked developers focused on web scraping what their language of choice is, most would likely answer Python, and for a good reason.
Python supersedes JavaScript in terms of what its web scraping libraries offer and their simplicity of implementation. However, Does this necessarily mean libraries like Playwright are more effective than their JavaScript counterparts? Maybe. Let’s now see how the Playwright team defined Playwright:
Playwright enables reliable end-to-end testing for modern web apps.
Playwright has a description as easy as Cheerio’s. However, Playwright does specify itself as a web scraper. It specifies itself as a library for “end-to-end testing" for modern web apps.” This is interesting. If this is the case, then even though it sounds simpler, akin to Cheerio, it is more complicated like Puppeteer. Let us check some of its functionality: Web scraping, native mobile emulation for Google, generate tests, browser failure trace viewer, Cross-language API manipulation, auto-waits for DOM elements to be actionable, etc. These are only a few parts of its functionality. Playwright could be said to be the Python version of Puppeteer. The question is, does it help me and my startup by allowing me to web scrap commercial sites more easily? Let’s see.
Playwright is very similar to Puppeteer in terms of functionality. We can even see the layout of the web scraping request. It is in an asynchronous function. You launch a default Chromium browser over the operation, and then you print out the page findings. Simple but heavy and elegant.
Let’s not beat around the bush. Let’s get straight to it! I spent quite some time with each web scraper, prodding and testing them for my startup needs. Overall?
I will 100% be continuing with Python playwright more than Puppeteer and Cheerio. What is the reason?
Cheerio is too simple. It gives me the bare minimum for an MVP but what about after that stage? My capacity to build will be extremely limited by Cheerio just being bare bones. I’ll use Cheerio in the future for small projects and scripts. Not applications. Unless the web scraping feature would be a smaller feature in a wider application.
Puppeteer is heavy but in a bad way. As I said before, Playwright is very similar to Puppeteer. So why Playwright over Puppeteer? Whilst I was using Puppeteer and Puppeteer stealth, I received way more problems than with Playwright. For the same operations, I was scrolling for hours on simple solutions, whereas on Playwright, the problems are majorly minimized. I did eventually resolve the problems, but the process left a bitter taste in my code palette for this library. I will only use Puppeteer if I am solely using JavaScript on a project that needs web scraping.
That’s right, you’ve won Playwright! Compared to the others, it’s not as simple, but still simple enough to warrant usage. It has the same amount of features as Puppeteer, and I can handle it better than Puppeteer. At the end of the day, these are just my opinions based on doing very specific development for a very specific startup project of mine. However, all of what I said might be the complete opposite for you. That’s not just fine; that’s great!
Subscribe so you can be on the lookout for my other pieces here!