I am guilty of bookmarking a ton of tutorials but never opening them again. I am one of those who prefer a pdf version or a book over a mountain of web pages as bookmarks or tabs stored in OneTab.
I was refreshing my Operating Systems concepts recently from my favorite site https://www.geeksforgeeks.org/operating-systems/.
Though I loved learning OS, what I didn't like was having to open weblinks one after other according to the topic. I wanted to have all the content of the pages as a single file for my perusal. Following that was my attempt at automating the process of ‘extracting text from websites’ aka web scraping. The use cases for web scraping are limitless. Like checking for flight fares, monitoring the stock prices, product prices comparison and even collecting data for your next analytics project …What better way than using Python to do it?
The web page we are extracting
As you could view from the above picture, the web page contains many topics and the associated embedded links. My requirement was to extract those links and further parse it to obtain the data from them.
I have Anaconda (a Python distribution) setup in my PC which automatically installs Python. It also takes care of the environment variables configuration.
2. Install BeautifulSoup4, a Python library for pulling data out of HTML and XML files.
Download the library from https://anaconda.org/anaconda/beautifulsoup4. Run conda install beautifulsoap4-4.6.7-py37_0.tar.bz2
. Modify according to the name of the package you downloaded.
3. Understand the basics of HTML and how tags work from here.
from bs4 import BeautifulSoupimport requestsimport os
2. Specify the web page to be extracted in page_link variable. The requests module send a GET request to the web server to extract our web page
page_link ='https://www.geeksforgeeks.org/operating-systems/'page_response = requests.get(page_link, timeout=7)
3. Now we finally use our BeautifulSoup library to parse the above response.
page_content = BeautifulSoup(page_response.content, "html.parser")
4. Inspect the web page and find the class name (id name or tag name could be used too) of the contents you need to pull out. To do so, right-click on your element and select Inspect.
Using Inspect
5. I found out the class names which I required. Now use the find method to particularly extract these classes from the nested data structure created in the above step.
It is good practice to use try-except blocks as the errors can be caught so that they don't stop the subsequent execution of the program.
6. Since I required the links embedded within these classes, I wrote a function to get the links alone and save them in output_links.txt . The ‘a’ tags contain the href attribute I seek. Using find_all to search for ‘a’ tags and using get to obtain href.
7. This step is to call the function we defined above for all the classes we extracted.
A look at the output_links.txt file which contains all the links we extracted
The extracted links
8. Now we actually need to get the content from each of the links and store them in a file. Code to extract the content is,
Again it depends on the structure of the HTML page you are parsing. In the page I am scraping from, all the content is present within the article tag. I call this function when the links are getting extracted in get_links_topic() function itself.
To improve readability, I also removed some of the unnecessary content using decompose function. Also, I removed the ad lines and unnecessary new lines in the output. The final output is now stored in OS_Tutorials.txt
The full code is:
The OS_Tutorials.txt (around 159KB in size now) looks like this:
A few points before you proceed to parse your favorite web content:
3. And the layout of the site may change over time. You may need to modify your code accordingly.
Just kidding
References:
https://beautiful-soup-4.readthedocs.io/en/latest/https://hackernoon.com/web-scraping-bf2d814cc572
Note:
Please comment if there are any alternatives to this method and if there could be any code optimizations. Also comment if you have extracted any site with this method :)