I am guilty of a ton of tutorials but never opening them again. I am one of those who prefer a pdf version or a book over a mountain of web pages as bookmarks or tabs stored in OneTab. bookmarking I was refreshing my Operating Systems concepts recently from my favorite site https://www.geeksforgeeks.org/operating-systems/. Though I loved learning OS, what I didn't like was having to open weblinks one after other according to the topic. I wanted to have all the content of the pages as a single file for my perusal. Following that was my attempt at automating the process of ‘extracting text from websites’ aka web scraping. The use cases for web scraping are limitless. Like checking for flight fares, monitoring the stock prices, product prices comparison and even collecting data for your next analytics project …What better way than using to do it? Python The web page we are extracting As you could view from the above picture, the web page contains many topics and the associated embedded links. My requirement was to extract those links and further parse it to obtain the data from them. Essentials: Have Python installed in your system. I have (a Python distribution) setup in my PC which automatically installs Python. It also takes care of the environment variables configuration. Anaconda 2. Install BeautifulSoup4, a Python library for pulling data out of HTML and XML files. Download the library from . Run . Modify according to the name of the package you downloaded. https://anaconda.org/anaconda/beautifulsoup4 conda install beautifulsoap4-4.6.7-py37_0.tar.bz2 3. Understand the basics of HTML and how tags work from here. Implementation: We first need to download the web page in order to scrape it. We use the requests module for that purpose. Import all the modules requires in the script. from bs4 import BeautifulSoupimport requestsimport os 2. Specify the web page to be extracted in page_link variable. The requests module send a GET request to the web server to extract our web page page_link =' page_response = requests.get(page_link, timeout=7) https://www.geeksforgeeks.org/operating-systems/' 3. Now we finally use our BeautifulSoup library to parse the above response. page_content = BeautifulSoup(page_response.content, "html.parser") 4. Inspect the web page and find the class name (id name or tag name could be used too) of the contents you need to pull out. To do so, right-click on your element and select Inspect. Using Inspect 5. I found out the class names which I required. Now use the find method to particularly extract these classes from the nested data structure created in the above step. It is good practice to use try-except blocks as the errors can be caught so that they don't stop the subsequent execution of the program. 6. Since I required the links embedded within these classes, I wrote a function to get the links alone and save them in . The ‘ ’ tags contain the attribute I seek. Using to search for ‘ ’ tags and using to obtain . output_links.txt a href find_all a get href 7. This step is to call the function we defined above for all the classes we extracted. A look at the output_links.txt file which contains all the links we extracted The extracted links 8. Now we actually need to get the content from each of the links and store them in a file. Code to extract the content is, Again it depends on the structure of the HTML page you are parsing. In the page I am scraping from, all the content is present within the article tag. I call this function when the links are getting extracted in get_links_topic() function itself. To improve readability, I also removed some of the unnecessary content using function. Also, I removed the ad lines and unnecessary new lines in the output. The final output is now stored in OS_Tutorials.txt decompose The full code is: The OS_Tutorials.txt (around 159KB in size now) looks like this: A few points before you proceed to parse your favorite web content: Avoid using the extracted info for commercial purposes. I’m not sharing the extracted file in lieu of this. It is not cool to break the site you love by extracting aggressively. 3. And the layout of the site may change over time. You may need to modify your code accordingly. Just kidding References: https://beautiful-soup-4.readthedocs.io/en/latest/ https://hackernoon.com/web-scraping-bf2d814cc572 Note: Please comment if there are any alternatives to this method and if there could be any code optimizations. Also comment if you have extracted any site with this method :)

Tired of bookmarking pages? Scrap it instead …

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Here’s what you need to know about Linux

0–100 in Django: Starting an app the right way

The Noonification: How to Deal With Flapping or Broken Tests (11/29/2023)

The Noonification: OpenAI is Sam Altman; Sam Altman is OpenAI (11/30/2023)

The Noonification: Panda Power (11/28/2023)

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

Here’s what you need to know about Linux

0–100 in Django: Starting an app the right way

The Noonification: How to Deal With Flapping or Broken Tests (11/29/2023)

The Noonification: OpenAI is Sam Altman; Sam Altman is OpenAI (11/30/2023)

The Noonification: Panda Power (11/28/2023)

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps