Use Jupyter to Restart the Script from the Point Where the Scrapper Terminated

Written by databug | Published 2020/03/02
Tech Story Tags: jupyter-notebook | python | web-scraping | selenium | tips | jupyter-scraping-time-machine | programming | software-development | web-monetization

TLDR Using Jupyter, you can restart the script from the point where the scrapper has terminated. The solution depends on the “running code” feature, where we could run blocks of code independently from each batch. Python, Selenium, Pandas, Beautiful Soup, and your good old friend time are the library you need for this project. But do beware of the continuous error if the website server is unstable like my target website. The rest of the code is running :-)via the TL;DR App

Have you ever had a situation where your scrapper came across an error [may it be server error or scraper block] and had to start over again?
You’re in luck! You could use Jupyter to restart the script from the point where the scrapper has terminated. I don’t know how this exactly works, but let me give you a brief explanation on how to use this workaround.
This solution largely depends on the Jupyter “running code” feature, where we could run blocks of code independently from each batch.
Start with the usual scraping library:
Python, Selenium, Pandas, Beautiful Soup, and your good old friend time.
are the library you need for this project.
In this explanation, I won’t be diving deeply into my source code but will show how my first batch of scraping code would look like.
The brief explanation for my code would the following:
Go to the target siteDo some actions to go to the specific site I would like to go with using xpath.create an array called compiled_listscrap the ‘hell’ out of the javascript table (dealing with a website with an pagination feature doesn’t show in the URL)save the data in array‘transform’ the array to a pandas data frame.export the pandas datatable to CSV, naming them with the pageSome of parts are blocked out due to sensitivity of data
Here we… what? the servers are down….
People who scrape a lot hate this error message
but as I said in the first part of this article, create another batch of code that could continue the code that you started off with.
Before you continue:
1) Do not close the chrome browser connected with your script. If you do all your session/progress will be deleted.
2) Remember to check the page number that you scraped and recode accordingly on your next batch of code.
So run the program again… with a little change in id number of set no in the export part.
I edited the set number before I ran this block of code
and tada! we move forward to scraping the rest of the set
the rest of the code is running :-)
But do beware of the continuous error if the website server is unstable like my target website. Repeat the process above whenever your program’s connection to the website gets cut off.
I wasn’t able to figure out how to automate this portion, but if any of your awesome readers know the answer, please comment below.
Anyway, that’s it folks~ jupyter scraping time machine.



Written by databug | Data Scientist who loves anything Data. From South Korea
Published by HackerNoon on 2020/03/02