Hi Devs! Welcome to the second post of Upwork Series! First of all, I really appreciate your support and kindness in my posts and YouTube channel. This time our task is to navigate the target page, crawl the data and export it to excel sheet. Our automation must be with human delay to avoid detection and any other server errors. Step 1: Understanding the task So, our task is to navigate the target page, crawl the data and export it to excel sheet. Our automation must be with human delay to avoid detection and any other server errors. Click to see Job Posting The client provided PDF attachment under the job posting which describes the steps clearly. Here is the description of task: I need help setting up a web scraper that gets data from a website. What I need it to do is: Enter the website and navigate to "Sök kungörelse" Navigate to "Avancerad sökning" (Advanced search) Apply following filters from following:- Tidsperiod > Annan period > Input date-interval for the past day - Ämnesområde > Bolagsverkets registreringar - Kungörelserubrik > Aktiebolagsregistret - Underrubrik > Nyregistreringar Submit by pressing "Sök" Enter all the listings that show upGet text for: "Postadress", "Bildat" "Företagsnamn" and "E-post" Export data to Google Sheets or an Excel Sheet daily. I also need this crawler to have some human delay so it doesn't crash the website or give burden to the server. Our final destination will to crawl the data for "Postadress", "Bildat" "Företagsnamn" and "E-post". Step 2: Creating our environment and installing dependencies Now, we know what client wants from us, so let's create our virtual environment then inspect elements that we are going to crawl. To create virtualenv run the following command in your terminal: virtualenv env
. env/bin/activate and install these libraries: pip install selenium xlwt As you know Selenium is a web automation tool and we are going to use it to navigate target pages and get data from there. is a library to generate spreadsheet files compatible with Microsoft Excel and the package itself is pure Python with no dependencies on modules or packages outside the standard Python distribution. xlwt I know some of you will tell me to use pandas but let's just keep xlwt for this project. Step 3: Navigating and Crawling Data It is important to configure web driver correctly to be able to run automation. If you want to use Chrome as a web driver then you should install . However, if you want to choose Firefox then you should install . chromedriver geckodriver Let's start by creating a class to easily handle URLs and call functions so you don't have to create web drivers in each function every time. selenium webdriver self.bot = webdriver.Firefox(executable_path= ) from import : class Bolagsverket : def __init__ (self) # set your driver path here '/path/to/geckodriver' To see the project URL click here Now, we are creating new function named : navigate_and_crawl time bot = self.bot
        bot.get( ) 
        time.sleep( ) import : def navigate_and_crawl (self) 'https://poit.bolagsverket.se/poit/PublikPoitIn.do' 5 As you see I put function right after navigating to URL to act like human delay. sleep() Enter the website and navigate to "Sök kungörelse" Let's inspect this element bot = self.bot
        bot.get( ) 
        time.sleep( )
        bot.find_element_by_id( ).click()
        time.sleep( ) : def navigate_and_crawl (self) 'https://poit.bolagsverket.se/poit/PublikPoitIn.do' 5 'nav1-2' 5 Navigate to "Avancerad sökning" (Advanced search) Now we need to click to "Advanced search" link. As you see there is a one anchor tag in the form so we don't need to find the element with specific id or class, just tag names will enough to click the link. bot = self.bot
        bot.get( ) 
        time.sleep( )
        bot.find_element_by_id( ).click()
        time.sleep( )
        bot.find_element_by_tag_name( ).find_element_by_tag_name( ).click()
        time.sleep( ) : def navigate_and_crawl (self) 'https://poit.bolagsverket.se/poit/PublikPoitIn.do' 5 'nav1-2' 5 'form' 'a' 5 I am hearing again some smart guys telling me to use xpath. It is totally fine, I am just trying to be more simple to show details. Applying filters and Searching Tidsperiod > Annan period > Input date-interval for the past day Ämnesområde > Bolagsverkets registreringar Kungörelserubrik > Aktiebolagsregistret Underrubrik > Nyregistreringar We should set date-interval for the past day but currently there is no data for the past day, so I will set the date as shown image above. But I will also show you how to set interval for the past day, maybe when you check there will be data. Before the code, let's take a look elements. - Tidsperiod > Annan period > Input date-interval for the past day Alright Devs! Now, I am going to show you one of the best solutions to click the drop-down option. search_form = bot.find_element_by_tag_name( )
search_form.find_element_by_xpath( ).click() 'form' f"//select[@id='tidsperiod']/option[text()='Annan period']" After selecting "Annan period" the date fields will appear which means we have to implement explicit waits to make WebDriver wait until these date fields show up. datetime selenium.webdriver.common.keys Keys selenium.webdriver.common.by By selenium.webdriver.support expected_conditions EC selenium.webdriver.support.ui WebDriverWait bot = self.bot
        bot.get( ) 
        time.sleep( )
        bot.find_element_by_id( ).click()
        time.sleep( )
        bot.find_element_by_tag_name( ).find_element_by_tag_name( ).click()
        time.sleep( )

        search_form = bot.find_element_by_tag_name( )
        search_form.find_element_by_xpath( ).click()                       
        wait = WebDriverWait(bot, )
        input_from = wait.until(EC.element_to_be_clickable((By.XPATH, ))) input_from.send_keys( )
        input_to = wait.until(EC.element_to_be_clickable((By.XPATH, ))) input_to.send_keys( )
        time.sleep( ) import from import from import from import as from import : def navigate_and_crawl (self) 'https://poit.bolagsverket.se/poit/PublikPoitIn.do' 5 'nav1-2' 5 'form' 'a' 5 'form' f"//select[@id='tidsperiod']/option[text()='Annan period']" 10 "//input[@id='from']" #input_from.send_keys(str(datetime.date.today()-datetime.timedelta(1))) '2019-09-23' "//input[@id='tom']" #input_to.send_keys(str(datetime.date.today())) '2019-09-24' 3 Actually, the form is refreshing every time when you select something from drop downs. That means we have to use waits when clicking elements. Ämnesområde > Bolagsverkets registreringar Kungörelserubrik > Aktiebolagsregistret Underrubrik > Nyregistreringar We are applying same method which we used to select "Annan period". But this time adding waits as well. Once all values selected click the search button under the form. amnesomrade = wait.until(EC.element_to_be_clickable((By.XPATH, )))
        amnesomrade.find_element_by_xpath( ).click()
        time.sleep( )
        kungorelserubrik = wait.until(EC.element_to_be_clickable((By.XPATH, )))
        kungorelserubrik.find_element_by_xpath( ).click()
        time.sleep( )
        underrubrik = wait.until(EC.element_to_be_clickable((By.XPATH, )))
        underrubrik.find_element_by_xpath( ).click()                             
        time.sleep( ) button_sok = wait.until(EC.element_to_be_clickable((By.XPATH, )))
        button_sok.click()
        time.sleep( ) "//select[@id='amnesomrade']" f"//select[@id='amnesomrade']/option[text()='Bolagsverkets registreringar']" 2 "//select[@id='kungorelserubrik']" f"//select[@id='kungorelserubrik']/option[text()='Aktiebolagsregistret']" 2 "//select[@id='underrubrik']" f"//select[@id='underrubrik']/option[text()='Nyregistreringar']" 2 # Search Button "//input[@id='SokKungorelse']" 5 Iterate through list and crawl data Once you searched, you will see the results as shown image below: The automation must continuously click the result from the list, crawl the data in the clicked page, go back again to results and then click next result until all ages are finished. As client mentioned, we have to save the data in excel sheets for each page. Let's start by finding and for each page number of all pages number of results If you look first red circle in picture above you can see number of last page which means we have 18 pages in total. number_of_pages = bot.find_element_by_xpath( ).text.split( , )[ ] number_of_pages.strip().replace( , ) number_of_results = bot.find_elements_by_xpath( ) # find number of pages and extract the string after "av" '//div[@class="gotopagediv"]/em[@class="gotopagebuttons"]' "av" 1 1 # remove any empty spaces " " "" # all results or links for each page '//table/tbody/tr' and now are going to iterate through the pages and results to click each link or result in the list. Additionally, we must create the new excel sheet for each page. Remember, we have to crawl "Post Address", "Bildat", "Företagsnamn" and "Email" wb = Workbook() page range(int(number_of_pages)): sheet = wb.add_sheet( + str(page)) 
    style = xlwt.easyxf( ) 
    sheet.write( , , , style) 
    sheet.write( , , , style) 
    sheet.write( , , , style)
    sheet.write( , , , style) i range( , len(number_of_results) + ):
        result = bot.find_elements_by_xpath( )[i]
        link = result.find_element_by_tag_name( )
        bot.execute_script( , link)
        time.sleep( ) for in # Create new sheet for each page 'Sheet ' 'font: bold 1' 0 0 'Post Address' 0 1 'Bildat' 0 2 'Företagsnamn' 0 3 'Email' # Click each link in results for in 1 1 '//table/tbody/tr' 'a' "arguments[0].click();" 2 As you see, we are converting number_of_pages to an integer because we extracted it as a string before. The reason I am using JavaScript here is to make sure that links clicked because sometimes Selenium's click() function fails in iteration. Now, time to crawl the data inside these links. When we inspect the elements: There is no any special class or id to extract these particular fields. In this kind of cases, I am using RegEx to extract the data from the strings. I am showing you the full code block for this part so it will make sense. wb = Workbook() page range(int(number_of_pages)):
    sheet = wb.add_sheet( + str(page), cell_overwrite_ok= ) 
    style = xlwt.easyxf( ) 
    sheet.write( , , , style) 
    sheet.write( , , , style) 
    sheet.write( , , , style)
    sheet.write( , , , style) i range(len(number_of_results)):
        result = bot.find_elements_by_xpath( )[i]
        link = result.find_element_by_tag_name( )
        bot.execute_script( , link)
        time.sleep( )
        information = [bot.find_element_by_class_name( ).text] :
            postaddress = re.search( , information[ ])
            sheet.write(i + , , str(postaddress.group( ))) 
            bildat = re.search( , information[ ])
            sheet.write(i + , , str(bildat.group( ))) 
            foretagsnamn = re.search( , information[ ])
            sheet.write(i + , , str(foretagsnamn .group( ))) 
            email = re.search( , information[ ])
            sheet.write(i + , , str(email.group( ))) 
            print(postaddress.group( ),bildat.group( ),foretagsnamn.group( ),email.group( )) AttributeError e:
            print( )
            sheet.write(i + , , ) bot.back()
        time.sleep( )
        wb.save( ) 
    print( ) 
    button_next= wait.until(EC.element_to_be_clickable((By.XPATH, )))
    button_next.click()
    time.sleep( ) for in 'Sheet ' True 'font: bold 1' 0 0 'Post Address' 0 1 'Bildat' 0 2 'Företagsnamn' 0 3 'Email' for in '//table/tbody/tr' 'a' "arguments[0].click();" 2 'kungtext' try 'Postadress:(.*),' 0 1 0 1 'Bildat:(.*)\n' 0 1 1 1 'Företagsnamn:(.*)\n' 0 1 2 1 'E-post:(.*)\n' 0 1 3 1 1 1 1 1 except as 'error => Email is null' 1 3 'null' pass 5 'emails.xls' 'Going to next page' "//input[@id='movenextTop']" 5 Regex will extract the value between and newline in the paragraph. In some results email field is missing so I added try except to detect it and automatically set the field "null". "Field Name" "\n" is preventing overwrite the column names in excel cells. I highly recommend to check my YouTube channel for more detailed explanation. i + 1 When the data successfully crawled for a single page program is saving the data into sheet and moving to the next page. Full Code time datetime re xlwt xlwt Workbook selenium webdriver selenium.webdriver.common.keys Keys selenium.webdriver.common.by By selenium.webdriver.support expected_conditions EC selenium.webdriver.support.ui WebDriverWait self.bot = webdriver.Firefox(executable_path= ) bot = self.bot
        bot.get( )
        time.sleep( )
        bot.find_element_by_id( ).click()
        time.sleep( )
        bot.find_element_by_tag_name( ).find_element_by_tag_name( ).click()
        time.sleep( )

        search_form = bot.find_element_by_tag_name( )
        search_form.find_element_by_xpath( ).click()
        wait = WebDriverWait(bot, )
        input_from = wait.until(EC.element_to_be_clickable((By.XPATH, )))
        input_from.send_keys( ) input_to = wait.until(EC.element_to_be_clickable((By.XPATH, )))
        input_to.send_keys( ) time.sleep( )

        amnesomrade = wait.until(EC.element_to_be_clickable((By.XPATH, )))
        amnesomrade.find_element_by_xpath( ).click()
        time.sleep( )
        kungorelserubrik = wait.until(EC.element_to_be_clickable((By.XPATH, )))
        kungorelserubrik.find_element_by_xpath( ).click()
        time.sleep( )
        underrubrik = wait.until(EC.element_to_be_clickable((By.XPATH, )))
        underrubrik.find_element_by_xpath( ).click() button_sok = wait.until(EC.element_to_be_clickable((By.XPATH, )))
        button_sok.click()
        time.sleep( )

        number_of_pages = bot.find_element_by_xpath( ).text.split( , )[ ]                         
        number_of_pages.strip().replace( , )
        
        number_of_results = bot.find_elements_by_xpath( )

        wb = Workbook() page range(int(number_of_pages)):
            sheet = wb.add_sheet( + str(page))
            style = xlwt.easyxf( )
            sheet.write( , , , style)
            sheet.write( , , , style)
            sheet.write( , , , style)
            sheet.write( , , , style) i range(len(number_of_results)):
                result = bot.find_elements_by_xpath( )[i]
                link = result.find_element_by_tag_name( )
                bot.execute_script( , link)
                time.sleep( )
                
                information = [bot.find_element_by_class_name( ).text] :
                    postaddress = re.search( , information[ ])
                    sheet.write(i + , , str(postaddress.group( )))
                    bildat = re.search( , information[ ])
                    sheet.write(i + , , str(bildat.group( )))
                    foretagsnamn = re.search( , information[ ])
                    sheet.write(i + , , str(foretagsnamn.group( )))
                    email = re.search( , information[ ])
                    sheet.write(i + , , str(email.group( )))
                    print(postaddress.group( ), bildat.group( ), foretagsnamn.group( ), email.group( )) AttributeError e:
                    print( )
                    sheet.write(i + , , ) bot.back()
                time.sleep( )
                wb.save( )
            print( )
            button_next = wait.until(EC.element_to_be_clickable((By.XPATH, )))
            button_next.click()
            time.sleep( )


bot = Bolagsverket()
bot.navigate_and_crawl() import import import import from import from import from import from import from import as from import : class Bolagsverket : def __init__ (self) '/home/coderasha/Desktop/geckodriver' : def navigate_and_crawl (self) 'https://poit.bolagsverket.se/poit/PublikPoitIn.do' 5 'nav1-2' 5 'form' 'a' 5 'form' "//select[@id='tidsperiod']/option[text()='Annan period']" 10 "//input[@id='from']" '2019-09-23' # input_from.send_keys(str(datetime.date.today()-datetime.timedelta(1))) "//input[@id='tom']" '2019-09-24' # input_to.send_keys(str(datetime.date.today())) 5 "//select[@id='amnesomrade']" "//select[@id='amnesomrade']/option[text()='Bolagsverkets registreringar']" 5 "//select[@id='kungorelserubrik']" "//select[@id='kungorelserubrik']/option[text()='Aktiebolagsregistret']" 5 "//select[@id='underrubrik']" "//select[@id='underrubrik']/option[text()='Nyregistreringar']" # Search Button "//input[@id='SokKungorelse']" 5 "//div[@class='gotopagediv']/em[@class='gotopagebuttons']" "av" 1 1 " " "" '//table/tbody/tr' for in 'Sheet' 'font: bold 1' 0 0 'Post Address' 0 1 'Bildat' 0 2 'Foretagsnamn' 0 3 'Email' for in "//table/tbody/tr" 'a' 'arguments[0].click();' 5 'kungtext' try 'Postadress:(.*),' 0 1 0 1 'Bildat:(.*)\n' 0 1 1 1 'Företagsnamn:(.*)\n' 0 1 2 1 'E-post:(.*)\n' 0 1 3 1 1 1 1 1 except as 'Email is null' 1 3 'null' pass 5 'emails.xls' 'Going to next page ...' "//input/[@id='movenextTop']" 5 Mission Accomplished! You can watch the video tutorial of this project in my YouTube Channel I hope you enjoyed and learned something from this post. Job is still open so you can send proposal to client from Upwork. Please check for more cool content like this. Reverse Python Stay Connected!

Scraping Data With Selenium: Upwork Series #2

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

Automating Instagram API Using Python: Gain Active Followers

10 Indications That You Should Invest in Automation Via APIs

10 Commandments for AI-Assisted Social Media Marketers

11 Best Automation Testing Tools to Try in 2021

12 Use Cases of AI and Machine Learning In Finance

Automating Instagram API Using Python: Gain Active Followers

10 Indications That You Should Invest in Automation Via APIs

10 Commandments for AI-Assisted Social Media Marketers

11 Best Automation Testing Tools to Try in 2021

12 Use Cases of AI and Machine Learning In Finance

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps