Web Scraping And Data Extraction with Python: Upwork Series #1

Welcome to the first post of Upwork Series. In this series, we are going to work on gig requirements posted on UpWork. Web Scraping & Data Extraction Our task is to create web crawler which will scrape data daily from the report pages of a transportation company. Click here to see description of project Step 1: Understanding the task First, it is important to understand the task clearly. They want from us to scrape data then save it in CSV file where each attribute listed above is its own separate column. We are going to crawl the following information: Date ("The information below reflects the content of the FMCSA management information systems as of {Date}")Operating StatusLegal NameDBA NamePhysical AddressMailing AddressUSDOT NumberPower UnitsDrivers We can provide an initial set of 100 DOT numbers to make sure the information above can be scraped easily, from there we can provide all DOT numbers we would like to scrape on a daily basis. Report pages can be accessed by dots or with other name IDs. Each page has its own id (dot). So, these dots locate in Excel file. We have to read this file and extract dots from there then pass it into URL to access the report page. Step 2: Creating our environment and installing dependencies Now, we know what client wants from us, so let's create our virtual environment then inspect elements that we are going to crawl. To create virtualenv run the following command in your terminal: virtualenv env then install BeautifulSoup which is a Python package for parsing HTML and XML documents and xlrd which is a library for reading data and formatting information from Excel files: pip install beautifulsoup4 xlrd Step 3: Crawling Data Alright Devs! Let's start with opening the example url from project description so can see the fields. Click to see Example URL The page should look like this: Our first target is to find - Date ("The information below reflects the content of the FMCSA management information systems as of {Date}") The information below reflects the content of the FMCSA management information systems as of 01/01/2020. The truth is we can't crawl this element by specific class name or id. Unfortunately, this report pages messed up. But we can see that this element is in bold format. There are also many elements locate with the bold format. However, we can crawl all of them as text and use RegEx to extract the data we need. A regular expression (RegEx) is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. The date locates between The information below reflects the content of the FMCSA management information systems as of and .(dot). So , to find the date regex will look between these strings. import re # regex import urllib.request from urllib.request import urlopen, Request from bs4 import BeautifulSoup def crawl_data(url): req = Request(url, headers={'User-Agent': 'Mozilla/5.0'}) html = urlopen(req).read() bs = BeautifulSoup(html, 'html.parser') # Find all bold texts bold_texts = bs.find_all('b') for b in bold_texts: try: # look between these strings date = re.search('The information below reflects the content of the FMCSA management information systems as of(.*).', b.get_text(strip=True, separator=' ')).group(1).strip() # If regex finds multiple dots, extract string before first dot if len(date) > 11: date = date.split(".",1)[0] print(date) except AttributeError: pass Well if you run the program you will see it's printing the date. Let me quickly show you how regex works, because I feel some of you trying to understand. Consider the following code: import re # We need to extarct "coderasha" from the string data = "Hello my name is coderasha." name = re.search('Hello my name is (.*).', data) print(name) # Output: As you see match is found but its printed like object. group(1) capture the text matched by the regex inside them into a numbered group that can be reused with a numbered backreference import re # We need to extarct "coderasha" from the string data = "Hello my name is coderasha." name = re.search('Hello my name is (.*).', data).group(1) print(name) # Output: coderasha So, I am applying the same logic to find the date inside strings that crawled. The next step is, to find a table and continue to crawl other fields. Luckily, the table locates between center tags. However, we have to find data again using RegEx because table elements have no any special attribute. # Get all texts inside table information = bs.find('center').get_text(strip=True, separator=' ') # Find fields using RegEx operating = re.search('Operating Status:(.*)Out', information).group(1).strip() legal_name = re.search('Legal Name:(.*)DBA', information).group(1).strip() physical_address = re.search('Physical Address:(.*)Phone', information).group(1).strip() mailing_address = re.search('Mailing Address:(.*)USDOT', information).group(1).strip() usdot_address = re.search('USDOT Number:(.*)State Carrier ID Number', information).group(1).strip() power_units = re.search('Power Units:(.*)Drivers', information).group(1).strip() drivers = re.search('Drivers:(.*)MCS-150 Form Date', information).group(1).strip() Step 4: Write data in CSV Once data crawled, it is time to create new csv file and write data into it. I prefer to create another function which will handle this action. import csv def write_csv(date, operating, legal_name, physical_address, mailing_address, usdot_address, power_units, drivers): with open(usdot_address + '.csv', mode='w', newline='', encoding="utf-8") as csv_file: fieldnames = ['Date', 'Operating Status', 'Legal_Name', 'Physical Address', 'Mailing Address', 'Power Units', 'Drivers'] writer = csv.DictWriter(csv_file, fieldnames=fieldnames) writer.writeheader() writer.writerow({ 'Date':date, 'Operating Status': operating, 'Legal_Name': legal_name, 'Physical Address':physical_address, 'Mailing Address': mailing_address, 'Power Units':power_units, 'Drivers':drivers }) CSV name must be unique, so I named it with usdot_address or with other name ID of report page from crawled data Step 5: Read excel file to crawl data for each dot The final step is to read excel file and pass these dots end of the URL to access the pages. We can use xlrd to read excel file import xlrd dots = [] def read_excel_file(): loc = ("dots.xls") wb = xlrd.open_workbook(loc) sheet = wb.sheet_by_index(0) sheet.cell_value(0, 0) # First five dot in excel for i in range(1,5): # Convert floats to string and clean from .0 dot = str(sheet.cell_value(i, 0)).replace('.0', '') dots.append(dot) xlrd reads numbers as float so the best solution is to convert them to strings and use replace() method to remove .0 end of the string and pass these dots into url: for dot in dots: crawl_data('https://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&query_string=' + dot) # Sleep 5 seconds to avoid any errors time.sleep(5) Here is the Full Code: import re import csv import urllib.request from urllib.request import urlopen, Request from bs4 import BeautifulSoup import xlrd import time dots = [] def read_excel_file(): loc = ("dots.xls") wb = xlrd.open_workbook(loc) sheet = wb.sheet_by_index(0) sheet.cell_value(0, 0) for i in range(1,5): dot = str(sheet.cell_value(i, 0)).replace('.0', '') dots.append(dot) def crawl_data(url): req = Request(url, headers={'User-Agent': 'Mozilla/5.0'}) html = urlopen(req).read() bs = BeautifulSoup(html, 'html.parser') bold_texts = bs.find_all('b') for b in bold_texts: try: date = re.search('The information below reflects the content of the FMCSA management information systems as of(.*).', b.get_text(strip=True, separator=' ')).group(1).strip() if len(date) > 11: date = date.split(".",1)[0] print(date) except AttributeError: pass information = bs.find('center').get_text(strip=True, separator=' ') operating = re.search('Operating Status:(.*)Out', information).group(1).strip() legal_name = re.search('Legal Name:(.*)DBA', information).group(1).strip() physical_address = re.search('Physical Address:(.*)Phone', information).group(1).strip() mailing_address = re.search('Mailing Address:(.*)USDOT', information).group(1).strip() usdot_address = re.search('USDOT Number:(.*)State Carrier ID Number', information).group(1).strip() power_units = re.search('Power Units:(.*)Drivers', information).group(1).strip() drivers = re.search('Drivers:(.*)MCS-150 Form Date', information).group(1).strip() write_csv(date, operating, legal_name, physical_address, mailing_address, usdot_address, power_units, drivers) def write_csv(date, operating, legal_name, physical_address, mailing_address, usdot_address, power_units, drivers): with open(usdot_address + '.csv', mode='w', newline='', encoding="utf-8") as csv_file: fieldnames = ['Date', 'Operating Status', 'Legal_Name', 'Physical Address', 'Mailing Address', 'Power Units', 'Drivers'] writer = csv.DictWriter(csv_file, fieldnames=fieldnames) writer.writeheader() writer.writerow({ 'Date':date, 'Operating Status': operating, 'Legal_Name': legal_name, 'Physical Address':physical_address, 'Mailing Address': mailing_address, 'Power Units':power_units, 'Drivers':drivers }) read_excel_file() print(dots) for dot in dots: crawl_data('https://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&query_string=' + dot) time.sleep(5) Mission Accomplished! I hope you liked the first post of new Upwork series. The job offer still open so if want you can send proposal to client using this code. I post similar stories at Reverse Python. Check it out. Video Tutorial available in YouTube Channel - Reverse Python. Stay Connected! Welcome to the first post of Upwork Series. In this series, we are going to work on gig requirements posted on UpWork. Web Scraping & Data Extraction Our task is to create web crawler which will scrape data daily from the report pages of a transportation company. Click here to see description of project Click here to see description of project Step 1: Understanding the task Step 1: Understanding the task First, it is important to understand the task clearly. They want from us to scrape data then save it in CSV file where each attribute listed above is its own separate column. scrape data then save it in CSV file We are going to crawl the following information: Date ("The information below reflects the content of the FMCSA management information systems as of {Date}") Operating Status Legal Name DBA Name Physical Address Mailing Address USDOT Number Power Units Drivers Date ("The information below reflects the content of the FMCSA management information systems as of {Date}") Operating Status Legal Name DBA Name Physical Address Mailing Address USDOT Number Power Units Drivers We can provide an initial set of 100 DOT numbers to make sure the information above can be scraped easily, from there we can provide all DOT numbers we would like to scrape on a daily basis. Report pages can be accessed by dots or with other name IDs. Each page has its own id (dot). So, these dots locate in Excel file. We have to read this file and extract dots from there then pass it into URL to access the report page. Step 2: Creating our environment and installing dependencies Step 2: Creating our environment and installing dependencies Now, we know what client wants from us, so let's create our virtual environment then inspect elements that we are going to crawl. To create virtualenv run the following command in your terminal: virtualenv env virtualenv env virtualenv env virtualenv env then install BeautifulSoup which is a Python package for parsing HTML and XML documents and xlrd which is a library for reading data and formatting information from Excel files: pip install beautifulsoup4 xlrd pip install beautifulsoup4 xlrd pip install beautifulsoup4 xlrd pip install beautifulsoup4 xlrd Step 3: Crawling Data Step 3: Crawling Data Alright Devs! Let's start with opening the example url from project description so can see the fields. Click to see Example URL Click to see Example URL The page should look like this: Our first target is to find - Date ("The information below reflects the content of the FMCSA management information systems as of {Date}") The information below reflects the content of the FMCSA management information systems as of 01/01/2020. The truth is we can't crawl this element by specific class name or id. Unfortunately, this report pages messed up. But we can see that this element is in bold format. There are also many elements locate with the bold format. However, we can crawl all of them as text and use RegEx to extract the data we need. RegEx A regular expression (RegEx) is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. The date locates between The information below reflects the content of the FMCSA management information systems as of and .(dot) . So , to find the date regex will look between these strings. The information below reflects the content of the FMCSA management information systems as of .(dot) import re # regex import urllib.request from urllib.request import urlopen, Request from bs4 import BeautifulSoup def crawl_data(url): req = Request(url, headers={'User-Agent': 'Mozilla/5.0'}) html = urlopen(req).read() bs = BeautifulSoup(html, 'html.parser') # Find all bold texts bold_texts = bs.find_all('b') for b in bold_texts: try: # look between these strings date = re.search('The information below reflects the content of the FMCSA management information systems as of(.*).', b.get_text(strip=True, separator=' ')).group(1).strip() # If regex finds multiple dots, extract string before first dot if len(date) > 11: date = date.split(".",1)[0] print(date) except AttributeError: pass import re # regex import urllib.request from urllib.request import urlopen, Request from bs4 import BeautifulSoup def crawl_data(url): req = Request(url, headers={'User-Agent': 'Mozilla/5.0'}) html = urlopen(req).read() bs = BeautifulSoup(html, 'html.parser') # Find all bold texts bold_texts = bs.find_all('b') for b in bold_texts: try: # look between these strings date = re.search('The information below reflects the content of the FMCSA management information systems as of(.*).', b.get_text(strip=True, separator=' ')).group(1).strip() # If regex finds multiple dots, extract string before first dot if len(date) > 11: date = date.split(".",1)[0] print(date) except AttributeError: pass import re # regex import urllib.request from urllib.request import urlopen, Request from bs4 import BeautifulSoup def crawl_data ( url ): req = Request(url, headers={ 'User-Agent' : 'Mozilla/5.0' }) html = urlopen(req).read() bs = BeautifulSoup(html, 'html.parser' ) # Find all bold texts bold_texts = bs.find_all( 'b' ) for b in bold_texts: try : # look between these strings date = re.search( 'The information below reflects the content of the FMCSA management information systems as of(.*).' , b.get_text(strip= True , separator= ' ' )).group( 1 ).strip() # If regex finds multiple dots, extract string before first dot if len (date) > 11 : date = date.split( "." , 1 )[ 0 ] print (date) except AttributeError: pass import re # regex import urllib.request from urllib.request import urlopen, Request from bs4 import BeautifulSoup def crawl_data ( url ): req = Request(url, headers={ 'User-Agent' : 'Mozilla/5.0' }) bs = BeautifulSoup(html, 'html.parser' ) # Find all bold texts bold_texts = bs.find_all( 'b' ) for b in bold_texts: try : # look between these strings date = re.search( 'The information below reflects the content of the FMCSA management information systems as of(.*).' , b.get_text(strip= True , separator= ' ' )).group( 1 ).strip() # If regex finds multiple dots, extract string before first dot if len (date) > 11 : date = date.split( "." , 1 )[ 0 ] print (date) except AttributeError: pass Well if you run the program you will see it's printing the date. Let me quickly show you how regex works, because I feel some of you trying to understand. Consider the following code: import re # We need to extarct "coderasha" from the string data = "Hello my name is coderasha." name = re.search('Hello my name is (.*).', data) print(name) # Output: import re # We need to extarct "coderasha" from the string data = "Hello my name is coderasha." name = re.search('Hello my name is (.*).', data) print(name) # Output: import re # We need to extarct "coderasha" from the string data = "Hello my name is coderasha." name = re.search( 'Hello my name is (.*).' , data) print(name) # Output: import re # We need to extarct "coderasha" from the string data = "Hello my name is coderasha." name = re.search( 'Hello my name is (.*).' , data) # Output: As you see match is found but its printed like object. group(1) capture the text matched by the regex inside them into a numbered group that can be reused with a numbered backreference import re # We need to extarct "coderasha" from the string data = "Hello my name is coderasha." name = re.search('Hello my name is (.*).', data).group(1) print(name) # Output: coderasha import re # We need to extarct "coderasha" from the string data = "Hello my name is coderasha." name = re.search('Hello my name is (.*).', data).group(1) print(name) # Output: coderasha import re # We need to extarct "coderasha" from the string data = "Hello my name is coderasha." name = re.search( 'Hello my name is (.*).' , data).group( 1 ) print(name) # Output: coderasha import re # We need to extarct "coderasha" from the string data = "Hello my name is coderasha." name = re.search( 'Hello my name is (.*).' , data).group( 1 ) So, I am applying the same logic to find the date inside strings that crawled. The next step is, to find a table and continue to crawl other fields. Luckily, the table locates between center tags. However, we have to find data again using RegEx because table elements have no any special attribute. # Get all texts inside table information = bs.find('center').get_text(strip=True, separator=' ') # Find fields using RegEx operating = re.search('Operating Status:(.*)Out', information).group(1).strip() legal_name = re.search('Legal Name:(.*)DBA', information).group(1).strip() physical_address = re.search('Physical Address:(.*)Phone', information).group(1).strip() mailing_address = re.search('Mailing Address:(.*)USDOT', information).group(1).strip() usdot_address = re.search('USDOT Number:(.*)State Carrier ID Number', information).group(1).strip() power_units = re.search('Power Units:(.*)Drivers', information).group(1).strip() drivers = re.search('Drivers:(.*)MCS-150 Form Date', information).group(1).strip() # Get all texts inside table information = bs.find('center').get_text(strip=True, separator=' ') # Find fields using RegEx operating = re.search('Operating Status:(.*)Out', information).group(1).strip() legal_name = re.search('Legal Name:(.*)DBA', information).group(1).strip() physical_address = re.search('Physical Address:(.*)Phone', information).group(1).strip() mailing_address = re.search('Mailing Address:(.*)USDOT', information).group(1).strip() usdot_address = re.search('USDOT Number:(.*)State Carrier ID Number', information).group(1).strip() power_units = re.search('Power Units:(.*)Drivers', information).group(1).strip() drivers = re.search('Drivers:(.*)MCS-150 Form Date', information).group(1).strip() # Get all texts inside table information = bs.find( 'center' ).get_text(strip= True , separator= ' ' ) # Find fields using RegEx operating = re.search( 'Operating Status:(.*)Out' , information).group( 1 ).strip() legal_name = re.search( 'Legal Name:(.*)DBA' , information).group( 1 ).strip() physical_address = re.search( 'Physical Address:(.*)Phone' , information).group( 1 ).strip() mailing_address = re.search( 'Mailing Address:(.*)USDOT' , information).group( 1 ).strip() usdot_address = re.search( 'USDOT Number:(.*)State Carrier ID Number' , information).group( 1 ).strip() power_units = re.search( 'Power Units:(.*)Drivers' , information).group( 1 ).strip() drivers = re.search( 'Drivers:(.*)MCS-150 Form Date' , information).group( 1 ).strip() # Get all texts inside table information = bs.find( 'center' ).get_text(strip= True , separator= ' ' ) # Find fields using RegEx operating = re.search( 'Operating Status:(.*)Out' , information).group( 1 ).strip() legal_name = re.search( 'Legal Name:(.*)DBA' , information).group( 1 ).strip() physical_address = re.search( 'Physical Address:(.*)Phone' , information).group( 1 ).strip() mailing_address = re.search( 'Mailing Address:(.*)USDOT' , information).group( 1 ).strip() usdot_address = re.search( 'USDOT Number:(.*)State Carrier ID Number' , information).group( 1 ).strip() power_units = re.search( 'Power Units:(.*)Drivers' , information).group( 1 ).strip() drivers = re.search( 'Drivers:(.*)MCS-150 Form Date' , information).group( 1 ).strip() Step 4: Write data in CSV Step 4: Write data in CSV Once data crawled, it is time to create new csv file and write data into it. I prefer to create another function which will handle this action. import csv def write_csv(date, operating, legal_name, physical_address, mailing_address, usdot_address, power_units, drivers): with open(usdot_address + '.csv', mode='w', newline='', encoding="utf-8") as csv_file: fieldnames = ['Date', 'Operating Status', 'Legal_Name', 'Physical Address', 'Mailing Address', 'Power Units', 'Drivers'] writer = csv.DictWriter(csv_file, fieldnames=fieldnames) writer.writeheader() writer.writerow({ 'Date':date, 'Operating Status': operating, 'Legal_Name': legal_name, 'Physical Address':physical_address, 'Mailing Address': mailing_address, 'Power Units':power_units, 'Drivers':drivers }) import csv def write_csv(date, operating, legal_name, physical_address, mailing_address, usdot_address, power_units, drivers): with open(usdot_address + '.csv', mode='w', newline='', encoding="utf-8") as csv_file: fieldnames = ['Date', 'Operating Status', 'Legal_Name', 'Physical Address', 'Mailing Address', 'Power Units', 'Drivers'] writer = csv.DictWriter(csv_file, fieldnames=fieldnames) writer.writeheader() writer.writerow({ 'Date':date, 'Operating Status': operating, 'Legal_Name': legal_name, 'Physical Address':physical_address, 'Mailing Address': mailing_address, 'Power Units':power_units, 'Drivers':drivers }) import csv def write_csv ( date, operating, legal_name, physical_address, mailing_address, usdot_address, power_units, drivers ): with open (usdot_address + '.csv' , mode= 'w' , newline= '' , encoding= "utf-8" ) as csv_file: fieldnames = [ 'Date' , 'Operating Status' , 'Legal_Name' , 'Physical Address' , 'Mailing Address' , 'Power Units' , 'Drivers' ] writer = csv.DictWriter(csv_file, fieldnames=fieldnames) writer.writeheader() writer.writerow({ 'Date' :date, 'Operating Status' : operating, 'Legal_Name' : legal_name, 'Physical Address' :physical_address, 'Mailing Address' : mailing_address, 'Power Units' :power_units, 'Drivers' :drivers }) import csv def write_csv ( date, operating, legal_name, physical_address, mailing_address, usdot_address, power_units, drivers ): with open (usdot_address + '.csv' , mode= 'w' , newline= '' , encoding= "utf-8" ) as csv_file: fieldnames = [ 'Date' , 'Operating Status' , 'Legal_Name' , 'Physical Address' , 'Mailing Address' , 'Power Units' , 'Drivers' ] writer.writerow({ 'Date' :date, 'Operating Status' : operating, 'Legal_Name' : legal_name, 'Physical Address' :physical_address, 'Mailing Address' : mailing_address, 'Power Units' :power_units, 'Drivers' :drivers }) CSV name must be unique, so I named it with usdot_address or with other name ID of report page from crawled data usdot_address usdot_address Step 5: Read excel file to crawl data for each dot Step 5: Read excel file to crawl data for each dot The final step is to read excel file and pass these dots end of the URL to access the pages. We can use xlrd to read excel file xlrd xlrd import xlrd dots = [] def read_excel_file(): loc = ("dots.xls") wb = xlrd.open_workbook(loc) sheet = wb.sheet_by_index(0) sheet.cell_value(0, 0) # First five dot in excel for i in range(1,5): # Convert floats to string and clean from .0 dot = str(sheet.cell_value(i, 0)).replace('.0', '') dots.append(dot) import xlrd dots = [] def read_excel_file(): loc = ("dots.xls") wb = xlrd.open_workbook(loc) sheet = wb.sheet_by_index(0) sheet.cell_value(0, 0) # First five dot in excel for i in range(1,5): # Convert floats to string and clean from .0 dot = str(sheet.cell_value(i, 0)).replace('.0', '') dots.append(dot) import xlrd dots = [] def read_excel_file (): loc = ( "dots.xls" ) wb = xlrd.open_workbook(loc) sheet = wb.sheet_by_index( 0 ) sheet.cell_value( 0 , 0 ) # First five dot in excel for i in range ( 1 , 5 ): # Convert floats to string and clean from .0 dot = str (sheet.cell_value(i, 0 )).replace( '.0' , '' ) dots.append(dot) import xlrd def read_excel_file (): loc = ( "dots.xls" ) sheet = wb.sheet_by_index( 0 ) sheet.cell_value( 0 , 0 ) # First five dot in excel for i in range ( 1 , 5 ): # Convert floats to string and clean from .0 dot = str (sheet.cell_value(i, 0 )).replace( '.0' , '' ) xlrd reads numbers as float so the best solution is to convert them to strings and use replace() method to remove .0 end of the string and pass these dots into url: replace() replace() replace() for dot in dots: crawl_data('https://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&query_string=' + dot) # Sleep 5 seconds to avoid any errors time.sleep(5) for dot in dots: crawl_data('https://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&query_string=' + dot) # Sleep 5 seconds to avoid any errors time.sleep(5) for dot in dots: crawl_data( 'https://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&query_string=' + dot) # Sleep 5 seconds to avoid any errors time.sleep( 5 ) for dot in dots: crawl_data( 'https://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&query_string=' + dot) # Sleep 5 seconds to avoid any errors time.sleep( 5 ) Here is the Full Code: import re import csv import urllib.request from urllib.request import urlopen, Request from bs4 import BeautifulSoup import xlrd import time dots = [] def read_excel_file(): loc = ("dots.xls") wb = xlrd.open_workbook(loc) sheet = wb.sheet_by_index(0) sheet.cell_value(0, 0) for i in range(1,5): dot = str(sheet.cell_value(i, 0)).replace('.0', '') dots.append(dot) def crawl_data(url): req = Request(url, headers={'User-Agent': 'Mozilla/5.0'}) html = urlopen(req).read() bs = BeautifulSoup(html, 'html.parser') bold_texts = bs.find_all('b') for b in bold_texts: try: date = re.search('The information below reflects the content of the FMCSA management information systems as of(.*).', b.get_text(strip=True, separator=' ')).group(1).strip() if len(date) > 11: date = date.split(".",1)[0] print(date) except AttributeError: pass information = bs.find('center').get_text(strip=True, separator=' ') operating = re.search('Operating Status:(.*)Out', information).group(1).strip() legal_name = re.search('Legal Name:(.*)DBA', information).group(1).strip() physical_address = re.search('Physical Address:(.*)Phone', information).group(1).strip() mailing_address = re.search('Mailing Address:(.*)USDOT', information).group(1).strip() usdot_address = re.search('USDOT Number:(.*)State Carrier ID Number', information).group(1).strip() power_units = re.search('Power Units:(.*)Drivers', information).group(1).strip() drivers = re.search('Drivers:(.*)MCS-150 Form Date', information).group(1).strip() write_csv(date, operating, legal_name, physical_address, mailing_address, usdot_address, power_units, drivers) def write_csv(date, operating, legal_name, physical_address, mailing_address, usdot_address, power_units, drivers): with open(usdot_address + '.csv', mode='w', newline='', encoding="utf-8") as csv_file: fieldnames = ['Date', 'Operating Status', 'Legal_Name', 'Physical Address', 'Mailing Address', 'Power Units', 'Drivers'] writer = csv.DictWriter(csv_file, fieldnames=fieldnames) writer.writeheader() writer.writerow({ 'Date':date, 'Operating Status': operating, 'Legal_Name': legal_name, 'Physical Address':physical_address, 'Mailing Address': mailing_address, 'Power Units':power_units, 'Drivers':drivers }) read_excel_file() print(dots) for dot in dots: crawl_data('https://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&query_string=' + dot) time.sleep(5) import re import csv import urllib.request from urllib.request import urlopen, Request from bs4 import BeautifulSoup import xlrd import time dots = [] def read_excel_file(): loc = ("dots.xls") wb = xlrd.open_workbook(loc) sheet = wb.sheet_by_index(0) sheet.cell_value(0, 0) for i in range(1,5): dot = str(sheet.cell_value(i, 0)).replace('.0', '') dots.append(dot) def crawl_data(url): req = Request(url, headers={'User-Agent': 'Mozilla/5.0'}) html = urlopen(req).read() bs = BeautifulSoup(html, 'html.parser') bold_texts = bs.find_all('b') for b in bold_texts: try: date = re.search('The information below reflects the content of the FMCSA management information systems as of(.*).', b.get_text(strip=True, separator=' ')).group(1).strip() if len(date) > 11: date = date.split(".",1)[0] print(date) except AttributeError: pass information = bs.find('center').get_text(strip=True, separator=' ') operating = re.search('Operating Status:(.*)Out', information).group(1).strip() legal_name = re.search('Legal Name:(.*)DBA', information).group(1).strip() physical_address = re.search('Physical Address:(.*)Phone', information).group(1).strip() mailing_address = re.search('Mailing Address:(.*)USDOT', information).group(1).strip() usdot_address = re.search('USDOT Number:(.*)State Carrier ID Number', information).group(1).strip() power_units = re.search('Power Units:(.*)Drivers', information).group(1).strip() drivers = re.search('Drivers:(.*)MCS-150 Form Date', information).group(1).strip() write_csv(date, operating, legal_name, physical_address, mailing_address, usdot_address, power_units, drivers) def write_csv(date, operating, legal_name, physical_address, mailing_address, usdot_address, power_units, drivers): with open(usdot_address + '.csv', mode='w', newline='', encoding="utf-8") as csv_file: fieldnames = ['Date', 'Operating Status', 'Legal_Name', 'Physical Address', 'Mailing Address', 'Power Units', 'Drivers'] writer = csv.DictWriter(csv_file, fieldnames=fieldnames) writer.writeheader() writer.writerow({ 'Date':date, 'Operating Status': operating, 'Legal_Name': legal_name, 'Physical Address':physical_address, 'Mailing Address': mailing_address, 'Power Units':power_units, 'Drivers':drivers }) read_excel_file() print(dots) for dot in dots: crawl_data('https://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&query_string=' + dot) time.sleep(5) import re import csv import urllib.request from urllib.request import urlopen, Request from bs4 import BeautifulSoup import xlrd import time dots = [] def read_excel_file (): loc = ( "dots.xls" ) wb = xlrd.open_workbook(loc) sheet = wb.sheet_by_index( 0 ) sheet.cell_value( 0 , 0 ) for i in range ( 1 , 5 ): dot = str (sheet.cell_value(i, 0 )).replace( '.0' , '' ) dots.append(dot) def crawl_data ( url ): req = Request(url, headers={ 'User-Agent' : 'Mozilla/5.0' }) html = urlopen(req).read() bs = BeautifulSoup(html, 'html.parser' ) bold_texts = bs.find_all( 'b' ) for b in bold_texts: try : date = re.search( 'The information below reflects the content of the FMCSA management information systems as of(.*).' , b.get_text(strip= True , separator= ' ' )).group( 1 ).strip() if len (date) > 11 : date = date.split( "." , 1 )[ 0 ] print (date) except AttributeError: pass information = bs.find( 'center' ).get_text(strip= True , separator= ' ' ) operating = re.search( 'Operating Status:(.*)Out' , information).group( 1 ).strip() legal_name = re.search( 'Legal Name:(.*)DBA' , information).group( 1 ).strip() physical_address = re.search( 'Physical Address:(.*)Phone' , information).group( 1 ).strip() mailing_address = re.search( 'Mailing Address:(.*)USDOT' , information).group( 1 ).strip() usdot_address = re.search( 'USDOT Number:(.*)State Carrier ID Number' , information).group( 1 ).strip() power_units = re.search( 'Power Units:(.*)Drivers' , information).group( 1 ).strip() drivers = re.search( 'Drivers:(.*)MCS-150 Form Date' , information).group( 1 ).strip() write_csv(date, operating, legal_name, physical_address, mailing_address, usdot_address, power_units, drivers) def write_csv ( date, operating, legal_name, physical_address, mailing_address, usdot_address, power_units, drivers ): with open (usdot_address + '.csv' , mode= 'w' , newline= '' , encoding= "utf-8" ) as csv_file: fieldnames = [ 'Date' , 'Operating Status' , 'Legal_Name' , 'Physical Address' , 'Mailing Address' , 'Power Units' , 'Drivers' ] writer = csv.DictWriter(csv_file, fieldnames=fieldnames) writer.writeheader() writer.writerow({ 'Date' :date, 'Operating Status' : operating, 'Legal_Name' : legal_name, 'Physical Address' :physical_address, 'Mailing Address' : mailing_address, 'Power Units' :power_units, 'Drivers' :drivers }) read_excel_file() print (dots) for dot in dots: crawl_data( 'https://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&query_string=' + dot) time.sleep( 5 ) import re import csv import urllib.request from urllib.request import urlopen, Request from bs4 import BeautifulSoup import xlrd import time def read_excel_file (): loc = ( "dots.xls" ) sheet = wb.sheet_by_index( 0 ) sheet.cell_value( 0 , 0 ) for i in range ( 1 , 5 ): dot = str (sheet.cell_value(i, 0 )).replace( '.0' , '' ) def crawl_data ( url ): req = Request(url, headers={ 'User-Agent' : 'Mozilla/5.0' }) bs = BeautifulSoup(html, 'html.parser' ) bold_texts = bs.find_all( 'b' ) for b in bold_texts: try : date = re.search( 'The information below reflects the content of the FMCSA management information systems as of(.*).' , b.get_text(strip= True , separator= ' ' )).group( 1 ).strip() if len (date) > 11 : date = date.split( "." , 1 )[ 0 ] print (date) except AttributeError: pass information = bs.find( 'center' ).get_text(strip= True , separator= ' ' ) operating = re.search( 'Operating Status:(.*)Out' , information).group( 1 ).strip() legal_name = re.search( 'Legal Name:(.*)DBA' , information).group( 1 ).strip() physical_address = re.search( 'Physical Address:(.*)Phone' , information).group( 1 ).strip() mailing_address = re.search( 'Mailing Address:(.*)USDOT' , information).group( 1 ).strip() usdot_address = re.search( 'USDOT Number:(.*)State Carrier ID Number' , information).group( 1 ).strip() power_units = re.search( 'Power Units:(.*)Drivers' , information).group( 1 ).strip() drivers = re.search( 'Drivers:(.*)MCS-150 Form Date' , information).group( 1 ).strip() def write_csv ( date, operating, legal_name, physical_address, mailing_address, usdot_address, power_units, drivers ): with open (usdot_address + '.csv' , mode= 'w' , newline= '' , encoding= "utf-8" ) as csv_file: fieldnames = [ 'Date' , 'Operating Status' , 'Legal_Name' , 'Physical Address' , 'Mailing Address' , 'Power Units' , 'Drivers' ] writer.writerow({ 'Date' :date, 'Operating Status' : operating, 'Legal_Name' : legal_name, 'Physical Address' :physical_address, 'Mailing Address' : mailing_address, 'Power Units' :power_units, 'Drivers' :drivers }) print (dots) for dot in dots: crawl_data( 'https://safer.fmcsa.dot.gov/query.asp?searchtype=ANY&query_type=queryCarrierSnapshot&query_param=USDOT&query_string=' + dot) time.sleep( 5 ) Mission Accomplished! I hope you liked the first post of new Upwork series. The job offer still open so if want you can send proposal to client using this code. I post similar stories at Reverse Python . Check it out. Reverse Python Reverse Python Video Tutorial available in YouTube Channel - Reverse Python . YouTube Channel - Reverse Python YouTube Channel - Reverse Python Stay Connected!