Web scraping and utilizing various APIs are great ways to collect data from websites and applications that can later be used in data analytics. There is a company called HiQ that is well known for web scraping. HiQ crawls various “Public” websites to collect data and provide analytics for companies on their employees. They help companies find top talent using sites data like Linkedin, and other public sources to gain the information needed in their algorithms.
However, they ran into legal issues when Linkedin asked them to cease and desist as well as put in certain technical methods to slow down HiQ’s web crawlers. HiQ subsequently sued Linkedin and won! The judge said as long as the data was public, it was ok to scrape!
Image from commit strip (Here)
Web scraping typically requires a complex understanding of HTTP requests, faking headers, complex Regex statements, HTML parsers, and database management skills.
There are programming languages that make this much easier such as Python. This is because Python offers libraries like Scrapy and BeautifulSoup that make scraping and parsing HTML easier than old school web scrapers.
However, it still requires proper design and a decent understanding of programming and website architecture.
Let’s say your team does not have programming skills. That is ok! One of our team members recently gave a webinar at Loyola University to demonstrate how to scrape web pages without programming. Instead, Google sheets offer several useful functions that can help scrape web data. If you would like to see the video of our webinar it is below. If not, you can continue to read and figure out how to use Google Sheets to scrape websites.
Google Sheet Functions For Web Scraping
The functions you can use for web scraping with google sheets are:
All of these functions will scrape websites based off of different parameters provided to the function.
Web Scraping With ImportFeed
The ImportFeed Google Sheet function is one of the easier functions to use. It only requires access to Google Sheets and a URL for a rss feed. This is a feed that is typically associated with a blog.
For instance, you could use our RSS feed “http://www.acheronanalytics.com/2/feed".
How do you use this function? An example is given below.
That is all that is needed! There are some other tips and tricks that can help clean up the data feed as you will get more than just one column of information. For now, this is a great start at web scraping.
Do The Google Sheet Import Functions Update?
All of these import function automatically update data every 2 hours. A trigger function can be set to increase the cadence of updates. However this requires more programming.
This is it in this case! From here, it is all about how your team uses it! Make sure you engineer a solid data scraping system.
The picture above is an example of of using the ImportFeed function.
Web Scraping With ImportXML
The ImportXML function in Google Sheets is used to pull out specific data points using HTML ids, and classes. This requires some understanding of HTML and parsing XML. This can be a little frustrating. So we created a step by step for web scraping for HTML.
Here are some examples from an EventBrite page.
- Go to https://www.eventbrite.com/d/wa--everett/events/
- Right Click Inspect Element
- Find the HTML tag you are interested in
- We are looking for <div class = list-card__body> Some Text Here</div>
- So this is the tricky part. The first part you need to pull out from this HTML tag is the type. This would be like <div>, <a>, <img>, <span>, etc. This first one can be called out using “//” then the tag name. Such as “//div”, “//a” or “//span”.
- Now, if you actually want to get the “Some Text Here” you will need to call out the class.
- That is done in the method shown in step 5. You will notice it combines using “//div” with the “[@class=”class name here”].
- The xml string is “//div[@class=’list-card__body’]”
- There is another data value you might want to get.
- We want to get all the URLs
- This case would involve wanting to pull out the specific value inside of the first HTML tag itself. For instance, <a href=”https//www.google.com">Click here</a>.
- Then it would be like step 7.
- The xml string is “//a/@href”
- ImportXML(URL, XML String)
The truth about using this function is that it requires a lot of time. Thus, it requires planning and designing a good google sheet to ensure you get the maximum benefit from utilizing. Otherwise, your team will end up spending time maintaining it, rather than working on new things. Like in the picture below
Web scraping With ImportHTML
Finally we will discuss ImportHTML. This will import a table or list from a web page. For instance, what if you want to scrape data from a site that contains stock prices.
We will use the http://www.nasdaq.com/symbol/snap/real-time. There is a table on this page that has the stock prices from the past few days.
Similar to the past functions you need to use the URL. On top of the URL, you will have to mention which table on the webpage you want to grab. You can do this by utilizing the which number it might be.
An example would be ImportHTML(“http://www.nasdaq.com/symbol/snap/real-time",6). This will scrape the stock prices from the link above.
In our video above, we also show how we combine scraping the stock data above and melded it with news about the Stock ticker on that day. This could be utilized in a much more complex manner. A team could create an algorithm that utilizes the stock price of the past, as well as new articles and twitter information to choose whether to buy or sell stocks.
Do you have any good ideas of what you could do with web scraping? Do you need help with your web scraping project? Let us know!
Other great read about data science: