In English, the word Scraping has different definitions but they are all within the same meaning.
to remove (an outer layer, adhering matter, etc.) in this way: to scrape the paint and varnish from a table.
the act of removing the surface from something using a sharp edge or something rough.
However, we are for sure interested into what Web Scraping means in Software.
In software, Web Scraping is the process of extracting some information from a web resource from its user interface rather than its legit APIs. Therefore, it is not like calling a REST API of a website to get some data, it is like retrieving the website page like the browser does, parse the HTML and then extract the data rendered into the HTML.
Simply, because we need to have the data presented on this website and the website is not providing a legit API for us to retrieve this data.
It depends on the web resource itself. Some websites have it written somewhere if it is legal or not and sometimes it is not written anywhere.
Also, there is another factor which is what you are going to do with the data you scraped. Therefore, always try to be cautious and keep yourself safe. Do your research first before jumping into implementation.
There are different ways of doing it but in most of the cases the same concept applies; you write some code to get the HTML using the website URL, you parse the HTML, and finally you extract the data you want.
However, if we only stick to this definition, we would be missing a lot of details.
In some cases, things are more complicated than that. It depends on the way the website is built.
For Static websites, where the data are already rendered into the HTML from the first instance, you can follow the same steps we described.
However, for Dynamic websites, where the data are not rendered into the HTML from the first instance, and they are loaded dynamically through JavaScript libraries and frameworks (like Angular, React, Vue,…), you need to follow another approach.
Basically what you do in this case is that you try to mimic what a web browser (like Chrome, Firefox, IE, Edge,…) does and then you get the final HTML from the virtual browser you used. Once you have the full HTML where the data is rendered, the rest would be the same.
No, we already have some libraries which we can use to achieve the expected results.
For example, here is a list of some libraries which we can use.
These are not the only libraries that we have to help on our Web Scraping project. If you search the internet you would find a lot more.
First, let’s start with trying to scrap some data from a static website. On this example, we are going to scrap my own GitHub profile
We would try to get a list of the pinned repositories on my profile. Each entry would be composed of the name of the repository and its description.
Therefore, let’s start.
Till the moment of writing this article, this how my
When I checked the HTML, I found the following:
Here are the steps I followed:
using HtmlAgilityPack;
private static Task<string> GetHtml()
to get the HTML.private static List<(string RepositoryName, string Description)> ParseHtmlUsingHtmlAgilityPack(string html)
to parse the HTML.
using System.Collections.Generic;
using System.Net.Http;
using System.Threading.Tasks;
using HtmlAgilityPack;
namespace WebScraper
{
class Program
{
static async Task Main(string[] args)
{
var html = await GetHtml();
var data = ParseHtmlUsingHtmlAgilityPack(html);
}
private static Task<string> GetHtml()
{
var client = new HttpClient();
return client.GetStringAsync("https://github.com/AhmedTarekHasan");
}
private static List<(string RepositoryName, string Description)> ParseHtmlUsingHtmlAgilityPack(string html)
{
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var repositories =
htmlDoc
.DocumentNode
.SelectNodes("//div[@class='js-pinned-items-reorder-container']/ol/li/div/div");
List<(string RepositoryName, string Description)> data = new();
foreach (var repo in repositories)
{
var name = repo.SelectSingleNode("div/div/span/a").InnerText;
var description = repo.SelectSingleNode("p").InnerText;
data.Add((name, description));
}
return data;
}
}
}
Running this code, you will get the following
For sure you can make use of some cleaning to the strings here but this is not a big deal.
As you can see, it is easy to use HttpClient and HtmlAgilityPack. All what you need is to get used to their APIs and then it would be an easy job.
What you also need to keep in mind is that some websites would require more work from your side. Sometimes a website would need login details, authentication tokens, some specific headers,…
All of this you can still handle with HttpClient or other libraries which you can use to perform a call.
Now, we should try to scrap some data from a dynamic website. However, since I should be cautious before scraping a website, I would apply the same example as before but now with assuming that the website is dynamic.
Therefore, again, on this example, we are going to scrap my own GitHub profile
This would be the same as before.
Here are the steps I followed:
using HtmlAgilityPack;
using OpenQA.Selenium.Chrome;
private static string GetHtml()
to get the HTML.private static List<(string RepositoryName, string Description)> ParseHtmlUsingHtmlAgilityPack(string html)
to parse the HTML.
using System.Collections.Generic;
using HtmlAgilityPack;
using OpenQA.Selenium.Chrome;
namespace WebScraper
{
class Program
{
static void Main(string[] args)
{
var html = GetHtml();
var data = ParseHtmlUsingHtmlAgilityPack(html);
}
private static string GetHtml()
{
var options = new ChromeOptions
{
BinaryLocation = @"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe"
};
options.AddArguments("headless");
var chrome = new ChromeDriver(options);
chrome.Navigate().GoToUrl("https://github.com/AhmedTarekHasan");
return chrome.PageSource;
}
private static List<(string RepositoryName, string Description)> ParseHtmlUsingHtmlAgilityPack(string html)
{
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var repositories =
htmlDoc
.DocumentNode
.SelectNodes("//div[@class='js-pinned-items-reorder-container']/ol/li/div/div");
List<(string RepositoryName, string Description)> data = new();
foreach (var repo in repositories)
{
var name = repo.SelectSingleNode("div/div/span/a").InnerText;
var description = repo.SelectSingleNode("p").InnerText;
data.Add((name, description));
}
return data;
}
}
}
Running this code, you will get the following
For sure you can make use of some cleaning to the strings here but this is not a big deal.
Again, using Selenium.WebDriver and Selenium.WebDriver.ChromeDriver is easy as you can see.
As you can see, Web Scraping is not that hard, but it actually depends on the website you are trying to scrap. Sometimes you might get across a website that needs some tricks to make it work.
That’s it, hope you found reading this article as interesting as I found writing it.
Also Published Here