A guide on how to do Web Scraping in .NET C#, with code samples. What is Web Scraping In English, the word has different definitions but they are all within the same meaning. Scraping In Dictionary.com to remove (an outer layer, adhering matter, etc.) in this way: to scrape the paint and varnish from a table. In Dictionary.Cambridge.org the act of removing the surface from something using a sharp edge or something rough. However, we are for sure interested into what means in . Web Scraping Software In software, is the process of extracting some information from a web resource from its user interface rather than its legit APIs. Therefore, it is not like calling a REST API of a website to get some data, it is like retrieving the website page like the browser does, parse the HTML and then extract the data rendered into the HTML. Web Scraping Why Would We Need to Scrap a Website Simply, because we need to have the data presented on this website and the website is not providing a legit API for us to retrieve this data. Is Web Scraping Legal It depends on the web resource itself. Some websites have it written somewhere if it is legal or not and sometimes it is not written anywhere. Also, there is another factor which is what you are going to do with the data you scraped. Therefore, always try to be cautious and keep yourself safe. Do your research first before jumping into implementation. How to do Web Scraping There are different ways of doing it but in most of the cases the same concept applies; you write some code to get the HTML using the website URL, you parse the HTML, and finally you extract the data you want. However, if we only stick to this definition, we would be missing a lot of details. In some cases, things are more complicated than that. It depends on the way the website is built. For websites, where the data are already rendered into the HTML from the first instance, you can follow the same steps we described. Static However, for websites, where the data are not rendered into the HTML from the first instance, and they are loaded dynamically through JavaScript libraries and frameworks (like Angular, React, Vue,…), you need to follow another approach. Dynamic Basically what you do in this case is that you try to mimic what a web browser (like Chrome, Firefox, IE, Edge,…) does and then you get the final HTML from the virtual browser you used. Once you have the full HTML where the data is rendered, the rest would be the same. Should We do this Ourselves from Scratch No, we already have some libraries which we can use to achieve the expected results. For example, here is a list of some libraries which we can use. Performing Calls: .NET HttpClient RestSharp Parsing HTML: Html Agility Pack (HAP) CSQuery AngleSharp Virtual Browser: Headless Chrome Selenium WebDriver Puppeteer Sharp These are not the only libraries that we have to help on our Web Scraping project. If you search the internet you would find a lot more. Scraping a Static Website First, let’s start with trying to scrap some data from a static website. On this example, we are going to scrap my own GitHub profile https://github.com/AhmedTarekHasan We would try to get a list of the pinned repositories on my profile. Each entry would be composed of the of the repository and its . name description Therefore, let’s start. Observing the Data Structure in the HTML Till the moment of writing this article, this how my looked like. GitHub profile When I checked the HTML, I found the following: All my pinned repositories are found inside the main container with this path: div[@class=js-pinned-items-reorder-container] > ol > li Each pinned repository is contained inside a container with this relative path to the parent path: div > div Each pinned repository, would have its inside , and its inside name div > div > span > a description p Writing Code Here are the steps I followed: Created a Console Application Solution: WebScraping Project: WebScraper Installed the NuGet package . HtmlAgilityPack Added the using directive using HtmlAgilityPack; Defined the method to get the HTML. private static Task<string> GetHtml() Defined the method to parse the HTML. private static List<(string RepositoryName, string Description)> ParseHtmlUsingHtmlAgilityPack(string html) Finally, the code should be as follows: using System.Collections.Generic;
using System.Net.Http;
using System.Threading.Tasks;
using HtmlAgilityPack;

namespace WebScraper
{
    class Program
    {
        static async Task Main(string[] args)
        {
            var html = await GetHtml();
            var data = ParseHtmlUsingHtmlAgilityPack(html);
        }

        private static Task<string> GetHtml()
        {
            var client = new HttpClient();
            return client.GetStringAsync("https://github.com/AhmedTarekHasan");
        }

        private static List<(string RepositoryName, string Description)> ParseHtmlUsingHtmlAgilityPack(string html)
        {
            var htmlDoc = new HtmlDocument();
            htmlDoc.LoadHtml(html);

            var repositories =
                htmlDoc
                    .DocumentNode
                    .SelectNodes("//div[@class='js-pinned-items-reorder-container']/ol/li/div/div");

            List<(string RepositoryName, string Description)> data = new();

            foreach (var repo in repositories)
            {
                var name = repo.SelectSingleNode("div/div/span/a").InnerText;
                var description = repo.SelectSingleNode("p").InnerText;
                data.Add((name, description));
            }

            return data;
        }
    }
} Running this code, you will get the following For sure you can make use of some cleaning to the strings here but this is not a big deal. As you can see, it is easy to use and . All what you need is to get used to their APIs and then it would be an easy job. HttpClient HtmlAgilityPack What you also need to keep in mind is that some websites would require more work from your side. Sometimes a website would need login details, authentication tokens, some specific headers,… All of this you can still handle with or other libraries which you can use to perform a call. HttpClient Scraping a Dynamic Website Now, we should try to scrap some data from a dynamic website. However, since I should be cautious before scraping a website, I would apply the same example as before but now with assuming that the website is dynamic. Therefore, again, on this example, we are going to scrap my own GitHub profile https://github.com/AhmedTarekHasan Observing the Data Structure in the HTML This would be the same as before. Writing Code Here are the steps I followed: Created a Console Application Solution: WebScraping Project: WebScraper Installed the NuGet package . HtmlAgilityPack Installed the NuGet package Selenium.WebDriver. Installed the NuGet package Selenium.WebDriver.ChromeDriver. Added the using directive using HtmlAgilityPack; Added the using directive using OpenQA.Selenium.Chrome; Defined the method to get the HTML. private static string GetHtml() Defined the method to parse the HTML. private static List<(string RepositoryName, string Description)> ParseHtmlUsingHtmlAgilityPack(string html) Finally, the code should be as follows: using System.Collections.Generic;
using HtmlAgilityPack;
using OpenQA.Selenium.Chrome;


namespace WebScraper
{
    class Program
    {
        static void Main(string[] args)
        {
            var html = GetHtml();
            var data = ParseHtmlUsingHtmlAgilityPack(html);
        }

        private static string GetHtml()
        {
            var options = new ChromeOptions
            {
                BinaryLocation = @"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe"
            };

            options.AddArguments("headless");

            var chrome = new ChromeDriver(options);
            chrome.Navigate().GoToUrl("https://github.com/AhmedTarekHasan");

            return chrome.PageSource;
        }

        private static List<(string RepositoryName, string Description)> ParseHtmlUsingHtmlAgilityPack(string html)
        {
            var htmlDoc = new HtmlDocument();
            htmlDoc.LoadHtml(html);

            var repositories =
                htmlDoc
                    .DocumentNode
                    .SelectNodes("//div[@class='js-pinned-items-reorder-container']/ol/li/div/div");

            List<(string RepositoryName, string Description)> data = new();

            foreach (var repo in repositories)
            {
                var name = repo.SelectSingleNode("div/div/span/a").InnerText;
                var description = repo.SelectSingleNode("p").InnerText;
                data.Add((name, description));
            }

            return data;
        }
    }
} Running this code, you will get the following For sure you can make use of some cleaning to the strings here but this is not a big deal. Again, using and is easy as you can see. Selenium.WebDriver Selenium.WebDriver.ChromeDriver Final Words As you can see, Web Scraping is not that hard, but it actually depends on the website you are trying to scrap. Sometimes you might get across a website that needs some tricks to make it work. That’s it, hope you found reading this article as interesting as I found writing it. Also Published Here

Fully Covering I/O File Based Applications in .NET C# Using Unit Tests

Using the Builder Design Pattern in .NET C# to Develop a Fluent API 

Check my links 😀

My HackerNoon Stories

My Medium Articles

Buy Me a Coffee

Buy Me a Gift

Check My Links

Hire Me

Too Long; Didn't Read

How to Use .NET C# for Web Scraping 

How to Use .NET C# for Web Scraping

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

A Better Implementation of Enhanced Repository Pattern in .NET C#

111 Stories To Learn About Csharp

130 Stories To Learn About Dotnet

C# 8.0 Indices and Ranges

3 Insanely Easy Changes You Can Make To Clean Up Your Code

5 Performance Tips For .Net Developers

A Better Implementation of Enhanced Repository Pattern in .NET C#

111 Stories To Learn About Csharp

130 Stories To Learn About Dotnet

C# 8.0 Indices and Ranges

3 Insanely Easy Changes You Can Make To Clean Up Your Code

5 Performance Tips For .Net Developers

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps