Parsing Data from Any Site Part 1 - Sniffer http/https Packets by@yarche

Parsing Data from Any Site Part 1 - Sniffer http/https Packets

image
Yaroslav Menshikov HackerNoon profile picture

Yaroslav Menshikov

I've been a programmer for 7+ years. C#, Java, SQL.

In this article, I will talk about the main tool of a web scraping specialist - the http/https packet sniffer. The article will help you prepare for data parsing, you will understand how to see the data between your browser and the web portal. In addition, this data can be controlled, since the sniffer in question is also a proxy server:

  • Ban on access to a specific site
  • Change the data packets sent to the webserver
  • Change responses from the webserver

The material of this article is suitable for a specialist of any level and anyone who knows any programming language. Knowing how to use a sniffer is not at all difficult, this article will teach you to see the work on the Internet as a data scraper.

What is it for? Why Do We Need Data Parsing?

First of all, data parsing is the automation of data acquisition.

Example:

You want to make a portal that aggregates the value of the currency in each bank in your city. Filling in data manually means wasting a lot of time every day. It is much easier to write a script that once a day (or even more often) accesses the websites of each bank, collects information, and writes it to your database.

To find out what requests to make from the program code, you need to intercept them with a sniffer when you make requests from a web browser.

This approach helps to find wrongly done protection of web services. If you are a web services developer, this will help you understand your vulnerabilities and fix them.

So we start. There are a lot of packet sniffers, you can choose anyone you like.

image

My choice is Fiddler Classic. This program is convenient, free, and has everything I need in terms of functionality. Program platform Windows, Linux, MacOS. Downloading here:

https://www.telerik.com/fiddler

image

The left part of the program screen is a list of intercepted requests. The right part of the program in the Inspectors tab consists of a top part and a bottom part. The upper part is the request data to the webserver. The lower part is the response data for this request.

To get started with Fiddler, you need to enable the ability to intercept HTTPS traffic in the settings. Tools → Options → HTTPS tab → Enable Decrypt HTTPS traffic.

By enabling the File → Captcha traffic setting, Fiddler starts intercepting all http/https requests from the browser.

To show what Fiddler is capable of, let's go to the page of some school test https://wordwall.net/ru/resource/354682/spotlight-4-unit-7a-funny-animals and pass this test. You may notice that it is impossible to pass this test faster than 40 seconds. On the other hand, having studied the requests intercepted in Fiddler, we find the request sent by the browser at the end of the test. The string sent the time variable with the value 48424, and the test passed in 48 seconds. We conclude that the test execution time in milliseconds is transmitted here.

image

Launch Fiddler ScriptEditor (built-in script editor) from Fiddler → Rules → Customize Rules. Let's go to the OnBeforeRequest method (from the menu Go → to OnBeforeRequest). OnBeforeRequest - a method called before sending any data packet through the sniffer. To change this time, let's write a script in Fiddler.

if (oSession.uriContains(“https://wordwall.net/leaderboardajax/getentryrank?score=“))
{
  oSession.fullUrl = "https://wordwall.net/leaderboardajax/getentryrank?score=8&time=1000&activityId=354682&templateId=46";
}

image

Having passed this test again, we see that the script worked correctly and we were asked to insert our name into the leaderboard, but we are not in the leaderboard. To find the reason, you need to look into Fiddler again.

image

There is a new request that apparently sends a new record to the leaderboard, but the timing of the test is not at all suitable. We conclude that we need to add one more script for this request.

if (oSession.uriContains(“https://wordwall.net/leaderboardajax/addentry?score=“))
{
  oSession.fullUrl = "https://wordwall.net/leaderboardajax/addentry?score=8&time=1000&name=%D0%BB%D0%B8%D0%B4%D0%B5%D1%802%D1%81%D0%B5%D0%BA&mode=1&activityId=354682&templateId=46";
}

Having passed this test again, the result is successfully recorded in the leaderboard.

Why Does this Approach Work?

The thing is that all the logic happens on the browser side, the webserver just dutifully accepts the test results, whatever they may be. To fix this problem in protection, you can transfer the functionality of counting time to a web server. Or make requests more hidden, do not transfer the time in the clear, but encrypt the transmitted string and send it not as a GET request, but as a POST request.

In this article, we have analyzed the main tool of a data parsing specialist - the http/https packet sniffer. We have learned to see data packets between the browser and the web portal. Now you know how and why to change these data packages.

Comments

Signup or Login to Join the Discussion

Tags

Related Stories