When you talk about web scraping, PHP is the last thing most people think about. But this tool based on Symphony framework is about to change your mind. According to it’s documentation on : https://goutte.readthedocs.io/en/latest/ “Goutte provides a nice API to crawl websites and extract data from the HTML/XML responses.” This also means you can login to a website, submit a form, upload a file and whatnot all using this library siting at your server (or even on your local system) and still do all the things you would normally do. To scrap any website you basically need to first visit the website and do recon. And how do you do that? Simple — Inspector tool. Most modern day web browsers have “inspect” tool when you right click a web page (For Safari, you need to enable the option “Show Develop menu in menu bar” under the Advanced tab of Preference). Once you select inspect tool, the browser’s developer window (or console window) will open up docked on one side of window. The first tab is the “Elements” tab. Inside this tab you get to see the elements and their styling. This is very important to see the “name” attribute of the elements, to see if there are any hidden elements, find the form or its action attribute etc., you can use this “inspect” tool. This is recon. Ok, now let’s see this in action. Suppose your objective is to find the list of repositories that a user has created on their GitHub account (you could do more, but let’s just play nice ;-) ). Let’s create a folder in your localhost directory by the name “crawler” (name is irrelevant). Then head over to the library’s repo page on GitHub to download from or https://github.com/FriendsOfPHP/Goutte.git git clone https://github.com/FriendsOfPHP/Goutte.git After unzipping (if you downloaded) or git cloning, you’ll see a directory named “Goutte”. You’ll be needing composer installed as well to make this library work. Download it from https://getcomposer.org/download/ Now execute this command to add dependency libraries — composer require fabpot/goutte Give this command a minute or so finish. After successfully executing this command vendor directory (containing dependencies) and composer file is added. Now we are ready. Let’s create an interface for the user. Nothing fancy, simple HTML login. Create an index.php file with the following code. <form method= "POST" > <div class =“ form - group "> < label for =“ git_email "> Email address </ label > < input type =" email " class =" form - control " id =“ git_email " name =“ git_email " placeholder =" Enter email "> </ div > < div class =" form - group "> < label for =“ git_pwd "> Password </ label > < input type =" password " class =" form - control " id =" git_pwd " name =" git_pwd " placeholder =" Password "> </ div >  < button type =" submit " class =" btn btn - primary "> Submit </ button > </ form > Then when the user enters the credential (I’m not doing any client validation — I leave that to you ), the page gets submitted to itself. So, here our logic begins. Check if the page is getting submitted with the data. } if ( isset ( $_POST [ "“git_email" ]) && isset ( $_POST [ "git_pwd" ]) && ! empty ( $_POST [ "“git_email" ]) && ! empty ( $_POST [ "git_pwd" ])){ //check if the email and password params are there Then include the library and initialise it. require_once ( "vendor/autoload.php" ); require_once (“Goutte/Goutte/Client.php "); use Goutte\Client; $client = new Client(); If we inspect the login form at we see the following: https://gitlab.com/users/sign_in It shows the name attribute of the email and password fields //initialise the crawler with request type get and the login url $crawler = $client ->request( 'GET' , 'https://github.com/login' ); Find the form, set the post params and submit form //find the form $form = $crawler ->selectButton( 'Sign in' )->form(); //set the post params $form ->setValues([ 'login' => $_POST [ "git_email" ], 'password' => $_POST [“git_pwd "]]); // submit the form $crawler = $client ->submit( $form ); Now let’s check if login was successful or not. The way to check it is if the response/redirected page has meta tag having name “octolytics-actor-login”. This meta tag contains the username. }
}); //variable to store the logged in username $username = “ "; //Find all the meta tag $crawler ->filter('meta')->each(function ( $node ) { global $username ;//This is important, since you need to set the variable //Stop at the meta tag that has “octolytics-actor-login" as it’s name if (trim( $node ->attr( "name" )) == "octolytics-actor-login" ){ //set the variable value and return $username = ( $node ->attr( "content" )); return ; If the variable value is blank then we can inform the user that the credentials they entered is invalid. } if ( $username == “”){ //give error message echo “<center>The credentials entered are invalid</center>”; Finally, let’s navigate to the url that contains the list of repositories associated with the username $crawler = $client ->request( 'GET' , 'https://github.com/' . $username . '?tab=repositories' ); Once we are on the repo page we need to loop through all the repository names (when you inspect any one repo link in inspector tool you’ll find that they are inside list tag with class “source” ) displayed as an anchor link: }
}); //loop through all “a” tag contained inside list element with “source” class $crawler ->filter( 'li.source a' )->each( function ( $node ) { //This additional check is to determine we only get repo name if (is_numeric( $node ->text()) === false ){ echo $node ->text(); echo "<br/>" ; And that’s it, we are printing all the repository belonging to the user who tried to login. This is the first article I’ve written ever. If you like it, I’d appreciate your claps! Happy Hacking :) Previously published at https://medium.com/@mubin.khalife/php-web-scraping-youve-goutte-a-friend-5dd9d2874b80

PHP Web Scraping Using Goutte

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

10 Cool CI/CD Tools For Your Project

27 Stories To Learn About Ssl

20 Essential Backend Tools For Developers

3 Most Common Ways to Connect your Node and React Applications

369 Stories To Learn About Database

4 JavaScript Portfolio Projects to Help You Land a Web Developer Position

10 Cool CI/CD Tools For Your Project

27 Stories To Learn About Ssl

20 Essential Backend Tools For Developers

3 Most Common Ways to Connect your Node and React Applications

369 Stories To Learn About Database

4 JavaScript Portfolio Projects to Help You Land a Web Developer Position

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps