or crawling is the fact of fetching data from a third party website by downloading and parsing the HTML code to extract the data you want. Web scraping Since every website does not offer a clean API, or an API at all, web scraping can be the only solution when it comes to extracting website information. Lots of companies use it to obtain knowledge concerning competitor prices, news aggregation, mass email collect… Almost everything can be extracted from , the only information that are “difficult” to extract are inside images or other medias. HTML In this post we are going to see basic techniques in order to fetch and parse data in . Java This article is an excerpt from my new book Java Web Scraping Handbook . The book will teach you the noble art of web scraping. From parsing HTML to breaking captchas, handling Javascript-heavy website and many more. Prerequisites Basic Java understanding Basic XPath You will need Java 8 with HtmlUnit net.sourceforge.htmlunit htmlunit 2.19 < > dependency < > groupId </ < > groupId artifactId </ > artifactId < > version </ > version </ > dependency If you are using Eclipse, I suggest you configure the max length in the detail pane (when you click in the variables tab ) so that you will see the entire HTML of your current page. Let’s scrape CraigList For our first exemple, we are going to fetch items from CraigList since they don’t seem to offer an API, to collect names, prices and images, and export it to JSON. First let’s take a look at what happen when you search an item on CraigList. Open Chrome Dev tools and click on the Network tab: The search URL is: https://newyork.craigslist.org/search/moa?is_paid=all&search_distance_type=mi&query=iphone+6s You can also use https://newyork.craigslist.org/search/sss?sort=rel&query=iphone+6s Now you can open your favorite IDE it is time to code. HtmlUnit needs a WebClient to make a request. There are many options (Proxy settings, browser, redirect enabled …) We are going to disable since it’s not required for our example, and disabling Javascript makes the page load faster: Javascript String searchQuery = ; WebClient client = WebClient(); client.getOptions().setCssEnabled( ); client.getOptions().setJavaScriptEnabled( ); { String searchUrl = + URLEncoder.encode(searchQuery, ); HtmlPage page = client.getPage(searchUrl); } (Exception e){ e.printStackTrace(); } "Iphone 6s" new false false try "https://newyork.craigslist.org/search/sss?sort=rel&query=" "UTF-8" catch The HtmlPage object will contain the HTML code, you can access it with method. asXml() Now we are going to fetch titles, images and prices. We need to inspect the DOM structure for an item: With HtmlUnit you have several options to select an html tag: getHtmlElementById(String id) which returns a List getFirstByXPath(String Xpath) - getByXPath(String XPath) many others, rtfm ! Since there isn’t any ID we could use, we have to make an expression to select the tags we want. Xpath is a query language to select XML nodes(HTML in our case). XPath First we are going to select all the tags that have a class <p> `result-info` Then we will iterate through this list, and for each item select the name, price and url, and then print it. List<HtmlElement> items = (List<HtmlElement>) page.getByXPath( ) ; (items.isEmpty()){ System.out.println( ); } { (HtmlElement htmlItem : items){ HtmlAnchor itemAnchor = ((HtmlAnchor) htmlItem.getFirstByXPath( )); HtmlElement spanPrice = ((HtmlElement) htmlItem.getFirstByXPath( )) ; String itemPrice = spanPrice == ? : spanPrice.asText() ; System.out.println( String.format( , itemName, itemPrice, itemUrl)); } } "//li[@class='result-row']" if "No items found !" else for ".//p[@class='result-info']/a" ".//a/span[@class='result-price']" // It is possible that an item doesn't have any price, we set the price to 0.0 in this case null "0.0" "Name : %s Url : %s Price : %s" Then instead of just printing the results, we are going to put it in JSON, using library, to map items in JSON format. Jackson First we need a POJO (plain old java object) to represent Items Item.java { String title ; BigDecimal price ; String url ; } public class Item private private private //getters and setters Then add this to your : pom.xml com.fasterxml.jackson.core jackson-databind 2.7.0 < > dependency < > groupId </ > groupId < > artifactId </ > artifactId < > version </ > version </ > dependency Now all we have to do is create an Item, set its attributes, and convert it to JSON string (or a file …), and adapt the previous code a little bit: (HtmlElement htmlItem : items){ HtmlAnchor itemAnchor = ((HtmlAnchor) htmlItem.getFirstByXPath( )); HtmlElement spanPrice = ((HtmlElement) htmlItem.getFirstByXPath( )) ; String itemPrice = spanPrice == ? : spanPrice.asText() ; Item item = Item(); item.setTitle(itemAnchor.asText()); item.setUrl( baseUrl + itemAnchor.getHrefAttribute()); item.setPrice( BigDecimal(itemPrice.replace( , ))); ObjectMapper mapper = ObjectMapper(); String jsonString = mapper.writeValueAsString(item) ; System.out.println(jsonString); } for ".//p[@class='result-info']/a" ".//a/span[@class='result-price']" // It is possible that an item doesn't have any price, we set the price to 0.0 in this case null "0.0" new new "$" "" new Go further This example is not perfect, there are many things that can be improved : Multi-city search Handling pagination Multi criteria search You can find the code in this Github repo I hope you enjoyed this post, feel free to give me feedback in the comments. This article is was excerpt from my new book: Java Web Scraping Handbook . The book will teach you the noble art of web scraping. From parsing HTML to breaking captchas, handling Javascript-heavy website and many more.