paint-brush
How to Web Scrap with Python lxml [Beginner's Guide]by@sandra-moraes
331 reads
331 reads

How to Web Scrap with Python lxml [Beginner's Guide]

by Sandra MoraesAugust 28th, 2019
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Python lxml is the most feature-rich and easy-to-use library for processing XML and HTML in Python programming language. It combines speed and XML feature completeness of these libraries with the simplicity of a native Python API. The lxml library can be used to either create XML/ HTML structure using elements, or parse XML/HTML structure to retrieve information from them. The objective of this tutorial is to throw light on how lxml helps us to get and process information from different web resources.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - How to Web Scrap with Python lxml [Beginner's Guide]
Sandra Moraes HackerNoon profile picture

Web Scraping with Python is a popular subject around data science enthusiasts. Here is a piece of content aimed at beginners who want to learn Web Scraping with Python lxml library.

What is lxml? 

lxml is the most feature-rich and easy-to-use library for processing XML and HTML in Python programming language. lxml is a reference to the XML toolkit in a pythonic way which is internally being bound with two specific libraries of C language, libxml2, and libxslt. lxml is unique in a way that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API.

When compared to the rest, the python lxml package gives an advantage in terms of performance. Reading and writing even fairly large XML files takes an imperceptible amount of time. Use of lxml makes data processing easier & much faster.

With the continued growth of both Python and XML, there are a plethora of packages out there that help you read, generate, and modify XML files from Python scripts. Compared to most of them, the python lxml package has two big advantages:

Performance: Reading and writing even fairly large XML files take an imperceptible amount of time.

Ease of programming: python lxml library has easy syntax and more adaptive nature than other packages.

lxml is similar in many ways to two other earlier packages which are called as parent packages for lxml.ElementTreeThis is used to create and parse tree structure of XML nodes.

xml.etree.ElementTree: This is now an official part of the Python library. There is a C-language version called cElementTree which maybe even faster than lxml for some applications.However, lxml is preferred by most of the python developers because it provides a number of additional features that make life easier.

In particular, it supports XPath, which makes it considerably easy to manage more complex XML structures.

python lxml library can be used to either create XML/HTML structure using elements, or parse XML/HTML structure to retrieve information from them. This library can be used to get information from different web services and web resources, as these are implemented in XML/HTML format. The objective of this tutorial is throw light on how lxml helps us to get and process information from different web resources.

How to install lxml?

 lxml can be installed as a python package using pip which is a package manager tool for python. Below is the command which is needs to be fired to install it on your system.

pip install lxmlpip automatically installs all the dependencies for installing python lxml as well.lxml can be installed as a system package using binary installers depending upon system OS. I would prefer to install it using the former method, as many systems do not have a better and clean way to install this package if the latter is used.

How to use lxml? 

Python is a very easy language to learn but libraries which are written using python are as easy. Getting a clear picture of the function of the library is ambiguous. Practical implementation will take us closer to creating an idea of what is the library actually doing. Let us pick a few examples and use lxml in practical scenarios. Successful implementation of Web Scraping with Python takes time and practice.As discussed earlier, we can use python lxml to create as well as parse XML/HTML structures.

In a first and very basic example, let’s create an html web page structure using python lxml and define some elements and its attributes. So, let us begin!

lxml has many modules and one of the module is an etree which is responsible for creating elements and structure using these elements.First, let’s import the “require” module in python. I generally prefer to use Ipython command shell to execute python programs because it gives an extensive and clear command prompt to use python features in a very broad way.

Python 2.7.11 |Anaconda 4.0.0 (64-bit)| (default, Dec 6 2015, 18:08:32)
Type "copyright", "credits" or "license" for more information.
 
IPython 4.1.2 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
 
In [1]: from lxml import etree

After importing etree module, we can use Element class API  to create multiple elements. In general, elements can be called as nodes as well.

In [2]: root = etree.Element('html')
In [3]: root
Out[3]: <Element html at 0x7f43a5c51ab8>
 
In [4]: print root.tag
html

XML/HTML pages designed on parent-child paradigm where elements can play the role of parents and children for other element nodes. To create  a parent-child relationship using python lxml, we can use SubElement method of etree module.

In [5]: etree.SubElement(root, 'head')
Out[5]: <Element head at 0x7f43a5c51e60>
 
In [6]: etree.SubElement(root, 'body')
Out[6]: <Element body at 0x7f43a5c51f38>
 
In [7]: print etree.tostring(root)
<html><head/><body/></html>

Element nodes have multiple properties. For example a text property can be used to set a text value for a node which we can be inferred as an information for the end user. We can also set attributes for any node in the tree structure. As you can see below, I have created a html tree structure using lxml and its etree which can be saved as a html web page as well.

We can set attributes for elements.

Now, let’s take another example in which we shall see how to parse html tree structure. This process is a part of scraping content from the web so you can follow this process if you want to scrap data from the web and process the data further.In this example, let us use requests python module, which is used to send HTTP requests to web URLs. requests module has improved speed and readability when compared to the built-in urllib2 module. So, using requests module is a better choice. Along with requestshtml module is made use of from lxml, to parse the response of the request.First, let’s import require modules,

In [19]: import requests
 In [20]: from lxml import html

Using requests module, let’s send a get request to cnn.com website to retrieve top news stories. HTTP web server sends the response as a Response<200> object. We store this in a page variable and then use html module to parse it and save the results in a tree.

The response object has multiple properties like response headers, contents, cookies etc. We can use the python dir() method to see all these object properties. Here, I am using page.content instead of page.text because html.fromstring implicitly expects bytes as input where the page.text provides content in simple text format (ASCII or utf-8, depending upon web server configuration).

In [21]: page = requests.get('http://www.cnn.com')
In [22]: html_content = html.fromstring(page.content)

html module also provides multiple functions to access the parsed object. For example, to iterate children of html object, we can useiterchildren(). The

tree


now contains the whole HTML file in a nice tree structure which we can go over two different ways: XPath and CSSSelect. In this example, we will focus on the former.XPath is a way of locating information in structured documents such as HTML or XML documents. XPath uses path expressions to select nodes or node-sets in an XML document. The node is selected by following a path or steps.The most useful path expressions are listed below:

Description              Selects all nodes with the name “nodename”              Selects from the root node             Selects nodes in the document from the current node that match the selection no matter            where they are              Selects the current node              Selects the parent of the current node              Selects attributes

/
//
.
..
@

Expression
nodename

Following are some path expressions and their results

Result       Selects all nodes with the name “bookstore”      Selects the root element bookstore     Note: If the path starts with a slash ( / ) it always represents an absolute path to an          element!      Selects all book elements that are children of bookstore     Selects all book elements no matter where they are in the document     Selects all book elements that are descendant of the bookstore element, no matter          where they are under the bookstore element

  Selects all attributes that are named lang

/bookstore
bookstore/book
//book
bookstore//book
//@lang

Path Expression
bookstore

Let’s get back to our scraping example. so far we have downloaded and made a tree structure from html web page. We are using XPath to select nodes from this tree structure. As, we want to get top stories, we have to analyse the web page to find the tags that are storing this information. Upon analysis we can see that h3 tag with data-analytic attribute contains this information. Selecting this node allows us to fetch the text of news stories and appropriate web links to read for complete news.

In [23]: for i in html_content.iterchildren():
  ....: print i
  ....:
<Element head at 0x7f43a5737db8>
<Element body at 0x7f43a5737e10>
 
In [24]: news_stories = html_content.xpath('//h3[@data-analytics]/a/span/text()')
 
In [25]: news_links = html_content.xpath('//h3[@data-analytics]/a/@href')
 
In [26]: news_links
Out[26]:
['/2016/07/25/politics/democratic-convention-dnc-emails-russia/index.html',
 '/2016/07/25/us/fort-myers-nightclub-shooting/index.html',
 '/2016/07/24/world/ansbach-germany-blast/index.html',
 '/2016/07/25/europe/germany-attacks-asylum-seekers-refugees/index.html',
 '/2016/07/25/world/protests-boy-killed-bangladesh/index.html',
 '/2016/06/15/politics/muslim-ban-maps-donald-trump/index.html',
 '/2016/07/24/world/qandeel-baloch-death-father-azeem/index.html',
 '/2016/07/24/aviation/tripadvisor-world-favorite-airlines/index.html',
 '/2016/07/25/africa/koffi-olomide-dancer-kenya/index.html']
 
In [27]: news_stories
Out[27]:
['FBI launches investigation into suspected Russian email hack',
 "Two dead, 14 injured at Florida 'Swimsuit Glow Party'",
 'Suicide bomber was slated to be deported',
 'German public questions refugee policy',
 'Brutal killing of boy, 10, sparks protests',
 "Mapped: Trump's Muslim travel ban ",
 "Father of slain social star: 'I want revenge'",
 "World's most-loved airline is...",
 'Pop star apologizes for kicking dancer ']
 

To give a better representation to this scraped data, I am zipping news stories and links together and storing them in a list, which later can be processed in form of printing or storing in a database for further process.


In [28]: top_stories = []
 
In [29]: for i in zip(news_stories, news_links):
  ....: top_stories.append(i)
  ....:
 
 
 
In [30]: top_stories
Out[30]:
[('FBI launches investigation into suspected Russian email hack',
  '/2016/07/25/politics/democratic-convention-dnc-emails-russia/index.html'),
 ("Two dead, 14 injured at Florida 'Swimsuit Glow Party'",
  '/2016/07/25/us/fort-myers-nightclub-shooting/index.html'),
 ('Suicide bomber was slated to be deported',
  '/2016/07/24/world/ansbach-germany-blast/index.html'),
 ('German public questions refugee policy',
  '/2016/07/25/europe/germany-attacks-asylum-seekers-refugees/index.html'),
 ('Brutal killing of boy, 10, sparks protests',
  '/2016/07/25/world/protests-boy-killed-bangladesh/index.html'),
 ("Mapped: Trump's Muslim travel ban ",
  '/2016/06/15/politics/muslim-ban-maps-donald-trump/index.html'),
 ("Father of slain social star: 'I want revenge'",
  '/2016/07/24/world/qandeel-baloch-death-father-azeem/index.html'),
 ("World's most-loved airline is...",
  '/2016/07/24/aviation/tripadvisor-world-favorite-airlines/index.html'),
 ('Pop star apologizes for kicking dancer ',
  '/2016/07/25/africa/koffi-olomide-dancer-kenya/index.html')]

Ta da! We have successfully covered scraping using python lxml and requests. We have it stored in memory as a list. Now we can do all sorts of cool stuff with it: analyze it using Python or save it in a file and share it with the world.We have covered most of the stuff related to web Scraping with python lxml module and also understood how can we combine it with other python modules to do some impressive work. Below are a few references which can be helpful in knowing more about it.Do share this if you enjoyed reading this blog post on Web Scraping with Python. Write a web scraper on your own and share your experience with us.

References