Web Scraping with Python is a popular subject around data science enthusiasts. Here is a piece of content who want to lxml library. aimed at beginners learn Web Scraping with Python What is lxml? is the most feature-rich and easy-to-use library for processing XML and HTML in Python programming language. lxml is a reference to the XML toolkit in a pythonic way which is internally being bound with two specific C language, , and . lxml is unique in a way that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API. lxml libraries of libxml2 libxslt When compared to the rest, the python lxml package gives an advantage in terms of performance. Reading and writing even fairly large XML files takes an imperceptible amount of time. Use of lxml makes easier & much faster. data processing With the continued growth of both Python and XML, there are a plethora of packages out there that help you read, generate, and modify XML files from Python scripts. Compared to most of them, the python lxml package has two big advantages: : Reading and writing even fairly large XML files take an imperceptible amount of time. Performance python lxml library has easy syntax and more adaptive nature than other packages. Ease of programming: is similar in many ways to two other earlier packages which are called as parent packages for lxml. This is used to create and parse tree structure of XML nodes. : This is now an official part of the Python library. There is a C-language version called which maybe even faster than for some applications.However, is preferred by most of the python developers because it provides a number of additional features that make life easier. lxml ElementTree : xml.etree.ElementTree cElementTree lxml lxml In particular, it supports which makes it considerably easy to manage more complex XML structures. XPath , python lxml library can be used to either create XML/HTML structure using elements, or parse XML/HTML structure to retrieve information from them. This library can be used to get information from different web services and web resources, as these are implemented in XML/HTML format. The objective of this tutorial is throw light on how helps us to get and process information from different web resources. lxml How to install lxml? can be installed as a using which is a tool for python. Below is the command which is needs to be fired to install it on your system. lxml python package pip package manager automatically installs all the dependencies for installing python as well. can be installed as a using binary installers depending upon system OS. I would prefer to install it using the former method, as many systems do not have a better and clean way to install this package if the latter is used. pip install lxmlpip lxml lxml system package How to use lxml? Python is a very easy language to learn but libraries which are written using python are as easy. Getting a clear picture of the function of the library is ambiguous. Practical implementation will take us closer to creating an idea of what is the library actually doing. Let us pick a few examples and use in practical scenarios. Successful implementation of Web Scraping with Python takes time and practice.As discussed earlier, we can use python to create as well as parse XML/HTML structures. lxml lxml In a first and very basic example, let’s create an html web page structure using python and define some elements and its attributes. So, let us begin! lxml has many modules and one of the module is an which is responsible for creating elements and structure using these elements.First, let’s import the “require” module in python. I generally prefer to use command shell to execute python programs because it gives an extensive and clear command prompt to use python features in a very broad way. lxml etree Ipython Python |Anaconda ( -bit)| ( , Dec , : : ) Type , or more information. IPython -- An enhanced Interactive Python. ? -> Introduction and overview IPython s own help system. object? -> Details about , use extra details. In [ ]: lxml etree 2.7 .11 4.0 .0 64 default 6 2015 18 08 32 "copyright" "credits" "license" for 4.1 .2 of 's features. %quickref -> Quick reference. help -> Python' 'object' 'object??' for 1 from import After importing module, we can use to create multiple elements. In general, elements can be called as nodes as well. etree Element class API In [ ]: root = etree.Element( ) In [ ]: root Out[ ]: 2 'html' 3 3 In [4]: print root.tag html < > Element html at 0x7f43a5c51ab8 XML/HTML pages designed on parent-child paradigm where elements can play the role of parents and children for other element nodes. To create a parent-child relationship using python lxml, we can use method of module. SubElement etree In [ ]: etree.SubElement(root, ) Out[ ]: <Element body at 0x7f43a5c51f38> In [7]: print etree.tostring(root) <html><head/><body/></html> 5 'head' 5 In [6]: etree.SubElement(root, 'body') Out[6]: < > Element head at 0x7f43a5c51e60 Element nodes have multiple properties. For example a property can be used to set a text value for a node which we can be inferred as an information for the end user. We can also set attributes for any node in the tree structure. As you can see below, I have created a html tree structure using and its which can be saved as a as well. text lxml etree html web page We can set attributes for elements. Now, let’s take another example in which we shall see how to parse html tree structure. This process is a part of content from the web so you can follow this process if you want to scrap data from the web and process the data further.In this example, let us use python module, which is used to send HTTP requests to web URLs. module has improved speed and readability when compared to the built-in module. So, using module is a better choice. Along with , module is made use of from to parse the response of the request.First, let’s import require modules, scraping requests requests urllib2 requests requests html lxml, In [ ]: requests In [ ]: lxml html 19 import 20 from import Using module, let’s send a get request to website to retrieve top news stories. HTTP web server sends the response as a Response<200> object. We store this in a page variable and then use html module to parse it and save the results in a tree. requests cnn.com The response object has multiple properties like response headers, contents, cookies etc. We can use the python method to see all these object properties. Here, I am using instead of page.text because html.fromstring implicitly expects bytes as input where the page.text provides content in simple text format (ASCII or utf-8, depending upon web server configuration). dir() page.content In [ ]: page = requests.get( ) In [ ]: html_content = html.fromstring(page.content) 21 'http://www.cnn.com' 22 also provides multiple functions to access the parsed object. For example, to iterate children of html object, we can useiterchildren(). The html module tree now contains the whole HTML file in a nice tree structure which we can go over two different ways: and . In this example, we will focus on the former. is a way of locating information in structured documents such as HTML or XML documents. uses path expressions to select nodes or node-sets in an XML document. The node is selected by following a path or steps.The most useful path expressions are listed below: XPath CSSSelect XPath XPath Description Selects all nodes with the name “ ” Selects from the root node Selects nodes in the document from the current node that match the selection no matter where they are Selects the current node Selects the parent of the current node Selects attributes nodename / . .. @ Expression nodename // Following are some path expressions and their results Result Selects all nodes with the name “bookstore” Selects the root element bookstore If the path starts with a slash ( / ) it always represents an absolute path to an element! Selects all book elements that are children of bookstore Selects all book elements no matter where they are in the document Selects all book elements that are descendant of the bookstore element, no matter where they are under the bookstore element Note: Selects all attributes that are named lang /bookstore bookstore/book bookstore Path Expression bookstore //book //book //@lang Let’s get back to our scraping example. so far we have downloaded and made a tree structure from html web page. We are using to select nodes from this tree structure. As, we want to get top stories, we have to analyse the web page to find the tags that are storing this information. Upon analysis we can see that with attribute contains this information. Selecting this node allows us to fetch the text of news stories and appropriate web links to read for complete news. XP ath h3 tag data-analytic In [ ]: i html_content.iterchildren(): ....: print i ....: <Element body at 0x7f43a5737e10> In [24]: news_stories = html_content.xpath('//h3[@data-analytics]/a/span/text()') In [25]: news_links = html_content.xpath('//h3[@data-analytics]/a/@href') In [26]: news_links Out[26]: ['/2016/07/25/politics/democratic-convention-dnc-emails-russia/index.html', '/2016/07/25/us/fort-myers-nightclub-shooting/index.html', '/2016/07/24/world/ansbach-germany-blast/index.html', '/2016/07/25/europe/germany-attacks-asylum-seekers-refugees/index.html', '/2016/07/25/world/protests-boy-killed-bangladesh/index.html', '/2016/06/15/politics/muslim-ban-maps-donald-trump/index.html', '/2016/07/24/world/qandeel-baloch-death-father-azeem/index.html', '/2016/07/24/aviation/tripadvisor-world-favorite-airlines/index.html', '/2016/07/25/africa/koffi-olomide-dancer-kenya/index.html'] In [27]: news_stories Out[27]: ['FBI launches investigation into suspected Russian email hack', "Two dead, 14 injured at Florida 'Swimsuit Glow Party'", 'Suicide bomber was slated to be deported', 'German public questions refugee policy', 'Brutal killing of boy, 10, sparks protests', "Mapped: Trump's Muslim travel ban ", "Father of slain social star: 'I want revenge'", "World's most-loved airline is...", 'Pop star apologizes for kicking dancer '] 23 for in < > Element head at 0x7f43a5737db8 To give a better representation to this scraped data, I am zipping news stories and links together and storing them in a list, which later can be processed in form of printing or storing in a database for further process. In [ ]: top_stories = [] In [ ]: i zip(news_stories, news_links): ....: top_stories.append(i) ....: In [ ]: top_stories Out[ ]: [( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , )] 28 29 for in 30 30 'FBI launches investigation into suspected Russian email hack' '/2016/07/25/politics/democratic-convention-dnc-emails-russia/index.html' "Two dead, 14 injured at Florida 'Swimsuit Glow Party'" '/2016/07/25/us/fort-myers-nightclub-shooting/index.html' 'Suicide bomber was slated to be deported' '/2016/07/24/world/ansbach-germany-blast/index.html' 'German public questions refugee policy' '/2016/07/25/europe/germany-attacks-asylum-seekers-refugees/index.html' 'Brutal killing of boy, 10, sparks protests' '/2016/07/25/world/protests-boy-killed-bangladesh/index.html' "Mapped: Trump's Muslim travel ban " '/2016/06/15/politics/muslim-ban-maps-donald-trump/index.html' "Father of slain social star: 'I want revenge'" '/2016/07/24/world/qandeel-baloch-death-father-azeem/index.html' "World's most-loved airline is..." '/2016/07/24/aviation/tripadvisor-world-favorite-airlines/index.html' 'Pop star apologizes for kicking dancer ' '/2016/07/25/africa/koffi-olomide-dancer-kenya/index.html' Ta da! We have successfully covered scraping using python lxml and . We have it stored in memory as a list. Now we can do all sorts of cool stuff with it: analyze it using Python or save it in a file and share it with the world.We have covered most of the stuff related to web Scraping with python module and also understood how can we combine it with other python modules to do some impressive work. Below are a few references which can be helpful in knowing more about it.Do share this if you enjoyed reading this blog post on Web Scraping with Python. Write a web scraper on your own and share your experience with us. requests lxml References lxml – XML and HTML with Python lxml.etree Tutorial Parsing XML and HTML using lxml