is an open source framework for creating web crawlers (AKA spiders). A common roadblock when developing Scrapy spiders, and web scraping in general, is dealing with sites that use a heavy amount of JavaScript. Since many modern websites are built on JavaScript, they require scripts to be run in order for the page to render properly. Scrapy In many cases, pages also present modals and other dialogues that need to be interacted with to show the full page. So we developed , an open source tool to help you get structured data from the web. In this post we’re going to show you how you can use Splash to handle JavaScript in your Scrapy projects. Splash What is Splash? is Scrapinghub’s in-house solution for JavaScript rendering, implemented in Python using and . Splash is a lightweight web browser which is capable of processing multiple pages in parallel, executing custom JavaScript in the page context, and much more. Splash Twisted QT Setting Up Splash The easiest way to set up Splash is through : Docker $ docker pull scrapinghub/splash$ docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash Splash will now be running on localhost:8050. If you’re using a Docker Machine on OS X or Windows, it will be running on the IP address of Docker’s virtual machine. If you would like to install Splash without using Docker, please refer to the . documentation Using Splash with Scrapy Now that Splash is running, you can test it in your browser: http://localhost:8050/ On the right enter a URL (e.g. and click ‘Render me!’. Splash will display a screenshot of the page as well as charts and a list of requests with their timings. At the bottom you should see a text box containing the rendered HTML. http://amazon.com) Manually You can use to send links to Splash: Request req_url = "http://localhost:8050/render.json"body = json.dumps({"url": url,"har": 1,"html": 0,})headers = Headers({'Content-Type': 'application/json'})yield scrapy.Request(req_url, self.parse_link, method='POST',body=body, headers=headers) If you’re using , the easiest way is to override the function in your spider to replace links with their Splash equivalents: CrawlSpider process_links def process_links(self, links):for link in links:link.url = "http://localhost:8050/render.html?" + urlencode({ 'url' : link.url })return links Scrapy-Splash (recommended) The preferred way to integrate Splash with Scrapy is using . See for why it’s recommended you use the middleware instead of using it manually. You can install scrapy-splash using pip: scrapy-splash here pip install scrapy-splash To use ScrapyJS in your project, you first need to enable the middleware: DOWNLOADER_MIDDLEWARES = {'scrapy_splash.SplashCookiesMiddleware': 723,'scrapy_splash.SplashMiddleware': 725,'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,} The middleware needs to take precedence over HttpProxyMiddleware, which by default is at position 750, so we set the middleware positions to numbers below 750. You then need to set the setting in your project’s settings.py: SPLASH_URL SPLASH_URL = 'http://localhost:8050/' Don’t forget, if you’re using a Docker Machine on OS X or Windows, you will need to set this to the IP address of Docker’s virtual machine, e.g.: SPLASH_URL = 'http://192.168.59.103:8050/' Enable to support feature: it allows to save disk space by not storing duplicate Splash arguments multiple times in a disk request queue. If Splash 2.1+ is used the middleware also allows to save network traffic by not sending these duplicate arguments to Splash server multiple times. SplashDeduplicateArgsMiddleware cache_args SPIDER_MIDDLEWARES = {'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,} Scrapy currently doesn’t provide a way to override request fingerprints calculation globally, so you will also have to set a custom and a custom cache storage backend: DUPEFILTER_CLASS DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' If you already use another cache storage backend, you will need to subclass it and replace all calls to with . scrapy.util.request.request_fingerprint scrapy_splash.splash_request_fingerprint Now that the Splash middleware is enabled, you can use in place of to render pages with Splash. SplashRequest scrapy.Request For example, if we wanted to retrieve the rendered HTML for a page, we could do something like this: import scrapyfrom scrapy_splash import SplashRequest class MySpider(scrapy.Spider):start_urls = ["http://example.com", "http://example.com/foo"] def start\_requests(self):  
    for url in self.start\_urls:  
        yield SplashRequest(url, self.parse,  
            endpoint='render.html',  
            args={'wait': 0.5},  
        )  

def parse(self, response):  
    # response.body is a result of render.html call; it  
    # contains HTML processed by a browser.  
    # … The ‘args’ dict contains arguments to send to Splash. You can find a full list of available arguments in the . By default the endpoint is set to ‘render.json’, but here we have overridden it and set it to ‘render.html’ to provide an HTML response. HTTP API documentation Running Custom JavaScript Sometimes you may need to press a button or close a modal to view the page properly. Splash lets you run your own JavaScript code within the context of the web page you’re requesting. There are several ways you can accomplish this: Using the js_source Parameter You can use the parameter to send the JavaScript you want to execute. The JavaScript code is executed after the page finished loading but before the page is rendered. This allow to use the JavaScript code to modify the page being rendered. For example, you can do it with Scrapy-Splash: js_source # Render page and modify its title dynamicallyyield SplashRequest('http://example.com',endpoint='render.html',args={'js_source': 'document.title="My Title";'},) Splash Scripts Splash supports Lua scripts through its endpoint. This is the preferred way to execute JavaScript as you can preload libraries, choose when to execute the JavaScript, and retrieve the output. execute Here’s an example script: function main(splash)assert(splash:go(splash.args.url))splash:wait(0.5)local title = splash:evaljs("document.title")return {title=title}end You need to send that script to the endpoint, in the argument. execute lua_source This will return a JSON object containing the title: {"title": "Some title"} Every script requires a main function to act as the entry point. You can return a Lua table which will be rendered as JSON, which is what we have done here. We use the function to tell Splash to visit the URL. The function lets you execute JavaScript within the page context, however, if you don’t need the result you should use instead. splash:go splash:evaljs splash:runjs You can test your Splash scripts in your browser by visiting your Splash instance’s index page (e.g. It’s also possible to use Splash with IPython notebook as an interactive web-based development environment, see for more details. http://localhost:8050/). here It’s often the case that you need to click a button before the page is displayed. We can do that using function: splash:mouse_click function main(splash)assert(splash:go(splash.args.url))local get_dimensions = splash:jsfunc([[function () {var rect = document.getElementById('button').getClientRects()[0];return {"x": rect.left, "y": rect.top}}]])splash:set_viewport_full()splash:wait(0.1)local dimensions = get_dimensions()splash:mouse_click(dimensions.x, dimensions.y)-- Wait split second to allow event to propagate.splash:wait(0.1)return splash:html()end Here we use to define a function that will return the element coordinates, then make sure the element is visible with , and click on the element. Splash then returns the rendered HTML. splash:jsfunc splash:set_viewport_full Wrap up You can find more info on running JavaScript with Splash in the , and for a more in-depth tutorial, check out the . docs Splash Scripts Tutorial We hope this tutorial gave you a nice introduction to Splash, and please let us know if you have any questions or comments! Originally published on the Scrapinghub blog . This post was written by Richard Dowinton, a former Software Developer at Scrapinghub. Please heart “Recommend” to share this tutorial far and wide. Find out what web scraping and web data can do for you .

How to Extract Data from JavaScript-Based Websites with Splash

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

How to Crawl the Web Politely with Scrapy

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

10 Ways to Optimize Your Database

10 Essential Computer Skills for Data Mining

10 Most Evolving Big Data Technologies to Catch Up on in 2022

Top 10 JavaScript Charting Libraries for Every Data Visualization Need

How to Crawl the Web Politely with Scrapy

The Noonification: Feature Optimization for Price Prediction (11/26/2023)

10 Ways to Optimize Your Database

10 Essential Computer Skills for Data Mining

10 Most Evolving Big Data Technologies to Catch Up on in 2022

Top 10 JavaScript Charting Libraries for Every Data Visualization Need

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps