_My experience running R for data processing on production_\n\n!(https://hackernoon.com/hn-images/1*pib-uo0Jwx20B9TYMY9PYA.png)\n\n### Motivation for using R\n\nDuring the beginning of my current project (at the same time while I was practicing R for [Data Science](https://hackernoon.com/tagged/data-science) and using that on a group of “[Kagglers](http://www.kaggle.com)”) I decided to give R language a try to handle some data processing tasks.\n\nWhile the initial challenge was to create some ETL jobs (pulling data from a Microsoft SQL Server, CSV and JSON files, run some cleanup and transformations and pushing that back to a [Google](https://hackernoon.com/tagged/google) Cloud SQL database instance) we also had other possibilities in sight such as performing some exploratory analysis on the data and also creating some simple reports on the top of that.\n\nWith that in in mind, R and its ecosystem [tools](https://hackernoon.com/tagged/tools) such as [R Markdown](http://rmarkdown.rstudio.com/) and [R Shiny](https://shiny.rstudio.com/) seemed be very valuable for the challenge.\n\n### Get off on the right foot\n\nSo “can I use R on [production](https://hackernoon.com/tagged/production)?”. This is the kind of question that is usually asked regarding an interpreted language (Python is a natural alternative for R and also another language is target with the same type of criticism).\n\nThe answer for the question heavily relies on what you do with the interpreted language you will be developing with.\n\nIf your experience on such programming language is limited to write **_scripts_**, with no logs, no error handling, no unit tests, no version control or the basics of DevOps practices… so the answer is no. You should not put that kind of “code” on production.\n\nBut R (just like Python) can be used for both **_script_** development (that simple code you write to validate an idea or a “non live” data analysis) and also to create more robust **_applications_**.\n\nThis article is not intended to be a tutorial, but I have put together a short list of some R packages (and guidelines) that I have being using in order to write some R applications:\n\n* [logging](http://logging.r-forge.r-project.org/): Log framework for R\n* [optparse](https://cran.r-project.org/web/packages/optparse/index.html): Handle command line arguments, very easy to use.\n* stick to a code guideline (I am following Google’s one: ([https://google.github.io/styleguide/Rguide.xml](https://google.github.io/styleguide/Rguide.xml)) — side node #1, on the same git repos you can find Google’s style guide for pretty much every language out there.\n* stick to a consistent “working directory” (e.g. in the root of your project)\n* project Organization: [http://nicercode.github.io/blog/2013-04-05-projects/](http://nicercode.github.io/blog/2013-04-05-projects/)\n* unit testing: [https://cran.r-project.org/web/packages/RUnit/vignettes/RUnit.pdf](https://cran.r-project.org/web/packages/RUnit/vignettes/RUnit.pdf) (there are other unit testing frameworks for R, but this worked out very nice for me, I would definetely take a look at [https://cran.r-project.org/web/packages/testthat/index.html](https://cran.r-project.org/web/packages/testthat/index.html) as well)\n\nYou really should put these (of similar) packages and standards in place on your R application before releasing that to handle real data on a daily basis.\n\n### Scheduling and Tuning along the way\n\nAfter writing some flows of your data processing script application you need to make it run somehow.\n\nAs we had full control of production (Google Cloud) environment. We took the decision to schedule our ETLs using the well known DevOps **Jenkins** (not the same instance used by developers to compile other parts of the system, but a dedicated production box).\n\nA great feature of Jenkins in our cloud environment scenario (other than the easy to use scheduling and history console), is the possibility for us to decide where to run a specific job. The default is to run on the jenkins (master) machine itself. But you can easily setup it to run on a more robust slave machine (nodes that you attach to the main server), that we can start and stop on demand, using simple [**gcloud**](https://cloud.google.com/compute/docs/gcloud-compute/) commands. This give us great flexibility and allows us to save some infra costs (side node #2: on Google Compute Engine the minimum charge for a instance is 10 minutes, not a full hour like in AWS) .\n\nIf you have instrumented your R code with logging (as mentioned above), you should have the same level of execution/history information available if you have, say, a Java job.\n\nAfter some weeks, as the number of flows, tables and volume of text files being processed started to increase (reaching something like 30 GB per day) we had to start doing some tuning in order to keep our jobs running fast.\n\nThese were the key changes that really boosted the application performance:\n\n1. Use data.table instead of data.frames (they are much faster), also use the data.table specific methods for reading and writing files — fread and fwrite, just check the [docs](https://cran.r-project.org/web/packages/data.table/data.table.pdf).\n2. Change your SQL connection layer in order to bulk insert data in the database, sending a list of VALUEs inside each execute statement (check this [stackoverflow post](http://stackoverflow.com/questions/6889065/inserting-multiple-rows-in-mysql) for examples).\n3. After some tests involving different JSON parsing libraries, [jsonlite](https://cran.r-project.org/web/packages/jsonlite/index.html) has worked just great for what we need (faster than rjson).\n\n### Conclusion\n\nR works just fine on production, as well as you could expect Python to be. It is really a matter of putting engineering work on that to make sure you are pushing applications, not naive scripts, to handle real user data.\n\nI am not going to get into “R vs Python” discussion in this article, but off course, the decision of a language should heavily depend on both project challenge combined with the skills in place of your team, but I expect to have at least given another option for you to choose :D.