My experience running R for data processing on production
During the beginning of my current project (at the same time while I was practicing R for Data Science and using that on a group of “Kagglers”) I decided to give R language a try to handle some data processing tasks.
While the initial challenge was to create some ETL jobs (pulling data from a Microsoft SQL Server, CSV and JSON files, run some cleanup and transformations and pushing that back to a Google Cloud SQL database instance) we also had other possibilities in sight such as performing some exploratory analysis on the data and also creating some simple reports on the top of that.
So “can I use R on production?”. This is the kind of question that is usually asked regarding an interpreted language (Python is a natural alternative for R and also another language is target with the same type of criticism).
The answer for the question heavily relies on what you do with the interpreted language you will be developing with.
If your experience on such programming language is limited to write scripts, with no logs, no error handling, no unit tests, no version control or the basics of DevOps practices… so the answer is no. You should not put that kind of “code” on production.
But R (just like Python) can be used for both script development (that simple code you write to validate an idea or a “non live” data analysis) and also to create more robust applications.
This article is not intended to be a tutorial, but I have put together a short list of some R packages (and guidelines) that I have being using in order to write some R applications:
You really should put these (of similar) packages and standards in place on your R application before releasing that to handle real data on a daily basis.
After writing some flows of your data processing script application you need to make it run somehow.
As we had full control of production (Google Cloud) environment. We took the decision to schedule our ETLs using the well known DevOps Jenkins (not the same instance used by developers to compile other parts of the system, but a dedicated production box).
A great feature of Jenkins in our cloud environment scenario (other than the easy to use scheduling and history console), is the possibility for us to decide where to run a specific job. The default is to run on the jenkins (master) machine itself. But you can easily setup it to run on a more robust slave machine (nodes that you attach to the main server), that we can start and stop on demand, using simple gcloud commands. This give us great flexibility and allows us to save some infra costs (side node #2: on Google Compute Engine the minimum charge for a instance is 10 minutes, not a full hour like in AWS) .
If you have instrumented your R code with logging (as mentioned above), you should have the same level of execution/history information available if you have, say, a Java job.
After some weeks, as the number of flows, tables and volume of text files being processed started to increase (reaching something like 30 GB per day) we had to start doing some tuning in order to keep our jobs running fast.
These were the key changes that really boosted the application performance:
R works just fine on production, as well as you could expect Python to be. It is really a matter of putting engineering work on that to make sure you are pushing applications, not naive scripts, to handle real user data.
I am not going to get into “R vs Python” discussion in this article, but off course, the decision of a language should heavily depend on both project challenge combined with the skills in place of your team, but I expect to have at least given another option for you to choose :D.