My experience running R for data processing on production Motivation for using R During the beginning of my current project (at the same time while I was practicing R for and using that on a group of “ ”) I decided to give R language a try to handle some data processing tasks. Data Science Kagglers While the initial challenge was to create some ETL jobs (pulling data from a Microsoft SQL Server, CSV and JSON files, run some cleanup and transformations and pushing that back to a Cloud SQL database instance) we also had other possibilities in sight such as performing some exploratory analysis on the data and also creating some simple reports on the top of that. Google With that in in mind, R and its ecosystem such as and seemed be very valuable for the challenge. tools R Markdown R Shiny Get off on the right foot So “can I use R on ?”. This is the kind of question that is usually asked regarding an interpreted language (Python is a natural alternative for R and also another language is target with the same type of criticism). production The answer for the question heavily relies on what you do with the interpreted language you will be developing with. If your experience on such programming language is limited to write , with no logs, no error handling, no unit tests, no version control or the basics of DevOps practices… so the answer is no. You should not put that kind of “code” on production. scripts But R (just like Python) can be used for both development (that simple code you write to validate an idea or a “non live” data analysis) and also to create more robust . script applications This article is not intended to be a tutorial, but I have put together a short list of some R packages (and guidelines) that I have being using in order to write some R applications: : Log framework for R logging : Handle command line arguments, very easy to use. optparse stick to a code guideline (I am following Google’s one: ( ) — side node #1, on the same git repos you can find Google’s style guide for pretty much every language out there. https://google.github.io/styleguide/Rguide.xml stick to a consistent “working directory” (e.g. in the root of your project) project Organization: http://nicercode.github.io/blog/2013-04-05-projects/ unit testing: (there are other unit testing frameworks for R, but this worked out very nice for me, I would definetely take a look at as well) https://cran.r-project.org/web/packages/RUnit/vignettes/RUnit.pdf https://cran.r-project.org/web/packages/testthat/index.html You really should put these (of similar) packages and standards in place on your R application before releasing that to handle real data on a daily basis. Scheduling and Tuning along the way After writing some flows of your data processing script application you need to make it run somehow. As we had full control of production (Google Cloud) environment. We took the decision to schedule our ETLs using the well known DevOps (not the same instance used by developers to compile other parts of the system, but a dedicated production box). Jenkins A great feature of Jenkins in our cloud environment scenario (other than the easy to use scheduling and history console), is the possibility for us to decide where to run a specific job. The default is to run on the jenkins (master) machine itself. But you can easily setup it to run on a more robust slave machine (nodes that you attach to the main server), that we can start and stop on demand, using simple commands. This give us great flexibility and allows us to save some infra costs (side node #2: on Google Compute Engine the minimum charge for a instance is 10 minutes, not a full hour like in AWS) . gcloud If you have instrumented your R code with logging (as mentioned above), you should have the same level of execution/history information available if you have, say, a Java job. After some weeks, as the number of flows, tables and volume of text files being processed started to increase (reaching something like 30 GB per day) we had to start doing some tuning in order to keep our jobs running fast. These were the key changes that really boosted the application performance: Use data.table instead of data.frames (they are much faster), also use the data.table specific methods for reading and writing files — fread and fwrite, just check the . docs Change your SQL connection layer in order to bulk insert data in the database, sending a list of VALUEs inside each execute statement (check this for examples). stackoverflow post After some tests involving different JSON parsing libraries, has worked just great for what we need (faster than rjson). jsonlite Conclusion R works just fine on production, as well as you could expect Python to be. It is really a matter of putting engineering work on that to make sure you are pushing applications, not naive scripts, to handle real user data. I am not going to get into “R vs Python” discussion in this article, but off course, the decision of a language should heavily depend on both project challenge combined with the skills in place of your team, but I expect to have at least given another option for you to choose :D.

Google

Can I use R on production?

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

How to dockerize any application

What Are Convolution Neural Networks? [ELI5]

The Noonification: Have U Been Pwned? (1/12/2023)

Goldman Sachs, Data Lineage, and Harry Potter Spells

People are still crazy about Python after twenty-five years

How to dockerize any application

What Are Convolution Neural Networks? [ELI5]

The Noonification: Have U Been Pwned? (1/12/2023)

Goldman Sachs, Data Lineage, and Harry Potter Spells

People are still crazy about Python after twenty-five years

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps