R Analytics and Altair Grid Engine
Among statisticians and data scientists, R remains one of the most popular and versatile programming languages. According to the KDnuggets annual software survey, R has consistently been among the top- data science tools for the past three years. In the most recent 2018 survey, 50% of data scientists self-identified as running R. In this article I’ll discuss R and explain how analysts and data scientists can leverage Altair Grid Engine for use with R and other analytic applications.
About the R language
R is open-source, runs across multiple platforms, and despite being interpreted, it is exceptionally fast with execution performance rivalling compiled code. It is also object-oriented, supports integrated data visualization, and is easily extensible. There are approximately 5,000 R add-on packages available from the Comprehensive R Archive Network (CRAN) and other sources. Data scientists and analysts use R in almost every field in applications ranging from Genetics to Machine Learning to Natural Language Processing to Economics. Inspired by an earlier S programming language, R was first released in 1995 with a stable beta version available in 2000.
R provides multiple interfaces
Similar to most interpreted languages, R is typically run from the command line. Programmers that need to create scripts and run R non-interactively can invoke the R interpreter with the CMD BATCH arguments and pass an input file comprised of R commands to the interpreter. In addition to the CMD BATCH option, R also supports Rscript, a binary front-end to simplify scripting. Rscript will be familiar to Linux users. R scripts use the familiar “shebang” construct (#!/usr/bin/Rscript) at the top of a file to make an R script executable from the command line.
In 2011, R became much easier to use with the initial release of open-source RStudio. RStudio is an integrated development environment (IDE) for R that includes a console, a syntax highlighting editor, and a plots window for interactive visualization of datasets. For users with small models, RStudio can be run on a desktop or laptop, and in larger environments, RStudio Server provides multiple users with access to RStudio through a web interface. While some users still use the command line, most R users will prefer to work in RStudio. The R command line is exposed as one of the panes within RStudio, so there is little reason for analysts using R to leave the IDE.
R in grid computing environments
The fact that R scripts can be run from the command line makes it trivial to integrate with Altair Grid Engine. Grid Engine users can submit R jobs using the Rscript or R CMD BATCH command line options, and RStudio integrates easily with grid environment, too. The only requirement to run R on a distributed cluster is that the R interpreter is present on each Grid Engine node.
Parallelizing R calculations across clusters is a common use case. The CRAN resource website dedicates an entire “task view” dedicated to the topic of High-Performance and Parallel Computing with R. Although users will typically use the RStudio interface, behind the scenes, parallel R applications will transparently submit Grid Engine jobs to a cluster and aggregate results before returning them to an analyst working in RStudio. Details about the underlying cluster are hidden from the R developer. They just need to know how to code to the appropriate parallel R language framework.
Among the many parallel computing packages for R, a few key modules will be of interest to Altair Grid Engine users.
- BatchJobs – Batch jobs is a general-purpose interface that makes it easy to parallelize R applications from inside RStudio or Rscript. Users can add the BatchJobs library to their script within RStudio and leverage convenient BatchJobs methods to set up, submit and monitor job execution, and collect results.
- Rmpi – Rmpi is a package for R that provides an R-language wrapper for open-source versions of MPI (OpenMPI, MPICH, MPICH2, and LAM-MPI). With Rmpi developers can parallelize models in R and execute them as parallel MPI jobs on Altair Grid Engine from within RStudio
- qsub – qsub is an R package developed specifically for Grid Engine users. qsub provides a convenient way for developers to parallelize R applications by providing a grid engine aware
- mclapply() function to transparently parallelize operations on R list objects or vector (both common operations in R) using a Altair Grid Engine cluster.
Notebooks are worth a special mention as well because data scientists and statisticians working in R frequently need to collaborate and share models with others. Apache Zeppelin and Jupyter notebooks are both used as front-ends for R-based applications. RStudio also supports a notebook facility called R Markdown Notebooks. Shiny is a separate RStudio add-on that enables models developed in R to be exposed through an interactive web interface to create high-quality interactive interfaces for visualizing data.
Running R on a shared cluster environment provides multiple advantages for analysts and data scientists as well as the IT people that support the analytic environment:
- Using RStudio server and popular notebooks any user with a browser can collaborate on data and analytic models for better analyst productivity
- Analysts can build parallel models and leverage Altair Grid Engine cluster to run deeper analysis on larger datasets in less time
- Rather than working on local copies of data, multiple users can collaborate on shared datasets accessible from clustered compute nodes avoiding version control and data movement issues
- Analysts can share the same compute infrastructure among multiple analytics and machine learning frameworks including R, Python, TensorFlow, MatLab, and Spark.
- Finally, users can optionally tap cloud resources when extra capacity or specialized resources are required using Navops Launch to extend analytic workloads to the cloud seamlessly.
According to KDnuggets research, the average data scientist runs seven different analytic frameworks. R, Python, Anaconda, scikit-learn, Tensorflow, Keras, and Apache Spark are all popular choices. Most of these frameworks can benefit from distributed grid computing environments. For IT organizations, providing a shared grid environment that can support multiple analytic tools and frameworks only makes sense.
Are you using Altair Grid Engine to run R, Python or other data science workloads? We’d love to hear about your experiences.