Statistics in the Cloud With R on OpenShift - Archived

Statistics

R is an open-source statistical software. It derives from S a closed-source statistical system. R provides an environment to run and evaluate statistical computation and it is also used for data-mining. I am personally starting with R and I am exploring the possibilities. But before we dive in, let’s see what the official page says

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much of the code written for S runs unaltered under R.

Sure it’s simple to run R on your system as it’s package for most Linuxes and official binaries are also provided for other platforms. For example, for Fedora the installation looks like

yum install R

enter … install … done. Let’s rock the statistics. So then my next thought was, you know, OpenShift is awesome, R is supposed to be awesome … is R on OpenShift awesome? I do not have hard statistical facts yet, but I do have R running on OpenShift in the most simple form and today I will walk you through the process.

Usually I first try to do a small introduction into the project and then explore how to get it on OpenShift. Today, however, we shall first deploy R and then play with it.

Installing R on OpenShift

To make it smooth-ish I have created a quickstart for OpenShift. It can be found on Github and it’s very simple to use.

First you need your own OpenShift account. I hope you already have one (’cause it’s free), but if you do not, please follow this guide. To be able to continue, you will need the account, command-line tools and SSH installed. It’s all described in the guide, once ready, please return here.

So, you should be all set now and we can start by creating a new gear where we shall install R. This is done by running the rhc command line tool and once we are at it we shall also install the quickstart for R – in my case I am using r as an application name

rhc app create r diy --from-code=https://github.com/openshift-quickstart/r-quickstart.git

Every application on OpenShift is hosted in an environment we call gear and every gear comes with it’s own Git repository. Using the –from-code parameter we pre-populate the repository with the code of my quickstart.

So, we have R quickstart installed, now we need to actually install R. I tried to find an automated way, but I have not had much luck so far. But a pre-compiled binary with it’s own cartridge is something I plan for a follow-up work and blog post. So, for the time being, we need to connect to the environment and actually install R. To do this we will use SSH to connect and once there, we will run an installation script that was prepared as part of the quickstart.

rhc ssh r
cd app-root/repo
./deploy.sh

Once the process finishes, we have R installed and available. You can actually navigate to the application’s web front-end. In my case, the URL would be

http://r-mjelen.rhcloud.com

r is my application name, mjelen is my namespace and the rest is constant. The front-end is very simple Ruby application that invokes R on the background and prints the output. For those who are curious, the code looks like

#!/usr/bin/env ruby
 
require 'webrick'
include WEBrick
 
Dir.chdir(ARGV[1])
 
config = {}
config.update(:Port => 8080)
config.update(:BindAddress => ARGV[0])
config.update(:DocumentRoot => ARGV[1])
 
server = HTTPServer.new(config)
 
server.mount_proc '/exec' do |req, res|
  File.open('tmp.R', 'w') do |f|
    f.puts(req.query["data"])
  end
  res.body = `cd #{ARGV[1]};/sandbox/r/bin/R --vanilla < tmp.R`
end
 
['INT', 'TERM'].each {|signal|
  trap(signal) {server.shutdown}
}
 
server.start

We use Webrick – the build-in Ruby web server – to serve files and handle HTTP requests and we tell it to server data from a directory specified as the 2nd parameter passed on the command line. We also set up a special handler for the request path /exec that handles Post requests from the web interface. It is very simple, it saves the code sent to it into a file and then invokes R to evaluate the file … whatever is printed to the standard output is sent back as a response.

The screen should actually look like:

R on OpenShift

So when you open the web interface you will see a text box on the left, where you enter R code that is supposed to be evaluated … you click the button and the code is sent to the backend, there it’s evaluated by the R runtime and the result is sent back and printed on the right side of the screen.

Now we have R running on OpenShift and we can run R code using the web interface. Let’s take a look at some samples of what we can do with R.

Simple Math examples

Let’s start from the most simple things. Take this as a simple peek into the R world, as I am newbie here myself and I am exploring this incredible world of statistics as you are.

Expressions are written line by line and evaluated in order. You can use variables to store values in the program body and do all the basic stuff we would expect from a programming language. Just to get started let’s try a very, very simple program

a <- 5
b <- 2
c <- a / b
c * b

In this simple program we introduce 2 variables, then a third one as a result of dividing the first one by the second one and at the end we multiply the result with the devisor to the divident back. As we do not assign the value in the fourth statement it’s printed into the output. The result looks like

> a <- 5
> b <- 2
> c <- a / b
> c * b
[1] 5
> 

This is pretty obvious and should be simple for us, right? To be more precise, this is boring ;) Let’s go further, R is good at working with verctors (lists) and allows us to do operations on them the same way we would do with simple variables.

a <- 5
b <- c(1,2,3,4,5)
a * b 

first we introduce a variable with a value then we introduce a second variable with a vector of numbers and at last we multiply the value with the vector. As we can see, the result is that R multiplied each value in the vector by the number

> a <- 5
> b <- c(1,2,3,4,5)
> a * b 
[1]  5 10 15 20 25
> 

As a matter of fact, R thinks of the world as vectors and matrices (since it is for statistics). While you can do loops on arrays, it is much slower than doing vector and matrix operations.

R also provides a nice set of functions that we know from the world of mathematics. There is an incredible amount of them, but a very simple one is a square root.

sqrt(4)

if we evaluate this single line we get the correct result

> sqrt(4)
[1] 2
> 

So far we have been in the general world of mathematics, let’s move to the specific world of statistics

Simple statistic examples

Let’s create a simple data sample. If I remember correctly, whenever a statistician works with data, they want them to be a normal distribution, as that simple fact simplifies many other aspects. So, for our own example, let’s work with some data in a normal distribution

a <- c(1, 2, 2, 3, 3, 3, 4, 4, 5)

So, what can we do with it? First you may want to get some basic information on the data, to do that, you can use the summary function

a <- c(1, 2, 2, 3, 3, 3, 4, 4, 5)
summary(a)

as a result we get some basic information on our sample

> summary(a)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       2       3       3       4       5 

The nice benefit of having such a simple sample of data is that we can actually logically reason whether the results provided by R are actually correct. In this example, it’s obvious that they are.

Of course R allows us to do much more with the data samples, like t-tests, f-tests, working with distributions, etc, etc. For most of it, I need to dig up my book on statistics from the university (4th semester if I do remember correctly), but that is not the point today.

However there is one thing that is simple and obvious and actually is quite nice. Regression allows us to try to fit a curve into our data sample and the more precise curve we can get the more precise future occurrence we can predict. R is again able to help us here.

Let’s define our sample data first

x <- c(1,2,3,4,5)
y <- c(2,4,6,8,10)

we see that as as move over the x the value of y is 2 * x. Let’s test if there is a correlation between the two samples

> cor(x,y)
[1] 1

wow, it’s 1, so there is quite a nice correlation, let’s try to use linear regression to predict the next value. Liner regression will try to fit the data to a function

y = a * x + b

we know we are trying to fit the function

y = m * x + b

Where m equals the slope of the line and b equals the y intercept.

so we can actually help R with the right linear function to use

Call:
lm(formula = y ~ x)
 
Coefficients:
(Intercept)            x  
  2.383e-15    2.000e+00  

so, R found out that there is a linear function that fits the data with the slope of x to be 2 and the intercept to be 0 (well as close as you can get to 0 with rounding problems on computers). This is exactly the way we constructed our data.

y = 2 * x + 0

and that’s exactly what we were looking for. lm is pretty powerful and complex and allows the user to parametrize all possible aspects of it’s inner-works … however for our simple example this should be enough.

Conclusion

I hope I made you a bit interested in R even though I have used extremely simple examples. This was done for the sake of showing how R works rather than presenting what incredibly complex things it’s capable of – this will come later. I do intend to follow up with a blog post and create more complex environment that would also allow working with graphics, which is a natural part of statistics and this also of R.

Next Steps

Categories
OpenShift Online
Tags
, , , ,
Comments are closed.