How To Setup Your Own Screen Scraper as a Service - Archived

PhantomJS logo
In this short tutorial I will guide you through setting up your own screen-scraping service using Ruby, Sinatra and PhantomJS. At the end you’ll be able to convert any website into pictures using this service.

Create an application

Step 1: Create an OpenShift account

If you don’t already have an OpenShift account, head on over to the website and sign up. It is completely free and Red Hat gives every user three free Gears on which to run your applications. At the time of this writing, the combined resources allocated for each user is 1.5 GB of memory and 3 GB of disk space.

Step 2: Install the RHC client tools

Note: If you would rather watch a screencast of this step, check out the following videos where I demo how to install the client tools.

The OpenShift client tools are written in a very popular programming language called Ruby. With OSX 10.6 or later and most Linux distributions, ruby is installed by default so installing the client tools is a snap. Simply issue the following command on your terminal application:

sudo gem install rhc

Step 3: Create an OpenShift application

Now that we have an OpenShift account and the client tools installed, lets get started with our application coding. The first thing we need to do is create a gear that will house our application code and mongo database. I will be using the command line tools that we installed in step 2, but you can perform the same action via the web console or using our IDE integration.

rhc app create codemongo ruby-1.9

This will create an application container for us, called a gear, and setup all of the required SELinux policies and cgroup configuration. OpenShift will also setup a private git repository for you and propogate your dns out world wide. This whole process should take about 30 seconds.

PhantomJS

PhantomJS is headless WebKit. In essence, it provides you with a browser without the need to have desktop system. It allows many use-case, like web application testing, data management, data generation, etc.

SSH into your application and run these command to install PhantomJS

cd app-root/data/
wget http://phantomjs.googlecode.com/files/phantomjs-1.8.0-linux-x86_64.tar.bz2
tar xf phantomjs-1.8.0-linux-x86_64.tar.bz2
rm phantomjs-1.8.0-linux-x86_64.tar.bz2
mv phantomjs-1.8.0-linux-x86_64/ phantomjs

after these few steps, we have PhantomJS ready, so let’s make sure it really works

cd phantomjs
./bin/phantomjs -v
# prints 1.8.0

Screen Scraping With PhantomJS

One of the examples that come with Phantom.js is rasterize.js this script loads a URL and exports the render viewport to a picture.

Let’s test whether it works and capture Google’s homepage into a PNG picture

./bin/phantomjs examples/rasterize.js http://www.google.com google.png

and now download the picture

scp <ssh url>:app-root/data/phantomjs/google.png google.png

Yes, it’s that simple. We now have the tools, let’s build the service.

Enable the Sinatra Service

Clone the repository with your application and in the root of it create a file called Gemfile

source :rubygems
 
gem 'sinatra'

and run bundler to install and lock the gems

bundle install

Now open and purge all the content from config.ru file and add code for the service

require 'sinatra/base'
require 'digest/md5'
 
class App < Sinatra::Base
 
  get '/' do
    return "to specifiy the rendered URL use \"?url=&lt;some url&gt;\"" unless params[:url]
    digest = Digest::MD5.hexdigest(params[:url])
    system(File.expand_path("~/app-root/data/phantomjs/bin/phantomjs"), File.expand_path("~/app-root/data/phantomjs/examples/rasterize.js"), params[:url], "public/#{digest}.png")
    digest
  end
 
end
 
run App

and add the file to Git and push to OpenShift

git add .
git commit -m "My application"
git push

And that’s all, we are ready to serve.

Screen Scraper as a Service

When you navigate to root of your application with url parameter, the service will generate the screenshot of the web page and will save it as MD5 hash of the url and returns the hash. Example in my case

http://screener-mjelen.rhcloud.com/?url=http://www.google.com

Will generate image of the Google’s homepage and prints ed646a3334ca891fd3467db131372140 which is the MD5 hash of the URL. To access the image, just navigate to file hash.png in the root of the application. Once again, in my case

http://screener-mjelen.rhcloud.com/ed646a3334ca891fd3467db131372140.png

Conclusion

Building services on OpenShift is extremely easy and versatile. Because you can SSH into your application you can run unimaginable amount of applications on the PaaS and create extraordinary service.

What’s Next?

Categories
News
Tags
,
Comments are closed.