Day 18: Boilerpipe–Article Extraction for Java Developers

Today for my 30 day challenge, I wanted to learn how to do text and image extraction from web links using the Java programming language. This is a common requirement in most of the content discovery websites like Prismatic. In this blog, I will show you how to use a Java library called boilerpipe to accomplish this task.

Prerequisite

  1. Basic Java knowledge is required. Install the latest Java Development Kit (JDK) on your operating system. You can either install OpenJDK 7 or Oracle JDK 7. OpenShift supports both OpenJDK 6 and 7.

  2. Sign up for an OpenShift Account.Today for my 30 day challenge, I decided to learn how to do text and image extraction from web links using the Java programming language. This is a very common requirement in most of the content discovery websites like Prismatic. In this blog, we will learn how we can use a Java library called boilerpipe to accomplish this task.

Prerequisite

  1. Basic Java knowledge is required. Install the latest Java Development Kit (JDK) on your operating system. You can either install OpenJDK 7 or Oracle JDK 7. OpenShift supports both OpenJDK 6 and 7.

  2. Sign up for an OpenShift Account. It is completely free and Red Hat gives every user three free Gears on which to run your applications. At the time of this writing, the combined resources allocated for each user is 1.5 GB of memory and 3 GB of disk space.

  3. Install the rhc client tool on your machine. RHC is a ruby gem so you need to have ruby 1.8.7 or above on your machine. To install rhc, just typesudo gem install rhc
    If you already have one, make sure it is the latest one. To update your rhc, execute the command
    sudo gem update rhc
    For additional assistance setting up the rhc command-line tool, see the following page: https://www.openshift.com/developers/rhc-client-tools-install

  4. Setup your OpenShift account using the rhc setup command. This command will help you create a namespace and upload your ssh keys to OpenShift server.

Step1 : Create a JBoss EAP application

Let’s start creating the demo application. The name of the application is newsapp.

$ rhc create-app newsapp jbosseap

If you have access to medium gears then you can use following command.

$ rhc create-app newsapp jbosseap -g medium

This will create an application container for us, called a gear, and setup all of the required SELinux policies and cgroup configuration. OpenShift will also setup a private git repository for us and clone the repository to the local system. Finally, OpenShift will propagate the DNS to the outside world. The application will be accessible at http://newsapp-{domain-name}.rhcloud.com/. Replace {domain-name} with your own unique OpenShift domain name (also sometimes called a namespace).

Step 2 : Add Maven dependencies

In the pom.xml file add the following dependency:

<dependency>
    <groupId>de.l3s.boilerpipe</groupId>
    <artifactId>boilerpipe</artifactId>
    <version>1.2.0</version>
</dependency>
<dependency>
    <groupId>xerces</groupId>
    <artifactId>xercesImpl</artifactId>
    <version>2.9.1</version>
</dependency>
 
<dependency>
    <groupId>net.sourceforge.nekohtml</groupId>
    <artifactId>nekohtml</artifactId>
    <version>1.9.13</version>
</dependency>

You will also need to add a new repository

<repository>
    <id>boilerpipe-m2-repo</id>
    <url>http://boilerpipe.googlecode.com/svn/repo/</url>
    <releases>
        <enabled>true</enabled>
    </releases>
    <snapshots>
        <enabled>false</enabled>
    </snapshots>
</repository>

Also update the maven project to Java 7 by updating a couple of properties in the pom.xml file:

<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>

Now update the Maven project Right click > Maven > Update Project.

Step 3 : Enable CDI

We are using CDI for dependency injection. CDI or Context and Dependency injection is a Java EE 6 specification which enables dependency injection in a Java EE 6 project. CDI defines type-safe dependency injection mechanism for Java EE. Almost any POJO can be injected as a CDI bean.

Create a new xml file named beans.xml in the src/main/webapp/WEB-INF folder. Replace the content of beans.xml with the following:

<beans xmlns="http://java.sun.com/xml/ns/javaee" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/beans_1_0.xsd">
 
</beans>

Step 4 : Create BoilerpipeContentExtractionService

Now we can create an BoilerpipeContentExtractionService service class which will take a url and find the title and article text from it.

import java.net.URL;
import java.util.Collections;
import java.util.List;
 
import com.newsapp.boilerpipe.image.Image;
import com.newsapp.boilerpipe.image.ImageExtractor;
 
import de.l3s.boilerpipe.BoilerpipeExtractor;
import de.l3s.boilerpipe.document.TextDocument;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
import de.l3s.boilerpipe.extractors.CommonExtractors;
import de.l3s.boilerpipe.sax.BoilerpipeSAXInput;
import de.l3s.boilerpipe.sax.HTMLDocument;
import de.l3s.boilerpipe.sax.HTMLFetcher;
 
public class BoilerpipeContentExtractionService {
 
    public Content content(String url) {
        try {
            final HTMLDocument htmlDoc = HTMLFetcher.fetch(new URL(url));
            final TextDocument doc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
            String title = doc.getTitle();
 
            String content = ArticleExtractor.INSTANCE.getText(doc);
 
            final BoilerpipeExtractor extractor = CommonExtractors.KEEP_EVERYTHING_EXTRACTOR;
            final ImageExtractor ie = ImageExtractor.INSTANCE;
 
            List<Image> images = ie.process(new URL(url), extractor);
 
            Collections.sort(images);
            String image = null;
            if (!images.isEmpty()) {
                image = images.get(0).getSrc();
            }
 
            return new Content(title, content.substring(0, 200), image);
        } catch (Exception e) {
            return null;
        }
 
    }
}

The code above:

  1. First fetches the document at the given url.
  2. Parses the HTML document and return TextDocument.
  3. Gets the title from the text document.
  4. Extracts content from the text and returns a new instance of the application value object.

Step 5 : Enable JAX-RS

To enable JAX-RS, create a class which extends javax.ws.rs.core.Application and specify the application path using the javax.ws.rs.ApplicationPath annotation as shown below.

import javax.ws.rs.ApplicationPath;
import javax.ws.rs.core.Application;
 
@ApplicationPath("/api/v1")
public class JaxrsInitializer extends Application{
 
 
}

Step 6 : Create ContentExtractionResource

Now we will create our ContentExtractionResource class which will return a content object as JSON. Create a new class named ContentExtractionResource and replace the code with the contents shown below:

import javax.inject.Inject;
import javax.ws.rs.GET;
import javax.ws.rs.Path;
import javax.ws.rs.Produces;
import javax.ws.rs.QueryParam;
import javax.ws.rs.core.MediaType;
 
import com.newsapp.service.BoilerpipeContentExtractionService;
import com.newsapp.service.Content;
 
@Path("/content")
public class ContentExtractionResource {
 
    @Inject
    private BoilerpipeContentExtractionService boilerpipeContentExtractionService;
 
    @GET
    @Produces(value = MediaType.APPLICATION_JSON)
    public Content extractContent(@QueryParam("url") String url) {
        return boilerpipeContentExtractionService.content(url);
    }
}

Deploy to OpenShift

Finally, deploy the changes to OpenShift

$ git add .
$ git commit -am "NewApp"
$ git push

After the code is pushed and the war is successfully deployed, we can view the application running at http://newsapp-{domain-name}.rhcloud.com. My sample application is running at http://newsapp-t20.rhcloud.com.

Now you can test by submitting a link in the application ui.

That’s it for today. Keep giving feedback.

Next Steps

Categories
Java, OpenShift Online
Tags
Comments are closed.