Jupyter on OpenShift Part 3: Creating a S2I Builder Image

In the prior post in this series I described the steps required to run the Jupyter Notebook images supplied by the Jupyter Project developers. When run, these notebook images provide an empty workspace with no initial notebooks to work with. Depending on the image used, they would include a range of pre-installed Python packages, but they may not have all packages installed that a user needs.

To make use of these Jupyter Notebook images, a user would need to manually upload any notebooks and data files. They would also need to manually install any additional Python packages they required.

Although it was possible to attach a persistent volume so that the notebooks and data files, along with any of the changes made, were retained across a restart of the container, any additional Python packages would need to be reinstalled each time. This was necessary as the Python packages are installed into directories outside of the persistent volume.

To combat this, a user could create a custom Docker-formatted image themselves, which builds on the base image, but this means that they have to know how to create a Docker-formatted image and have the tools available to do it.

Source-to-Image Builders

When using OpenShift, an alternative that exists which can make the life of a user much easier, is to Source-to-Image (S2I) enable base images which would commonly be extended by users. This is something OpenShift does for common programming languages such as Java, NodeJS, PHP, Python and Ruby.

The way in which the S2I builder images work is that they are run against a copy of a designated Git repository containing a users files. An assemble script within the image would do whatever is required to process those files to create a runnable image for an application. When that image is then started, a run script in the image would start up the application.

Using S2I, a user doesn’t need to know what steps are necessary to convert their files into a runnable application. Instead all the smarts are embodied in the assemble and run scripts of the S2I builder image.

Although the S2I builder images are typically used to take application source code and create a runnable image for a web application, they can also be used to take any sort of input files and combine them with an existing application.

In our case, we can use an S2I enabled image to perform two tasks for us. The first is to install any Python packages required for a set of notebook files, and secondly to copy the notebook files and data files into the image. Using this it becomes very easy for a user to create a custom image containing everything they need. When the Jupyter Notebook instance is started up, everything will be pre-populated and they can start working straight away.

This sort of system is especially useful in a teaching environment as a way of distributing course material. This is because you know that students are going to have the correct versions of software and Python packages that are required by the notebooks.

Customising the Image

To create a S2I enabled version of the Jupyter Notebook images for users, we are going to create a custom Docker-formatted image. To do this we start out with a Dockerfile. This will include a number of customisations, so we will go through them one at a time to understand what they do.

First up we need to indicate what base image were are building on top of. We are going to use the jupyter/minimal-notebook image. We start out with this image rather than the scipy-notebook image, as we will rely on the S2I build process to add additional packages that are needed, rather than bundling them all into the base image to begin with. This ensures the final image is as small as possible and isn’t bloated in size due to Python packages being installed which are never used.

FROM jupyter/minimal-notebook:latest

Next we need to switch to running commands listed in the Dockerfile as the root user. This is so that additional operating system packages can be installed.

USER root

We now install those operating system packages. In this case we install the rsync package so that the OpenShift oc rsync command can be run to copy files from a local system into a running container if necessary. The libav-tools package is also installed. This is a system package which gets installed as part of the scipy-notebook image. Because we want to allow all the same Python packages that scipy-notebook image has pre-installed to be installed using the S2I enabled version of the minimal-notebook, we install it here.

RUN apt-get update && \
    apt-get install -y --no-install-recommends libav-tools rsync && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

The operating system part of the image is now setup, but we still have to add the S2I parts. The first part of this is to set some labels on the image.

LABEL io.k8s.description="S2I builder for Jupyter (minimal-notebook)." \
      io.k8s.display-name="Jupyter (minimal-notebook)" \
      io.openshift.expose-services="8888:http" \
      io.openshift.tags="builder,python,jupyter" \
      io.openshift.s2i.scripts-url="image:///opt/app-root/s2i/bin"

These labels serve a number of purposes. This includes providing a name and description which can be displayed by OpenShift in the web console and on the command line when using the S2I enabled image. We also let OpenShift know which ports the running image will expose and most important of all, where the assemble and run scripts are installed in the image.

Having defined where the assemble and run scripts will reside, we now need to copy them into the image. We will get to the content of these scripts later.

COPY s2i /opt/app-root/s2i

We are done with needing to be the root user and so switch back to the user that the base image was original setup to run as.

USER 1000

In doing this though, an important difference is that we have not specified the user name jovyan as the base image used. Instead we have specified it by using the integer user ID for that user.

This is necessary because when a user name is used, the S2I build process cannot know for sure that the user name doesn’t actually map to the user ID 0, corresponding to the root user. As a security measure to ensure that the subsequent assemble script can’t run as the root user, the USER statement must therefore use an integer user ID and not a user name.

Finally, we override the existing command run when the image is started, to be the run script we copied into the image.

CMD [ "/opt/app-root/s2i/bin/run" ]

This isn’t strictly needed as when the S2I build process is run to generate an image, it will again set the command to be run as the run script explicitly. We add it here so that if the image is itself run as an application, rather than as a builder image, it will still start up using our run script, rather than the original command setup in the minimal-notebook image. This enables us to still perform additional steps in this case before the Jupyter Notebook is started.

The Build Script

The assemble script triggered during the build phase of the S2I process resides at the path s2i/bin/assemble and is copied into the location /opt/app-root/s2i/bin/assemble of the image.

The assemble script starts out with:

#!/bin/bash

set -eo pipefail

The set -eo pipefail ensures that if any step within the script fails, the script will exit immediately. This means it isn’t necessary to check the status of each command and explicitly exit from the script. If the script exits with any failure, the build process will have been deemed to have failed.

When the assemble script is run, the original contents of the Git repository the S2I builder image is run against, will have been copied into the /tmp/src directory. For our S2I builder this would comprise the notebooks and data files, along with a Python requirements.txt file listing the Python packages that need to be installed.

The next step is to copy all these files from /tmp/src to their proper location in the image.

cp -Rf /tmp/src/. /home/$NB_USER/work

rm -rf /tmp/src

As the Jupyter Notebook when run by the base image uses the /home/$NB_USER/work directory, we will use that location.

We copy the files rather than move them into place so they will be merged with anything already present in the directory. As we copy them, we remove the original contents of /tmp/src so we do not have duplicates of the files wasting space in the final image.

If there was a requirements.txt file, we now install any Python packages listed in it. As the Jupyter Project images use the Anaconda Python distribution rather than that from the Python Software Foundation, we use the conda package manager rather than pip. This will result in any packages being installed from the conda package index rather than PyPi.

if [ -f /home/$NB_USER/work/requirements.txt ]; then
    (cd /home/$NB_USER/work && conda install -y --file requirements.txt)
    rm /home/$NB_USER/work/requirements.txt
fi

At the end we remove the requirements.txt file. This is so that it will not interfere with anything if the resultant image is in turn used as a S2I builder. In other words, one can with S2I builders create layered builds just like with normal Docker builds. The requirements.txt file is removed so that such a subsequent build doesn’t try and re-install all the packages again if no requirements.txt file was provided in the Git repository for the subsequent build.

That is all there is to the assemble script. Just make sure it is also made executable.

The Runtime Script

The run script is run when the final image is started. It resides at the path s2i/bin/run and is copied into the location /opt/app-root/s2i/bin/run of the image.

The original base image used a script located at /usr/local/bin/start-notebook.sh. The run script wraps the execution of this script to make it easier to set a password for the Jupyter Notebook via an environment variable passed in by OpenShift.

#!/bin/bash

set -eo pipefail

NOTEBOOK_ARGS=

if [ x"${JUPYTER_NOTEBOOK_PASSWORD}" != x"" ]; then
    NOTEBOOK_ARGS=--NotebookApp.password=`python -c "import notebook.auth; \
        print(notebook.auth.passwd(\"$JUPYTER_NOTEBOOK_PASSWORD\"))"`
    unset JUPYTER_NOTEBOOK_PASSWORD
fi

exec /usr/local/bin/start-notebook.sh $NOTEBOOK_ARGS

A key detail of this script is that because it is wrapping the original script, it is important to use exec when running the original. This ensures that the sub process replaces the current process and that the Jupyter Notebook application inherits process ID 1. If this is not done, then signals will not be delivered to the Jupyter Notebook application, preventing the container from shutting down correctly.

As with the assemble script, the run script should be made executable.

Building the Builder Image

The files for the S2I enabled version of the Jupyter Project minimal notebook which have been described above can be found on the s2i-anyuid branch of the Git repository found at:

To build the image using OpenShift you can use the command:

oc new-build https://github.com/getwarped/s2i-minimal-notebook#s2i-anyuid \
    --name s2i-minimal-notebook

This will only build the image. Once the build has completed, to test that it can still be deployed as an application, run:

oc new-app s2i-minimal-notebook --env JUPYTER_NOTEBOOK_PASSWORD=grumpy

oc expose svc/s2i-minimal-notebook

In this case a password is being set using an environment variable. Because of what was added to the run script to detect the password and configure Jupyter Notebook, you do not need to determine the login token from the application logs as was the case in the previous post. Once the Jupyter Notebook is deployed, visit its URL and enter the password you supplied to login.

Do note that as described in the previous post, the Jupyter Project images will not work when run with an assigned user ID. The changes we have made do not change that. As such, you still need to enable the service account which images are being run as, to run images as any UID by having a system administrator run:

oc adm policy add-scc-to-user anyuid -z default -n myproject

In this case the command would be applied in the project called myproject.

Using the Builder Image

Now that we have our S2I enabled version of the Jupyter Project image, lets use it to deploy a Jupyter Notebook instance which is pre-populated with a set of notebook files. For this run:

oc new-app \
    s2i-minimal-notebook~https://github.com/jupyter/notebook \
    --context-dir docs/source/examples/Notebook \
    --env JUPYTER_NOTEBOOK_PASSWORD=grumpy \
    --name notebook-samples

oc expose svc/notebook-samples

This particular set of notebooks will help you with learning more about working with Jupyter Notebooks. It doesn’t require any additional Python packages.

Where a set of Jupyter Notebooks does have a requirement that additional Python packages be installed, all that needs to be done is to include a requirements.txt file in the Git repository, which lists what those packages are, and if necessary the specific versions of those packages.

An example of a Git repository providing both a set of Jupyter Notebooks and the requirements.txt file listing what packages it needs, is the set of simple exercises for using the numpy package that has been put together by Nicolas Rougier based on posts from the numpy package mailing list.

To build up an image incorporating these notebooks and deploy it, run:

oc new-app s2i-minimal-notebook~https://github.com/rougier/numpy-100 \
    --env JUPYTER_NOTEBOOK_PASSWORD=grumpy \
    --name numpy-examples

oc expose svc/numpy-examples

This time, because of the requirements.txt file being present, the numpy package will be automatically installed and incorporated into the image.

Adding the Persistent Volume

In the previous post it was demonstrated how a persistent volume could be used in conjunction with a Jupyter Notebook image. To do this a persistent volume claim was made and the volume mounted at /home/jovyan/work inside the container. If we do that with the result of running the S2I enabled image, the volume will be mounted on top of the directory containing the notebooks which were copied into the image.

To be able to still use a persistent volume, such that we can pre-populate an image with notebooks and data files, but then be able to work on them and not lose any work, a different strategy is needed. I will explain how we can enhance the S2I builder to be aware of a persistent volume and copy any notebooks and data files into the persistent volume the first time the image is run, and then subsequently use that, in the next post.

Categories
OpenShift Ecosystem, OpenShift Online, Python
Tags
, , ,