Jupyter on OpenShift Part 4: Adding a Persistent Workspace

This is now the fourth post about running Jupyter Notebooks on OpenShift in this series. In the second post I covered how to run the Jupyter Notebook images from the Jupyter Project on OpenShift. In the third post I described how one could create a custom version of the notebook image which was S2I enabled. The S2I enabled image made it easy to deploy a Jupyter Notebook instance which was pre-populated with a set of notebooks and data files, along with any additional Python packages the notebooks required.

When the original Jupyter Notebook image was deployed it provided an empty workspace. You could upload your own notebooks and data files, but they, along with any changes you made, would be lost if the container was restarted. The solution to this was to claim a persistent volume and mount it into the container at the location which the Jupyter Notebook application used as the notebook directory.

When using the S2I enabled image, it became possible to pre-populate the image with notebooks and data files, as well as have any Python packages required by the notebooks pre-installed. In this case though, because the files are part of the image, changes you make will again be lost when the container restarts. We can’t just mount a persistent volume on top of the notebook directory, as that will hide the files which were pre-populated.

To provide persistence for any work done, it becomes necessary to copy any notebooks and data files from the image into the persistent volume the first time the image is started with that persistent volume. In this blog post I will describe how the S2I enabled image can be extended to do this automatically, as well as go into some other issues related to saving of your work.

Overriding the Notebook Directory

The contents of the run script in the S2I enabled image, is up to this point:

#!/bin/bash

set -eo pipefail

NOTEBOOK_ARGS=

if [ x"${JUPYTER_NOTEBOOK_PASSWORD}" != x"" ]; then
    NOTEBOOK_ARGS=--NotebookApp.password=`python -c "import notebook.auth; \
        print(notebook.auth.passwd(\"$JUPYTER_NOTEBOOK_PASSWORD\"))"`
    unset JUPYTER_NOTEBOOK_PASSWORD
fi

exec /usr/local/bin/start-notebook.sh $NOTEBOOK_ARGS

This script file wrapped the execution of the original start-notebook.sh script in order to make it easier to define a password for the Jupyter Notebook instance via an environment variable when deploying the image.

As this script is run to start the Jupyter Notebook, we can extend it to add in additional steps.

What we will do in this case, just prior to running start-notebook.sh, is add the following:

JUPYTER_NOTEBOOK_DIR=${JUPYTER_NOTEBOOK_DIR:-/home/$NB_USER/work}

if [ x"${PERSISTENT_VOLUME_ROOTDIR}" != x"" ]; then
    PERSISTENT_VOLUME_WORKSPACE=${PERSISTENT_VOLUME_WORKSPACE:-work}

    WORKDIR=${PERSISTENT_VOLUME_ROOTDIR}/${PERSISTENT_VOLUME_WORKSPACE}

    if [ ! -d ${WORKDIR} ]; then
        mkdir -p ${WORKDIR}
        cp -rp ${JUPYTER_NOTEBOOK_DIR}/. ${WORKDIR}
    fi

    JUPYTER_NOTEBOOK_DIR=${PERSISTENT_VOLUME_ROOTDIR}
fi

NOTEBOOK_ARGS="$NOTEBOOK_ARGS --notebook-dir=${JUPYTER_NOTEBOOK_DIR}"

cd ${JUPYTER_NOTEBOOK_DIR}

The first thing this change adds is:

JUPYTER_NOTEBOOK_DIR=${JUPYTER_NOTEBOOK_DIR:-/home/$NB_USER/work}

The default directory used by the Jupyter Notebook image as the notebook directory is /home/$NB_USER/work. Independent of the other changes related to use of a persistent volume, allowing this to be overridden can be useful in its own right. For example, the Git repository from which an image is built may have multiple directories containing different sets of notebooks, but you want to set the focus to be just one.

You could also use the --context-dir option when using oc new-app to do the same thing, but that would mean any requirements.txt file listing Python packages to be installed would also need to be in the sub-directory, but it may only reside at the root of the Git repository. This therefore gives us a little more flexibility.

Whatever the notebook directory ends up being, we also add at the end:

NOTEBOOK_ARGS="$NOTEBOOK_ARGS --notebook-dir=${JUPYTER_NOTEBOOK_DIR}"

cd ${JUPYTER_NOTEBOOK_DIR}

This ensures that Jupyter Notebook uses the specified directory as the notebook directory. We also change the working directory to be the same directory. This is so that when a terminal is created using the Jupyter Notebook web interface, we end up in the same directory.

Using a Persistent Volume

In the middle of the changes we had above, was the part dealing with a persistent volume. This was:

if [ x"${PERSISTENT_VOLUME_ROOTDIR}" != x"" ]; then
    PERSISTENT_VOLUME_WORKSPACE=${PERSISTENT_VOLUME_WORKSPACE:-work}

    WORKDIR=${PERSISTENT_VOLUME_ROOTDIR}/${PERSISTENT_VOLUME_WORKSPACE}

    if [ ! -d ${WORKDIR} ]; then
        mkdir -p ${WORKDIR}
        cp -rp ${JUPYTER_NOTEBOOK_DIR}/. ${WORKDIR}
    fi

    JUPYTER_NOTEBOOK_DIR=${PERSISTENT_VOLUME_ROOTDIR}
fi

What we do here is look for the presence of the environment variable PERSISTENT_VOLUME_ROOTDIR, the idea being that if you add a persistent volume to the container, you specify its location using that environment variable.

We then calculate a sub-directory within the persistent volume into which the notebooks and data files will be copied. The name of this subdirectory will default to being called work but can be overridden using the PERSISTENT_VOLUME_WORKSPACE environment variable.

Having worked out the sub-directory in the persistent volume to use, if the directory doesn’t already exist, we copy across the notebooks and data files from the original notebook directory. In other words, a copy is only made the first time the image is started against that persistent volume.

Finally, the notebook directory is updated to be the persistent volume directory so that Jupyter Notebook will use it and any changes made will also be made to the persistent volume and thus available after a restart.

Reverting to the Original Files

These changes mean the S2I enabled image can be used to create a new image which is pre-populated with everything that is required. At the same time, the notebooks and data files are automatically copied into a persistent volume so anyone working with them doesn’t lose their changes.

One example of where this way of distributing notebooks and data files can be used is in a teaching environment. The use of a S2I builder ensures that students have the correct notebooks and data files, as well as a runtime environment with the correct version of Python, and any Python packages required by the notebooks.

What though about the case where a student accidentally deletes their copy of a file, or mucks up the code within a Jupyter Notebook and wants to revert back to the original file?

The first option they have if they want to restore a single file, is to open up a terminal within Jupyter Notebook. This will provide them with an interactive command shell. They can then copy the original file themselves from the /home/$NB_USER/work directory into their directory on the persistent volume.

The second option is they delete the sub-directory from the persistent volume and trigger a new deployment of the Jupyter Notebook so that is is restarted. If they don’t have access to OpenShift to do this as the Jupyter Notebook instance was provisioned for them, then they could from a terminal created from Jupyter Notebook run kill -TERM 1. More likely though they would be using the Jupyter Notebook instance via JupyterHub, in which case the control panel provided by JupyterHub would allow them to stop and start the Jupyter Notebook instance. Either way, the sub-directory will be recreated with a fresh copy of the files the next time Jupyter Notebook is started.

Deleting the whole directory does mean they will lose any other changes made in the directory, so a final option is that rather than delete the sub-directory in the persistent volume, is that they rename it, then restart the Jupyter Notebook instance. That way they will get a fresh copy of the files, but also have the original and then can from either Jupyter Notebook or the terminal, selectively copy across other files as need be.

Steps to Deploy Everything

The files for this version of the Jupyter Project minimal notebook can be found on the s2i-anyuid-pvc branch of the Git repository found at:

To build the image using OpenShift you can use the command:

oc new-build https://github.com/getwarped/s2i-minimal-notebook#s2i-anyuid-pvc \
    --name s2i-minimal-notebook

Once the image is built, to deploy a Jupyter Notebook instance, and declare that we intend using a persistent volume, we use:

oc new-app \
    s2i-minimal-notebook~https://github.com/jupyter/notebook \
    --context-dir docs/source/examples/Notebook \
    --env JUPYTER_NOTEBOOK_PASSWORD=grumpy \
    --env PERSISTENT_VOLUME_ROOTDIR=/home/jovyan/volume \
    --env PERSISTENT_VOLUME_WORKSPACE=samples \
    --name notebook-samples

oc set volume dc/notebook-samples --add --mount-path /home/jovyan/volume --claim-size=1G

oc expose svc/notebook-samples

When Jupyter Notebook is started and you login, rather than being in the directory containing the notebooks and data files, you are in top level directory of the persistent volume, and the files from the image are in samples sub-directory.

Traversing into the samples sub-directory you will then see the copy of the notebooks and data files.

Adding Extra Python Packages

The intent with using a S2I builder is that in addition to pre-populating the image with the notebooks and data files, is that any additional Python packages required will also be installed. These are installed due to them being listed in the requirements.txt file found in the top level directory of the Git repository the S2I builder was run against.

Once you have the Jupyter Notebook instance running and you are working in it, you may find though that you need to install further Python packages. This may be because a notebook requires it but it was missing from the requirements.txt file, or your own changes mean the package is now required.

Additional Python packages can be installed by creating a terminal from the Jupyter Notebook web interface and then using the conda package manager to install them. The problem is that packages are installed under the /opt/conda directory. This directory is part of the container file system and not part of the persistent volume. This means that if the container is restarted, you need to install the extra packages again.

In the next post I will look into how you can solve this problem, with further changes to the S2I builder image to accommodate moving the Python virtual environment used into the persistent volume, and how to manage that.

Categories
OpenShift Ecosystem, OpenShift Origin, Python
Tags
, ,