Jupyter on OpenShift Part 5: Ad-hoc Package Installation

The main reason persistent volumes are used is to store any application data. This is so that if a container running an application is restarted, that data is preserved and available to the new instance of the application.

When using an interactive coding environment such as Jupyter Notebooks, what you may want to persist can extend beyond just the notebooks and data files you are working with. Because it is an interactive environment using the dynamic scripting language Python, a user may want to install additional Python packages at the point they are creating a notebook.

These Python packages can be installed via the Jupyter Notebook interface, or from an interactive terminal running within the container. Because Python packages are not installed into the persistent volume, but into the file system of the container, a restart of the container will result in any additions being lost. This means that the user would need to reinstall the packages after any restart.

Creating a Source-to-Image (S2I) enabled version of the Jupyter Notebook image, as described in one of the previous blog posts in this series, addresses the issue of being able to install additional Python packages, but only for the case where you know in advance what you need. The S2I builder isn’t going to help when you are first developing the notebook and you haven’t worked out what packages you yet need.

What we therefore need is a way of also saving away the environment under which the Jupyter Notebook application is running. More specifically, we need to have the set of Python packages required saved away in the persistent volume, along with the notebooks and data files being worked on.

Python Virtual Environments

When you install Python packages, the default is that they would be installed into the Python installation itself. In order to maintain the set of required Python packages separate to the Python installation, you can use what is called a Python virtual environment.

When using the Anaconda Python distribution, you can create a new environment by running the command conda create from an interactive terminal within the container.

$ conda create --name venv
Fetching package metadata .........
.Solving package specifications: .
Package plan for installation in environment /opt/conda/envs/venv:

The following empty environments will be CREATED:

/opt/conda/envs/venv

Proceed ([y]/n)? y

#
# To activate this environment, use:
# > source activate venv
#
# To deactivate this environment, use:
# > source deactivate venv
#

Although this has created a new environment, there are a number of problems.

The first is that the conda command has created the environment under the directory /opt/conda. This directory is still in the container file system and not on the persistent volume we would want to be used.

The second is that this is creating an empty environment. If the S2I builder had been used and a set of Python packages had already been installed, but you were wanting to add more, then those initial set of packages would be missing if this environment were used. Also missing will be all the packages required for running the Jupyter Notebook application itself.

The third and final problem is that although this creates a new environment, the Jupyter Notebook application is already running and if it were shutdown in an attempt to run it with the new environment, the container would actually be restarted and we would be kicked out of the interactive terminal session.

Cloning the Root Environment

To solve the second problem, what we actually need to do is rather than create a new empty environment, we want to clone the existing root environment which is part of the Python installation. When this is being done, to solve the first problem, we want to override where the conda create command puts it.

We want to therefore use the --prefix option to specify a location for the environment, and the --clone option to indicate that we want to clone the root environment. When performing the clone, we want to ensure that files are actually copied, and not linked and so use the --copy option as well.

$ conda create --prefix venv --clone root --copy
Source:      /opt/conda
Destination: /home/jovyan/volume/venv
The following packages cannot be cloned out of the root environment:
 - conda-4.2.12-py35_0
Packages: 59
Files: 64
Fetching package metadata .........
Fetching packages ...
conda-env-2.6. 100% |####################################| Time: 0:00:00 608.19 kB/s
libgcc-5.2.0-0 100% |####################################| Time: 0:00:01 975.81 kB/s
pandoc-1.19.2- 100% |####################################| Time: 0:00:21 855.67 kB/s
Extracting packages ...
[      COMPLETE      ]|#####################################################| 100%
Linking packages ...
[      COMPLETE      ]|#####################################################| 100%
#
# To activate this environment, use:
# > source activate /home/jovyan/volume/venv
#
# To deactivate this environment, use:
# > source deactivate /home/jovyan/volume/venv
#

To test that this has worked, we can activate the environment and attempt to install a Python package we might need.

$ source activate /home/jovyan/volume/venv

(/home/jovyan/volume/venv) $ conda install numpy
Fetching package metadata .........
Solving package specifications: ..........


InstallError: Install error: Error: one or more of the packages already installed depend on 'conda'
and should only be installed in the root environment: conda-env
These packages need to be removed before conda can proceed.

This unfortunately fails and the reason is an oddity with environments when using Anaconda. That is that when cloning the root environment, it copies across a key package which then actually makes the clone unusable. This package is the conda-env package. What we therefore need to do is remove that package.

$ conda remove conda-env
Fetching package metadata .........
Solving package specifications: ..........

Package plan for package removal in environment /home/jovyan/volume/venv:

The following packages will be REMOVED:

    conda-env: 2.6.0-0

Proceed ([y]/n)? y

Unlinking packages ...
[      COMPLETE      ]|###################################################| 100%

Having removed the conda-env package, we can now install the additional packages we want.

(/home/jovyan/volume/venv) jovyan@notebook-samples-2-qpf33:~/volume$ conda install numpy
Fetching package metadata .........
Solving package specifications: ..........

Package plan for installation in environment /home/jovyan/volume/venv:

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    libgfortran-3.0.0          |                1         281 KB
    openblas-0.2.19            |                1        14.1 MB  conda-forge
    blas-1.1                   |         openblas           1 KB  conda-forge
    numpy-1.12.1               |py35_blas_openblas_200         8.5 MB  conda-forge
    ------------------------------------------------------------
                                           Total:        22.9 MB

The following NEW packages will be INSTALLED:

    blas:        1.1-openblas                  conda-forge (soft-link)
    libgfortran: 3.0.0-1                                   (soft-link)
    numpy:       1.12.1-py35_blas_openblas_200 conda-forge [blas_openblas] (soft-link)
    openblas:    0.2.19-1                      conda-forge (soft-link)

Proceed ([y]/n)? y

Fetching packages ...
libgfortran-3. 100% |####################################| Time: 0:00:00 857.66 kB/s
openblas-0.2.1 100% |####################################| Time: 0:00:18 809.28 kB/s
blas-1.1-openb 100% |####################################| Time: 0:00:00  34.49 kB/s
numpy-1.12.1-p 100% |####################################| Time: 0:00:11 750.83 kB/s
Extracting packages ...
[      COMPLETE      ]|###################################################| 100%
Linking packages ...
[      COMPLETE      ]|###################################################| 100%

Activating the Environment

We now have a new environment stored in the persistent volume. If the container is restarted, that environment will still be there. It is still necessary to do something about our third problem though, which was the Jupyter Notebook instance isn’t using our persistent environment, and is instead still using the default root environment.

To address this we are going to making further changes to the run script we had previously created for our S2I enabled Jupyter Notebook image.

The contents of that run script at this point are:

#!/bin/bash

set -eo pipefail

NOTEBOOK_ARGS=

# Calculate login token from the supplied password.

if [ x"${JUPYTER_NOTEBOOK_PASSWORD}" != x"" ]; then
    NOTEBOOK_ARGS=--NotebookApp.password=`python -c "import notebook.auth; \
        print(notebook.auth.passwd(\"$JUPYTER_NOTEBOOK_PASSWORD\"))"`
    unset JUPYTER_NOTEBOOK_PASSWORD
fi

# Copy files into volume if specified and change notebook directory.

JUPYTER_NOTEBOOK_DIR=${JUPYTER_NOTEBOOK_DIR:-/home/$NB_USER/work}

if [ x"${PERSISTENT_VOLUME_ROOTDIR}" != x"" ]; then
    PERSISTENT_VOLUME_WORKSPACE=${PERSISTENT_VOLUME_WORKSPACE:-work}

    WORKDIR=${PERSISTENT_VOLUME_ROOTDIR}/${PERSISTENT_VOLUME_WORKSPACE}

    if [ ! -d ${WORKDIR} ]; then
        mkdir -p ${WORKDIR}
        cp -rp ${JUPYTER_NOTEBOOK_DIR}/. ${WORKDIR}
    fi

    JUPYTER_NOTEBOOK_DIR=${PERSISTENT_VOLUME_ROOTDIR}
fi

NOTEBOOK_ARGS="$NOTEBOOK_ARGS --notebook-dir=${JUPYTER_NOTEBOOK_DIR}"

cd ${JUPYTER_NOTEBOOK_DIR}

# Start the Jupyter notebook instance.

exec /usr/local/bin/start-notebook.sh $NOTEBOOK_ARGS

This includes what had been added initially to allow a password to be provided via an environment variable, as well as the changes made in the last post to copy any output from the S2I build process into the persistent volume. Finally the Jupyter Notebook instance is run.

The change we are going to make this time is to activate the environment just before starting up the Jupyter Notebook instance.

if [ x"${PERSISTENT_VOLUME_ROOTDIR}" != x"" ]; then
    PERSISTENT_VOLUME_VIRTUALENV=${PERSISTENT_VOLUME_VIRTUALENV:-venv}

    VENVDIR=${PERSISTENT_VOLUME_ROOTDIR}/${PERSISTENT_VOLUME_VIRTUALENV}

    if [ -f "${VENVDIR}/bin/jupyter" ]; then
        source activate ${VENVDIR}
    fi
fi

What this will do is determine if it was indicated that a persistent volume was being used and if an environment was found which contained the Jupyter Notebook application, activate it. Now when the Jupyter Notebook instance is started, it will use the persistent environment.

With this change made, a user can now work on their notebook and even if the container is restarted, the additional Python packages they installed will still be there after the restart.

When the user has finished developing their notebook, they can download their files through the Jupyter Notebook interface or using the oc rsync command, add a requirements.txt file listing the additional packages and push it all up to a Git repository. The S2I builder image for the Jupyter Notebook can now be run to create an image which replicates everything they have done. The Git repository or the image created by the S2I build process, could be shared with others interested in their work, or could be deployed in a class room environment for students.

Using the Builder Image

The files for this version of the Jupyter Project minimal notebook can be found on the s2i-anyuid-pvc-venv branch of the Git repository found at:

To build the image using OpenShift you can use the command:

oc new-build https://github.com/getwarped/s2i-minimal-notebook#s2i-anyuid-pvc-venv \
    --name s2i-minimal-notebook

To deploy the image to create an empty environment in which to start working on a notebook, along with an attached persistent volume, you can run:

oc new-app s2i-minimal-notebook \
    --env JUPYTER_NOTEBOOK_PASSWORD=grumpy \
    --env PERSISTENT_VOLUME_ROOTDIR=/home/jovyan/volume \
    --name notebook

oc set volume dc/notebook --add --mount-path /home/jovyan/volume --claim-size=1G

oc expose svc/notebook

Once the Jupyter Notebook application is running, from within an interactive terminal created from the Jupyter Notebook interface, to create the environment run:

conda create --prefix venv --clone root --copy

source activate /home/jovyan/volume/venv

conda remove conda-env

kill -TERM 1

The kill -TERM 1 command will cause the Jupyter Notebook instance to be shutdown and the container will be restarted. After the restart, the Jupyter Notebook instance will be using the new environment and additional Python packages can be installed.

Removing the ‘anyuid’ Requirement

Even with these modifications, the issue still exists that the way the original Jupyter Project minimal-notebook image was set up, meant that it was necessary to enable the service account the application is being run with, to run images as any user ID, by having an administrator run the command:

oc adm policy add-scc-to-user anyuid -z default -n myproject

Having got this far with improving on the original base image so that it is better integrated with OpenShift, lets return to that original problem and see if there is anything that can be done to remedy it, such that there is no need to use the anyuid role. This will be the topic of the next post.

Categories
OpenShift Ecosystem, OpenShift Origin, Python
Tags
, ,