Overview/Background:
When users install packages from Conda by default they install this to the ephemeral storage /opt folder. Since this is ephemeral, if the notebook is restarted based on an idle notebook/s, the packages are then lost and have to be reinstalled again in which this takes some time.
Workarounds/Approaches:
We have several recommendation to address this concern and, users can select either one of the approaches below based on his preferable method:
1. By creating a new Conda environment. Create a new conda environment under the user's workspace (/home/kubeflow/ by default). This folder is backed by a separate PVC, so the data is retained after the Notebook (StatefulSet effectively) is scaled down.
Generally, the best practice is to create your own conda/pip environment for installing required packages for better isolation/security/consistency etc.
- Create a new conda env under the specified location:
conda create --prefix /home/kubeflow/conda
- Initialize the shell (terminal restart is required)
conda init bash
- Active the new conda env:
conda activate /home/kubeflow/conda
- Install required packages:
conda install numpy
- Verify the env:
conda list
# packages in environment at /home/kubeflow/conda:
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 1_gnu conda-forge
bzip2 1.0.8 h7f98852_4 conda-forge
ca-certificates 2021.10.8 ha878542_0 conda-forge
ld_impl_linux-64 2.36.1 hea4e1c9_2 conda-forge
libblas 3.9.0 13_linux64_openblas conda-forge
libcblas 3.9.0 13_linux64_openblas conda-forge
libffi 3.4.2 h7f98852_5 conda-forge
libgcc-ng 11.2.0 h1d223b6_11 conda-forge
libgfortran-ng 11.2.0 h69a702a_11 conda-forge
libgfortran5 11.2.0 h5c6108e_11 conda-forge
libgomp 11.2.0 h1d223b6_11 conda-forge
liblapack 3.9.0 13_linux64_openblas conda-forge
libnsl 2.0.0 h7f98852_0 conda-forge
libopenblas 0.3.18 pthreads_h8fe5266_0 conda-forge
libstdcxx-ng 11.2.0 he4da1e4_11 conda-forge
libuuid 2.32.1 h7f98852_1000 conda-forge
libzlib 1.2.11 h36c2ea0_1013 conda-forge
ncurses 6.2 h58526e2_4 conda-forge
numpy 1.22.1 py310h454958d_0 conda-forge
openssl 3.0.0 h7f98852_2 conda-forge
pip 21.3.1 pyhd8ed1ab_0 conda-forge
python 3.10.2 h543edf9_0_cpython conda-forge
python_abi 3.10 2_cp310 conda-forge
readline 8.1 h46c0cb4_0 conda-forge
setuptools 60.5.0 py310hff52083_0 conda-forge
sqlite 3.37.0 h9cd32fc_0 conda-forge
tk 8.6.11 h27826a3_1 conda-forge
tzdata 2021e he74cb21_0 conda-forge
wheel 0.37.1 pyhd8ed1ab_0 conda-forge
xz 5.2.5 h516909a_1 conda-forge
zlib 1.2.11 h36c2ea0_1013 conda-forge
2. Install additional packages with ‘pip install --user’ flag. The --user flag to pip install tells Pip to install packages in some specific directories within your home directory. In this case, packages will be installed directly to the PVC and will be available after the notebook pod is restarted. You can check pip install --help and check the --user flag to learn more.
Sample output:
$ pip install --user <sample_username>
Collecting <sample_username>
Downloading <sample_username>-2.1.0-py2.py3-none-any.whl (38 kB)
Requirement already satisfied: importlib-metadata in /opt/conda/lib/python3.7/site-packages (from <sample_username>) (4.10.0)
Requirement already satisfied: typing-extensions>=3.6.4 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata->jsonpickle) (3.7.4.3)
Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.7/site-packages (from importlib-metadata-><sample_username>) (3.4.1)
Installing collected packages: <sample_username>
Successfully installed <some_username>-2.1.0
$ pip list packages | grep <sample_username>
<sample_username> 2.1.0
$ pip show <sample_username>
Name: <sample_username>
Version: 2.1.0
Summary: Python library for serializing any arbitrary object graph into JSON
Home-page: https://github.com/<sample-username>/<sample-username>
Author: <sample_author_name>
Author-email: <sample-emailadd@gmail.com>
License: UNKNOWN
Location: /home/kubeflow/.local/lib/python3.7/site-packages
Requires: importlib-metadata
Required-by:
- Create a new docker image based on the default images that we provide. In the docker file example below the default base image is Tensorflow. In this approach, you're basically baking the additional packages into our default base image.
FORM mesosphere/kubeflow:1.3.0-jupyter-spark-3.0.0-horovod-0.22.0-tensorflow-2.5.0
RUN <install additional packages here>
ENTRYPOINT ["tini", "--"]
CMD ["sh","-c", "/entrypoint.sh"]
Ultimately the goal of the approaches above is to retain user installed packages after the notebook container is restarted. With these, users will no longer have to reinstall packages from an idle notebook which was restarted.