Guidelines for using/running jobs on larger HPC machines
Hi all, Generally speaking, what should the PYOP2/FIREDRAKE_TSFC_KERNEL_CACHE_DIR environment variables be set to? Is it sufficient to have something like: export PYOP2/FIREDRAKE_TSFC_KERNEL_CACHE_DIR=$HOME/fd-cache This works on my smaller university cluster but I wonder if for a system like Edison at NERSC ( http://www.nersc.gov/users/computational-systems/edison/) if there is a better directory for this. Also, from what I read, Edison's SLURM scheduler loads the executable to the allocated compute nodes from the current working directory, which supposedly can be really slow - they recommend using something like: srun --bcast=/tmp/$SLURM_JOB_ID --compress=lz4 ... if 2000 or more nodes are needed. But even on jobs that require no more than a single compute node (24 cores), the firedrake/python modules seem to load very slowly. This is the description of the file storage system on Edison ( http://www.nersc.gov/users/computational-systems/edison/file-storage-and-i-o...) btw Any help or thoughts appreciated. Justin
Hi Justin,
On 22 Mar 2017, at 05:13, Justin Chang <jychang48@gmail.com> wrote:
Hi all,
Generally speaking, what should the PYOP2/FIREDRAKE_TSFC_KERNEL_CACHE_DIR environment variables be set to? Is it sufficient to have something like:
export PYOP2/FIREDRAKE_TSFC_KERNEL_CACHE_DIR=$HOME/fd-cache
This works on my smaller university cluster but I wonder if for a system like Edison at NERSC (http://www.nersc.gov/users/computational-systems/edison/) if there is a better directory for this.
Also, from what I read, Edison's SLURM scheduler loads the executable to the allocated compute nodes from the current working directory, which supposedly can be really slow - they recommend using something like:
srun --bcast=/tmp/$SLURM_JOB_ID --compress=lz4 ...
if 2000 or more nodes are needed. But even on jobs that require no more than a single compute node (24 cores), the firedrake/python modules seem to load very slowly.
This is the description of the file storage system on Edison (http://www.nersc.gov/users/computational-systems/edison/file-storage-and-i-o...) btw
Any help or thoughts appreciated.
We have work in progress, via Nick Johnson (who has a poster next week at the firedrake meeting), to make this better. The reason it's so slow to load the python modules is that although you're sending out the executable and python script, all the firedrake virtualenv is still being loaded from shared filesystem. Our plan is to make the virtualenv in its entirety relocatable, then when you launch your job, you ship that out to the compute nodes, activate it there, and then run everything locally. I think it's not right ready for production yet, but we should discuss next week to see if you want to try it out. Lawrence
participants (2)
-
Justin Chang
-
Lawrence Mitchell