Hi Lawrence,
Below are the first weak scaling results from runs at lowest order on up to 96 cores on ARCHER. On 384 cores the code crashes with a PETSc error (segfault). This crash is already in the matrix-free solver (which, of course, uses a PETSc KSP).
Could this be an issue with the python module for launching the compilation/loading the kernels in PyOP2? However, on Friday I ran with PYOP2_NO_FORK_AVAILABLE=1, which I thought would fix this? If I run with PYOP2_NO_FORK_AVAILABLE=0, then it crashes with a different error because it can't compile a kernel. This morning just repeated exactly the same 384 core run (PYOP2_NO_FORK_AVAILABLE=1 as before) and now it goes through without problems (i.e. it does both the matrix-free and the PETSc solve). I observe something similar with the 1536 run: The first run crashed, in the subsequent runs the matrix-free solver completes but it crashes later in the run where it gets to the PETSc solve. I then set PYOP2_DEBUG=1 in the 1536 core run, and again it fails because it can't compile code. The resulting .err file is empty. I then ran the compilation command in the .log file manually. It goes through, but with a warning, which I attach together with the output of the run. Cheers, Eike
Hi Eike, On 2 Feb 2015, at 09:09, Eike Mueller <E.Mueller@bath.ac.uk> wrote:
Hi Lawrence,
Below are the first weak scaling results from runs at lowest order on up to 96 cores on ARCHER. On 384 cores the code crashes with a PETSc error (segfault). This crash is already in the matrix-free solver (which, of course, uses a PETSc KSP).
Could this be an issue with the python module for launching the compilation/loading the kernels in PyOP2? However, on Friday I ran with PYOP2_NO_FORK_AVAILABLE=1, which I thought would fix this? If I run with PYOP2_NO_FORK_AVAILABLE=0, then it crashes with a different error because it can't compile a kernel.
This morning just repeated exactly the same 384 core run (PYOP2_NO_FORK_AVAILABLE=1 as before) and now it goes through without problems (i.e. it does both the matrix-free and the PETSc solve).
I observe something similar with the 1536 run: The first run crashed, in the subsequent runs the matrix-free solver completes but it crashes later in the run where it gets to the PETSc solve.
I then set PYOP2_DEBUG=1 in the 1536 core run, and again it fails because it can't compile code. The resulting .err file is empty. I then ran the compilation command in the .log file manually. It goes through, but with a warning, which I attach together with the output of the run.
My experience of runs on ARCHER is that our JIT-compilation can occasionally fail, especially on "lots" of cores. PYOP2_NO_FORK_AVAILABLE=1 helps a bit, but isn't perfect. TBH, I don't really have any ideas as to why this might be the case: as you observe, sometimes things work a little better. If you can set up the problem such that a "small" run populates the code caches fully (so that when running large jobs you don't need to compile any modules) that seems to work best. This mostly involves replacing literal constants in forms/expressions with Constant(value). That way the same code is generated irrespective of the value (and you therefore don't need to recompile). Lawrence
Hi Lawrence, thanks, I will go through my code and replace all constants by PyOP2 Constants. How can I check that this worked? Can I set PYOP2_DUMP_GENCODE=1, PYOP2_DUMP_GENCODE_PATH=./build, run with two different resolutions and then check if the second run generated any new files? Thanks, Eike On 02/02/15 11:57, Lawrence Mitchell wrote:
Hi Eike,
On 2 Feb 2015, at 09:09, Eike Mueller <E.Mueller@bath.ac.uk> wrote:
Hi Lawrence,
Below are the first weak scaling results from runs at lowest order on up to 96 cores on ARCHER. On 384 cores the code crashes with a PETSc error (segfault). This crash is already in the matrix-free solver (which, of course, uses a PETSc KSP).
Could this be an issue with the python module for launching the compilation/loading the kernels in PyOP2? However, on Friday I ran with PYOP2_NO_FORK_AVAILABLE=1, which I thought would fix this? If I run with PYOP2_NO_FORK_AVAILABLE=0, then it crashes with a different error because it can't compile a kernel.
This morning just repeated exactly the same 384 core run (PYOP2_NO_FORK_AVAILABLE=1 as before) and now it goes through without problems (i.e. it does both the matrix-free and the PETSc solve).
I observe something similar with the 1536 run: The first run crashed, in the subsequent runs the matrix-free solver completes but it crashes later in the run where it gets to the PETSc solve.
I then set PYOP2_DEBUG=1 in the 1536 core run, and again it fails because it can't compile code. The resulting .err file is empty. I then ran the compilation command in the .log file manually. It goes through, but with a warning, which I attach together with the output of the run.
My experience of runs on ARCHER is that our JIT-compilation can occasionally fail, especially on "lots" of cores. PYOP2_NO_FORK_AVAILABLE=1 helps a bit, but isn't perfect. TBH, I don't really have any ideas as to why this might be the case: as you observe, sometimes things work a little better.
If you can set up the problem such that a "small" run populates the code caches fully (so that when running large jobs you don't need to compile any modules) that seems to work best. This mostly involves replacing literal constants in forms/expressions with Constant(value). That way the same code is generated irrespective of the value (and you therefore don't need to recompile).
Lawrence
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 03/02/15 09:53, Eike Mueller wrote:
Hi Lawrence,
thanks, I will go through my code and replace all constants by PyOP2 Constants. How can I check that this worked?
Can I set PYOP2_DUMP_GENCODE=1, PYOP2_DUMP_GENCODE_PATH=./build, run with two different resolutions and then check if the second run generated any new files?
That would work, but an easier way is probably to run with INFO level logging and check for any "compiling" output. Lawrence -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iQEcBAEBAgAGBQJU0KXCAAoJECOc1kQ8PEYvC6kH/3uaVcAsU+GgeOd7BlbWakLz YvUsPQQaU1G9d0gVBQcyFIGSvIIcov6cHw6RK0MaKcweueYk4dq3A7Ja9XAF5Ni3 YrWfuvd1EmaUOHmSW1Xlonh+IMiqLR8aY497WaoOa6bBk5XpyAUEW5bFbBwAHkSE mUKnRCduOYhq4ndfS9OH7eE9kl4scKQe0GCXL3mvAlwVDn/qGCC9FFxDrg2PEZec 15TQ34l/KnQdlJ9JF3s7A3yQ2uRwAM1VVa1QbZU8VbUS9i2qTXOTkUX3FMY2GSqd sHgR2Zv9ritjH+Pwijnja6FB+cIol1XkIQaLxiNn3l5vNoglHjKkOSHkxV8QJ1M= =t2z8 -----END PGP SIGNATURE-----
participants (2)
- 
                
                Eike Mueller
- 
                
                Lawrence Mitchell