Fwd: [petsc-maint] Weak scaling test: Fieldsplit questions
What is going on here? It looks like you are using subprocess. Why would you do that on a cluster rather than MPI? Matt ---------- Forwarded message ---------- From: Buesing, Henrik <HBuesing@eonerc.rwth-aachen.de> Date: Thu, Mar 30, 2017 at 8:25 AM Subject: AW: [petsc-maint] Weak scaling test: Fieldsplit questions To: Matthew Knepley <knepley@gmail.com> Cc: Barry Smith <bsmith@mcs.anl.gov>, Hong <hzhang@mcs.anl.gov>, " petsc-maint@mcs.anl.gov" <petsc-maint@mcs.anl.gov> In my opinion, there is some kind of race condition in Firedrake when running on more than one node. Thus, until this is fixed it is very unlikely for me to get the 64 cores case running. Hmm, we are running Firedrake in parallel with no problems here. What is the error? *[Buesing, Henrik] *See [1] for the error message and the attached three logs (for the 32 core case this was a 2/5 running and 3/5 crashing ). This is just for running the compiled code. During the compile stage I had problems, too. What I did is the following: 1) Run Firedrake on 1 node (this works). Now all the *.so files are in place. 2) Run Firedrake on more than one node. This crashes more often the more processes I use. I’m guessing for a race condition, because on 17 cores (1 node + 1 core) my problem runs fine. On 32 cores it sometimes runs. And on 64 cores it, up to now, never runs. But if you are not having these problems, and if the provided code reproduces the MatCreateSubMats problem, then you can do tests on your own. Well, a lot of ifs, but better than nothing. Thank you! Henrik [1] Traceback (most recent call last): File "/work/hb111949/Firedrake/twophase/2pDrake/2pinjection.py", line 228, in <module> solver = NonlinearVariationalSolver(problem,options_prefix="") File "/rwthfs/rz/cluster/work/hb111949/Firedrake/firedrake/ src/firedrake/firedrake/variational_solver.py", line 156, in __init__ pre_function_callback=pre_f_callback) File "/rwthfs/rz/cluster/work/hb111949/Firedrake/firedrake/ src/firedrake/firedrake/solving_utils.py", line 260, in __init__ form_compiler_parameters=fcp) File "/rwthfs/rz/cluster/work/hb111949/Firedrake/firedrake/ src/firedrake/firedrake/assemble.py", line 143, in create_assembly_callable collect_loops=True) File "<decorator-gen-279>", line 2, in _assemble File "/rwthfs/rz/cluster/work/hb111949/Firedrake/firedrake/ src/firedrake/firedrake/utils.py", line 62, in wrapper return f(*args, **kwargs) File "/rwthfs/rz/cluster/work/hb111949/Firedrake/firedrake/ src/firedrake/firedrake/assemble.py", line 192, in _assemble kernels = tsfc_interface.compile_form(f, "form", parameters=form_compiler_parameters, inverse=inverse) File "/rwthfs/rz/cluster/work/hb111949/Firedrake/firedrake/ src/firedrake/firedrake/tsfc_interface.py", line 193, in compile_form number_map).kernels File "/rwthfs/rz/cluster/work/hb111949/Firedrake/firedrake/src/PyOP2/pyop2/caching.py", line 200, in __new__ obj = make_obj() File "/rwthfs/rz/cluster/work/hb111949/Firedrake/firedrake/src/PyOP2/pyop2/caching.py", line 190, in make_obj obj.__init__(*args, **kwargs) File "/rwthfs/rz/cluster/work/hb111949/Firedrake/firedrake/ src/firedrake/firedrake/tsfc_interface.py", line 121, in __init__ kernels.append(KernelInfo(kernel=Kernel(ast, ast.name, opts=opts), File "/rwthfs/rz/cluster/work/hb111949/Firedrake/firedrake/src/PyOP2/pyop2/caching.py", line 200, in __new__ obj = make_obj() File "/rwthfs/rz/cluster/work/hb111949/Firedrake/firedrake/src/PyOP2/pyop2/caching.py", line 190, in make_obj obj.__init__(*args, **kwargs) File "/rwthfs/rz/cluster/work/hb111949/Firedrake/firedrake/src/PyOP2/pyop2/base.py", line 3843, in __init__ self._code = self._ast_to_c(self._ast, opts) File "/rwthfs/rz/cluster/work/hb111949/Firedrake/firedrake/ src/PyOP2/pyop2/sequential.py", line 73, in _ast_to_c ast_handler.plan_cpu(self._opts) File "/rwthfs/rz/cluster/work/hb111949/Firedrake/firedrake/src/COFFEE/coffee/plan.py", line 121, in plan_cpu loop_opt.rewrite(rewrite) File "/rwthfs/rz/cluster/work/hb111949/Firedrake/firedrake/ src/COFFEE/coffee/optimizer.py", line 117, in rewrite ew.sharing_graph_rewrite() File "/rwthfs/rz/cluster/work/hb111949/Firedrake/firedrake/ src/COFFEE/coffee/rewriter.py", line 619, in sharing_graph_rewrite prob.solve(ilp.GLPK(msg=0)) File "/rwthfs/rz/cluster/work/hb111949/Firedrake/firedrake/ lib/python2.7/site-packages/pulp/pulp.py", line 1651, in solve status = solver.actualSolve(self, **kwargs) File "/rwthfs/rz/cluster/work/hb111949/Firedrake/firedrake/ lib/python2.7/site-packages/pulp/solvers.py", line 383, in actualSolve rc = subprocess.call(proc, stdout = pipe, stderr = pipe) File "/rwthfs/rz/cluster/work/hb111949/Firedrake/firedrake/ lib/python2.7/site-packages/subprocess32.py", line 578, in call p = Popen(*popenargs, **kwargs) File "/rwthfs/rz/cluster/work/hb111949/Firedrake/firedrake/ lib/python2.7/site-packages/subprocess32.py", line 825, in __init__ restore_signals, start_new_session) File "/rwthfs/rz/cluster/work/hb111949/Firedrake/firedrake/ lib/python2.7/site-packages/subprocess32.py", line 1574, in _execute_child raise child_exception_type(errno_num, err_msg) OSError: [Errno 14] Bad address Thanks, Matt -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
On 30/03/17 14:28, Matthew Knepley wrote:
What is going on here? It looks like you are using subprocess. Why would you do that on a cluster rather than MPI?
It's all pretty icky. One of the JIT steps involves forking something. Plausibly, Henrik, your MPI doesn't like fork. I went round at one point to try and set up a single subprocess that was able to fork before MPI_Init. For example, when using craypich, I need to say: export MPICH_GNI_FORK_MODE=FULLCOPY Old OpenMPI versions, especially those using some infiniband stuff, don't like fork either, for example https://www.open-mpi.org/faq/?category=openfabrics#ofa-fork Matt, here is David's response (you're not subscribed I think, so didn't get it):
Hi Matt,
Are you sure? To me it looks like pulp is using subprocess to fork the linear program solver. Nothing to do with parallel.
David
Lawrence
On Thu, Mar 30, 2017 at 8:39 AM, Lawrence Mitchell < lawrence.mitchell@imperial.ac.uk> wrote:
On 30/03/17 14:28, Matthew Knepley wrote:
What is going on here? It looks like you are using subprocess. Why would you do that on a cluster rather than MPI?
It's all pretty icky. One of the JIT steps involves forking something. Plausibly, Henrik, your MPI doesn't like fork. I went round at one point to try and set up a single subprocess that was able to fork before MPI_Init.
For example, when using craypich, I need to say:
export MPICH_GNI_FORK_MODE=FULLCOPY
Old OpenMPI versions, especially those using some infiniband stuff, don't like fork either, for example https://www.open-mpi.org/faq/?category=openfabrics#ofa-fork
Matt, here is David's response (you're not subscribed I think, so didn't get it):
Is there a simple thing we can do to shut this off for testing? He is on a cluster I believe. Hendrik, do you know what MPI you are using? Thanks, Matt
Hi Matt,
Are you sure? To me it looks like pulp is using subprocess to fork the linear program solver. Nothing to do with parallel.
David
Lawrence
-- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
According to his log output files, it seems like he is using openmpi 1.10.4 compiled with the system gcc On Thu, Mar 30, 2017 at 8:47 AM Matthew Knepley <knepley@gmail.com> wrote:
On Thu, Mar 30, 2017 at 8:39 AM, Lawrence Mitchell < lawrence.mitchell@imperial.ac.uk> wrote:
On 30/03/17 14:28, Matthew Knepley wrote:
What is going on here? It looks like you are using subprocess. Why would you do that on a cluster rather than MPI?
It's all pretty icky. One of the JIT steps involves forking something. Plausibly, Henrik, your MPI doesn't like fork. I went round at one point to try and set up a single subprocess that was able to fork before MPI_Init.
For example, when using craypich, I need to say:
export MPICH_GNI_FORK_MODE=FULLCOPY
Old OpenMPI versions, especially those using some infiniband stuff, don't like fork either, for example https://www.open-mpi.org/faq/?category=openfabrics#ofa-fork
Matt, here is David's response (you're not subscribed I think, so didn't get it):
Is there a simple thing we can do to shut this off for testing? He is on a cluster I believe.
Hendrik, do you know what MPI you are using?
Thanks,
Matt
Hi Matt,
Are you sure? To me it looks like pulp is using subprocess to fork the linear program solver. Nothing to do with parallel.
David
Lawrence
-- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener _______________________________________________ firedrake mailing list firedrake@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/firedrake
(Forgot to include Matt in my response) According to his log output files, it seems like he is using openmpi 1.10.4 compiled with the system gcc On Thu, Mar 30, 2017 at 8:47 AM Matthew Knepley <knepley@gmail.com> wrote:
On Thu, Mar 30, 2017 at 8:39 AM, Lawrence Mitchell < lawrence.mitchell@imperial.ac.uk> wrote:
On 30/03/17 14:28, Matthew Knepley wrote:
What is going on here? It looks like you are using subprocess. Why would you do that on a cluster rather than MPI?
It's all pretty icky. One of the JIT steps involves forking something. Plausibly, Henrik, your MPI doesn't like fork. I went round at one point to try and set up a single subprocess that was able to fork before MPI_Init.
For example, when using craypich, I need to say:
export MPICH_GNI_FORK_MODE=FULLCOPY
Old OpenMPI versions, especially those using some infiniband stuff, don't like fork either, for example https://www.open-mpi.org/faq/?category=openfabrics#ofa-fork
Matt, here is David's response (you're not subscribed I think, so didn't get it):
Is there a simple thing we can do to shut this off for testing? He is on a cluster I believe.
Hendrik, do you know what MPI you are using?
Thanks,
Matt
Hi Matt,
Are you sure? To me it looks like pulp is using subprocess to fork the linear program solver. Nothing to do with parallel.
David
Lawrence
-- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener _______________________________________________ firedrake mailing list firedrake@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/firedrake
According to his log output files, it seems like he is using openmpi 1.10.4 compiled with the system gcc [Buesing, Henrik] Yes! The gcc is gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11). On Thu, Mar 30, 2017 at 8:47 AM Matthew Knepley <knepley@gmail.com<mailto:knepley@gmail.com>> wrote: On Thu, Mar 30, 2017 at 8:39 AM, Lawrence Mitchell <lawrence.mitchell@imperial.ac.uk<mailto:lawrence.mitchell@imperial.ac.uk>> wrote: On 30/03/17 14:28, Matthew Knepley wrote:
What is going on here? It looks like you are using subprocess. Why would you do that on a cluster rather than MPI?
It's all pretty icky. One of the JIT steps involves forking something. Plausibly, Henrik, your MPI doesn't like fork. I went round at one point to try and set up a single subprocess that was able to fork before MPI_Init. For example, when using craypich, I need to say: export MPICH_GNI_FORK_MODE=FULLCOPY Old OpenMPI versions, especially those using some infiniband stuff, don't like fork either, for example https://www.open-mpi.org/faq/?category=openfabrics#ofa-fork Matt, here is David's response (you're not subscribed I think, so didn't get it): Is there a simple thing we can do to shut this off for testing? He is on a cluster I believe. Hendrik, do you know what MPI you are using? Thanks, Matt
Hi Matt,
Are you sure? To me it looks like pulp is using subprocess to fork the linear program solver. Nothing to do with parallel.
David
Lawrence -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener _______________________________________________ firedrake mailing list firedrake@imperial.ac.uk<mailto:firedrake@imperial.ac.uk> https://mailman.ic.ac.uk/mailman/listinfo/firedrake
Old OpenMPI versions, especially those using some infiniband stuff, don't like fork either, for example https://www.open-mpi.org/faq/?category=openfabrics#ofa-fork I replaced my OpenMPI 1.10.4 with MPICH 3.2 and the problem (parallel execution on more than 1 node) is gone. Probably, my OpenMPI really did not like the fork. Thank you! Henrik
On 30/03/17 14:45, Matthew Knepley wrote:
On Thu, Mar 30, 2017 at 8:39 AM, Lawrence Mitchell <lawrence.mitchell@imperial.ac.uk <mailto:lawrence.mitchell@imperial.ac.uk>> wrote:
On 30/03/17 14:28, Matthew Knepley wrote: > What is going on here? It looks like you are using subprocess. Why > would you do that on a cluster rather than MPI?
It's all pretty icky. One of the JIT steps involves forking something. Plausibly, Henrik, your MPI doesn't like fork. I went round at one point to try and set up a single subprocess that was able to fork before MPI_Init.
For example, when using craypich, I need to say:
export MPICH_GNI_FORK_MODE=FULLCOPY
Old OpenMPI versions, especially those using some infiniband stuff, don't like fork either, for example https://www.open-mpi.org/faq/?category=openfabrics#ofa-fork <https://www.open-mpi.org/faq/?category=openfabrics#ofa-fork>
Matt, here is David's response (you're not subscribed I think, so didn't get it):
Is there a simple thing we can do to shut this off for testing? He is on a cluster I believe.
I think, do: from firedrake import * parameters["coffee"]["optlevel"] = "O0" parameters["pyop2_options"]["opt_level"] = "O0" YOUR CODE HERE (Yes, this is terrible, WIP to create only one place you have to set this) This will turn off the coffee transformations. In the general case, you don't want to do this, but if you're attempting to weak scale a fieldsplit solve, it probably won't have a big effect. Lawrence
Is there a simple thing we can do to shut this off for testing? He is on a cluster I believe.
I think, do:
from firedrake import * parameters["coffee"]["optlevel"] = "O0" parameters["pyop2_options"]["opt_level"] = "O0"
YOUR CODE HERE
(Yes, this is terrible, WIP to create only one place you have to set this)
This will turn off the coffee transformations. In the general case, you don't want to do this, but if you're attempting to weak scale a fieldsplit solve, it probably won't have a big effect.
Maybe I rather want a firedrake reinstall with PETSc configure option --download-mpich? Or will this not fully replace the OpenMPI with MPICH? @Matt: I assume, you are using MPICH?
On Thu, Mar 30, 2017 at 9:20 AM, Buesing, Henrik < HBuesing@eonerc.rwth-aachen.de> wrote:
Is there a simple thing we can do to shut this off for testing? He is on a cluster I believe.
I think, do:
from firedrake import * parameters["coffee"]["optlevel"] = "O0" parameters["pyop2_options"]["opt_level"] = "O0"
YOUR CODE HERE
(Yes, this is terrible, WIP to create only one place you have to set this)
This will turn off the coffee transformations. In the general case, you don't want to do this, but if you're attempting to weak scale a fieldsplit solve, it probably won't have a big effect.
Maybe I rather want a firedrake reinstall with PETSc configure option --download-mpich? Or will this not fully replace the OpenMPI with MPICH?
@Matt: I assume, you are using MPICH?
Yes. Thanks, Matt -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
Is there a simple thing we can do to shut this off for testing? He is on a cluster I believe.
I think, do:
from firedrake import * parameters["coffee"]["optlevel"] = "O0" parameters["pyop2_options"]["opt_level"] = "O0"
YOUR CODE HERE
This works! I'm still having the compilation problems, but if I prerun with 1 node, make sure all the *.so files are in place and then do the full run with > 1 node, this works. I will also try to replace OpenMPI with MPICH and see how that goes. Thank you! Henrik
Lawrence
participants (4)
- 
                
                Buesing, Henrik
- 
                
                Justin Chang
- 
                
                Lawrence Mitchell
- 
                
                Matthew Knepley