Dear all, I am having problems running Firdrake in parallel on more than one node. If I run on 16 cores (2x8 cores, 1 node) everything is fine. If I go to more than one node (2,4,...) I get an error (see also attached log): OSError: /w0/tmp/lsf_user.35589703.0/pyop2-cache-uid17470/f51f6074c31bf0cd78d85bcb40381923.so: cannot open shared object file: No such file or directory Do you have an idea what could be wrong? Thank you! Henrik -- Dipl.-Math. Henrik Büsing Institute for Applied Geophysics and Geothermal Energy E.ON Energy Research Center RWTH Aachen University ------------------------------------------------------ Mathieustr. 10 | Tel +49 (0)241 80 49907 52074 Aachen, Germany | Fax +49 (0)241 80 49889 ------------------------------------------------------ http://www.eonerc.rwth-aachen.de/GGE hbuesing@eonerc.rwth-aachen.de ------------------------------------------------------
Dear Henrik,
On 27 Mar 2017, at 10:13, Buesing, Henrik <HBuesing@eonerc.rwth-aachen.de> wrote:
Dear all,
I am having problems running Firdrake in parallel on more than one node. If I run on 16 cores (2x8 cores, 1 node) everything is fine. If I go to more than one node (2,4,…) I get an error (see also attached log):
OSError: /w0/tmp/lsf_user.35589703.0/pyop2-cache-uid17470/f51f6074c31bf0cd78d85bcb40381923.so: cannot open shared object file: No such file or directory
Do you have an idea what could be wrong? Thank you!
Right now, Firedrake compiles code only on rank zero, and therefore requires that all processes can see the directory it writes to. Assuming you have access to a shared filesystem, set the environment variable: PYOP2_CACHE_DIR to point to a temporary directory that all ranks can see. We are working on node-local compilation, which will be more scalable, and make this problem go away. Thanks, Lawrence
Assuming you have access to a shared filesystem, set the environment variable:
PYOP2_CACHE_DIR
to point to a temporary directory that all ranks can see.
Dear Lawrence, I have set PYOP2_CACHE_DIR. I could verify, that he uses the shared directory in the 1 node case. Unfortunately, in the multi-node case I still get the same error message (see log1): OSError: /work/user/Firedrake/twophase/2pDrake/fine8/tmp/f51f6074c31bf0cd78d85bcb40381923.so: cannot open shared object file: No such file or directory I can verify that the file in question actually exists after running the job. From time to time he seems to reuse exactly the same hash for the *.so file, so I tried to just resubmit the job. Then I get the following error message (see log2): File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/lib/python2.7/site-packages/pulp/pulp.py", line 1651, in solve status = solver.actualSolve(self, **kwargs) File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/lib/python2.7/site-packages/pulp/solvers.py", line 385, in actualSolve raise PulpSolverError("PuLP: Error while trying to execute "+self.path) pulp.solvers.PulpSolverError: PuLP: Error while trying to execute glpsol Any ideas what still could be wrong? Thank you! Henrik
Apparently glpsol (from GLPK) is not installed. On Mon, Mar 27, 2017 at 8:58 PM +0100, "Buesing, Henrik" <HBuesing@eonerc.rwth-aachen.de<mailto:HBuesing@eonerc.rwth-aachen.de>> wrote:
Assuming you have access to a shared filesystem, set the environment variable:
PYOP2_CACHE_DIR
to point to a temporary directory that all ranks can see.
Dear Lawrence, I have set PYOP2_CACHE_DIR. I could verify, that he uses the shared directory in the 1 node case. Unfortunately, in the multi-node case I still get the same error message (see log1): OSError: /work/user/Firedrake/twophase/2pDrake/fine8/tmp/f51f6074c31bf0cd78d85bcb40381923.so: cannot open shared object file: No such file or directory I can verify that the file in question actually exists after running the job. From time to time he seems to reuse exactly the same hash for the *.so file, so I tried to just resubmit the job. Then I get the following error message (see log2): File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/lib/python2.7/site-packages/pulp/pulp.py", line 1651, in solve status = solver.actualSolve(self, **kwargs) File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/lib/python2.7/site-packages/pulp/solvers.py", line 385, in actualSolve raise PulpSolverError("PuLP: Error while trying to execute "+self.path) pulp.solvers.PulpSolverError: PuLP: Error while trying to execute glpsol Any ideas what still could be wrong? Thank you! Henrik
Apparently glpsol (from GLPK) is not installed. [Buesing, Henrik] And what could be the reason, that this is needed for parallel calculations with more than one node and not for calculations with just one node? In addition, how would I install this? Is this not installed during the normal firedrake installation? Apart from this: Is it possible to divide compilation and execution in two steps? Could I compile, make sure that everything is in place and then just execute with another job? Thank you! Henrik On Mon, Mar 27, 2017 at 8:58 PM +0100, "Buesing, Henrik" <HBuesing@eonerc.rwth-aachen.de<mailto:HBuesing@eonerc.rwth-aachen.de>> wrote:
Assuming you
have access to a shared filesystem, set the environment variable:
PYOP2_CACHE_DIR
to point to a temporary directory that all ranks can see.
Dear Lawrence, I have set PYOP2_CACHE_DIR. I could verify, that he uses the shared directory in the 1 node case. Unfortunately, in the multi-node case I still get the same error message (see log1): OSError: /work/user/Firedrake/twophase/2pDrake/fine8/tmp/f51f6074c31bf0cd78d85bcb40381923.so: cannot open shared object file: No such file or directory I can verify that the file in question actually exists after running the job. From time to time he seems to reuse exactly the same hash for the *.so file, so I tried to just resubmit the job. Then I get the following error message (see log2): File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/lib/python2.7/site-packages/pulp/pulp.py", line 1651, in solve status = solver.actualSolve(self, **kwargs) File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/lib/python2.7/site-packages/pulp/solvers.py", line 385, in actualSolve raise PulpSolverError("PuLP: Error while trying to execute "+self.path) pulp.solvers.PulpSolverError: PuLP: Error while trying to execute glpsol Any ideas what still could be wrong? Thank you! Henrik
On 28/03/17 19:37, Buesing, Henrik wrote:
Apparently glpsol (from GLPK) is not installed.
*[Buesing, Henrik] And what could be the reason, that this is needed for parallel calculations with more than one node and not for calculations with just one node? In addition, how would I install this? Is this not installed during the normal firedrake installation?*
This is part of the normal Firedrake installation. I assume something is messed up with parallel launch configuration, so that glpsol is somehow not in path on all nodes as it should be.
**
*Apart from this: Is it possible to divide compilation and execution in two steps? Could I compile, make sure that everything is in place and then just execute with another job?*
The generated, optimised and compiled kernels are cached. So what you can do, that you run your problem once, perhaps with a small size and just a couple of time steps. Then you make sure that the cache is still there when you do your real run, so at that time you hopefully don't need to compile again because everything is in cache already.
Apart from this: Is it possible to divide compilation and execution in two steps? Could I compile, make sure that everything is in place and then just execute with another job? The generated, optimised and compiled kernels are cached. So what you can do, that you run your problem once, perhaps with a small size and just a couple of time steps. Then you make sure that the cache is still there when you do your real run, so at that time you hopefully don't need to compile again because everything is in cache already. [Buesing, Henrik] Bottom-line first: In my opinion, there is some kind of race condition, when using more than one node. Any ideas where this could be? Line of thought: I started Firedrake on one node (16 cores). This generated 35 *.so files. Any ideas why 35? I also tried 8 cores, which generated 35 *.so files, too. Therefore, I was confident that all the *.so files are in place also for the 64 cores calculation. I ran the 64 cores case (4 nodes) and also on two and three nodes (32 & 48 cores). The error message is sometimes [1] (Bad address) and sometimes the previous pulp error. Is the "Bad address" message more informative than the previous one? Then I just reran the 2 node case a few time, to see if the error message changes. Sometime I get the pulp error, sometime the bad address error. However, one time the simulation just ran. So, I tried this also with 64 cores, but never got it running. Thus, I figured there might be some race condition when using > 1 node. To minimize this problem, I ran with 17 cores (1 node + 1 core). In my test cases this always runs. Of course, the bigger the number of cores the bigger the chance for the race condition to appear. Thus, I never got the 64 cores case to work in my limited number of test cases. Any ideas what could be wrong and how to fix this? Thank you! Henrik [1] Traceback (most recent call last): File "/work/user/Firedrake/twophase/2pDrake/2pinjection.py", line 228, in <module> solver = NonlinearVariationalSolver(problem,options_prefix="") File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/firedrake/firedrake/variational_solver.py", line 156, in __init__ pre_function_callback=pre_f_callback) File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/firedrake/firedrake/solving_utils.py", line 226, in __init__ appctx=appctx) File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/firedrake/firedrake/assemble.py", line 120, in allocate_matrix allocate_only=True) File "<decorator-gen-279>", line 2, in _assemble File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/firedrake/firedrake/utils.py", line 62, in wrapper return f(*args, **kwargs) File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/firedrake/firedrake/assemble.py", line 192, in _assemble kernels = tsfc_interface.compile_form(f, "form", parameters=form_compiler_parameters, inverse=inverse) File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/firedrake/firedrake/tsfc_interface.py", line 193, in compile_form number_map).kernels File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/PyOP2/pyop2/caching.py", line 200, in __new__ obj = make_obj() File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/PyOP2/pyop2/caching.py", line 190, in make_obj obj.__init__(*args, **kwargs) File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/firedrake/firedrake/tsfc_interface.py", line 121, in __init__ kernels.append(KernelInfo(kernel=Kernel(ast, ast.name, opts=opts), File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/PyOP2/pyop2/caching.py", line 200, in __new__ obj = make_obj() File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/PyOP2/pyop2/caching.py", line 190, in make_obj obj.__init__(*args, **kwargs) File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/PyOP2/pyop2/base.py", line 3843, in __init__ self._code = self._ast_to_c(self._ast, opts) File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/PyOP2/pyop2/sequential.py", line 73, in _ast_to_c ast_handler.plan_cpu(self._opts) File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/COFFEE/coffee/plan.py", line 121, in plan_cpu loop_opt.rewrite(rewrite) File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/COFFEE/coffee/optimizer.py", line 117, in rewrite ew.sharing_graph_rewrite() File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/COFFEE/coffee/rewriter.py", line 609, in sharing_graph_rewrite prob.solve(ilp.GLPK(msg=0)) File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/lib/python2.7/site-packages/pulp/pulp.py", line 1651, in solve status = solver.actualSolve(self, **kwargs) File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/lib/python2.7/site-packages/pulp/solvers.py", line 383, in actualSolve rc = subprocess.call(proc, stdout = pipe, stderr = pipe) File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/lib/python2.7/site-packages/subprocess32.py", line 578, in call p = Popen(*popenargs, **kwargs) File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/lib/python2.7/site-packages/subprocess32.py", line 825, in __init__ restore_signals, start_new_session) File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/lib/python2.7/site-packages/subprocess32.py", line 1574, in _execute_child raise child_exception_type(errno_num, err_msg) OSError: [Errno 14] Bad address
participants (4)
-
Buesing, Henrik
-
Homolya, Miklós
-
Lawrence Mitchell
-
Miklós Homolya