Apart from this: Is it possible to divide compilation and execution in two steps? Could I compile, make sure that everything is in place and then just execute with another job?

The generated, optimised and compiled kernels are cached. So what you can do, that you run your problem once, perhaps with a small size and just a couple of time steps. Then you make sure that the cache is still there when you do your real run, so at that time you hopefully don't need to compile again because everything is in cache already.


[Buesing, Henrik] Bottom-line first: In my opinion, there is some kind of race condition, when using more than one node. Any ideas where this could be?

 

Line of thought:

I started Firedrake on one node (16 cores). This generated 35 *.so files. Any ideas why 35? I also tried 8 cores, which generated 35 *.so files, too. Therefore, I was confident that all the *.so files are in place also for the 64 cores calculation.

 

I ran the 64 cores case (4 nodes) and also on two and three nodes (32 & 48 cores).
The error message is sometimes [1] (Bad address) and sometimes the previous pulp error. Is the “Bad address” message more informative than the previous one?

 

Then I just reran the 2 node case a few time, to see if the error message changes. Sometime I get the pulp error, sometime the bad address error. However, one time the simulation just ran. So, I tried this also with 64 cores, but never got it running.

Thus, I figured there might be some race condition when using > 1 node. To minimize this problem, I ran with 17 cores (1 node + 1 core). In my test cases this always runs. Of course, the bigger the number of cores the bigger the chance for the race condition to appear. Thus, I never got the 64 cores case to work in my limited number of test cases.

Any ideas what could be wrong and how to fix this?

 

 

Thank you!
Henrik

 

[1]

 

Traceback (most recent call last):

  File "/work/user/Firedrake/twophase/2pDrake/2pinjection.py", line 228, in <module>

    solver = NonlinearVariationalSolver(problem,options_prefix="")

  File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/firedrake/firedrake/variational_solver.py", line 156, in __init__

    pre_function_callback=pre_f_callback)

  File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/firedrake/firedrake/solving_utils.py", line 226, in __init__

    appctx=appctx)

  File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/firedrake/firedrake/assemble.py", line 120, in allocate_matrix

    allocate_only=True)

  File "<decorator-gen-279>", line 2, in _assemble

  File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/firedrake/firedrake/utils.py", line 62, in wrapper

    return f(*args, **kwargs)

  File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/firedrake/firedrake/assemble.py", line 192, in _assemble

    kernels = tsfc_interface.compile_form(f, "form", parameters=form_compiler_parameters, inverse=inverse)

  File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/firedrake/firedrake/tsfc_interface.py", line 193, in compile_form

    number_map).kernels

  File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/PyOP2/pyop2/caching.py", line 200, in __new__

    obj = make_obj()

  File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/PyOP2/pyop2/caching.py", line 190, in make_obj

    obj.__init__(*args, **kwargs)

  File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/firedrake/firedrake/tsfc_interface.py", line 121, in __init__

    kernels.append(KernelInfo(kernel=Kernel(ast, ast.name, opts=opts),

  File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/PyOP2/pyop2/caching.py", line 200, in __new__

    obj = make_obj()

  File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/PyOP2/pyop2/caching.py", line 190, in make_obj

    obj.__init__(*args, **kwargs)

  File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/PyOP2/pyop2/base.py", line 3843, in __init__

    self._code = self._ast_to_c(self._ast, opts)

  File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/PyOP2/pyop2/sequential.py", line 73, in _ast_to_c

    ast_handler.plan_cpu(self._opts)

  File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/COFFEE/coffee/plan.py", line 121, in plan_cpu

    loop_opt.rewrite(rewrite)

  File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/COFFEE/coffee/optimizer.py", line 117, in rewrite

    ew.sharing_graph_rewrite()

  File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/src/COFFEE/coffee/rewriter.py", line 609, in sharing_graph_rewrite

    prob.solve(ilp.GLPK(msg=0))

  File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/lib/python2.7/site-packages/pulp/pulp.py", line 1651, in solve

    status = solver.actualSolve(self, **kwargs)

  File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/lib/python2.7/site-packages/pulp/solvers.py", line 383, in actualSolve

    rc = subprocess.call(proc, stdout = pipe, stderr = pipe)

  File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/lib/python2.7/site-packages/subprocess32.py", line 578, in call

    p = Popen(*popenargs, **kwargs)

  File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/lib/python2.7/site-packages/subprocess32.py", line 825, in __init__

    restore_signals, start_new_session)

  File "/rwthfs/rz/cluster/work/user/Firedrake/firedrake/lib/python2.7/site-packages/subprocess32.py", line 1574, in _execute_child

    raise child_exception_type(errno_num, err_msg)

OSError: [Errno 14] Bad address