CPU time for hessian computation
Dear all, In my code, I need to compute the hessian of my solution (for now a Lagrange P1 field, on an unstructured simplicial mesh). To that aim, I have been using the classic FE method, which is given in the Monge-Ampère demonstration in 2D (http://www.firedrakeproject.org/demos/ma-demo.py.html). I am now trying to do the same thing in 3D. But it is slow. For now, I am running in serial, so I expected it to be slow but now that slow. The code used is the following: n = 20 mesh = UnitCubeMesh(n, n, n) W = FunctionSpace(mesh, 'CG', 1) M = TensorFunctionSpace(mesh, 'CG', 1) icExpr = Expression("((x[0]-0.35)*(x[0]-0.35) + (x[1]-0.35)*(x[1]-0.35) + (x[2]-0.35)*(x[2]-0.35) < 0.15*0.15)") w = Function(W).interpolate(icExpr) sigma = TestFunction(M) H = Function(M) n = FacetNormal(mesh) Lh = inner(sigma, H)*dx + inner(div(sigma), grad(w))*dx Lh -= (sigma[0,0]*w.dx(0)*n[0] + sigma[1,0]*w.dx(1)*n[0] + sigma[2,0]*w.dx(2)*n[0] \ + sigma[0,1]*w.dx(0)*n[1] + sigma[1,1]*w.dx(1)*n[1] + sigma[2,1]*w.dx(2)*n[1] \ + sigma[0,2]*w.dx(0)*n[2] + sigma[1,2]*w.dx(1)*n[2] + sigma[2,2]*w.dx(2)*n[2])*ds H_prob = NonlinearVariationalProblem(Lh, H) H_solv = NonlinearVariationalSolver(H_prob) H_solv.solve() (Warning if you try to run this code, UniCubeMesh can take a long time for not so great n) If we consider a unit cube with 50 vertices on each edge (n=50, that makes 132,651 vertices and 750,000 tetrahedra which is a small 3D case), the wall clock time for the last instruction H_solv.solve() is 15 min. For n=65 (287,496 vertices and 1,647,750 tets), the wall clock time goes up to one hour. For n=85 (531,441 vertices, 3,072,000 tets), it takes 80 min, The processor is an Intel Xeon E5-2640 v3 @ 2.60GHz, and these timings are obtained with the default set of parameters for the solver (notably snes_rtol = 1e-8 and ksp_rtol = 1e-5). For both meshes, the solver ends after 2 SNES iterations, each iteration requiring 5 KSP iterations, which seem to be relatively small numbers. If I understand correctly, these times include the assembly and the actual solution of the problem. I do not know how to distinguish the assembly time from the solver time, but, if I can trust Firedrake debug prints (with monitor_snes: True), the assembly is an order of magnitude slower than the solver. Do these CPU times look normal to you ? I have, changed snes_rtol to 1e-2, as I don't need a great precision for my hessian (I'm leaving ksp_rtol to 1e-5 to preserve the convergence, am I right ?). I also changed the preconditioner, and use SOR, which helped (a lot for tiny meshes, somewhat less for bigger ones). With these options, my CPU time becomes 5min for n=50 and 9min for n=80. That is better. However, I am a little concerned by the fact that the CPU time quickly increases with n, and that this method seems significantly slower than a gradient/Hessian recovery method. In the context of mesh adaptation, I have to compute the Hessian frequently, and it is not supposed to be a costly stage in the process. So what do you think? Are there other things I can do to speed this up? Many thanks, -- Nicolas -- Nicolas Barral Dept. of Earth Science and Engineering Imperial College London Royal School of Mines - Office 4.88 London SW7 2AZ
On 18/07/16 14:17, Nicolas Barral wrote:
Dear all,
In my code, I need to compute the hessian of my solution (for now a Lagrange P1 field, on an unstructured simplicial mesh). To that aim, I have been using the classic FE method, which is given in the Monge-Ampère demonstration in 2D (http://www.firedrakeproject.org/demos/ma-demo.py.html).
I am now trying to do the same thing in 3D. But it is slow. For now, I am running in serial, so I expected it to be slow but now that slow.
The code used is the following: n = 20 mesh = UnitCubeMesh(n, n, n) W = FunctionSpace(mesh, 'CG', 1) M = TensorFunctionSpace(mesh, 'CG', 1) icExpr = Expression("((x[0]-0.35)*(x[0]-0.35) + (x[1]-0.35)*(x[1]-0.35) + (x[2]-0.35)*(x[2]-0.35) < 0.15*0.15)") w = Function(W).interpolate(icExpr) sigma = TestFunction(M) H = Function(M) n = FacetNormal(mesh) Lh = inner(sigma, H)*dx + inner(div(sigma), grad(w))*dx Lh -= (sigma[0,0]*w.dx(0)*n[0] + sigma[1,0]*w.dx(1)*n[0] + sigma[2,0]*w.dx(2)*n[0] \ + sigma[0,1]*w.dx(0)*n[1] + sigma[1,1]*w.dx(1)*n[1] + sigma[2,1]*w.dx(2)*n[1] \ + sigma[0,2]*w.dx(0)*n[2] + sigma[1,2]*w.dx(1)*n[2] + sigma[2,2]*w.dx(2)*n[2])*ds H_prob = NonlinearVariationalProblem(Lh, H) H_solv = NonlinearVariationalSolver(H_prob) H_solv.solve() (Warning if you try to run this code, UniCubeMesh can take a long time for not so great n)
If we consider a unit cube with 50 vertices on each edge (n=50, that makes 132,651 vertices and 750,000 tetrahedra which is a small 3D case), the wall clock time for the last instruction H_solv.solve() is 15 min. For n=65 (287,496 vertices and 1,647,750 tets), the wall clock time goes up to one hour. For n=85 (531,441 vertices, 3,072,000 tets), it takes 80 min, The processor is an Intel Xeon E5-2640 v3 @ 2.60GHz, and these timings are obtained with the default set of parameters for the solver (notably snes_rtol = 1e-8 and ksp_rtol = 1e-5). For both meshes, the solver ends after 2 SNES iterations, each iteration requiring 5 KSP iterations, which seem to be relatively small numbers.
If I understand correctly, these times include the assembly and the actual solution of the problem. I do not know how to distinguish the assembly time from the solver time, but, if I can trust Firedrake debug prints (with monitor_snes: True), the assembly is an order of magnitude slower than the solver.
Do these CPU times look normal to you ?
I have, changed snes_rtol to 1e-2, as I don't need a great precision for my hessian (I'm leaving ksp_rtol to 1e-5 to preserve the convergence, am I right ?). I also changed the preconditioner, and use SOR, which helped (a lot for tiny meshes, somewhat less for bigger ones). With these options, my CPU time becomes 5min for n=50 and 9min for n=80.
That is better. However, I am a little concerned by the fact that the CPU time quickly increases with n, and that this method seems significantly slower than a gradient/Hessian recovery method. In the context of mesh adaptation, I have to compute the Hessian frequently, and it is not supposed to be a costly stage in the process.
So what do you think? Are there other things I can do to speed this up?
Please wrap the different bits of your code: mesh/functionspace creation and then solver creation/solve in timed_stage context managers: from pyop2.profiling import timed_stage with timed_stage("Make spaces"): mesh = ... V = FunctionSpace(...) ... with timed_stage("Solve"): solve(Lh == 0, H, ...) And send the output of python script.py -log_view For the different mesh sizes. Lawrence
On 18/07/16 14:53, Lawrence Mitchell wrote: ...
Please wrap the different bits of your code: mesh/functionspace creation and then solver creation/solve in timed_stage context managers:
from pyop2.profiling import timed_stage
with timed_stage("Make spaces"): mesh = ... V = FunctionSpace(...)
... with timed_stage("Solve"): solve(Lh == 0, H, ...)
And send the output of
python script.py -log_view
For the different mesh sizes.
I did this for n=50 on an identical xeon chip. I quibble with your statement that the solve takes loads of time. Here's what I get: Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- -- Message Lengths -- -- Reductions -- Avg %Total Avg %Total counts %Total Avg %Total counts %Total 0: Main Stage: 5.5219e-01 0.1% 0.0000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0% 1: Make mesh: 7.2269e+02 96.2% 0.0000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0% 2: Solve problem: 2.8219e+01 3.8% 5.8936e+10 100.0% 0.000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0% So you can see it took a little over 12 minutes to build the mesh, and 30s to solve the problem. AFAICT almost all of that time is in ctetgen meshing the cube (it does fully unstructured triangulation, so this may explain things). We would welcome patches that made the creation of these structured cube meshes faster by not calling general mesh software. Digging down into the solve, it takes 5 seconds to evaluate the jacobian twice, 1 second for three residual evaluations and about 20 seconds for the two solves. Lawrence
participants (2)
-
Lawrence Mitchell
-
Nicolas Barral