Lawrence, I have attached the code I am working with. It's basically the one you sent me a few weeks ago, but I am only working with selfp. Attached are the log files with 1, 2, and 4 processors on our local HPC machine (Intel Xeon E5-2680v2 2.8 GHz) 1) I wrapped the PyPAPI calls around solver.solve(). I guess this is doing what I want. Right now I am estimating the arithmetic intensity by documenting the FLOPS, Loads, and Stores. When i compare the measured FLOPS with the PETSc manual FLOP count it seems papi over counts by a factor of 2 (which I suppose is expected coming from a new Intel machine). Anyway, in terms of computing the FLOPS and AI this is what I want, I just wanted to make sure these don't account for the DMPlex initialization and stuff because: 2) According to the attached log_summaries it seems DMPlexDistribute and MeshMigration still consume a significant portion of the time. By significant I mean that the %T doesn't reduce as I increase the number of processors. I remember seeing Michael Lange's presentations (from PETSc-20 and the webinar) that mentioned something about this? 3) Bonus question: how do I also use PAPI_flops(&real_time, &proc_time, &flpins, &mflops)? I see there's the flops() function, but in my limited PAPI experience, I seem to have issues whenever I try to put both that and PAPI_start_counters into the same program, but I could be wrong. Thanks, Justin On Thu, Jul 16, 2015 at 3:46 AM, Lawrence Mitchell < lawrence.mitchell@imperial.ac.uk> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 15/07/15 21:14, Justin Chang wrote:
First option works wonderfully for me, but now I am wondering how I would employ the second option.
Specifically, I want to profile SNESSolve()
OK, so calls out to PETSc are done from Python (via petsc4py). It's just calls to integral assembly (i.e. evaluation of jacobians and residuals) that go through a generated code path.
To be more concrete, let's say you have the following code:
F = some_residual
problem = NonlinearVariationalProblem(F, u, ...)
solver = NonlinearVariationalSolver(problem)
solver.solve()
Then the call chain inside solver.solve is effectively:
solver.solve -> SNESSolve -> # via petsc4py SNESComputeJacobian -> assemble(Jacobian) # Callback to Firedrake SNESComputeFunction -> assemble(residual) # Callback to Firedrake KSPSolve
So if you wrapped flop counting around the outermost solver.solve() call, you're pretty close to wrapping SNESSolve.
Or do you mean something else when profiling SNESSolve?
I would prefer to circumvent profiling of the DMPlex distribution because it seems that is a major bottleneck for multiple processes at the moment.
Can you provide an example mesh/process count that demonstrates this issue, or at least characterize it a little better? Michael Lange and Matt Knepley have done a lot of work on making DMPlexDistribute much faster than it was over the last 9 months or so. So if it turns out still to be slow, we'd really like to know about it and try and fix it.
Cheers,
Lawrence -----BEGIN PGP SIGNATURE----- Version: GnuPG v1
iQEcBAEBAgAGBQJVp29aAAoJECOc1kQ8PEYv+xYIAKMWLy2Go1WXjxKAj9+RvbHs s26Dr/nJufqgC9GxArKRM0g/iJXD9sTnJckSQQQA1wHZzuVigr+ZFyHkN6HeNkbM HILg5Mu7SYWvAwQOo18G3y6e8c7WFryJU7eNcEcfMqgZqQnfQ0JrV5iIshgM36mx aP6VN7PfmJgy0CxQ/QuYyemt+U/9qvMAMSqfWNd5xRABTFw+dLcaj/h2T6u8EKxA JCbhr3WTpeVsKygdDl01ZkGXjG7xd0tYRq9Y0AoZ7K9fUQlAYcAAPhfjlSz9ABZe ZHWgJi724uzcnbAxtnY78TDqD0eHFFfRetEwd5Bn2G8uAssXZYzOg+DO49ETjn8= =eDM8 -----END PGP SIGNATURE-----
_______________________________________________ firedrake mailing list firedrake@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/firedrake