Re: [firedrake] Hardware counters for Firedrake

16 Jul 2015

      Lawrence,

I have attached the code I am working with. It's basically the one you sent
me a few weeks ago, but I am only working with selfp. Attached are the log
files with 1, 2, and 4 processors on our local HPC machine (Intel Xeon
E5-2680v2 2.8 GHz)

1) I wrapped the PyPAPI calls around solver.solve(). I guess this is doing
what I want. Right now I am estimating the arithmetic intensity by
documenting the FLOPS, Loads, and Stores. When i compare the measured FLOPS
with the PETSc manual FLOP count it seems papi over counts by a factor of 2
(which I suppose is expected coming from a new Intel machine). Anyway, in
terms of computing the FLOPS and AI this is what I want, I just wanted to
make sure these don't account for the DMPlex initialization and stuff
because:

2) According to the attached log_summaries it seems DMPlexDistribute and
MeshMigration still consume a significant portion of the time. By
significant I mean that the %T doesn't reduce as I increase the number of
processors. I remember seeing Michael Lange's presentations (from PETSc-20
and the webinar) that mentioned something about this?

3) Bonus question: how do I also use PAPI_flops(&real_time,
&proc_time, &flpins,
&mflops)? I see there's the flops() function, but in my limited PAPI
experience, I seem to have issues whenever I try to put both that and
PAPI_start_counters into the same program, but I could be wrong.

Thanks,
Justin

On Thu, Jul 16, 2015 at 3:46 AM, Lawrence Mitchell <
lawrence.mitchell@imperial.ac.uk> wrote:
...
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 15/07/15 21:14, Justin Chang wrote:
...
First option works wonderfully for me, but now I am wondering how
I would employ the second option.
Specifically, I want to profile SNESSolve()
OK, so calls out to PETSc are done from Python (via petsc4py).  It's
just calls to integral assembly (i.e. evaluation of jacobians and
residuals) that go through a generated code path.
To be more concrete, let's say you have the following code:
F = some_residual
problem = NonlinearVariationalProblem(F, u, ...)
solver = NonlinearVariationalSolver(problem)
solver.solve()
Then the call chain inside solver.solve is effectively:
solver.solve ->
  SNESSolve -> # via petsc4py
    SNESComputeJacobian ->
      assemble(Jacobian) # Callback to Firedrake
    SNESComputeFunction ->
      assemble(residual) # Callback to Firedrake
    KSPSolve
So if you wrapped flop counting around the outermost solver.solve()
call, you're pretty close to wrapping SNESSolve.
Or do you mean something else when profiling SNESSolve?
...
I would prefer to circumvent profiling of the DMPlex distribution
because it seems that is a major bottleneck for multiple processes
at the moment.
Can you provide an example mesh/process count that demonstrates this
issue, or at least characterize it a little better?  Michael Lange and
Matt Knepley have done a lot of work on making DMPlexDistribute much
faster than it was over the last 9 months or so.  So if it turns out
still to be slow, we'd really like to know about it and try and fix it.
Cheers,
Lawrence
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAEBAgAGBQJVp29aAAoJECOc1kQ8PEYv+xYIAKMWLy2Go1WXjxKAj9+RvbHs
s26Dr/nJufqgC9GxArKRM0g/iJXD9sTnJckSQQQA1wHZzuVigr+ZFyHkN6HeNkbM
HILg5Mu7SYWvAwQOo18G3y6e8c7WFryJU7eNcEcfMqgZqQnfQ0JrV5iIshgM36mx
aP6VN7PfmJgy0CxQ/QuYyemt+U/9qvMAMSqfWNd5xRABTFw+dLcaj/h2T6u8EKxA
JCbhr3WTpeVsKygdDl01ZkGXjG7xd0tYRq9Y0AoZ7K9fUQlAYcAAPhfjlSz9ABZe
ZHWgJi724uzcnbAxtnY78TDqD0eHFFfRetEwd5Bn2G8uAssXZYzOg+DO49ETjn8=
=eDM8
-----END PGP SIGNATURE-----
_______________________________________________
firedrake mailing list
firedrake@imperial.ac.uk
https://mailman.ic.ac.uk/mailman/listinfo/firedrake