Lawrence,

I have attached the code I am working with. It's basically the one you sent me a few weeks ago, but I am only working with selfp. Attached are the log files with 1, 2, and 4 processors on our local HPC machine (Intel Xeon E5-2680v2 2.8 GHz)

1) I wrapped the PyPAPI calls around solver.solve(). I guess this is doing what I want. Right now I am estimating the arithmetic intensity by documenting the FLOPS, Loads, and Stores. When i compare the measured FLOPS with the PETSc manual FLOP count it seems papi over counts by a factor of 2 (which I suppose is expected coming from a new Intel machine). Anyway, in terms of computing the FLOPS and AI this is what I want, I just wanted to make sure these don't account for the DMPlex initialization and stuff because:

2) According to the attached log_summaries it seems DMPlexDistribute and MeshMigration still consume a significant portion of the time. By significant I mean that the %T doesn't reduce as I increase the number of processors. I remember seeing Michael Lange's presentations (from PETSc-20 and the webinar) that mentioned something about this?

3) Bonus question: how do I also use PAPI_flops(&real_time, &proc_time, &flpins, &mflops)? I see there's the flops() function, but in my limited PAPI experience, I seem to have issues whenever I try to put both that and PAPI_start_counters into the same program, but I could be wrong.

Thanks,
Justin

On Thu, Jul 16, 2015 at 3:46 AM, Lawrence Mitchell <lawrence.mitchell@imperial.ac.uk> wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 15/07/15 21:14, Justin Chang wrote:
> First option works wonderfully for me, but now I am wondering how
> I would employ the second option.
>
> Specifically, I want to profile SNESSolve()

OK, so calls out to PETSc are done from Python (via petsc4py).  It's
just calls to integral assembly (i.e. evaluation of jacobians and
residuals) that go through a generated code path.

To be more concrete, let's say you have the following code:

F = some_residual

problem = NonlinearVariationalProblem(F, u, ...)

solver = NonlinearVariationalSolver(problem)

solver.solve()

Then the call chain inside solver.solve is effectively:

solver.solve ->
  SNESSolve -> # via petsc4py
    SNESComputeJacobian ->
      assemble(Jacobian) # Callback to Firedrake
    SNESComputeFunction ->
      assemble(residual) # Callback to Firedrake
    KSPSolve

So if you wrapped flop counting around the outermost solver.solve()
call, you're pretty close to wrapping SNESSolve.

Or do you mean something else when profiling SNESSolve?

> I would prefer to circumvent profiling of the DMPlex distribution
> because it seems that is a major bottleneck for multiple processes
> at the moment.

Can you provide an example mesh/process count that demonstrates this
issue, or at least characterize it a little better?  Michael Lange and
Matt Knepley have done a lot of work on making DMPlexDistribute much
faster than it was over the last 9 months or so.  So if it turns out
still to be slow, we'd really like to know about it and try and fix it.

Cheers,

Lawrence
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJVp29aAAoJECOc1kQ8PEYv+xYIAKMWLy2Go1WXjxKAj9+RvbHs
s26Dr/nJufqgC9GxArKRM0g/iJXD9sTnJckSQQQA1wHZzuVigr+ZFyHkN6HeNkbM
HILg5Mu7SYWvAwQOo18G3y6e8c7WFryJU7eNcEcfMqgZqQnfQ0JrV5iIshgM36mx
aP6VN7PfmJgy0CxQ/QuYyemt+U/9qvMAMSqfWNd5xRABTFw+dLcaj/h2T6u8EKxA
JCbhr3WTpeVsKygdDl01ZkGXjG7xd0tYRq9Y0AoZ7K9fUQlAYcAAPhfjlSz9ABZe
ZHWgJi724uzcnbAxtnY78TDqD0eHFFfRetEwd5Bn2G8uAssXZYzOg+DO49ETjn8=
=eDM8
-----END PGP SIGNATURE-----

_______________________________________________
firedrake mailing list
firedrake@imperial.ac.uk
https://mailman.ic.ac.uk/mailman/listinfo/firedrake