Lawrence,
I have attached the code I am working with. It's basically the one you sent me a few weeks ago, but I am only working with selfp. Attached are the log files with 1, 2, and 4 processors on our local HPC machine (Intel Xeon E5-2680v2 2.8 GHz)
1) I wrapped the PyPAPI calls around solver.solve(). I guess this is doing what I want. Right now I am estimating the arithmetic intensity by documenting the FLOPS, Loads, and Stores. When i compare the measured FLOPS with the PETSc manual FLOP count it seems papi over counts by a factor of 2 (which I suppose is expected coming from a new Intel machine). Anyway, in terms of computing the FLOPS and AI this is what I want, I just wanted to make sure these don't account for the DMPlex initialization and stuff because:
2) According to the attached log_summaries it seems DMPlexDistribute and MeshMigration still consume a significant portion of the time. By significant I mean that the %T doesn't reduce as I increase the number of processors. I remember seeing Michael Lange's presentations (from PETSc-20 and the webinar) that mentioned something about this?
3) Bonus question: how do I also use PAPI_flops(&real_time, &proc_time, &flpins, &mflops)? I see there's the flops() function, but in my limited PAPI experience, I seem to have issues whenever I try to put both that and PAPI_start_counters into the same program, but I could be wrong.
Thanks,
Justin