Re: [firedrake] Valgrind and Firedrake
Please send emails about firedrake issues to the firedrake list, or submit github issues. The intersection of the expertise is larger than that of any one individual, moreover you're much more likely to get an answer quickly.
On 9 Jun 2016, at 14:51, Nicolas Barral <n.barral@imperial.ac.uk> wrote:
Hi Lawrence,
I am still trying to find the bug I mentioned on IRC two weeks ago (point evaluation raising a not in domain error for points in the domain), and I have been totally unsuccessful so I need your help.
You suggested using valgrind to find a memory issue, which I tried with the regular build of Firedrake (not in debug mode). Even with the suppression file suggested on Python's website, I get a many thousand line log, most errors being unusable because the debug symbols were not exported.
Do any of the lines look like they are memory errors in likely candidates. That is, either code compiled by us (library has some kind of md5 hash in its name) or else inside libspatialindex?
So I thought I should try to compile python with the adequate debug options, and use this python with Firedrake. But how do I do that ? I tried to run firedrake-install with a PATH pointing to my custom build of Python (Miklos's suggestion), but it fails when checking/installing virtualenv (it keeps stopping at "Requirement already satisfied Virtual env installed. Please run firedrake-install again."). It sees the virtualenv from the standard python, which might be a problem ? So I'm stuck at that point.
I do not know how to install multiple versions of python and get them to pick up the correct virtualenv and similar. I would have thought that your hand-installed python would not find the globally installed virtualenv stuff, but maybe not. Can you just make a virtualenv "by hand". We effectively just run "python -c 'import virtualenv'" and use that. So if that doesn't work that's somewhere to start. Normally if you're on linux you can just install the package that adds debug symbols for python (on ubuntu this is called python-debug I think).
My plan was then to compile the modules/dependencies in debug mode (notably the libspatialindex). Would that be really helpful ? And how do I do that since I can't just run configure/make in the right directory ?
firedrake-install runs configure/make for you. You can determine how to installed the shared libraries in debug mode and add that as an option to firedrake-install, we will gladly accept patches for this.
Another idea came to me, and I searched for the cells of the mesh surrounding the points "not in domain" and computed their barycentric coordinates in the corresponding cell. It turns out that the point is always on a vertex or an edge (one or two barycentric coordinates ~ 1e-16). Could there be a bug in libspatialindex or the at function in these cases ?
So what you're saying is that the point is right on the boundary between two cells? I guess there are two ways the point location can fail. Either the point is not found in any bounding box. Or, the point is found by libspatialindex in a bounding box. But that bucket does not actually contain the cell in question, so then the linear search for point location may fail due to floating point rounding. Which of these two are occurring? All of this code is compiled by us and is pretty straight line C, so you can just compile with debugging and step through it in the failing case and figure out what's going on.
I wanted to talk to you about all that in person, but I'm not sure we'll see each other before a while, since I'll be in Bath next week until wednesday... The faulty piece of code is likely to be removed when we will have a good Firedrake implementation of the interpolation and the metric computation, but maybe there's something deeper I'm not seeing which is worth being understood, so any help would be helpful :)
I am at Imperial on Thursday next week. Lawrence
It would be much easier to help if you could provide a minimal failing example that exposes the bug you are experiencing. Preferably as a GitHub issue.
So what you're saying is that the point is right on the boundary between two cells? I guess there are two ways the point location can fail. Either the point is not found in any bounding box. Or, the point is found by libspatialindex in a bounding box. But that bucket does not actually contain the cell in question, so then the linear search for point location may fail due to floating point rounding. Which of these two are occurring?
Yes! This is an important distinction. If the point is in the bounding box, but found to be outside the cell during the physical coordinates -> reference coordinates transformation, then you could inspect the Newton iteration and the tolerances. Also, which cell types do you have this problem with? Please confirm you are not trying to do point evaluations on manifolds. ________________________________ From: firedrake-bounces@imperial.ac.uk <firedrake-bounces@imperial.ac.uk> on behalf of Lawrence Mitchell <lawrence.mitchell@imperial.ac.uk> Sent: 09 June 2016 14:15:22 To: Barral, Nicolas X Cc: firedrake Subject: Re: [firedrake] Valgrind and Firedrake Please send emails about firedrake issues to the firedrake list, or submit github issues. The intersection of the expertise is larger than that of any one individual, moreover you're much more likely to get an answer quickly.
On 9 Jun 2016, at 14:51, Nicolas Barral <n.barral@imperial.ac.uk> wrote:
Hi Lawrence,
I am still trying to find the bug I mentioned on IRC two weeks ago (point evaluation raising a not in domain error for points in the domain), and I have been totally unsuccessful so I need your help.
You suggested using valgrind to find a memory issue, which I tried with the regular build of Firedrake (not in debug mode). Even with the suppression file suggested on Python's website, I get a many thousand line log, most errors being unusable because the debug symbols were not exported.
Do any of the lines look like they are memory errors in likely candidates. That is, either code compiled by us (library has some kind of md5 hash in its name) or else inside libspatialindex?
So I thought I should try to compile python with the adequate debug options, and use this python with Firedrake. But how do I do that ? I tried to run firedrake-install with a PATH pointing to my custom build of Python (Miklos's suggestion), but it fails when checking/installing virtualenv (it keeps stopping at "Requirement already satisfied Virtual env installed. Please run firedrake-install again."). It sees the virtualenv from the standard python, which might be a problem ? So I'm stuck at that point.
I do not know how to install multiple versions of python and get them to pick up the correct virtualenv and similar. I would have thought that your hand-installed python would not find the globally installed virtualenv stuff, but maybe not. Can you just make a virtualenv "by hand". We effectively just run "python -c 'import virtualenv'" and use that. So if that doesn't work that's somewhere to start. Normally if you're on linux you can just install the package that adds debug symbols for python (on ubuntu this is called python-debug I think).
My plan was then to compile the modules/dependencies in debug mode (notably the libspatialindex). Would that be really helpful ? And how do I do that since I can't just run configure/make in the right directory ?
firedrake-install runs configure/make for you. You can determine how to installed the shared libraries in debug mode and add that as an option to firedrake-install, we will gladly accept patches for this.
Another idea came to me, and I searched for the cells of the mesh surrounding the points "not in domain" and computed their barycentric coordinates in the corresponding cell. It turns out that the point is always on a vertex or an edge (one or two barycentric coordinates ~ 1e-16). Could there be a bug in libspatialindex or the at function in these cases ?
So what you're saying is that the point is right on the boundary between two cells? I guess there are two ways the point location can fail. Either the point is not found in any bounding box. Or, the point is found by libspatialindex in a bounding box. But that bucket does not actually contain the cell in question, so then the linear search for point location may fail due to floating point rounding. Which of these two are occurring? All of this code is compiled by us and is pretty straight line C, so you can just compile with debugging and step through it in the failing case and figure out what's going on.
I wanted to talk to you about all that in person, but I'm not sure we'll see each other before a while, since I'll be in Bath next week until wednesday... The faulty piece of code is likely to be removed when we will have a good Firedrake implementation of the interpolation and the metric computation, but maybe there's something deeper I'm not seeing which is worth being understood, so any help would be helpful :)
I am at Imperial on Thursday next week. Lawrence
Miklós, thanks for your answer. Le 09/06/16 à 14:28, Homolya, Miklós a écrit :
It would be much easier to help if you could provide a minimal failing example that exposes the bug you are experiencing. Preferably as a GitHub issue.
It would be much easier to debug as well if I could, but when I try to isolate the bug it disappears... What I'm doing: considering a time-dependent equation, my code solves the equation for one time step using Firedrake, calls a third-party software to generate a metric based on the solution, then generates a mesh adapted to this metric using petsc, then interpolates the solution onto the new mesh using ugly loops and point evaluation, then solves the equation for the next time-step and so on a certain number of times. Sometimes during the interpolation step one vertex of the new mesh is found "not in domain" while it obviously is (one vertex every few iterations). To isolate the bug, I have to write the concerned meshes to a file and read them in the minimal code. But then the bug disappears. This could be due to a memory issue (hence valgrind), or rounding errors when writing to the file. I'm going to retry using double precision when writing the files though, to see if it helps. (I checked, mesh or dmplex objects can unfortunately not be serialized.)
So what you're saying is that the point is right on the boundary between two cells? I guess there are two ways the point location can fail. Either the point is not found in any bounding box. Or, the point is found by libspatialindex in a bounding box. But that bucket does not actually contain the cell in question, so then the linear search for point location may fail due to floating point rounding. Which of these two are occurring?
Yes! This is an important distinction.
Let's find out then! But how do I do that ? (cf other email)
If the point is in the bounding box, but found to be outside the cell during the physical coordinates -> reference coordinates transformation, then you could inspect the Newton iteration and the tolerances. Also, which cell types do you have this problem with?
simple triangles.
Please confirm you are not trying to do point evaluations on manifolds.
My domain is a simple unit square, so a manifold I guess, but not the bad ones. -- Nicolas -- Nicolas Barral Dept. of Earth Science and Engineering Imperial College London Royal School of Mines - Office 4.88 London SW7 2AZ
Thanks for your answer, Given what you say, maybe we should focus on what happens in the point evaluation function before going back to valgrind Le 09/06/16 à 14:15, Lawrence Mitchell a écrit :
On 9 Jun 2016, at 14:51, Nicolas Barral <n.barral@imperial.ac.uk> wrote:
I am still trying to find the bug I mentioned on IRC two weeks ago (point evaluation raising a not in domain error for points in the domain), and I have been totally unsuccessful so I need your help.
You suggested using valgrind to find a memory issue, which I tried with the regular build of Firedrake (not in debug mode). Even with the suppression file suggested on Python's website, I get a many thousand line log, most errors being unusable because the debug symbols were not exported.
Do any of the lines look like they are memory errors in likely candidates. That is, either code compiled by us (library has some kind of md5 hash in its name) or else inside libspatialindex?
I'm absolutely not an expert in valgrind, so I don't know, without the debug symbols it's hard to understand where erros come from. On linux they all seem to be Python errors. On my mac I had some more errors, but I wouldn't trust them.
So I thought I should try to compile python with the adequate debug options, and use this python with Firedrake. But how do I do that ? I tried to run firedrake-install with a PATH pointing to my custom build of Python (Miklos's suggestion), but it fails when checking/installing virtualenv (it keeps stopping at "Requirement already satisfied Virtual env installed. Please run firedrake-install again."). It sees the virtualenv from the standard python, which might be a problem ? So I'm stuck at that point.
I do not know how to install multiple versions of python and get them to pick up the correct virtualenv and similar. I would have thought that your hand-installed python would not find the globally installed virtualenv stuff, but maybe not. Can you just make a virtualenv "by hand". We effectively just run "python -c 'import virtualenv'" and use that. So if that doesn't work that's somewhere to start.
Okay I can try that.
Normally if you're on linux you can just install the package that adds debug symbols for python (on ubuntu this is called python-debug I think).
Miklós already suggested that, unfortunately I can't install packages on the linux on which I'm working, and it's not an option on mac.
My plan was then to compile the modules/dependencies in debug mode (notably the libspatialindex). Would that be really helpful ? And how do I do that since I can't just run configure/make in the right directory ? firedrake-install runs configure/make for you. You can determine how to installed the shared libraries in debug mode and add that as an option to firedrake-install, we will gladly accept patches for this.
Okay, I will at least try and come back with more questions.
Another idea came to me, and I searched for the cells of the mesh surrounding the points "not in domain" and computed their barycentric coordinates in the corresponding cell. It turns out that the point is always on a vertex or an edge (one or two barycentric coordinates ~ 1e-16). Could there be a bug in libspatialindex or the at function in these cases?
So what you're saying is that the point is right on the boundary between two cells? Apparently, yes.
I guess there are two ways the point location can fail. Either the point is not found in any bounding box. Or, the point is found by libspatialindex in a bounding box. But that bucket does not actually contain the cell in question, so then the linear search for point location may fail due to floating point rounding. Which of these two are occurring? It seems very likely we're in the second case, (and floating point rounding could also explain why the behaviour changes when using mac or linux ?)
All of this code is compiled by us and is pretty straight line C, so you can just compile with debugging and step through it in the failing case and figure out what's going on.
I'm going to need more help for that... where is this C code ? (I tried to read the code of the at function, I'm not sure I understand what is called in make_c_evaluate ...) -- Nicolas -- Nicolas Barral Dept. of Earth Science and Engineering Imperial College London Royal School of Mines - Office 4.88 London SW7 2AZ
participants (3)
- 
                
                Homolya, Miklós
- 
                
                Lawrence Mitchell
- 
                
                Nicolas Barral