Nektar hangs on NFS
Hello I deploy a small cluster on the cloud (not on IB). I setup the VMs, keys etc. and I use NFS for shared storage. This is a small cluster intended for 1-2 users, so presumably NFS should be fine. At least that's what I thought. Currently I am testing the tutorial case "basics-advection-diffusion". It runs when executed in serial or parallel using 1 node (4 cores). However, when I use 2 nodes it hangs during:
Initial Conditions: - Field u: sin(k*x)*cos(k*y) Writing: "ADR_mesh_aligned_0.chk" (0.0199919s, XML)
I see that "ADR_mesh_aligned_0" and its content written successfully, however next directory cannot be created and solver remains idle.
mpirun -np 8 -mca btl_tcp_if_include eth0 -hostfile hosts \ $NEKTAR_BIN/ADRSolver ADR_mesh_aligned.xml ADR_conditions.xml
I tested a simple hostname command and an mpi parallel file write, both worked fine on two nodes with mpirun. Any suggestions highly appreciated. Thank you // Fatih
Maybe I can ask another question about parallel I/O implementation. Is that possible to determine I/O ranks while running on multiple nodes? For instance, can I restrict file read and writes to a single processor while running on an HPC environment? // Fatih On Fri, Jan 25, 2019 at 2:14 PM Fatih Ertinaz <fertinaz@gmail.com> wrote:
Hello
I deploy a small cluster on the cloud (not on IB). I setup the VMs, keys etc. and I use NFS for shared storage. This is a small cluster intended for 1-2 users, so presumably NFS should be fine. At least that's what I thought.
Currently I am testing the tutorial case "basics-advection-diffusion". It runs when executed in serial or parallel using 1 node (4 cores). However, when I use 2 nodes it hangs during:
Initial Conditions: - Field u: sin(k*x)*cos(k*y) Writing: "ADR_mesh_aligned_0.chk" (0.0199919s, XML)
I see that "ADR_mesh_aligned_0" and its content written successfully, however next directory cannot be created and solver remains idle.
mpirun -np 8 -mca btl_tcp_if_include eth0 -hostfile hosts \ $NEKTAR_BIN/ADRSolver ADR_mesh_aligned.xml ADR_conditions.xml
I tested a simple hostname command and an mpi parallel file write, both worked fine on two nodes with mpirun.
Any suggestions highly appreciated. Thank you
// Fatih
Hi Fatih, Unfortunately, this is not currently possible with Nektar++. Each rank will either write a single file containing its portion of the subdomain per checkpoint, or will concurrently write to a single HDF5 file, along with all other ranks. Cheers, Chris On Mon, 28 Jan 2019 12:00:30 -0500, Fatih Ertinaz <fertinaz@gmail.com> wrote:
Maybe I can ask another question about parallel I/O implementation.
Is that possible to determine I/O ranks while running on multiple nodes? For instance, can I restrict file read and writes to a single processor while running on an HPC environment?
// Fatih
On Fri, Jan 25, 2019 at 2:14 PM Fatih Ertinaz <fertinaz@gmail.com> wrote:
Hello
I deploy a small cluster on the cloud (not on IB). I setup the VMs, keys etc. and I use NFS for shared storage. This is a small cluster intended for 1-2 users, so presumably NFS should be fine. At least that's what I thought.
Currently I am testing the tutorial case "basics-advection-diffusion". It runs when executed in serial or parallel using 1 node (4 cores). However, when I use 2 nodes it hangs during:
Initial Conditions: - Field u: sin(k*x)*cos(k*y) Writing: "ADR_mesh_aligned_0.chk" (0.0199919s, XML)
I see that "ADR_mesh_aligned_0" and its content written successfully, however next directory cannot be created and solver remains idle.
mpirun -np 8 -mca btl_tcp_if_include eth0 -hostfile hosts \ $NEKTAR_BIN/ADRSolver ADR_mesh_aligned.xml ADR_conditions.xml
I tested a simple hostname command and an mpi parallel file write, both worked fine on two nodes with mpirun.
Any suggestions highly appreciated. Thank you
// Fatih
-- Chris Cantwell Imperial College London South Kensington Campus London SW7 2AZ Email: c.cantwell@imperial.ac.uk www.imperial.ac.uk/people/c.cantwell
Hello Chris Thank you for your reply. I managed to resolve the problem by adding the "--mca btl tcp,self" flag to the mpi command. Additionally, NFS now sits on top of a GPFS instance which definitely helped achieving faster I/O. Thank for clarifying parallel I/O approach as well. I guess using HDF5 would definitely help in this case. // Fatih On Fri, Feb 1, 2019 at 2:16 AM Chris Cantwell <c.cantwell@imperial.ac.uk> wrote:
Hi Fatih,
Unfortunately, this is not currently possible with Nektar++.
Each rank will either write a single file containing its portion of the subdomain per checkpoint, or will concurrently write to a single HDF5 file, along with all other ranks.
Cheers, Chris
On Mon, 28 Jan 2019 12:00:30 -0500, Fatih Ertinaz <fertinaz@gmail.com> wrote:
Maybe I can ask another question about parallel I/O implementation.
Is that possible to determine I/O ranks while running on multiple nodes? For instance, can I restrict file read and writes to a single processor while running on an HPC environment?
// Fatih
On Fri, Jan 25, 2019 at 2:14 PM Fatih Ertinaz <fertinaz@gmail.com> wrote:
Hello
I deploy a small cluster on the cloud (not on IB). I setup the VMs, keys etc. and I use NFS for shared storage. This is a small cluster intended for 1-2 users, so presumably NFS should be fine. At least that's what I thought.
Currently I am testing the tutorial case "basics-advection-diffusion". It runs when executed in serial or parallel using 1 node (4 cores). However, when I use 2 nodes it hangs during:
Initial Conditions: - Field u: sin(k*x)*cos(k*y) Writing: "ADR_mesh_aligned_0.chk" (0.0199919s, XML)
I see that "ADR_mesh_aligned_0" and its content written successfully, however next directory cannot be created and solver remains idle.
mpirun -np 8 -mca btl_tcp_if_include eth0 -hostfile hosts \ $NEKTAR_BIN/ADRSolver ADR_mesh_aligned.xml ADR_conditions.xml
I tested a simple hostname command and an mpi parallel file write, both worked fine on two nodes with mpirun.
Any suggestions highly appreciated. Thank you
// Fatih
-- Chris Cantwell Imperial College London South Kensington Campus London SW7 2AZ Email: c.cantwell@imperial.ac.uk www.imperial.ac.uk/people/c.cantwell
Dear Fatih, Would you be able to compile in debug mode and attach a debugger to one of the hanging instances and send a backtrace? This would be helpful in diagnosing the problem. Please also confirm the exact version of the Nektar++ you are using. Cheers, Chris On Fri, 25 Jan 2019 14:14:44 -0500, Fatih Ertinaz <fertinaz@gmail.com> wrote:
Hello
I deploy a small cluster on the cloud (not on IB). I setup the VMs, keys etc. and I use NFS for shared storage. This is a small cluster intended for 1-2 users, so presumably NFS should be fine. At least that's what I thought.
Currently I am testing the tutorial case "basics-advection-diffusion". It runs when executed in serial or parallel using 1 node (4 cores). However, when I use 2 nodes it hangs during:
Initial Conditions: - Field u: sin(k*x)*cos(k*y) Writing: "ADR_mesh_aligned_0.chk" (0.0199919s, XML)
I see that "ADR_mesh_aligned_0" and its content written successfully, however next directory cannot be created and solver remains idle.
mpirun -np 8 -mca btl_tcp_if_include eth0 -hostfile hosts \ $NEKTAR_BIN/ADRSolver ADR_mesh_aligned.xml ADR_conditions.xml
I tested a simple hostname command and an mpi parallel file write, both worked fine on two nodes with mpirun.
Any suggestions highly appreciated. Thank you
// Fatih
-- Chris Cantwell Imperial College London South Kensington Campus London SW7 2AZ Email: c.cantwell@imperial.ac.uk www.imperial.ac.uk/people/c.cantwell
participants (2)
- 
                
                Chris Cantwell
- 
                
                Fatih Ertinaz