---------- Forwarded message ---------
From: Ehsan Asgari <eh.asgari@gmail.com>
Date: Tue, Mar 14, 2023 at 11:46 AM
Subject: Re: [Nektar-users] Installing Nektar++ 5.2.0 on a cluster
To: Slaughter, James W <j.slaughter19@imperial.ac.uk>

Hi James

Yes, I compiled without those flags for gcc. Are you pointing out this is the cause of the runtime difference I get?

I see there are avx2 and avx512 directories on the cluster. Would you believe changing their flags will bring a noticeable speed-up?

Here is the error I received when I tried to use 128 cores each with 3GB of RAM:

=======================================================================
EquationType: UnsteadyNavierStokes
Session Name: clusteredToGmsh
Spatial Dim.: 3
Max SEM Exp. Order: 4
Num. Processes: 64
Expansion Dim.: 3
Projection Type: Continuous Galerkin
Advection: explicit
Diffusion: implicit
Time Step: 0.0001
No. of Steps: 500000
Checkpoints (steps): 10
Integration Type: IMEXOrder2
Splitting Scheme: Velocity correction (strong press. form)
=======================================================================
Initial Conditions:
- Field u: from file clusteredToGmsh.fld
- Field v: from file clusteredToGmsh.fld
- Field w: from file clusteredToGmsh.fld
- Field p: from file clusteredToGmsh.fld
Writing: "clusteredToGmsh_0.chk" (1.18025s, XML)
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 55 with PID 32156 on node cra710 exited on signal 9 (Killed).
--------------------------------------------------------------------------
4 total processes killed (some possibly by mpirun during cleanup)

Sometimes, the error happens before everything (not even showing the setup).

The mesh issue has become a bottleneck, as the cluster has put me in a queue to apply for 2 hours use of 128 cores and a dedicated 125GB of RAM. It has been 1 day so far!

To give more insight, I could be granted 24 hours use of 128 cores with 3GB of RAM per core almost instantly.

Kind regards

syavash

On Tue, Mar 14, 2023 at 1:47 AM Slaughter, James W <j.slaughter19@imperial.ac.uk> wrote:

Hi Syavash,

The compiler itself shouldn’t make a massive difference. You were compiling by the looks of it with SSE with the chosen intel compiler. Guessing you were then running on intel based cluster? SSE and AVX will speed-up your simulations non-trivially then. My suspicion would be that these you aren’t compiling with these compiler flags for gcc.

The mesh is a trickier one again to troubleshoot. 750k at P3 yeh, I’d suggest 3Gb won’t be enough. When you had the previous issue do you know where in the simulation it fails? I.e in setup, first-timestep or a bit further into the simulation?

Kind regards,

James.

From: nektar-users-bounces@imperial.ac.uk <nektar-users-bounces@imperial.ac.uk> On Behalf Of Ehsan Asgari
Sent: Monday, March 13, 2023 7:57 PM
To: nektar-users <nektar-users@imperial.ac.uk>
Subject: Re: [Nektar-users] Installing Nektar++ 5.2.0 on a cluster

Thank you James

You are quite right. I managed to install it by totally unloading the intel-related modules and loading gcc instead. However, I found that gcc resulted in significantly slower simulations.

Regarding the mesh problem, I realized that Nektar is quite demanding in term of memory, so I had to allocate a big sum of RAM to the larger mesh (750K cells I sent you). In other words, the minimum 3GB of RAM was not enough to decompose and start the solution. It can be restrictive as resources may not be available as "whole node" on clusters.

For now I am doing some tests to see if 128 cores with 3GB RAM per core can do the trick.

Please let me know if things can be improved.

Kind regards

syavash

On Mon, Mar 13, 2023, 22:02 Slaughter, James W <j.slaughter19@imperial.ac.uk> wrote:

Hi Syavash,

Can you run ldd the IncNavierStokesSolver and put module list as an extra line in your submission script before the solver execution and run again and send through the output?

It looks like you’ve compiled at least Spatial Domains with an intel specific instruction set and these can’t be found at runtime.

Kind regards,

James.

From: nektar-users-bounces@imperial.ac.uk <nektar-users-bounces@imperial.ac.uk> On Behalf Of Ehsan Asgari
Sent: Saturday, March 11, 2023 8:32 AM
To: nektar-users <nektar-users@imperial.ac.uk>
Subject: [Nektar-users] Installing Nektar++ 5.2.0 on a cluster

Hi Parv

Thank you for your kind response.

I managed to install ver. 5.3 using intel compiler and OpenMPI 4.0.3 at last. However, I am still getting MPI-related issues when running a medium-sized mesh with 760K cells (success with a smaller mesh in parallel):

=======================================================================
EquationType: UnsteadyNavierStokes
Session Name: clusteredToGmsh
Spatial Dim.: 3
Max SEM Exp. Order: 4
Num. Processes: 32
Expansion Dim.: 3
Projection Type: Continuous Galerkin
Advect. advancement: explicit
Diffuse. advancement: implicit
Time Step: 0.0001
No. of Steps: 500000
Checkpoints (steps): 10000
Integration Type: IMEX
Splitting Scheme: Velocity correction (strong press. form)
=======================================================================
Initial Conditions:
- Field u: 1.0
- Field v: 0.0
- Field w: 0.0
- Field p: 0.0
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node cra1080 exited on signal 9 (Killed).

I was suspicious that it might be a problem with the inlet compiler and so I switched to GCC 9.3 for a fresh installation. But it seems that GCC is causing problems and I get the following error at some point:

../../SpatialDomains/libSpatialDomains.so.5.3.0: error: undefined reference to '__intel_sse2_strcpy'
../../SpatialDomains/libSpatialDomains.so.5.3.0: error: undefined reference to '_intel_fast_memmove'
../../SpatialDomains/libSpatialDomains.so.5.3.0: error: undefined reference to '__intel_sse2_strlen'
../../SpatialDomains/libSpatialDomains.so.5.3.0: error: undefined reference to '_intel_fast_memcpy'
../../SpatialDomains/libSpatialDomains.so.5.3.0: error: undefined reference to '_intel_fast_memset'
collect2: error: ld returned 1 exit status
make[2]: *** [library/Demos/SpatialDomains/CMakeFiles/PartitionAnalyse.dir/build.make:127: library/Demos/SpatialDomains/PartitionAnalyse] Error 1
make[1]: *** [CMakeFiles/Makefile2:3681: library/Demos/SpatialDomains/CMakeFiles/PartitionAnalyse.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....

I discussed the mesh problem with James Slaughter prior to this and he suggested that the problem might be due to a bug in a lower version (5.0.3) that I was working with. That is why I decided to go with the most recent version.

After all, I am still struggling with my parallel simulations!

Kind regards

syavash

On Thu, Mar 9, 2023 at 4:57 PM Khurana, Parv <p.khurana22@imperial.ac.uk> wrote:

Hi Syavash,

A few questions pop up on my mind on seeing this:

What compilers are you using (GCC or Intel?)
Are you loading the modules which are compatible with the compiler you are using?
Do you have a version of OpenBLAS or MKL already loaded as one of the modules on your cluster?

As often is the case, the problem might be more detailed and it’ll be great to see the modules and cmake commands you are using for you installation in order to debug this properly. Happy to hop on a call if needed!

Best

Parv

From: nektar-users-bounces@imperial.ac.uk <nektar-users-bounces@imperial.ac.uk> On Behalf Of Ehsan Asgari
Sent: 09 March 2023 09:54
To: nektar-users <nektar-users@imperial.ac.uk>
Subject: [Nektar-users] Installing Nektar++ 5.2.0 on a cluster

This email from eh.asgari@gmail.com originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders list to disable email stamping for this address.

Hi Everyone,

I am trying to install the latest version of Nektar on a cluster. However, I get the following error at some point:

../../library/SpatialDomains/libSpatialDomains.so.5.3.0: error: undefined reference to '__intel_sse2_strlen'
../../library/SpatialDomains/libSpatialDomains.so.5.3.0: error: undefined reference to '_intel_fast_memcpy'
../../library/SpatialDomains/libSpatialDomains.so.5.3.0: error: undefined reference to '_intel_fast_memset'
collect2: error: ld returned 1 exit status
make[2]: *** [utilities/NekMesh/CMakeFiles/NekMesh.dir/build.make:106: utilities/NekMesh/NekMesh] Error 1
make[1]: *** [CMakeFiles/Makefile2:1647: utilities/NekMesh/CMakeFiles/NekMesh.dir/all] Error 2
make: *** [Makefile:141: all] Error 2

I had "NEKTAR_USE_SYSTEM_BLAS_LAPACK:BOOL=ON " and "THIRDPARTY_BUILD_BLAS_LAPACK:BOOL=ON" in the ccmake as per suggested in the user archives.

I appreciate your kind help.

Kind regards

syavash