Fwd: Installing Nektar++ 5.2.0 on a cluster

14 Mar 2023

      ---------- Forwarded message ---------
From: Ehsan Asgari <eh.asgari@gmail.com>
Date: Tue, Mar 14, 2023 at 11:46 AM
Subject: Re: [Nektar-users] Installing Nektar++ 5.2.0 on a cluster
To: Slaughter, James W <j.slaughter19@imperial.ac.uk>

Hi James

Yes, I compiled without those flags for gcc. Are you pointing out this is
the cause of the runtime difference I get?
I see there are avx2 and avx512 directories on the cluster. Would you
believe changing their flags will bring a noticeable speed-up?

Here is the error I received when I tried to use 128 cores each with 3GB of
RAM:

=======================================================================
...
EquationType: UnsteadyNavierStokes
                Session Name: clusteredToGmsh
                Spatial Dim.: 3
          Max SEM Exp. Order: 4
              Num. Processes: 64
              Expansion Dim.: 3
             Projection Type: Continuous Galerkin
                   Advection: explicit
                   Diffusion: implicit
                   Time Step: 0.0001
                No. of Steps: 500000
         Checkpoints (steps): 10
            Integration Type: IMEXOrder2
            Splitting Scheme: Velocity correction (strong press. form)
=======================================================================
Initial Conditions:
  - Field u: from file clusteredToGmsh.fld
  - Field v: from file clusteredToGmsh.fld
  - Field w: from file clusteredToGmsh.fld
  - Field p: from file clusteredToGmsh.fld
Writing: "clusteredToGmsh_0.chk" (1.18025s, XML)
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 55 with PID 32156 on node cra710 exited
on signal 9 (Killed).
--------------------------------------------------------------------------
4 total processes killed (some possibly by mpirun during cleanup)
Sometimes, the error happens before everything (not even showing the setup).
The mesh issue has become a bottleneck, as the cluster has put me in a
queue to apply for 2 hours use of 128 cores and a dedicated 125GB of RAM.
It has been 1 day so far!
To give more insight, I could be granted 24 hours use of 128 cores with 3GB
of RAM per core almost instantly.

Kind regards
syavash

On Tue, Mar 14, 2023 at 1:47 AM Slaughter, James W <
j.slaughter19@imperial.ac.uk> wrote:
...
Hi Syavash,
The compiler itself shouldn’t make a massive difference. You were
compiling by the looks of it with SSE with the chosen intel compiler.
Guessing you were then running on intel based cluster? SSE and AVX will
speed-up your simulations non-trivially then. My suspicion would be that
these you aren’t compiling with these compiler flags for gcc.
The mesh is a trickier one again to troubleshoot. 750k at P3 yeh, I’d
suggest 3Gb won’t be enough. When you had the previous issue do you know
where in the simulation it fails? I.e in setup, first-timestep or a bit
further into the simulation?
Kind regards,
James.
*From:* nektar-users-bounces@imperial.ac.uk <
nektar-users-bounces@imperial.ac.uk> *On Behalf Of *Ehsan Asgari
*Sent:* Monday, March 13, 2023 7:57 PM
*To:* nektar-users <nektar-users@imperial.ac.uk>
*Subject:* Re: [Nektar-users] Installing Nektar++ 5.2.0 on a cluster
Thank you James
You are quite right. I managed to install it by totally unloading the
intel-related modules and loading gcc instead. However, I found that gcc
resulted in significantly slower simulations.
Regarding the mesh problem, I realized that Nektar is quite demanding in
term of memory, so I had to allocate a big sum of RAM to the larger mesh
(750K cells I sent you). In other words, the minimum 3GB of RAM was not
enough to decompose and start the solution. It can be restrictive as
resources may not be available as "whole node" on clusters.
For now I am doing some tests to see if 128 cores with 3GB RAM per core
can do the trick.
Please let me know if things can be improved.
Kind regards
syavash
On Mon, Mar 13, 2023, 22:02 Slaughter, James W <
j.slaughter19@imperial.ac.uk> wrote:
Hi Syavash,
Can you run ldd the IncNavierStokesSolver and put module list as an extra
line in your submission script before the solver execution and run again
and send through the output?
It looks like you’ve compiled at least Spatial Domains with an intel
specific instruction set and these can’t be found at runtime.
Kind regards,
James.
*From:* nektar-users-bounces@imperial.ac.uk <
nektar-users-bounces@imperial.ac.uk> *On Behalf Of *Ehsan Asgari
*Sent:* Saturday, March 11, 2023 8:32 AM
*To:* nektar-users <nektar-users@imperial.ac.uk>
*Subject:* [Nektar-users] Installing Nektar++ 5.2.0 on a cluster
Hi Parv
Thank you for your kind response.
I managed to install ver. 5.3 using intel compiler and OpenMPI 4.0.3 at
last. However, I am still getting MPI-related issues when running a
medium-sized mesh with 760K cells (success with a smaller mesh in parallel):
=======================================================================
                EquationType: UnsteadyNavierStokes
                Session Name: clusteredToGmsh
                Spatial Dim.: 3
          Max SEM Exp. Order: 4
              Num. Processes: 32
              Expansion Dim.: 3
             Projection Type: Continuous Galerkin
         Advect. advancement: explicit
        Diffuse. advancement: implicit
                   Time Step: 0.0001
                No. of Steps: 500000
         Checkpoints (steps): 10000
            Integration Type: IMEX
            Splitting Scheme: Velocity correction (strong press. form)
=======================================================================
Initial Conditions:
  - Field u: 1.0
  - Field v: 0.0
  - Field w: 0.0
  - Field p: 0.0
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node cra1080 exited on
signal 9 (Killed).
I was suspicious that it might be a problem with the inlet compiler and so
I switched to GCC 9.3 for a fresh installation. But it seems that GCC is
causing problems and I get the following error at some point:
../../SpatialDomains/libSpatialDomains.so.5.3.0: error: undefined
reference to '__intel_sse2_strcpy'
../../SpatialDomains/libSpatialDomains.so.5.3.0: error: undefined
reference to '_intel_fast_memmove'
../../SpatialDomains/libSpatialDomains.so.5.3.0: error: undefined
reference to '__intel_sse2_strlen'
../../SpatialDomains/libSpatialDomains.so.5.3.0: error: undefined
reference to '_intel_fast_memcpy'
../../SpatialDomains/libSpatialDomains.so.5.3.0: error: undefined
reference to '_intel_fast_memset'
collect2: error: ld returned 1 exit status
make[2]: ***
[library/Demos/SpatialDomains/CMakeFiles/PartitionAnalyse.dir/build.make:127:
library/Demos/SpatialDomains/PartitionAnalyse] Error 1
make[1]: *** [CMakeFiles/Makefile2:3681:
library/Demos/SpatialDomains/CMakeFiles/PartitionAnalyse.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
I discussed the mesh problem with James Slaughter prior to this and he
suggested that the problem might be due to a bug in a lower version (5.0.3)
that I was working with. That is why I decided to go with the most recent
version.
After all, I am still struggling with my parallel simulations!
Kind regards
syavash
On Thu, Mar 9, 2023 at 4:57 PM Khurana, Parv <p.khurana22@imperial.ac.uk>
wrote:
Hi Syavash,
A few questions pop up on my mind on seeing this:
1. What compilers are you using  (GCC or Intel?)
   2. Are you loading the modules which are compatible with the compiler
   you are using?
   3. Do you have a version of OpenBLAS or MKL already loaded as one of
   the modules on your cluster?
As often is the case, the problem might be more detailed and it’ll be
great to see the modules and cmake commands you are using for you
installation in order to debug this properly. Happy to hop on a call if
needed!
Best
Parv
*From:* nektar-users-bounces@imperial.ac.uk <
nektar-users-bounces@imperial.ac.uk> *On Behalf Of *Ehsan Asgari
*Sent:* 09 March 2023 09:54
*To:* nektar-users <nektar-users@imperial.ac.uk>
*Subject:* [Nektar-users] Installing Nektar++ 5.2.0 on a cluster
This email from eh.asgari@gmail.com originates from outside Imperial. Do
not click on links and attachments unless you recognise the sender. If you
trust the sender, add them to your safe senders list
<https://spam.ic.ac.uk/SpamConsole/Senders.aspx> to disable email
stamping for this address.
Hi Everyone,
I am trying to install the latest version of Nektar on a cluster. However,
I get the following error at some point:
../../library/SpatialDomains/libSpatialDomains.so.5.3.0: error: undefined
reference to '__intel_sse2_strlen'
../../library/SpatialDomains/libSpatialDomains.so.5.3.0: error: undefined
reference to '_intel_fast_memcpy'
../../library/SpatialDomains/libSpatialDomains.so.5.3.0: error: undefined
reference to '_intel_fast_memset'
collect2: error: ld returned 1 exit status
make[2]: *** [utilities/NekMesh/CMakeFiles/NekMesh.dir/build.make:106:
utilities/NekMesh/NekMesh] Error 1
make[1]: *** [CMakeFiles/Makefile2:1647:
utilities/NekMesh/CMakeFiles/NekMesh.dir/all] Error 2
make: *** [Makefile:141: all] Error 2
I had "NEKTAR_USE_SYSTEM_BLAS_LAPACK:BOOL=ON " and
"THIRDPARTY_BUILD_BLAS_LAPACK:BOOL=ON" in the ccmake as per suggested in
the user archives.
I appreciate your kind help.
Kind regards
syavash

Ehsan Asgari

tags

participants (1)