Dear Dr. Cantwell, I installed Nektar to the cluster by switching NEKTAR_USE_ACML option ON. There are 92 nodes and 24 processors on each node. When I use 1 and 2 processors on 1 node the analysis run. However, when I increase the processor number to 4 on 1 node I'm getting the error. Regards, Kamil 02.12.2014 23:43 tarihinde, Chris Cantwell yazdı:
Dear Kamil,
Your problem sounds to be specific to the cluster you are using, or to the use of ACML.
Do you use ACML on your workstation where KovaFlow_m8.xml ran successfully using mpirun? How many cores was this on, and how many were you using on the cluster?
We will need to see a backtrace at the point when the segmentation fault occurs to be able to diagnose what is going wrong and help further. How you do this will depend on what debugging software is available on your cluster. Your system administrator should be able to help you with this.
Cheers, Chris
On 02/12/14 21:27, Kamil ÖZDEN wrote:
Dear Dr. Cantwell,
Latest situation about the Nektar installation to the cluster with ACML is that KovaFlow_m8.xml analysis is running with 1 and 2 processors but when I try to run it with 4 processors I'm getting the Segmentation fault error.
Regards, Kamil
02.12.2014 14:58 tarihinde, Kamil Ozden yazdı:
Dear Dr. Cantwell,
As an additional information I want to state that KovaFlow_m8.xml analysis is running from the command line by using mpirun command but not running when submitted to the cluster by using a script and giving the error below.
/*mpirun noticed that process rank 2 with PID 32190 on node mercan155.yonetim exited on signal 11 (Segmentation fault).*/
Is there any option that is to be changed in Nektar configuration to run the analysis in the cluster?
NOTE: I have used both mpirun and mpiexec commands in the script but I've taken the same error. If you want I can also send the script to you.
Regards, Kamil
On 01-12-2014 23:26, Kamil ÖZDEN wrote:
Dear Dr. Cantwell,
I try to run the test file of Nektar++ KovaFlow_m8.xml via script file and got the same Segmentation Fault error.
Then I copied the same file to the directory /*nektar++-4.0.0/build/solvers/IncNavierStokesSolver/*//**/and tried to run from the command line by typing the command
/*./IncNavierStokesSolver KovaFlow_m8.xml*/
but I got the following error
/*./IncNavierStokesSolver: error while loading shared libraries: libacml_mv.so: cannot open shared object file: No such file or directory*/
Regards, Kamil
01.12.2014 22:42 tarihinde, Chris Cantwell yazdı:
Dear Kamil,
The first error is simply that more memory was needed than the amount you allocated to the job (as you probably realised). The second error is a segmentation fault.
Can you reproduce the problem using a (much) smaller job?
Cheers, Chris
On 30/11/14 21:41, Kamil ÖZDEN wrote:
Dear Dr. Cantwell,
Thanks for your help. I'll try this and inform you about the result.
Meanwhile I made another installation with ACML on the same cluster with the following ACML and MPI configuration
**************** /* ACML /truba/sw/centos6.4/lib/acml/4.4.0/gfortran64/lib/libacml.so *//* *//* ACML_INCLUDE_PATH /truba/sw/centos6.4/lib/acml/4.4.0/gfortran64/include *//* *//* ACML_SEARCH_PATHS /truba/sw/centos6.4/lib/acml/4.4.0/gfortran64/include *//* *//* ACML_USE_OPENMP_LIBRARIES OFF *//* *//* ACML_USE_SHARED_LIBRARIES ON */ ********************** /*MPIEXEC /usr/mpi/gcc/openmpi-1.6.5/bin/mpiexec *//* *//* MPIEXEC_MAX_NUMPROCS 2 *//* *//* MPIEXEC_NUMPROC_FLAG -np *//* *//* MPIEXEC_POSTFLAGS *//* *//* MPIEXEC_PREFLAGS *//* *//* MPI_CXX_COMPILER /usr/mpi/gcc/openmpi-1.6.5/bin/mpicxx *//* *//* MPI_CXX_COMPILE_FLAGS *//* *//* MPI_CXX_INCLUDE_PATH /usr/mpi/gcc/openmpi-1.6.5/include *//* *//* MPI_CXX_LIBRARIES /usr/mpi/gcc/openmpi-1.6.5/lib64/libmpi_cxx.so;/usr/mpi/gcc/openmpi-1.6.5/lib64/libmpi.so;/usr/lib64/libdl.so;/usr/lib64/libm.so;/usr/lib64/librt.so;/usr/lib64/libnsl.so;/usr/lib64/libutil.so;/usr/lib64/libm.so;/usr/lib64/libdl.so
*//* *//* MPI_CXX_LINK_FLAGS -Wl,--export-dynamic *//* *//* MPI_C_COMPILER /usr/mpi/gcc/openmpi-1.6.5/bin/mpicc *//* *//* MPI_C_COMPILE_FLAGS *//* *//* MPI_C_INCLUDE_PATH /usr/mpi/gcc/openmpi-1.6.5/include *//* *//* MPI_C_LIBRARIES /usr/mpi/gcc/openmpi-1.6.5/lib64/libmpi.so;/usr/lib64/libdl.so;/usr/lib64/libm.so;/usr/lib64/librt.so;/usr/lib64/libnsl.so;/usr/lib64/libutil.so;/usr/lib64/libm.so;/usr/lib64/libdl.so
*//* *//* MPI_C_LINK_FLAGS -Wl,--export-dynamic *//* *//* MPI_EXTRA_LIBRARY /usr/mpi/gcc/openmpi-1.6.5/lib64/libmpi.so;/usr/lib64/libdl.so;/usr/lib64/libm.so;/usr/lib64/librt.so;/usr/lib64/libnsl.so;/usr/lib64/libutil.so;/usr/lib64/libm.so;/usr/lib64/libdl.so
*//* *//* MPI_LIBRARY /usr/mpi/gcc/openmpi-1.6.5/lib64/libmpi_cxx.so ***********************
*/Nektar seems to be installed successfully. However when I try to submit a job by using mpirun command with a script to the AMD processors of cluster (cluster uses SLURM resource manager) I face with such an issue.
When I tried to run with 4 processors.Initial conditons are read and first .chk directory is started to write as seen below:
/*=======================================================================*/
/**/
/*EquationType: UnsteadyNavierStokes*/
/**/
/*Session Name: Re_1_v2_N6*/
/**/
/*Spatial Dim.: 3*/
/**/
/*Max SEM Exp. Order: 7*/
/**/
/*Expansion Dim.: 3*/
/**/
/*Projection Type: Continuous Galerkin*/
/**/
/*Advection: explicit*/
/**/
/*Diffusion: explicit*/
/**/
/*Time Step: 0.01*/
/**/
/*No. of Steps: 300*/
/**/
/*Checkpoints (steps): 30*/
/**/
/*Integration Type: IMEXOrder1*/
/**/
/*=======================================================================*/
/**/
/*Initial Conditions:*/
/**/
/*- Field u: 0*/
/**/
/*- Field v: 0*/
/**/
/*- Field w: 0.15625*/
/**/
/*- Field p: 0*/
/**/
/*Writing: Re_1_v2_N6_0.chk */
/**/
/**/But after that the analysis is ended by giving the error below:
/*Warning: Conflicting CPU frequencies detected, using: 2300.000000.*/
/**/
/*Warning: Conflicting CPU frequencies detected, using: 2300.000000.*/
/**/
/*Warning: Conflicting CPU frequencies detected, using: 2300.000000.*/
/**/
/*Warning: Conflicting CPU frequencies detected, using: 2300.000000.*/
/**/
/*slurmd[mercan115]: Job 405433 exceeded memory limit (22245156 > 20480000), being killed*/
/**/
/*slurmd[mercan115]: Exceeded job memory limit*/
/**/
/*slurmd[mercan115]: *** JOB 405433 CANCELLED AT 2014-11-30T23:15:28 ****/
However when I try to run the analysis with 8 processors, the analysis directly ends by giving the error below:
/*Warning: Conflicting CPU frequencies detected, using: 2300.000000.*/
/**/
/*Warning: Conflicting CPU frequencies detected, using: 2300.000000.*/
/**/
/*Warning: Conflicting CPU frequencies detected, using: 2300.000000.*/
/**/
/*Warning: Conflicting CPU frequencies detected, using: 2300.000000.*/
/**/
/*Warning: Conflicting CPU frequencies detected, using: 2300.000000.*/
/**/
/*Warning: Conflicting CPU frequencies detected, using: 2300.000000.*/
/**/
/*Warning: Conflicting CPU frequencies detected, using: 2300.000000.*/
/**/
/*Warning: Conflicting CPU frequencies detected, using: 2300.000000.*/
/**/
/*--------------------------------------------------------------------------*/
/**/
/*mpirun noticed that process rank 2 with PID 24004 on node mercan146.yonetim exited on signal 11 (Segmentation fault).*/
What may be the reason for this problem?
Regards, Kamil
/**/ 30.11.2014 13:08 tarihinde, Chris Cantwell yazdı: > Dear Kamil, > > This still seems to suggest that the version in your home > directory is > not compiled with -fPIC. > > Try deleting all library files (*.a) and all compiled object code > (*.o) from within the LAPACK source tree and try compiling from > fresh > again. Also note that you need to add the -fPIC flag to both the > OPTS > and NOOPT variables in your LAPACK make.inc file (which > presumably is > what your system administrator altered). > > Cheers, > Chris >