Unable to Restart SImulation on Different Cluster
******************* This email originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders list https://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address. ******************* Hello Nektar, I am unable to restart a simulation after moving the case files to a new cluster. I began the simulation on one cluster and ran it to the end of the transient flow stage. Once finished, took the final .fld file, put it on another cluster and tried to run it from that time step. My expansion section is <EXPANSIONS> <F VAR="u,v,w,p" FILE="session.fld" /> </EXPANSIONS> and my initial conditions are <FUNCTION NAME="InitialConditions"> <F VAR="u,v,w,p" FILE="session.fld" /> </FUNCTION> When I try to start the simulation on the new cluster, I get the same CG iterations problem every time (see output file as text at the bottom of this email). The new cluster uses different CPUs, so I thought this could have something to do with it, but I am still getting the same problem when I try different CPUs. The only thing I tried which had any effect was changing the time step size. This shouldn't have been necessary because the one I was using was already keeping the CFL low (~0.5) during the transient flow. Decreasing CFL only lowered the size of the error on the CG iterations made... line displayed before the Level 0 assertion violation, and not by much. I would be grateful for your assistance. Regards, Isaac ======================================================================= EquationType: UnsteadyNavierStokes Session Name: session Spatial Dim.: 3 Max SEM Exp. Order: 5 Num. Processes: 208 Expansion Dim.: 3 Projection Type: Continuous Galerkin Advect. advancement: explicit Diffuse. advancement: implicit Time Step: 0.004 No. of Steps: 187500 Checkpoints (steps): 63 Integration Type: IMEX Splitting Scheme: Velocity correction (strong press. form) Dealiasing: spectral/hp Smoothing-SpecHP: SVV (spectral/hp DG Kernel (diff coeff = 1*Uh/p)) ======================================================================= Initial Conditions: - Field u: from file session.fld - Field v: from file session.fld - Field w: from file session.fld - Field p: from file session.fld CG iterations made = 5001 using tolerance of 1e-09 (error = 9.57894e-07, rhs_mag = 15.6227) Fatal : Level 0 assertion violation Exceeded maximum number of iterations -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[2602,1],4] Exit code: 1 --------------------------------------------------------------------------
Hi Issac, Did you compile the installations of Nektar++ on the two clusters yourself? I would assume it’s important to ensure that both are built with the same set of options / dependencies. If not, this could result in some differences - e.g. are both installations compiled either with, or without FFTW or other optional dependencies. I’m not an expert on the mathematical aspects of the problem but I suspect that this is likely to be related to some difference in the builds of Nektar++. This is just an initial thought, maybe a member of the community with more knowledge of the specifics of the problem your solving can offer other insights. Kind regards, Jeremy
On 24 Jul 2023, at 03:07, Isaac Rosin <isaac.rosin1@ucalgary.ca> wrote:
This email from isaac.rosin1@ucalgary.ca <mailto:isaac.rosin1@ucalgary.ca> originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders list <https://spam.ic.ac.uk/SpamConsole/Senders.aspx> to disable email stamping for this address.
Hello Nektar,
I am unable to restart a simulation after moving the case files to a new cluster. I began the simulation on one cluster and ran it to the end of the transient flow stage. Once finished, took the final .fld file, put it on another cluster and tried to run it from that time step. My expansion section is
<EXPANSIONS> <F VAR="u,v,w,p" FILE="session.fld" /> </EXPANSIONS>
and my initial conditions are
<FUNCTION NAME="InitialConditions"> <F VAR="u,v,w,p" FILE="session.fld" /> </FUNCTION>
When I try to start the simulation on the new cluster, I get the same CG iterations problem every time (see output file as text at the bottom of this email). The new cluster uses different CPUs, so I thought this could have something to do with it, but I am still getting the same problem when I try different CPUs. The only thing I tried which had any effect was changing the time step size. This shouldn't have been necessary because the one I was using was already keeping the CFL low (~0.5) during the transient flow. Decreasing CFL only lowered the size of the error on the CG iterations made... line displayed before the Level 0 assertion violation, and not by much.
I would be grateful for your assistance.
Regards, Isaac
======================================================================= EquationType: UnsteadyNavierStokes Session Name: session Spatial Dim.: 3 Max SEM Exp. Order: 5 Num. Processes: 208 Expansion Dim.: 3 Projection Type: Continuous Galerkin Advect. advancement: explicit Diffuse. advancement: implicit Time Step: 0.004 No. of Steps: 187500 Checkpoints (steps): 63 Integration Type: IMEX Splitting Scheme: Velocity correction (strong press. form) Dealiasing: spectral/hp Smoothing-SpecHP: SVV (spectral/hp DG Kernel (diff coeff = 1*Uh/p)) ======================================================================= Initial Conditions: - Field u: from file session.fld - Field v: from file session.fld - Field w: from file session.fld - Field p: from file session.fld CG iterations made = 5001 using tolerance of 1e-09 (error = 9.57894e-07, rhs_mag = 15.6227) Fatal : Level 0 assertion violation Exceeded maximum number of iterations -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[2602,1],4] Exit code: 1 --------------------------------------------------------------------------
_______________________________________________ Nektar-users mailing list Nektar-users@imperial.ac.uk <mailto:Nektar-users@imperial.ac.uk> https://mailman.ic.ac.uk/mailman/listinfo/nektar-users <https://mailman.ic.ac.uk/mailman/listinfo/nektar-users>
Hi Jeremy, I compiled the installation myself on the first cluster but not on the second. I tried to compile it on the second cluster myself following your advice, but was unable to get the exact same configuration since the second cluster has different modules installed. And using the new installation, I still get the same issue. I did find a workaround which feels wrong but it's producing results which seem ok: On the new cluster I turned the time step down to a very small value, ran the simulation for a few steps, saved a field, and used this field as the new restart file. With this I was able to turn the step size back up to the previous value and that has been working as if nothing was wrong. It's not a perfect solution but it works. Thank you for your response Jeremy. All the best, Isaac ________________________________ From: Jeremy Cohen <jeremy.cohen@imperial.ac.uk> Sent: Monday, July 24, 2023 3:52 AM To: Isaac Rosin <isaac.rosin1@ucalgary.ca> Cc: nektar-users@imperial.ac.uk <nektar-users@imperial.ac.uk> Subject: Re: [Nektar-users] Unable to Restart SImulation on Different Cluster [△EXTERNAL] Hi Issac, Did you compile the installations of Nektar++ on the two clusters yourself? I would assume it’s important to ensure that both are built with the same set of options / dependencies. If not, this could result in some differences - e.g. are both installations compiled either with, or without FFTW or other optional dependencies. I’m not an expert on the mathematical aspects of the problem but I suspect that this is likely to be related to some difference in the builds of Nektar++. This is just an initial thought, maybe a member of the community with more knowledge of the specifics of the problem your solving can offer other insights. Kind regards, Jeremy On 24 Jul 2023, at 03:07, Isaac Rosin <isaac.rosin1@ucalgary.ca<mailto:isaac.rosin1@ucalgary.ca>> wrote: This email from isaac.rosin1@ucalgary.ca<mailto:isaac.rosin1@ucalgary.ca> originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders list<https://spam.ic.ac.uk/SpamConsole/Senders.aspx> to disable email stamping for this address. Hello Nektar, I am unable to restart a simulation after moving the case files to a new cluster. I began the simulation on one cluster and ran it to the end of the transient flow stage. Once finished, took the final .fld file, put it on another cluster and tried to run it from that time step. My expansion section is <EXPANSIONS> <F VAR="u,v,w,p" FILE="session.fld" /> </EXPANSIONS> and my initial conditions are <FUNCTION NAME="InitialConditions"> <F VAR="u,v,w,p" FILE="session.fld" /> </FUNCTION> When I try to start the simulation on the new cluster, I get the same CG iterations problem every time (see output file as text at the bottom of this email). The new cluster uses different CPUs, so I thought this could have something to do with it, but I am still getting the same problem when I try different CPUs. The only thing I tried which had any effect was changing the time step size. This shouldn't have been necessary because the one I was using was already keeping the CFL low (~0.5) during the transient flow. Decreasing CFL only lowered the size of the error on the CG iterations made... line displayed before the Level 0 assertion violation, and not by much. I would be grateful for your assistance. Regards, Isaac ======================================================================= EquationType: UnsteadyNavierStokes Session Name: session Spatial Dim.: 3 Max SEM Exp. Order: 5 Num. Processes: 208 Expansion Dim.: 3 Projection Type: Continuous Galerkin Advect. advancement: explicit Diffuse. advancement: implicit Time Step: 0.004 No. of Steps: 187500 Checkpoints (steps): 63 Integration Type: IMEX Splitting Scheme: Velocity correction (strong press. form) Dealiasing: spectral/hp Smoothing-SpecHP: SVV (spectral/hp DG Kernel (diff coeff = 1*Uh/p)) ======================================================================= Initial Conditions: - Field u: from file session.fld - Field v: from file session.fld - Field w: from file session.fld - Field p: from file session.fld CG iterations made = 5001 using tolerance of 1e-09 (error = 9.57894e-07, rhs_mag = 15.6227) Fatal : Level 0 assertion violation Exceeded maximum number of iterations -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[2602,1],4] Exit code: 1 -------------------------------------------------------------------------- _______________________________________________ Nektar-users mailing list Nektar-users@imperial.ac.uk<mailto:Nektar-users@imperial.ac.uk> https://mailman.ic.ac.uk/mailman/listinfo/nektar-users
Dear Isaac, Another workaround might be the following: Make a new case directory, let's call it transfer. Copy mesh.xml session.xml and restart.fld to transfer/ Assuming you are running in parallel, the time steps are saved as folders. You can use FieldConvert mesh.xml final.fld/ restart.fld in your original case directory to obtain a file instead of a folder. Then you can compress your transfer/ directory and transfer it to the second cluster, or just copy it to the new cluster. Once you do this, only change the initial conditions in your session.xml to read from restart.fld All the other settings will be the same as your old simulation. This should work on any machine unless you are using a specific setting that the version on the machine is not compiled with. Hope this helps. Kind regards, Ilteber -- İlteber R. Özdemir On Tue, 25 Jul 2023 at 03:55, Isaac Rosin <isaac.rosin1@ucalgary.ca> wrote:
Hi Jeremy,
I compiled the installation myself on the first cluster but not on the second. I tried to compile it on the second cluster myself following your advice, but was unable to get the exact same configuration since the second cluster has different modules installed. And using the new installation, I still get the same issue.
I did find a workaround which feels wrong but it's producing results which seem ok: On the new cluster I turned the time step down to a very small value, ran the simulation for a few steps, saved a field, and used this field as the new restart file. With this I was able to turn the step size back up to the previous value and that has been working as if nothing was wrong. It's not a perfect solution but it works.
Thank you for your response Jeremy.
All the best, Isaac ------------------------------ *From:* Jeremy Cohen <jeremy.cohen@imperial.ac.uk> *Sent:* Monday, July 24, 2023 3:52 AM *To:* Isaac Rosin <isaac.rosin1@ucalgary.ca> *Cc:* nektar-users@imperial.ac.uk <nektar-users@imperial.ac.uk> *Subject:* Re: [Nektar-users] Unable to Restart SImulation on Different Cluster
[△EXTERNAL]
Hi Issac,
Did you compile the installations of Nektar++ on the two clusters yourself? I would assume it’s important to ensure that both are built with the same set of options / dependencies. If not, this could result in some differences - e.g. are both installations compiled either with, or without FFTW or other optional dependencies.
I’m not an expert on the mathematical aspects of the problem but I suspect that this is likely to be related to some difference in the builds of Nektar++.
This is just an initial thought, maybe a member of the community with more knowledge of the specifics of the problem your solving can offer other insights.
Kind regards,
Jeremy
On 24 Jul 2023, at 03:07, Isaac Rosin <isaac.rosin1@ucalgary.ca> wrote:
This email from isaac.rosin1@ucalgary.ca originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders list <https://spam.ic.ac.uk/SpamConsole/Senders.aspx> to disable email stamping for this address.
Hello Nektar,
I am unable to restart a simulation after moving the case files to a new cluster. I began the simulation on one cluster and ran it to the end of the transient flow stage. Once finished, took the final .fld file, put it on another cluster and tried to run it from that time step. My expansion section is
<EXPANSIONS> <F VAR="u,v,w,p" FILE="session.fld" /> </EXPANSIONS>
and my initial conditions are
<FUNCTION NAME="InitialConditions"> <F VAR="u,v,w,p" FILE="session.fld" /> </FUNCTION>
When I try to start the simulation on the new cluster, I get the same CG iterations problem every time (see output file as text at the bottom of this email). The new cluster uses different CPUs, so I thought this could have something to do with it, but I am still getting the same problem when I try different CPUs. The only thing I tried which had any effect was changing the time step size. This shouldn't have been necessary because the one I was using was already keeping the CFL low (~0.5) during the transient flow. Decreasing CFL only lowered the size of the error on the CG iterations made... line displayed before the Level 0 assertion violation, and not by much.
I would be grateful for your assistance.
Regards, Isaac
======================================================================= EquationType: UnsteadyNavierStokes Session Name: session Spatial Dim.: 3 Max SEM Exp. Order: 5 Num. Processes: 208 Expansion Dim.: 3 Projection Type: Continuous Galerkin Advect. advancement: explicit Diffuse. advancement: implicit Time Step: 0.004 No. of Steps: 187500 Checkpoints (steps): 63 Integration Type: IMEX Splitting Scheme: Velocity correction (strong press. form) Dealiasing: spectral/hp Smoothing-SpecHP: SVV (spectral/hp DG Kernel (diff coeff = 1*Uh/p) ) ======================================================================= Initial Conditions: - Field u: from file session.fld - Field v: from file session.fld - Field w: from file session.fld - Field p: from file session.fld CG iterations made = 5001 using tolerance of 1e-09 (error = 9.57894e-07, rhs_mag = 15.6227) Fatal : Level 0 assertion violation Exceeded maximum number of iterations -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[2602,1],4] Exit code: 1 --------------------------------------------------------------------------
_______________________________________________ Nektar-users mailing list Nektar-users@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/nektar-users
_______________________________________________ Nektar-users mailing list Nektar-users@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/nektar-users
Hi Ilteber, Thanks for your input on this. I’d be interested to hear if your workaround resolves the issue Isaac is having. If not, I’m guessing it must be some difference in the dependencies compiled into the code on the different machines? Presumably, if a specific module specified in the session file was missing, then the computation wouldn’t even get to the point of starting because it couldn’t instantiate the relevant module when reading the session file? It looks like the session file is being processed and the computation is being set up successfully. I think it’s just failing when it runs? (i.e. it’s exceeding the maximum number of iterations and not converging?). Thanks, Jeremy
On 25 Jul 2023, at 10:00, İlteber Özdemir <rilteber@gmail.com> wrote:
Dear Isaac,
Another workaround might be the following: Make a new case directory, let's call it transfer. Copy mesh.xml session.xml and restart.fld to transfer/ Assuming you are running in parallel, the time steps are saved as folders. You can use FieldConvert mesh.xml final.fld/ restart.fld in your original case directory to obtain a file instead of a folder. Then you can compress your transfer/ directory and transfer it to the second cluster, or just copy it to the new cluster. Once you do this, only change the initial conditions in your session.xml to read from restart.fld All the other settings will be the same as your old simulation.
This should work on any machine unless you are using a specific setting that the version on the machine is not compiled with. Hope this helps.
Kind regards, Ilteber
-- İlteber R. Özdemir
On Tue, 25 Jul 2023 at 03:55, Isaac Rosin <isaac.rosin1@ucalgary.ca <mailto:isaac.rosin1@ucalgary.ca>> wrote: Hi Jeremy,
I compiled the installation myself on the first cluster but not on the second. I tried to compile it on the second cluster myself following your advice, but was unable to get the exact same configuration since the second cluster has different modules installed. And using the new installation, I still get the same issue.
I did find a workaround which feels wrong but it's producing results which seem ok: On the new cluster I turned the time step down to a very small value, ran the simulation for a few steps, saved a field, and used this field as the new restart file. With this I was able to turn the step size back up to the previous value and that has been working as if nothing was wrong. It's not a perfect solution but it works.
Thank you for your response Jeremy.
All the best, Isaac From: Jeremy Cohen <jeremy.cohen@imperial.ac.uk <mailto:jeremy.cohen@imperial.ac.uk>> Sent: Monday, July 24, 2023 3:52 AM To: Isaac Rosin <isaac.rosin1@ucalgary.ca <mailto:isaac.rosin1@ucalgary.ca>> Cc: nektar-users@imperial.ac.uk <mailto:nektar-users@imperial.ac.uk> <nektar-users@imperial.ac.uk <mailto:nektar-users@imperial.ac.uk>> Subject: Re: [Nektar-users] Unable to Restart SImulation on Different Cluster
[△EXTERNAL]
Hi Issac,
Did you compile the installations of Nektar++ on the two clusters yourself? I would assume it’s important to ensure that both are built with the same set of options / dependencies. If not, this could result in some differences - e.g. are both installations compiled either with, or without FFTW or other optional dependencies.
I’m not an expert on the mathematical aspects of the problem but I suspect that this is likely to be related to some difference in the builds of Nektar++.
This is just an initial thought, maybe a member of the community with more knowledge of the specifics of the problem your solving can offer other insights.
Kind regards,
Jeremy
On 24 Jul 2023, at 03:07, Isaac Rosin <isaac.rosin1@ucalgary.ca <mailto:isaac.rosin1@ucalgary.ca>> wrote:
This email from isaac.rosin1@ucalgary.ca <mailto:isaac.rosin1@ucalgary.ca> originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders list <https://spam.ic.ac.uk/SpamConsole/Senders.aspx> to disable email stamping for this address.
Hello Nektar,
I am unable to restart a simulation after moving the case files to a new cluster. I began the simulation on one cluster and ran it to the end of the transient flow stage. Once finished, took the final .fld file, put it on another cluster and tried to run it from that time step. My expansion section is
<EXPANSIONS> <F VAR="u,v,w,p" FILE="session.fld" /> </EXPANSIONS>
and my initial conditions are
<FUNCTION NAME="InitialConditions"> <F VAR="u,v,w,p" FILE="session.fld" /> </FUNCTION>
When I try to start the simulation on the new cluster, I get the same CG iterations problem every time (see output file as text at the bottom of this email). The new cluster uses different CPUs, so I thought this could have something to do with it, but I am still getting the same problem when I try different CPUs. The only thing I tried which had any effect was changing the time step size. This shouldn't have been necessary because the one I was using was already keeping the CFL low (~0.5) during the transient flow. Decreasing CFL only lowered the size of the error on the CG iterations made... line displayed before the Level 0 assertion violation, and not by much.
I would be grateful for your assistance.
Regards, Isaac
======================================================================= EquationType: UnsteadyNavierStokes Session Name: session Spatial Dim.: 3 Max SEM Exp. Order: 5 Num. Processes: 208 Expansion Dim.: 3 Projection Type: Continuous Galerkin Advect. advancement: explicit Diffuse. advancement: implicit Time Step: 0.004 No. of Steps: 187500 Checkpoints (steps): 63 Integration Type: IMEX Splitting Scheme: Velocity correction (strong press. form) Dealiasing: spectral/hp Smoothing-SpecHP: SVV (spectral/hp DG Kernel (diff coeff = 1*Uh/p)) ======================================================================= Initial Conditions: - Field u: from file session.fld - Field v: from file session.fld - Field w: from file session.fld - Field p: from file session.fld CG iterations made = 5001 using tolerance of 1e-09 (error = 9.57894e-07, rhs_mag = 15.6227) Fatal : Level 0 assertion violation Exceeded maximum number of iterations -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[2602,1],4] Exit code: 1 --------------------------------------------------------------------------
_______________________________________________ Nektar-users mailing list Nektar-users@imperial.ac.uk <mailto:Nektar-users@imperial.ac.uk> https://mailman.ic.ac.uk/mailman/listinfo/nektar-users <https://mailman.ic.ac.uk/mailman/listinfo/nektar-users>
Nektar-users mailing list Nektar-users@imperial.ac.uk <mailto:Nektar-users@imperial.ac.uk> https://mailman.ic.ac.uk/mailman/listinfo/nektar-users <https://mailman.ic.ac.uk/mailman/listinfo/nektar-users>
Hi Isaac, A quick and “dirty” trick that I do when running simulations with the incompressible solver and restart from another field is to set the pressure field initialisation to 0 and use the initialisation fields only for the velocity components. It might work for your case as well. Kind regards, Alexandra On Tue, 25 Jul 2023 at 02:55, Isaac Rosin <isaac.rosin1@ucalgary.ca> wrote:
Hi Jeremy,
I compiled the installation myself on the first cluster but not on the second. I tried to compile it on the second cluster myself following your advice, but was unable to get the exact same configuration since the second cluster has different modules installed. And using the new installation, I still get the same issue.
I did find a workaround which feels wrong but it's producing results which seem ok: On the new cluster I turned the time step down to a very small value, ran the simulation for a few steps, saved a field, and used this field as the new restart file. With this I was able to turn the step size back up to the previous value and that has been working as if nothing was wrong. It's not a perfect solution but it works.
Thank you for your response Jeremy.
All the best, Isaac ------------------------------ *From:* Jeremy Cohen <jeremy.cohen@imperial.ac.uk> *Sent:* Monday, July 24, 2023 3:52 AM *To:* Isaac Rosin <isaac.rosin1@ucalgary.ca> *Cc:* nektar-users@imperial.ac.uk <nektar-users@imperial.ac.uk> *Subject:* Re: [Nektar-users] Unable to Restart SImulation on Different Cluster
[△EXTERNAL]
Hi Issac,
Did you compile the installations of Nektar++ on the two clusters yourself? I would assume it’s important to ensure that both are built with the same set of options / dependencies. If not, this could result in some differences - e.g. are both installations compiled either with, or without FFTW or other optional dependencies.
I’m not an expert on the mathematical aspects of the problem but I suspect that this is likely to be related to some difference in the builds of Nektar++.
This is just an initial thought, maybe a member of the community with more knowledge of the specifics of the problem your solving can offer other insights.
Kind regards,
Jeremy
On 24 Jul 2023, at 03:07, Isaac Rosin <isaac.rosin1@ucalgary.ca> wrote:
This email from isaac.rosin1@ucalgary.ca originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders list <https://spam.ic.ac.uk/SpamConsole/Senders.aspx> to disable email stamping for this address.
Hello Nektar,
I am unable to restart a simulation after moving the case files to a new cluster. I began the simulation on one cluster and ran it to the end of the transient flow stage. Once finished, took the final .fld file, put it on another cluster and tried to run it from that time step. My expansion section is
<EXPANSIONS> <F VAR="u,v,w,p" FILE="session.fld" /> </EXPANSIONS>
and my initial conditions are
<FUNCTION NAME="InitialConditions"> <F VAR="u,v,w,p" FILE="session.fld" /> </FUNCTION>
When I try to start the simulation on the new cluster, I get the same CG iterations problem every time (see output file as text at the bottom of this email). The new cluster uses different CPUs, so I thought this could have something to do with it, but I am still getting the same problem when I try different CPUs. The only thing I tried which had any effect was changing the time step size. This shouldn't have been necessary because the one I was using was already keeping the CFL low (~0.5) during the transient flow. Decreasing CFL only lowered the size of the error on the CG iterations made... line displayed before the Level 0 assertion violation, and not by much.
I would be grateful for your assistance.
Regards, Isaac
======================================================================= EquationType: UnsteadyNavierStokes Session Name: session Spatial Dim.: 3 Max SEM Exp. Order: 5 Num. Processes: 208 Expansion Dim.: 3 Projection Type: Continuous Galerkin Advect. advancement: explicit Diffuse. advancement: implicit Time Step: 0.004 No. of Steps: 187500 Checkpoints (steps): 63 Integration Type: IMEX Splitting Scheme: Velocity correction (strong press. form) Dealiasing: spectral/hp Smoothing-SpecHP: SVV (spectral/hp DG Kernel (diff coeff = 1*Uh/p) ) ======================================================================= Initial Conditions: - Field u: from file session.fld - Field v: from file session.fld - Field w: from file session.fld - Field p: from file session.fld CG iterations made = 5001 using tolerance of 1e-09 (error = 9.57894e-07, rhs_mag = 15.6227) Fatal : Level 0 assertion violation Exceeded maximum number of iterations -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[2602,1],4] Exit code: 1 --------------------------------------------------------------------------
_______________________________________________ Nektar-users mailing list Nektar-users@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/nektar-users
_______________________________________________ Nektar-users mailing list Nektar-users@imperial.ac.uk https://mailman.ic.ac.uk/mailman/listinfo/nektar-users
Hi all, Thank you for your input. İlteber, it seems like having an uncompressed output for the original simulation is important for the method you have described. Please correct me if I am wrong though. I am using HDF5 output option, so I have only been getting one file per checkpoint and I don't know if this means I can or cannot use your method. Something I tried which may be similar is to copy the restart file onto the new cluster and use FieldConvert session.xml restart.fld newrestart.fld with the Nek installation on that cluster, but alas, no success. Alexandra, that trick is intriguing, and I will try it. It's no dirtier than the workaround I used. All the best, Isaac ________________________________ From: Alexandra Liosi <liosi.alex@gmail.com> Sent: Wednesday, July 26, 2023 2:41 AM To: Isaac Rosin <isaac.rosin1@ucalgary.ca> Cc: Jeremy Cohen <jeremy.cohen@imperial.ac.uk>; nektar-users@imperial.ac.uk <nektar-users@imperial.ac.uk> Subject: Re: [Nektar-users] Unable to Restart SImulation on Different Cluster [△EXTERNAL] Hi Isaac, A quick and “dirty” trick that I do when running simulations with the incompressible solver and restart from another field is to set the pressure field initialisation to 0 and use the initialisation fields only for the velocity components. It might work for your case as well. Kind regards, Alexandra On Tue, 25 Jul 2023 at 02:55, Isaac Rosin <isaac.rosin1@ucalgary.ca<mailto:isaac.rosin1@ucalgary.ca>> wrote: Hi Jeremy, I compiled the installation myself on the first cluster but not on the second. I tried to compile it on the second cluster myself following your advice, but was unable to get the exact same configuration since the second cluster has different modules installed. And using the new installation, I still get the same issue. I did find a workaround which feels wrong but it's producing results which seem ok: On the new cluster I turned the time step down to a very small value, ran the simulation for a few steps, saved a field, and used this field as the new restart file. With this I was able to turn the step size back up to the previous value and that has been working as if nothing was wrong. It's not a perfect solution but it works. Thank you for your response Jeremy. All the best, Isaac ________________________________ From: Jeremy Cohen <jeremy.cohen@imperial.ac.uk<mailto:jeremy.cohen@imperial.ac.uk>> Sent: Monday, July 24, 2023 3:52 AM To: Isaac Rosin <isaac.rosin1@ucalgary.ca<mailto:isaac.rosin1@ucalgary.ca>> Cc: nektar-users@imperial.ac.uk<mailto:nektar-users@imperial.ac.uk> <nektar-users@imperial.ac.uk<mailto:nektar-users@imperial.ac.uk>> Subject: Re: [Nektar-users] Unable to Restart SImulation on Different Cluster [△EXTERNAL] Hi Issac, Did you compile the installations of Nektar++ on the two clusters yourself? I would assume it’s important to ensure that both are built with the same set of options / dependencies. If not, this could result in some differences - e.g. are both installations compiled either with, or without FFTW or other optional dependencies. I’m not an expert on the mathematical aspects of the problem but I suspect that this is likely to be related to some difference in the builds of Nektar++. This is just an initial thought, maybe a member of the community with more knowledge of the specifics of the problem your solving can offer other insights. Kind regards, Jeremy On 24 Jul 2023, at 03:07, Isaac Rosin <isaac.rosin1@ucalgary.ca<mailto:isaac.rosin1@ucalgary.ca>> wrote: This email from isaac.rosin1@ucalgary.ca<mailto:isaac.rosin1@ucalgary.ca> originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders list<https://spam.ic.ac.uk/SpamConsole/Senders.aspx> to disable email stamping for this address. Hello Nektar, I am unable to restart a simulation after moving the case files to a new cluster. I began the simulation on one cluster and ran it to the end of the transient flow stage. Once finished, took the final .fld file, put it on another cluster and tried to run it from that time step. My expansion section is <EXPANSIONS> <F VAR="u,v,w,p" FILE="session.fld" /> </EXPANSIONS> and my initial conditions are <FUNCTION NAME="InitialConditions"> <F VAR="u,v,w,p" FILE="session.fld" /> </FUNCTION> When I try to start the simulation on the new cluster, I get the same CG iterations problem every time (see output file as text at the bottom of this email). The new cluster uses different CPUs, so I thought this could have something to do with it, but I am still getting the same problem when I try different CPUs. The only thing I tried which had any effect was changing the time step size. This shouldn't have been necessary because the one I was using was already keeping the CFL low (~0.5) during the transient flow. Decreasing CFL only lowered the size of the error on the CG iterations made... line displayed before the Level 0 assertion violation, and not by much. I would be grateful for your assistance. Regards, Isaac ======================================================================= EquationType: UnsteadyNavierStokes Session Name: session Spatial Dim.: 3 Max SEM Exp. Order: 5 Num. Processes: 208 Expansion Dim.: 3 Projection Type: Continuous Galerkin Advect. advancement: explicit Diffuse. advancement: implicit Time Step: 0.004 No. of Steps: 187500 Checkpoints (steps): 63 Integration Type: IMEX Splitting Scheme: Velocity correction (strong press. form) Dealiasing: spectral/hp Smoothing-SpecHP: SVV (spectral/hp DG Kernel (diff coeff = 1*Uh/p)) ======================================================================= Initial Conditions: - Field u: from file session.fld - Field v: from file session.fld - Field w: from file session.fld - Field p: from file session.fld CG iterations made = 5001 using tolerance of 1e-09 (error = 9.57894e-07, rhs_mag = 15.6227) Fatal : Level 0 assertion violation Exceeded maximum number of iterations -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[2602,1],4] Exit code: 1 -------------------------------------------------------------------------- _______________________________________________ Nektar-users mailing list Nektar-users@imperial.ac.uk<mailto:Nektar-users@imperial.ac.uk> https://mailman.ic.ac.uk/mailman/listinfo/nektar-users _______________________________________________ Nektar-users mailing list Nektar-users@imperial.ac.uk<mailto:Nektar-users@imperial.ac.uk> https://mailman.ic.ac.uk/mailman/listinfo/nektar-users
participants (4)
- 
                
                Alexandra Liosi
- 
                
                Isaac Rosin
- 
                
                İlteber Özdemir
- 
                
                Jeremy Cohen