Network disruption between MPI ranks
- Problem: running long simulation, often some of the
Artis
MPI processes crashed with such an error message fromPMix
slurmstepd: error: lxbk1040 [6] pmixp_client_v2.c:210 [_errhandler] mpi/pmix: ERROR: Error handler invoked: status = -25: Interrupted system call (4)
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 55246495.2 ON lxbk1034 CANCELLED AT 2022-06-13T09:38:17 ***
We need to investigate if this is due to
- temporary network disruption
- problem in the Artis program itself