Epoch crashed on virgo centos container dued to Infiniband connection problem
This is not a issue linked to Epoch installation but to Virgo cluster itself. You may be also will encounter this issue that is why i posted it here first.
It seems that on some nodes on virgo the Infiniband connection get disrupted.
This happenned after a long run TNSA simulation.
This is causing the EPOCH MPI processeses to crashed with ERROR_FATAL
.
I send a ticket to the Virgo Admins about this issue:
issue_with_IB