Current status is as follows: the upgrade does not work automatically (e.g. via cinc-solo) for a previously installed execution node (Virgo 2.1) because the new slurmd RPM package tries to write configuration files under /etc/slurm, which is exported on all nodes via NFS and thus it is not writable on the client side.
Our proposal: would it possible to have a slurmd RPM that store its configuration files under /usr/share/slurmd/... as examples, instead of /etc/slurm? @d.klein do you think it possible to rebuild the slurmd package and if yes, how much do you estimate it will take to have such a change?
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related.
Learn more.
Our proposal: would it possible to have a slurmd RPM that store its configuration files under /usr/share/slurmd/... as examples, instead of /etc/slurm?
Yes, sounds good. I have looked also for the possibility of an install-time parameter, but this is does not exist, so we need to move them unconditionally to a different path (or skip) as you suggested.
@d.klein do you think it possible to rebuild the slurmd package and if yes, how much do you estimate it will take to have such a change?
I am on it, give me an hour or so. This change will also be affecting the config of slurm-singularity-exec as it depends on the creation of the /etc/slurm/plugstack.conf.d directory by the slurm package. This will no longer happen, so slurm-singularity-exec also cannot install its config any more.
Obviously the slurmctld and slurmdbd should not install configuration files and not modify anything in /etc/slurm, which would overwrite the configuration maintained in virgo-3/slurm-config
# ...after an automatic package upgrade this morning[root@lxrm10 slurm]# grep-i upgraded /var/log/dnf.log 2024-01-31T06:24:19+0100 DEBUG Upgraded: munge-0.5.15-6.el8.x86_642024-01-31T06:24:19+0100 DEBUG Upgraded: munge-libs-0.5.15-6.el8.x86_642024-01-31T06:24:19+0100 DEBUG Upgraded: slurm-23.11.1-3.el8.x86_642024-01-31T06:24:19+0100 DEBUG Upgraded: slurm-libs-23.11.1-3.el8.x86_642024-01-31T06:24:19+0100 DEBUG Upgraded: slurm-slurmctld-23.11.1-3.el8.x86_642024-01-31T06:24:19+0100 DEBUG Upgraded: slurm-slurmdbd-23.11.1-3.el8.x86_64# ...restart was triggered by the package[root@lxrm10 slurm]# systemctl status slurmdbd● slurmdbd.service - Slurm DBD accounting daemon Loaded: loaded (/usr/lib/systemd/system/slurmdbd.service; enabled; vendor preset: disabled) Active: inactive (dead) since Wed 2024-01-31 06:24:19 CET; 59min agoCondition: start condition failed at Wed 2024-01-31 07:18:11 CET; 5min ago └─ ConditionPathExists=/etc/slurm/slurmdbd.conf was not met#...# ...slurmdbd is dead ...since the configuration is missing[root@lxrm10 slurm]# ls /etc/slurm/slurmdbd*/etc/slurm/slurmdbd.conf.rpmsave
The exiting configuration files are moved to *.rpmsave, which makes an restart of the service not possible.
I'm still unsure about automatic restarts, which you mentioned in another comment below. Depending on the situation restarting slurmctld could render the cluster unresponsive for all users. Personally I would like to propose to NOT automatically restart slurmctld and slurmdbd, but lets see what the other think about that...
No config (also for slurmctld/dbd pkgs) will be written to /etc/slurm any more by the updated packages. The *.rpmsave files must be a one-time effect from uninstalling the old package which used to own those files via %config(noreplace).
Regarding the triggered restart, I can easily remove this behaviour. Linux should keep the old binary files alive as long as they are used by the still running services. (even though they are no longer reachable via the filesystem after the package is installed)
...after an automatic package upgrade this morning
This is of course now in our hands to coordinate better as well. E.g. I could only publish updates to the repo-history git repo and you decide when to copy this to cluster-mirror and thus control when a restart would be triggered. And in addition, we could remove the triggered restart, then you can trigger it via config mgmt or manually.
I just pushed new el8 packages to http://cluster-mirror.hpc.gsi.de/packages/virgo-3/el8/. The full diff to the previous repo state is repo-history@33080cf1.
I modified the slurm and slurm-singularity-exec packages to no longer put their configuration to /etc/slurm (directory is not even created), but to /usr/share/slurm. However, by content, the configuration and service files assume the slurm config to be present in /etc/slurm as before which means you have to deploy it at that location before any services can be started.
In case you are wondering, why the munge package is updated as well. It has not changed except that the build version number is now rendered correctly. I only realized this bug after the first repo build was published, see builder-virgo#1 (closed) for details. This is also the reason why it took a bit longer, sry, I first tried to get rpmautospec working on epel8, but it needs too many changes in the code so I got stuck.
If you update the packages slurmctld/dbd packages, note: Currently, they will restart the services. I would be interested in your feedback on this.
Now also updating the el9 packages to match the new behaviour regarding /etc/slurm. Should finish some time later today.
I just pushed new el8 packages to http://cluster-mirror.hpc.gsi.de/packages/virgo-3/el8/. The full diff to the previous repo state is repo-history@33080cf1.
I can confirm that the slurmd package can be install without modifying /etc/slurm. Thanks!