resource_limits.shtml

<!--#include virtual="header.txt"-->

<h1>Resource Limits</h1>

<p>SLURM scheduling policy support was significantly changed
in version 2.0 in order to take advantage of the database
integration used for storing accounting information.
This document describes the capabilities available in
SLURM version 2.0.
New features are under active development.
Familiarity with SLURM's <a href="accounting">Accounting</a> web page
is strongly recommended before use of this document.</p>

<p>Note for users of Maui or Moab schedulers: <br>
Maui and Moab are not integrated with SLURM's resource limits,
but should use their own resource limits mechanisms.</p>

<h2>Configuration</h2>

<p>Scheduling policy information must be stored in a database
as specified by the <b>AccountingStorageType</b> configuration parameter
in the <b>slurm.conf</b> configuration file.
Information can be recorded in either <a href="http://www.mysql.com/">MySQL</a>
or <a href="http://www.postgresql.org/">PostgreSQL</a>.
For security and performance reasons, the use of
SlurmDBD (SLURM Database Daemon) as a front-end to the
database is strongly recommended.
SlurmDBD uses a SLURM authentication plugin (e.g. Munge).
SlurmDBD also uses an existing SLURM accounting storage plugin
to maximize code reuse.
SlurmDBD uses data caching and prioritization of pending requests
in order to optimize performance.
While SlurmDBD relies upon existing SLURM plugins for authentication
and database use, the other SLURM commands and daemons are not required
on the host where SlurmDBD is installed.
Only the <i>slurmdbd</i> and <i>slurm-plugins</i> RPMs are required
for SlurmDBD execution.</p>

<p>Both accounting and scheduling policy are configured based upon
an <i>association</i>. An <i>association</i> is a 4-tuple consisting
of the cluster name, bank account, user and (optionally) the SLURM
partition.
In order to enforce scheduling policy, set the value of
<b>AccountingStorageEnforce</b>:
This option contains a comma separated list of options you may want to
 enforce.  The valid options are
<ul>
<li>associations - This will prevent users from running jobs if
their <i>association</i> is not in the database. This option will
prevent users from accessing invalid accounts.
</li>
<li>limits - This will enforce limits set to associations.  By setting
  this option, the 'associations' option is also set.
</li>
<li>qos - This will require all jobs to specify (either overtly or by
default) a valid qos (Quality of Service).  QOS values are defined for
each association in the database.  By setting this option, the
'associations' option is also set.
</li>
<li>wckeys - This will prevent users from running jobs under a wckey
  that they don't have access to.  By using this option, the
  'associations' option is also set.  The 'TrackWCKey' option is also
  set to true.
</li>
</ul>
(NOTE: The association is a combination of cluster, account,
user names and optional partition name.)
<br>
Without AccountingStorageEnforce being set (the default behavior)
jobs will be executed based upon policies configured in SLURM on each
cluster.
<br>
It is advisable to run without the option 'limits' set when running a
scheduler on top of SLURM, like Moab, that does not update in real
time their limits per association.</li>
</p>

<h2>Tools</h2>

<p>The tool used to manage accounting policy is <i>sacctmgr</i>.
It can be used to create and delete cluster, user, bank account,
and partition records plus their combined <i>association</i> record.
See <i>man sacctmgr</i> for details on this tools and examples of
its use.</p>

<p>A web interface with graphical output is currently under development.</p>

<p>Changes made to the scheduling policy are uploaded to
the SLURM control daemons on the various clusters and take effect
immediately. When an association is delete, all jobs running or
pending which belong to that association are immediately canceled.
When limits are lowered, running jobs will not be canceled to
satisfy the new limits, but the new lower limits will be enforced.</p>

<h2>Policies supported</h2>

<p> A limited subset of scheduling policy options are currently
supported.
The available options are expected to increase as development
continues.
Most of these scheduling policy options are available not only
for a user association, but also for each cluster and account.
If a new association is created for some user and some scheduling
policy options is not specified, the default will be the option
for the cluster plus account pair and if that is not specified
then the cluster and if that is not specified then no limit
will apply.</p>

<p>Currently available scheduling policy options:</p>
<ul>
<li><b>Fairshare=</b> Used for determining priority.  Essentially
  this is the amount of claim this association and it's children have
  to the above system.</li>
</li>

<!-- For future use
<li><b>GrpCPUMins=</b> A hard limit of cpu minutes to be used by jobs
  running from this association and its children.  If this limit is
  reached all jobs running in this group will be killed, and no new
  jobs will be allowed to run.
</li>
-->

<!-- For future use
<li><b>MaxCPUMinsPerJob=</b> A limit of cpu minutes to be used by jobs
  running from this association.  If this limit is
  reached the job will be killed will be allowed to run.
</li>
-->

<!-- For future use
<li><b>GrpCPUs=</b> The total count of cpus able to be used at any given
  time from jobs running from this association and its children.  If
  this limit is reached new jobs will be queued but only allowed to
  run after resources have been relinquished from this group.
</li>
-->

<!-- For future use
<li><b>MaxCPUsPerJob=</b> The maximum size in cpus any given job can
  have from this association.  If this limit is reached the job will
  be denied at submission.
</li>
-->

<li><b>GrpJobs=</b> The total number of jobs able to run at any given
  time from this association and its children.  If
  this limit is reached new jobs will be queued but only allowed to
  run after previous jobs complete from this group.
</li>

<li><b>MaxJobs=</b> The total number of jobs able to run at any given
  time from this association.  If this limit is reached new jobs will
  be queued but only allowed to run after previous jobs complete from
  this association.
</li>

<li><b>GrpNodes=</b> The total count of nodes able to be used at any given
  time from jobs running from this association and its children.  If
  this limit is reached new jobs will be queued but only allowed to
  run after resources have been relinquished from this group.
</li>

<li><b>MaxNodesPerJob=</b> The maximum size in nodes any given job can
  have from this association.  If this limit is reached the job will
  be denied at submission.
</li>

<li><b>GrpSubmitJobs=</b> The total number of jobs able to be submitted
  to the system at any given time from this association and its children.  If
  this limit is reached new submission requests will be denied until
  previous jobs complete from this group.
</li>

<li><b>MaxSubmitJobs=</b> The maximum number of jobs able to be submitted
  to the system at any given time from this association.  If
  this limit is reached new submission requests will be denied until
  previous jobs complete from this association.
</li>

<li><b>GrpWall=</b> The maximum wall clock time any job submitted to
  this group can run for.  Submitting jobs that specify a wall clock
  time limit that exceeds this limit will be denied.</li>

<li><b>MaxWallDurationPerJob=</b> The maximum wall clock time any job
  submitted to this association can run for.  Submitting jobs that
  specify a wall clock time limit that exceeds this limit will be
  denied.
</li>

<li><b>QOS=</b> comma separated list of QOS's this association is
  able to run.
</li>
</ul>

<p>The <b>MaxNodes</b> and <b>MaxWall</b> options already exist in
SLURM's configuration on a per-partition basis, but the above options
provide the ability to impose limits on a per-user basis.  The
<b>MaxJobs</b> option provides an entirely new mechanism for SLURM to
control the workload any individual may place on a cluster in order to
achieve some balance between users.</p>

<p>Fair-share scheduling is based upon the hierarchical bank account
data maintained in the SLURM database.  More information can be found
in the <a href="priority_multifactor.html">priority/multifactor</a>
plugin description.</p>

<p style="text-align: center;">Last modified 9 October 2009</p>

</ul></body></html>