CRCV cluster scheduling

From CECS wiki
Jump to navigation Jump to search

This is a living document describing how slurm scheduling on the CRCV cluster is configured and the rationale for the configuration.

Aspects of this could be applied to any cluster on request.

Numbers marked (?) are suggestions and guesses, feel free to suggest other values. Some of these numbers would be raised as the cluster grows or lowered closer to deadlines. Partitioning may reduce number of available gpus, requiring these numbers to be adjusted.

This configuration description is partially implemented, and the remainder intended as an implementation guide. This is to give an overall picture of the planned configuration and status quo.

General limits[edit]

(all implemented)

  • Hardware limits prevent more than 4 gpus per job per node (with exceptions on one node)
  • Multi-node jobs require special coding and nobody in CRCV has figured it out yet, so jobs are limited to one node per job (except fully preemptable jobs)
  • Limiting gpus per job in general makes little sense; limit total gpus per user and let the user figure out how to arrange them. All counted user limits are on total gpus, not jobs.
  • Fairshare limits what waiting job runs next so that eventually all users get the same number of GPUs when the partition is full. Shorter jobs balance faster. Fairshare only works when jobs are waiting.
  • Users get a fixed portion of gpus for non-preemptive jobs, with increasing numbers for shorter jobs. Numbers should be selected so that users can use more than their fair share of jobs when the queue is empty, but fairshare balances out in a reasonable amount of time when jobs are waiting.
  • QOS with shorter times get a very slight priority boost so they run sooner than other jobs submitted at the same time by the same user, but fairshare dominates so they do not run before jobs from other users with a fairshare advantage.
  • There are limits to the maximum number of jobs in the queue (running or waiting). If someone needs more than that, the workflow needs to be restructured.
    • Large numbers of near identical jobs (>10) should be merged into a single array job that automatically spawns off numbered sub-jobs. (examples in wiki) But this doesn’t fix hard job limit counts.
    • Multiple sequential jobs should be merged so the minimum job runtime is at least 5-15 minutes. Otherwise slurm job start overhead will slow down things horribly.
    • If merging jobs into appropriate length jobs doesn’t reduce the count enough, the job can be restructured as a master job that submits sub-jobs and then adds them to itself as a dependency so it is put on hold until something finishes. This could also be done for jobs that can be split into smaller dependent pieces.

Fully preemptable jobs[edit]

(fully implemented, easily added to any cluster if it isn't already there)

  • Fully preemptable jobs have no time limits or resource limits. If a user requests too many resources, the job is more likely to be preempted.
  • Preemptable jobs get their own partition so that jobs are not counted in any other partition’s limits.
  • All other jobs, including time tiered preemptable jobs can preempt fully preemptable jobs

Short partition (HTC)[edit]

This is subject to change based on group discussions. This may be implemented in the future.

  • (DONE) Short partition includes old (slow) and low memory (<=12G) GPUs that are fine for short jobs where higher speed gives little advantage or jobs take the same amount of time on a higher end gpu but with lower utilization or job is mostly interactive with gpu mostly idle
  • (DONE) Long running jobs are not allowed in short partition as they would be better off on a faster gpu.
  • (DONE mostly) The default job is limited to 3 hours (?), up to ½ (?) GPUs in the partition; these are preemptive after 3h with no end limit with a max of 2 GPUs.
  • (DONE) Each user can have 1 (?) long non-preemptable job job with 1 gpu with a max time of 1 day (?)
  • Additional jobs (?) allowed per user for shorter time, 30 min (?) (with more gpus per job?)
  • One gpu per job, as if you need more, you should be using faster GPUs, unless you are debugging to get utilization up on a multi-gpu job. (default QOS vs. special QOS?)

Regular partition (HPC)[edit]

(mostly implemented where supported by slurm; numbers subject to tweaking)

  • A few nodes / GPUs will be reserved for short jobs (3-5h?) that need more gpu memory or faster GPUs or for debugging on different architectures not in the short queue. (The reservation allows slurm to backfill short jobs without requiring a special QOS for them or requesting the reservation.)
  • When supported, reservations will be dynamic, reserving more GPUs as more short jobs are submitted.
  • Close to deadlines, more GPUs can be scheduled to be reserved in the HPC partition for short jobs or for specific papers as needed. (A formal process for allocating paper specific gpus is needed.)
  • Close to deadlines, maximum counts (below) can be adjusted to try to leave a few gpus free without adjusting reservations.
  • Short jobs that need reservations must have a hard end time (<5h?) (may change this if next version of slurm supports it)
  • Users get a limited number (15?) of GPUs with no runtime limit.
  • Users can submit day jobs (15?) with a 24 hour limit, after which they become preemptable.
  • Users can get a maximum of 20 (?) GPUs between day jobs and regular non-preemptable jobs. Users can decide for themselves what portion within that limit are preemptable and non-preemptable.
  • Users can have up to 25 (?) GPUs using the short QOS (3-5hr) but GPUs in short QOS count against the 20 gpu count for regular and day jobs. (so 5 more short jobs really, and new longer jobs won’t start until the short ones finish or are preempted)