Cluster job optimization

This is a brainstorming page for a possible future seminar and/or help page on cluster job optimization. Feel free to add ideas here, or ask me about things like this! A future seminar is possible!

Existing tools[edit]

grafana graphs
- how to read and navigate
- which ones do you find useful?
- heatmap linkage
slurm stats
- view tools (squeue scontrol sacct)
- wrapper scripts
  - sacct-diag
  - sacct-mem
  - swhy
globus (ask to get an account)

Problems[edit]

insufficient number of threads to feed gpu
not asking for enough cpu cores (crcv: 12 per gpu allowed)
too many threads causing cpu pressure (and I/O pressure) -- match threads to cores
- pytorch and other frameworks may start threads independent from number of workers; be careful to not oversubscribe
data set geometry issues
- try to limit to <1000 files/directories per directory
- don't make hundreds of directories with one or two files each either
- if possible, don't unpack data, or get it squashed
  - Python libraries can read directly from tar and zip files
  - squash files can be very quickly generated directly from zip and tar files
- if putting data in /share/datasets
  - make sure everything is publicly readable
  - put a README in the top level listing data source and any license restrictions
  - put actual data in subdirectories
- Data antipatterns and optimizations
Try to move as much work into gpu as possible

Possible new tools[edit]

more indicators in heatmap
- mark jobs that are persistently under utilizing gpus and could be killed for abuse
- mark jobs that are using the wrong memory size gpu (pie chart?)
- mark nodes with high pressures (need appropriate icons or something)
grafana improvements
- which graph panels do you use? I can collect them on a new dashboard or rearrange existing dashboards
- collect more gpu statistics?
  - if you find stats that are interesting (from nvml??) let me know and I'll research adding them
slurm supports sub-jobs
- use --cpus-per-task and --ntasks to control resource distribution
- use srun to start each subtask within a job
- use of srun is required for multi-node jobs along with communication libraries like MPI or NCCL
- each task gets a portion of the cpus allocated to a job and sees all gpus (unless those are subdivided too)
complex slurm job dependency system
- break long running jobs into smaller pieces for easier scheduling
- split job into cpu only and gpu jobs and run them separately
- use a master job to spawn sub-jobs pieces that need to be sequenced
- use --dependency= to sequence jobs
gpu health testing
- currently I use gpuburn which is good for testing bad memory and overheating issues and max steady state power use
- gpuburn does not test dynamic power use or power excursion issues
- gpuburn does not test all gpu features or detect known gpu errors; a better tool would be useful
May install in the future
- https://github.com/PrincetonUniversity/jobstats/

External links[edit]

https://github.com/stas00/ml-engineering/tree/master/orchestration/slurm/launchers

Cluster job optimization

Contents

Existing tools[edit]

Problems[edit]

Possible new tools[edit]

External links[edit]

Navigation menu

Cluster job optimization

Existing tools[edit]

Problems[edit]

Possible new tools[edit]

External links[edit]

Navigation menu

Search