Cluster job optimization
Jump to navigation
Jump to search
This is a brainstorming page for a possible future seminar and/or help page on cluster job optimization. Feel free to add ideas here, or ask me about things like this! A future seminar is possible!
Existing tools[edit]
- grafana graphs
- how to read and navigate
- which ones do you find useful?
- heatmap linkage
- slurm stats
- view tools (squeue scontrol sacct)
- wrapper scripts
- sacct-diag
- sacct-mem
- swhy
- globus (ask to get an account)
Problems[edit]
- insufficient number of threads to feed gpu
- not asking for enough cpu cores (crcv: 12 per gpu allowed)
- too many threads causing cpu pressure (and I/O pressure) -- match threads to cores
- pytorch and other frameworks may start threads independent from number of workers; be careful to not oversubscribe
- data set geometry issues
- try to limit to <1000 files/directories per directory
- don't make hundreds of directories with one or two files each either
- if possible, don't unpack data, or get it squashed
- Python libraries can read directly from tar and zip files
- squash files can be very quickly generated directly from zip and tar files
- if putting data in /share/datasets
- make sure everything is publicly readable
- put a README in the top level listing data source and any license restrictions
- put actual data in subdirectories
- Data antipatterns and optimizations
- Try to move as much work into gpu as possible
Possible new tools[edit]
- more indicators in heatmap
- mark jobs that are persistently under utilizing gpus and could be killed for abuse
- mark jobs that are using the wrong memory size gpu (pie chart?)
- mark nodes with high pressures (need appropriate icons or something)
- grafana improvements
- which graph panels do you use? I can collect them on a new dashboard or rearrange existing dashboards
- collect more gpu statistics?
- if you find stats that are interesting (from nvml??) let me know and I'll research adding them
- slurm supports sub-jobs
- use --cpus-per-task and --ntasks to control resource distribution
- use srun to start each subtask within a job
- use of srun is required for multi-node jobs along with communication libraries like MPI or NCCL
- each task gets a portion of the cpus allocated to a job and sees all gpus (unless those are subdivided too)
- complex slurm job dependency system
- break long running jobs into smaller pieces for easier scheduling
- split job into cpu only and gpu jobs and run them separately
- use a master job to spawn sub-jobs pieces that need to be sequenced
- use --dependency= to sequence jobs
- gpu health testing
- currently I use gpuburn which is good for testing bad memory and overheating issues and max steady state power use
- gpuburn does not test dynamic power use or power excursion issues
- gpuburn does not test all gpu features or detect known gpu errors; a better tool would be useful
- May install in the future