Cluster job optimization

From CECS wiki
Jump to navigation Jump to search

This is a brainstorming page for a possible future seminar and/or help page on cluster job optimization. Feel free to add ideas here, or ask me about things like this! A future seminar is possible!

Existing tools[edit]

  • grafana graphs
    • how to read and navigate
    • which ones do you find useful?
    • heatmap linkage
  • slurm stats
    • view tools (squeue scontrol sacct)
    • wrapper scripts
      • sacct-diag
      • sacct-mem
      • swhy
  • globus (ask to get an account)

Problems[edit]

  • insufficient number of threads to feed gpu
  • not asking for enough cpu cores (crcv: 12 per gpu allowed)
  • too many threads causing cpu pressure (and I/O pressure) -- match threads to cores
    • pytorch and other frameworks may start threads independent from number of workers; be careful to not oversubscribe
  • data set geometry issues
    • try to limit to <1000 files/directories per directory
    • don't make hundreds of directories with one or two files each either
    • if possible, don't unpack data, or get it squashed
      • Python libraries can read directly from tar and zip files
      • squash files can be very quickly generated directly from zip and tar files
    • if putting data in /share/datasets
      • make sure everything is publicly readable
      • put a README in the top level listing data source and any license restrictions
      • put actual data in subdirectories
    • Data antipatterns and optimizations
  • Try to move as much work into gpu as possible

Possible new tools[edit]

  • more indicators in heatmap
    • mark jobs that are persistently under utilizing gpus and could be killed for abuse
    • mark jobs that are using the wrong memory size gpu (pie chart?)
    • mark nodes with high pressures (need appropriate icons or something)
  • grafana improvements
    • which graph panels do you use? I can collect them on a new dashboard or rearrange existing dashboards
    • collect more gpu statistics?
      • if you find stats that are interesting (from nvml??) let me know and I'll research adding them
  • slurm supports sub-jobs
    • use --cpus-per-task and --ntasks to control resource distribution
    • use srun to start each subtask within a job
    • use of srun is required for multi-node jobs along with communication libraries like MPI or NCCL
    • each task gets a portion of the cpus allocated to a job and sees all gpus (unless those are subdivided too)
  • complex slurm job dependency system
    • break long running jobs into smaller pieces for easier scheduling
    • split job into cpu only and gpu jobs and run them separately
    • use a master job to spawn sub-jobs pieces that need to be sequenced
    • use --dependency= to sequence jobs
  • gpu health testing
    • currently I use gpuburn which is good for testing bad memory and overheating issues and max steady state power use
    • gpuburn does not test dynamic power use or power excursion issues
    • gpuburn does not test all gpu features or detect known gpu errors; a better tool would be useful
  • May install in the future

External links[edit]