Help:Heat-svg
heat-svg is a locally written web application that shows status and utilization of gpus in a slurm managed cluster.
This tool is available from the local nvidia-heat repo.
Parameters[edit]
Add parameters to the end of the url after a ? separated by &'
For example
heat-svg?size=fill&showhostname=1
- mode
- html or svg (use svg to embed as an image)
- refresh
- interval in seconds default=300, range=30-86000
- usericon
- number forcing the user icon on or off (default: off for svg, on for html); If you want your own icon, email Steve or send a message in slack. Icons may be randomly assigned if you don't supply one.
- showclass
- show cluster defined job class
- showhostname
- add hostname/gpu labels to the graph in additional to the popup
- showidle
- show utilization of gpus marked as idle in slurm instead of ignoring them
- showuser
- show information for jobs for listed users (comma separated)
- showjob
- show job information
- showall
- show all items if possible
- cols
- number of gpus per row (or all for one row)
- rows
- nubmer of rows when cols=all
- size
- size (width) of the graph in pixels (default=50%) or
- full
- allow browser to maximize to fit
- fill
- maximize width and height to fill browser window
- w
- maximize width
- h
- maximize height
- key
- 1 to show key (default for mode=html), 0 to hide it
- rangeinterval
- (prometheus only) show utilization data for at most this interval (default=1d) (less if a job has been running less time)
- fast
- (0 or 1, default=1) fast shows only data from rangeinterval; otherwise data from the entire job history is used (with a 60% weighted average)
- avgweight
- running average weight as a percentage (>50% favors older samples)
Visual description[edit]
Gpus are represented by squares. Squares are marked to indicate gpu and slurm status.
The color represents the gpu utilization. For jobs in slurm, the color is the average over the runtime of the job. A running average over the last 15 minutes is also calculated and used for gpus not in slurm (showidle=1) and used as a center color for running jobs. A center dot represents the most recent sample.
Colors are selected with an emphasis on nearly idle and nearly saturated gpus. Tooltips on the key (and on individual gpus) give values and thresholds.
Diagonal lines from top right to bottom left indicate slurm gpu job state:
- khaki
- gpu is not managed by slurm
- white
- slurm thinks this gpu is idle
- green
- configuring
- orange
- completing
- red
- signaling
- yellow
- stage out
- pink
- stopped or suspended
Diagonal lines from top left to bottom right indicate unusual slurm node states:
- dark blue
- node is off due to slurm power management
- light blue
- slurm thinks this node is idle
- goldenrod
- node is marked as idle but not responding
- red
- node is marked as down or draining
- greenyellow
- completing (a job has completed and is being cleaned up)
- pink
- some error state
- indianred
- node is not responding to slurm or in an unknown state
- orange
- node is rebooting or powering up
- fuchsia
- unexpected state (see tooltip for details)
Data source and custom configuration[edit]
Data is collected from the following sources:
- slurm job data
- slurm node data
- slurm accounting data
- collectd graphs using the collectd_nvidianvml python extension
- prometheus (alternate to collectd)
Additional system specific perl code can be used to customize this for a specific system. The config file is gpus.pl in the same directory as the collectd rrds.
The following variables are of interest:
- $sortbyweight
- @nodes
- nodes to show, in order (overrides defaults)
- @nodesz
- parallel to @nodes, number of gpus in each node
- $scanmore
- if @nodes is not empty, must be 1 to scan for more gpu nodes
- $cols
- default number of columns
- $rows
- default number of rows (if not caulculated)
- %skipnode
- skip these nodes even if they have rrd data
- %partxlate
- translate these job partition to symbols for showclass=1
- %qosxlate
- translate these job QOS to symbols for showclass=1
- %accountxlate
- translate these job accounts to symbols for showclass=1
Symbol selection priority order is partxlate qosxlate accountxlate