Help:Heat-svg

From CECS wiki
(Redirected from Help:Heatmap)
Jump to navigation Jump to search

heat-svg is a locally written web application that shows status and utilization of gpus in a slurm managed cluster.

This tool is available from the local nvidia-heat repo.

Parameters[edit]

Add parameters to the end of the url after a ? separated by &'

For example

 heat-svg?size=fill&showhostname=1
mode
html or svg (use svg to embed as an image)
refresh
interval in seconds default=300, range=30-86000
usericon
number forcing the user icon on or off (default: off for svg, on for html); If you want your own icon, email Steve or send a message in slack. Icons may be randomly assigned if you don't supply one.
showclass
show cluster defined job class
showhostname
add hostname/gpu labels to the graph in additional to the popup
showidle
show utilization of gpus marked as idle in slurm instead of ignoring them
showuser
show information for jobs for listed users (comma separated)
showjob
show job information
showall
show all items if possible
cols
number of gpus per row (or all for one row)
rows
nubmer of rows when cols=all
size
size (width) of the graph in pixels (default=50%) or
full
allow browser to maximize to fit
fill
maximize width and height to fill browser window
w
maximize width
h
maximize height
key
1 to show key (default for mode=html), 0 to hide it
rangeinterval
(prometheus only) show utilization data for at most this interval (default=1d) (less if a job has been running less time)
fast
(0 or 1, default=1) fast shows only data from rangeinterval; otherwise data from the entire job history is used (with a 60% weighted average)
avgweight
running average weight as a percentage (>50% favors older samples)

Visual description[edit]

Gpus are represented by squares. Squares are marked to indicate gpu and slurm status.

The color represents the gpu utilization. For jobs in slurm, the color is the average over the runtime of the job. A running average over the last 15 minutes is also calculated and used for gpus not in slurm (showidle=1) and used as a center color for running jobs. A center dot represents the most recent sample.

Colors are selected with an emphasis on nearly idle and nearly saturated gpus. Tooltips on the key (and on individual gpus) give values and thresholds.

Diagonal lines from top right to bottom left indicate slurm gpu job state:

khaki
gpu is not managed by slurm
white
slurm thinks this gpu is idle
green
configuring
orange
completing
red
signaling
yellow
stage out
pink
stopped or suspended

Diagonal lines from top left to bottom right indicate unusual slurm node states:

dark blue
node is off due to slurm power management
light blue
slurm thinks this node is idle
goldenrod
node is marked as idle but not responding
red
node is marked as down or draining
greenyellow
completing (a job has completed and is being cleaned up)
pink
some error state
indianred
node is not responding to slurm or in an unknown state
orange
node is rebooting or powering up
fuchsia
unexpected state (see tooltip for details)

Data source and custom configuration[edit]

Data is collected from the following sources:

  • slurm job data
  • slurm node data
  • slurm accounting data
  • collectd graphs using the collectd_nvidianvml python extension
  • prometheus (alternate to collectd)

Additional system specific perl code can be used to customize this for a specific system. The config file is gpus.pl in the same directory as the collectd rrds.

The following variables are of interest:

$sortbyweight
@nodes
nodes to show, in order (overrides defaults)
@nodesz
parallel to @nodes, number of gpus in each node
$scanmore
if @nodes is not empty, must be 1 to scan for more gpu nodes
$cols
default number of columns
$rows
default number of rows (if not caulculated)
%skipnode
skip these nodes even if they have rrd data
%partxlate
translate these job partition to symbols for showclass=1
%qosxlate
translate these job QOS to symbols for showclass=1
%accountxlate
translate these job accounts to symbols for showclass=1

Symbol selection priority order is partxlate qosxlate accountxlate