Help:Hopper

From CECS wiki
Jump to navigation Jump to search

The hopper cluster is used for both research and classes. This cluster is currently using Ubuntu Linux. Please report any problems to help@cs.ucf.edu.

Full hostname: hopper.cs.ucf.edu

Home page
http://hopper.cs.ucf.edu/
Ganglia page
http://hopper.cs.ucf.edu/ganglia

See also:

Getting Access[edit]

Hopper access is available to:

  • students in the parallel processing courses if the instructor requests it
  • engineering students doing research that has moderate computational requirements that exceeds resources available on or is inconvenient on desktops
  • students in Electrical Engineering or Computer Science that need remote access to Matlab for research and do not have a department owned computer

If you need high performance computing (beyond 200 cpu cores or a tightly coupled job that needs low latency communication), or you need GPU access you may want to apply for an account at https://ARCC.ist.ucf.edu instead or in addition to hopper.

If you need access to Matlab for coursework, you should already have access on https://apps.ucf.edu. Matlab on hopper may not be used for coursework.

Classes will be given access only at request of the class instructor.

To get access for research use, email help at eecs.ucf.edu and cc your research advisor, include your NID in the message, and let us know what resources you expect to need (matlab, cpu cores, memory, disk space).

Rules for cluster use[edit]

Please be considerate of your classmates and use euler's resources fairly.

  • Do not run cpu intensive jobs on the head node, as this degrades performance for all users.
  • Please use the slurm batch queue system to schedule your jobs.
  • Class jobs should complete in under an hour. Longer jobs will be automatically killed. (Let us know if you have a special need for more time.) Research users will be allowed unlimited time for jobs.
  • Jobs that request more cpus than available in the cluster will never run and will be deleted without warning.
  • Access to compute nodes outside of the batch queue is not allowed. If you start a job and ssh into the node it is running on, your session will be automatically killed when the job completes.
  • The batch queue includes fair share policies to try to balance use between users.
  • It may be possible to add additional nodes if cluster usage becomes high.

Please note that this cluster uses automatic power management and nodes will be automatically powered off and on to meet queue demands. This may delay start of jobs after a period of low demand.

Special slurm options[edit]

See help:slurm for a more complete description of slurm and its options. The options listed here are specific to this cluster.

If you get a message like this:

Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

then you have tried to submit a job that exceeds either your account limits or fair use limits on the cluster.

To fairly distribute resources between users and make sure every user has some resources available, the maximum size or time of a job is limited.

The following QOS are available:

  • All usrs:
    preempt
    Jobs have no time or size limit, but can be preempted if resources are needed by other jobs
  • Class users:
    class
    Job runtime is limited to 1 hour and at most 10 nodes
  • Research users:
    normal
    Jobs are limited to 5 nodes with no time limit
    short
    Jobs are limited to 2 hours with no size limit

These limits are subject to change as usage changes and the cluster grows in size.


Use these QOS by adding one of the following sets of options to your job, either in the submission script or on the command line.

--qos short
maxium runtime of 2 hours with no size limit
--qos preempt
No limits, but your job will be killed to make room for other users when necessary
--qos preempt --requeue --open-mode=append
Same as above, but restart the job when resources become available again. Note, this only works with sbatch. Assumes your job can checkpoint itself. A 5 minute grace time is provided for checkpointing. For example, you can use --signal=2 to trigger checkpointing if necessary. If you need to start the job specially to use the checkpoint, check $SLURM_RESTART_COUNT in your sbatch script.

Helpful links and documentation[edit]

External links[edit]