Nvidia

From CECS wiki
Jump to navigation Jump to search

For user software use, see Help:Cuda linux.

This page covers installation of nVidia drivers including the CUDA toolkit for nVidia GPUS and nVidia GPGPUs.

A more detailed interactive version of this documentation is also available.

Note for all proprietary video driver installs:

  • Make sure that the dkms package is installed. Some driver install scripts will autodetect this if it is pre-installed and use it.
  • If DKMS is used, the driver will update itself with kernel updates. Otherwise, a driver reinstall will be required each time the kernel is updated.

NOTE: nVidia drivers are frequently only supported in Ubuntu LTS releases! Please check driver availability before upgrading!

Note: The nvidia legacy drivers nvidia-304 and nvidia-340 are buggy in Ubuntu 16.04 and break if you have kernel version >4.10. There is a beta version of this driver that works, but for legacy cards the nouveau driver may also work. Try uninstalling the nvidia drivers to re-enable it.

utilties[edit]

  • nvidia-smi
  • nvtop (Ubuntu 20 repos)
  • nvidia-ps (local)
  • heat-svg (local)
  • gpust (local)

Versions[edit]

See CUDA

Software:

cuda released OS issues
9 Sept 2017 - May 2018 Ubuntu 16, 18 won't run on turing or ampere
10 Sept 2018 - Nov 2019 Ubuntu 18 won't run on ampere gpus
11 March 2020 - Oct 2022 Ubuntu 20, 22 still supported
12 Dec 2022 - current Ubuntu 20, 22 current

Hardware:

compute generation
6.x Pascal
7.0 7.2 Volta
7.5 Turing
8.0 8.6 8.7 Ampere
8.9 Ada Lovelace
9.0 Hopper

Recent updates[edit]

The following items are not (yet) in the interactive version linked above:

  • Before trying to debug your nvidia driver, make sure you actually have an nvidia video card
update-pciids
lspci | grep -i vga
  • Ubuntu 18 repos include cuda and the drivers seem well integrated; use of PPAs is probalby not needed anymore

Hardware with known issues[edit]

Cards known to be problematic

  • GeForce 7000 LT may require older driver and is not supported by CUDA
  • GeForce 8xxx requires nvidia-340 driver (ubuntu)
  • Titan X/pascal DOES NOT WORK with cuda-toolbox-7.5, it requires 8.0 ; the final release of 8.0 DOES NOT NEED a newer driver

For these cards, installation of a specific driver may be needed instead of the default driver or the driver that comes with cuda. Read instructions below for special driver alternate installation methods.

Debugging and recovery[edit]

If the video works at POST and grub but locks up or fails at linux boot, add nomodeset to the kernel command line. See grub magic for all the rest of the options.

Current nVidia drivers explain what to do if they fail, and occasionally specify exactly what driver is needed instead of the latest one.

Centos and RedHat
check /var/log/messages
Ubuntu
check /var/log/syslog
All operating systems
immediately after boot or failed driver load (modprobe?) dmesg may include the error as well.

Look for nVidia in the log or dmesg output.

Diagnostics and stress tests[edit]

 ./cuda_memtest --num_iterations 10 --exit_on_error --stress --monitor_temp 5 

run without --stress first, --monitor_temp doesn't work on all cards

Installing nvidia proprietary driver without cuda[edit]

If you are going to install cuda, skip this section and use the driver that comes with cuda instead unless your card has a known problem with the cuda driver.

You may need to try multiple versions of the driver until one works. You can install drivers with

   apt-get install package...

You must reboot to actually load the driver, as the previous driver version can't be unloaded with an active screen.

After the driver tries to load, read through the kernel messages to see if it succeeded. Some driver versions (especially newer ones) recognize that they don't support your video card and may suggest the correct version of the driver to load.

 dmesg | less

Try these drivers first:

  • nvidia-current
  • nvidia-367 (latest supplied with ubuntu)

If those don't work and don't suggest an older driver, you can try a newer driver:

apt-add-repository ppa:graphics-drivers/ppa
apt-get update
apt-cache search 'nvidia-[0-9]'

Note that the drivers in this ppa may be incompatible with the nVidia CUDA toolkit.

You can check what driver is loaded with

 lsmod | grep nvidia

If you believe you have installed the correct driver but it is still loading the wrong one, reconfigure it. For example:

 dpkg-reconfigure nvidia-367

and make sure that DKMS correctly builds the driver without errors.


Ubuntu 18[edit]

ubuntu-drivers autoinstall

CUDA 7.5 / 10 conflict[edit]

If you try to install CUDA 10 on a system that already has CUDA 7.5 it will partially install and then fail with a circular dependancy.

These procedures are not well tested and may be missing steps.

The best option is to remove CUDA 7.5 before attempting to install 10.

Alternately, try this to try to complete the cuda 10 install:

 apt-get remove --autoremove libcudart7.5 libcupti7.5 
 apt-get -o Dpkg::Options::="--force-overwrite" install cuda

Ubuntu 18 CUDA[edit]

Ubuntu 18 has the latest version of the cuda in its own repos

  • apt-get install nvidia-cuda-toolkit

OR install from nvidia's repos:

nVidia has now released its own repo for drivers, so the instructions for 16 should also now work for 18.

However, nvidia only includes cuda 10.0 in this repo, so if you need older versions of cuda, you still need to install ubuntu 16.

  • install cuda repo from nvidia: https://developer.nvidia.com/cuda-downloads Do not skip this step!
    • recommend installer type deb network option
    • If you can't download on the local computer, download elsewhere and use dpkg -i cuda*.deb to install from text mode
  • sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
  • apt-get update
  • Install everything at once: apt-get install build-essential cuda environment-modules libprotobuf-dev libleveldb-dev libsnappy-dev libopencv-dev libhdf5-serial-dev libboost-all-dev libgflags-dev libgoogle-glog-dev liblmdb-dev protobuf-compiler libopenblas-dev python-pip python-protobuf git sshfs; apt-get remove unattended-upgrades

Ubuntu 16 CUDA[edit]

NOTE: see #Hardware below if you have a conflicting device

  • install cuda repo from nvidia: https://developer.nvidia.com/cuda-downloads Do not skip this step!
    • recommend installer type deb network option
    • If you can't download on the local computer, download elsewhere and use dpkg -i cuda*.deb to install from text mode
  • sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
  • apt-get update
  • apt-get install build-essential cuda environment-modules
  • Caffe dependencies:
    • apt-get install libprotobuf-dev libleveldb-dev libsnappy-dev libopencv-dev libhdf5-serial-dev libboost-all-dev libgflags-dev libgoogle-glog-dev liblmdb-dev protobuf-compiler libopenblas-dev python-pip python-protobuf
  • Or install everything at once: apt-get install build-essential cuda environment-modules libprotobuf-dev libleveldb-dev libsnappy-dev libopencv-dev libhdf5-serial-dev libboost-all-dev libgflags-dev libgoogle-glog-dev liblmdb-dev protobuf-compiler libopenblas-dev python-pip python-protobuf git sshfs cuda-9-0 cuda-9-2 ; apt-get remove unattended-upgrades
  • If your software needs specific versions: apt-get install cuda cuda-9-0 cuda-9-2
  • Recommended: apt-get remove unattended-upgrades or you will need to reboot every time a driver update is automatically installed
  • If you are doing deep learning, you need to also download and install the appropriate version of the cudnn libraries from the nvidia website.
  • You may also want to install modules to help manage multiple versions of cudnn and cuda.

Make sure the video card works on reboot.

Make sure the nvidia-smi command line lists the card.

If it fails, boot in recovery mode or use a text console and find out why the driver is not working.

Upgrade conflict between CUDA 7.5 and CUDA 10[edit]

Before installing CUDA 10, you must remove CUDA 7.5. If you fail to do this, you'll get into an unresolvable dependency loop.

You can force apt-get to ignore the problem with:

apt-get -o Dpkg::Options::="--force-overwrite" 


The nvidia-smi command may help debug. Possible problems include:

  • secure boot is preventing the driver from loading (change secure boot mode and reboot)
  • driver does not support hardware: check system messages for details and install the correct driver

If these directions don't work, additional things to try can be found at https://help.ubuntu.com/community/BinaryDriverHowto/Nvidia

EXTRA: Add cuda to the default path by saving the following to /etc/profile.d/cuda-path.sh

PATH=$PATH:/usr/local/cuda/bin


Alternate install[edit]

If the cuda driver does not work or cuda is not needed, a specific driver can be installed.

On systems having problems, it may help to start with a clean slate. This is not necessary on most machines.

If you do not need cuda, you can try

 apt-get install nvidia-current

If this fails on reboot, check /var/log/syslog and look for nvidia driver messages which sometimes specifies the exact driver version needed. If a newer driver is needed, try the ubuntu proprietary driver repo:

apt-add-repository ppa:graphics-drivers/ppa
apt-get update

If you have already installed cuda and the included driver does not recognize your card:

   apt-get remove cuda-runtime-8-0

And then search for an appropriate driver:

apt-cache search 'nvidia-[0-9]'

and then install the highest relevant driver version, for instance

apt-get install nvidia-370

To prevent cuda from being autoremoved:

 apt-get install nvidia-cuda-dev cuda-toolkit-8-0

The following packages may also be useful that are normally included with cuda-toolkit:

 apt-get install nvidia-cuda-{doc,gdb} nvidia-{visual-,}profiler 

Note: package list above generated from apt-cache rdepends cuda and following the dependency tree.

Upgrade conflict between CUDA 11.0 and 11.2[edit]

cuda-11-2 doesn't seem to want to install over top of cuda-11-0 easily. The conflict is actually cuda-drivers-450 vs. cuda-drivers-460

The following seems to fix it cleanly:

apt update
apt install cuda cuda-drivers-450- libnvidia-extra-450-
apt full-upgrade

Configuration[edit]

systemd persistenced[edit]

This keeps the nvidia driver permanently loaded, which shortens startup time for cuda apps.

  • cp /lib/systemd/system/nvidia-persistenced.service /etc/systemd/system/
  • edit and change --no-persistence-mode

Disable nvidia for video[edit]

If you want to use the internal VGA instead of the nvidia card:

  1. (supermicro bios) advanced -> PCIe/PCI/PNP configuration -> VGA Priority (onboard / offboard)
  2. update-alternatives --config x86_64-linux-gnu_egl_conf
  3. update-alternatives --config x86_64-linux-gnu_gl_conf
  4. ldconfig (and then restart the X server)
  5. Add to cuda module:
prepend-path    PATH [join [glob /usr/lib/nvidia*/bin] ":"]
prepend-path    LD_LIBRARY_PATH [join [glob /usr/lib/nvidia*] ":"]

Note: this breaks cuda in some cases.

rebuild xorg.conf[edit]

[1]

nvidia-xconfig --query-gpu-info
nvidia-xconfig

performance data collection[edit]

Long term data performance collection can be done with collectd. The following pieces must be installed:

collectd data collection back end
apt-get install collectd
collectd nvidia plugin
pip install collectd-nvidianvml
then add configure collectd plugin as per README
data viewer (choose one)
local desktop: kcollectd
web: cgp
web: php-collection
web: (custom code)

Centos 6[edit]

caffe dependencies

base
openblas-devel lapack-devel atlas-devel boost-devel protobuf-devel boost-python snappy-devel
epel
hdf5-devel leveldb-devel lmdb-devel
not in repos
glog (newer gflags) (newer boost)

Makefile.config  : USE_CUDNN := 1

Note: may need to blacklist the noveau driver if you use the nvidia proprietary driver from their website instead of cuda drivers.

  • add rdblacklist=nouveau to /etc/default/grub
  • update grub: (verify correct filenames) grub2-mkconfig --output=/boot/grub2/grub.cfg
  • xrandr will reset resolution to the correct one if possible, but use system->settings->display to permanently set it


nvidia driver module rebuild:

  • dkms status
  • dkms build nvidia/version