Nvidia CUDA setup on Ubuntu 22.04 for GPU training

AMD vs Nvidia GPU

Pytorch and TensorFlow both support AMD and Nvidia, but Nvidia CUDA is just so much better at programming.

Train on Windows or Linux

In my experience, setup a Windows machine for training neural networks is way easier than on the Ubuntu, or Linux in general.

First, on Windows, you basically just install CUDA toolkit v535, and than Pytorch. That’s it. Even with multiple RTX 3090s on your machine, It just works. You unplug your monitor to another GPU card or integrated CPU graphics, it just works with minimal debugging.

Linux, oh that’s a beast.

The installation on Ubuntu is fairly strict forward like on Windows ( Check step below ).

But a lot of time CUDA just fail to initialize on Pytorch. I always end up just fresh install Ubuntu and hope for the best.

Each package has to be exactly the version you installed and tested before. Upgrade 1 package might result CUDA not working. Hack, maybe your whole monitor no longer works as you mess up the graphic driver.

Working version: CUDA Toolkit v535, Pytorch 2.0.1, Ubuntu 22.04 ( Don’t update, latest updates at Jan 2024 breaks CUDA )

Always make sure you install Ubuntu -> apt Upgrade -> install CUDA Toolkit v535 -> PyTorch 2.0.1

Good thing using Linux

You can always customized everything.

For example, you have your main GPU ( PCIe first thread ) to be mainly use for monitor, while your second GPU ( a beefy RTX 3090 ) to train the model.

You can set every program except PyTorch to run on integrated GPU or CPU.

The result is fairly less crash or Out or Memory Error (OOM Error) in terms of training.

The GPU memory is not allocated sequentially. It just allocated at random location. With lots of system and program allocated at random position in GPU, you run into a lot more OOM Error when training.

Checking Compatibility

First off, we need to ensure our GPU and the CUDA version are ready to tango. Head over to Nvidia’s CUDA Toolkit website and look up the compatibility of your GPU with the toolkit version you intend to install. It’s the tech version of checking if your new fish is going to get along with the ones already in the tank.

 lspci | grep -i nvidia

The above command will confirm that your Linux instance recognizes the GPU. If it does, it’ll list it out for you. If it doesn’t, you’ll be sitting there, staring at the command line equivalent of tumbleweed.

Purge Previously Installed CUDA (If Any)

Got a case of the old CUDA blues? Sometimes your system has remains of an older installation, and trust me, it’s best to start with a clean slate. Run the following command to purge previous CUDA installations:

 sudo apt-get --purge remove "*cublas*" "cuda*"

Now your system is as fresh as a daisy, ready for the new install. It’s like spring cleaning but for geeks.

The Actual CUDA Installation

Alright, down to the nitty-gritty. The CUDA website will give you some options for installation. You can swing with the deb (local) route or get jiggy with the runfile (local). I always use .run file, and always download the whole thing first.

https://developer.nvidia.com/cuda-downloads

Run that and let the terminal do its thing. Might be a good time to grab a snack, or if you’re feeling brave, start reading on CuDNN (more on that in another post).

Testing the Installation

Last but not least, let’s put this baby to the test with the good ol’ nvidia-smi and some CUDA samples:

 nvidia-smi

This command is like asking your GPU, “You good, buddy?” It shows you the current state of your GPU, memory, temperature, and all that good stuff. If you see your GPU details, then you’re all set!

Now, install PyTorch an test if CUDA is working:

import torch
torch.cuda.is_available()

If it return True, than you are in business.