Course on CUDA Programming — Study Abroad Crawler

# Course on CUDA Programming

**Source**: https://people.maths.ox.ac.uk/~gilesm/cuda/
**Parent**: https://oerc.ox.ac.uk/education/cuda-programming-on-nvidia-gpus

## Course on CUDA Programming on NVIDIA GPUs, July 20-24, 2026

The course will be taught by
[Prof. Mike Giles](https://people.maths.ox.ac.uk/gilesm/)
and
[Prof. Wes Armour](https://eng.ox.ac.uk/people/wes-armour/).
They have both used CUDA in their research for many years, and
set up
[JADE](https://www.jade.ac.uk/),
the first national GPU HPC facility for Machine Learning.
\
\
**Online registration should be set up by the end of March 2026
with a link from this webpage.**

\
\
This is a one-week hands-on course for students, postdocs, academics
and others who want to learn how to develop applications to run on
NVIDIA GPUs using the CUDA programming environment. All that will
be assumed is some proficiency with C and basic C++ programming.
No prior experience with parallel computing will be assumed.
\
\
The course consists of approximately 3 hours of lectures and 4 hours
of practicals each day. The aim is that by the end of the course you
will be able to write relatively simple programs and will be confident
and able to continue learning through studying the
[CUDA code samples](https://github.com/nvidia/cuda-samples)
provided by NVIDIA on GitHub.
\
\
All attendees should bring a laptop to access the GPU servers which
will be used for the practicals.
\
\
The costs for the course are:

- free for everyone in Oxford (due to central funding)
- £250 for those from other UK universities
- £500 for those from UK government labs,
  UK not-for-profit organisations,
  and foreign universities
- £2500 for those from industry and foreign government labs

**Anyone with a status which does not fit into one of the categories above,
including those outside the UK who are not from a university, company or
government lab, should contact me
([mike.giles@maths.ox.ac.uk](mailto:mike.giles@maths.ox.ac.uk))
to discuss the appropriate fee category.**
\
\
The intention is that these costs should not deter anyone from attending
the course. The higher costs for certain participants correspond to the
fact that they will be paying more for their travel and accommodation,
and/or their organisations will be paying more for their time spent
attending the course. It also reflects the UK funding for the facilities
being used.

\
\

---

## Venue

The lectures and practicals will all take place in Lecture Theatre L1
downstairs in the
[Mathematical Institute](https://www.maths.ox.ac.uk/about-us/contact-us).
Attendees should bring laptops for accessing the remote Linux servers to
carry out the practicals. It would be good to use fully-charged laptops,
but we will try to provide adequate charging points as far as possible.
\
\

---

## Travel to Oxford

For those coming to Oxford, especially from abroad, there is travel advice
[here](../travel.html).
\
\

---

## Accommodation and food

Those attending the course must arrange their own accommodation.
These are within a few minutes walk (or bus ride), and are arranged
roughly in order of increasing cost:

- [University Rooms](https://www.universityrooms.com/en-GB/city/oxford/home/) (St. Anne's, Somerville and Keble colleges are the closest)
- [Premier Inn -- Westgate](https://www.premierinn.com/gb/en/hotels/england/oxfordshire/oxford/oxford-city-centre-westgate.html) (15-20 minute walk)
- [Travelodge -- Peartree](https://www.travelodge.co.uk/hotels/60/Oxford-Peartree-hotel) (15 minutes by bus)
- [easyHotel -- Oxford](https://www.easyhotel.com/hotels/united-kingdom/oxford/oxford) (10 minutes by bus)
- [Cotswold Lodge Hotel](https://www.cotswoldlodgehotel.co.uk/) (10 minute walk)
- [Old Parsonage Hotel](https://www.oldparsonagehotel.co.uk/) (5 minute walk)

Alternatively, you might consider using
[Airbnb](https://www.airbnb.co.uk/oxford-united-kingdom/stays).
\
\
For coffee, breakfast and lunch, there is a good cafe in the
basement of the Mathematical Institute. Little Clarendon Street,
which is nearby, has several restaurants for dinner
(and an excellent ice cream shop), and there are two sandwich
shops for lunch on either side of its junction with Woodstock
Road (A4144 on Google Maps).
\
\

---

## Timetable

For the first three days we will follow this timetable:

- 09:00 - 10:30 lecture
- 10:30 - 11:00 break
- 11:00 - 12:30 practical
- 12:30 - 13:30 lunch break
- 13:30 - 15:00 lecture
- 15:00 - 15:30 break
- 15:30 - 17:00 practical

On the last two days we will switch to having both lectures in the morning,
and then have practicals all afternoon. This provides more time for longer
practicals, and will also allow those coming to Oxford from far away to
leave when they wish on Friday afternoon.
\
\

---

## Preliminary Reading

Please read chapters 1 and 2 of the NVIDIA CUDA C Programming Guide
which is available both as
[PDF](https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf)
and
[online HTML](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html).
\
\
**CUDA is an extension of C/C++, so if you are a little rusty with C/C++
you should refresh your memory of it. Here are links to
[a couple of introductory lectures on C](lecs/pre_course_lectures.pdf),
[a larger online resource](https://www.freecodecamp.org/news/the-c-beginners-handbook/) and
[an even larger online resource](https://www.learncpp.com/). This
[reddit critique](https://www.reddit.com/user/IyeOnline/comments/157f10z/comment/juvgjkc/) particularly recommends that last one, and mentions various other ones in addition.**
\
\

---

## Additional References

- [online CUDA documentation](https://docs.nvidia.com/cuda/index.html)

- [CUDA homepage]( https://developer.nvidia.com/cuda-zone)
- [CUDA Runtime API](https://docs.nvidia.com/cuda/pdf/CUDA_Runtime_API.pdf)
- [CUDA C++ Best Practices Guide](https://docs.nvidia.com/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf)

- [CUDA Compiler Driver NVCC](https://docs.nvidia.com/cuda/pdf/CUDA_Compiler_Driver_NVCC.pdf)
- [CUDA-gdb debugger](https://docs.nvidia.com/cuda/pdf/cuda-gdb.pdf)

- [CUDA maths library](https://docs.nvidia.com/cuda/pdf/CUDA_Math_API.pdf)
- [cuBLAS library](https://docs.nvidia.com/cuda/pdf/CUBLAS_Library.pdf)
- [cuFFT library](https://docs.nvidia.com/cuda/pdf/CUFFT_Library.pdf)
- [cuRAND library](https://docs.nvidia.com/cuda/pdf/CURAND_Library.pdf)
- [cuSOLVER library](https://docs.nvidia.com/cuda/pdf/CUSOLVER_Library.pdf)
- [cuSPARSE library](https://docs.nvidia.com/cuda/pdf/CUSPARSE_Library.pdf)
- [cuDSS library](https://docs.nvidia.com/cuda/cudss/index.html)
- [NCCL multi-GPU communications library](https://developer.nvidia.com/nccl)
- [CUDA Core Compute Libraries (CCCL)](https://github.com/NVIDIA/cccl)

- [CUDA Fortran](https://developer.nvidia.com/cuda-fortran)
- [CUDA Fortran Programming Guide](https://docs.nvidia.com/hpc-sdk/pdf/hpc235cudaforug.pdf)

- [PTX ISA](https://docs.nvidia.com/cuda/pdf/ptx_isa_8.5.pdf) (low-level instructions)

- [Nsight Visual Studio](https://developer.nvidia.com/nsight-visual-studio-edition)
- [Nsight Visual Studio Code](https://developer.nvidia.com/nsight-visual-studio-code-edition)
- [Nsight Eclipse](https://developer.nvidia.com/nsight-eclipse-edition)
- [Nsight Kernel Profiling Guide](https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html)
- [Nsight Compute Command Line Interface](https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html)
- [Nsight Compute User Interface](https://docs.nvidia.com/nsight-compute/NsightCompute/index.html)
- [Compute Sanitizer](https://docs.nvidia.com/compute-sanitizer/index.html) (including memchk and racecheck tools)
- [other Nsight tools](https://developer.nvidia.com/tools-overview)

- [CUDA code samples](https://github.com/nvidia/cuda-samples) on GitHub

- [helper\_math.h](https://github.com/NVIDIA/cuda-samples/blob/master/Common/helper_math.h) header file defining operator-overloading operations for CUDA intrinsic vector datatypes such as float4
- [dbldbl.h](https://gist.github.com/seibert/5914108)
  header file defining double-double arithmetic for quad-precision
  (originally developed by NVIDIA, but not supported)

- [NVIDIA webpage](https://developer.nvidia.com/cuda-gpus)
  listing Compute Capability type of all GPUs
- Wikipedia pages on NVIDIA
  [HPC cards](https://en.wikipedia.org/wiki/Nvidia_Tesla), and
  [GeForce 50](https://en.wikipedia.org/wiki/GeForce_50_series) graphics cards

- [Ampere Tuning Guide](https://docs.nvidia.com/cuda/ampere-tuning-guide/)
- [Ampere A100 White Paper](https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf)

- [Hopper Tuning Guide](https://docs.nvidia.com/cuda/hopper-tuning-guide/)
- [Hopper H100 White Paper](https://resources.nvidia.com/en-us-tensor-core)

- [Blackwell Tuning Guide](https://docs.nvidia.com/cuda/blackwell-tuning-guide/)
- [Blackwell White Paper](https://resources.nvidia.com/en-us-blackwell-architecture/blackwell-architecture-technical-brief)

- [Jetson Thor](https://www.nvidia.com/en-gb/autonomous-machines/embedded-systems/jetson-thor/) for embedded systems
- [Jetson Thor faqs](https://developer.nvidia.com/embedded/faq)
- [Red Hawk real-time OS](https://concurrent-rt.com/partners/nvidia/) for Jetson systems

- [NVIDIA slides](lecs/NVIDIA_performance_debugging.pdf) on Performance and Debugging Tools (2025)

- [GTC slides](https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s33322/) on "Dissecting the Ampere GPU Architecture through Microbenchmarking"
- [arXiv paper](https://arxiv.org/abs/2501.12084) on "Dissecting
  the NVIDIA Hopper Architecture through Microbenchmarking and Multiple Level
  Analysis"

- [NVIDIA T4 datasheet](https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-t4/t4-tensor-core-datasheet-951643.pdf) for those doing practicals on Google Colab

---

## Lectures

- lecture 1: [An introduction to CUDA](lecs/lec1.pdf)
- lecture 2: [Different memory and variable types](lecs/lec2.pdf)
- lecture 3: [Control flow and synchronisation](lecs/lec3.pdf)
- lecture 4: [Warp shuffles, and reduction / scan operations](lecs/lec4.pdf)
- lecture 5: [Libraries and tools](lecs/lec5_wes.pdf)
- lecture 6: [Multiple GPUs, and odds and ends](lecs/lec6_wes.pdf)
- lecture 7: [Tackling a new CUDA application](lecs/lec7.pdf)
- lecture 8: [Future Directions](lecs/lec9_wes.pdf)
- lecture 9: choice of different research talks in L1 and L2:\
  a) [AstroAccelerate](lecs/lec8_wes.pdf)\
  b) [FlashAttention -- an interesting CUDA application](../cuda/lecs/FlashAttention.pdf) \
      
  [Use of GPUs for Explicit and Implicit Finite Difference Methods](https://people.maths.ox.ac.uk/gilesm/talks/QuanTech_16.pdf)
- lecture 10: [NVIDIA guest lecture by Ira Shoker](lecs/CUDA_workshop_Shokar.pdf) (40MB)
\
- extra research talks (not presented):\
  [Automated CUDA code generation](../cuda/lecs/codegen.pdf) \
  [Sparse matrix-vector multiplication](https://people.maths.ox.ac.uk/gilesm/talks/ed_3.pdf) \
  [OP2 "Library" for Unstructured Grids](lecs/OP2.pdf)

---

## Practicals

Attendees will be provided with accounts on the
[ARC/HTC](https://arc-user-guide.readthedocs.io/en/latest/)
system which has a number of NVIDIA GPU nodes.
Before starting the practicals, please read these
[ARC notes](arc_notes.pdf).
**The notes include some weblinks for info for Windows users who may not
be very familiar with the Linux systems we will be working on.**
Some details on the Slurm batch queueing system are available
[here](https://www.arc.ox.ac.uk/new-arcus-c-environment#Scheduler).
\
\
The practicals all use these header files
([helper\_cuda.h](headers/helper_cuda.h),
[helper\_string.h](headers/helper_string.h))
which came originally from the CUDA SDK. They provide routines for
error-checking and initialisation.

### Tar file for all practicals

<practicals.tar.gz>
contains all of the files needed for the practicals.
Follow the instructions in the [ARC notes](arc_notes.pdf)
to copy it to your ARC account and untar it.

### Practical 1

Application: a trivial "hello world" example
\
\
CUDA aspects: launching a kernel, copying data to/from the graphics card,
error checking and printing from kernel code

- [instructions](prac1/prac1.pdf) (PDF)
- [prac1a.cu](prac1/prac1a.cu)
- [prac1b.cu](prac1/prac1b.cu)
- [prac1c.cu](prac1/prac1c.cu)
- [Makefile](prac1/Makefile)
- [notes on Makefiles](prac1/makefile.pdf) (PDF)

Note: the instructions above explain how a tar file of all files can
be copied from this webpage, so there's no need to download individual
files from here
\

- [instructions](prac1/prac1_Colab.pdf) (PDF)
  for those doing the practicals within Google Colab
- [Google Colab notebook](https://colab.research.google.com/drive/111C1HMtGRFE6P7r489QD9sjM22nX3atR?usp=sharing)

### Practical 2

Application: Monte Carlo simulation using NVIDIA's cuRAND library
for random number generation
\
\
CUDA aspects: constant memory, random number generation, kernel timing,
optimising device memory bandwidth

- [instructions](prac2/prac2.pdf) (PDF)
- [some mathematical notes](prac2/MC_notes.pdf) (PDF)
- [prac2.cu](prac2/prac2.cu)
- [prac2\_device.cu](prac2/prac2_device.cu)
- [Makefile](prac2/Makefile)
\
- [instructions](prac2/prac2_Colab.pdf) (PDF)
  for those doing the practicals within Google Colab
- [Google Colab notebook](https://colab.research.google.com/drive/1QSwWX32yjrzDJ98R_yMqu5eLfM2MHXhT?usp=sharing)
- [Google Colab notebook with model solution](https://colab.research.google.com/drive/15DVMZWcZHIAfMapSMio46ep9vIZqm-F-)

### Practical 3

Application: 3D Laplace finite difference solver
\
\
CUDA aspects: thread block size optimisation, multi-dimensional memory layout,
performance profiling

- [instructions](prac3/prac3.pdf) (PDF)
- [some mathematical notes](prac3/FD_notes.pdf) (PDF)
- [notes on Nsight Systems profiling](lecs/Nsight.pdf) (PDF)
- [laplace3d.cu](prac3/laplace3d.cu)
- [laplace3d\_new.cu](prac3/laplace3d_new.cu)
- [laplace3d\_gold.cpp](prac3/laplace3d_gold.cpp)
- [Makefile](prac3/Makefile)
\
- [instructions](prac3/prac3_Colab.pdf) (PDF)
  for those doing the practicals within Google Colab
- [Google Colab notebook](https://colab.research.google.com/drive/19bnXX0j2hZu7IwO7tnugAlt-jaZ7kNZE?usp=sharing)

### Practical 4

Application: reduction
\
\
CUDA aspects: dynamic shared memory, thread synchronisation, shuffles, atomics

- [instructions](prac4/prac4.pdf) (PDF)
- [reduction.cu](prac4/reduction.cu)
- [Makefile](prac4/Makefile)
- [round\_up\_test.c](prac4/round_up_test.c) code to round an integer up to nearest power of 2
\
- [instructions](prac4/prac4_Colab.pdf) (PDF)
  for those doing the practicals within Google Colab
- [Google Colab notebook](https://colab.research.google.com/drive/1JzWCAWf3InWUjblBt3BYCBcSibMWAA0P?usp=sharing)
- [Google Colab notebook with model solution](https://colab.research.google.com/drive/1lKXIUwG_h7cKImEa1bLZmLN-TKv6ljSp)

### Practical 5

Application: using Tensor Cores and cuBLAS and other libraries

- [instructions](prac5/prac5.pdf) (PDF)
- [tensorCUBLAS.cu](prac5/tensorCUBLAS.cu)
- [simpleTensorCoreGEMM.cu](prac5/simpleTensorCoreGEMM.cu)
- [Makefile](prac5/Makefile)
\
- [instructions](prac5/prac5_Colab.pdf) (PDF)
  for those doing the practicals within Google Colab
- [Google Colab notebook](https://colab.research.google.com/drive/18MctYEARjYRThQS0nWPjel5Nhs3aBF1V?usp=sharing)

### Practical 6

Application: revisiting the simple "hello world" example
\
\
CUDA aspects: using g++ for the main code, building libraries,
using templates

- [instructions](prac6/prac6.pdf) (PDF)
- [main.cpp](prac6/main.cpp)
- [prac6.cu](prac6/prac6.cu)
- [prac6b.cu](prac6/prac6b.cu)
- [prac6c.cu](prac6/prac6c.cu)
- [Makefile](prac6/Makefile)
\
- [instructions](prac6/prac6_Colab.pdf) (PDF)
  for those doing the practicals within Google Colab
- [Google Colab notebook](https://colab.research.google.com/drive/1qOW8u3YSHfrElb_Zk95AukUpv5uPVwlh?usp=sharing)

### Practical 7

Application: tri-diagonal equations -- see Lecture 7, and also
[this research talk](https://people.maths.ox.ac.uk/gilesm/talks/QuanTech_16.pdf)

- [instructions](prac7/prac7.pdf) (PDF)
- [trid.cu](prac7/trid.cu)
- [trid\_gold.cpp](prac7/trid_gold.cpp)
- [Makefile](prac7/Makefile)
\
- [instructions](prac7/prac7_Colab.pdf) (PDF)
  for those doing the practicals within Google Colab
- [Google Colab notebook](https://colab.research.google.com/drive/1W9I7RDRzn2LCKBMDHCHG3F_UJUyqxYsw?usp=sharing)

### Practical 8

Application: scan operation and recurrence equations

- [instructions](prac8/prac8.pdf) (PDF)
- [scan.cu](prac8/scan.cu)
- [Makefile](prac8/Makefile)
\
- [instructions](prac8/prac8_Colab.pdf) (PDF)
  for those doing the practicals within Google Colab
- [Google Colab notebook](https://colab.research.google.com/drive/1LGwPeSXznGlQ7-etqLBCByrFNxUI7KdH?usp=sharing)

### Practical 9

Application: pattern matching

- [instructions](prac9/prac9.pdf) (PDF)
- [match.cu](prac9/match.cu)
- [match.cpp](prac9/match_gold.cpp)
- [Makefile](prac9/Makefile)

### Practical 10

Application: auto-tuning

- [instructions](prac10/prac10.pdf) (PDF)
- [README](prac10/README) file
- [Flamingo auto-tuning software](http://mistymountain.co.uk/flamingo/)

### Practical 11

Application: streams and OpenMP multithreading

- [instructions](prac11/prac11.pdf) (PDF)
- [stream\_test.cu](prac11/stream_test.cu)
- [multithread\_test.cu](prac11/multithread_test.cu)
- [Makefile](prac11/Makefile)

### Practical 12

Application: more on streams and overlapping computation and communication

- [instructions](prac12/prac12.pdf) (PDF)
- [work\_streaming.cu](prac12/work_streaming.cu)
- [work\_streaming2.cu](prac12/work_streaming2.cu)
- [kernel\_overlap.cu](prac12/kernel_overlap.cu)
- [Makefile](prac12/Makefile)

\

---

### Acknowledgements

Many thanks to:

- Yassamine Mather, Yishun Lu and Jay Zhang for their help with the practicals
- the Mathematical Institute for hosting the lectures and practicals
- Oxford's Advanced Research Computing for the GPU servers used
  in the practicals
- Google for the Google Colab system

---

[webpage link checker](https://validator.w3.org/checklink?uri=http%3A%2F%2Fpeople.maths.ox.ac.uk%2Fgilesm%2Fcuda%2F&hide_type=all&depth=1&check=Check)

\