# lectures **Repository Path**: DreamPro/lectures ## Basic Information - **Project Name**: lectures - **Description**: No description available - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-05-07 - **Last Updated**: 2026-05-08 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Supplementary Material for Lectures [![](https://dcbadge.vercel.app/api/server/gpumode?style=flat)](https://discord.gg/gpumode) [YouTube Channel](https://www.youtube.com/@GPUMODE) The PMPP Book: [Programming Massively Parallel Processors: A Hands-on Approach](https://a.co/d/2S2fVzt) (Amazon link) ## Lecture 1: Profiling and Integrating CUDA kernels in PyTorch - Speaker: [Mark Saroufim](https://twitter.com/marksaroufim) - Notebook and slides in [lecture_001](./lecture_001/) folder ## Lecture 2: Recap Ch. 1-3 from the PMPP book - Speaker: [Andreas Koepf](https://twitter.com/neurosp1ke) - Slides: The powerpoint file [lecture_002/cuda_mode_lecture2.pptx](./lecture_002/cuda_mode_lecture2.pptx) can be found in the root directory of this repository. Alternatively [here](https://docs.google.com/presentation/d/1deqvEHdqEC4LHUpStO6z3TT77Dt84fNAvTIAxBJgDck/edit#slide=id.g2b1444253e5_1_75) as Google docs presentation. ## Lecture 3: Getting Started With CUDA - Speaker: [Jeremy Howard](https://twitter.com/jeremyphoward) - Notebook: See the [lecture_003](./lecture_003/) folder, or run the [Colab version](https://colab.research.google.com/drive/180uk6frvMBeT4tywhhYXmz3PJaCIA_uk?usp=sharing) ## Lecture 4: Intro to Compute and Memory Architecture - Speaker: [Thomas Viehmann](https://lernapparat.de/) - Notebook and slides in the [lecture_004](./lecture_004/) folder. ## Lecture 5: Going Further with CUDA for Python Programmers - Speaker: [Jeremy Howard](https://twitter.com/jeremyphoward) - Notebook in the [lecture_005](./lecture_005/) folder. ## Lecture 6: Optimizing PyTorch Optimizers - Speaker: [Jane Xu](https://github.com/janeyx99) - [Slides](https://docs.google.com/presentation/d/13WLCuxXzwu5JRZo0tAfW0hbKHQMvFw4O/edit#slide=id.p1) ## Lecture 7: Advanced Quantization - Speaker: [Charles Hernandez](https://github.com/HDCharles) - [Slides](https://www.dropbox.com/scl/fi/hzfx1l267m8gwyhcjvfk4/Quantization-Cuda-vs-Triton.pdf?rlkey=s4j64ivi2kpp2l0uq8xjdwbab&dl=0) ## Lecture 8: CUDA Performance Checklist - Speaker: [Mark Saroufim](https://github.com/msaroufim) - Code in the [lecture_008](./lecture_008/) folder - [Slides](https://docs.google.com/presentation/d/1cvVpf3ChFFiY4Kf25S4e4sPY6Y5uRUO-X-A4nJ7IhFE/edit?usp=sharing) ## Lecture 9: Reductions - Speaker: [Mark Saroufim](https://github.com/msaroufim) - Code in the [lecture_009](./lecture_009/) folder - [Slides](https://docs.google.com/presentation/d/1s8lRU8xuDn-R05p1aSP6P7T5kk9VYnDOCyN5bWKeg3U/edit?usp=drive_link) ## Lecture 10: Build a Prod Ready CUDA Library * Speaker: [Oscar Amoros Huguet](https://github.com/morousg) * [slides](https://drive.google.com/drive/folders/158V8BzGj-IkdXXDAdHPNwUzDLNmr971_?usp=drive_link) ## Lecture 11: Sparsity * Speaker: [Jesse Cai](https://github.com/jcaip) * [Slides](./lecture_011/sparsity.pptx) ## Lecture 12: Flash Attention - Speaker: [Thomas Viehmann](https://lernapparat.de/) - Code in the [lecture_012](./lecture_012/) folder ## Lecture 13: Ring Attention - Speaker: [Andreas Koepf](https://twitter.com/neurosp1ke) - [Slides](./lecture_013/ring_attention.pptx) ## Lecture 14: Practitioner's Guide to Triton - Date: 2024-04-13, Speaker: [Umer Adil](https://twitter.com/UmerHAdil) - [Notebook](./lecture_014/A_Practitioners_Guide_to_Triton.ipynb) ## Lecture 15: CUTLASS - Speaker: [Eric Auld](https://github.com/ericauld) ## Lecture 16: On Hands profiling - Speaker: [Taylor Robbie](https://www.linkedin.com/in/taylor-robie/) ## Bonus Lecture: CUDA C++ llm.cpp - Speaker: [Jake Hemstad & Georgii Evtushenko]() - [Slides](https://drive.google.com/drive/folders/1T-t0d_u0Xu8w_-1E5kAwmXNfF72x-HTA) ## Lecture 17: GPU Collective Communication (NCCL) - Speaker: [Dan Johnson](https://physbam.stanford.edu/~dansj/) - Code in the [lecture_017](./lecture_017/) folder ## Lecture 18: Fused Kernels - Speaker: [Kapil Sharma](https://www.kapilsharma.dev/) - Code in the [lecture_018](./lecture_018/) folder ## Lecture 19: Data Processing on GPUs - Speaker: [Devavret Makkar](https://github.com/devavret) ## Lecture 20: Scan Algorithm - Speaker: [Izzat El Haj](https://ielhajj.github.io/) - [Slides](https://docs.google.com/presentation/d/1MEMsE5LKi6ush_60hlYu3-cz4DUCFzSL/edit?usp=sharing&ouid=106222972308395582904&rtpof=true&sd=true) ## Lecture 21: Scan Algorithm Part 2 - Speaker: [Izzat El Haj](https://ielhajj.github.io/) - [Slides](https://docs.google.com/presentation/d/1MEMsE5LKi6ush_60hlYu3-cz4DUCFzSL/edit?usp=sharing&ouid=106222972308395582904&rtpof=true&sd=true) ## Lecture 22: Hacker's Guide to Speculative Decoding in VLLM - Speaker: [Cade Daniel](https://x.com/cdnamz) - [Slides](https://docs.google.com/presentation/d/1p1xE-EbSAnXpTSiSI0gmy_wdwxN5XaULO3AnCWWoRe4/edit#slide=id.p) ## Lecture 23: Tensor Cores - Speaker: Vijay Thakkar & Pradeep Ramani - [Slides](https://drive.google.com/file/d/18sthk6IUOKbdtFphpm_jZNXoJenbWR8m/view) ## Lecture 24: Scan at the Speed of Light - Speaker: Jake Hemstad & Georgii Evtushenko ## Lecture 25: Speaking Composable Kernel - Speaker: Haocong Wang - [Slides](./lecture_025/AMD_ROCm_Speaking_Composable_Kernel_July_20_2024.pdf) ## Lecture 26: SYCL MODE (Intel GPU) - Speaker: Patric Zhao - [Slides](https://docs.google.com/presentation/d/1SW4XKomAJhhJSH5-jpZI9Qlwp7TEunbV/edit?usp=sharing&ouid=106222972308395582904&rtpof=true&sd=true) ## Lecture 27: gpu.cpp - Speaker: [Austin Huang](https://x.com/austinvhuang) - [Slides](https://gpucpp-presentation.answer.ai/) ## Lecture 28: Liger Kernel - Speaker: [Byron Hsu](https://x.com/hsu_byron) - [Slides](https://docs.google.com/presentation/d/1CGTV-uKw9crrBo13q1jAzAFCFzlpZFjeL4bnK67pTd8/edit?usp=sharing) - Hands-on Notebooks 1. [RMSNorm: Verifying Correctness and Performance](https://colab.research.google.com/drive/1CQYhul7MVG5F0gmqTBbx1O1HgolPgF0M?usp=sharing) 2. [FusedLinearCrossEntropy: Verifying Memory Reduction](https://colab.research.google.com/drive/1Z2QtvaIiLm5MWOs7X6ZPS1MN3hcIJFbj?usp=sharing) 3. [Convergence Comparison: Triton Kernel Patched vs. Original Model Layer-by-Layer](https://colab.research.google.com/drive/1e52FH0BcE739GZaVp-3_Dv7mc4jF1aif?usp=sharing) 4. [Contiguity is the hidden killer](https://colab.research.google.com/drive/1llnAdo0hc9FpxYRRnjih0l066NCp7Ylu?usp=sharing) 5. [Address int32 overflow](https://colab.research.google.com/drive/1WgaU_cmaxVzx8PcdKB5P9yHB6_WyGd4T?usp=sharing) ## Lecture 29: Triton Internals - Speaker: [Kapil Sharma](https://www.kapilsharma.dev/) - Code/presentation in the [lecture_029](./lecture_029/) folder ## Lecture 30: Quantized training - Speaker: [Thien Tran](https://github.com/gau-nernst) - Code/presentation in the [lecture_030](./lecture_030/) folder ## Lecture 31: Beginners Guide to Metal Kernels - Speaker: [Nikita Shulga](https://github.com/gau-nernst) - Code/presentation in the [lecture_031](./lecture_031/) folder ## Lecture 32: Unsloth - LLM Systems Engineering - Speaker: [Daniel Han](https://x.com/danielhanchen) - [Slides](https://docs.google.com/presentation/d/1BvgbDwvOY6Uy6jMuNXrmrz_6Km_CBW0f2espqeQaWfc/edit?usp=sharing) ## Lecture 33: BitBLAS - Speaker: [Wang Lei](https://github.com/LeiWang1999) - Code/presentation in the [lecture_033](./lecture_033/) folder ## Lecture 34: Low Bit Triton Kernels - Speaker: [Hicham Badri](https://github.com/mobicham) - [Slides](https://docs.google.com/presentation/d/1R9B6RLOlAblyVVFPk9FtAq6MXR1ufj1NaT0bjjib7Vc/edit) ## Lecture 35: SGLang Performance Optimization - Speaker: [Yineng Zhang](https://linkedin.com/in/zhyncs) - [Slides](https://github.com/zhyncs/lectures/blob/main/lecture_035/SGLang-Performance-Optimization-YinengZhang.pdf) ## Lecture 36: CUTLASS and Flash ATtention 3 - Speaker: [Jay Shah](https://research.colfax-intl.com/blog/) - [Slides](lecture_036/) ## Lecture 37: Introduction to SASS & GPU Microarchitecture - Speaker: [Arun Demeure](https://github.com/ademeure) - [Slides](lecture_037/) ## Lecture 38: Lowbit kernels for ARM CPU - Speaker: [Scott Roy](https://github.com/metascroy) - [Slides](lecture_038/) ## Lecture 39: TorchTitan - Speaker: Mark Saroufim and Tianyu Liu ## Lecture 40: Flash Infer - Speaker: [Zihao Ye](https://homes.cs.washington.edu/~zhye/) ## Lecture 41: CUDA Docs for Humans - Speaker: [Charles Frye](https://x.com/charles_irl/status/1867306225706447023) - [Slides](https://docs.google.com/presentation/d/15lTG6aqf72Hyk5_lqH7iSrc8aP1ElEYxCxch-tD37PE/edit#slide=id.g326210b960f_0_42) ## Lecture 42: Mosaic GPU - Speaker: [Adam Paszke](https://x.com/apaszke) ## Lecture 43: - Speaker: Erik Schultheis - [Slides](lecture_042) ## Lecture 57: CuTE - Speaker: Cris Cecka - [Slides](lecture_057) ## Lecture 67: NCCL & NVSHMEM - Speaker: Jeff Hammond - [Slides](https://drive.google.com/file/d/1T8uHhFIeVa_g1oYb_O4d2Ltb8YQly1zK/view?usp=sharing) - [Code](https://github.com/ParRes/Kernels/tree/main/Cxx11) ## Lecture 69: Quartet 4 bit training - Speakers: Roberto Castro and Andrei Panferov - Code: https://github.com/IST-DASLab/Quartet and https://github.com/isT-DASLab/qutlass Roberto Castro and Andrei Panferov - [Paper](https://arxiv.org/abs/2505.14669) ## Lecture 70: Fault tolerant communication collectives - Speaker: mike64_t - [Slides](https://docs.google.com/presentation/d/1MKB51lhNOsV-Y_hscSaJk7wZskzxft2pFJQZKyvcMyo/edit?usp=sharing) ## Lecture 71: [ScaleML Series] FlexOlmo: Open Language Models for Flexible Data Use - Speaker: [Sewon Min](https://www.sewonmin.com) - [Slides](lecture_071) ## Lecture 72: [ScaleML Series] Efficient & Effective Long-Context Modeling for Large Language Models - Speaker: [Guangxuan Xiao](https://guangxuanx.com) - [Slides](lecture_072) ## Lecture 74: [ScaleML Series] Positional Encodings and PaTH Attention - Speaker: [Songlin Yang](https://sustcsonglin.github.io) - [Slides](lecture_074) ## Lecture 75: [ScaleML Series] GPU Programming Fundamentals + ThunderKittens - Speaker 1: William Brandon - [Slides 1](https://docs.google.com/presentation/d/1ypi4IjEF36PUZGOJSaFxjNzk7BpO61TicdTBBf77oqc/) - Speaker 2: [Simran Arora](https://arorasimran.com) - [Slides 2](lecture_075) ## Lecture 78: Iris: Multi-GPU Programming in Triton Speakers: Muhammad Awad, Muhammad Osama & Brandon Potter - [Slides](lecture_078) ## Lecture 79: Mirage (MPK): Compiling LLMs into Mega Kernels Speakers: Mengdi Wu, Xinhao Cheng - [Slides](lecture_079) ## Lecture 84: Numerics and AI Speaker: Paulius Micikevicius - [Slides](lecture_084) ## Lecture 86: Introduction to CuTeDSL (for NVIDIA competition) Speaker: Vicki Wang - [Slides](lecture_086) ## Lecture 103: Fundamentals of CuTe Layout Algebra and Category-theoretic Interpretation Speaker: Jack Carlisle and Jay Shah - [Slides](lecture_103) ## Lecture 104: Gluon: Tile-Based GPU Programming with Low-Level Control Speakers: Peter Bell, Mario Lezcano, Keren Zhou - [Slides and notes](lecture_104)