Book contents
- Frontmatter
- Dedication
- Contents
- Figures
- Tables
- Examples
- Preface
- 1 Introduction to GPU Kernels and Hardware
- 2 Thinking and Coding in Parallel
- 3 Warps and Cooperative Groups
- 4 Parallel Stencils
- 5 Textures
- 6 Monte Carlo Applications
- 7 Concurrency Using CUDA Streams and Events
- 8 Application to PET Scanners
- 9 Scaling Up
- 10 Tools for Profiling and Debugging
- 11 Tensor Cores
- Appendix A A Brief History of CUDA
- Appendix B Atomic Operations
- Appendix C The NVCC Compiler
- Appendix D AVX and the Intel Compiler
- Appendix E Number Formats
- Appendix F CUDA Documentation and Libraries
- Appendix G The CX Header Files
- Appendix H AI and Python
- Appendix I Topics in C++
- Index
3 - Warps and Cooperative Groups
Published online by Cambridge University Press: 04 May 2022
- Frontmatter
- Dedication
- Contents
- Figures
- Tables
- Examples
- Preface
- 1 Introduction to GPU Kernels and Hardware
- 2 Thinking and Coding in Parallel
- 3 Warps and Cooperative Groups
- 4 Parallel Stencils
- 5 Textures
- 6 Monte Carlo Applications
- 7 Concurrency Using CUDA Streams and Events
- 8 Application to PET Scanners
- 9 Scaling Up
- 10 Tools for Profiling and Debugging
- 11 Tensor Cores
- Appendix A A Brief History of CUDA
- Appendix B Atomic Operations
- Appendix C The NVCC Compiler
- Appendix D AVX and the Intel Compiler
- Appendix E Number Formats
- Appendix F CUDA Documentation and Libraries
- Appendix G The CX Header Files
- Appendix H AI and Python
- Appendix I Topics in C++
- Index
Summary
Chapter 3 contains a comprehensive description of CUDA cooperative groups including the powerful features for explicit warp-level programming. Warp-level programming is becoming more prominent with recent CUDA developments such as the warp matrix functions introduced to support tensor core hardware. The various types of thread groupings are discussed and illustrated in examples.We show a revised reduce kernel which uses warp-level intrinsic functions instead of shared memory.This kernel is further improved by using 128-bit vector loading of data from global GPU memory to kernel local register-based variables. A variation of this example using coalesced thread groups is also shown.The conditions for avoiding deadlock when working with warp-level thread divergence are explored with differences between the older and newer generations of GPU explained. An example using the new the cg::reduce function is shown; this function has hardware support for CC ≥ 8 devices.
- Type
- Chapter
- Information
- Programming in Parallel with CUDAA Practical Guide, pp. 72 - 105Publisher: Cambridge University PressPrint publication year: 2022