March 6, 2024 — Stijn Heldens has successfully defended his PhD thesis titled “Parallel Programming Systems for Scalable Scientific Computing” at the Agnietenkapel of the University of Amsterdam. Stijn did his PhD while working at the Netherlands eScience Center, where he currently still works as a Research Software Engineer (RSE). Stijn’s supervision team consisted of Prof.dr. Rob van Nieuwpoort as promotor, with Dr. Jason Maasen and Dr. Ben van Werkhoven as co-promotors. Dr. Pieter Hijma was also involved as part of the supervision team and as part of a collaboration with the VU Amsterdam.
Stijn’s PhD thesis is available from the UvA website. Included below is the abstract of his thesis.
High-performance computing (HPC) systems are more powerful than ever before. However, this rise in performance brings with it greater complexity, presenting significant challenges for researchers who wish to use these systems for their scientific work. This dissertation explores the development of scalable programming solutions for scientific computing. These solutions aim to be effective across a diverse range of computing platforms, from personal desktops to advanced supercomputers.
To better understand HPC systems, this dissertation begins with a literature review on exascale supercomputers, massive systems capable of performing 10¹⁸ floating-point operations per second. This review combines both manual and data-driven analyses, revealing that while traditional challenges of exascale computing have largely been addressed, issues like software complexity and data volume remain. Additionally, the dissertation introduces the open-source software tool (called LitStudy) developed for this research.
Next, this dissertation introduces two novel programming systems. The first system (called Rocket) is designed to scale all-versus-all algorithms to massive datasets. It features a multi-level software-based cache, a divide-and-conquer approach, hierarchical work-stealing, and asynchronous processing to maximize data reuse, exploit data locality, dynamically balance workloads, and optimize resource utilization. The second system (called Lightning) aims to scale existing single-GPU kernel functions across multiple GPUs, even on different nodes, with minimal code adjustments. Results across eight benchmarks on up to 32 GPUs show excellent scalability.
The dissertation concludes by proposing a set of design principles for developing parallel programming systems for scalable scientific computing. These principles, based on lessons from this PhD research, represent significant steps forward in enabling researchers to efficiently utilize HPC systems.