NET4EXA contribution at ISC 2026: Best Paper Award for “PICO: Performance Insights for Collective Operations”
We are pleased to announce that the paper “PICO: Performance Insights for Collective Operations”, led by Daniele De Sensi (University of Rome) and co-authors, has received the Best Paper Award at ISC 2026.
We warmly congratulate all the authors for this outstanding achievement, which highlights the importance of systematic and reproducible performance analysis for collective communication in modern HPC and AI systems.
About PICO
PICO is an open-source framework designed to enable systematic, reproducible benchmarking and fine-grained analysis of collective communication operations across heterogeneous HPC and AI software stacks. It decouples experimental setup from platform execution, supports adaptive configuration across MPI and NCCL backends, and enables controlled, reproducible performance studies.
Contribution within NET4EXA (WP4)
This work has been carried out in the context of the NET4EXA project and is closely aligned with the objectives of Work Package 4. In particular:
- it supports Programming Models (Task 4.1) by enabling backend-adaptive benchmarking across MPI and NCCL implementations, and
- it contributes to Fabric-aware Communication Optimization (Task 4.5) by providing detailed diagnostic insights to evaluate and validate optimized collective communication strategies.
Key results
The study demonstrates that default collective communication configurations can be up to 5× slower than optimized alternatives. By identifying performance bottlenecks and applying targeted tuning, simulations show a potential reduction in LLM training time of up to 44%.
Abstract
Autors: Saverio Pasqualoni, Tommaso Bonato, Lorenzo Piarulli, Torsten Hoefler, Marco Canini, Daniele De Sensi
Collective operations are cornerstones of both HPC applications and large-scale AI training and inference, yet benchmarking them in a systematic and reproducible way remains difficult on modern systems due to the complexity of their hardware and software stacks. Existing suites primarily report end-to-end timings and offer limited support for controlled algorithm and configuration selection, fine-grained profiling, and capturing the runtime environment. We present PICO (Performance Insights for Collective Operations), an open-source framework that decouples portable experiment setup from platform execution, provides a backend-adaptive parameter selection interface across MPI and NCCL, supplies plain-MPI reference collective implementations, optionally instrumentable, and records the system configuration for reproducible comparisons. Evaluated on three major supercomputers, PICO shows that default collective algorithms and transport settings can be up to 5× slower than the best available choice. It provides diagnostic evidence by isolating topology sensitive algorithmic choices and, through instrumentation, reveals detailed algorithmic breakdowns. To assess end-to-end effects of benchmark-informed tuning and evaluate application-level impacts, we replay open-source LLM training traces in ATLAHS simulator with optimized collective profiles identified by PICO, achieving reductions in training times of up to 44%.
Find the paper here :
https://arxiv.org/abs/2508.16809

