Cray User Group Conference

Talk Abstracts

Wednesday, May 5th

Applications and Performance (ARM)

An Evaluation of the A64FX Architecture for HPC Applications

Contributors:

Andrei Poenaru, Tom Deakin, Simon McIntosh-Smith, Si Hammond, Andrew Younge

Description:

In this paper, we present some of the first in-depth, rigorous, independent benchmark results for the A64FX, the processor at the heart of Fugaku, the current #1 supercomputer in the world, and now available in Apollo 80 guise. The Isambard and Astra research teams have combined to perform this study, using a combination of mini-apps and application benchmarks to evaluate A64FX's performance for both compute- and bandwidth-bound scenarios. The study uniquely had access to all four major compilers for A64FX: Cray, Arm, GNU and Fujitsu. The results showed that the A64FX is extremely competitive, matching or exceeding contemporary dual-socket x86 servers. We also report tuning and optimisation techniques which proved essential for achieving good performance on this new architecture.

Vectorising and distributing NTTs to count Goldbach partitions on Arm-based supercomputers

LIVE SESSION @ 8:15 AM PST

Contributors:

Ricardo Jesus, Tomás Oliveira e Silva, Michèle Weiland

Description:

In this paper we explore the usage of SVE to vectorise number-theoretic transforms (NTTs). In particular, we show that 64-bit modular arithmetic operations, including modular multiplication, can now be efficiently implemented with SVE instructions. The vectorisation of NTT loops and similar code structures involving 64-bit modular operations was not possible in previous Arm-based SIMD architectures, since these architectures lacked crucial instructions to efficiently implement modular multiplication. We test and evaluate our SVE implementation on an A64FX processor in an HPE Apollo 80 system. Furthermore, we implement a distributed NTT for the computation of large-scale exact integer convolutions. We evaluate this transform on HPE Apollo 70 and Cray XC50 systems, where we demonstrate good scalability to thousands of cores. Finally, we describe how these methods can be utilised to count the number of Goldbach partitions of the even numbers to large limits. We present some preliminary results concerning this last problem, in particular the curve of the even numbers up to 2^40 whose number of partitions is larger than the number of partitions of all previous integers.

Optimizing a 3D multi-physics continuum mechanics code for the HPE Apollo 80 System

LIVE SESSION @ 8:30 AM PST

Contributors:

Vince Graziano, David Nystrom, Howard Pritchard, Brandon Smith, Brian Gravelle

Description:

We present results of a performance evaluation of a LANL 3D multi-physics continuum mechanics code - Pagosa - on an HPE Apollo 80 system. The Apollo 80 features the Fujitsu A64FX ARM processor with Scalable Vector Extension (SVE) support and high bandwidth memory. This combination of SIMD vector units and high memory bandwidth offers the promise of realizing a significant fraction of the theoretical peak performance for applications like Pagosa. In this paper we present performance results of the code using the GNU, ARM, and CCE compilers, analyze these compilers’ ability to vectorize performance critical loops when targeting the SVE instruction set, and describe code modifications to improve the performance of the application on the A64FX processor.

Applications and Performance

Optimizing the Cray Graph Engine for Performant Analytics on Cluster, SuperDome Flex, Shasta Systems and Cloud Deployment

LIVE SESSION @ 9:00 AM PST

Contributors:

Christopher Rickett, Kristyn Maschhoff, Sreenivas Sukumar

Description:

We present updates to the Cray Graph Engine, a high performance in-memory semantic graph database, which enable performant execution across multiple architectures as well as deployment in a container to support cloud and as-a-service graph analytics. This paper discusses the changes required to port and optimize CGE to target multiple architectures, including Cray Shasta systems, large shared-memory machines such as SuperDome Flex (SDF), and cluster environments such as Apollo systems. The porting effort focused primarily on removing dependences on XPMEM and Cray PGAS and replacing these with a simplified PGAS library based upon POSIX shared memory and one-side MPI, while preserving the existing Coarray-C++ CGE code base. We also discuss the containerization of CGE using Singularity and the techniques required to enable container performance matching native execution. We present early benchmarking results for running CGE on the SDF, Infiniband clusters and Slingshot interconnect-based Shasta systems.

Real-Time XFEL Data Analysis at SLAC and NERSC: a Trial Run of Nascent Exascale Experimental Data Analysis

LIVE SESSION @ 9:15 AM PST

Contributors:

Johannes P. Blaschke, Aaron S. Brewster, Daniel Paley, Derek Mendez, Nicholas K. Sauter, Deborah Bard

Description:

X-Ray scattering experiments using Free Electron Lasers (XFEL) are a powerful tool to determine the molecular structure, and function of unknown samples (such as COVID-19 viral proteins). XFEL experiments are a challenge to computing in two ways: i) due to the high cost of running XFELs, a fast turnaround time from data acquisition to data analysis is essential to make informed decisions on experimental protocols; ii) data collection rates are growing exponentially, requiring new scalable algorithms. Here we report our experiences from the two experiments at LCLS during September 2020. Raw data was analyzed on NERSC’s Cori system, using the super-facility paradigm: our workflow automatically moves raw data between LCLS and NERSC, where it is analyzed (using CCTBX). We achieved real time data analysis with a 20 min turnaround time from data acquisition to full molecular reconstruction -- sufficient time for the experiment’s operators to make informed decisions between shots.

Early Experiences Evaluating the HPE/Cray Ecosystem for AMD GPUs

LIVE SESSION @ 9:30 AM PST

Contributors:

Veronica G. Vergara Larrea, Reuben Budiardja, Wayne Joubert

Description:

Since deploying the Titan supercomputer in 2012, the Oak Ridge Leadership Computing Facility (OLCF) has continued to support and promote GPU-accelerated computing among its user community. Summit, the flagship system at the OLCF --- currently number 2 in the most recent TOP500 list --- has a theoretical peak performance of approximately 200 petaflops. Because the majority of Summit’s computational power comes from its 27,648 GPUs, users must port their applications to one of the supported programming models in order to make efficient use of the system. Looking ahead to Frontier, the OLCF’s exascale supercomputer, users will need to adapt to an entirely new ecosystem which will include new hardware and software technologies. First, users will need to familiarize themselves with the AMD Radeon GPU architecture. Furthermore, users who have been previously relying on CUDA will need to transition to the Heterogeneous-Computing Interface for Portability (HIP) or one of the other supported programming models (e.g., OpenMP, OpenACC). In this work, we describe our initial experiences in porting three applications or proxy apps currently running on Summit to the HPE/Cray ecosystem to leverage the compute power from AMD GPUs: minisweep, GenASiS, and Sparkler. Each one is representative of current production workloads utilized at the OLCF, different programming languages, and different programming models. We also share lessons learned from challenges encountered during the porting process and provide preliminary results from our evaluation of the HPE/Cray Programming Environment and the AMD software stack using these key OLCF applications.

PEAD: Update of Cray Programming Environment

ON-DEMAND SESSION

Contributors:

John Levesque

Description:

Over the past year, the Cray Programming Environment (CPE) engineers have been hard at work on numerous projects to make the compiler and tools easier to use and to interact well with the new GPU systems. This talk will cover those facets of development and will give a futures perspective to where CPE is going. We recognize that CPE is the only programming environment that gives applications developers a portable development interface across all the popular nodes and GPU options. The one major complaint is that CPE is a strict standard forcing compiler which makes it incompatible with Intel and GNU which allow non-standard extensions. This complaint is being addressed. We are also modifying the software to be usable with newer software components like containers and SPACK. Additionally, CPE will be supported on HPE systems beyond the traditional Cray systems. Finally there are numerous new products being developed for Coral 2 systems, which will be beneficial to the entire HPE community.

Convergence of AI and HPC at HLRS. Our Roadmap.

ON-DEMAND SESSION

Contributors:

Denns Hoppe

Description:

The growth of artificial intelligence (AI) is accelerating. AI has left research and innovation labs, and nowadays plays a significant role in everyday lives. The impact on society is graspable: autonomous driving cars produced by Tesla, voice assistants such as Siri, and AI systems that beat renowned champions in board games like Go. All these advancements are facilitated by powerful computing infrastructures based on HPC and advanced AI-specific hardware, as well as highly optimized AI codes. Since several years, HLRS is engaged in big data and AI-specific activities around HPC. The road towards AI at HLRS began several years ago with installing a Cray Urika-GX for processing large volumes of data. But due to the isolated platform and, for HPC users, different usage concept, uptake of this system was lower than expected. This drastically changed recently with the installation of a CS-Storm equipped with powerful GPUs. Since then, we are also extending our HPC system with GPUs due to a high customer demand. We foresee that the duality of using AI and HPC on different systems will soon be overcome, and hybrid AI/HPC workflows will be eventually possible. In this talk, I will give a brief overview about our research project CATALYST to engage with researchers and SMEs, as well as present exciting case studies from some of our customers that leverage AI. This will be put in the overall AI strategy of HLRS including lessons learned throughout the years on different Cray/HPE systems such as the Urika-GX.

Porting Codes to LUMI

ON-DEMAND SESSION

Contributors:

Georgios Markomanolis

Description:

LUMI is a new upcoming EuroHPC pre-exascale supercomputer with a peak performance of a bit over 550 petaflop/s by HPE Cray. Many countries of LUMI consortium will have access to this system among other users. It is known that this system will be based on the next generation of AMD Instinct GPUs and this is a new environment for all of us. In this presentation, we discuss the AMD ecosystem, we present with examples the procedure to convert CUDA codes to HIP, among also how to port Fortran codes with hipfort. We discuss the utilization of other HIP libraries and we demonstrate a performance comparison between CUDA and HIP. We explore the challenges that scientists will have to handle during their application porting and also we provide step-by-step guidance. Finally, we will discuss the potential of other programming models and the workflow that we follow to port codes depending on their readiness for GPUs and the used programming language.