Cray User Group Conference

Talk Abstracts

Tuesday, May 4th

System Analytics and Monitoring

Integrating System State and Application Performance Monitoring: Network Contention Impact

Contributors:

Jim Brandt, Tom Tucker, Simon Hammond, Ben Schwaller, Ann Gentile, Kevin Stroup, Jeanine Cook

Description:

Discovering and attributing application performance variation in production HPC systems requires continuous concurrent information on the state of the system and applications, and of applications’ progress. Even with such information, there is a continued lack of understanding of how time-varying system conditions relate to a quantifiable impact on application performance. We have developed a unified framework to obtain and integrate, at run time, both system and application information to enable insight into application performance in the context of system conditions. The Lightweight Distributed Metric Service (LDMS) is used on several significant large-scale Cray platforms for the collection of system data and is planned for inclusion on several upcoming HPE systems. We have developed a new capability to inject application progress information into the LDMS data stream. The consistent handling of both system and application data eases the development of storage, performance analytics, and dashboards. We illustrate the utility of our framework by providing runtime insight into application performance in conjunction with network congestion assessments on a Cray XC40 system with a beta Programming Environment being used to prepare for the upcoming ACES Crossroads system. We describe possibilities for application to the Slingshot network. The complete system is generic and can be applied to any *-nix system; the system data can be obtained by both generic and system-specific data collection plugins (e.g., Aries vs Slingshot counters); and no application changes are required when the injection is performed by a portability abstraction layer, such as that employed by kokkos.

trellis — An Analytics Framework for Understanding Slingshot Performance

LIVE SESSION @ 8:20 AM PST

Contributors:

Madhu Srinivasan, Dipanwita Mallick, Kristyn Maschhoff

Description:

The next generation HPE Cray EX and HPE Apollo supercomputers with Slingshot interconnect are breaking new ground in the collection and analysis of system performance data. The monitoring frameworks on these systems provide visibility into Slingshot's operational characteristics through advanced instrumentation and transparency into real-time network performance. There still exists, however, a wide gap between the volume of telemetry generated by Slingshot and a user's ability to assimilate and explore this data to derive critical, timely, and actionable insights about fabric health, application performance, and potential congestion scenarios. In this work, we present trellis --- an analytical framework built on top of Slingshot monitoring APIs. The goal of trellis is to provide system-administrators and researchers insight into network performance, and its impact on complex workflows that include both AI and traditional simulation workloads. We also present a visualization interface, built on trellis, that allows users to interactively explore through various levels of the network topology over specified time windows, and gain key insights into job performance and communication patterns. We demonstrate these capabilities on an internal Shasta development system and visualize Slingshot's innovative congestion-control and adaptive-routing in action.

AIOps: Leveraging AI/ML for Anomaly Detection in System Management

LIVE SESSION @ 8:35 AM PST

Contributors:

Sergey Serebryakov, Jeff Hanson, Tahir Cader, Deepak Nanjundaiah, Joshi Subrahmanya

Description:

HPC datacenters rely on set-points and dashboards for system management, which leads to thousands of false alarms. Exascale systems will deploy thousands of servers and sensors, produce millions of data points per second, and be more prone to management errors and equipment failures. HPE and the National Renewable Energy Lab (NREL) are using AI/ML to improve data center resiliency and energy efficiency. HPE has developed and deployed in NREL’s production environment (since June 2020), an end-to-end anomaly detection pipeline that operates in real-time, automatically, and at massive scale. In the paper, we will provide detailed results from several end-to-end anomaly detection workflows either already deployed at NREL, or to be deployed soon. We will describe the upcoming AIOps release as a technology preview with HPCM 1.5, plans for future deployment with Cray System Manager, and potential use as an Edge processor (inferencing engine) for HPE’s InfoSight analytics platform.

Real-time Slingshot Monitoring in HPCM

ON-DEMAND SESSION

Contributors:

Priya K, Prasanth Kurian, Jyothsna Deshpande

Description:

HPE Performance Cluster Manager (HPCM) software is used to provision, monitor, and manage HPC cluster hardware and software components. HPCM has a centralized monitoring infrastructure for persistent storage of telemetry and alerting on these metrics based on thresholds. Slingshot fabric management and monitoring is the new feature in HPCM Monitoring infrastructure. Slingshot Telemetry (SST) monitoring framework in HPCM is used for collecting and storing Slingshot fabric health and performance telemetry. Real time telemetry information gathered by SST is used for fabric health monitoring, real-time analytics, visualization, and alerting. This solution is capable of both vertical and horizontal scalability, handling huge volumes of telemetry data. Flexible and extensible model of the SST collection agent makes it easy to collect metrics at different granularities and intervals. Visualization dashboards are designed to suit different use cases, giving a complete view of fabric health.

Analytic Models to Improve Quality of Service of HPC Jobs

ON-DEMAND SESSION

Contributors:

Saba Naureen, Prasanth Kurian, Amarnath Chilumukuru

Description:

A typical High Performance Computing (HPC) cluster comprises of components such as CPU, Memory, GPU, Ethernet, Fabric, storage, racks, cooling devices and switches. A cluster usually consists of 1000’s of compute nodes interconnected using an Ethernet network for management tasks and Fabric network for data traffic. Job scheduler need to be aware of the health & availability of the cluster components in order to deliver high performance results. Since the failure of any component will adversely impact the overall performance of a job, identifying the issues or outages is very critical for ensuring the desired Quality of Service (QoS) is achieved. We showcase an analytics based model implemented as part of HPE Performance Cluster Manager (HPCM) that gathers and analyzes the telemetry data pertaining to the various cluster components like the racks, enclosures, cluster nodes, storage devices, Fabric switches, Cooling Distribution Unit (CDU), ARC (Adaptive Rack Cooling), Chassis Management Controller (CMC), fabric, power supplies & system logs. This real-time status information based on the telemetry data is utilized by the Job Schedulers to perform scheduling tasks effectively. This enables schedulers to take smart decisions and ensure that it schedules jobs only on healthy nodes, thus preventing job failures and wastage of computational resources. Our solution enables HPC job schedulers to be health-aware resulting in improving the reliability of the clusters and improve overall customer experience.

Systems Support

Blue Waters System and Component Reliability

LIVE SESSION @ 9:05 AM PST

Contributors:

Brett Bode, David King, Celso Mendes, William Kramer, Saurabh Jha, Roger Ford, Justin Davis, Mark Dalton, Steven Dramstad

Description:

The Blue Waters system, installed in 2012 at NCSA, has the largest component count of any system Cray has built. Blue Waters includes a mix of dual-socket CPU (XE) and single-socket CPU, single GPU (XK) nodes. The primary storage is provided by Cray’s Sonexion/ClusterStor Lustre storage system delivering 35PB (raw) storage at 1TB/sec. The statistical failure rates over time for each component including CPU, DIMM, GPU, disk drive, power supply, blower, etc and their impact on higher level failure rates for individual nodes and the systems as a whole are presented in detail, with a particular emphasis on identifying any increases in rate that might indicate the right-side of the expected bathtub curve has been reached. Strategies employed by NCSA and Cray for minimizing the impact of component failure, such as the preemptive removal of suspect disk drives, are also presented.

Configuring and Managing Multiple Shasta Systems: Best Practices Developed During the Perlmutter Deployment

LIVE SESSION @ 9:15 AM PST

Contributors:

James Botts, Zachary Crisler, Aditi Gaur, Douglas Jacobsen, Harold Longley, Alex Lovell-Troy, Dave Poulsen, Eric Roman, Chris Samuel

Description:

The perlmutter supercomputer and related test systems provide an early look at Shasta system management and our ideas on best practices for managing Shasta systems. The cloud-native software and ethernet-based networking on the system enable tremendous flexibility in management policies and methods. Based on work performed using Shasta 1.3 and previewed 1.4 releases, NERSC has developed, in close collaboration with HPE through the perlmutter System Software COE, methodologies for efficiently managing multiple Shasta systems. We describe how we template and synchronize configurations and software between systems and orchestrate manipulations of the configuration of the managed system. Key to this is a secured external management system that provides both a configuration origin for the system and an interactive management space. Leveraging this external management system we simultaneously create a systems-development environment as well as secure key aspects of the Shasta system, enabling NERSC to rapidly deploy the perlmutter system.

Slurm on Shasta at NERSC: adapting to a new way of life

LIVE SESSION @ 9:30 AM PST

Contributors:

Christopher Samuel, Douglas M Jacobsen, Aditi Gaur

Description:

Shasta, with its heady mix of kubernetes, containers, software defined networking and 1970s batch computing provides a vast array of new concepts, strategies and acronyms for traditional HPC administrators to adapt to. NERSC has been working through this maze to take advantage of the new capabilities that Shasta brings in order to provide a stable and expected interface for traditional HPC workloads on Perlmutter whilst also taking advantage of Shasta and new abilities in Slurm to provide more modern interfaces and capabilities for production use. This paper discusses the decisions that have been made regarding the deployment of Slurm on Perlmutter at NERSC, how we are faring and what is still in development as well as how this is all tied up with Shasta’s own development over the months.

Declarative automation of compute node lifecycle through Shasta API integration

ON-DEMAND SESSION

Contributors:

J. Lowell Wofford

Description:

Using the Cray Shasta system available at Los Alamos National Laboratory, we have experimented with integrating with various components of the HPE Cray Shasta software stack through the provided APIs. We have integrated with a LANL open-source software project, Kraken, which provides distributed state-based automation to provide new automation and management features to the Shasta system. We have focused on managing Shasta compute node lifecycle with Kraken, providing automation to node operations such as node image, kernel and configuration management. We examine the strengths and challenges of integrating with the Shasta APIs and discuss possibilities for further API integrations.

Cray EX Shasta v1.4 System Management Overview

ON-DEMAND SESSION

Contributors:

Harold Longley

Description:

How do you manage a Cray EX (Shasta) system? This overview describes the Cray System Management software in the Shasta v1.4 release. This release has introduced new features such as booting management nodes from images, product streams, and configuration layers. The foundation of containerized microservices orchestrated by Kubernetes on the management nodes provides a highly available and resilient set of services to manage the compute and application nodes. Lower level hardware control is based on the DMTF Redfish standard enabling higher level hardware management services to control and monitor components and manage firmware updates. The network management services enable control of the high speed network fabric. The booting process relies upon preparation of images and configuration as well as run-time interaction between the nodes and services while nodes boot and configure. All microservices have published RESTful APIs for those who want to integrate management functions into their existing DevOps environment. The v1.4 software includes the cray CLI and the SAT (System Administration Toolkit) CLI which are clients that use these services. Identity and access management protect critical resources, such as the API gateway. Non-administrative users access the system either through a multi-user Linux node (User Access Node) or a single-user container (User Access Instance) managed by Kubernetes. Logging and telemetry data can be sent from the system to other site infrastructure. The tools for collection, monitoring, and analysis of telemetry and log data have been improved with new alerts and notifications.

Managing User Access with UAN and UAI

ON-DEMAND SESSION

Contributors:

Harold Longley, Alex Lovell-Troy, Gregory Baker

Description:

User Access Nodes (UANs) and User Access Instances (UAIs) represent the primary entry point for users on a Cray EX system to develop, build, and execute their applications on the Cray EX compute nodes. The UAN is a traditional, multi-user Linux node. The UAI is a dynamically provisioned, single user container which can be customized to the user’s needs. This presentation will describe the state of the Shasta v1.4 software for user access with UAN and UAI, provisioning software products for users, providing access to shared filesystems, granting and revoking authentication and authorization, logging of access, and monitoring of resource utilization.

User and Administrative Access Options for CSM-Based Shasta Systems

ON-DEMAND SESSION

Contributors:

Alex Lovell-Troy, Sean Lynn, Harold Longley

Description:

Cray System Management (CSM) from HPE is a cloud-like control system for High Performance Computing. CSM is designed to integrate the Supercomputer with multiple datacenter networks and provide secure administrative access via authenticated REST APIs. Access to the compute nodes and to the REST APIs may need to follow different network paths which has network routing implications. This paper outlines the flexible network configurations and guides administrators planning their Shasta/CSM systems. Site Administrators have configuration options for allowing users and administrators to access the REST APIs from outside. They also have options for allowing applications running on the compute nodes to access these same APIs. This paper is structured around three themes. The first theme defines a layer2/layer3 perimeter around the system and addresses upstream connections to the site network. The second theme deals primarily with layer 3 subnet routing from the network perimeter inward. The third theme deals with administrative access control at various levels of the network as well as user-based access controls to the APIs themselves. Finally, this paper will combine the themes to describe specific use cases and how to support them with available administrative controls.

HPE Ezmeral Container Platform: Current And Future

ON-DEMAND SESSION

Contributors:

Thomas Phelan

Description:

The HPE Ezmeral Container Platform is the industry's first enterprise-grade container platform for both cloud-native and distributed non cloud-native applications using the open-source Kubernetes container orchestrator. Ezmeral enables true hybrid cloud operations across any location: on-premises, public cloud, and edge. Today, the HPE Ezmeral Container Platform is largely used for enterprise AI/ML/DL applications. However, the industry is starting to see a convergence of AI/ML/DL and High Performance Computer (HPC) workloads. This session will present an overview of the HPE Ezmeral Container Platform - its architecture, features, and usecases. It will also provide a look into the future product roadmap where the platform will support HPC workloads as well.