24-29 March 2024
BHSS, Academia Sinica
Asia/Taipei timezone

Characteristic Analysis and Running Time Prediction of Slurm Jobs on CSNS Scientific Computing Platform (Remote Presentation)

26 Mar 2024, 11:30
30m
Conf. Room 2 (BHSS, Academia Sinica)

Conf. Room 2

BHSS, Academia Sinica

Oral Presentation Track 9: Converging High Performance Computing Infrastructures: Supercomputers, clouds, accelerators Converging High Performance infrastructures: Supercomputers, clouds, accelerators

Speaker

Jianshu Hong (Institute of High Energy Physics, Chinese Academy of Sciences)

Description

China Spallation Neutron Source (CSNS), the fourth pulse spallation neutron source in the world built during China's 11th Five-Year Plan period (2006-2010), has begun its operation after passing the national acceptance test in August 2018. There are 11 spectrometers built and put into operation and 3TB of raw data generated every day by November 2023. With the accelerator power upgraded from 100kW to 500kW and 20 spectrometers built eventually in the second phase of CSNS (CSNS II) before 2027, there will be 12TB of raw data generated every day then.
In order to meet the urgent needs of computing, simulating and data processing during the construction project and experimental operation of CSNS, a high-performance computing cluster has been planned and built by stage in 2018 and 2021, and provides powerful computing capability of a total of 40,000 CPU cores, 80 NVIDIA v100 GPU cards and sufficient storage capacity of 4PB. The Cluster of CSNS partition running in 2018, mainly serves the computation and simulation of CSNS construction project, including radiation shield,accelerator design, beam analysis, and spectrometer design. The Cluster of Dongguan Big Science partition running in 2021, mainly provides inhouse and external scientists services of theoretical computing, experimental analyzing and machine learning.
A web-base platform integrating HPC and AI services, namely CSNS Scientific Computing Platform of Institute of High Energy Physics of CAS and GBA Sub-center of National HEP Science Data Center (hereinafter referred to as CSNS SC Platfom) was design and developed in 2022 and officially launched in January 2023 in order to provide one-stop services with computing, simulating, data analyzing and AI training. The CSNS SC Platform integrates and unifies the hardware resources, software deployment, users management, job scheduling and API interfaces, providing a unified web-based development environment, job submission, data management,visual environment,AI development frameworks, dataset tools and AI development processes.
With the development of computing technologies and the expansion of computing cluster scale, the components and system of HPC clusters becomes so complex increasingly that is a great challenge to the management of clusters and the allocation of computing resources. Related research has shown that the distribution of computation, communication, and I/O operations on HPC clusters is uneven among the HPC nodes, and the actual application performance of HPC clusters is often less than 10% of the peak performance which results in a huge waste of computing resources. Optimized scheduling strategies are expected to increase the resource utilization of HPC clusters and improve the efficiency of scientific computation and data analysis. This paper will present our resent work on collection and organization of the slurm jobs data on CSNS SC Platform, analysis on the job characteristics, and optimization of prediction of job run time with machine learning

Primary authors

Fengyao HOU (Institute of High Energy Physics, CAS) Jianshu Hong (Institute of High Energy Physics, Chinese Academy of Sciences) Yonghao Cao (School of Computer Science and Technology, Dongguan University of Technology)

Co-authors

Li Wang (IHEP) Qianghua Yuan (School of Computer Science and Technology, Dongguan University of Technology)

Presentation materials

There are no materials yet.