24-29 March 2024
BHSS, Academia Sinica
Asia/Taipei timezone

The Application of Cluster-Based Distributed Computing in High Energy Physics Experiments (Remote Presentation)

Mar 29, 2024, 11:10 AM
20m
Conf. Room 1 (BHSS, Academia Sinica)

Conf. Room 1

BHSS, Academia Sinica

Oral Presentation Track 1: Physics and Engineering Applications Physics & Engineering Application

Speaker

Mr Chaoqi Guo (Institute of High Energy Physics, Chinese Academy of Sciences)

Description

The computing cluster of the Institute of High Energy Physics has long provided computational services for high energy physics experiments, with a large number of experimental users. With the continuous expansion of the scale of experiments and the increase in the number of users, the queuing situation of the existing cluster is becoming increasingly severe.

To alleviate the shortage of local cluster resources and the long job queuing time, the Dongguan Big Science Data Center has provided 18,000 CPU cores to expand the scale of the Institute of High Energy Physics cluster. Considering that the experimental users of the Institute of High Energy Physics have long maintained the habit of submitting computational jobs using the cluster, it would be difficult to promote the use of these remote resources in various experiments through grid computing. Therefore, this paper designs and implements cluster-based distributed computing.

Cluster-based distributed computing monitors the demand of the cluster queue for distributed resources through the implementation of Glidein Factory, and completes the dynamic expansion of distributed resources and distributed job scheduling. Secondly, through the XRootD proxy and the CVMFS file system, data access between user jobs at distributed sites and the local cluster is facilitated.

We have implemented cross-platform identity authentication based on Kerberos Tokens to ensure that user jobs at distributed sites have access rights to local cluster services at the Institute of High Energy Physics. A detailed token update and maintenance mechanism has been designed to ensure the timeliness of the token during user job queuing and running.

Finally, considering users' long-standing habit of submitting jobs using the cluster, in order not to add extra learning costs to users, we have developed relevant job submission tools to achieve transparent distributed scheduling for users.

Currently, cluster-based distributed computing has been preliminarily promoted and used in BES, LHAASO, and HERD experiments, contributing over 30,000,000 CPU core hours in total.

Primary authors

Mr Chaoqi Guo (Institute of High Energy Physics, Chinese Academy of Sciences) Jingyan Shi (IHEP) Xiaowei Jiang (Institute of High Energy Physics, Chinese Academy of Sciences) Ran Du (Institute of High Energy Physics, Chinese Academy of Sciences) Yujiang BI (IHEP) Mr Jianshu Hong (Institute of High Energy Physics, Chinese Academy of Sciences) Fazhi QI (Institute of High Energy Physics,CAS)

Presentation materials