Speaker
Description
In the field of high energy physics, with the advancement of several large-scale scientific experiments, the speed of data production is constantly increasing, and it will reach EB level in the next decade. This puts forward higher request to data storage and computation processing. In order to reduce the dependence on single type chip architecture and provide a more cost-effective storage and computing solution, we built a super computer cluster based on ARM architecture with 9600 computing cores in Dongguan, China. Based on this cluster, we have done a lot of work.
In terms of storage, we ported a large distributed storage system EOS to the ARM cluster, and incorporated the 4.7.7-aarch64 version into the official code base of EOS. In the process of software migration, the biggest challenge is that many software dependent programs do not have an officially released version of aarch64 architecture. And, due to the difference of underlying architecture, some original codes also need to be adjusted. Despite these challenges, we successfully ported it to the aarch64 architecture. The performance test is carried out.
In terms of HPC, we carried out the related work of lattice quantum chromodynamics (LQCD) based on ARM cluster, including the algorithm library transplantation and numerical simulation. QCD is the basic theory to describe strong interaction and hadron structure. Because it involves strong coupling, its strict solution is a very challenging theoretical problem. LQCD is a numerical gauge field theory method to study the properties of quarks and gluons based on the first principle of QCD. It is also a numerical simulation method to study elementary particles. Because of its huge amount of data and computing scale, it has become one of the main scientific research applications of supercomputers in the world.
In terms of HTC, we cooperated with LHAASO experiment and ported several high-energy physics data analysis softwares, such as GEANT4, Corsika, km2a, g4wcda, etc. We tested them on different architectures, like X86 and ARM, and found similar results.
In order to facilitate unified management and make better use of this ARM cluster and our existing X86 cluster, we standardized the directory structure of the system and formulated a set of scheduling strategies for heterogeneous computing clusters at remote sites. At the same time, we also deployed a monitoring system to observe the operation of ARM computing cluster and find abnormalities in time.