5-10 March 2017
BHSS, Academia Sinica
Asia/Taipei timezone

The High Throughput Strategy of IHEP

Mar 10, 2017, 9:30 AM
30m
Conf. Room 2 (BHSS, Academia Sinica)

Conf. Room 2

BHSS, Academia Sinica

No. 128, Sec. 2, Academia Rd., Taipei, Taiwan
Physics (including HEP) and Engineering Applications Physics & Engineering I

Speaker

Dr Jiaheng Zou (IHEP, Chinese Academy of Sciences)

Description

IHEP computing center serves for many high energy physics experiments. We have more than 13,000 CPU cores and hundreds of active users. There are tens of thousands jobs per day. We divide users into many groups classically according to which experiment they belong to. And each computing node is privately owned by one group. The peak requirements of different groups are not coincident in general. Then there can be a great waste without resource sharing between groups. It makes good sense to improve the resource utilization and the jobs throughput. We managed to deploy a high throughput system, which considers both the sharing of resources and the fairness between groups. The system is based on HTCondor. However, it is necessary to customize a strategy to manage user groups in a single cluster pool. We keep a number of their own nodes for each group considering fairness. This ensures that there are resources available for each group forever. Meanwhile, a ratio of resources have to be shared between all groups. So it is possible for busy groups to take benefits from free groups. This is important to increase the entire jobs throughput. We provide real time statistics and try to ask free groups for more sharing resources. The sharing ratio of each group can be tuned automatically with owners’ approval. We are developing an accounting system, which will provide statistical details of free groups’ contribution and busy groups’ occupation. An error recovery mechanism is provided and integrated with the cluster monitoring system. Nodes with fatal problems are removed from the pool automatically. We also developed a set of toolkits to users. The toolkits add a series of attributes to users’ jobs, which are necessary in our approach. The entire strategy needs an enhanced central control system of HTCondor. We have implemented the essential components for central control, and deployed the customized strategy. The result shows great effects to our high throughput computing management.

Primary author

Dr Jiaheng Zou (IHEP, Chinese Academy of Sciences)

Co-authors

Dr Jingyan Shi (IHEP) Dr Ran DU (IHEP) Mr Xiaowei JIANG (IHEP) Mr Zhenyu SUN (IHEP)

Presentation materials