International Symposium on Grids & Clouds 2017 (ISGC 2017)

Name: International Symposium on Grids & Clouds 2017 (ISGC 2017)
Start: 2017-03-05T08:00:00+08:00
End: 2017-03-10T18:00:00+08:00
Location: BHSS, Academia Sinica

5-10 March 2017

BHSS, Academia Sinica

Asia/Taipei timezone

Support

stella.shen@twgrid.org

The High Throughput Strategy of IHEP

10 Mar 2017, 09:30

30m

Conf. Room 2 (BHSS, Academia Sinica)

Conf. Room 2

BHSS, Academia Sinica

No. 128, Sec. 2, Academia Rd., Taipei, Taiwan

Physics (including HEP) and Engineering Applications Physics & Engineering I

Dr Jiaheng Zou (IHEP, Chinese Academy of Sciences)

IHEP computing center serves for many high energy physics experiments. We have more than 13,000 CPU cores and hundreds of active users. There are tens of thousands jobs per day. We divide users into many groups classically according to which experiment they belong to. And each computing node is privately owned by one group. The peak requirements of different groups are not coincident in general. Then there can be a great waste without resource sharing between groups. It makes good sense to improve the resource utilization and the jobs throughput. We managed to deploy a high throughput system, which considers both the sharing of resources and the fairness between groups. The system is based on HTCondor. However, it is necessary to customize a strategy to manage user groups in a single cluster pool. We keep a number of their own nodes for each group considering fairness. This ensures that there are resources available for each group forever. Meanwhile, a ratio of resources have to be shared between all groups. So it is possible for busy groups to take benefits from free groups. This is important to increase the entire jobs throughput. We provide real time statistics and try to ask free groups for more sharing resources. The sharing ratio of each group can be tuned automatically with owners’ approval. We are developing an accounting system, which will provide statistical details of free groups’ contribution and busy groups’ occupation. An error recovery mechanism is provided and integrated with the cluster monitoring system. Nodes with fatal problems are removed from the pool automatically. We also developed a set of toolkits to users. The toolkits add a series of attributes to users’ jobs, which are necessary in our approach. The entire strategy needs an enhanced central control system of HTCondor. We have implemented the essential components for central control, and deployed the customized strategy. The result shows great effects to our high throughput computing management.

Dr Jiaheng Zou (IHEP, Chinese Academy of Sciences)

Dr Jingyan Shi (IHEP) Dr Ran DU (IHEP) Mr Xiaowei JIANG (IHEP) Mr Zhenyu SUN (IHEP)

Slides

HighThroughput_ISGC2017.pdf

International Symposium on Grids & Clouds 2017 (ISGC 2017)

Support

The High Throughput Strategy of IHEP

Conf. Room 2

BHSS, Academia Sinica

Speaker

Description

Primary author

Co-authors

Presentation materials