Speaker
Dr
Jiaheng Zou
(IHEP, Chinese Academy of Sciences)
Description
IHEP computing center serves for many high energy physics experiments. We have more than 13,000 CPU cores and hundreds of active users. There are tens of thousands jobs per day. We divide users into many groups classically according to which experiment they belong to. And each computing node is privately owned by one group. The peak requirements of different groups are not coincident in general. Then there can be a great waste without resource sharing between groups.
It makes good sense to improve the resource utilization and the jobs throughput. We managed to deploy a high throughput system, which considers both the sharing of resources and the fairness between groups. The system is based on HTCondor. However, it is necessary to customize a strategy to manage user groups in a single cluster pool. We keep a number of their own nodes for each group considering fairness. This ensures that there are resources available for each group forever. Meanwhile, a ratio of resources have to be shared between all groups. So it is possible for busy groups to take benefits from free groups. This is important to increase the entire jobs throughput. We provide real time statistics and try to ask free groups for more sharing resources. The sharing ratio of each group can be tuned automatically with owners’ approval. We are developing an accounting system, which will provide statistical details of free groups’ contribution and busy groups’ occupation. An error recovery mechanism is provided and integrated with the cluster monitoring system. Nodes with fatal problems are removed from the pool automatically. We also developed a set of toolkits to users. The toolkits add a series of attributes to users’ jobs, which are necessary in our approach.
The entire strategy needs an enhanced central control system of HTCondor. We have implemented the essential components for central control, and deployed the customized strategy. The result shows great effects to our high throughput computing management.
Primary author
Dr
Jiaheng Zou
(IHEP, Chinese Academy of Sciences)