Speaker
Mr
Qingbao Hu
(IHEP)
Description
At IHEP Computing Center, there are thousands of nodes managed by the htcondor scheduler, with about 12,000 cores, and these nodes provide computing services for multiple experimental groups. In the process of job scheduling, some work nodes will cause jobs abnormal due to some service exception. Under the traditional scheduling method, these abnormal nodes will continue to devour jobs, like "black holes", resulting in a large number of job errors, affecting the service quality of the computing cluster. In addition, in the process of job scheduling, in order to quickly locate the unknown abnormal information in the job or the operating environment, we often isolate part of the work nodes for the specific experimental group. How to quickly isolate the node and record the change history is also an urgent problem to be solved. OMAT is short for open operation analysis toolkits, it is applied to cluster computing center in 2017 operational monitoring, providing rapid acquisition of abnormal data, correlation analysis, strategy alarm and visualization, etc. In this report, we will introduce how to use OMAT monitoring IHEP computing cluster, and feedback the node service status, help htcondor scheduler rapid convenient management of computing resources, and thus to minimize the effects of abnormal service for user operation, improve computing cluster service quality.
Primary author
Mr
Qingbao Hu
(IHEP)
Co-authors
Dr
Jingyan Shi
(IHEP)
Mr
Wei Zheng
(Institute of High Energy Physics, CAS)
Mr
Xiaowei Jiang
(Institute of High Energy Physics, Chinese Academy of Sciences)