Speaker
Mr
Xiaowei Jiang
(Institute of High Energy Physics, Chinese Academy of Sciences)
Description
In a batch system, the job queue is in charge of a set of jobs. Job queue health is determined by the health status of these jobs. The job state can be queuing, running, completed, error or held, etc. Generally jobs can move from one state to another. However, if one job keeps in one state for too long, there might be problems, such as worker node failure and network blocking. In a large-scale computing cluster, problems cannot be avoided. Then some jobs will be blocked in one state, and cannot be completed in time. This will delay the progress of the computing task.
For the previous situation, this paper studies on the abnormal job state reason, problem handling and job queue stability. We aim to improve the job queue health, so that we can raise job success rate and speed up users’ task progress. Abnormal reasons can be found from job attributes, queue information and logs, which can be analyzed in detail to acquire better solutions. These solutions are grouped into two categories. The first one is automatic job recovering that associated with the monitor system. When a job is recovered, it can be rescheduled in time. The second one is automatically informing users to recover jobs by themselves. Depending on the analysis results, feasible recommendations are pushed to users for quick recovering.
As described above, a queue health system is designed and implemented at IHEP. We define a series of standards to determine abnormal jobs. Various information is collected and analyzed in association. According to the analysis results, automatic recovery measures are applied to abnormal jobs. In case of invalid automatic recovery, recommendations are sent to users by emails, WeChat, etc. The status of this system shows that it’s able to improve the job queue health in most conditions.
Primary author
Mr
Xiaowei Jiang
(Institute of High Energy Physics, Chinese Academy of Sciences)
Co-authors
Ms
Hongnan Tan
(IHEP)
Mr
Jiaheng Zou
(IHEP, Chinese Academy of Sciences)
Dr
Jingyan Shi
(IHEP)
Mr
Qingbao Hu
(IHEP)
Ms
Ran Du
(Institute of High Energy Physics, Chinese Academy of Sciences)
Mr
Zhenyu Sun
(IHEP)