Studies on Job Queue Health and Problem Recovery

22 Mar 2018, 14:00
20m
Conf. Room 2 (Academia Sinica)

Conf. Room 2

Academia Sinica

Oral Presentation Physics (including HEP) and Engineering Applications Physics & Engineering Session

Speaker

Mr Xiaowei Jiang (Institute of High Energy Physics, Chinese Academy of Sciences)

Description

In a batch system, the job queue is in charge of a set of jobs. Job queue health is determined by the health status of these jobs. The job state can be queuing, running, completed, error or held, etc. Generally jobs can move from one state to another. However, if one job keeps in one state for too long, there might be problems, such as worker node failure and network blocking. In a large-scale computing cluster, problems cannot be avoided. Then some jobs will be blocked in one state, and cannot be completed in time. This will delay the progress of the computing task. For the previous situation, this paper studies on the abnormal job state reason, problem handling and job queue stability. We aim to improve the job queue health, so that we can raise job success rate and speed up users’ task progress. Abnormal reasons can be found from job attributes, queue information and logs, which can be analyzed in detail to acquire better solutions. These solutions are grouped into two categories. The first one is automatic job recovering that associated with the monitor system. When a job is recovered, it can be rescheduled in time. The second one is automatically informing users to recover jobs by themselves. Depending on the analysis results, feasible recommendations are pushed to users for quick recovering. As described above, a queue health system is designed and implemented at IHEP. We define a series of standards to determine abnormal jobs. Various information is collected and analyzed in association. According to the analysis results, automatic recovery measures are applied to abnormal jobs. In case of invalid automatic recovery, recommendations are sent to users by emails, WeChat, etc. The status of this system shows that it’s able to improve the job queue health in most conditions.

Primary author

Mr Xiaowei Jiang (Institute of High Energy Physics, Chinese Academy of Sciences)

Co-authors

Ms Hongnan Tan (IHEP) Mr Jiaheng Zou (IHEP, Chinese Academy of Sciences) Dr Jingyan Shi (IHEP) Mr Qingbao Hu (IHEP) Ms Ran Du (Institute of High Energy Physics, Chinese Academy of Sciences) Mr Zhenyu Sun (IHEP)

Presentation materials