As the scale of equipment continues to grow and the computing environment becomes more and more complex, the difficulty of operation and maintenance of large-scale computing clusters is also increasing. Operation and maintenance methods based on automation technology cannot quickly and effectively solve various service failures in computing clusters. It is urgent to adopt emerging technologies to obtain all-around cluster operation and maintenance information, integrate monitoring data from multiple dimensions, and comprehensively analyze abnormal monitoring data. Based on the results of data analysis, locate the root cause of service failures and help computing clusters quickly restore services.
In order to provide a more stable cluster operating environment, IHEPCC combined big data technology and data analysis index tools to design and implement a set of open operation and maintenance analysis tool sets (OMAT), which include data collection, correlation analysis, strategy of monitoring data, abnormal alarm and other functions. This report will introduce the architecture, processing capabilities, and some key functions of OMAT. Combined with the processing flow of monitoring data, introduce the specific implementation of the system in data collection, data processing, data storage and data visualization.
The current OMAT platform has been applied to multiple cross-regional computing clusters including IHEP, covering about 5k nodes. The collected information includes node status, storage performance, network traffic, user operations, account security, power environment and other operation and maintenance indicators to ensure the computing cluster’s performance Stable operation.