Speaker
Dr
Luca Giommi
(INFN and University of Bologna)
Description
The INFN-CNAF computing center, one of the Worldwide LHC Computing Grid Tier-1 sites, is serving a large set of scientific communities, in High Energy Physics and beyond. In order to increase efficiency and to remain competitive in the long run, CNAF is launching various activities aiming at implementing a global predictive maintenance solution for the site.
This requires a site-wide effort in collecting, cleaning and structuring all possibly useful data coming from log files of the various Tier-1 services and systems, as a necessary step prior to designing machine learning based approaches for predictive maintenance.
Among the Tier-1 services, efficient storage systems are one of the key ingredients of Tier-1 operations. CNAF uses StoRM as a Grid Storage Resource Manager solution: its operations are logged in a very complex manner, as the log content is deeply unstructured and hard to be exploited for analytics purposes. Despite such difficulty, the StoRM logs are a precious source of information for operators (real-time monitoring, anomaly detection), for developers (debugging, service stability, code improvements) and for site managers (service optimization, storage usage efficiency, time and money saving ways to spot and prevent unwanted behaviours).
This work describes how the StoRM logs can be handled and parsed to extract the relevant information, how such log handling can be designed to work automatically, how to define and implement metrics to tag critical states of the service, how to correlate StoRM events with external services’ events, and ultimately how to contribute to the future CNAF-wide predictive maintenance system.
First steps in this activity are presented and discussed, and a mention to complementary work in progress by other teams at the CNAF centre is also mentioned.
Primary author
Dr
Luca Giommi
(INFN and University of Bologna)
Co-author
Prof.
Daniele Bonacorsi
(University of Bologna)