31 March 2019 to 5 April 2019
Academia Sinica
Asia/Taipei timezone

Towards Predictive Maintenance with Machine Learning at the INFN-CNAF computing centre

4 Apr 2019, 16:30
30m
Conference Room 1 (Academia Sinica)

Conference Room 1

Academia Sinica

Oral Presentation Data Management & Big Data Data Management & Big Data

Speaker

Dr Luca Giommi (INFN and University of Bologna)

Description

The INFN-CNAF computing center, one of the Worldwide LHC Computing Grid Tier-1 sites, is serving a large set of scientific communities, in High Energy Physics and beyond. In order to increase efficiency and to remain competitive in the long run, CNAF is launching various activities aiming at implementing a global predictive maintenance solution for the site. This requires a site-wide effort in collecting, cleaning and structuring all possibly useful data coming from log files of the various Tier-1 services and systems, as a necessary step prior to designing machine learning based approaches for predictive maintenance. Among the Tier-1 services, efficient storage systems are one of the key ingredients of Tier-1 operations. CNAF uses StoRM as a Grid Storage Resource Manager solution: its operations are logged in a very complex manner, as the log content is deeply unstructured and hard to be exploited for analytics purposes. Despite such difficulty, the StoRM logs are a precious source of information for operators (real-time monitoring, anomaly detection), for developers (debugging, service stability, code improvements) and for site managers (service optimization, storage usage efficiency, time and money saving ways to spot and prevent unwanted behaviours). This work describes how the StoRM logs can be handled and parsed to extract the relevant information, how such log handling can be designed to work automatically, how to define and implement metrics to tag critical states of the service, how to correlate StoRM events with external services’ events, and ultimately how to contribute to the future CNAF-wide predictive maintenance system. First steps in this activity are presented and discussed, and a mention to complementary work in progress by other teams at the CNAF centre is also mentioned.

Primary author

Dr Luca Giommi (INFN and University of Bologna)

Co-author

Prof. Daniele Bonacorsi (University of Bologna)

Presentation materials