31 March 2019 to 5 April 2019
Academia Sinica
Asia/Taipei timezone

Collection and harmonization of system logs and prototypal Analytics services with the Elastic (ELK) suite at the INFN-CNAF computing centre

3 Apr 2019, 14:00
20m
Conference Room 2 (Academia Sinica)

Conference Room 2

Academia Sinica

Oral Presentation Network, Security, Infrastructure & Operations Networking, Security, Infrastructure & Operations

Speaker

Mr Tommaso Diotalevi (INFN and University of Bologna)

Description

The distributed Grid infrastructure for High-Energy Physics experiments at the Large Hadron Collider (LHC) in Geneva comprises a set of computing centers, as part of the Worldwide LHC Computing Grid (WLCG). The Tier-1 level functionalities in Italy are served by the INFN-CNAF data centre, which actually serves also more than twenty non-LHC experiments. A key challenge is the modernisation of the center to be able to cope with the increasing flux of data expected in the near future. High-level standards of operation require a continuous work towards full understanding of service behaviours and a constant seek for higher level of automation and optimization. Data centers worldwide witness the use of Artificial Intelligence (AI) to push data centers into a new phase, in which tasks traditionally managed by operators could be more efficiently managed by human-supervisioned algorithms and techniques. Besides, CNAF collects a high amount of logs every day from various sources, which are highly heterogeneous and difficult to harmonise: such log data are archived but almost never used, except for specific internal debugging and hardware monitoring operations. In this contribution, a working implementation of a system that collects, parses and displays the log information from CNAF data sources, with analytics functionalities, is presented. The open source ELK software suite, including Elasticsearch, is used for the log ingestion and transformation, as well as for creating a centralised and structured database to organize in a clean and ordered manner the CNAF logs, including the creation of new visualisations and dashboards that offer online monitoring functionalities. This infrastructure is then vital for the CNAF long-term goal of modernizing the centre via machine learning based predictive maintenance approaches, moving away from preventive replacements of equipment, which is highly expensive and far from optimal efficiency.

Primary author

Mr Tommaso Diotalevi (INFN and University of Bologna)

Co-authors

Antonio Falabella (INFN-CNAF) Prof. Daniele Bonacorsi (University of Bologna) Diego Michelotto (INFN-CNAF)

Presentation materials