Speaker
Description
The increasing complexity and scale of modern data centers generate operational environments where the ability to detect anomalies, anticipate failures, and optimize resource usage is becoming critically important. Recent advances in machine learning and artificial intelligence offer powerful techniques for extracting actionable insights from heterogeneous monitoring data, ranging from logs and metrics to event streams and security signals. In this contribution, we explore a set of AI-driven approaches that can be applied to a specific large-scale computing facility hosted at INFN-CNAF in Bologna. We will discuss methods for anomaly detection, predictive alerting and fault classification. Particular attention is given to techniques capable of leveraging high-volume, real-time data pipelines, including deep learning models for temporal analysis, clustering algorithms for behaviour profiling, and hybrid systems combining statistical baselines with learned representations. We also outline how such tools could support operational workflows, improve reliability, and reduce downtime, ultimately enhancing the overall efficiency of data-center operations. This study also establishes the methodological foundations for future integration with the INFN-CNAF Big Data Platform, which is expected to provide the unified data backbone for advanced operational analytics. The goal of this contribution is to highlight promising research directions and practical use cases where AI can provide measurable value to large distributed computing infrastructures.