What goes up must go down: A case study from RAL on the process of shrinking an existing storage service

20 Mar 2018, 12:00
30m
Conference Room 1, BHSS (Academia Sinica)

Conference Room 1, BHSS

Academia Sinica

Oral Presentation Big Data & Data Management Data Management & Big Data Session

Speaker

Mr Rob Appleyard (STFC)

Description

Much attention is paid to the process of how new storage services are deployed into production that the challenges therein. Far less is paid to what happens when a storage service is approaching the end of its useful life. The challenges in rationalising and de-scoping a service that, while relatively old, is still critical to production work for both the UK WLCG Tier 1 and local facilities are not to be underestimated. RAL has been running a disk and tape storage service based on CASTOR (Cern Advanced STORage) for over 10 years. CASTOR must cope with both the throughput requirements of supplying data to a large batch farm and the data integrity requirements needed by a long-term tape archive. A new storage service, called ‘Echo’ is now being deployed to replace the disk-only element of CASTOR, but we intend to continue supporting the CASTOR system for tape into the medium term. This, in turn, implies a downsizing and redesign of the CASTOR service in order to improve manageability and cost effectiveness. We will give an outline of both Echo and CASTOR as background. This paper will discuss the project to downsize CASTOR and improve its manageability when running both at a considerably smaller scale (we intend to go from around 140 storage nodes to around 20), and with a considerably lower amount of available staff effort. This transformation must be achieved while, at the same time, running the service in 24/7 production and supporting the transition to the newer storage element. To achieve this goal, we intend to transition to a virtualised infrastructure to underpin the remaining management nodes and improve resilience by allowing management functions to be performed by many different nodes concurrently (‘cattle’ as opposed to ‘pets’), and also intend to streamline the system by condensing the existing 4 CASTOR ‘stagers’ (databases that record the state of the disk pools) into a single one that supports all users.

Primary author

Mr Rob Appleyard (STFC)

Co-author

Presentation materials