Dr Andrew Lahiff (STFC) Dr Ian Collier
Container orchestration is rapidly emerging as a means of gaining many potential benefits compared to a traditional static infrastructure, such as increased resource utilisation through multi-tenancy, the ability to handle changing loads due to elasticity, and improved availability as a result of self-healing. Whilst many large organisations are using this technology, in some cases for many years, it is not yet common in the scientific community. At the RAL Tier-1 we have been investigating migration of services to an Apache Mesos cluster running on bare metal. In this architecture the whole concept of individual machines is abstracted away and services are run on the cluster in ephemeral Docker containers. Instead of the standard approach of manually placing long-running services on specific hosts, services are managed by a scheduler. This means that any host or application failures, as well as procedures such as rolling starts or upgrades, can be handled automatically and no longer require human intervention. Similarly, the number of instances of applications can be scaled automatically in response to changes in load. Even though there are these clear benefits, a number of new challenges arise, such as how monitoring, logging and in particular service discovery are dealt with in such a dynamic environment where services are no longer tied to specific hosts. In addition, an important question is whether it is even possible to run traditional grid middleware in this type of environment. This talk will describe the Mesos infrastructure which has been deployed at RAL, the testing we have done, our progress towards migrating production services and discuss our future plans.
Dr Andrew Lahiff (STFC)