5-10 March 2017
BHSS, Academia Sinica
Asia/Taipei timezone

VCondor - an implemention of dynamic virtual computing cluster

10 Mar 2017, 10:50
20m
Conf. Room 1 (BHSS, Academia Sinica)

Conf. Room 1

BHSS, Academia Sinica

No. 128, Sec. 2, Academia Rd., Taipei, Taiwan
Infrastructure Clouds and Virtualisation Infrastructure Clouds and Virtualisation II

Speaker

Mr Yaodong CHENG (IHEP, CAS)

Description

As a new approach to manage resource, virtualization technology is more and more widely applied in high energy physics field. We have built virtual computing cluster at IHEP based on Openstack, with HTCondor as the job management system. In traditional computing cluster, fixed number of slots are pre-allocated to the job queue of different experiments. However, this kind of policy has gradually become dissatisfy with the peak requirements of different experiments, and also leads to a low CPU utilization. To solve the problem, we designed and implemented a dynamic virtual computing cluster system - VCondor based on HTCondor and Openstack. This system performs unified management of virtual machines according with queue status in HTCondor. One or more VMs will be created automatically when some jobs are waiting to run. VM will be destroyed when job is finished and there is no more job in HTCondor queue. Job queue status is checked in a period of time such as 10 minutes, so a VM will continue to run if there are new jobs in the period of time. VCondor also support resource provision and reservation for different experiments. VCondor has to request and get the available number of VM from a VM resource scheduling system called VMQuota before it acreatea VMs. VMQuota tells how many VMs VCondor can create and how long these VMs will be reserved before they are created. This talk will present several use cases of LHAASO and JUNO experiments. The results show virtual computing cluster can dynamically expanded or shrunk while computing requirements changed. Additionally, CPU utilization of overall computing resource is significantly improved compared with traditional resource management system. The system also has good performance when there are multiple condor schedulers and multiple job queues. It is stable and easy to maintain as well.

Primary author

Mr Yaodong CHENG (IHEP, CAS)

Presentation materials