International Symposium on Grids & Clouds 2018 (ISGC 2018) in conjunction with Frontiers in Computational Drug Discovery (FCDD)

Name: International Symposium on Grids & Clouds 2018 (ISGC 2018) in conjunction with Frontiers in Computational Drug Discovery (FCDD)
Start: 2018-03-16T08:00:00+08:00
End: 2018-03-23T18:00:00+08:00
Location: Academia Sinica

16-23 March 2018

Academia Sinica

Asia/Taipei timezone

Support

stella.shen@twgrid.org

The research of High-Performance Computing infrastructure for Artificial Intelligence and Big Data

Not scheduled

20m

Academia Sinica

Oral Presentation Infrastructure Clouds and Virtualisation

Dr Jue Wang (Computer Network Information Center, Chinese Academy of Sciences) Prof. Yangang Wang (Computer Network Information Center, Chinese Academy of Sciences) Prof. Zhonghua Lu (Computer Network Information Center, Chinese Academy of Sciences)

Since the difference of software stack between traditional HPC applications, big data applications and artificial intelligence applications, infrastructure for each usually adopted completely different method and system to build and manage. To meet the rapidly growing demand for computing and storage resources raised by big data and artificial intelligence application, and take fully advantage of infrastructure’s computing and storage ability, we build our infrastructure based on traditional HPC technology and infrastructure, and make improvements on different layers of the infrastructure to meet the requirements of big data and artificial intelligence application, realizing a high-performance computing infrastructure with high efficiency and user-friendly. For storage layer, besides the traditional high-performance parallel file system, we specifically equipped high-volume SSD storage for each computing node, making the infrastructure compatible with distributed storage file system like HDFS, and meet the data-locality requirements. Based on that, the infrastructure can handle big data applications built on top of Hadoop, Spark and other related framework. To manage the data of different I/O pattern efficiently, we build multiple data management interface to upload and write data, satisfying different application mode. To adapt to the computing pattern of artificial application, like deep learning application, each computing node equipped with 8 Nvidia Tesla P100 GPUs, and inter-connected by 56Gbps InfiniBand network. We use high efficient and powerful scheduler system respectively based on LSF and Apache Yarn. For both system, leveraging the container and other resource isolation and technologies, the infrastructure implement CPU affinity, GPU affinity and other features to improve computing and communication efficiency. Multiple scheduling policy and queue management policy are supported by the infrastructure, so application with different resource requirement characteristics can be managed reasonably and fully utilize the computing resource. Based on the container technology, users can build their own software stack image, manage their software and framework by a more easy-to-use way than the traditional HPC user environment. Specifically, we build interface between common deep learning framework and the scheduler system, like tensorflow, caffe, pytorch, mxnet. For distributed deep-learning applications, user need not to explicitly allocate resource detailly in their code, the scheduler will handle most of the work. Furthermore, we pay attention to applying MPI and NCCL technology in common deep-learning frameworks on the infrastructure, aiming to make a full use of the InfiniBand network and improving the communication performance. To make the infrastructure more easy to use by artificial intelligence developers, different kinds of assistance service are being built, including visualization tools and training monitoring tools for the common deep learning frameworks, that uses can view their model structure and training process from web. The infrastructure has been adopted by users from different fields. An atmospheric Science application using deep-learning networks to predict the weather, the test results shows that the infrastructure make a significant performance improvement than their own deep-learning clusters. Other applications include a research of applying reinforcement learning to poker games, a research of building algorithmic trader based on reinforcement learning leveraging large scale of financial trading data, and application of face recognition. In the future, the infrastructure will be connected with other commercial computing service platform through grid technology, like Kingsoft’s Cloud Service. Furthermore, the infrastructure will connected to the China Nation Grid, to offer service for more academic and enterprise users.

Dr Jue Wang (Computer Network Information Center, Chinese Academy of Sciences)

Ms Chen Li (Computer Network Information Center, Chinese Academy of Sciences) Dr Fang Liu (Computer Network Information Center, Chinese Academy of Sciences) Mr Kun Sun (Computer Network Information Center, Chinese Academy of Sciences) Mr Tengteng Hu (Computer Network Information Center, Chinese Academy of Sciences) Prof. Yangang Wang (Computer Network Information Center, Chinese Academy of Sciences) Mr Yongze Sun (Computer Network Information Center, Chinese Academy of Sciences) Prof. Zhonghua Lu (Computer Network Information Center, Chinese Academy of Sciences)

There are no materials yet.

International Symposium on Grids & Clouds 2018 (ISGC 2018) in conjunction with Frontiers in Computational Drug Discovery (FCDD)

Support

The research of High-Performance Computing infrastructure for Artificial Intelligence and Big Data

Academia Sinica

Speakers

Description

Primary author

Co-authors

Presentation materials