Designing Convergent HPC, Big Data Analytics and Deep Learning Software Stacks for Exascale Systems
Abstract:This talk will focus on challenges in designing convergent HPC, Deep Learning, and Big Data Analytics Software stacks for Exascale systems with millions of processors and accelerators. For the HPC domain, we will discuss about the challenges in designing runtime environments for MPI+X programming models by taking into account support for multi-core systems (Xeon, OpenPower, and ARM), high-performance networks, and GPGPUs (including GPUDirect RDMA). Features and sample performance numbers from the MVAPICH2 libraries (http://mvapich.cse.ohio-state.edu) will be presented. An overview of RDMA-based designs for Hadoop (HDFS, MapReduce, RPC and HBase), Spark, Memcached, Swift, and Kafka using native RDMA support for InfiniBand and RoCE will be presented. Benefits of these designs on various cluster configurations using the publicly available RDMA-enabled packages from the OSU HiBD project (http://hibd.cse.ohio-state.edu) will be shown. For the Deep Learning domain, we will focus on scalable DNN training with Caffe and TensorFlow using MVAPICH2-GDR MPI library and RDMA-Enabled Big Data stacks (http://hidl.cse.ohio-state.edu).