Researchers from the Synergy Lab at Georgia Tech and Intel Labs won the Best Paper Award at the 26th IEEE International Symposium on High-Performance Computer Architecture (HPCA), held February 22-26 in San Diego, California.
Five co-authors are from the Synergy Lab, which has researchers from the Georgia Tech School of Electrical and Computer Engineering (ECE) and School of Computer Science (CS). They include Tushar Krishna, leader of the Synergy Lab and the ON Semiconductor Junior Professor in the School of ECE; his Ph.D. students Eric Qin (ECE), Ananda Samajdar (ECE), and Hyoukjun Kwon (CS); and his B.S./M.S. ECE student Vineet Nadella. Three coauthors are from Intel Labs–Sudarshan Srinivasan, Dipankar Das, and Bharat Kaul.
The title of the team’s award-winning paper is "SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training.” The advent of Deep Learning (DL) has radically transformed the computing industry across the entire spectrum from algorithms to circuits. As myriad application domains embrace DL, it has become synonymous with a genre of workloads across vision, speech, language, recommendations, robotics, and games.
The key compute kernel within most DL workloads is general matrix-matrix multiplications (GEMMs), which appears frequently during both the forward pass (inference and training) and backward pass (training). GEMMs are a natural choice for hardware acceleration to speed up training, and have led to 2D systolic architectures like NVIDIA tensor cores and the Google Tensor Processing Unit (TPU).
Unfortunately, emerging GEMMs in DL are highly irregular (i.e., non-square) and sparse (i.e., have lot of zeros, ranging from 10-90% during training). This leads to low utilization on systolic architectures, as these are optimized for dense square GEMMs. This paper proposes SIGMA, a flexible and scalable architecture that offers high utilization of all its processing elements (PEs), regardless of kernel shape and sparsity.
SIGMA’s key novelty is a flexible communication fabric for input/weight distribution and output reduction; this enables SIGMA to morph its datapath and tailor it to efficiently map any GEMM of arbitrary shapes and sparsity levels. SIGMA performs 5.7x better than systolic array architectures for irregular sparse matrices, and roughly 3x better than state-of-the-art sparse accelerators. Krishna, his students, and his Intel Labs colleagues demonstrate an instance of SIGMA operating at 10.8 TFLOPS efficiency with a 65.10 mm2 and 22.33 W footprint on a 28 nm process.
Photo cutline: Pictured from left to right are Hyoukjun Kwon, Ananda Samajdar (on iPad), Tushar Krishna, Eric Qin, and Vineet Nadella.
School of Electrical and Computer Engineering
Last revised July 15, 2020