Package-level integration using multi-chip-modules (MCMs) is a promising approach for building large-scale systems. Compared to a large monolithic die, an MCM combines many smaller chiplets into a larger system, substantially reducing fabrication and design costs. Current MCMs typically only contain a handful of coarse-grained large chiplets due to the high area, performance, and energy overheads associated with inter-chiplet communication. This work investigates and quantifies the costs and benefits of using MCMs with fine-grained chiplets for deep learning inference, an application area with large compute and on-chip storage requirements. To evaluate the approach, we architected, implemented, fabricated, and tested Simba, a 36-chiplet prototype MCM system for deep-learning inference. Each chiplet achieves 4 TOPS peak performance, and the 36-chiplet MCM package achieves up to 128 TOPS and up to 6.1 TOPS/W. The MCM is configurable to support a flexible mapping of DNN layers to the distributed compute and storage units. To mitigate inter-chiplet communication overheads, we introduce three tiling optimizations that improve data locality. These optimizations achieve up to 16% speedup compared to the baseline layer mapping. Our evaluation shows that Simba can process 1988 images/s running ResNet-50 with batch size of one, delivering inference latency of 0.50 ms.
Yakun Sophia Shao, Jason Clemons, Rangharajan Venkatesan, Brian Zimmer, Matthew Fojtik, Nan Jiang, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, Priyanka Riana, Stephen G. Tell, Yanqing Zhang, William J. Dally, Joel Emer, C. Thomas Gray, Brucek Khailany, Stephen W. Keckler
Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ‘52),
2019
This work presents a scalable deep neural network (DNN) accelerator consisting of 36 chips connected in a mesh network on a multi-chip-module (MCM) using ground-referenced signaling (GRS). While previous accelerators fabricated on a single monolithic die are limited to specific network sizes, the proposed architecture enables flexible scaling for efficient inference on a wide range of DNNs, from mobile to data center domains. The 16nm prototype achieves 1.29 TOPS/mm^2 , 0.11 pJ/op energy efficiency, 4.01 TOPS peak performance for a 1-chip system, and 127.8 peak TOPS and 2615 images/s ResNet50 inference for a 36-chip system.
Brian Zimmer, Rangharajan Venkatesan, Yakun Sophia Shao, Jason Clemons, Matthew Fojtik, Nan Jiang, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, Priyanka Raina, Stephen G Tell, Yanqing Zhang, William J Dally, Joel S Emer, C Thomas Gray, Stephen W Keckler, Brucek Khailany
2019 Symposium on VLSI Circuits (VLSI),
2019
Modern SoCs suffer from power supply noise that can require significant additional timing margin, reducing performance and energy efficiency. Globally asynchronous, locally synchronous (GALS) systems can mitigate the impact of power supply noise, as well as simplify system design by removing the need for global timing closure. This work presents a 4mm^2 distributed accelerator engine with 19 independent clock domains implemented in a 16nm process. Local adaptive clock generators dynamically tolerate and mitigate power supply noise, resulting in a 10% improvement in performance at the same voltage compared to a globally-clocked baseline. Pausible bisynchronous FIFOs enable low-latency global communication across an onchip network via error-free clock domain crossings. The SoC functions robustly across a wide range of voltages, frequencies, and workloads, demonstrating the practical applicability of fine grained GALS techniques for modern SoC design.
Matthew Fojtik, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, Stephen G. Tell, Brian Zimmer, Tezaswi Raja, Kevin Zhou, William J. Dally, Brucek Khailany
2019 25th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC),
2019
A high-productivity digital VLSI flow for designing complex SoCs is presented. The flow includes high-level synthesis tools, an object-oriented library of synthesizable SystemC and C++ components, and a modular VLSI physical design approach based on fine-grained globally asynchronous locally synchronous (GALS) clocking. The flow was demonstrated on a 16nm FinFET testchip targeting machine learning and computer vision.
Brucek Khailany, Evgeni Krimer, Rangharajan Venkatesan, Jason Clemons, Joel S. Emer, Matthew Fojtik, Alicia Klinefelter, Michael Pellauer, Nathaniel Pinckney, Yakun Sophia Shao, Shreesha Srinath, Christopher Torng, Sam (Likun) Xi, Yanqing Zhang, Brian Zimmer
Proceedings of the 55th Annual Design Automation Conference (DAC),
2018
Near-threshold operations provide a powerful knob for improving energy efficiency and alleviating on-chip power densities. This article explores the impact of newest FinFET CMOS technologies (from 40 to 7 nm) on near-threshold computing in terms of performance and energy efficiency.
N. Pinckney, S. Jeloka, R. Dreslinski, T. Mudge, D. Sylvester, D. Blaauw, L. Shifren, B. Cline, S. Sinha
IEEE Design Test,
2017
In recent years, operating at near-threshold supply voltages has been proposed to improve energy efficiency in circuits, yet decreased efficacy of dynamic voltage scaling has been observed in recent planar technologies. However, foundries have introduced a shift from planar to FinFET fabrication processes. In this paper, we study 7nm FinFET’s ability to voltage scale and compare it to planar technologies across three dynamic voltage scaling scenarios. The switch to FinFET allows for a return to strong voltage scalability. We find up to 8.6 × higher energy efficiency at NT compared to nominal supply voltage (vs. 4.8 × gain in 20nm planar).
N. Pinckney, L. Shifren, B. Cline, S. Sinha, S. Jeloka, R. G. Dreslinski, T. Mudge, D. Sylvester, D. Blaauw
2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC),
2016