Recent Publications

Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture

Yakun Sophia Shao, Jason Clemons, Rangharajan Venkatesan, Brian Zimmer, Matthew Fojtik, Nan Jiang, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, Priyanka Riana, Stephen G. Tell, Yanqing Zhang, William J. Dally, Joel Emer, C. Thomas Gray, Brucek Khailany, Stephen W. Keckler

PDF DOI

A 0.11 pJ/Op, 0.32-128 TOPS, Scalable Multi-Chip-Module-based Deep Neural Network Accelerator Designed with a High-Productivity VLSI Methodology

Rangharajan Venkatesan, Yakun Sophia Shao, Brian Zimmer, Jason Clemons, Matt Fojtik, Ted Jiang, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, Priyanka Raina (Stanford), Stephen Tell, Yanqing Zhang, William Dally, Joel Emer, Tom Gray, Steve Keckler, Brucek Khailany

PDF

A 0.11 pJ/Op, 0.32-128 TOPS, Scalable Multi-Chip-Module-based Deep Neural Network Accelerator with Ground-Reference Signaling in 16nm

Brian Zimmer, Rangharajan Venkatesan, Yakun Sophia Shao, Jason Clemons, Matthew Fojtik, Nan Jiang, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, Priyanka Raina, Stephen G Tell, Yanqing Zhang, William J Dally, Joel S Emer, C Thomas Gray, Stephen W Keckler, Brucek Khailany

PDF DOI

Selected Publications

Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture

Package-level integration using multi-chip-modules (MCMs) is a promising approach for building large-scale systems. Compared to a large monolithic die, an MCM combines many smaller chiplets into a larger system, substantially reducing fabrication and design costs. Current MCMs typically only contain a handful of coarse-grained large chiplets due to the high area, performance, and energy overheads associated with inter-chiplet communication. This work investigates and quantifies the costs and benefits of using MCMs with fine-grained chiplets for deep learning inference, an application area with large compute and on-chip storage requirements. To evaluate the approach, we architected, implemented, fabricated, and tested Simba, a 36-chiplet prototype MCM system for deep-learning inference. Each chiplet achieves 4 TOPS peak performance, and the 36-chiplet MCM package achieves up to 128 TOPS and up to 6.1 TOPS/W. The MCM is configurable to support a flexible mapping of DNN layers to the distributed compute and storage units. To mitigate inter-chiplet communication overheads, we introduce three tiling optimizations that improve data locality. These optimizations achieve up to 16% speedup compared to the baseline layer mapping. Our evaluation shows that Simba can process 1988 images/s running ResNet-50 with batch size of one, delivering inference latency of 0.50 ms.

Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ‘52), 2019

PDF DOI

A 0.11 pJ/Op, 0.32-128 TOPS, Scalable Multi-Chip-Module-based Deep Neural Network Accelerator with Ground-Reference Signaling in 16nm

This work presents a scalable deep neural network (DNN) accelerator consisting of 36 chips connected in a mesh network on a multi-chip-module (MCM) using ground-referenced signaling (GRS). While previous accelerators fabricated on a single monolithic die are limited to specific network sizes, the proposed architecture enables flexible scaling for efficient inference on a wide range of DNNs, from mobile to data center domains. The 16nm prototype achieves 1.29 TOPS/mm^2 , 0.11 pJ/op energy efficiency, 4.01 TOPS peak performance for a 1-chip system, and 127.8 peak TOPS and 2615 images/s ResNet50 inference for a 36-chip system.

2019 Symposium on VLSI Circuits (VLSI), 2019

PDF DOI

A Fine-Grained GALS SoC with Pausible Adaptive Clocking in 16 nm FinFET

Modern SoCs suffer from power supply noise that can require significant additional timing margin, reducing performance and energy efficiency. Globally asynchronous, locally synchronous (GALS) systems can mitigate the impact of power supply noise, as well as simplify system design by removing the need for global timing closure. This work presents a 4mm^2 distributed accelerator engine with 19 independent clock domains implemented in a 16nm process. Local adaptive clock generators dynamically tolerate and mitigate power supply noise, resulting in a 10% improvement in performance at the same voltage compared to a globally-clocked baseline. Pausible bisynchronous FIFOs enable low-latency global communication across an onchip network via error-free clock domain crossings. The SoC functions robustly across a wide range of voltages, frequencies, and workloads, demonstrating the practical applicability of fine grained GALS techniques for modern SoC design.

Matthew Fojtik, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, Stephen G. Tell, Brian Zimmer, Tezaswi Raja, Kevin Zhou, William J. Dally, Brucek Khailany

2019 25th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC), 2019

PDF DOI

A Modular Digital VLSI Flow for High-Productivity SoC Design

A high-productivity digital VLSI flow for designing complex SoCs is presented. The flow includes high-level synthesis tools, an object-oriented library of synthesizable SystemC and C++ components, and a modular VLSI physical design approach based on fine-grained globally asynchronous locally synchronous (GALS) clocking. The flow was demonstrated on a 16nm FinFET testchip targeting machine learning and computer vision.

Brucek Khailany, Evgeni Krimer, Rangharajan Venkatesan, Jason Clemons, Joel S. Emer, Matthew Fojtik, Alicia Klinefelter, Michael Pellauer, Nathaniel Pinckney, Yakun Sophia Shao, Shreesha Srinath, Christopher Torng, Sam (Likun) Xi, Yanqing Zhang, Brian Zimmer

Proceedings of the 55th Annual Design Automation Conference (DAC), 2018

PDF DOI

Impact of FinFET on Near-Threshold Voltage Scalability

Near-threshold operations provide a powerful knob for improving energy efficiency and alleviating on-chip power densities. This article explores the impact of newest FinFET CMOS technologies (from 40 to 7 nm) on near-threshold computing in terms of performance and energy efficiency.

N. Pinckney, S. Jeloka, R. Dreslinski, T. Mudge, D. Sylvester, D. Blaauw, L. Shifren, B. Cline, S. Sinha

IEEE Design Test, 2017

PDF DOI

Near-threshold computing in FinFET technologies: Opportunities for improved voltage scalability

In recent years, operating at near-threshold supply voltages has been proposed to improve energy efficiency in circuits, yet decreased efficacy of dynamic voltage scaling has been observed in recent planar technologies. However, foundries have introduced a shift from planar to FinFET fabrication processes. In this paper, we study 7nm FinFET’s ability to voltage scale and compare it to planar technologies across three dynamic voltage scaling scenarios. The switch to FinFET allows for a return to strong voltage scalability. We find up to 8.6 × higher energy efficiency at NT compared to nominal supply voltage (vs. 4.8 × gain in 20nm planar).

N. Pinckney, L. Shifren, B. Cline, S. Sinha, S. Jeloka, R. G. Dreslinski, T. Mudge, D. Sylvester, D. Blaauw

2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC), 2016

PDF DOI

Nathaniel Pinckney

Senior Research Scientist

NVIDIA

About

Interests

Education

Recent Publications

Selected Publications

Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture

A 0.11 pJ/Op, 0.32-128 TOPS, Scalable Multi-Chip-Module-based Deep Neural Network Accelerator with Ground-Reference Signaling in 16nm

A Fine-Grained GALS SoC with Pausible Adaptive Clocking in 16 nm FinFET

A Modular Digital VLSI Flow for High-Productivity SoC Design

Impact of FinFET on Near-Threshold Voltage Scalability

Near-threshold computing in FinFET technologies: Opportunities for improved voltage scalability

Contact