(Go to Top Page)

The 27th Asia and South Pacific Design Automation Conference
Technical Program

  • Full presentation videos including Regular Sessions, Special Sessions, Designers' Forum, and University Design Contest will be available on demand from January 10 to January 28, 2022.
  • The official conference will be held online according to the time table below. For Regular Sessions, Special Sessions, and University Design Contest, pitch talk videos will be played which is followed by live Q&A via WebEx.
  • Please participate in the live Opening Session, Keynotes, and Tutorials.
  • Time zone is TST (=UTC+8:00)
  • The presenter of each paper is marked with "*".

Technical Program:   SIMPLE version   DETAILED version with abstract
Author Index:   HERE

Session Schedule

Monday, January 17, 2022

Room 1Room 2Room 3
T1  Tutorial-1
9:00 - 12:00
T2  Tutorial-2
9:00 - 12:00
T3  Tutorial-3
9:00 - 12:00
T4  Tutorial-4
13:30 - 16:30
T5  Tutorial-5
13:30 - 16:30
T6  Tutorial-6
13:30 - 16:30

Tuesday, January 18, 2022

Room ARoom BRoom CRoom DRoom E
1K  (Room S)
Opening and Keynote Session I

8:20 - 10:00
1A  University Design Contest-1
10:00 - 10:35
1B  (SS-1) New Advances towards Building Secure Computer Architectures
10:00 - 10:35
1C  Research Paradigm in Approximate and Neuromorphic Computing
10:00 - 10:35
1D  New Design Techniques for Emerging Challenges in Microfluidic Biochips
10:00 - 10:35
1E  Advances in Machine Learning Assisted Analog Circuit Sizing
10:00 - 10:35
2A  University Design Contest-2
10:35 - 11:10
2B  (SS-2) Analog Circuit and Layout Synthesis: Advancement and Prospect
10:35 - 11:10
2C  Low-cost and Memory-Efficient Deep Learning
10:35 - 11:10
2D  High-level Verification and Application
10:35 - 11:10
2E  Design for Manufacturing and Signal Integrity
10:35 - 11:10
3A  (DF-1) Key Drivers of Global Hardware Security
11:10 - 11:55
3B  Analysis and optimization for timing, power, and reliability
11:10 - 11:55
3C  Advanced Machine Learning with Emerging Technologies
11:10 - 11:55
3D  Software Solutions for Heterogeneous Embedded Architectures
11:10 - 11:55

Wednesday, January 19, 2022

Room ARoom BRoom CRoom D
2K  (Room S)
Keynote Session II

9:00 - 10:00
4A  (SS-3) Technology Advancements inside the Edge Computing Paradigm and using the Machine Learning Techniques
10:00 - 10:35
4B  Recent Advances in Placement Techniques
10:00 - 10:35
4C  Emerging Trends in Stochastic Computing
10:00 - 10:35
4D  Efficient Techniques for Emerging Applications
10:00 - 10:35
5A  (DF-2) Compiler and Toolchain for Efficient AI Computation
10:35 - 11:10
5B  Moving frontiers of test and simulation
10:35 - 11:10
5C  Optimizations in Modern Memory Architecture
10:35 - 11:10
5D  Novel Boolean Optimization and Mapping
10:35 - 11:10
6A  (DF-3) AI for Chip Design and Testing
11:10 - 11:45
6B  Towards Reliable and Secure Circuits: Cross Perspectives
11:10 - 11:45
6C  Accelerator Architectures for Machine Learning
11:10 - 11:45
6D  Quantum and Reconfigurable Computing
11:10 - 11:45
1W  (Room W)
Cadence Training Workshop

13:30 - 16:30

Thursday, January 20, 2022

Room ARoom BRoom CRoom D
3K  (Room S)
Keynote Session III

9:00 - 10:00
7A  (SS-4) Reshaping the Future of Physical and Circuit Design, Power and Memory with Machine Learning
10:00 - 10:35
7B  Advances in Analog Design Methodologies
10:00 - 10:35
7C  Low-Energy Edge AI Computing
10:00 - 10:35
7D  Emerging Technologies in Embedded Systems and Cyber-Physical Systems
10:00 - 10:35
8A  (DF-4) Empowering AI through Innovative Computing
10:35 - 11:10
8B  Advances in VLSI Routing
10:35 - 11:10
8C  Machine Learning with Crossbar Memories
10:35 - 11:10
8D  High Level Synthesis, CGRA mapping and P&R for hotspot mitigation
10:35 - 11:10
9A  (SS-5) Artificial Intelligence on Back-End EDA: Panacea or One-Trick Pony?
11:10 - 11:45
9B  Side Channel Leakage: Characterization and Protection
11:10 - 11:45
9C  Emerging Non-volatile Memory-based In-Memory Computing
11:10 - 11:45
9D  System Level Design of Learning Systems
11:10 - 11:45

DF: Designers' Forum, SS: Special Session

List of papers

  • Full presentation videos including Regular Sessions, Special Sessions, Designers' Forum, and University Design Contest will be available on demand from January 10 to January 28, 2022.
  • The official conference will be held online according to the time table below. For Regular Sessions, Special Sessions, and University Design Contest, pitch talk videos will be played which is followed by live Q&A via WebEx.
  • Please participate in the live Opening Session, Keynotes, and Tutorials.
  • Time zone is TST (=UTC+8:00)
  • The presenter of each paper is marked with "*".

Monday, January 17, 2022

[To Session Table]

Session T1  Tutorial-1
Time: 9:00 - 12:00, Monday, January 17, 2022
Location: Room 1

Title(Tutorial) IEEE CEDA DATC RDF and METRICS2.1: Toward a Standard Platform for ML-Enabled EDA and IC Design
AuthorJinwook Jung (IBM Research, USA), Andrew B. Kahng, Seungwon Kim, Ravi Varadarajan (UCSD, USA)
AbstractMachine learning (ML) for IC design often faces the challenge of "small data" due to its nature. It takes a huge amount of time and effort to go through multiple P&R flows with various tool settings, constraints, and parameters for obtaining useful training data of ML-enabled EDA. In this regard, systematic and scalable execution of hardware design experiments, together with standards for sharing of data and models, is an essential element of ML-based EDA and chip design. In this tutorial, we describe the effort taken in IEEE CEDA Design Automation Technical Committee (DATC) toward a standard platform for ML-enabled EDA and IC design. We first overview the challenges in ML-enabled EDA and review the related previous efforts. We then present DATC RDF and METRICS2.1, followed by hands-on working examples of (1) large-scale design experiments via cloud deployment, (2) extracting, collecting, and analyzing METRICS data from large-scale experiment results, and (3) flow auto-tuning framework for PPA optimization via METRICS2.1 realization. We will provide the working examples used throughout the tutorial via a public code repository, which include full RTL-to-GDS flow, codebases, Jupyter notebooks, and cloud deployment sample codes.

[To Session Table]

Session T2  Tutorial-2
Time: 9:00 - 12:00, Monday, January 17, 2022
Location: Room 2

Title(Tutorial) Low-bit Neural Network Computing: Algorithms and Hardware
AuthorZidong Du (Institute of Computing Technology, Chinese Academy of Sciences, China), Haojin Yang (Hasso-Plattner-Institute, Germany), Kai Han (Huawei Technology, China)
KeywordNeural Network Computing
AbstractIn recent years, deep learning technologies achieved excellent performance and many breakthroughs in both academia and industry. However, the state-of-the-art deep models are computationally expensive and consume large storage space. Deep learning is also strongly demanded by numerous applications from areas such as mobile platforms, wearable devices, autonomous robots, and IoT devices. How to efficiently apply deep models on such low power devices becomes a challenging research problem. In recent years, Low-bit Neural Network (NN) Computing has received much attention, due to its potential to reduce the storage and computation complexity of NN inference and training. This tutorial introduces the existing efforts on low-bit NN computing. This tutorial includes three parts: 1) Algorithms towards more accurate low-bit NNs; 2) Binary neural network design and inference; 3) Low-bit training of NNs.
Firstly, quantized neural networks with low-bit weights and activations are attractive for developing AI accelerators. However, the performance of low-bit neural networks is usually much worse than that of the corresponding full-precision counterparts. Firstly, the quantization functions are non-differentiable, which increases the optimization difficulty of quantized networks. Secondly, the representation capacity of low-bit values is limited, which also limits the performance of quantized networks. This talk will introduce several methods to obtain high-performance low-bit neural networks, including better optimization manners, better network architecture and a new quantized adder operator.
Secondly, this talk will present the recent progress of binary neural networks, including the development process and the latest model design. It will also give out some preliminary verification results of BNNs using hardware accelerator simulation in terms of accuracy and energy consumption. The presenter will further provide an outlook on prospects and challenges of AI accelerators based on BNNs.
Finally, training CNNs is time-consuming and energy-hungry. Using low-bit formats has been proved promising for speeding up and improving the energy efficiency of CNN inference by many studies, while it is harder for the training phase of CNNs to benefit from such techniques. This talk will introduce the challenge and recent progress of low-bit NN training. Also, this talk will elaborate on hardware architecture design principles of efficient quantized training.

[To Session Table]

Session T3  Tutorial-3
Time: 9:00 - 12:00, Monday, January 17, 2022
Location: Room 3

Title(Tutorial) Side Channel Analysis: from Concepts to Simulation and Silicon Validation
AuthorMakoto Nagata (Kobe University, Japan), Lang Lin (ANSYS Inc, USA), Yier Jin (University of Florida, USA)
KeywordSide Channel Analysis
AbstractSince the report of simple and differential power analysis in the late 1990’s, side channel analysis (SCA) has been one of the most important and well-studied topics in hardware security. In this tutorial, we will share our insights and experience on SCA by a combination of presentations, embedded demos, and an interactive panel discussion. The three speakers are from academia and industry with rich experience and solid tracking record on hardware security research and practice.
We will start the tutorial with a comprehensive introduction of SCA, including the popular side channels that have been exploited by attackers, common countermeasures, and the simulation based SCA with commercial EDA tools. Then we will present industry proven flows for fast and effective pre-silicon side channel leakage analysis (SCLA) with focus on physical level power and electromagnetic (EM) side channels. Next, we elaborate how to perform on-chip and in-system side-channel leakage measurements and assessments with system-level assembly options on crypto silicon chips with the help of embedded on-chip noise monitor circuits. We will conclude the presentations with some forward-looking discussion on emerging topics such as SCA for security, SCA in AI and machine learning (ML), and pre-silicon SCLA assisted by AI/ML. Multiple short video clips will be embedded to showcase SCA by simulation and silicon measurement.
No prior knowledge is required to attend this tutorial. The audience is expected to learn the foundations and state-of-the-arts in SCA with some hands-on skills.

[To Session Table]

Session T4  Tutorial-4
Time: 13:30 - 16:30, Monday, January 17, 2022
Location: Room 1

Title(Tutorial) New Techniques in Variational Quantum Algorithms and Their Applications
AuthorTamiya Onodera, Atsushi Matsuo, Rudy Raymond (IBM Research, Japan)
KeywordQuantum Computing
AbstractVariational Quantum Algorithms (VQAs) are important and promising quantum algorithms applicable to near-term quantum devices. Some of their applications are in optimization, machine learning, and quantum chemistry. This tutorial will introduce new techniques in VQAs by emphasizing the design of their quantum circuits. It starts with a general introduction of IBM Quantum devices and their programming environment. Next, the design of parameterized quantum circuits in the VQAs for optimization and machine learning is discussed. Finally, we will present some open problems that may be of interest to the ASP-DAC community.

[To Session Table]

Session T5  Tutorial-5
Time: 13:30 - 16:30, Monday, January 17, 2022
Location: Room 2

Title(Tutorial) Towards Efficient Computation for Sparsity in Future Artificial Intelligence
AuthorFei Sun (Alibaba Group, China), Dacheng Liang (Biren Technology, China), Yu Wang, Guohao Dai (Tsinghua University, China)
KeywordArtificial Intelligence
AbstractWith the fast development of Artificial Intelligence (AI), introducing sparsity becomes a key enabler for both practical deployment and efficient training in various domains. On the other hand, sparsity does not necessarily translate to efficiency, sparse computation loses to its dense counterpart in terms of throughput. To enable efficient AI computation based on sparse workloads in the future, this tutorial focuses on the following issues: (1) From the algorithm perspective. Compressing deep learning models by introducing zeros to the model weights has proven to be an effective way to reduce model complexity, which introduces sparsity to the model. The introduction of fine-grained 2:4 sparsity by NVIDIA A100 GPUs has renewed the interest in efficient sparse formats on commercial hardware. This tutorial compares different pruning methodologies and introduces several efficient sparse representations that are efficient on CPUs and GPUs. (2) From the kernel perspective. Many AI applications can be decomposed into several key sparse kernels, and the problem can be broken down into acceleration for sparse kernels. Different from dense kernels like general-purpose matrix multiplication (GEMM) which can utilize the peak performance of GPU hardware, sparse kernels like sparse matrix-matrix multiplication (SpMM) achieves low FLOPs, and performance is closely related to the implementation. In this tutorial, we will introduce how to optimize sparse kernels on GPUs with several simple but effective methods. (3) From the hardware perspective. In the existing architecture, the efficiency of dense GEMM operations is getting higher and higher, and the performance is constantly being squeezed. To improve computing efficiency and reduce energy consumption, sparse tensor processing has been placed in an increasingly important position by designers. The main challenges of sparse tensor processing are limited bandwidth, irregular memory access, and fine grain data processing, etc. In this tutorial, we will present some Domain Specific Architectures (DSAs) for sparse tensors in GPGPUs and provide an overview of trends of development.

[To Session Table]

Session T6  Tutorial-6
Time: 13:30 - 16:30, Monday, January 17, 2022
Location: Room 3

Title(Tutorial) Scan-based DfT: Mitigating its Security Vulnerabilities and Building Security Primitives
AuthorAijiao Cui (Harbin Institute of Technology, China), Gang Qu (University of Maryland, USA)
KeywordHardware Security
AbstractScan chain is one of the most powerful and popular design for test (DfT) technologies as it provides test engineers the unrestrictive access to the internal states of the core under test. This same convenience has also made scan chain an exploitable side channel for attackers to steal the cipher key of cryptographic core or combinational logic designed for obfuscation. In this tutorial, we will first present the preliminaries of scan-based DfT technology. Then we will illustrate the vulnerabilities against scan side-channel attacks and SAT attacks and review the existing countermeasures on how to design secure scan-based DfT to resist these attacks. Next, we will discuss how to utilize scan-based DfT as a security primitive to provide solution for several hardware security problems including: hardware intellectual property protection, physical unclonable function and device authentication.
This tutorial targets two groups of audience: (1) graduate students interested in IC testing (in particular scan chain) and security, (2) researchers and engineers from industry and academic working on IC testing and hardware security. No prior knowledge on scan-based DfT or security is required to attend this tutorial. The audience is expected to learn the foundations and state-of-the-arts in secure scan design.

Tuesday, January 18, 2022

[To Session Table]

Session 1K  Opening and Keynote Session I
Time: 8:20 - 10:00, Tuesday, January 18, 2022
Location: Room S
Chair: Ting-Chi Wang (National Tsing Hua University, Taiwan)

1K-1 (Time: 8:20 - 9:00)
TitleASP-DAC 2022 Opening:
1. Welcome by GC (Prof. Ting-Chi Wang)
2. Welcome by SC-Chair (Prof. Shinji Kimura)
3. Program Report by TPC Chair (Prof. Masanori Hashimoto)
   3-1. Best Paper Award Presentation (Dr. Gi-Joon Nam)
   3-2. 10-Year Retrospective Most Influential Paper Award Presentation (Dr. Gi-Joon Nam)
4. Designers' Forum Report by DF Co-Chairs (Prof. Hung-Pin Wen and Prof. Kai-Chiang Wu)
5. Design Contest Report by UDC Co-Chair (Prof. Ing-Chao Lin)
   5.1 UDC Award Presentation (Prof. Ing-Chao Lin)
6. Student Research Forum Report by SRF Chair (Prof. Lei Jiang)
7. IEEE CEDA Awards by CEDA President (Dr. Gi-Joon Nam)
   7.1 CEDA Outstanding Service Recognition (Dr. Gi-Joon Nam)
8. Welcome message for ASP-DAC 2023 by 2023GC (Prof. Atsushi Takahashi)

Title(Keynote Address) Boosting Productivity and Robustness in the SysMoore with a Triple-play of Hyperconvergency, Analytics, and AI Innovations
Author*Shankar Krishnamoorthy (Synopsys, USA)
AbstractThe SysMoore Era can be characterized as the widening gap between classic Moore’s Law scaling and increasing system complexity. System-on-a-chip complexity has now fallen by the wayside to systems-of-chips with the need for smaller process nodes, and multi-die integration. With engineers now handling not just larger chip designs but systems comprised of multiple chips, the focus on user productivity and design robustness becomes a major factor in getting designs to market in the fastest time and with the best possible PPA. Combining a hyperconvergent design flow with smart data analytics and AI-based solution space exploration provides a huge benefit to the engineer tasked with completing these systems. In this presentation outlines the challenges and the road to a triple-play solution that gets design engineers out of their late design-cycle jams.

[To Session Table]

Session 1A  University Design Contest-1
Time: 10:00 - 10:35, Tuesday, January 18, 2022
Location: Room A
Chairs: Ing-Chao Lin (National Cheng Kung University, Taiwan), Tsung-Te Liu (National Taiwan University, Taiwan)

Best Design Award
TitleA 0.5 mm2 Ambient Light-Driven Solar Cell-Powered Biofuel Cell-Input Biosensing System with LED Driving for Stand-Alone RF-Less Continuous Glucose Monitoring Contact Lens
Author*Guowei Chen, Xinyang Yu, Yue Wang, Tran Minh Quan, Naofumi Matsuyama, Takuya Tsujimura, Kiichi Niitsu (Nagoya University, Japan)
Pagepp. 1 - 2
KeywordBiofuel cell input, glucose monitoring, LED driving, smart contact lens, solar cell
AbstractThis work presents the first solar cell (SC)-powered biofuel cell (BFC)-input biosensing system using 65 nm CMOS with pulse interval modulation (PIM) and pulse density modulation (PDM) LED driving capability for stand-alone RF-less continuous glucose monitoring (CGM) contact lenses, which notices diabetes patients of CGM level without any external devices. LED implementation can eliminate the necessity of wireless communication. Power supply from on-lens SCs can eliminate the necessity of wireless power delivery, enabling a fully stand-alone operation under office-room ambient light.

TitleA 76-81 GHz FMCW 2TX/3RX Radar Transceiver with Integrated Mixed-Mode PLL and Series-Fed Patch Antenna Array
Author*Taikun Ma, Wei Deng, Haikun Jia (School of Integrated Circuits, Tsinghua University, China), Yejun He (College of Electronics and Information Engineering, Shenzhen University, China), Baoyong Chi (School of Integrated Circuits, Tsinghua University, China)
Pagepp. 3 - 4
Keywordfrequency-modulated continuous wave (FMCW), millimeter-wave (mm-wave) radar, mixed-mode PLL, series-fed patch antenna
AbstractThis paper presented a 76-81 GHz FMCW MIMO Radar transceiver with mixed-mode PLL. Utilizing series-fed patch antenna array, a prototype system is developed based on the proposed transceiver. On-chip Measurements show that reconfigurable sawtooth chirps could be generated with a bandwidth up to 4 GHz and a period as short as 30 us. Real-time experiments demonstrate that the prototype MIMO radar has the ability of target detection and achieves an angular resolution of 9 degree.

TitleA 5.2GHz RFID Chip Contactlessly Mountable on FPC at Any 90-Degree Rotation and Face Orientation
Author*Reiji Miura, Saito Shibata, Masahiro Usui, Atsutake Kosuge, Mototsugu Hamada, Tadahiro Kuroda (The University of Tokyo, Japan)
Pagepp. 5 - 6
KeywordRFID, 5.2GHz, Contactless, Low-cost
AbstractThis paper presents an RFID Chip contactlessly mountable on an FPC having an antenna pattern. Inductive coupling between the FPC and the chip realizes low-cost bondingless implementation. It is also possible to place the chip on the FPC at any angle of 0/90/180/270 degrees and face-up or face-down. Simulation shows the antenna gain is almost the same irrespective of the chip placement angle and face orientation. The experimental results confirmed that the proposed RFID chip works at upto 20cm away from a reader whose output power is 15dBm, achieving the same figure-of-merit as a conventionally bonded module.

TitleA 40nm CMOS SoC for Real-Time Dysarthric Voice Conversion of Stroke Patients
Author*Tay-Jyi Lin, Chen-Zong Liao, You-Jia Hu, Wei-Cheng Hsu, Zheng-Xian Wu, Shao-Yu Wang (National Chung Cheng University, Taiwan), Chun-Ming Huang (Taiwan Semiconductor Research Institute, Taiwan), Ying-Hui Lai (National Yang Ming Chiao Tung University, Taiwan), Chingwei Yeh, Jinn-Shyan Wang (National Chung Cheng University, Taiwan)
Pagepp. 7 - 8
Keywordlow-power, AI, DNN, voice conversion, RISC-V
AbstractThis paper presents the first dysarthric voice conversion SoC, which can translate stroke patients’ voice into more intelligible and clearer speech in real time. The SoC is composed of a RISC-V MPU and a compact DNN engine with a single 16-bit multiply-accumulator, which improves 12× performance and > 100× energy efficiency, and has been implemented in 40nm CMOS. The silicon area is 0.68×0.79mm2, and the measured power is 18.4mW for converting 3-sec dysarthric voice within 0.5 sec (at 200MHz and 0.8V) and 4.8mW for conversion < 1 sec (at 100MHz and 0.6V).

TitleA Side-Channel Hardware Trojan in 65nm CMOS with 2µW precision and Multi-bit Leakage Capability
Author*Tiago Perez, Samuel Pagliarini (Tallinn University of Technology (TalTech), Estonia)
Pagepp. 9 - 10
KeywordHardware security, Hardware Trojan, Side-channel Attack
AbstractIn this work, a novel architecture for a side-channel trojan (SCT) capable of leaking multiple bits per power signature reading is proposed. This trojan is inserted utilizing a novel framework featuring an Engineering Change Order (ECO) flow. For assessing our methodology, a testchip comprising of two versions of the AES and two of the Present (PST) crypto cores is manufactured in 65nm commercial technology. Our results from the hardware validation demonstrated that keys are successfully leaked by creating microwatt-sized shifts in the power consumption.

[To Session Table]

Session 1B  (SS-1) New Advances towards Building Secure Computer Architectures
Time: 10:00 - 10:35, Tuesday, January 18, 2022
Location: Room B
Chair: Aijiao Cui (Harbin Institute of Technology, China)

Title(Invited Paper) SC-K9: A Self-synchronizing Framework to Counter Micro-architectural Side Channels
Author*Hongyu Fang, Milos Doroslovacki, Guru Venkataramani (George Washington University, USA)
Pagepp. 11 - 18
KeywordSide Channels, Security Monitoring, Cyber-Deception, Secure Computing Systems
AbstractSide channels within the processor microarchitecture are notorious for their ability to leak information without leaving any physical traces for forensic examination. Most prior detection frameworks typically choose to continuously sample a select subset of hardware events without attempting to understand the mechanics behind the side channel activity. In this work, we propose SC-K9, a novel framework that synchronizes its sampling frequency with that of the adversary, thereby improving the detection accuracy even when the frequency of attack operations vary with specific implementations. We then deploy a hardware-based deception strategy to trick the adversary and annul its observations from the side channel activities. We illustrate our design and demonstrate its effectiveness in identifying some of the potent side channels exposed by recent speculative execution attacks. Our experimental results show that SC-K9 can effectively spot adversaries at different operational modes, and incurs very low rate of false alarms among the benign workloads.

Title(Invited Paper) CacheGuard: A Behavior Model Checker for Cache Timing Side-Channel Security
Author*Zihan Xu, Lingfeng Yin, Yongqiang Lyu, Haixia Wang (Tsinghua University, China), Gang Qu (University of Maryland, USA), Dongsheng Wang (Tsinghua University, China)
Pagepp. 19 - 24
KeywordSecure Caches, Side Channels, Model Checking
AbstractDefending cache timing side-channels has become a major concern in modern secure processor designs. However, a formal method that can completely check if a given cache design can defend against timing side-channel attacks is still absent. This study presents CacheGuard, a behavior model checker for cache timing side-channel security. Compared to current state-of-the-art prose rule-based security analysis methods, CacheGuard can cover the whole state space for a given cache design to discover unknown side-channel attacks. Checking results on standard cache and state-of-the-art secure cache designs discovers 5 new attack strategies, and potentially makes it possible to develop a timing side channel-safe cache with the aid of CacheGuard.

Title(Invited Paper) Lightweight and Secure Branch Predictors against Spectre Attacks
Author*Congcong Chen, Chaoqun Shen, Jiliang Zhang (College of Computer Science and Electronic Engineering, Hunan University, China)
Pagepp. 25 - 30
KeywordSpectre defense, Branch predictors, Microarchitecture security
AbstractSpectre attacks endanger most of CPUs, operating systems and cloud services due to the sharing of branch predictors in modern processors, while existing defenses fail to balance the security and overhead. This paper designs a lightweight and secure branch predictor (LS-BP), which provides lightweight hardware isolation for different branch entries of same-address-space and cross-address-space. Therefore, it is difficult for the attacker to establish branch conflicts. Experimental results show the average performance overhead is less than 3% while providing strong protection.

Title(Invited Paper) Computation-in-Memory Accelerators for Secure Graph Database: Opportunities and Challenges
Author*Md Tanvir Arafin (Morgan State University, USA)
Pagepp. 31 - 36
Keywordcomputation-in-memory, graph computation, graph database, homomorphic encryption, privacy-preserving queries
AbstractThis work presents the challenges and opportunities for developing computing-in-memory (CIM) accelerators to support secure graph databases (GDB). First, we examine the database backend of common GDBs to understand the feasibility of CIM-based hardware architectures to speed up database queries. Then, we explore standard accelerator designs for graph computation. Next, we present the security issues of graph databases and survey how advanced cryptographic techniques such as homomorphic encryption and zero-knowledge protocols can execute privacy-preserving queries in a secure graph database. After that, we illustrate possible CIM architectures for integrating secure computation with GDB acceleration. Finally, we discuss the design overheads, useability, and potential challenges for building CIM-based accelerators for supporting data-centric calculations. Overall, we find that computing-in-memory primitives have the potential to play a crucial role in realizing the next generation of fast and secure graph databases.

[To Session Table]

Session 1C  Research Paradigm in Approximate and Neuromorphic Computing
Time: 10:00 - 10:35, Tuesday, January 18, 2022
Location: Room C
Chairs: Xunzhao Yin (Zhejiang University, China), Lang Feng (Nanjing University, China)

TitleHEALM: Hardware-Efficient Approximate Logarithmic Multiplier with Reduced Error
Author*Shuyuan Yu, Maliha Tasnim, Sheldon Tan (University of California, Riverside, USA)
Pagepp. 37 - 42
KeywordApproximate Computing, Integer Multiplier, Low Power, Area Efficient
AbstractIn this work, we propose a new approximate logarithm multipliers (ALM) based on a novel error compensation scheme. The proposed hardware-efficient ALM, named HEALM, first determines the truncation width for mantissa summation in ALM. Then the error compensation or reduction is performed via a lookup table, which stores reduction factors for different regions of input operands. This is in contrast to an existing approach, in which error reduction is performed independent of the width truncation of mantissa summation. As a result, the new design will lead to more accurate result with both reduced area and power. Furthermore, different from existing approaches which will either introduce resource overheads when doing error improvement or lose accuracy when saving area and power, HEALM can improve accuracy and resource consumption at the same time. Our study shows that 8-bit HEALM can achieve up to 2.92%, 9.30%, 16.08%, 17.61% improvement in mean error, peak error, area, power consumption over REALM, which is the state of art work with the same number of bits truncated.

Best Paper Candidate
TitleDistriHD: A Memory Efficient Distributed Binary Hyperdimensional Computing Architecture for Image Classification
Author*Dehua Liang, Jun Shiomi, Noriyuki Miura (Osaka University, Japan), Hiromitsu Awano (Kyoto University, Japan)
Pagepp. 43 - 49
KeywordBrain-inspired Computing, Hyperdimensional Computing, Memory Efficiency, Distributed System
AbstractHyper-Dimensional (HD) computing is a brain-inspired learning approach for efficient and fast learning on today’s embedded devices. HD computing first encodes all data points to high-dimensional vectors called hypervectors and then efficiently performs the classification task using a well-defined set of operations. Although HD computing achieved reasonable performances in several practical tasks, it comes with huge memory requirements since the data point should be stored in a very long vector having thousands of bits. To alleviate this problem, we propose a novel HD computing architecture, called DistriHD which enables HD computing to be trained and tested using binary hypervectors and achieves high accuracy in single-pass training mode with significantly low hardware resources. DistriHD encodes data points to distributed binary hypervectors and eliminates the expensive item memory in the encoder, which significantly reduces the required hardware cost for inference. Our evaluation also shows that our model can achieve a 27.6× reduction in memory cost without hurting the classification accuracy. The hardware implementation also demonstrates that DistriHD achieves over 9.9× and 28.8× reduction in area and power, respectively.

TitleThermal-aware Layout Optimization and Mapping Methods for Resistive Neuromorphic Engines
Author*Chengrui Zhang (ShanghaiTech University/Shanghai Engineering Research Center of Energy Efficient and Custom AI IC, China), Yu Ma (ShanghaiTech University/Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences/University of Chinese Academy of Sciences/Shanghai Engineering Research Center of Energy Efficient and Custom AI IC, China), Pingqiang Zhou (ShanghaiTech University/Shanghai Engineering Research Center of Energy Efficient and Custom AI IC, China)
Pagepp. 50 - 55
Keywordneuromorphic computing, memristor, thermal issue, spiking neural network
AbstractResistive neuromorphic engines can accelerate spiking neural network tasks with memristor crossbars. However, the stored weight is influenced by the temperature, which leads to accuracy and endurance degradation. The higher the temperature is, the larger the influence is. In this work, we propose a cross-array mapping method and a layout optimization method to reduce the thermal effect with the consideration of input distribution, weight value and layout of memristor crossbars. Experimental results show that our method reduces the peak temperature up to 10.4K and improves the endurance up to 1.72x times.

[To Session Table]

Session 1D  New Design Techniques for Emerging Challenges in Microfluidic Biochips
Time: 10:00 - 10:35, Tuesday, January 18, 2022
Location: Room D
Chairs: Hailong Yao (Tsinghua University, China), Xing Huang (Technical University of Munich, Germany)

TitleNR-Router: Non-Regular Electrode Routing with Optimal Pin Selection for Electrowetting-on-Dielectric Chips
Author*Hsin-Chuan Huang, Chi-Chun Liang (National Tsing Hua University, Taiwan), Qining Wang (University of California, Los Angeles, USA), Xing Huang, Tsung-Yi Ho (National Tsing Hua University, Taiwan), Chang-Jin Kim (University of California, Los Angeles, USA)
Pagepp. 56 - 61
KeywordMicrofluidic biochips, Electronic design automation, EDA, Routing optimization, Biomedical sciences, Electrowetting-on-dielectric (EWOD) chip
AbstractWith the advances in microfluidics, electrowetting-on-dielectric (EWOD) chips have widely been applied to various laboratory procedures. Glass-based EWOD chips with non-regular electrodes are proposed, which allow more reliable droplet operations and facilitating integration of optical sensors for many biochemical applications. Besides, non-regular electrode designs (e.g., interdigitated electrodes) are utilized in EWOD chips to precisely control droplet volume, and electrodes with a specific shape become necessary for certain applications. However, due to the technical barriers of fabricating multi-layer interconnection on the glass substrate (e.g., unreliable process and high cost), both control electrodes and wires are fabricated with a single-layer configuration, which poses significant challenges to pin selection for non-regular electrodes under the limited routing resource. In this paper, we propose a minimum cost flow-based routing algorithm called NR-Router that features efficient and robust routing for single-layer EWOD chips with non-regular electrodes, which overcomes the challenges mentioned above. NR-Router is the first algorithm that can accurately route in single-layer EWOD chips with non-regular electrodes to the best of our knowledge. We construct a minimum cost flow algorithm to generate optimal routing paths followed by a light-weight model to handle flow capacity. NR-Router achieves 100% routability while minimizing wirelength at shorter run time, and generates mask files feasible for manufacturing via adjustments of design parameters. Experimental results demonstrate the robustness and efficiency of our proposed algorithm.

TitleDesign-for-Reliability and Probability-Based Fault Tolerance for Paper-Based Digital Microfluidic Biochips with Multiple Faults
Author*Jian-De Li, Sying-Jyan Wang (Department of Computer Science and Engineering, National Chung Hsing University, Taiwan), Katherine Shu-Min Li (Department of Computer Science and Engineering, National Sun Yat-sen University, Taiwan), Tsung-Yi Ho (Department of Computer Science, National Tsing Hua University, Taiwan)
Pagepp. 62 - 67
Keywordpaper-based digital microfluidic biochip, design-for-reliability, on-the-fly, diagnosis, fault tolerance
AbstractPaper-based digital microfluidic biochips (PB-DMFBs) have emerged as the most promising solution to biochemical applications in resource-limited regions. However, like silicon chips, the reliability of PB-DMFBs is affected by physical defects. Even worse, since electrodes, conductive wires, and droplet routings are entangled on the same layer, multiple faults may occur simultaneously. Such faults not only cause waste of samples and human resource but also affect the correctness of the diagnostics. In this paper, we propose a reliability scheme with emphasis on design-for-reliability (DfR) and probability-based fault tolerance to ensure the correct functionality of PB-DMFBs with multiple faults.

TitleImproving the Robustness of Microfluidic Networks
Author*Gerold Fink, Philipp Ebner, Sudip Poddar, Robert Wille (Johannes Kepler University Linz - Institute for Integrated Circuits, Austria)
Pagepp. 68 - 73
Keywordmicrofluidics, robustness, simulation
AbstractMicrofluidic devices, often in the form of Labs-on-a-Chip (LoCs), are successfully utilized in many domains such as medicine, chemistry, biology, etc. However, neither the fabrication process nor the respectively used materials are perfect and, thus, defects are frequently induced into the actual physical realization of the device. This is especially critical for sensitive devices such as droplet-based microfluidic networks that are able to route droplets inside channels along different paths by only exploiting passive hydrodynamic effects. However, these passive hydrodynamic effects are very sensitive and already slight changes of parameters (e.g., in the channel width) can alter the behavior, even in such a way that the intended functionality of the network breaks. Hence, it is important that microfluidic networks become robust against such defects in order to prevent erroneous behavior. But considering such defects during the design process is a non-trivial task and, therefore, designers mostly neglected such considerations thus far. To overcome this problem, we propose a robustness improvement process that allows to optimize an initial design in such a way that it becomes more robust against defects (while still retaining the original behavior of the initial design). To this end, we first utilize a metric to compare the robustness of different designs and, afterwards, discuss methods that aim to improve the robustness. The metric and methods are demonstrated by an example and also tested on several networks to show the validity of the robustness improvement process.

[To Session Table]

Session 1E  Advances in Machine Learning Assisted Analog Circuit Sizing
Time: 10:00 - 10:35, Tuesday, January 18, 2022
Location: Room E
Chairs: Chien-Nan Jimmy Liu (National Yang Ming Chiao Tung University, Taiwan), Fan Yang (Fudan University, China)

Best Paper Candidate
TitleAn Efficient Kriging-based Constrained Multi-objective Evolutionary Algorithm for Analog Circuit Synthesis via Self-adaptive Incremental Learning
Author*Sen Yin, Wenfei Hu, Wenyuan Zhang, Ruitao Wang, Jian Zhang, Yan Wang (School of integrated circuits, Tsinghua University, China)
Pagepp. 74 - 79
Keywordincremental learning, analog circuit synthesis, multi-objective optimization, differential evolution
AbstractIn this paper, we propose an efficient Kriging-based constrained multi-objective evolutionary algorithm for analog circuit synthesis via self-adaptive incremental learning. The incremental learning technique is introduced to reduce time complexity of training the Kriging model from O(n3) to O(n2), where n is the number of training points. The proposed approach reduces the total optimization time in three aspects. First, by reusing the previously trained models, a self-adaptive incremental learning strategy is applied to reduce the training time of the Kriging model. Second, we use non-dominated sorting and modified crowding distance to prescreen the most promising one to be simulated, which largely reduce the number of simulations. Third, as there is no internal optimization, the prediction time of the Kriging model is saved. Experimental results on two real-world circuits demonstrate that compared with the state-of-the-art multi-objective Bayesian optimization, our method can reduce the training time of Kriging model by 95% and the prediction time by 99.7% without surrendering optimization results. Compared with NSGA-II and MOEA/D, the proposed method can achieve up to 10X speed up in terms of the total optimization time while achieving better results.

TitleFast Variation-aware Circuit Sizing Approach for Analog Design with ML-Assisted Evolutionary Algorithm
Author*Ling-Yen Song, Tung-Chieh Kuo, Ming-Hung Wang, Chien-Nan Jimmy Liu, Juinn-Dar Huang (Institute of Electronics, National Yang Ming Chiao Tung University, Taiwan)
Pagepp. 80 - 85
KeywordProcess variation, analog circuit sizing, evolutionary algorithm, machine learning
AbstractEvolutionary algorithm (EA) based on circuit simulation is one of the popular approaches for analog circuit sizing because of its high accuracy and adaptability on different cases. However, if process variation is also considered, the huge number of simulations becomes almost infeasible for large circuits. Although there are some recent works that adopt machine learning (ML) techniques to speed up the optimization process, the process variation effects are still hard to be considered in those approaches. In this paper, we propose a fast variation-aware evolutionary algorithm for analog circuit sizing with a ML-assisted prediction model. By predicting the likelihood for a design that has worse performance, our EA process is able to skip many unnecessary simulations to reduce the convergence time. Moreover, a novel force-directed model is proposed to guide the optimization toward better yield. Based on the performance of prior circuit samples in the EA optimization, the proposed force model is able to predict the likelihood of a design that has better yield without time-consuming Monte Carlo simulations. Compared with prior works, the proposed approach significantly reduces the number of simulations in the yield-aware EA optimization, which helps to generate more practical designs with high reliability and low cost.

TitleA Novel and Efficient Bayesian Optimization Approach for Analog Designs with Multi-Testbench
Author*Jingyao Zhao, Changhao Yan, Zhaori Bi, Fan Yang, Xuan Zeng (Fudan University, China), Dian Zhou (University of Texas at Dallas, USA)
Pagepp. 86 - 91
KeywordPredictive entropy search with constraints, Bayesian optimization, Gaussian process, Analog circuit synthesis
AbstractAnalog circuits are characterized by various circuit performances obtained from multiple testbenches which need to be simulated independently. In this paper, we propose an efficient Bayesian optimization approach for multi-testbench analog circuit design. Predictive Entropy Search with Constraints (PESC) is applied for selecting the suitable testbench to simulate, and time-weighted PESC (wPESC) is also proposed considering different analysis time. Furthermore, the Feasibility Expected Improvement (FEI) acquisition function for constraints and solving a multi-modal optimal problem of FEI are proposed to improve the efficiency of exploring feasible regions. The proposed approach can gain 2.7 ~ 3.8x speedup compared with the state-of-the-art method, and achieve better optimization results.

[To Session Table]

Session 2A  University Design Contest-2
Time: 10:35 - 11:10, Tuesday, January 18, 2022
Location: Room A
Chairs: Ing-Chao Lin (National Cheng Kung University, Taiwan), Tsung-Te Liu (National Taiwan University, Taiwan)

Special Feature Award
TitleA 2.17μW@120fps Ultra-Low-Power Dual-Mode CMOS Image Sensor with Senputing Architecture
Author*Ziwei Li (Beijing Jiaotong University, China), Han Xu, Zheyu Liu (Tsinghua University, China), Li Luo (Beijing Jiaotong University, China), Qi Wei, Fei Qiao (Tsinghua University, China)
Pagepp. 92 - 93
KeywordSenputing, AIOT, CMOS Image Sensor, Always-on, BNN
AbstractThis paper proposes an ultra-low-power CMOS Image Sensor (CIS) chip based on sensing-with-computing (Senputing) architecture to reduce the power bottleneck of vision system. This Senputing chip achieves BNN 1st-layer convolution in analog domain with ultra-low power consumption. It has two working modes, Normal-Sensor (NS) mode and Direct-Photocurrent-Computation (DPC) mode. The prototype measurement results under 65nm CMOS process on MNIST classification task shows that the power of feature map computation is 2.17µW with 120fps frame rates and 98.1% accuracy. The computation efficiency reaches to 11.49TOPs/W, which is 14.8× higher than state-of-art works.

TitleA Reconfigurable Inference Processor for Recurrent Neural Networks Based on Programmable Data Format in a Resource-Limited FPGA
Author*Jiho Kim, Kwoanyoung Park, Tae-Hwan Kim (Korea Aerospace University, Republic of Korea)
Pagepp. 94 - 95
Keywordrecurrent neural networks, inference, FPGA, reconfigurable, processor
AbstractAn efficient inference processor for recurrent neural networks is designed and implemented in an FPGA. The proposed processor is designed to be reconfigurable for various models and perform every vector operation consistently utilizing a single array of multiply-accumulate units with the aim of achieving a high resource efficiency. The data format is programmable per operand. The resource and energy efficiency are 1.89MOP/LUT and 263.95GOP/J, respectively, in Intel Cyclone-V FPGA. The functionality has been verified successfully under a fully-integrated inference system.

TitleSupply-Variation-Tolerant Transimpedance Amplifier Using Non-Inverting Amplifier in 180-nm CMOS
Author*Tomofumi Tsuchida, Akira Tsuchiya, Toshiyuki Inoue, Keiji Kishine (University of Shiga Prefecture, Japan)
Pagepp. 96 - 97
Keywordtransimpedance amplifier, supply voltage variation, non-inverting amplifier
AbstractThis paper presents a supply-variation-tolerant transimpedance amplifier (TIA). For parallel integration of optical transceivers, supply voltage variation is one of the serious problems. We propose a TIA using a non-inverting stage to cancel the supply variation. As a proof of concept, we fabricated the proposed TIA in a 180-nm CMOS. We measured the eye-diagrams with various supply voltages. Measurement results show that the voltage swing and the eye-opening voltage are improved by 105% and 180%, respectively.

TitleDeformable Chiplet-Based Computer Using Inductively Coupled Wireless Communication
Author*Junichiro Kadomoto, Hidetsugu Irie, Shuichi Sakai (University of Tokyo, Japan)
Pagepp. 98 - 99
Keywordprocessor, wireless communication
AbstractResearch on microrobot swarms and deformable user interfaces has been conducted extensively. Inductively coupled wireless bus technology has been proposed for such applications. This technology uses inductive coupling among on-chip coils to connect multiple chiplets wirelessly. By wirelessly connecting small chiplets, it is possible to construct deformable systems with various chip configurations. The prototype chip, which has a 32-bit RISC-V processor core and a wireless communication interface, is fabricated in 0.18-µm CMOS technology. The prototype validates that inductively coupled wireless data communication can be achieved between two processor chiplets.

[To Session Table]

Session 2B  (SS-2) Analog Circuit and Layout Synthesis: Advancement and Prospect
Time: 10:35 - 11:10, Tuesday, January 18, 2022
Location: Room B
Chair: Mark Po-Hung Lin (National Yang Ming Chiao Tung University, Taiwan)

Title(Invited Paper) AMS Circuit Synthesis Enabled by the Advancements of Circuit Architectures and ML Algorithms
AuthorShiyu Su, Qiaochu Zhang, Mohsen Hassanpourghadi, Juzheng Liu, Rezwan Rasul, *Mike Shuo-Wei Chen (University of Southern California, USA)
Pagepp. 100 - 107
KeywordAMS, synthesis, mostly digital, neural network, open source
AbstractAnalog mixed-signal (AMS) circuit architecture has evolved towards more digital friendly due to technology scaling and demand for higher flexibility/reconfigurability. Meanwhile, the design complexity and cost of AMS circuit has substantially increased due to the necessity of optimizing the circuit sizing, layout, and verification of a complex AMS circuit. On the other hand, machine learning (ML) algorithms have been under exponential growth over the past decade, and actively exploited by the electronic design automation (EDA) community. This paper will identify the opportunities and challenges brought about by this trend and overview several emerging AMS design methodologies that are enabled by the recent evolution of AMS circuit architectures and machine learning algorithms. Specifically, we will focus on using neural-network-based surrogate models to expedite the circuit design parameter search and layout iterations. Lastly, we will demonstrate the rapid synthesis of several AMS circuit examples from specification to silicon prototype, with significantly reduced human intervention.

Title(Invited Paper) Automating Analog Constraint Extraction: From Heuristics to Learning
AuthorKeren Zhu, Hao Chen, Mingjie Liu, *David Z. Pan (The University of Texas at Austin, USA)
Pagepp. 108 - 113
KeywordAnalog, layout, Machine learning, constraint extraction, EDA
AbstractAnalog layout synthesis has recently received much attention to mitigate the increasing cost of manual layout efforts. To achieve the desired performance and design specifications, generating layout constraints is critical in fully automated netlist-to-GDSII analog layout flow. However, there is a big gap between automatic constraint extraction and constraint management in analog layout synthesis. This paper introduces the existing constraint types for analog layout synthesis and points out the recent research trends in automating analog constraint extraction. Specifically, the paper reviews the conventional graph heuristic methods such as graph similarity and the recent machine learning approach leveraging graph neural networks. It also discusses challenges and research opportunities.

Title(Invited Paper) Common-Centroid Layout for Active and Passive Devices: A Review and the Road Ahead
AuthorNibedita Karmokar, Meghna Madhusudan, Arvind K. Sharma, Ramesh Harjani (University of Minnesota, USA), Mark Po-Hung Lin (National Yang Ming Chiao Tung University, Taiwan), *Sachin S. Sapatnekar (University of Minnesota, USA)
Pagepp. 114 - 121
KeywordAnalog design, Mismatch, Common-centroid, Performance, Variations
AbstractThis paper presents an overview of common-centroid (CC) layout styles, used in analog designs to overcome the impact of systematic variations. CC layouts must be carefully engineered to minimize the impact of mismatch. Algorithms for CC layout must be aware of routing parasitics, layout-dependent effects (for active devices), and the performance impact of layout choices. The optimal CC layout further depends on factors such as the choice of the unit device and the relative impact of uncorrelated and systematic variations. The paper also examines scenarios where non-CC layouts may be preferable to CC layouts.

[To Session Table]

Session 2C  Low-cost and Memory-Efficient Deep Learning
Time: 10:35 - 11:10, Tuesday, January 18, 2022
Location: Room C
Chairs: Tao Liu (Lawrence Technological University, USA), Jianlei Yang (Beihang University, China)

Best Paper Candidate
TitlePUMP: Profiling-free Unified Memory Prefetcher for Large DNN Model Support
AuthorChung-Hsiang Lin (Taiwan AI Lab, Taiwan), *Shao-Fu Lin (National Taiwan University, Taiwan), Yi-Jung Chen (National Chi Nan University, Taiwan), En-Yu Jenp, Chia-Lin Yang (National Taiwan University, Taiwan)
Pagepp. 122 - 127
KeywordMachine Learning, Model Training, Large Model Support
AbstractModern DNNs are going deeper and wider to achieve higher accuracy. However, existing deep learning frameworks require the whole DNN model to fit into the GPU memory when training with GPUs, which puts an unwanted limitation on training large models. Utilizing NVIDIA Unified Memory (UM) could inherently support training DNN models beyond GPU memory capacity. However, naively adopting UM would suffer a significant performance penalty due to the delay of data transfer. In this paper, we propose PUMP, a Profiling-free Unified Memory Prefetcher. PUMP exploits GPU asynchronous execution for prefetch; that is, there exists a delay between the time that CPU launches a kernel and the time the kernel actually executes in GPU. PUMP extracts memory blocks accessed by the kernel when launching and swaps these blocks into GPU memory. Experimental results show PUMP achieves about 2x speedup on the average compared to the baseline that naively enables UM.

TitleRADARS: Memory Efficient Reinforcement Learning Aided Differentiable Neural Architecture Search
Author*Zheyu Yan, Weiwen Jiang, Xiaobo Sharon Hu, Yiyu Shi (University of Notre Dame, USA)
Pagepp. 128 - 133
KeywordNAS, Memory
AbstractDifferentiable neural architecture search (DNAS) is known for its capacity in the automatic generation of superior neural networks. However, DNAS based methods suffer from memory usage explosion when the search space expands, which may prevent them from running successfully on even advanced GPU platforms. On the other hand, reinforcement learning (RL) based methods, while being memory efficient, are extremely time-consuming. Combining the advantages of both types of methods, this paper presents RADARS, a scalable RL-aided DNAS framework that can explore large search spaces in a fast and memory-efficient manner. RADARS iteratively applies RL to prune undesired architecture candidates and identifies a promising subspace to carry out DNAS. Experiments using a workstation with 12 GB GPU memory show that on CIFAR-10 and ImageNet datasets, RADARS can achieve up to 3.41% higher accuracy with 2.5X search time reduction compared with a state-of-the-art RL-based method, while the two DNAS baselines do not complete due to excessive memory usage or search time. To the best of the authors’ knowledge, this is the first DNAS framework that can handle large search spaces with bounded memory usage

TitleA Heuristic Exploration to Retraining-free Weight Sharing for CNN Compression
Author*Etienne Dupuis (Univ Lyon, Ecole Centrale de Lyon, CNRS, INSA Lyon, Université Claude Bernard Lyon 1, CPE Lyon, CNRS, INL, UMR5270, France), David Novo (LIRMM, Université de Montpellier, CNRS, France), Ian O'Connor, Alberto Bosio (Univ Lyon, Ecole Centrale de Lyon, CNRS, INSA Lyon, Université Claude Bernard Lyon 1, CPE Lyon, CNRS, INL, UMR5270, France)
Pagepp. 134 - 139
KeywordDeep Learning, Approximate Computing, Weight Sharing, Hardware Accelerator, Convolutional Neural Networks
AbstractThe computational workload involved in Convolutional Neural Networks (CNNs) is typically out of reach for low-power embedded devices. The scientific literature provides a large number of approximation techniques to address this problem. Among them, the Weight-Sharing (WS) technique gives promising results, but it requires carefully determining the shared values for each layer of a given CNN. As the number of possible solutions grows exponentially with the number of layers, the WS Design Space Exploration (DSE) time can easily explode for state-of-the-art CNNs. In this paper, we propose a new heuristic approach to drastically reduce the exploration time without sacrificing the quality of the output. The results carried out on recent CNNs (GoogleNet, ResNet50V2, MobileNetV2, InceptionV3, and EfficientNet), trained with the ImageNet dataset, show over 5x memory compression at an acceptable accuracy loss (complying with the MLPerf quality target) without any retraining step and in less than 10 hours. Our code is publicly available on GitHub

TitleHiKonv: High Throughput Quantized Convolution With Novel Bit-wise Management and Computation
Author*Yao Chen (Advanced Digital Sciences Center, Singapore), Xinheng Liu (University of Illinois at Urbana-Champaign, USA), Prakhar Ganesh (Advanced Digital Sciences Center, Singapore), Junhao Pan (University of Illinois at Urbana-Champaign, USA), Jinjun Xiong (IBM Thomas J. Watson Research Center, USA), Deming Chen (University of Illinois at Urbana-Champaign, USA)
Pagepp. 140 - 146
KeywordQuantization, Convolution, bit-wise management, high throughput, low-bitwidth
AbstractQuantization for CNN has shown significant progress with the intention of reducing the cost of computation and storage with low-bitwidth data. There are, however, no systematic studies on how an existing full-bitwidth processing unit, such as CPUs and DSPs, can be better utilized to carry out significantly higher computation throughput for convolution under various quantized bitwidths. In this study, we propose HiKonv, a unified solution that maximizes the compute throughput of a given processing unit to process low-bitwidth quantized data through novel bit-wise parallel computation. We establish theoretical performance bounds using a full-bitwidth multiplier for highly parallelized low-bitwidth convolution, and demonstrate new breakthroughs for high-performance computing in this critical domain. For example, a single 32-bit processing unit can deliver 128 binarized convolution operations under one CPU instruction and a 27x18 DSP can deliver 8 convolution operations with 4-bit inputs in one cycle. We demonstrate the effectiveness of HiKonv on CPU and FPGA for both convolutional layers or a complete DNN model. For a convolutional layer quantized to 4-bit, HiKonv achieves a 3.17x latency improvement over the baseline C++ implementation on CPU. Compared to the DAC-SDC 2020 champion model for FPGA, HiKonv achieves a 2.37x throughput and 2.61x DSP efficiency improvements.

[To Session Table]

Session 2D  High-level Verification and Application
Time: 10:35 - 11:10, Tuesday, January 18, 2022
Location: Room D
Chairs: Xinfei Guo (Shanghai Jiao Tong University, China), Hiroyuki Tomiyama (Ritsumei University, Japan)

TitleMapping Large Scale Finite Element Computing onto Wafer-Scale Engines
Author*Yishuang Lin, Rongjian Liang, Yaguang Li, Hailiang Hu, Jiang Hu (Texas A&M University, USA)
Pagepp. 147 - 153
Keywordwafer-scale chip, computing task mapping, finite element
AbstractThe finite element method has wide applications and often presents a computing challenge due to huge problem sizes and slow convergence rate. A leading-edge computing acceleration approach is to leverage wafer-scale engine, which contains more than 800K processing elements. The effectiveness of this approach heavily depends on how to map a finite element computing task onto such enormous hardware space. A mapping method is introduced to partition an object space into computing kernels, which are further placed onto processing elements. This method achieves the best overall result in terms of computing accuracy and communication cost among all the ISPD 2021 contest participants.

TitleGeneralizing Tandem Simulation: Connecting High-level and RTL Simulation Models
Author*Yue Xing, Aarti Gupta, Sharad Malik (Princeton University, USA)
Pagepp. 154 - 159
Keywordhardware simulation, EDA, hardware modeling, testing
AbstractSimulation-based testing has been the workhorse of hardware implementation validation. For processors, tandem simulation improves test and debug efficiency by cross-level simulating the Instruction Set Architecture (ISA) and RTL models, and comparing architectural-state variables at the end of each instruction rather than at the end of the whole trace. Further, the simulation may start with the ISA model and switch to the RTL model at some point by transferring the values of the architectural variables, thus speeding up the “warm-up” phase. However, thus far tandem simulation has been limited to processor designs as other SoC components lack high-level ISA models and thus the notion of instructions. Even for processors, significant manual effort is required in connecting the two models. This paper leverages the recently proposed Instruction-level Abstractions (ILAs) for generalizing tandem simulation to accelerators. Further, we use the refinement map that is part of the ILA verification methodology to automate the connection between the ILA and the RTL simulation models for both processors and accelerators. We provide seven case studies to demonstrate the practical applicability of our methodology.

TitleAutomated Detection of Spatial Memory Safety Violations for Constrained Devices
Author*Sören Tempel (University of Bremen, Germany), Vladimir Herdt, Rolf Drechsler (University of Bremen / DFKI GmbH, Germany)
Pagepp. 160 - 165
KeywordSymbolic Execution, Memory Safety, Embedded Software, Virtual Prototype, RISC-V
AbstractSoftware written for constrained devices, commonly used in the Internet of Things (IoT), is primarily written in C and thus subject to vulnerabilities caused by the lack of memory safety (e.g. buffer overflows). To prevent these vulnerabilities, we present a systematic approach for finding spatial memory safety violations in low-level code for constrained embedded devices. We propose implementing this approach using SystemC-based Virtual Prototypes (VPs) and illustrate an architecture for a non-intrusive integration into an existing VP. To the best of our knowledge, this approach is novel as it is the first for finding spatial memory safety violations which addresses challenges specific to constrained devices. Namely, limited computing resources and utilization of custom hardware peripherals. We evaluate our approach by applying it to the IoT operating system RIOT where we discovered seven previously unknown spatial memory safety violations in the network stack of the operating system.

[To Session Table]

Session 2E  Design for Manufacturing and Signal Integrity
Time: 10:35 - 11:10, Tuesday, January 18, 2022
Location: Room E
Chairs: Shao-Yun Fang (National Taiwan University of Science and Technology, Taiwan), Bei Yu (Chinese University of Hong Kong, Hong Kong)

TitleLithography Hotspot Detection via Heterogeneous Federated Learning with Local Adaptation
Author*Xuezhong Lin (Zhejiang University, China), Jingyu Pan (Duke University, USA), Jinming Xu (Zhejiang University, China), Yiran Chen (Duke University, USA), Cheng Zhuo (Zhejiang University, China)
Pagepp. 166 - 171
Keywordfederated learning, lithography hotspot detection, design for manufacturability
AbstractAs technology scaling is approaching its physical limit, lithography hotspot detection has become an essential task in design for manufacturability. Although the deployment of machine learning in hotspot detection is found to save significant simulation time, such methods typically demand non-trivial quality data to build the model. While most design houses are actually short of quality data, they are also unwilling to directly share such layout related data to build a unified model due to the concerns on IP protection and model effectiveness. On the other hand, with data homogeneity and insufficiency within each design house, the locally trained models can be easily over-fitted, losing generalization ability and robustness when applying to the new designs. In this paper, we propose a heterogeneous federated learning framework for lithography hotspot detection that can address the aforementioned issues. The framework can build a more robust centralized global sub-model through heterogeneous knowledge sharing while keeping local data private. Then the global sub-model can be combined with a local sub-model to better adapt to local data heterogeneity. The experimental results show that the proposed framework can overcome the challenge of non-independent and identically distributed (non-IID) data and heterogeneous communication to achieve very high performance in comparison to other state-of-the-art methods while guaranteeing good convergence in various scenarios.

TitleVoronoi Diagram Based Heterogeneous Circuit Layout Centerline Extraction for Mask Verification
Author*Xiqiong Bai (Fuzhou University, China), Ziran Zhu (Southeast University, China), Peng Zou, Jianli Chen, Jun Yu (Fudan University, China), Yao-Wen Chang (National Taiwan University, Taiwan)
Pagepp. 172 - 177
KeywordCenterline extraction, mask verification
AbstractModern circuit layout centerline extraction is an essential step in estimating the parasitic inductance and verifying the layout performance in mask verification. With the continued feature size shrinking and the growing complexity in modern circuit designs, heterogeneous layout centerline extraction has become even more challenging. In this paper, we first formulate a Voronoi diagram-based problem transformation to collect all centerline points. Then, a graph-based initial centerline result algorithm is presented to handle all invalid centerline points effectively. Finally, a heterogeneity-aware centerline optimization method is proposed to generate optimized design-violation-free centerline results with irregular structures. Compared with the state-of-the-art commercial 3D-RC parasitic parameter extraction tool RCExplorer and 1st place in the 2019 EDA Elite Challenge Contest, experimental results show that our algorithm achieves the best average precision ratio of 99.7% on centerline extraction while satisfying all design constraints.

TitleSignal-Integrity-Aware Interposer Bus Routing in 2.5D Heterogeneous Integration
Author*Sung-Yun Lee, Daeyeon Kim, Kyungjun Min, Seokhyeong Kang (Pohang University of Science and Technology (POSTECH), Republic of Korea)
Pagepp. 178 - 183
Keywordsilicon interposer, design rules, high bandwidth, eye-diagram
AbstractWe propose a fast interposer bus router that observes the complex design rules of silicon interposer layers and optimizes the signal integrity. By escaping highly integrated physical layers (PHYs) of chiplets and sharing the same bus topology, our router compactly interconnects thousands of bump I/Os within a short timeframe. In addition, we secure the maximum wire pitch and guard the signal wires to optimize the signal integrity in high bandwidth. Compared with the results of a commercial EDA tool, our router is about five times faster and the results are verified to transmit signal in a target data rate with 30% improved eye width and 35% improved eye height for industrial designs. Our router can provide practical routing results for the upcoming 2.5D ICs that have more chiplets and require higher bandwidth than the existing chips.

[To Session Table]

Session 3A  (DF-1) Key Drivers of Global Hardware Security
Time: 11:10 - 11:55, Tuesday, January 18, 2022
Location: Room A
Chair: Prof. Sying-Jyan Wang (National Chung Hsing University, Taiwan)

Title(Designers' Forum) Solving Chip Security’s Weakest Link: Complete Secure Boundary with PUF-based Hardware Root of Trust
Author*John Chou (PUFSecurity, Taiwan)
KeywordHardware Security
AbstractThe most crucial element in chip security is the Root Key or Hardware Unique Key(HUK). The key is the starting point not only for protecting each chip but also the chain of trust that encompasses the entire system and associated services. Therefore, key generation, along with its storage and usage, must be well considered from the beginning of the design. With the invention of Physical Unclonable Functions(PUF), we can now create a unique, inborn, unclonable key at the hardware level. The natural follow-up question to this is, “but how do we protect this key?” It is like storing your key to secrets in a drawer, a surefire way to break the secure boundary and create vulnerabilities. Security is only as strong as the weakest link, and in most cases, the weakest link is insecure key storage in eFuse. Insecure storage immediately compromises the whole system’s security, regardless of the sophistication of the key itself. Since hardware update is difficult and costly after production, it is very crucial to deploy appropriate hardware security at the beginning. We present a use case of how applying an integrated IP that consists of PUF and anti-Fuse based secure One-Time Programmable(OTP) memory in a companion of crypto coprocessor to provide proper hardware security at the manufacturing stage. It delivers an unclonable key and secure OTP storage with complete anti-tamper designs, and hence making the secure boundary complete.

Title(Designers' Forum) SoC and Data Security in the Era Of Cloud Supercomputing
AuthorDana Neustadter (Synopsys, USA), *Matthew Ma (Synopsys, China)
KeywordHardware Security

Title(Designers' Forum) Semiconductor Supply Chain Security-Introduction to Chip Security Test Specifications
Author*Mars Kao (Institute for Information Industry, Taiwan)
KeywordHardware Security

[To Session Table]

Session 3B  Analysis and optimization for timing, power, and reliability
Time: 11:10 - 11:55, Tuesday, January 18, 2022
Location: Room B
Chairs: Wenjian Yu (Tsinghua University, China), Umamaheswara Rao Tida (North Dakota State University, USA)

Best Paper Candidate
TitlePre-Routing Path Delay Estimation Based on Transformer and Residual Framework
Author*Tai Yang, Guoqing He, Peng Cao (Southeast University, China)
Pagepp. 184 - 189
Keywordtransformer network, pre-routing delay estimation, residual model
AbstractTiming estimation prior to routing is of vital importance for optimization at placement stage and timing closure. Existing wire- or net-oriented learning-based methods limits the accuracy and efficiency of prediction due to the neglect of the delay correlation along path and computational complexity for delay accumulation. In this paper, an efficient and accurate pre-routing path delay prediction framework is proposed by employing transformer network and residual model, where the timing and physical information at placement stage is extracted as sequence features while the residual of path delay is modeled to calibrate the mismatch between the pre- and post-routing path delay. Experimental results demonstrate that with the proposed framework, the prediction error of post-routing path delay is less than 1.68% and 3.12% for seen and unseen circuits in terms of rRMSE, which is reduced by 2.3~5.0 times compared with exiting learning-based method for pre-routing prediction. Moreover, this framework produces at least three orders of magnitude speedup compared with the traditional design flow, which is promising to guide circuit optimization with satisfying prediction accuracy prior to time-consuming routing and timing analysis.

TitleEfficient Critical Paths Search Algorithm using Mergeable Heap
AuthorKexing Zhou, *Zizheng Guo (Peking University, China), Tsung-Wei Huang (University of Utah, USA), Yibo Lin (Peking University, China)
Pagepp. 190 - 195
KeywordSTA, static timing analysis, path search
AbstractPath searching is a central step in static timing analysis (STA). State-of-the-art algorithms need to generate path deviations for hundreds of thousands of paths, which becomes the runtime bottleneck of STA. Accelerating path searching is a challenging task due to the complex and iterative path generating process. In this work, we propose a novel path searching algorithm that has asymptotically lower runtime complexity than the state-of-the-art. We precompute the path deviations using mergeable heap and apply a group of deviations to a path in near-constant time. We prove our algorithm has a runtime complexity of O(nlogn + klogk) which is asymptotically smaller than the state-of-the-art O(nk). Experimental results show that our algorithm is up to 60× faster compared to OpenTimer and 1.8× compared to the leading path search algorithm based on suffix forest.

TitleA Graph Neural Network Method for Fast ECO Leakage Power Optimization
Author*Kai Wang, Peng Cao (Southeast University, China)
Pagepp. 196 - 201
KeywordECO, GNN, directed graph, Vth assignments
AbstractIn modern design, engineering change order (ECO) is often utilized to perform power optimization including gate-sizing and Vth-assignments, which is efficient but highly timing consuming. Many graph neural network (GNN) based methods are recently proposed for fast and accurate ECO power optimization by considering neighbors' information. Nonetheless, these works fail to learn high-quality node representations on directed graph since they treat all neighbors uniformly when gathering their information and lack local topology information from neighbors one or two-hop away. In this paper, we introduce a directed GNN based method which learns information from different neighbors respectively and contains rich local topology information, which was validated by the Opencores and IWLS 2005 benchmarks with TSMC 28nm technology. Experimental results show that our approach outperforms prior GNN based methods with at least 7.8% and 7.6% prediction accuracy improvement for seen and unseen designs respectively as well as 8.3% to 29.0% leakage optimization improvement. Compared with commercial EDA tool PrimeTime, the proposed framework achieves similar power optimization results with up to 12X runtime improvement.

TitleVector-based Dynamic IR-drop Prediction Using Machine Learning
AuthorJia-Xian Chen, Shi-Tang Liu, *Yu-Tsung Wu, Mu-Ting Wu, Chien-Mo Li (National Taiwan University, Taiwan), Norman Chang, Ying-Shiun Li (Ansys, USA), Wen-Tze Chuang (Ansys, Taiwan)
Pagepp. 202 - 207
KeywordDynamic IR-drop, machine learning
AbstractVector-based dynamic IR-drop analysis of the entire vector set is infeasible due to long runtime. In this paper, we use machine learning to perform vector-based IR drop prediction for all logic cells in the circuit. We extract important features, such as toggle counts and arrival time, directly from the logic simulation waveform so that we can perform vector-based IR-drop prediction quickly. We also propose a feature engineering method, density map, to increase correlation by 0.1. Our method is scalable because the feature dimension is fixed (72), independent of design size and cell library. Our experiments show that the mean absolute error of the predictor is less than 3% of the nominal supply voltage. We achieve more than 495 speedups compared to a popular commercial tool. Our machine learning prediction can be used to identify IR-drop risky vectors from the entire test vector set, which is infeasible using traditional IR-drop analysis.

TitleFast Electromigration Stress Analysis Considering Spatial Joule Heating Effects
Author*Mohammadamir Kavousi, Liang Chen, Sheldon Tan (University of California, Riverside, USA)
Pagepp. 208 - 213
KeywordElectromigration (EM), Thermomigration (TM), immortality check
AbstractTemperature gradient due to Joule heating has huge impacts on the electromigration (EM) induced failure effects. However, Joule heating and related thermomigration (TM) effects were less investigated in the past for physics-based EM analysis for VLSI chip design. In this work, we propose a new spatial temperature-aware transient EM induced stress analysis method. The new method consists of two new contributions: First, we propose a new TM-aware void saturation volume estimation method for fast immortality check in the post-voiding phase for the first time. We derive the analytic formula to estimate the void saturation in the presence of spatial temperature gradients due to Joule heating. Second, we develop a fast numerical solution for EM-induced stress analysis for multi-segment interconnect trees considering TM effect. The new method first transforms the coupled EM-TM partial differential equations into linear timeinvariant ordinary differential equations (ODEs). Then extended Krylov subspace-based reduction technique is employed to reduce the size of the original system matrices so that they can be efficiently simulated in the time domain. The proposed method can perform the simulation process for both void nucleation and void growth phases under time-varying input currents and position-dependent temperatures. The numerical results show that, compared to the recently proposed semi-analytic EM-TM method, the proposed method can lead to about 28x speedup on average for the interconnect with up to 1000 branches for both void nucleation and growth phases with negligible errors.

[To Session Table]

Session 3C  Advanced Machine Learning with Emerging Technologies
Time: 11:10 - 11:55, Tuesday, January 18, 2022
Location: Room C
Chairs: Fan Chen (Indiana University Bloomington, USA), Zhuwei Qin (San Francisco State University, USA)

TitleSONIC: A Sparse Neural Network Inference Accelerator with Silicon Photonics for Energy-Efficient Deep Learning
Author*Febin Payickadu (Sunny, USA), Mahdi (Nikdast, USA), Sudeep (Pasricha, USA)
Pagepp. 214 - 219
Keywordsilicon photonics, DNN acceleration, inference acceleration, sparse neural networks
AbstractSparse neural networks can greatly facilitate the deployment of neural networks on resource-constrained platforms as they offer compact model sizes while retaining inference accuracy. Because of the sparsity in parameter matrices, sparse neural networks can, in principle, be exploited in accelerator architectures for improved energy-efficiency and latency. However, to realize these improvements in practice, there is a need to explore sparsity-aware hardware-software co-design. In this paper, we propose a novel silicon photonics-based sparse neural network inference accelerator called SONIC. SONIC takes advantage of the high energy-efficiency and low latency of photonic devices along with software co-optimization to accelerate sparse neural networks. Our experimental analysis shows that SONIC can achieve up to 5.8× better performance-per-watt and 8.4× lower energy-per-bit than state-of-the-art sparse electronic neural network accelerators; and up to 13.8× better performance per-watt and 27.6× lower energy-per-bit than the best known photonic neural network accelerators.

TitleXCelHD: An Efficient GPU-Powered Hyperdimensional Computing with Parallelized Training
Author*Jaeyoung Kang, Behnam Khaleghi (University of California, San Diego, USA), Yeseong Kim (DGIST, Republic of Korea), Tajana Rosing (University of California, San Diego, USA)
Pagepp. 220 - 225
KeywordBrain-inspired Hyperdimensional computing, Edge computing, GPGPU-based acceleration, Machine learning
AbstractHyperdimensional Computing (HDC) is an emerging lightweight machine learning method alternative to deep learning. One of its key strengths is the ability to accelerate it in hardware, as it offers massive parallelisms. Prior work primarily focused on FPGA and ASIC, which do not provide the seamless flexibility required for HDC applications. Few studies that attempted GPU designs are inefficient, partly due to the complexity of accelerating HDC on GPUs because of the bit-level operations of HDC. Besides, HDC training exhibited low hardware utilization due to sequential operations. In this paper, we present XCelHD, a high-performance GPU-powered framework for HDC. XCelHD uses a novel training method to maximize the training speed of the HDC model while fully utilizing hardware. We propose memory optimization strategies specialized for GPU-based HDC, minimizing the access time to different memory subsystems and redundant operations. We show that the proposed training method reduces the required number of training epochs by four-fold to achieve comparable accuracy. Our evaluation results on NVIDIA Jetson TX2 show that XCelHD is up to 35× faster than the state-of-the-art TensorFlow-based HDC implementation.

TitleHAWIS: Hardware-Aware Automated WIdth Search for Accurate, Energy-Efficient and Robust Binary Neural Network on ReRAM Dot-Product Engine
Author*Qidong Tang, Zhezhi He, Fangxin Liu, Zongwu Wang, Yiyuan Zhou, Yinghuan Zhang, Li Jiang (Shanghai Jiao Tong University, China)
Pagepp. 226 - 231
Keywordbinary neural network, neural architecture, reinforcement learning
AbstractBinary Neural Networks (BNNs) have attracted tremendous attention in ReRAM-based Process-In- Memory (PIM) systems, since they significantly simplify the hardware-expensive peripheral circuits and memory footprint. Meanwhile, BNNs are proven to have superior bit error tolerance, which inspires us to make use of this capability in PIM systems whose memory bit-cell suffers from severe device defects. Nevertheless, prior works of BNN do not simultaneously meet the criterion that 1) achieving similar accuracy w.r.t its full-precision counterpart; 2) fully binarized without full-precision operation; and 3) rapid BNN construction, which hampers its real-world deployment. This work proposes the first framework called HAWIS, whose generated BNN can satisfy all the above criteria. The proposed framework utilizes the super-net pre-training technique and reinforcement-learning based width search for BNN generation. Our experimental results show that the BNN generated by HAWIS achieves 69.3% top-1 accuracy on ImageNet with ResNet-18. In terms of robustness, our method maximally increases the inference accuracy by 66.9% and 20% compared to 8-bit and baseline 1-bit counterparts under ReRAM non-ideal effects.

TitleSynthNet: A High-throughput yet Energy-efficient Combinational Logic Neural Network
Author*Tianen Chen, Taylor Kemp, Younghyun Kim (University of Wisconsin, USA)
Pagepp. 232 - 237
Keywordmachine learning, neural networks, approximate computing, logic synthesis
AbstractCombinational logic neural networks (CLNNs), where neurons are realized as combinational logic circuits or look-up tables (LUTs), make extremely low-latency inference possible by performing the computation with pure hardware without loading weights from the memory. The high throughput, however, is powered by massively parallel logic circuits or LUTs and hence comes with high area occupancy and high energy consumption. We present SYNTHNET, a novel CLNN design method that effectively identifies and keeps only the sublogics that play a critical role in the accuracy and remove those which do not contribute to improving the accuracy. It captures the abundant redundancy in NNs that can be exploited only in CLNNs, and thereby dramatically reduces the energy consumption of CLNNs without accuracy degradation. We prove the efficacy of SYNTHNET on the CIFAR-10 dataset, maintaining a competitive accuracy while successfully replacing layers of a VGG-style network which traditionally uses memory-based floating point operations with combinational logic. Experimental results suggest our design can reduce energy-consumption of CLNNs more than 90% compared to the state-of-the-art design.

[To Session Table]

Session 3D  Software Solutions for Heterogeneous Embedded Architectures
Time: 11:10 - 11:55, Tuesday, January 18, 2022
Location: Room D
Chairs: Sara Vinco (Politecnico di Torino, Italy), Jason Xue (Hong Kong City University, Hong Kong)

Best Paper Award
TitleOptimal Data Allocation for Graph Processing in Processing-in-Memory Systems
Author*Zerun Li, Xiaoming Chen, Yinhe Han (Institute of Computing Technology, Chinese Academy of Sciences, China)
Pagepp. 238 - 243
Keyworddata allocation, graph algorithm, processing-in-memory
AbstractGraph processing involves lots of irregular memory accesses and increases demands on high memory bandwidth, making it difficult to execute efficiently on compute-centric architectures. Dedicated graph processing accelerators based on the processing-in-memory (PIM) technique have recently been proposed. Despite they achieved higher performance and energy efficiency than conventional architectures, the data allocation problem for communication minimization in PIM systems (e.g., hybrid memory cubes (HMCs)) has still not been well solved. In this paper, we demonstrate that the conventional “graph data distribution = graph partitioning” assumption is not true, and the memory access patterns of graph algorithms should also be taken into account when partitioning graph data for communication minimization. For this purpose, we classify graph algorithms into two representative classes from a memory access pattern point of view and propose different graph data partitioning strategies for them. We then propose two algorithms to optimize the partition-to-HMC mapping to minimize the inter-HMC communication. Evaluations have proved the superiority of our data allocation framework and the data movement energy efficiency is improved by 4.2-5× on average than the state-of-the-art GraphP approach.

TitleBoosting the Search Performance of B+-tree with Sentinels for Non-volatile Memory
Author*Chongnan Ye, Chundong Wang (ShanghaiTech University, China)
Pagepp. 244 - 249
KeywordB+-tree, Cache-friendly, Non-volatile memory, In-memory indexing
AbstractB+-tree has been an important index structure since the era of hard disks. The next-generation non-volatile memory (NVM) is striding into computer systems as a new tier as it incorporates both DRAM’s byte-addressability and disk’s persistency. Researchers and practitioners have considered building persistent memory by placing NVM on the memory bus for CPU to directly load and store data. As a result, cache-friendly data structures, such as the B+-tree, have been developed for NVM. State-of-the-art in-NVM B+-trees mainly focus on the optimization of write operations (insertion and deletion). However, search is of paramount importance for B+-tree. Not only search-intensive workloads benefit from an optimized search, but insertion and deletion also rely on a preceding search operation to proceed. In this paper, we attentively study a sorted B+-tree node that spans over contiguous cache lines. Such cache lines exhibit a monotonically increasing trend and searching a target key across them can be accelerated by estimating a range the key falls into. To do so, we construct a probing Sentinel Array in which a sentinel stands for each cache line of B+-tree node. Checking the Sentinel Array avoids scanning unnecessary cache lines and hence significantly reduces cache misses for a search. A quantitative evaluation shows that using Sentinel Arrays boosts the search performance of state-of-the-art in-NVM B+-trees by up to 48.4% while the cost of maintaining of Sentinel Array is low.

TitleAlgorithm and Hardware Co-design for Reconfigurable CNN Accelerator
Author*Hongxiang Fan (Imperial College London, UK), Martin Ferianc (University College London, UK), Zhiqiang Que (Imperial College London, UK), He Li (Cambridge University, UK), Shuanglong Liu (Hunan Normal University,, China), Xinyu Niu (Corerain Technologies Ltd, China), Wayne Luk (Imperial College London, UK)
Pagepp. 250 - 255
KeywordFPGA, Deep Learning, Neural Network, Real Time
AbstractRecent advances in algorithm-hardware co-design for deep neural networks (DNNs) have demonstrated their potential in automatically designing neural architectures and hardware designs. Nevertheless, it is still a challenging optimization problem due to the expensive training cost and the time-consuming hardware implementation, which makes the exploration on the vast design space of neural architecture and hardware design intractable. Different with previous co-design methods for DNNs, this paper proposes an algorithm-hardware co-design framework which decouples DNN training from the design space exploration of hardware architecture and neural architecture. By incorporating the characteristics of the underlying reconfigurable accelerator to construct basic searching cells before neural architecture searching, a hardware-friendly neural architecture space is proposed. The Gaussian process-based accuracy, latency and power predictors are proposed to speedup the optimization, and avoid the time-consuming synthesis and place and route processes. Together with the genetic algorithm for optimization, we demonstrated our framework can effectively find the Pareto frontier in the vast co-design space. In comparison with the manually-designed ResNet101 and MobileNetV2, we can achieve up to 5% higher accuracy with up to 3 times speedup on the ImageNet dataset. Compared with other state-of-the-art co-design methods, our found network and hardware configuration can achieve 2% ~ 6% higher accuracy, 2x ~ 26x smaller latency and 8.5x higher energy efficiency.

TitleExploring ILP for VLIW architecture by Quantified Modeling and Dynamic Programming-based Instruction Scheduling
Author*Can Deng, Zhaoyun Chen, Yang Shi, Xichang Kong, Mei Wen (National University of Defense Technology, China)
Pagepp. 256 - 261
KeywordVLIW architecture, list scheduling, dynamic programming, quantifiable
AbstractExploring the instruction-level parallelism (ILP) of Very Long Instruction Word (VLIW) architecture relies on the instruction scheduling. List scheduling (LS) algorithms, which are most adopted in modern compilers, have limitations in searching for optimal solutions. This paper proposes quantifiable modeling for instruction scheduling and a dynamic programming-based strategy (DPS). We evaluate DPS on the FT-Matrix platform and realize high efficiency. The results suggest that the DPS achieves an efficiency improvement of up to 44.72% within acceptable time cost.

TitleTime-Triggered Scheduling for Time-Sensitive Networking with Preemption
Author*Yuanbin Zhou (Linkoping University, Sweden), Soheil Samii (Linkoping University/General Motors, Sweden), Petru Eles, Zebo Peng (Linkoping University, Sweden)
Pagepp. 262 - 267
KeywordTSN, real-time systems, preemption, time-triggered scheduling
AbstractTime-Sensitive Networking (TSN) is a set of IEEE 802.1 technologies that support real-time and reliable Ethernet communication,commonly used in automotive and industrial automation systems. Time-aware scheduling is adopted in TSN to achieve high temporal predictability. In this paper, we demonstrate that such a scheduling solution alone does not always meet all timing requirements and must be combined with network preemption support. We propose an SMT-based synthesis method for preemptive time-triggered scheduling and routing in TSN. Our experiments demonstrate that schedulability is improved significantly when using frame preemption compared to a standard time-triggered message scheduling approach.

Wednesday, January 19, 2022

[To Session Table]

Session 2K  Keynote Session II
Time: 9:00 - 10:00, Wednesday, January 19, 2022
Location: Room S
Chair: Masanori Hashimoto (Kyoto University, Japan)

Title(Keynote Address) Powering a Quantum Future through Quantum Circuits
Author*Jerry M. Chow (IBM, USA)
AbstractAs the field of quantum computing continues to mature, it becomes critical to drive progress in technologies for quantum computing systems through key metrics - scale, quality, and speed. Pushing on these dimensions enables us to achieve a path to quantum advantage in a practical frictionless fashion. I will overview the recent development of superconducting quantum computing systems and the technological advances by IBM that enabled us to scale superconducting qubits to our latest 127-qubit Eagle processor while also describing our efforts to continue improving the underlying quality of the devices, setting the foundational elements for our roadmap. I will also describe the efforts to expand the ecosystem through open source software with Qiskit, and appealing to a broad set of developers from hardware design all the way to applications research as we continue to drive a future for the consumption of quantum circuits.

[To Session Table]

Session 4A  (SS-3) Technology Advancements inside the Edge Computing Paradigm and using the Machine Learning Techniques
Time: 10:00 - 10:35, Wednesday, January 19, 2022
Location: Room A
Chair: Sabya Das (Synopsys, USA)

Title(Invited Paper) A Task Parallelism Runtime Solution for Deep Learning Applications using MPSoC on Edge Devices
AuthorHua Jiang (Xilinx Inc, USA), Raghav Chakravarthy (Centennial High School, USA), *Ravikumar V Chakaravarthy (Xilinx Inc, USA)
Pagepp. 268 - 274
KeywordAI/ML, MPSoC, parallelism, pipelining, DAG
AbstractAI on edge devices [1] [2] [3] [4] are becoming increasing popular over the last few years. There are many research projects TVM [5] TensorFlow lite [6] that have focused on deployment and acceleration of AI/ML models on edge devices. These solutions have predominantly used data parallelism to accelerate AI/ML models on the edge device using operator fusion, nested parallelism, memory latency hiding [5] etc. to achieve best performance on the supported hardware backends. However, when the hardware supports multiple heterogenous hardware backends it becomes important to support task parallelism in addition to data parallelism to achieve optimal performance. Tasks level parallelism [7] [8] helps break down an AI/ML model into multiple tasks that can be scheduled across various heterogenous backends available in a multi-processor system on chip (MPSoC). In our proposed solution we take an AI/ML compute graph and break it into a directed acyclic graph (DAG) such that each node of the DAG represents a sub-graph of the original compute graph. The nodes of the DAG are generated using an auto-tuner to achieve optimal performance for the corresponding hardware backend. The nodes are compiled into a binary executable for the targeted hardware backend and we are extending our machine learning framework, XTA [9], to generate DAG. The XTA runtime will analyze the DAG and generate scheduling configuration. The nodes of the DAG are analyzed for dependencies and parallelized or pipelined accordingly. We are seeing a 30% improvement over the current solutions by parallelizing the execution of nodes in the DAG. The performance can be further optimized by using more hardware backend cores of the MPSoC to execute the nodes of the DAG in parallel, which is missing in the existing solutions.

Title(Invited Paper) Circuit and System Technologies for Energy-Efficient Edge Robotics
Author*Zishen Wan, Ashwin Sanjay Lele, Arijit Raychowdhury (Georgia Institute of Technology, USA)
Pagepp. 275 - 280
KeywordEdge Intelligence, Energy Efficiency, Robotic Computing, Autonomous System, Hardware Accelerator
AbstractAs we march towards the age of ubiquitous intelligence, we note that AI and intelligence are progressively moving from the cloud to the edge. The success of Edge-AI is pivoted on innovative circuits and hardware that can enable inference and limited learning in resource-constrained edge autonomous systems. This paper introduces a series of ultra-low-power accelerator and system designs on enabling the intelligence in edge robotic platforms, including reinforcement learning neuromorphic control, swarm intelligence, and simultaneous mapping and localization. We put an emphasis on the impact of the mixed-signal circuit, neuro-inspired computing system, benchmarking and software infrastructure, as well as algorithm-hardware co-design to realize the most energy-efficient Edge-AI ASICs for the next-generation intelligent and autonomous systems.

Title(Invited Paper) RTL Regression Test Selection using Machine Learning
Author*Ganapathy Parthasarathy (Synopsys Inc, USA), Aabid Rushdi (Synopsys Inc, Sri Lanka), Parivesh Choudhary, Saurav Nanda (Synopsys Inc, USA), Malan Evans, Hansika Gunasekara (Synopsys Inc, Sri Lanka), Sridhar Rajakumar (Synopsys Inc, USA)
Pagepp. 281 - 287
KeywordFunctional Verification, Machine Learning, Regression Analysis, Tests
AbstractRegression testing is a technique to ensure that micro-electronic circuit design functionality is correct under iterative changes during the design process. This incurs significant costs in the hardware design and verification cycle in terms of productivity, machine, and simulation software costs, and time – sometimes as much as 70% of the hardware design costs. We propose a machine learning approach to select a subset of tests from the set of all RTL regression tests for the design. Ideally, the selected sub-set should detect all failures that the full set of tests would have detected. Our approach learns characteristics of both RTL code and tests during the verification process to estimate the likelihood that a test will expose a bug introduced by an incremental design modification. This paper describes our approach to the problem and its implementation. We also present experiments on several real-world designs of various types with different types of test-suites that demonstrate significant time and resource savings while maintaining validation quality.

[To Session Table]

Session 4B  Recent Advances in Placement Techniques
Time: 10:00 - 10:35, Wednesday, January 19, 2022
Location: Room B
Chairs: Jinwook Jung (IBM Research, USA), Evangeline F.Y. Young (Chinese University of Hong Kong, Hong Kong)

Best Paper Award
TitleNet Separation-Oriented Printed Circuit Board Placement via Margin Maximization
AuthorChung-Kuan Cheng, Chia-Tung Ho, *Chester Holtz (University of California, San Diego, USA)
Pagepp. 288 - 293
KeywordPlacement, Congestion
AbstractPackaging is becoming a crucial process due to the paradigm shift of the More than Moore roadmap. Addressing manufacturing and yield issues poses a significant challenge for modern layout algorithms. In this work, we propose to use printed circuit board (PCB) placement as a benchmark for the packaging problem. A maximum-margin formulation is devised to model the separation between nets. Our framework includes seed layout proposals, a coordinate descent-based procedure to optimize routability, and a mixed-integer linear programming method to locally legalize the layout. We perform an extensive study using 14 PCB designs and an open-source auto-router. We demonstrate that the placements produced by NS-place improve routed wirelength by up to 25%, reduce the number of vias by up to 50%, and reduce DRVs by 79% on average compared to the manual placements.

TitleHybridGP: Global Placement for Hybrid-Row-Height Designs
AuthorKuan-Yu Chen, *Hsiu-Chu Hsu, Wai-Kei Mak, Ting-Chi Wang (National Tsing Hua University, Taiwan)
Pagepp. 294 - 299
Keywordglobal placement
AbstractConventional global placement algorithms typically assume that all cell rows in a design have the same height. Nevertheless, a design making use of standard cells with short-row height, tall-row height, and double-row (short plus tall) height can provide a better sweet spot for performance and area co-optimization in advanced nodes. In this paper, we assume for a hybrid-row-height design, its placement region is composed of both tall rows and short rows, and a cell library containing multiple versions of each cell in the design is provided. We present a new analytical global placer, HybridGP, for such hybrid-row-height designs. Furthermore, we assume that a subset of cells with sufficient timing slacks is given so that we may change their versions without overall timing degradation if desired. Our approach considers the usage of short-row and tall-row resources and exploits the flexibility of cell version change to facilitate the subsequent legalization stage. Augmented with an identical legalizer for final placement legalization, we compared HybridGP with a conventional global placer. The experimental results show that legalized placement solutions of much better quality can be obtained in less run time with HybridGP.

TitleDREAMPlaceFPGA: An Open-Source Analytical Placer for Large Scale Heterogeneous FPGAs using Deep-Learning Toolkit
Author*Rachel Selina Rajarathnam, Mohamed Baker Alawieh, Zixuan Jiang (The University of Texas at Austin, USA), Mahesh Iyer (Intel Corporation, USA), David Z. Pan (The University of Texas at Austin, USA)
Pagepp. 300 - 306
KeywordField Programmable Gate Arrays, GPU Acceleration, Deep Learning Toolkit
AbstractModern Field Programmable Gate Arrays (FPGAs) are large-scale heterogeneous programmable devices that enable high performance and energy efficiency. Placement is a crucial and computationally intensive step in the FPGA design flow that determines the physical locations of various heterogeneous instances in the design. Several works have employed GPUs and FPGAs to accelerate FPGA placement and have obtained significant runtime improvement. However, with these approaches, it is a non-trivial effort to develop optimized and algorithmic-specific kernels for GPU and FPGA to realize the best acceleration performance. In this work, we present DREAMPlaceFPGA, an open-source deep-learning toolkit-based accelerated placement framework for large-scale heterogeneous FPGAs. Notably, we develop new operators in our framework to handle heterogeneous resources and FPGA architecture-specific legality constraints. The proposed framework requires low development cost and provides an extensible framework to employ different placement optimizations. Our experimental results on the ISPD’2016 benchmarks show very promising results compared to prior approaches.

[To Session Table]

Session 4C  Emerging Trends in Stochastic Computing
Time: 10:00 - 10:35, Wednesday, January 19, 2022
Location: Room C
Chairs: Xunzhao Yin (Zhejiang University, China), Lang Feng (Nanjing University, China)

TitleLinear Feedback Shift Register Reseeding for Stochastic Circuit Repairing and Minimization
Author*Chen Wang, Weikang Qian (Shanghai Jiao Tong University, China)
Pagepp. 307 - 313
Keywordstochastic computing, SC circuit repair, SC circuit minimization, LFSR reseeding
AbstractStochastic computing (SC) is a re-emerging paradigm to realize complicated computation by simple circuitry. Although SC has strong tolerance to bit flip errors, manufacturing defects may still cause unacceptably large computation errors. SC circuits commonly adopt linear feedback shift registers (LFSRs) for stochastic bit stream generation. In this study, we observe that the computation error of a faulty LFSR-based SC circuit can be reduced by LFSR reseeding. We propose novel methods to use LFSR reseeding to 1) repair a faulty SC circuit and 2) minimize an SC circuit by constant replacement. Our experiments show the effectiveness of our proposed methods. Notably, the proposed SC circuit minimization method achieves an average 36% area-delay product reduction over the state-of-the-art fully-shared LFSR design with no reduction of the computation accuracy.

TitleBSC: Block-based Stochastic Computing to Enable Accurate and Efficient TinyML
Author*Yuhong Song, Edwin Hsing-Mean Sha, Qingfeng Zhuge, Rui Xu, Yongzhuo Zhang (East China Normal University, China), Bingzhe Li (Oklahoma State University, USA), Lei Yang (University of New Mexico, USA)
Pagepp. 314 - 319
KeywordStochastic Computing, Approximate Computing, Circuit Optimization, Tiny Machine Learning, Power Consumption
AbstractAlong with the progress of AI democratization,machine learning (ML) has been successfully applied to edge applications, such as smart phones and automated driving. Nowadays, more applications require ML on tiny devices with extremely limited resources, like implantable cardioverter defibrillator (ICD), which is known as TinyML. Unlike ML on the edge, TinyML with a limited energy supply has higher demands on low-power execution. Stochastic computing (SC)using bitstreams for data representation is promising for TinyML since it can perform the fundamental ML operations using simple logical gates, instead of the complicated binary adder and multiplier. However, SC commonly suffers from low accuracy for ML tasks due to low data precision and inaccuracy of arithmetic units. Increasing the length of the bitstream in the existing works can mitigate the precision issue but incur higher latency. In this work, we propose a novel SC architecture, namely Block-based Stochastic Computing (BSC). BSC divides inputs into blocks,such that the latency can be reduced by exploiting high data parallelism. Moreover, optimized arithmetic units and output revision (OUR) scheme are proposed to improve accuracy. On top of it, a global optimization approach is devised to determine the number of blocks, which can make a better latency-power trade-off. Experimental results show that BSC can outperform the existing designs in achieving over 10% higher accuracy on ML tasks and over 6×power reduction.

TitleStreaming Accuracy: Characterizing Early Termination in Stochastic Computing
Author*Hsuan Hsiao (University of Toronto, Canada), Joshua San Miguel (University of Wisconsin-Madison, USA), Jason Anderson (University of Toronto, Canada)
Pagepp. 320 - 325
Keywordstochastic computing
AbstractStochastic computing has garnered interest in the research community due to its ability to implement complicated compute with very small area footprints, at the cost of some accuracy and higher latency. With its unique tradeoffs between area, accuracy and latency, one commonly used technique to minimize area and latency is to early-terminate computation. Given this, it is useful to be able to measure and characterize how amenable a bitstream is to early termination. We present Streaming Accuracy, a metric that measures how far a bitstream is from its most early-terminable form. We show that it overcomes limitations of prior studies, and we characterize the design space for building stochastic circuits with early termination. We then propose a new hardware bitstream generator that produces bitstreams with optimal streaming accuracy.

[To Session Table]

Session 4D  Efficient Techniques for Emerging Applications
Time: 10:00 - 10:35, Wednesday, January 19, 2022
Location: Room D
Chairs: Xueqing Li (Tsinghua University, China), Sangyoung Park (Technische Universität Berlin, Germany)

Best Paper Candidate
TitleTENET: Temporal CNN with Attention for Anomaly Detection in Automotive Cyber-Physical Systems
Author*Sooryaa Vignesh Thiruloga, Vipin Kumar Kukkala, Sudeep Pasricha (Colorado State University, USA)
Pagepp. 326 - 331
KeywordCyber Physical Systems, In-vehicle Networks, Anomaly Detection, Convolutional Neural Networks
AbstractModern vehicles have multiple electronic control units (ECUs) that are connected together as part of a complex distributed cyber-physical system (CPS). The ever-increasing communication between ECUs and external electronic systems has made these vehicles particularly susceptible to a variety of cyber-attacks. In this work, we present a novel anomaly detection framework called TENET to detect anomalies induced by cyber-attacks on vehicles. TENET uses temporal convolutional neural networks with an integrated attention mechanism to learn the dependency between messages traversing the in-vehicle network. Post deployment in a vehicle, TENET employs a robust quantitative metric and classifier, together with the learned dependencies, to detect anomalous patterns. TENET is able to achieve an improvement of 32.70% in False Negative Rate, 19.14% in the Mathews Correlation Coefficient, and 17.25% in the ROC-AUC metric, with 94.62% fewer model parameters, 86.95% decrease in memory footprint, and 48.14% lower inference time when compared to the best performing prior work on automotive anomaly detection.

TitleELight: Enabling Efficient Photonic In-Memory Neurocomputing with Life Enhancement
Author*Hanqing Zhu, Jiaqi Gu, Chenghao Feng, Mingjie Liu, Zixuan Jiang, Ray Chen, David Pan (University of Texas at Austin, USA)
Pagepp. 332 - 338
KeywordOptical Neural Networks, Photonic in-memory neurocomputing, Reliability, phase change material
AbstractWith the recent advances in optical phase change material (PCM), photonic in-memory neurocomputing has demonstrated its superiority in optical neural network (ONN) designs with near-zero static power consumption, time-of-light latency, and compact footprint. However, photonic tensor cores require massive hardware reuse to implement large matrix multiplication due to the limited single-core scale. The resultant large number of PCM writes leads to serious dynamic power and overwhelms the fragile PCM with limited write endurance. In this work, we propose a synergistic optimization framework, ELight, to minimize the overall write efforts for efficient and reliable optical in-memory neurocomputing. We first propose write-aware training to encourage the similarity among weight blocks, and combine it with a post-training optimization method to reduce programming efforts by eliminating redundant writes. Experiments show that ELight can achieve over 20× reduction in the total number of writes and dynamic power with comparable accuracy. With our ELight, photonic in-memory neurocomputing will step forward towards viable applications in machine learning with preserved accuracy, order-of-magnitude longer lifetime, and lower programming energy.

TitleSolving Least-Squares Fitting in O(1) Using RRAM-based Computing-in-Memory Technique
Author*Xiaoming Chen, Yinhe Han (Institute of Computing Technology, Chinese Academy of Sciences, China)
Pagepp. 339 - 344
Keywordleast-squares fitting, RRAM
AbstractLeast-squares fitting (LSF) is a fundamental statistical method that is widely used in linear regression problems, such as modeling, data fitting, predictive analysis, etc. For large-scale data sets, LSF is computationally complex and poorly scaled due to the O(N2)-O(N3) computational complexity. The computing-in-memory technique has potentials to improve the performance and scalability of LSF. In this paper, we propose a computing-in-memory accelerator based on resistive random-access memory (RRAM) devices. We not only utilize the conventional idea of accelerating matrix-vector multiplications by RRAM-based crossbar arrays, but also elaborate the hardware and the mapping strategy. Our approach has a unique feature that it can finish a complete LSF problem in O(1) time complexity. We also propose a scalable and configurable architecture such that the problem scale that can be solved is not restricted by the crossbar array size. Experimental results have demonstrated the superior performance and energy efficiency of our accelerator.

TitleSonicFFT: A system architecture for ultrasonic-based FFT acceleration
Author*Darayus Adil Patel (Nanyang Technological University, Singapore), Viet Phuong Bui (Institute of High Performance Computing, A*STAR (Agency for Science, Technology and Research), Singapore), Kevin Tshun Chuan Chai (Institute of Microelectronics, A*STAR (Agency for Science, Technology and Research), Singapore), Amit Lal (Cornell University, USA), Mohamed M. Sabry Aly (Nanyang Technological University, Singapore)
Pagepp. 345 - 351
KeywordFFT accelerator, ultrasonic computation, compact-modelling, Cooley-Tukey algorithm
AbstractFast Fourier Transform (FFT) is an essential algorithm for numerous scientific and engineering applications. It is key to implement FFT in a high-performance and energy-efficient manner. In this paper, we leverage the properties of ultrasonic wave propagation in silicon for FFT computation. We introduce SonicFFT: A system architecture for ultrasonic-based FFT acceleration. To evaluate the benefits of SonicFFT, a compact-model based simulation framework that quantifies the performance and energy of an integrated system comprising of digital computing components interfaced with an ultrasonic FFT accelerator has been developed. We also present mapping strategies to compute 2D FFT utilizing the accelerator. Simulation results show that SonicFFT achieves a 2317× system-level energy-delay product benefits—a simultaneous 117.69× speedup and 19.69× energy reduction—versus state-of-the-art baseline all-digital configuration.

[To Session Table]

Session 5A  (DF-2) Compiler and Toolchain for Efficient AI Computation
Time: 10:35 - 11:10, Wednesday, January 19, 2022
Location: Room A
Chair: Chia-Heng Tu (National Cheng Kung University, Taiwan)

Title(Designers' Forum) Making Deep Learning More Portable with Deep Learning Compiler
Author*Cody Yu (Amazon Web Services, Inc., USA)
AbstractDeep Learning (DL) is getting more and more popular in recent years. To apply deep learning everywhere with decent performance, vendors and service providers eager to deploy DL models to various hardware accelerators on both cloud and edge platforms. However, current DL frameworks rely on vendor-specific kernel libraries, and face to the challenge of limited operator/device coverage and graph/tensor-level optimizations. This talk attempts to use Apache TVM, an open-source DL compiler, along with two of its popular features, auto-scheduling and heterogeneous execution with pluggable backends, to illustrate how a DL compiler could address the challenge by generating high-performance kernels with domain-specific compilation.

Title(Designers' Forum) Tiny ONNC: MLIR-based AI Compiler for ARM IoT Devices
Author*Luba Tang (Skymizer Taiwan Inc., Taiwan)
AbstractIntermediate Tiny ONNC is an MLIR-based compiler exporting deep neural networks (DNN) into function calls to the ARM CMSIS-NN library. MLIR is a high-quality compiler framework addressing software fragmentation issues. By supporting variant Intermediate Representations in a single infrastructure, compilers can transform variant input languages into a common output form. Tiny ONNC leverages the unique power of MLIR to support rich neural network frameworks, including PyTorch, Open Neural Network Exchange Format (ONNX), Tensorflow, TensorflowLite, and even TVM relay. Tiny ONNC transforms all the input DNN formats into a function composed of a series of function calls to the ARM CMSIS-NN library. “One fits all,” MLIR makes it possible. Tiny ONNC has enormous optimization approaches, such as automatic operator splitting and tensor splitting, addressing on memory constraints of microcontrollers. When an operator or a tensor is too big to fit in the cache, Tiny ONNC separates the big objects into small pieces and reorganizes networks for reusing the memory. Tiny ONNC also supports operators that are not directly supported by CMSIS-NN by mathematical equivalent or approximate transformations. These optimization approaches deliver strong empirical performances while keeping high memory utilization and high performance at the same time. On the MLPerf Tiny benchmark, Tiny ONNC achieves the same level (<2%) as TensorflowLite for Microcontrollers (TFLM) in terms of performance and precision. In the best case, the memory footprint of the generated program is only 3/5 of TFLM, and the code size is only 1/10 of TFLM. In this talk, we will introduce MLIR first, see how it works in Tiny ONNC. And then we will dive into memory optimization strategies and approaches. Last, we will explain the experiment results to see how Tiny ONNC outperforms its rivals.

Title(Designers' Forum) Architecture Design for the DNN Accelerator
Author*Yao-Hua Chen (ITRI, Taiwan)
AbstractDeep Learning is getting unprecedented interests in research and industry, due to recent success in many artificial intelligence areas such as image classification and object recognition. The huge growth in deep learning applications has created an exponential demand for computing power, which leads to the rise of AI-specific hardware. To design the DNN accelerator, designers need to explore the design space tradeoffs at both the algorithm and architectural level. In this talk, we introduce NNArch – an analytical model designed for Deep Neural Network (DNN) accelerator, which can help designers to explore the performance of the DNN accelerator with given DNN models.

[To Session Table]

Session 5B  Moving frontiers of test and simulation
Time: 10:35 - 11:10, Wednesday, January 19, 2022
Location: Room B
Chairs: Ying Zhang (Tongji University, China), Michihiro Shintani (Nara Institute of Science and Technology, Japan)

TitleFIRVER: Concolic Testing for Systematic Validation of Firmware Binaries
Author*Tashfia Alam (University of Florida, USA), Zhenkun Yang, Bo Chen, Nicholas Armour (Intel Corporation, USA), Sandip Ray (University of Florida, USA)
Pagepp. 352 - 357
KeywordFirmware Validation, Concolic testing, symbolic execution
AbstractFirmware is low-level software that interacts closely with the underlying hardware. Firmware is crucial in a number of critical functionalities of modern computing systems, including boot, power management, and security. Unfortunately, firmware validation introduces some unique challenges that make it difficult to adopt software validation tools directly. We develop an infrastructure, FIRVER, for systematic validation of firmware binaries. Furthermore, we make unique use of existing virtual prototyping environment to help comprehend hardware-firmware interaction. We report the use of the framework in generating tests for TianoCore, a comprehensive opensource boot firmware developed by Intel Corporation. Our experiments demonstrate that tests from FIRVER can achieve more than 90% code coverage. It also enabled exploration of corner cases that exposed segmentation faults in many constituent functions.

TitleWAL: A Novel Waveform Analysis Language for Advanced Design Understanding and Debugging
Author*Lucas Klemmer, Daniel Große (Institute for Complex Systems, Johannes Kepler University Linz, Austria)
Pagepp. 358 - 364
Keywordsimulation analysis, debugging, performance optimization, design understanding
AbstractStarting points for design understanding and debugging are generated waveforms. However, waveform viewing is still a highly manual and tedious process, and unfortunately, there has been no progress for automating the analysis of waveforms. Therefore, we introduce the Waveform Analysis Language (WAL) in this paper. We realized WAL as a Domain Specific Language (DSL). This design choice has many advantages ranging from a natural expressiveness of an waveform analysis problem to providing an Intermediate Representation (IR) well-suited as a compilation target from other languages. We evaluate WAL in two major case studies. This includes (i) a WAL-based communication analyzer reporting for example throughput or latency of AXI communication and (ii) the generation of the instruction flow in the pipeline and the extraction the software basic blocks of a RISC-V processor via WAWK, which is based on the WAL-IR to make complex waveform analysis as easy as searching in text files.

TitleAccelerate SAT-based ATPG via Preprocessing and New Conflict Management Heuristics
AuthorJunhua Huang (Xiamen University, China), *Hui-Ling Zhen (Noah's Ark Lab, Huawei, China), Naixing Wang (Hisilicon, Huawei, China), Mingxuan Yuan, Hui Mao (Noah's Ark Lab, Huawei, China), Yu Huang (Hisilicon, Huawei, China), Jiping Tao (Xiamen University, China)
Pagepp. 365 - 370
KeywordATPG, SAT, Preprocessing, Conflict Management
AbstractDue to the continuous advancement of semiconductor technologies, there are more defects than ever widely distributed in manufactured chips. In order to meet the high product quality and low defective-parts-per-million (DPPM) goals, Boolean Satisfiability (SAT) technique has been shown to be a robust alternative to conventional APTG techniques, especially for hard-to-detect faults. However, the SAT-based ATPG still confronts two challenges. The first one is to reduce extra computational overhead of SAT modeling, i.e. to transform a circuit testing problem to a Conjunctive Normal Form (CNF) which is the foundation of modern SAT solvers. The second one lies in the SAT solver's efficiency which is brought by the loss of structural information during CNF transformation. In this work, we propose a new SAT-based ATPG approach to address the two challenges mentioned above: (1) To reduce CNF transformation overhead, we utilize a simulation-driven pre-processing for narrowing down the fault propagation and activation logic cones, leading to an improvement in CNF transformation and reduction in runtime. (2) To further improve the solving efficiency, We propose new ranking-based heuristics to build more effective conflict database, enabling the direct solving for small scale instance and a looking-head method for large scale ones. Extensive experimental results on industrial circuits demonstrate that on average the proposed approach could cover 89.67% of the faults failed by a commercial ATPG tool with a comparable runtime.

TitleA Fast and Accurate Middle End of Line Parasitic Capacitance Extraction for MOSFET and FinFET Technologies Using Machine Learning
Author*Mohamed Saleh Abouelyazid (Siemens EDA, American University in Cairo, Egypt), Sherif Hammouda (Siemens EDA, Egypt), Yehea Ismail (American University in Cairo, Egypt)
Pagepp. 371 - 376
Keywordmiddle-end-of-line, neural-networks, parasitic capacitance, parasitic extraction, machine learning
AbstractA novel machine learning modeling methodology for parasitic capacitance extraction of middle-end-of-line metal layers around FinFETs and MOSFETs is developed. Due to the increasing complexity and parasitic extraction accuracy requirements of middle-end-of-line patterns in advanced process nodes, most of the current parasitic extraction tools rely on field-solvers to extract middle-end-of-line parasitic capacitances. As a result, a lot of time, memory, and computational resources are consumed. The proposed modeling methodology overcomes these problems by providing compact models that predict middle-end-of-line parasitic capacitances efficiently. The compact models are pre-characterized and technology-dependent. Also, they can handle the increasing accuracy requirements in advanced process nodes. The proposed methodology scans layouts for devices, extracts geometrical features of each device using a novel geometry-based pattern representation, and uses the extracted features as inputs to the required machine learning models. Two machine learning methods are used including: support vector regressions and neural networks. The testing covered more than 40M devices in several different real designs that belong to 28nm and 7nm process technology nodes. The proposed methodology managed to provide outstanding results as compared to field-solvers with an average error < 0.2%, a standard deviation < 3%, and a speed up of 100X.

[To Session Table]

Session 5C  Optimizations in Modern Memory Architecture
Time: 10:35 - 11:10, Wednesday, January 19, 2022
Location: Room C
Chairs: Chenchen Fu (Southeast Unviersity, China), Shouzhen Gu (East China Normal University, China)

Best Paper Candidate
TitleLamina: Low Overhead Wear Leveling for NVM with Bounded Tail
Author*Jiacheng Huang, Min Peng, Libing Wu (Wuhan University, China), Chun Jason Xue (City University of Hong Kong, Hong Kong), Qingan Li (Wuhan University, China)
Pagepp. 377 - 382
Keywordnon-volatile memory, wear leveling, memory management
AbstractEmerging non-volatile memory (NVM) has been considered as a promising candidate for the next generation memory architecture because of its excellent characteristics. However, the endurance of NVM is much lower than DRAM. Without additional wear management technology, its lifetime can be very short, which extremely limits the use of NVM. This paper observes that the tail wear with a very small percentage of extreme deviation significantly hurts the lifetime of NVM, which the existing methods do not effectively solve. We present Lamina to address the tail wear issue, in order to improve the lifetime of NVM. Lamina consists of two parts: bounded tail wear leveling (BTWL) and lightweight wear enhancement (LWE). BTWL is used to make the wear degree of all pages close to the average value and control the upper limit of tail wear. LWE improves the accuracy of BTWL by exploiting the locality to interpolate low-frequency sampling schemes in virtual memory space. Our experiments show that compared with the state-of-the-art methods, Lamina can significantly improve the lifetime of NVM with low overhead.

TitleHeterogeneous Memory Architecture Accommodating Processing-In-Memory on SoC For AIoT Applications
Author*Kangyi Qiu (Peking University, China), Yaojun Zhang (Pimchip Technology Co., Ltd., China), Bonan Yan, Ru Huang (Peking University, China)
Pagepp. 383 - 388
KeywordProcessing-In-Memory, Compute-In-Memory, SoC, Programmable Architecture, Hardware/Software Interface
AbstractProcessing-In-Memory (PIM) technologies is one of most promising candidates for AIoT applications due to its attractive characteristics, such as low computation latency, large throughput and high power efficiency. However, how to efficiently utilize PIM with System-on-Chip (SoC) architecture has been scarcely discussed. In this paper, we demonstrate a series of solution from hardware architecture to algorithm to maximize the benefits of PIM design. First, we propose a Heterogeneous Memory Architecture (HMA) that facilitates the existing SoC with PIM via high-throughput on-chip buses. Then, based on given HMA structure, we also propose an HMA tensor mapping approach to partition tensors and deploy general matrix multiplication operations on PIM structures. Both HMA hardware and HMA tensor mapping approach harnesses the programmability of the mature embedded CPU solution stack and maximize the high efficiency of PIM technology. The whole HMA system can save 416× power as well as 44.6% design area compare with the latest accelerator solutions. The evaluation also shows that our design can reduce the operation latency by 430× and 11× for TinyML applications, compare with state-of-art baseline and PIM without optimization, respectively.

TitleOptimal Loop Tiling for Minimizing Write Operations on NVMs with Complete Memory Latency Hiding
Author*Rui Xu, Edwin Hsing.-Mean Sha, Qingfeng Zhuge, Yuhong Song, Jingzhi Lin (East China Normal University, China)
Pagepp. 389 - 394
Keywordloop tiling, Write operations, memory latency hiding, Non-volatile Memory, nested loop
AbstractNon-volatile memory (NVM) is expected to be the second level memory (named remote memory) in two-level memory hierarchy in the future. However, NVM has the limited write endurance, thus it is vital to reduce the number of write operations on NVM. Meanwhile, in two-level memory hierarchy, prefeching is widely used to fetch certain data before it is actually required to shorten the completion time and hide the remote memory access latency. In general, large-scale nested loop is the performance bottleneck in one program due to the write operations on NVM caused by the first level memory (named local memory) miss and data reuse. NVMs and a pipeline scheduling policy for completely hiding the memory latency. In this paper, we propose a new loop tiling approach for minimizing the write operations on NVMs and completely hiding the NVM latency. Specifically, we introduce a series of theorems to help loop tiling. Then, an optimal tile size selection strategy is proposed according to data dependency and local memory capacity. Furthermore, we propose a pipeline scheduling policy to completely hide the remote memory latency. Extensive experiments show that the proposed techniques can reduce write operations on NVMs by 95.1% on average, and NVM latency can be completely hidden.

[To Session Table]

Session 5D  Novel Boolean Optimization and Mapping
Time: 10:35 - 11:10, Wednesday, January 19, 2022
Location: Room D
Chairs: Shinji Kimura (Waseda University, Japan), Kenshu Seto (Tokyo City University, Japan)

TitleBoolean Rewriting Strikes Back: Reconvergence-Driven Windowing Meets Resynthesis
AuthorHeinz Riener, *Siang-Yun Lee (EPFL, Switzerland), Alan Mishchenko (UC Berkeley, USA), Giovanni de Micheli (EPFL, Switzerland)
Pagepp. 395 - 402
Keywordlogic synthesis, circuit rewriting, Boolean resynthesis, reconvergence
AbstractThe paper presents a novel DAG-aware Boolean rewriting algorithm for restructuring combinational logic before technology mapping. The algorithm, called window rewriting, repeatedly selects small parts of the logic and replaces them with more compact implementations. Window rewriting combines small-scale windowing with a fast heuristic Boolean resynthesis. The former uses sophisticated structural analysis to capture reconvergent paths in a multi-output window. The latter re-expresses the multi-output Boolean function of the window using fewer gates if possible. Experiments on the EPFL benchmarks show that a single iteration of window rewriting outperforms state-of-the-art AIG rewriting repeated until convergence in both quality and runtime.

TitleDelay Optimization of Combinational Logic by And-Or Path Restructuring
Author*Ulrich Brenner (University of Bonn, Germany), Anna Silvanus (Synopsys GmbH, Germany)
Pagepp. 403 - 409
KeywordLogic optimization, And-Or Paths, Timing Optimization
AbstractWe propose a timing optimization framework that replaces critical paths by logically equivalent realizations with less delay. Our tool allows to revise early decisions on the logical structure of the netlist in late physical design. The core routine of our framework is a new algorithm that constructs delay-optimized circuits for alternating \aop{}s with prescribed input arrival times. It is a sophisticated dynamic programming algorithm which is a common generalization of the previously best approaches. In contrast to all earlier methods, we avoid fixing the structure of sub-solutions before deciding on how to combine them, significantly expanding the search space of the algorithm. Our algorithm provably fulfills the best known approximation guarantees, almost always computes delay-optimum solutions, and empirically outperforms all previous approaches. The reduction to \aop{} optimization allows us to optimize paths of arbitrary length in our logic restructuring framework. The framework is applied successfully as a late step in an industrial physical design flow. Experiments demonstrate the effectiveness of our tool on industrial 7nm instances.

TitleA Versatile Mapping Approach for Technology Mapping and Graph Optimization
Author*Alessandro Tempia Calvino, Heinz Riener (EPFL, Switzerland), Shubham Rai, Akash Kumar (TU Dresden, Germany), Giovanni De Micheli (EPFL, Switzerland)
Pagepp. 410 - 416
KeywordTechnology mapping, Logic synthesis, Technology-independent optimization, Logic rewriting
AbstractThis paper proposes a versatile mapping approach that has three objectives: i) it can map from one technology-independent graph representation to another; ii) it can map to a cell library; iii) it supports logic rewriting. The method is cut-based, mitigates logic-sharing issues of previous graph mapping approaches, and exploits structural hashing. The mapper is the first one of its kind to support remapping among various graph representations, thus enabling specialized mapping to emerging technologies (such as AQFP) and for security applications (such as XAG-based design). We show that mapping to MIGs improves area by 10% as compared to the state of the art, and that technology mapping is 18% faster than ABC with slightly better results.

[To Session Table]

Session 6A  (DF-3) AI for Chip Design and Testing
Time: 11:10 - 11:45, Wednesday, January 19, 2022
Location: Room A
Chair: Ing-Chao Lin (National Cheng Kung University, Taiwan)

Title(Designers' Forum) Reinforcement Learning-Driven Optimization for Superior Performance, Power and Productivity in Chip Design
Author*Thomas Andersen (Synopsys, USA)
AbstractArtificial Intelligence is an avenue to innovation that is touching every industry worldwide. We present a novel technology applying reinforcement learning (RL) for optimizing the digital implementation process for performance, productivity and turn-around time. Our revolutionary DSO.ai technology is used to massively scale exploration of options in design workflows and automates multi-dimensional optimization objectives. AI-grade DSO.ai automation opens up a new growth trajectory for the semiconductor industry, enabling companies to build new products and deliver the innovations of tomorrow.

Title(Designers' Forum) Machine Learning for Electronic Design Automation
Author*Erick Chao (Cadence, Taiwan)
AbstractNew applications and technology are driving demand for even more compute power and functionality in the devices we use every day. Consequently, the semiconductor industry is experiencing strong growth based on technologies such as 5G, autonomous driving, hyperscale compute, industrial IoT, and many others. System-on-chip (SoC) designs are quickly migrating to new process nodes and rapidly growing in size and complexity, further increasing the need for engineering teams to increase productivity.
Machine learning combined with distributed computing offers new capabilities to automate and scale RTL-to-GDS chip implementation flows, enabling design teams to support more, and increasingly complex, SoC projects. In this talk, we will address key technologies behind the new Cadence® Cerebrus™ Intelligent Chip Explorer and the RTL-to-signoff implementation flow to show how EDA software can help designers achieve up to 10X productivity and 20% PPA improvements for implementation.

Title(Designers' Forum) Fast Reward Calculation for Reinforcement Learning Macro Placement
Author*Tung-Chieh Chen (Maxeda Technology, Taiwan)
AbstractReinforcement learning has been effectively applied to several EDA problems. One practical application is for hard macro placement. Since over 10000 solutions are evaluated during the reinforcement learning process, it is desired to have an efficient reward calculation method with high fidelity. We studied some simplified cost models and their correlation coefficients for three important cost metrics, wirelength, congestion, and timing. A framework is proposed to find a fast reward calculation technique that can be used in the reinforcement learning-based macro placement.

[To Session Table]

Session 6B  Towards Reliable and Secure Circuits: Cross Perspectives
Time: 11:10 - 11:45, Wednesday, January 19, 2022
Location: Room B
Chairs: Kaveh Shamsi (University of Texas at Dallas, USA), Xiaolong Guo (Kansas State University, USA)

TitleAvatar: Reinforcing Fault Attack Countermeasures in EDA with Fault Transformations
Author*Prithwish Basu Roy (IIT Madras, India), Patanjali SLPSK (University of Florida, USA), Chester Rebeiro (IIT Madras, India)
Pagepp. 417 - 422
KeywordFault Injection Attack, Gate Reconfiguration, EDA Security
AbstractCryptography hardware are highly vulnerable to a class of side-channel attacks known as Differential Fault Analysis (DFA). These attacks exploit fault induced errors to compromise secret keys from ciphers within a few seconds. A bias in the error probabilities strengthens the attack considerably. It abets in bypassing countermeasures and is also the basis of powerful attack variants like the Differential Fault Intensity Analysis (DFIA) and Statistical Ineffective Fault Analysis (SIFA). In this paper, we make two significant contributions. First, we identify the correlation between fault induced errors and gate-level parameters like the threshold voltage, gate size, and VDD. We show how these parameters can influence the bias in the error probabilities. Then, we propose an algorithm, called Avatar, that carefully tunes gate-level parameters to strengthen fault attack countermeasures against DFA, DFIA, and SIFA attacks with no additional logic needed. In AES for instance, fault attack resistance improves by 40% for DFA and DFIA, and 99% in the case of SIFA. Avatar incurs negligible area overheads and can be quickly adopted in any cipher design. It can be incorporated in commercial EDA flows and provides users with tunable knobs to trade-off performance and power consumption, for fault attack security.

TitleAnti-Piracy of Analog and Mixed-Signal Circuits in FD-SOI
Author*Mariam Tlili, Alhassan Sayed, Doaa Mahmoud, Marie-Minerve Louerat, Hassan Aboushady, Haralampos-G. Stratigopoulos (Sorbonne University, CNRS, LIP6, France)
Pagepp. 423 - 428
KeywordIP/IC piracy, Locking, FD-SOI, Analog and Mixed-Signal ICs
AbstractWe propose an anti-piracy security technique based on locking for analog and mixed-signal circuits designed in FD-SOI. We show that obfuscating the body-bias voltages of tunable transistors is an effective way for inducing high functionality corruption. The obfuscation is achieved by constituting a secret key from the concatenation of the input digital codes of the body-bias generators that produce the correct body-bias voltages. We also propose a slight modification of the body-bias generator that increases prohibitively the time complexity of counter-attacks aiming at finding an approximate key. The proposed locking scheme is demonstrated on a Sigma-Delta modulator used in highly-digitized RF receiver architectures.

TitleToward Optical Probing Resistant Circuits: A Comparison of Logic Styles and Circuit Design Techniques
Author*Sajjad Parvin (University of Bremen, Germany), Thilo Krachenfels (Technische Universität Berlin, Germany), Shahin Tajik (Worcester Polytechnic Institute, USA), Jean-Pierre Seifert (Technische Universität Berlin, Germany), Frank Sill Torres (German Aerospace Center (DLR), Germany), Rolf Drechsler (University of Bremen, Germany)
Pagepp. 429 - 435
Keyworddual-rail logic, optical probing, hardware security, differential logic gates
AbstractLaser-assisted side-channel analysis techniques, such as optical probing (OP), have been shown to pose a severe threat to secure hardware. While several countermeasures have been proposed in the literature, they can either be bypassed by an attacker or require a modification in the transistor’s fabrication process, which is costly and complex. In this work, firstly, we propose a formulation for the caliber of reflected light from OP. Secondly, we propose circuit design techniques and logic styles to alleviate OP attacks based on our formulation. Finally, we compare several logic families and circuit design techniques in terms of performance and OP security merits. In this regard, we perform simulations to compare the optical beam interaction between the different logic gates. By utilizing our proposed circuit design techniques and dual-rail logic (DRL), the signal-to-noise ratio (SNR) of the reflected light from OP is reduced significantly.

[To Session Table]

Session 6C  Accelerator Architectures for Machine Learning
Time: 11:10 - 11:45, Wednesday, January 19, 2022
Location: Room C
Chairs: Linghao Song (University of California, Los Angeles, USA), Bonan Yan (Peking University, China)

Best Paper Candidate
TitleDynamic CNN Accelerator Supporting Efficient Filter Generator with Kernel Enhancement and Online Channel Pruning
Author*Chen Tang, Wenyu Sun, Wenxun Wang, Yongpan Liu (Tsinghua University, China)
Pagepp. 436 - 441
KeywordDynamic CNN, CNN accelerator
AbstractDeep neural network achieves exciting performance in several tasks with heavy storing and computing costs. Previous works adopt pruning-based methods to slim deep network. For traditional pruning, either the convolution kernel or the network inference is static, which cannot fully compress the model parameter and restrains their performance. In this paper, we propose an online pruning algorithm to support dynamic kernel generation and dynamic network inference at the same time. Two novel techniques including the filter generator and the importance-level based channel pruning are proposed. Moreover, we valid the success of the proposed method by the implementation on Ultra96-v2 FPGA. Compared with state-of-art static or dynamic pruning methods, our method can reduce the top-5 accuracy drop by nearly 50% for ResNet model on ImageNet at similar compressing level. It can also achieve better accuracy while up to 50% fewer weights are reduced to be saved on chip.

TitleToward Low-Bit Neural Network Training Accelerator by Dynamic Group Accumulation
Author*Yixiong Yang, Ruoyang Liu, Wenyu Sun, Jinshan Yue, Huazhong Yang, Yongpan Liu (Tsinghua University, China)
Pagepp. 442 - 447
KeywordCNN, Acceleration, Low-Bit Training
AbstractLow-bit quantization is a big challenge for neural network training. Conventional training hardware adopts FP32 to accumulate the partial-sum result, which seriously degrades energy efficiency. In this paper, a technology called dynamic group accumulation (DGA) is proposed to reduce the accumulation error. First, we model the proposed group accumulation method and give the optimal DGA algorithm. Second, we design a training architecture and implement a hardware-efficient DGA unit. Third, we make a comprehensive analysis of the DGA algorithm and training architecture. The proposed method is evaluated on CIFAR and ImageNet datasets, and results show that DGA can reduce accumulation bit-width by 6 bits while achieving the same precision as the static group method. With the FP12 DGA, the CNN algorithm only loses 0.11% accuracy in ImageNet training, and our architecture saves 32% of power consumption compared to the FP32 baseline.

TitleAn Energy-Efficient Bit-Split-and-Combination Systolic Accelerator for NAS-Based Multi-Precision Convolution Neural Networks
Author*Liuyao Dai, Quan Cheng, Yuhang Wang, Gengbin Huang, Junzhuo Zhou, Kai Li, Wei Mao, Hao Yu (Southern University of Science and Technology, China)
Pagepp. 448 - 453
Keywordmulti-precision, systolic array, accelerators, CNN, NAS
AbstractOptimized convolutional neural network (CNN) models and energy-efficient hardware design are of great importance in edge-computing applications. The neural architecture search (NAS) methods are employed for CNN model optimization with multi-precision networks. To satisfy the computation requirements, multi-precision convolution accelerators are highly desired. The existing high-precision-split (HPS) designs reduce the additional logics for reconfiguration while resulting in low throughput for low precisions. The low-precision-combination (LPC) designs improve the low-precision throughput with large hardware cost. In this work, a bit-split-and-combination (BSC) systolic accelerator is proposed to overcome the bottlenecks. Firstly, BSC-based multiply-accumulate (MAC) unit is designed to support multi-precision computation operations. Secondly, multi-precision systolic dataflow is developed with improved data-reuse and transmission efficiency. The proposed work is designed by Chisel and synthesized in 28-nm process. The BSC MAC unit achieves maximum 2.40× and 1.64× energy efficiency than HPS and LPC units, respectively. Compared with published accelerator designs Gemmini, Bit-fusion and Bit-serial, the proposed accelerator achieves up to 2.94× area efficiency and 6.38× energy-saving performance on the multi-precision VGG-16, ResNet-18 and LeNet-5 benchmarks.

TitleMulti-Precision Deep Neural Network Acceleration on FPGAs
Author*Negar Neda (University of Tehran, Iran), Salim Ullah (TU Dresden, Germany), Azam Ghanbari, Hoda Mahdiani, Mehdi Modarressi (UT, Iran), Akash Kumar (TU Dresden, Germany)
Pagepp. 454 - 459
Keywordneural networks, hardware acceleration, FPGA
AbstractQuantization is a promising approach to reduce the computational load of neural networks. The minimum bit-width that preserves the original accuracy varies significantly across different neural networks and even across different layers of a single neural network. Most existing designs over-provision accelerators with sufficient bit-width to preserve the required accuracy across a wide range of neural networks. In this paper, we present mpDNN, a multi-precision multiplier with dynamically adjustable bit-width for deep neural network acceleration. The design supports run-time splitting an arithmetic operator into multiple independent operators with smaller bit-width, effectively increasing throughput when lower precision is required. The proposed architecture is designed for FPGAs, in that the multipliers and bit-width adjustment mechanism are optimized for the LUT-based structure of FPGAs. Experimental results show that by enabling run-time precision adjustment, mpDNN can offer 3-15x improvement in throughput.

[To Session Table]

Session 6D  Quantum and Reconfigurable Computing
Time: 11:10 - 11:45, Wednesday, January 19, 2022
Location: Room D
Chairs: Michael Miller (University of Victoria, Canada), Shigeru Yamashita (Ritsumeikan University, Japan)

TitleEfficient Preparation of Cyclic Quantum States
Author*Fereshte Mozafari (EPFL, Switzerland), Yuxiang Yang (ETH, Switzerland), Giovanni De Micheli (EPFL, Switzerland)
Pagepp. 460 - 465
KeywordQuantum Computing, Quantum Compilation, Quantum State Preparation, Cyclic States
AbstractUniversal quantum algorithms that prepare arbitrary n-qubit quantum states require O(2n) gate complexity. The complexity can be reduced by considering specific families of quantum states depending on the task at hand. In particular, multipartite quantum states that are invariant under permutations, e.g. Dicke states, have intriguing properties. In this paper, we consider states invariant under cyclic permutations, which we call cyclic states. We present a quantum algorithm that deterministically prepares cyclic states with gate complexity O(n) without requiring any ancillary qubit. Through both analytical and numerical analyses, we show that our algorithm is more efficient than existing ones.

TitleLimiting the Search Space in Optimal Quantum Circuit Mapping
Author*Lukas Burgholzer, Sarah Schneider, Robert Wille (Johannes Kepler University Linz, Austria)
Pagepp. 466 - 471
Keywordquantum computing, optimal mapping, search space reduction
AbstractExecuting quantum circuits on currently available quantum computers requires compiling them to a representation that conforms to all restrictions imposed by the targeted architecture. Due to the limited connectivity of the devices' physical qubits, an important step in the compilation process is to map the circuit in such a way that all its gates are executable on the hardware. Existing solutions delivering optimal solutions to this task are severely challenged by the exponential complexity of the problem. In this paper, we show that the search space of the mapping problem can be limited drastically while still preserving optimality. The proposed strategies are generic, architecture-independent, and can be adapted to various mapping methodologies. The findings are backed by both, theoretical considerations and experimental evaluations. Results confirm that, by limiting the search space, optimal solutions can be determined for instances that timeouted before or speed-ups of up to three orders of magnitude can be achieved.

TitleEfficient Routing in Coarse-Grained Reconfigurable Arrays using Multi-Pole NEM Relays
Author*Akash Levy, Michael Oduoza, Akhilesh Balasingam, Roger T. Howe, Priyanka Raina (Stanford University, USA)
Pagepp. 472 - 478
KeywordMEMS, CGRA, reconfigurable architecture, 3D place-and-route, NEMS
AbstractIn this paper, we propose the use of multi-pole nanoelectromechanical (NEM) relays for routing multi-bit signals within a coarse-grained reconfigurable array (CGRA). We describe a CMOS-compatible multi-pole relay design that can be integrated in 3-D and improves area utilization by 40% over a prior design. Additionally, we demonstrate a method for placing multiple contacts on a relay that can reduce contact resistance variation by 40x over a circular placement strategy. We then show a methodology for integrating these relays into an industry-standard digital design flow. Using our multi-pole relay design, we perform post-layout simulation of a processing element (PE) tile within a hybrid CMOS-NEMS CGRA in 40 nm technology. We achieve up to 19% lower area and 10% lower power at iso-delay, compared to a CMOS-only PE tile. The results show a way to bridge the performance gap between programmable logic devices (such as CGRAs) and application-specific integrated circuits using NEMS technology.

TitleFault Testing and Diagnosis Techniques for Carbon Nanotube-Based FPGAs
Author*Kangwei Xu, Yuanqing Cheng (Beihang University, China)
Pagepp. 479 - 484
KeywordCNFET, FPGA, fault testing, process variation, MWCNT
AbstractAs process technology shrinks into the nanometerscale, the CMOS-based Field Programmable Gate Arrays (FPGAs) face big challenges in the scalability of performance and power consumption. Multi-walled Carbon Nanotube (MWCNT) serves as a promising candidate for Cu interconnects due to superior conductivity. Moreover, Carbon Nanotube Field Transistor (CNFET) also emerges as a prospective alternative to the conventional CMOS device because of its higher power efficiency and larger noise margin. However, the MWCNT interconnects exhibit significant variations due to an immature fabrication process, leading to delay faults. Furthermore, the non-ideal CNFET fabrication process may generate a few metallic-CNTs (m-CNTs), rendering correlated faulty blocks. In this paper, we propose a ring oscillator (RO) based testing technique to detect delay faults due to the process variations of MWCNT interconnects. In addition, a novel circuit design based on the lookup table (LUT) is applied to speed up the fault testing of CNT-based FPGAs. Finally, we propose a testing algorithm to detect m-CNTs in configurable logic blocks (CLBs). Experimental results show that the test application time for a 6-input LUT can be reduced by 35.49% compared to the conventional testing method, and the proposed algorithm can also achieve a high fault coverage with lower testing overheads.

[To Session Table]

Session 1W  Cadence Training Workshop
Time: 13:30 - 16:30, Wednesday, January 19, 2022
Location: Room W
Chair: Milton Lien (Cadence, Taiwan)

1W-1 (Time: 13:30 - 14:10)
Title(Training Workshop) (AWR) PA: Loadpull & Matching Synthesis, GaN PA Design with Thermal Analysis
Author*Milton Lien (Cadence, Taiwan)
AbstractThis session introduces features, capabilities, examples, and other utilities that PA designers are likely to find useful. Simulate or import measured load-pull data and use the built-in measurements to understand the performance impacts of load or source impedances on your device. The Network Synthesis Wizard allows the user to specify goals and components to generate matching network topologies in a matter of minutes. A PA matching network is designed to meet both PAE and output power at a fixed compression point goal. Understand the electrical characteristics of the metalization of your power amplifier. Analyze 3D planar metal structures with AWR's AXIEM or use AWR's Analyst to simulate fully arbitrary 3D structures. In addition to the well known linear stability analysis methods (K, Mu, B1) check out these other methods/tools: Normalized Determinant Function (NDF), Stability envelope, Gamma Probe and STAN wizard for nonlinear stability analysis using the pole zero method. Celsius Thermal Solver uses design data such as layout geometries, material properties, and dissipated power simulation results from Microwave Office to provide PA designers with thermal heat map visualization and operating temperature information. RF PA designers have access to critical data impacting performance and reliability concerns and can investigate heat sinking and package strategies to best manage thermal dissipation.

1W-2 (Time: 14:10 - 14:50)
Title(Training Workshop) (EMX/Virtuoso RF): EM Extraction/Modeling of Passive Components in RFIC/ Integrated Flow with RFIC Simulation Platform
Author*Milton Lien (Cadence, Taiwan)
AbstractThis session examines how large scale full-wave, planar 3D electromagnetic EMX simulator for RFIC design provides a foundry proved signoff workflow made possible through the two kinds of integration within Virtuoso and VirtuosoRF platform. Several practical examples will demonstrate the environment setting, EM analysis of inductor and its lumped element model extraction, viewing the mesh/current in ParaView, the peculiar EMX Block-box flow worked with other active/passive component models in RFIC design and using EMAG (Eletromagnetic Assistant GUI) in Virtuoso RF to extract RF passive component (R,L,C) models by net selecting, automatically stackup/boundaries setting and port generating with EMX solver and creating Extracted View for Post-Layout Simulation.

1W-3 (Time: 14:50 - 15:30)
Title(Training Workshop) (Sigrity Electro-Thermal PI): Die Model Aware Target Impedance Exploration Using SystemPI
Author*Eric Chen (Cadence, Taiwan)
AbstractPower integrity on PCB and Package are closely related to chip level design. Also die current is critical to plan PCB and Package components and construction. Therefore, target impedance is quickly becoming the preferred method by IC vendors to specify package and board-level AC PDN requirements. This session presents a practical method on how target impedance may be determined by IC vendors, including IC PDN transient current waveforms and impedance profiles.

1W-4 (Time: 15:30 - 16:10)
Title(Training Workshop) (Sigrity SI): How to Design GDDR6 to Improve Performance and Efficiency
Author*Homer Chang (Cadence, Taiwan)
AbstractCurrent generation, High-bandwidth applications in automotive, datacenter, and 5G applications require advanced memory interfaces for Machine learning, AI, graphics, automated driving, Advanced Driver Assistance System (ADAS), High-Performance Computing (HPC). However, many memory technologies cannot be supported enough. Therefore, GDDR6 interface can provide significantly higher memory bandwidth with top performance.

Thursday, January 20, 2022

[To Session Table]

Session 3K  Keynote Session III
Time: 9:00 - 10:00, Thursday, January 20, 2022
Location: Room S
Chair: Wai-Kei Mak (National Tsing Hua University, Taiwan)

Title(Keynote Address) EDA Opportunities for Future HPC and 3D IC Integration
Author*Ken Wang (TSMC, Taiwan)
AbstractIn the AI era of computing, it is opening lots of opportunities for semiconductor and EDA industries. More chips are built with customized AI-accelerators and parallel processing GPU. These demands will need High-Performance-Computing (HPC) technology to support and enable the implementations under a restricted power consumption. Technologies like MIM (Metal-Insulator-Metal), std cell designs architecture, and metal routing techniques are introduced to achieve HPC requirement. Also, more computing logics are integrated into a smaller yet more powerful system. The trend of this integration already happened with 2.5D chip or package design. Now, 3D IC design and fabrication has just begun and will be the future trend. 3D IC design flow examples and EDA challenges are summarized in this presentation.

[To Session Table]

Session 7A  (SS-4) Reshaping the Future of Physical and Circuit Design, Power and Memory with Machine Learning
Time: 10:00 - 10:35, Thursday, January 20, 2022
Location: Room A
Chair: Taeyoung Kim (Intel, USA)

Title(Invited Paper) Fast Thermal Analysis for Chiplet Design based on Graph Convolution Networks
Author*Liang Chen, Wentian Jin, Sheldon Tan (University of California, Riverside, USA)
Pagepp. 485 - 492
KeywordChiplet-based systems, thermal analysis, graph neural networks, compact thermal model, principle neighborhood aggregation
Abstract2.5D chiplet-based technology promises an efficient integration technique for advanced designs with more functionality and higher performance. Temperature and related thermal optimization, heat removal are of critical importance for temperature-aware physical synthesis for chiplets. This paper presents a novel graph convolutional networks (GCN) architecture to estimate the thermal map of the 2.5D chipletbased systems with the thermal resistance networks built by the compact thermal model (CTM). First, we take the total power of all chiplets as an input feature, which is a global feature. This additional global information can overcome the limitation that the GCN can only extract local information via neighborhood aggregation. Second, inspired by convolutional neural networks (CNN), we add skip connection into the GCN to pass the global feature directly across the hidden layers with the concatenation operation. Third, to consider the edge embedding feature, we propose an edge-based attention mechanism based on the graph attention networks (GAT). Last, with the multiple aggregators and scalers of principle neighborhood aggregation (PNA) networks, we can further improve the modeling capacity of the novel GCN. The experimental results show that the proposed GCN model can achieve an average RMSE of 0.31 K and deliver a 2.6× speedup over the fast steady-state solver of opensource HotSpot based on SuperLU. More importantly, the GCN model demonstrates more useful generalization or transferable capability. Our results show that the trained GCN can be directly applied to predict thermal maps of six unseen datasets with acceptable mean RMSEs of less than 0.67 K without retraining via inductive learning.

Title(Invited Paper) Design Close to the Edge for Advanced Technology using Machine Learning and Brain-inspired Algorithms
AuthorHussam Amrouch, Florian Klemme, *Paul R. Genssler (University of Stuttgart, Germany)
Pagepp. 493 - 499
KeywordMachine Learning, Brain-Inspired Computing, Reliability, FinFET, SRAM
AbstractIn advanced technologies, transistor performance is increasingly impacted by different types of degradation. First, variation is inherent to the manufacturing process and is constant over the lifetime. Second, aging effects degrade the transistor over its whole life and can cause failures later on. Both effects impact the underlying electrical properties of which the threshold voltage is the most important. To estimate the degradation-induced changes in the transistor performance for a whole circuit, extensive SPICE simulations have to be performed. However, for large circuits, the computational effort of such simulations can become infeasible very quickly. Furthermore, the SPICE simulations cannot be delegated to circuit designers, since the required underlying transistor models cannot be shared due to their high confidentiality for the foundry. In this paper, we tackle these challenges at multiple levels, ranging from transistor to memory to circuit level. We employ machine learning and brain-inspired algorithms to overcome computational infeasibility and confidentiality problems, paving the way towards design close to the edge.

Title(Invited Paper) Reinforcement Learning for Electronic Design Automation: Case Studies and Perspectives
AuthorAhmet F. Budak, Zixuan Jiang, Keren Zhu (The University of Texas at Austin, USA), Azalia Mirhoseini, Anna Goldie (Google, USA), *David Z. Pan (The University of Texas at Austin, USA)
Pagepp. 500 - 505
KeywordEDA, VLSI, CAD, Reinforcement learning, Machine learning
AbstractReinforcement learning (RL) algorithms have recently seen rapid advancement and adoption in the field of electronic design automation (EDA) in both academia and industry. In this paper, we first give an overview of RL and its applications in EDA. In particular, we discuss three case studies: chip macro placement, analog transistor sizing, and logic synthesis. In collaboration with Google Brain, we develop a hybrid RL and analytical mixed-size placer and achieve better results with less training time on public and proprietary benchmarks. Working with Intel, we develop an RL-inspired optimizer for analog circuit sizing, combining the strengths of deep neural networks and reinforcement learning to achieve state-of-the-art black-box optimization results. We also apply RL to the popular logic synthesis framework ABC and obtain promising results. Through these case studies, we discuss the advantages, disadvantages, opportunities, and challenges of RL in EDA.

Title(Invited Paper) Differentially Evolving Memory Ensembles: Pareto Optimization based on Computational Intelligence for Embedded Memories on a System Level
AuthorFelix Last (Technical University of Munich, Germany), Ceren Yeni (Intel Germany, Germany), *Ulf Schlichtmann (Technical University of Munich, Germany)
Pagepp. 506 - 512
Keywordelectronic design automation, system-level design, computational intelligence, evolutionary algorithms, ppa
AbstractAs the relative power, performance, and area (PPA) impact of embedded memories continues to grow, proper parameterization of each of the thousands of memories on a chip is essential. When the parameters of all memories of a product are optimized together as part of a single system, better trade-offs may be achieved than if the same memories were optimized in isolation. However, challenges such as a sparse solution space, conflicting objectives, and computationally expensive PPA estimation impede the application of common optimization heuristics. We show how the memory system optimization problem can be solved through computational intelligence. We apply a Pareto-based Differential Evolution to ensure unbiased optimization of multiple PPA objectives. To ensure efficient exploration of a sparse solution space, we repair individuals to yield feasible parameterizations. PPA is estimated efficiently in large batches by pre-trained regression neural networks. Our framework enables the system optimization of thousands of memories while keeping a small resource footprint. Evaluating our method on a tractable system, we find that our method finds diverse solutions which exhibit less than 0.5% distance from known global optima.

[To Session Table]

Session 7B  Advances in Analog Design Methodologies
Time: 10:00 - 10:35, Thursday, January 20, 2022
Location: Room B
Chairs: Markus Olbrich (University of Hannover, Germany), Ian O'Connor (Ecole Centrale de Lyon, France)

TitleTransient Adjoint DAE Sensitivities: a Complete, Rigorous, and Numerically Accurate Formulation
Author*Naomi Sagan, Jaijeet Roychowdhury (University of California, Berkeley, USA)
Pagepp. 513 - 518
KeywordParameter Sensitivities, Circuit Simulation, Adjoint
AbstractAlmost all practical systems rely heavily on physical parameters. As a result, parameter sensitivity, or the extent to which perturbations in parameter values affect the state of a system, is intrinsically connected to system design and optimization. We present TADsens, a method for computing the parameter sensitivities of an output of a differential algebraic equation (DAE) system. Specifically, we provide rigorous, insightful theory for adjoint sensitivity computation of DAEs, along with an efficient and numerically well-posed algorithm implemented in Berkeley MAPP. Our theory and implementation advances resolve long-standing issues that have impeded adoption of adjoint transient sensitivities in circuit simulators for over 5 decades. We present results and comparisons on two nonlinear analog circuits. TADsens is numerically well posed and accurate, and faster by a factor of 300 over direct sensitivity computation on a circuit with over 150 unknowns and 600 parameters.

TitleGenerative-Adversarial-Network-Guided Well-Aware Placement for Analog Circuits
Author*Keren Zhu, Hao Chen, Mingjie Liu, Xiyuan Tang, Wei Shi, Nan Sun, David Z. Pan (The University of Texas at Austin, USA)
Pagepp. 519 - 525
KeywordAnalog layout, Machine learning for CAD, placement and routing, well generation
AbstractGenerating wells for transistors is an essential challenge in analog circuit layout synthesis. While it is closely related to analog placement, very little research has explicitly considered well generation within the placement process. In this work, we propose a new analytical well-aware analog placer. It uses a generative adversarial network (GAN) for generating wells and guides the placement process. A global placement algorithm spreads the modules given the GAN guidance and optimizes for area and wirelength. Well-aware legalization techniques then legalize the global placement results and produce the final placement solutions. By allowing well sharing between transistors and explicitly considering wells in placement, the proposed framework achieves more than 74% improvement in the area and more than 26% reduction in half-perimeter wirelength over existing placement methodologies in experimental results.

TitleTAFA: Design Automation of Analog Mixed-Signal FIR Filters Using Time Approximation Architecture
Author*Shiyu Su, Qiaochu Zhang, Juzheng Liu, Mohsen Hassanpourghadi, Rezwan Rasul, Mike Shuo-Wei Chen (University of Southern California, USA)
Pagepp. 526 - 531
KeywordAnalog mixed-signal FIR filter, design automation, time approximation, artificial neural network, regression
AbstractA digital finite impulse response (FIR) filter design is fully synthesizable, thanks to the mature CAD support of digital circuitry. On the contrary, analog mixed-signal (AMS) filter design is mostly a manual process, including architecture selection, schematic design, and layout. This work presents a systematic design methodology to automate AMS FIR filter design using a time approximation architecture without any tunable passive component, such as switched capacitor or resistor. It not only enhances the flexibility of the filter but also facilitates design automation with reduced analog complexity. The proposed design flow features a hybrid approximation scheme that automatically optimize the filter's impulse response in light of time quantization effects, which shows significant performance improvement with minimum designer's efforts in the loop. Additionally, a layout-aware regression model based on an artificial neural network (ANN), in combination with gradient-based search algorithm, is used to automate and expedite the filter design. With the proposed framework, we demonstrate rapid synthesis of AMS FIR filters in 65nm process from specification to layout.

[To Session Table]

Session 7C  Low-Energy Edge AI Computing
Time: 10:00 - 10:35, Thursday, January 20, 2022
Location: Room C
Chairs: Bing Li (Capital Normal University, China), Yaojun Zhang (Pimchip Technology Co., China)

TitleEfficient Computer Vision on Edge Devices with Pipeline-Parallel Hierarchical Neural Networks
Author*Abhinav Goel, Caleb Tung, Xiao Hu (Purdue University, USA), George K. Thiruvathukal (Loyola University Chicago, USA), James C. Davis, Yung-Hsiang Lu (Purdue University, USA)
Pagepp. 532 - 537
KeywordNeural network, low-power, computer vision, parallel computing, edge devices
AbstractComputer vision on low-power edge devices enables applications including search-and-rescue and security. State-of-the-art computer vision algorithms, such as Deep Neural Networks (DNNs), are too large for inference on low-power edge devices. To improve efficiency, some existing approaches parallelize DNN inference across multiple edge devices. However, these techniques introduce significant communication and synchronization overheads or are unable to balance workloads across devices. This paper demonstrates that the hierarchical DNN architecture is well suited for parallel processing on multiple edge devices. We design a novel method that creates a parallel inference pipeline for computer vision problems that use hierarchical DNNs. The method balances loads across the collaborating devices and reduces communication costs to facilitate the processing of multiple video frames simultaneously with higher throughput. Our experiments consider a representative computer vision problem where image recognition is performed on each video frame, running on multiple Raspberry Pi 4Bs. With four collaborating low-power edge devices, our approach achieves 3.21× higher throughput, 68% less energy consumption per device per frame, and 58% decrease in memory when compared with existing single-device hierarchical DNNs.

TitleEfficient On-Device Incremental Learning by Weight Freezing
Author*Ze-Han Wang, Zhenli He, Hui Fang, Yi-Xiong Huang, Ying Sun, Yu Yang, Zhi-Yuan Zhang, Di Liu (Yunnan University, China)
Pagepp. 538 - 543
Keyworddeep learning, incremental learning
AbstractOn-device learning has become a new trend for edge intelligence systems. In this paper, we investigate the on-device incremental learning problem, which targets to learn new classes on top of a well-trained model on the device. Incremental learning is known to suffer from catastrophic forgetting, i.e., a model learns new classes at the cost of forgetting the old classes. Inspired by model pruning techniques, we propose a new on-device incremental learning method based on the weight freezing. The weight freezing in our framework plays two roles: 1) preserving the knowledge of the old classes; 2) boosting the training procedure. By means of weight freezing, we build up an efficient incremental learning framework which combines knowledge distillation to fine-tune the new model. We conduct extensive experiments on CIFAR100 and compare our method with two existing methods. The experimental results show that our method can achieve higher accuracy after incrementally learning new classes.

TitleEdgenAI: Distributed Inference with Local Edge Devices and Minimum Latency
AuthorMaedeh Hemmat, *Azadeh Davoodi, Yu Hen Hu (University of Wisconsin-Madison, USA)
Pagepp. 544 - 549
KeywordDistributed inference, Deep neural networks
AbstractWe propose EdgenAI, a framework to decompose a complex deep neural networks (DNN) over n available local edge devices with minimal communication overhead and overall latency. Our framework creates small DNNs (SNNs) from an original DNN by partitioning its classes across the edge devices, while taking into account their available resources. Class-aware pruning is then applied to aggressively reduce the size of the SNN mapped to each edge device. The SNNs perform inference in parallel, and are additionally configured to generate a ‘Don’t Know’ response when identifying an unassigned class. Our experiments show up to 17X speedup compared to a recent work, on devices with at most 100MB memory when distributing a variant of VGG-16 over 20 parallel edge devices, without much loss in accuracy.

TitleLarge Forests and Where to “Partially” Fit Them
Author*Andrea Damiani, Emanuele Del Sozzo, Marco D. Santambrogio (Politecnico di Milano, Italy)
Pagepp. 550 - 555
KeywordDecision Trees, Random Forests, Field-Programmable Gate Arrays, Partial Dynamic Reconfiguration
AbstractThe Artificial Intelligence of Things (AIoT) calls for on-site Machine Learning inference to overcome the instability in latency and availability of networks. Thus, hardware acceleration is paramount for reaching the Cloud’s modeling performance within an embedded device’s resources. In this paper, we propose Entree, the first automatic design flow for deploying the inference of Decision Tree (DT) ensembles over Field-Programmable Gate Arrays (FPGAs) at the network’s edge. It exploits dynamic partial reconfiguration on modern FPGA-enabled Systems-on-a-Chip (SoCs) to accelerate arbitrarily large DT ensembles data latency a hundred times stabler than software alternatives. Plus, given Entree’s suitability for both hardware designers and non-hardware-savvy developers, we believe it has the potential of helping data scientists to develop a non-Cloud-centric AIoT.

[To Session Table]

Session 7D  Emerging Technologies in Embedded Systems and Cyber-Physical Systems
Time: 10:00 - 10:35, Thursday, January 20, 2022
Location: Room D
Chairs: Bei Yu (Chinese University of Hong Kong, Hong Kong), Shiyan Hu (University of Southampton, UK)

TitleAdaSens: Adaptive Environment Monitoring by Coordinating Intermittently-Powered Sensors
AuthorShuyue Lan, *Zhilu Wang, John Mamish, Josiah Hester, Qi Zhu (Northwestern University, USA)
Pagepp. 556 - 561
Keywordcyber-physical systems, intermittent computing, multi-agent adaptation
AbstractPerceiving the environment for better and more efficient situational awareness is essential in applications such as wildlife surveillance, wildfire detection, crop irrigation, and building management. Energy-harvesting, intermittently-powered sensors have emerged as a zero maintenance solution for long-term environmental perception. However, these devices suffer from intermittent and varying energy supply, which presents three major challenges for executing perceptual tasks: (1) intelligently scaling computation in light of constrained resources and dynamic energy availability, (2) planning communication and sensing tasks, (3) and coordinating sensor nodes to increase the total perceptual range of the network. We propose an adaptive framework, AdaSens, which adapts the operations of intermittently-powered sensor nodes in a coordinated manner to cover as much as possible of the targeted scene, both spatially and temporally, under interruptions and constrained resources. We evaluate AdaSens on a real-world surveillance video dataset, VideoWeb, and show at least 16% improvement on the coverage of the important frames compared with other methods.

TitleEnergy Harvesting Aware Multi-hop Routing Policy in Distributed IoT System Based on Multi-agent Reinforcement Learning
Author*Wen Zhang (Texas A&M University- Corpus Christi, USA), Tao Liu (Lawrence Technological University, USA), Mimi Xie (University of Texas at San Antonio, USA), Longzhuang Li, Dulal Kar, Chen Pan (Texas A&M University- Corpus Christi, USA)
Pagepp. 562 - 567
KeywordEnergy harvesting, Embedded devices, Multi-hop Routing
AbstractEnergy harvesting technologies offer a promising solution to sustainably power an ever-growing number of Internet of Things (IoT) devices. However, due to the weak and transient natures of energy harvesting, IoT devices have to work intermittently rendering conventional routing policies and energy allocation strategies impractical. To this end, this paper, for the very first time, developed a distributed multi-agent reinforcement algorithm known as global actor-critic policy (GAP) to address the problem of routing policy and energy allocation together for the energy harvesting powered IoT system. At the training stage, each IoT device is treated as an agent and one universal model is trained for all agents to save computing resources. At the inference stage, packet delivery rate can be maximized. The experimental results show that the proposed GAP algorithm achieves 1.28 times and 1.24 times data transmission rate than that of the Q-table and ESDSRAA algorithm, respectively.

TitleAn Accuracy Reconfigurable Vector Accelerator based on Approximate Logarithmic Multipliers
Author*Lingxiao Hou, Yutaka Masuda, Tohru Ishihara (Nagoya University, Japan)
Pagepp. 568 - 573
KeywordApproximate Computing, Energy Efficient Computing, Low Power Design, Vector Acceleration, Single Instruction Multiple Data
AbstractThe logarithmic approximate multiplier proposed by Mitchell provides an efficient alternative to accurate multipliers in terms of area and power consumption. However, its maximum error of 11.1% makes it difficult to deploy in applications requiring high accuracy. To widely reduce the error of the Mitchell multiplier, this paper proposes a novel operand decomposition method which decomposes one operand into multiple operands and calculates them using multiple Mitchell multipliers. Based on this operand decomposition, this paper also proposes an accuracy reconfigurable vector accelerator which can provide a required computational accuracy with a high parallelism. The proposed vector accelerator dramatically reduces the area by more than half from the accurate multiplier array while satisfying the required accuracy for various applications. The experimental results show that our proposed vector accelerator behaves well in image processing and robot localization.

TitleNeural Network Pruning and Fast Training for DRL-based UAV Trajectory Planning
AuthorYilan Li, Haowen Fang, *Mingyang Li, Yue Ma, Qinru Qiu (Syracuse University, USA)
Pagepp. 574 - 579
KeywordStructured neural network compression, GPU acceleration, UAV trajectory planning
AbstractDeep reinforcement learning (DRL) has been applied for optimal control of autonomous UAV trajectory generation. The energy and payload capacity of small UAVs impose constraints on the complexity and size of the neural network. While Model compression has the potential to optimize the trained neural network model for efficient deployment on embedded platforms, pruning a neural network for DRL is more difficult due to the slow convergence in the training before and after pruning. In this work, we focus on improving the speed of DRL training and pruning. New reward function and action exploration are first introduced, resulting in convergence speedup by 34.14%. The framework that integrates pruning and DRL training is then presented with an emphasize on how to reduce the training cost. The pruning does not only improve computational performance of inference, but also reduces the training effort without compromising the quality of the trajectory. Finally, experimental results are presented. We show that the integrated training and pruning framework reduces 67.16% of the weight and improves trajectory success rate by 1.7%. It achieves a 4.43x reduction of the floating-point operations for the inference, resulting a measured 41.85% run time reduction.

[To Session Table]

Session 8A  (DF-4) Empowering AI through Innovative Computing
Time: 10:35 - 11:10, Thursday, January 20, 2022
Location: Room A
Chair: Prof. Chao-Tsung Huang (National Tsing Hua University, Taiwan)

Title(Designers' Forum) Mediatek Dual-Core Deep-Learning Accelerator for Versatile AI Applications
Author*Chih-Chung Cheng (Mediatek Inc., Taiwan)
AbstractIn this discussion, Mediatek will share the design experiences of the deep-learning accelerator (DLA) in 5G smartphone SOC. The material is mainly based on the sharing in ISSCC 2020. By introducing optimizations to reduce memory bandwidth and computing overhead, the 7nm DLA can achieve 3.6TOPS with the energy efficiency ranging from 3.4-to-13.3TOPS/W.

Title(Designers' Forum) Kneron KL-530 introduction - How we define the next generation of Edge AI Chip
Author*David Yang (Kneron, USA)
KeywordAI Hardware
AbstractAs a startup focused on edge AI chip technology, we constantly think the best features of next generation product. In my section, I would like to share Kneron’s latest Edge AI chip and how we define it.

Title(Designers' Forum) In-Memory Computing for Future AI Acceleration
Author*Tuo-Hung Hou (National Yang Ming Chiao Tung University, Taiwan)
KeywordAI Hardware
AbstractIn-memory computing (IMC) exploits the highly parallel vector-matrix multiplication in high-density memory arrays with minimal data transfer, and it has been proposed to improve the form factor, cost, and power consumption of the future deep learning hardware. However, the requirements for memory devices and circuits are different from conventional storage applications. The limited precision and inherent variability pose severe design constraints at the architecture level. Furthermore, improving the energy and area-expensive peripheral circuits requires innovations. In this talk, I will introduce the fundamentals and current status of in-memory computing. The challenges and potential solutions in computing precision, device/circuit variation, and area/energy efficiency will be addressed. Our approach is showcased using a recent SRAM-based in-memory computing macro achieving unprecedentedly high energy efficiency of 20943 TOPS/W designed for tenary CNN acceleration and keyword spotting applications.

[To Session Table]

Session 8B  Advances in VLSI Routing
Time: 10:35 - 11:10, Thursday, January 20, 2022
Location: Room B
Chairs: Wuxi Li (Xilinx, USA), Yi-Lang Li (National Yang Ming Chiao Tung University, Taiwan)

TitleHigh-Correlation 3D Routability Estimation for Congestion-guided Global Routing
Author*Miaodi Su, Hongzhi Ding, Shaohong Weng, Changzhong Zou (Fuzhou University, China), Zhonghua Zhou (The University of British Columbia, Canada), Yilu Chen (Fuzhou University, China), Jianli Chen (Fudan University, China), Yao-Wen Chang (National Taiwan University, Taiwan)
Pagepp. 580 - 585
KeywordRoutability Estimation, Deep Learning, Guided Routing
AbstractRoutability estimation identifies potentially congested areas in advance to achieve high-quality routing solutions. To improve the routing quality, this paper presents a deep learning-based congestion estimation algorithm that applies the estimation to a global router. Unlike existing methods based on traditional compressed 2D features for model training and prediction, our algorithm extracts appropriate 3D features from the placed netlists. Furthermore, an improved RUDY (Rectangular Uniform wire DensitY) method is developed to estimate 3D routing demands. Besides, we develop a congestion estimator by employing a U-net model to generate a congestion heatmap, which is predicted before global routing and serves to guide the initial pattern routing of a global router to reduce unexpected overflows. Experimental results show that the Pearson Correlation Coefficient (PCC) between actual and our predicted congestion is high at about 0.848 on average, significantly higher than the counterpart by 21.14%. The results also show that our guided routing can reduce the respective routing overflows, wirelength, and via count by averagely 6.05%, 0.02%, and 1.18%, with only 24% runtime overheads, compared with the state-of-the-art CUGR global router that can balance routing quality and efficiency very well.

TitleSPRoute 2.0: A detailed-routability-driven deterministic parallel global router with soft capacity
Author*Jiayuan He (University of Texas at Austin, USA), Udit Agarwal (Katana Graph, USA), Yihang Yang, Rajit Manohar (Yale University, USA), Keshav Pingali (University of Texas at Austin, USA)
Pagepp. 586 - 591
Keywordglobal routing, detailed routability, parallelization, determinism
AbstractGlobal routing has become more challenging due to advancements in the technology node and the ever-increasing size of chips. Global routing needs to generate routing guides such that (1) routability of detailed routing is considered and (2) the routing is deterministic and fast. In this paper, we firstly introduce soft capacity which reserves routing space for detailed routing based on the pin density and Rectangular Uniform wire Density (RUDY). Second, we propose a deterministic parallelization approach that partitions the netlist into batches and then bulk-synchronously maze-routes a single batch of nets. The advantage of this approach is that it guarantees determinacy without requiring the nets running in parallel to be disjoint, thus guaranteeing scalability. We then design a scheduler that mitigates the load imbalance and livelock issues in this bulk synchronous execution model. We implement SPRoute 2.0 with the proposed methodology. The experimental results show that SPRoute 2.0 generates good quality of results with 43% fewer shorts, 14% fewer DRCs and a 7.4X speedup over a state-of-the-art global router on the ICCAD2019 contest benchmarks.

TitleFPGA-Accelerated Maze Routing Kernel for VLSI Designs
Author*Xun Jiang (Nanjing University, China), Jiarui Wang, Yibo Lin (Peking University, China), Zhongfeng Wang (Nanjing University, China)
Pagepp. 592 - 597
Keywordmaze routing, detailed routing, FPGA acceleration, VLSI designs
AbstractDetailed routing for large-scale integrated circuits (ICs) is time-consuming. It needs to finish the wiring for millions of nets and handle complicated design rules. Due to the heterogeneity of net sizes, the greedy nature of the backbone maze routing, and interdependent workloads, accelerating detailed routing with parallelization is rather challenging. In this paper, we propose a FPGA-based implementation to accelerate the maze routing kernels in a most recent detailed router. Experimental results demonstrate that batched maze routing is 3.1x speedup on FPGA. Besides, our design gets deterministic results and has less than 1% quality degradation on ISPD 2018 contest benchmarks.

[To Session Table]

Session 8C  Machine Learning with Crossbar Memories
Time: 10:35 - 11:10, Thursday, January 20, 2022
Location: Room C
Chairs: Ing-chao Lin (National Cheng Kung University, Taiwan), Yi-Jung Chen (National Chi Nan University, Taiwan)

TitleReliable Memristive Neural Network Accelerators Based on Early Denoising and Sparsity Induction
Author*Anlan Yu, Ning Lyu, Wujie Wen, Zhiyuan Yan (Lehigh University, USA)
Pagepp. 598 - 603
KeywordNeural Network, ReRAM crossbar, DNN accelerator, Robust, Reliable memristive crossbar
AbstractImplementing deep neural networks (DNNs) in hardware is challenging due to the requirements of huge memory and computation associated with DNNs' primary operation---matrix-vector multiplications (MVMs). Memristive crossbar shows great potential to accelerate MVMs by leveraging its capability of in-memory computation. However, one critical obstacle to such a technique is potentially significant inference accuracy degradation caused by two primary sources of errors---the variations during computation and stuck-at-faults (SAFs). To overcome this obstacle, we propose a set of dedicated schemes to significantly enhance its tolerance against these errors. First, a minimum mean square error (MMSE) based denoising scheme is proposed to diminish the impact of variations during computation in the intermediate layers. To the best of our knowledge, this is the first work considering denoising in the intermediate layers without extra crossbar resources. Furthermore, MMSE early denoising not only stabilizes the crossbar computation results but also mitigates errors caused by low resolution analog-to-digital converters. Second, we propose a weights-to-crossbar mapping scheme by inverting bits to mitigate the impact of SAFs. The effectiveness of the proposed bit inversion scheme is analyzed theoretically and demonstrated experimentally. Finally, we propose to use L1 regularization to increase the network sparsity, as a greater sparsity not only further enhances the effectiveness of the proposed bit inversion scheme, but also facilitates other early denoising mechanisms. Experimental results show that our schemes can achieve 40%-78% accuracy improvement, for the MNIST and CIFAR10 classification tasks under different networks.

TitleBoosting ReRAM-based DNN by Row Activation Oversubscription
AuthorMengyu Guo, Zihan Zhang, Jianfei Jiang, Qin Wang, *Naifeng Jing (Shanghai Jiao Tong University, China)
Pagepp. 604 - 609
KeywordReRAM, Accelerator, Sparse, Prediction, Oversubscription
AbstractIdeally, ReRAM crossbar is good at matrix-vector multiplication (MVM) operation for Deep Neural Network (DNN) acceleration, but in practical it is suffering from low computing parallelism due to the high analog-digital converter (ADC) cost when interpreting analog MVM results. In this study, we propose RAOS (row activation oversubscription), a new crossbar architecture that can dynamically leverage both sparsity and small values that are common in various DNNs to oversubscribe the row activation, so as to increase the computing parallelism without stressing ADC. To learn the dynamics, we propose a predicting unit to find the upper bound of the results without hurting MVM accuracy, and two prediction schemes to maximize the oversubscription rate for MVM calculation. The proposed RAOS architecture introduces little hardware cost but greatly improves the performance while reserving or even reducing the ADC resolution requirement. Evaluation results show that RAOS can improve the performance by 3.8X and 1.2X comparing to the state-of-the-art ReRAM accelerator designs that use fixed row activation (ISAAC) and sparsity (SRE). The total energy can be reduced by 4.9X and 1.7X, respectively.

TitleXBM: A Crossbar Column-wise Binary Mask Learning Method for Efficient Multiple Task Adaption
Author*Fan Zhang, Li Yang, Jian Meng, Yu (Kevin) Cao, Jae-sun Seo, Deliang Fan (Arizona State University, USA)
Pagepp. 610 - 615
KeywordNeural Network, In-Memory Computing, ReRAM, Continual Learning
AbstractRecently, utilizing ReRAM crossbar array to accelerate DNN inference on single task has been widely studied. However, using the crossbar array for multiple task adaption has not been well explored. In this paper, for the first time, we propose XBM, a novel crossbar column-wise binary mask learning method for multiple task adaption in ReRAM crossbar DNN accelerator. XBM leverages the mask based learning algorithm’s benefit to avoid catastrophic forgetting to learn a task-specific mask foreach new task. With our hardware-aware design innovation, the required masking operation to adapt for a new task could be easily implemented in existing crossbar based convolution engine with minimal hardware/ memory overhead and, more importantly, no need of power hungry cell re-programming, unlike prior works. The extensive experimental results show that comparing with state-of-the-art multiple task adaption methods, XBM keeps the similar accuracy on new tasks while only requires 1.4%mask memory size compared with popular piggyback.Moreover, the elimination of cell re-programming or tuning saves upto 40% energy during new task adaption.

[To Session Table]

Session 8D  High Level Synthesis, CGRA mapping and P&R for hotspot mitigation
Time: 10:35 - 11:10, Thursday, January 20, 2022
Location: Room D
Chairs: Shinya Takamaeda-Yamazaki (The University of Tokyo, Japan), Kazutoshi Wakabayashi (The University of Tokyo, Japan)

TitleCGRA Mapping Using Zero-Suppressed Binary Decision Diagrams
Author*Rami Beidas, Jason H. Anderson (University of Toronto, Canada)
Pagepp. 616 - 622
KeywordCGRA Mapping, Placement and Routing, ZDD
AbstractThe restricted routing networks of coarse-grained reconfigurable arrays (CGRAs) have motivated CAD developers to utilize exact solutions, such as integer linear programming (ILP), in formulating and solving the mapping problem. Such solutions that rely on general purpose optimizers have not been shown to scale. In this work, we formulate CGRA mapping as a solution enumeration and selection problem, relying on the efficiency of zero-suppressed binary decision diagrams (ZDDs) to capture the solution space. For small-to-moderate size problems, it is possible to capture every possible mapping in a few megabytes. For larger problems, thousands if not millions of solutions can be enumerated. The final mapping is a simple linear-time DAG traversal of the enumeration ZDD. The proposed solution was implemented in the CGRA-ME framework. A speedup of two orders of magnitude was obtained when compared with past solutions targeting smaller CGRA devices. Larger devices beyond the capacity of those solutions are now accessible.

TitleImproving the Quality of Hardware Accelerators through automatic Behavioral Input Language Conversion in HLS
Author*Md Imtiaz Rashid, Benjamin Carrion Schaefer (The University of Texas at Dallas, USA)
Pagepp. 623 - 628
KeywordHigh-Level Synthesis, Graph Convolutional Networks, Parsers, QoR
AbstractHigh-Level Synthesis (HLS) is now part of most standard VLSI design flows and there are numerous commercial HLS tools available. One main problem of HLS is that the quality of results still heavily depends on minor things like how the code is written. One additional observation that we have made in this work is that the input language used for the same HLS tools affects the result of the synthesized circuit. HLS tools (commercial and academic) are built in a modular way which typically include a separate front-end (parser) for each input language supported. These front-ends parse the untimed behavioral descriptions, perform numerous technology independent optimizations and output a common intermediate representations (IR) for all different input languages supported. These optimizations also heavily depend on the synthesis directives set by the designer. These directives in the form of pragmas allow to control how to synthesize arrays (register or RAM), loops (unroll or not or pipeline) and functions (inline or not). We have observed that two functional equivalent behavioral descriptions with the same set of synthesis directives often lead to different circuits for the same HLS tool. Thus, automated tools are needed to help designers to generate the best possible circuit independently of the input language used. To address this, in this work we propose using Graph Convolutional Networks (GCN) to determine the best language for a given new behavioral description and present an automated language converter for HLS.

TitleHotspot Mitigation through Multi-Row Thermal-aware Re-Placement of Logic Cells based on High-Level Synthesis Scheduling
Author*Benjamin Carrion Schaefer (The University of Texas at Dallas, USA)
Pagepp. 629 - 634
KeywordHotspots, High-Level Synthesis, Scheduling
AbstractPlace and route tools do only consider area and timing when placing a synthesized netlist. This can lead to a placement with high-density power regions, which in turn lead to hotspots. This work presents a method to re-place logic cells locally, within the hotspot, to reduce the peak temperatures, leveraging the fact that placed rows contain fillers between cells. One key contribution of this work is that it optimizes all of the rows simultaneously instead of traditional methods that do a row-based optimization. This is accomplished by formulating the cell placement problem as a system of difference constraints (SDC). SDC constraint were mainly proposed to solve the operation scheduling in High-Level Synthesis. In this work we propose two methods that apply this global optimization problem for the thermal-aware re-placement of logic gates. Experimental results show that our proposed methods lead to better results than optimizing every row independently leading to average peak temperature reduction of 5.2℃ and 9.5℃ with minimum delay overhead.

[To Session Table]

Session 9A  (SS-5) Artificial Intelligence on Back-End EDA: Panacea or One-Trick Pony?
Time: 11:10 - 11:45, Thursday, January 20, 2022
Location: Room A
Chair: Yibo Lin (Peking University, China)

Title(Invited Paper) Techniques for CAD Tool Parameter Auto-tuning in Physical Synthesis: A Survey
Author*Hao Geng, Tinghuan Chen, Qi Sun, Bei Yu (The Chinese University of Hong Kong, Hong Kong)
Pagepp. 635 - 640
KeywordPhysical Synthesis, CAD Tool, Parameter Auto-tuning, Survey
AbstractAs the technology node of integrated circuits rapidly goes beyond 5nm, synthesis-centric modern very large-scale integration (VLSI) design flow is facing ever-increasing design complexity and suffering the pressure of time-to-market. During the past decades, synthesis tools have become progressively sophisticated and offer countless tunable parameters that can significantly influence design quality. Nevertheless, owing to the time-consuming tool evaluation plus a limitation to one possible parameter combination per synthesis run, manually searching for optimal configurations of numerous parameters proves to be elusive. What’s worse, tiny perturbations to these parameters can result in very large variations in the Quality-of- Results (QoR). Therefore, automatic tool parameter tuning to reduce human cost and tool evaluation cost is in demand. Machine-learning techniques provide chances to enable the auto-tuning process of tool parameters. In this paper, we will survey the recent pace of progress on advanced parameter auto-tuning flows of physical synthesis tools. We sincerely expect this survey can enlighten the future development of parameter auto-tuning methodologies.

Title(Invited Paper) Application of Deep Learning in Back-End Simulation: Challenges and Opportunities
AuthorYufei Chen (Zhejiang University, China), Haojie Pei (China University of Petroleum, China), Xiao Dong (Zhejiang University, China), Zhou Jin (China University of Petroleum, China), *Cheng Zhuo (Zhejiang University, China)
Pagepp. 641 - 646
KeywordDeep Learning, Back-end Simulation
AbstractRelentless semiconductor scaling and ever increasing device integration have resulted in the exponentially growing size of the back-end design, which makes back-end simulation very time- and resource-consuming. With the success in the computer vision community, deep learning seems a promising alternative to assist the back-end simulation. However, unlike computer vision tasks, most back-end simulation problems are mathematically and physically well-defined, e.g., power delivery network sign off and post-layout circuit simulation. It then brings broad interests in the community where and how to deploy deep learning in the back-end simulation flows. This paper discusses a few challenges that the deployment of deep learning models in back-end simulation have to confront and the corresponding opportunities for future research.

Title(Invited Paper) EasyMAC: Design Exploration-Enabled Multiplier-Accumulator Generator using a Canonical Architectural Representation
AuthorJiaxi Zhang, Qiuyang Gao, Yijiang Guo, Bizhao Shi, *Guojie Luo (Peking University, China)
Pagepp. 647 - 653
KeywordMultiplier-Accumulator, Design Space Exploration, Canonical Representation
AbstractMultiplier-accumulator (MAC) is a crucial arithmetic element widely used in digital integrated circuits. Customized MACs are necessary for different scenarios but need great effort due to the huge architecture design space. In this paper, we develop EasyMAC, a flexible Chisel-based MAC generator with a canonical architectural representation. We design a compact and canonical sequence representation to express the architecture of MACs. And the MAC generator takes the compact representation as input to gain the Verilog codes. We also give a case study on developing a heuristic design space exploration (DSE) method based on this representation. The experimental result shows the effectiveness of the representation in DSE. Using the percent relative range of the power-delay-area product as a metric to measure the optimization opportunities that this representation exposes, the relative range is 17.4% and 23.1% for 16x16 and 25x18 MACs, respectively. At last, we discuss some promising directions of EasyMAC.

[To Session Table]

Session 9B  Side Channel Leakage: Characterization and Protection
Time: 11:10 - 11:45, Thursday, January 20, 2022
Location: Room B
Chairs: Xueyan Wang (Beihang University, China), Bi Wu (Nanjing University of Aeronautics and Astronautics, China)

Best Paper Candidate
TitleDVFSspy: Using Dynamic Voltage and Frequency Scaling As A Covert Channel for Multiple Procedures
Author*Pengfei Qiu, Dongsheng Wang (Department of Computer Science and Technology, Tsinghua University, China), Yongqiang Lyu (Beijing National Research Center for Information Science and Technology, Tsinghua University, China), Gang Qu (Department of Electrical and Computer Engineering and Institute for Systems Research, University of Maryland, USA)
Pagepp. 654 - 659
Keywordcovert channel, dynamic voltage and frequency scaling, frequency variation, message transfer, hardware vulnerability
AbstractDynamic Voltage and Frequency Scaling (DVFS) is a widely deployed low-power technology in modern systems. In this paper, we discover a vulnerability in the implementation of the DVFS technology that allows us to measure the processor's frequency in the userspace. By exploiting this vulnerability, we successfully implement a covert channel on the commercial Intel platform and demonstrate that the covert channel can reach a throughput of 28.41bps with an error rate of 0.53%. This work indicates that the processor's hardware information that is unintentionally leaked to the userspace by the privileged kernel modules may cause security risks.

TitleFortify: Analytical Pre-Silicon Side-Channel Characterization of Digital Designs
Author*Lakshmy A V, Chester Rebeiro (Indian Institute of Technology Madras, India), Swarup Bhunia (University of Florida, USA)
Pagepp. 660 - 665
KeywordPower side-channel attacks, Pre-Silicon leakage evaluation, Analytical model
AbstractPower side-channel attacks are potent security threats that exploit the power consumption patterns of an electronic device to glean sensitive information ranging from secret keys and passwords to web-browsing activity. While pre-Silicon tools promise early detection of side-channel leakage at the design stage, they require several hours of simulation time. In this paper, we present an analytical framework called Fortify that estimates the power side-channel vulnerability of digital circuit designs at signal-level granularity, given the RTL or gate-level netlist of the design, at least 100 times faster than contemporary works. We demonstrate the correctness of Fortify by comparing it with a recent simulation-based side-channel leakage analysis framework. We also test its scalability by evaluating Fortify on an open-source System-on-Chip.

TitleData Leakage through Self-Terminated Write Schemes in Memristive Caches
Author*Jonas Krautter, Mahta Mayahinia, Dennis R. E. Gnad, Mehdi B. Tahoori (Karlsruhe Institute of Technology (KIT), Germany)
Pagepp. 666 - 671
Keywordside-channel attack, emerging non-volatile memory, hardware security, resistive memory, timing side-channel
AbstractMemory cells in emerging non-volatile resistive memories often have asymmetric switching properties, where reliable write operations are achieved by setting the write period to a fixed value. To improve their performance and energy efficiency, self-terminating write schemes have been proposed, in which the write signal is stopped after the required state change has been observed. In this work, we show how this data-dependent write latency can be exploited as a side-channel in multiple ways to unveil restricted memory content. Moreover, we discuss and evaluate potential approaches to address the issue.

TitleA Voltage Template Attack on the Modular Polynomial Subtraction in Kyber
Author*Jianan Mu, Yixuan Zhao (Institute of Computing Technology, Chinese Academy of Sciences, China), Zongyue Wang (Open Security Research, China), Jing Ye (Institute of Computing Technology, Chinese Academy of Sciences, China), Junfeng Fan (Open Security Research, China), Shuai Chen (Rock-Solid Security Lab, Fiberhome, China), Huawei Li, Xiaowei Li (Institute of Computing Technology, Chinese Academy of Sciences, China), Yuan Cao (Hohai University, China)
Pagepp. 672 - 677
Keywordtemplate attack, Kyber, PQC, MLWE
AbstractKyber is one of the four final Key Encapsulation Mechanism (KEM) competitors of the NIST Post-Quantum Cryptography (PQC) standardization competition. This paper reveals the vulnerability of Kyber under a voltage template side-channel attack: the modular polynomial subtraction operation in Kyber.CCAKEM.Dec. In this paper, by splicing data under different selected ciphertexts, a small number of traces are required to recover the secret key. Experiments show that the recovering accuracy of secret key achieves 100% when using 330 traces, and still achieves 98% when only using 44 traces.

[To Session Table]

Session 9C  Emerging Non-volatile Memory-based In-Memory Computing
Time: 11:10 - 11:45, Thursday, January 20, 2022
Location: Room C
Chairs: Ken Takeuchi (University of Tokyo, Japan), Tae Hyoung Kim (Nanyang Technology University, Singapore)

TitleFeMIC: Multi-Operands In-Memory Computing Based on FeFETs
Author*Rui Liu (Xiangtan University, China), Xiaoyu Zhang, Xiaoming Chen, Yinhe Han (Institute of Computing Technology, Chinese Academy of Sciences, China), Minghua Tang (Xiangtan University, China)
Pagepp. 678 - 683
Keywordfefet, in-memory computing, multi-operands
AbstractThe “memory wall” bottleneck caused by the performance gap between processors and memories is getting worse. Computing-in-memory (CiM), a promising technology to alleviate the “memory wall” bottleneck, has recently attracted much attention. Conventional CiM architectures based on emerging nonvolatile devices have a major drawback that they need N-1 clock cycles to complete a CiM operation with N operands, as they are natively designed for processing two operands. In this work, we propose a new CiM architecture based on ferroelectric field-effect transistors (FeFETs), named FeMIC, which natively supports the computation of multiple operands. For a CiM operation with N operands, FeMIC only needs N/2 clock cycles. The simulation results based on a calibrated FeFET model reveal that our proposed FeMIC can significantly reduce the energy consumption when processing multi-operand CiM operations, compared with state-of-the-arts which use conventional CiM mechanisms.

TitleSparsity-Aware Non-Volatile Computing-In-Memory Macro with Analog Switch Array and Low-Resolution Current-Mode ADC
Author*Yuxuan Huang, Yifan He (Tsinghua University, China), Jinshan Yue (Institute of Microelectronics of the Chinese Academy of Sciences, China), Wenyu Sun, Huazhong Yang, Yongpan Liu (Tsinghua University, China)
Pagepp. 684 - 689
Keywordcomputing-in-memory, RRAM, sparsity, analog switch array, current-mode ADC
AbstractNon-volatile computing-in-memory (nvCIM) is a novel architecture used for deep neural networks (DNNs) because it can reduce the movement of data between computing units and memory units. As sparsity has made great progress in DNNs, the existing nvCIM architecture is only optimized for structured sparsity but little for unstructured sparsity. To solve this problem, the sparsity-aware nvCIM macro is proposed to improve the computing performance and network classification accuracy, and to support both structured and unstructured sparsity. First, the analog switch array is used to take advantage of the structured sparsity and to improve the computing parallelism. Second, the low-resolution current-mode analog-to-digital converter (CM-ADC) is designed to optimize the unstructured sparsity. Experimental results show that the peak equivalent energy efficiency of the proposed nvCIM macro is 9.1 TOPS/W (A8W8, 8-bit activations and 8-bit weights) with only 0.51% accuracy loss, and 584.9 TOPS/W (A1W1), which is 4.8 - 7.5x compared to the state-of-the-art nvCIM macros.

TitleSTREAM: Towards READ-based In-Memory Computing for Streaming based Data Processing
Author*Muhammad Rashedul Haq Rashed, Sven Thijssen (University of Central Florida, USA), Sumit Kumar Jha (University of Texas at San Antonio, USA), Fan Yao, Rickard Ewetz (University of Central Florida, USA)
Pagepp. 690 - 695
KeywordIn-memory computing, READ-based computing, Technology mapping
AbstractProcessing in-memory breaks von-Neumann based design principles to accelerate data-intensive applications. While analog in-memory computing is extremely energy-efficient, the low precision narrows the spectrum of viable applications. In contrast, digital in-memory computing has deterministic precision and can therefore be used to accelerate a broad range of high assurance applications. Unfortunately, the state-of-the-art digital in-memory computing paradigms rely on repeatedly switching the non-volatile memory devices using expensive WRITE operations. In this paper, we propose a framework called STREAM that performs READ-based in-memory computing for streaming-based data processing. The framework consists of a synthesis tool that decomposes high-level programs into in-memory compute kernels that are executed using non-volatile memory. The paper presents hardware/software co-design techniques to minimize the data movement between different nanoscale crossbars within the platform. The framework is evaluated using circuits from ISCAS85 benchmark suite and Suite-Sparse applications to scientific computing. Compared with WRITE-based digital in-memory computing, the READ-based in-memory computing improves average latency and power consumption up to 139X and 14X, respectively.

[To Session Table]

Session 9D  System Level Design of Learning Systems
Time: 11:10 - 11:45, Thursday, January 20, 2022
Location: Room D
Chairs: Chun-Yi Lee (National Tsing Hua University, Taiwan), Vasily Moshnyaga (Fukuoka University, Japan)

TitleOn the Viability of Decision Trees for Learning Models of Systems
Author*Swantje Plambeck, Lutz Schammer, Görschwin Fey (Hamburg University of Technology, Germany)
Pagepp. 696 - 701
KeywordDecision Trees, Abstract Models, Cyber-Physical Systems, Mealy Machine, Bounded History
AbstractAbstract models of embedded systems are useful for various tasks, ranging from diagnosis, through testing to monitoring at run-time. However, deriving a model for an unknown system is difficult. Generic learners like decision trees can identify specific properties of systems and have been applied successfully, e.g., for anomaly detection and test case identification. We consider Decision Tree Learning (DTL) to derive a new type of model from given observations with bounded history for systems that have a Mealy machine representation. We prove theoretical limitations and evaluate the practical characteristics in an experimental validation.

Best Paper Candidate
TitleThis is SPATEM! A Spatial-Temporal Optimization Framework for Efficient Inference on ReRAM-based CNN Accelerator
Author*Yen-Ting Tsou (National Taiwan University, Taiwan), Kuan-Hsun Chen (University of Twente, Netherlands), Chia-Lin Yang (National Taiwan University, Taiwan), Hsiang-Yun Cheng (Academia Sinica, Taiwan), Jian-Jia Chen (Technische Universität Dortmund, Germany), Der-Yu Tsai (National Taiwan University, Taiwan)
Pagepp. 702 - 707
Keywordcomputing-in-memory, inference latency optimization, convolutional neural network
AbstractResistive memory-based computing-in-memory (CIM) has been considered as a promising solution to accelerate convolutional neural networks (CNN) inference, which stores the weights in crossbar memory arrays and performs in-situ matrix-vector multiplications (MVMs) in an analog manner. Several techniques assume that a whole crossbar can operate concurrently and discuss how to efficiently map the weights onto crossbar arrays. However, in practice, the accumulated effect of per-cell current deviation and Analog-to-Digital-Converter overhead may greatly degrade inference accuracy, which motivates the concept of Operation Unit (OU), by which an operation per cycle in a crossbar only involve limited wordlines and bitlines to preserve satisfactory inference accuracy. With OU-based operations, the mapping of weights and scheduling strategy for parallelizing CNN convolution operations should take the cost of communication overhead and resource utilization into consideration to optimize the inference acceleration. In this work, we propose the first optimization framework named SPATEM, that efficiently executes MVMs with OU-based operations on ReRAM-based CIM accelerators. It decouples the design space into tractable steps, models the expected inference latency, and derives an optimized spatial-temporal-aware scheduling strategy. By comparing with state-of-the-arts, the experimental result shows that the derived scheduling strategy of SPATEM achieves 29.24% inference latency reduction on average by utilizing 3.19x more crossbar cells with 31.28% less communication overhead.

TitleHACScale: Hardware-Aware Compound Scaling for Resource-Efficient DNNs
Author*Hao Kong (HP-NTU Digital Manufacturing Corporate Lab, Nanyang Technological University/School of Computer Science and Engineering, Nanyang Technological University, Singapore), Di Liu (HP-NTU Digital Manufacturing Corporate Lab, Nanyang Technological University, Singapore), Xiangzhong Luo, Weichen Liu (School of Computer Science and Engineering, Nanyang Technological University, Singapore), Ravi Subramaniam (Innovations and Experiences – Business Personal Systems, HP Inc., USA)
Pagepp. 708 - 713
KeywordModel scaling, Deep learning, Hardware-aware, Hardware performance
AbstractModel scaling is an effective way to improve the accuracy of deep neural networks (DNNs) by increasing the model capacity. However, existing approaches seldom consider the underlying hardware, causing inefficient utilization of hardware resources and consequently high inference latency. In this paper, we propose HACScale, a hardware-aware model scaling strategy to fully exploit hardware resources for higher accuracy. In HACScale, different dimensions of DNNs are jointly scaled with consideration of their contributions to hardware utilization and accuracy. To improve the efficiency of width scaling, we introduce importance-aware width scaling in HACScale, which computes the importance of each layer to the accuracy and scales each layer accordingly to optimize the trade-off between accuracy and model parameters. Experiments show that HACScale improves the hardware utilization by 1.92× on ImageNet, as a result, it achieves 2.41% accuracy improvement with a negligible latency increase of 0.6%. On CIFAR-10, HACScale improves the accuracy by 2.23% with only 6.5% latency growth.

TitlePearl: Towards Optimization of DNN-accelerators Via Closed-Form Analytical Representation
Author*Arko Dutt, Suprojit Nandy, Mohamed M Sabry (NTU Singapore, Singapore)
Pagepp. 714 - 719
KeywordDeep Neural Networks, Domain-specific accelerator, Design-space exploration, Analytical formulation, Validation
AbstractHardware accelerators for deep learning are proliferating, owing to their high-speed and energy-efficient execution of deep neural network (DNN) workloads. Ensuring an efficient DNN accelerator design requires a vast design-space exploration of a large number of parameters. However, current exploration frameworks are limited by slow architectural simulations, which limit the number of design points to be examined. To address this challenge, in this paper we introduce Pearl, an analytical representation of executing the DNN inference, mapped to an accelerator. Pearl provides immediate estimates of performance and energy of DNN accelerators, where we incorporate new parameters to capture dataflow mapping schemes beneficial for DNN systems. We model equations that represent utilization rates of the compute fabric for different dataflow mappings. We validate the accuracy of our equations against a state-of-the art cycle-accurate DNN hardware simulator. Results show that Pearl achieves <1.0% and <1.3% average error in performance and energy estimates, respectively, while achieving > 1.2*107 times simulation speedup. Pearl shows higher average accuracy than existing analytical models that support the simulator. We also leverage Pearl to explore and optimize area-constrained DNN accelerators targeting inference on full HD resolution.