(Go to Top Page)

The 28th Asia and South Pacific Design Automation Conference
Technical Program

Remark: The presenter of each paper is marked with "*".   Time zone is JST (=UTC+9:00)
Technical Program:   SIMPLE version   DETAILED version with abstract
Author Index:   HERE

Session Schedule

Monday, January 16, 2023

Room Saturn Room Uranus Room Venus Room Mars/Mercury
T1  Tutorial-1: Optimization Problems for Design Automation of Microfluidic Biochips: Scope of Machine Learning
9:30 - 12:30
T2  Tutorial-2: Cryogenic Memory Technologies: A Device-to-System Perspective
9:30 - 12:30
T3  Tutorial-3: Quantum Annealing for EDA and Its Hands-on Training
9:30 - 12:30
T4  Tutorial-4: The Evolution of Functional Verification: SystemVerilog, UVM, and Portable Stimulus
9:30 - 12:30
T5  Tutorial-5: Design Methods and Computing Paradigms based on Flexible Inorganic Printed Electronics
14:00 - 17:00
T6  Tutorial-6: HW/SW Codesign for Reliable In-Memory Computing on Unreliable Technologies: Journey from Beyond-CMOS to Beyond-von Neumann
14:00 - 17:00

T7  Tutorial-7: Agile Hardware and Software Co-Design
14:00 - 17:00

Tuesday, January 17, 2023

Room Saturn Room Uranus Room Venus Room Mars/Mercury Miraikan Hall
1K  (Miraikan Hall)
Opening and Keynote Session I

8:30 - 10:00
Coffee Break
10:00 - 10:20
1A  Reliability Considerations for Emerging Computing and Memory Architectures
10:20 - 11:35
1B  Accelerators and Equivalence Checking
10:20 - 11:35
1C  New Frontiers in Cyber-Physical and Autonomous Systems
10:20 - 11:35
1D  Machine Learning Assisted Optimization Techniques for Analog Circuits
10:20 - 11:35

Lunch Break
11:35 - 13:00
2A  (SS-1) Machine Learning for Reliable, Secure, and Cool Chips: A Journey from Transistors to Systems
13:00 - 14:40
2B  High Performance Memory for Storage and Computing
13:00 - 14:40
2C  Cool and Efficient Approximation
13:00 - 14:40
2D  Logic Synthesis for AQFP, Quantum Logic, AI driven and efficient Data Layout for HBM
13:00 - 14:40
2E  University Design Contest
13:00 - 14:40
Coffee Break
14:40 - 15:00
3A  Synthesis of Quantum Circuits and Systems
15:00 - 17:05
3B  In-Memory/Near-Memory Computing for Neural Networks
15:00 - 17:05
3C  IEEE CEDA Sponsored Technical Session: EDA for New VLSI Revolutions
15:00 - 17:30
3D  Machine Learning-Based Design Automation
15:00 - 17:05

Wednesday, January 18, 2023

Room Saturn Room Uranus Room Venus Room Mars/Mercury Miraikan Hall
2K  (Miraikan Hall)
Keynote II

9:00 - 10:00
Coffee Break
10:00 - 10:20
4A  Advanced Techniques for Yields, Low Power and Reliability
10:20 - 11:35
4B  Microarchitectural Design and Neural Networks
10:20 - 11:35
4C  Novel Techniques for Scheduling and Memory Optimizations in Embedded Software
10:20 - 11:35
4D  Efficient Circuit Simulation and Synthesis for Analog Designs
10:20 - 11:35

Lunch Break
11:35 - 13:00
5A  (SS-2) Security of Heterogeneous Systems Containing FPGAs
13:00 - 14:40
5B  Novel Application & Architecture-Specific Quantization Techniques
13:00 - 14:40
5C  Approximate Brain-Inspired Architectures for Efficient Learning
13:00 - 14:40
5D  Retrospect and Prospect of Verifiation and Test Technologies
13:00 - 14:40
5E  DF Keynote / (DF-1) Next-Generation Computing
13:00 - 14:40
Coffee Break
14:40 - 15:00
6A  (SS-3) Computing, Erasing, and Protecting: the Security Challenges for the Next Generation of Memories
15:00 - 16:15
6B  System-Level Codesign in DNN Accelerators
15:00 - 16:40
6C  New Advances in Hardware Trojan Detection
15:00 - 16:40
6D  Advances in Physical Design and Timing Analysis
15:00 - 17:05
6E  (DF-2) Advanced Sensor Technologies and Application
15:00 - 16:15

Thursday, January 19, 2023

Room Saturn Room Uranus Room Venus Room Mars/Mercury Miraikan Hall
3K  (Miraikan Hall)
Keynote III

9:00 - 10:00
Coffee Break
10:00 - 10:20
7A  (SS-4) Brain-inspired Hyperdimensional Computing to the Rescue for beyond von Neumann Era
10:20 - 11:35
7B  System Level Design Space Exploration
10:20 - 11:35
7C  Security Assurance and Acceleration
10:20 - 11:35
7D  (SS-5) Hardware and Software Co-design of Emerging Machine Learning Algorithms
10:20 - 11:35

Lunch Break
11:35 - 13:00
8A  (SS-6) Full-Stack Co-design for On-Chip Learning in AI Systems
13:00 - 14:15
8B  Energy-Efficient Computing for Emerging Applications
13:00 - 14:40
8C  Side-Channel Attacks and RISC-V Security
13:00 - 14:40
8D  Simulation and Verification of Quantum Circuits
13:00 - 14:40
8E  (DF-3) Edge AI Design
13:00 - 14:15
Coffee Break
14:40 - 15:00
9A  (SS-7) Learning x Security in DFM
15:00 - 16:40
9B  Lightweight Models for Edge AI
15:00 - 16:40

9D  Design Automation for Emerging Devices
15:00 - 16:40
9E  (DF-4) Panel Discussion: Aiming Direction of DX System Design from Hardware to Application
15:00 - 16:15

DF: Designers' Forum, SS: Special Session

List of papers

Remark: The presenter of each paper is marked with "*".   Time zone is JST (=UTC+9:00)

Monday, January 16, 2023

[To Session Table]

Session T1  Tutorial-1: Optimization Problems for Design Automation of Microfluidic Biochips: Scope of Machine Learning
Time: 9:30 - 12:30, Monday, January 16, 2023
Location: Room Saturn

Title(Tutorial) Optimization Problems for Design Automation of Microfluidic Biochips: Scope of Machine Learning
AuthorSudip Roy (Indian Institute of Technology Roorkee, India), Shigeru Yamashita (Ritsumeikan University, Japan), Debraj Kundu (Indian Institute of Technology Roorkee, India)
AbstractA microfluidic biochip serves as an integrated mini laboratory, where multiple fluidic operations are performed simultaneously for detection and testing purposes of sample fluids. In contrast to traditional macro systems, biochips provide exceptional resource, cost, and time savings. The device can be used for rapidly screening several biological analytes for a wide range of applications, including disease diagnosis and detection of hazardous biological agents in a system. The global market for biochips, which was previously anticipated to be worth around $13.1 billion in 2020, is now expected to expand to US$25.4 billion by 2026 at a CAGR of 11.7%. Moreover, after a detailed analysis of the financial effects of the COVID-19 pandemic and the economic crisis it caused, the growth of the Lab-on-a-Chip segments is expected to be a revised 12.3% CAGR for the following seven years. To cope with such high demands for biochips, we need to increase the rate of development of microfluidic biochips in parallel. However, the rate of development of such biochips is mainly dependent on the advancements in the corresponding fabrication technologies and design-automation research. This tutorial will cover the analysis of various optimization problems along with the state-of-the-art algorithms, which are developed to solve the challenges for automation of fluidic operations on various kinds of microfluidic biochips. For different biochip platforms like digital and flow-based microfluidic biochips, unique design automation challenges exist. In this tutorial, we treat design automation techniques for various types of digital and flow-based microfluidic biochips including DMFBs (Digital Microfluidic Biochips), FMBs (Flow-based Microfluidic Biochips), MEDAs (Micro-Electrode-Dot-Arrays), PMDs (Programmable Microfluidic Devices) and RMFs (Random Microfluidic Biochips). In the first part of this tutorial, we present the algorithmic motivations and approaches for design automation and sample preparation for all kinds of microfluidic biochips namely DMFBs, FMBs, MEDA-DMFBs, PMDs and RMFs. Sample preparation is an inherent step of many bio-protocols, so an efficient algorithmic approach to solve such a problem is highly desirable. In the second part of the tutorial, the low-level synthesis and optimization problems for PMD based biochips will be discussed. The low-level synthesis for PMD includes loading, washing, and fluid assignment problems, which are unique for PMD based biochips. We will discuss some optimization techniques for loading, washing, and fluid-to-cell assignment in PMDs. In the last part of the tutorial, we will focus on the design automation problems of low-level synthesis of MEDA biochips that includes the methodologies for placement of on-chip fluidic modules and routing of fluids. Due to the intractable nature of these problems, both heuristics and optimization-based methodologies will be discussed. In this tutorial, we will discuss the existing techniques based on machine learning (ML) and reinforcement learning (RL) for various design automation problems of different types of microfluidic biochips.

[To Session Table]

Session T2  Tutorial-2: Cryogenic Memory Technologies: A Device-to-System Perspective
Time: 9:30 - 12:30, Monday, January 16, 2023
Location: Room Uranus

Title(Tutorial) Cryogenic Memory Technologies: A Device-to-System Perspective
AuthorAhmedullah Aziz (University of Tennessee Knoxville, USA)
AbstractCryogenic (Cryo) memory technologies have been rapidly garnering interest in recent years due to their immense prospect as potential enablers for multiple exciting technology platforms, including - quantum computing, high performance computing (HPC), and space electronics. The use of ultra-cold (~milli Kelvin) superconducting (SC) qubits are customary in most of the cutting-edge quantum computing systems in existence. The quantum core is accompanied by two other crucial components - a classical control processor and a memory block. Currently, these classical components are kept at room temperature and are interfaced with the quantum substrate through low-density dissipative interconnects. The large thermal gradient resulting therefrom adds extra noise to this sensitive system which already strives to suppress such interferences. To realize a practical quantum computing system (comprising thousands of qubits), it is necessary to keep all relevant components (qubits, control processor, interconnects, and the memory block) at cryogenic environment. Even with the advent of the quantum computing era, ultra-fast and energy-efficient classical computing systems are still in high demand. One of the classical platforms that can achieve this dream combination is superconducting single flux quantum (SFQ) electronics. Interestingly, the capabilities of SFQ processors may not be fully leveraged without pairing it with a suitable cryo memory block. Cryogenic memory is also critically important for space-based applications. A multitude of technologies have already been explored to find suitable candidates for cryogenic data storage. This tutorial provides a comprehensive overview of the existing and emerging variants of cryogenic memory technologies. To ensure an organized discussion, the family of cryogenic memory platforms is categorized into three types: superconducting, non-superconducting, and hybrid. This tutorial covers the device-level key concepts all the way to a system-level overview. The discussion also includes the challenges associated with these technologies and their unique prospects.

[To Session Table]

Session T3  Tutorial-3: Quantum Annealing for EDA and Its Hands-on Training
Time: 9:30 - 12:30, Monday, January 16, 2023
Location: Room Venus

Title(Tutorial) Quantum Annealing for EDA and Its Hands-on Training
AuthorTakuji Hiraoka (Fixstars Amplify, Japan), Koji Mizumatsu (Fixstars, Japan), Takahisa Todoroki (Fixstars Amplify, Japan), Yukihide Kohira (University of Aizu, Japan)
AbstractAt present, many quantum annealing machines and Ising machines have been proposed, and several machines have also been put into practical use. However, the software development environment and operation methods of each machine are very different, which makes it difficult for researchers and engineers to use them. The Fixstars Amplify has been developed as a cloud platform to facilitate the development and execution of algorithms for solving combinatorial optimization problems of commercially available quantum annealing machines, Ising machines, mathematical optimization solvers, and gated quantum computers. An overview of its features, several use cases in companies, and examples of researches conducted using the Fixstars Amplify in academia will be introduced in this session. In the second half of the session, a hands-on session on how Fixstars Amplify can be used using the example of combinatorial optimization problem in the manufacturing industry will be conducted.

[To Session Table]

Session T4  Tutorial-4: The Evolution of Functional Verification: SystemVerilog, UVM, and Portable Stimulus
Time: 9:30 - 12:30, Monday, January 16, 2023
Location: Room Mars/Mercury

Title(Tutorial) The Evolution of Functional Verification: SystemVerilog, UVM, and Portable Stimulus
AuthorTom Fitzpatrick (Siemens Digital Industries Software)
AbstractWe all know that the size and complexity of electronic chips and systems have increased dramatically over the past decade or so. Unfortunately, as designs get more complex the difficulty of verifying their functionality increases at an even more dramatic rate. To have any hope of ensuring that designs function as they're intended, the functional verification landscape has undergone a corresponding evolution to include more powerful algorithms, techniques, and tools. Test writers have moved from signal-level directed tests to fully-automated self-checking test environments that can execute orders of magnitude more tests to exercise and verify the vastly growing functionality of today's (and tomorrow's) designs. This technical tutorial will take the audience on a guided tour through the most important advances in functional verification in the last twenty years. Beginning with the standardization of constrained-random stimulus, functional coverage, and other key features in SystemVerilog, we will investigate the advantages of including automation as part of a verification environment and the accompanying techniques required to maximize productivity by abstracting verification to the level of transactions. This will lead to a discussion of the Universal Verification Methodology, the first standard to allow and encourage the development of modular, reusable, transaction-level verification components. We will explore all aspects of the UVM and see the advantages it provides in separating the creation of transaction-level tests written as UVM sequences from the specification of a flexible, reusable component-based test environment. We will then examine the limitations of UVM for developing a software-driven SoC verification environment and introduce the new Portable Test and Stimulus Standard (PSS) that allows the creation of automated constrained-random scenario-level tests that can be used to generate test implementations targeted to both UVM environments and to C-based tests to run on SoC platforms in simulation, emulation, or silicon.

[To Session Table]

Session T5  Tutorial-5: Design Methods and Computing Paradigms based on Flexible Inorganic Printed Electronics
Time: 14:00 - 17:00, Monday, January 16, 2023
Location: Room Saturn

Title(Tutorial) Design Methods and Computing Paradigms based on Flexible Inorganic Printed Electronics
AuthorMehdi B. Tahoori (Karlsruhe Institute of Technology, Germany)
AbstractFlexible electronics is an emerging and fast-growing field which can be used in many demanding and emerging application domains such as wearables, smart sensors, and Internet of Things (IoT). Unlike traditional computing and electronics domain which is mostly driven by performance characteristics, flexible electronics based on additive manufacturing processes are mainly associated with low fabrication costs (as they are used even in consumer market) and low energy consumption (as they could be used in energy-harvested systems). Printed electronics offer certain technological advantages over their silicon based counterparts, like mechanical flexibility, low process temperatures, maskless and additive manufacturing possibilities. However, it is essential that the printed devices operate at low supply voltages. Electrolyte gated transistors (EGTs) using solution-processed inorganic materials which are fully printed using inkject printers at low temperatures are very promising to provide such solutions. In this tutorial, I discuss the technology, process, modeling, fabrication, design (automation), computing paradigms and security aspects of circuits based on additive printed technologies.

[To Session Table]

Session T6  Tutorial-6: HW/SW Codesign for Reliable In-Memory Computing on Unreliable Technologies: Journey from Beyond-CMOS to Beyond-von Neumann
Time: 14:00 - 17:00, Monday, January 16, 2023
Location: Room Uranus

Title(Tutorial) HW/SW Codesign for Reliable In-Memory Computing on Unreliable Technologies: Journey from Beyond-CMOS to Beyond-von Neumann
AuthorHussam Amrouch (University of Stuttgart, Germany)
AbstractBreakthroughs in deep learning continuously fuel innovations that profoundly improve our daily life. However, DNNs largely overwhelm conventional computing systems because they are severely bottlenecked by the data movement between processing units and memory. As a result, novel and intelligent computing systems become more and more inevitable to enhance or even replace the current von-Neumann architecture, which has remained unchanged since decades. This tutorial provides a comprehensive overview on the major shortcomings of modern architectures and the ever-increasing necessity for novel designs that fundamentally reduce memory latency and energy through enabling data processing inside the memory itself. The tutorial will also discuss in detail the great promise of recent emerging beyond-CMOS devices like Ferroelectric Field-Effect Transistor (FeFET). It will bridge the gap between the latest innovations in the underlying technology and the recent breakthroughs in computer architectures. It will demonstrate how HW/SW codesign is a key to realize efficient and reliable in-memory computing.

[To Session Table]

Session T7  Tutorial-7: Agile Hardware and Software Co-Design
Time: 14:00 - 17:00, Monday, January 16, 2023
Location: Room Mars/Mercury

Title(Tutorial) Agile Hardware and Software Co-Design
AuthorYun Eric Liang (Peking University, China), Cheng Zhuo (Zhejiang University, China), Wei Zhang (Hong Kong University of Science and Technology, Hong Kong)
AbstractAs Moore's law is approaching to the end, designing specialized hardware along with the software that map the applications onto the specialized hardware is a promising solution. The hardware design determines the peak performance, while the software is also important as it determines the actual performance. Hardware/software (HW/SW) co-design can optimize the hardware and software in concert and improve overall performance. However, the current flow designs hardware and software in isolation. More importantly, both hardware and software are difficult to design and optimize due to the low-level programming and huge design space.

Tuesday, January 17, 2023

[To Session Table]

Session 1K  Opening and Keynote Session I
Time: 8:30 - 10:00, Tuesday, January 17, 2023
Location: Miraikan Hall
Chair: Shinji Kimura (Waseda University, Japan)

TitleASP-DAC 2023 Opening

Title(Keynote Address) More Moore, More than Moore, More People
AuthorTadahiro Kuroda (University of Tokyo, Japan)
AbstractThe sustainable development of a data-driven society requires solving the energy crisis created by the proliferation of semiconductors (by developing green technology). Therefore, the semiconductor industry is entering a new era where it shifts focus from general-purpose chips with high production efficiency to specialized chips with high energy efficiency. While the general-purpose chip is a competition of device manufacturing (capital), the specialized chip is a competition of design (knowledge). As a result, innovation in the new era requires the democratization of semiconductors, in other words, the ability of more people to develop specialized chips. For that reason, an agile development platform is required that can develop a specialized chip using a combination of automated design and semi-custom manufacturing at 1/10 the time and cost. By following the 80-Point Score Principle (targeting good enough but not perfect score) and iterating development and improvement in a high-speed cycle, it is possible for many more people to rapidly develop a system that highly integrates hardware and software. By adding More People to More Moore and More than Moore of nanotechnology, democratization will drive the creation of an intellectual society.

[To Session Table]

Session 1A  Reliability Considerations for Emerging Computing and Memory Architectures
Time: 10:20 - 11:35, Tuesday, January 17, 2023
Location: Room Saturn
Chairs: Anupam Chattopadhyay (Nanyang Technological University, Singapore), Wei Zhang (The Hong Kong University of Science and Technology)

1A-1 (Time: 10:20 - 10:45) (In-person)
TitleA Fast Semi-Analytical Approach for Transient Electromigration Analysis of Interconnect Trees using Matrix Exponential
Author*Pavlos Stoikos, George Floros, Dimitrios Garyfallou, Nestor Evmorfopoulos, George Stamoulis (University of Thessaly, Greece)
Pagepp. 1 - 6
KeywordElectromigration, Korhonen's PDE, Matrix exponential method, Krylov subspace, Transient analysis
AbstractAs integrated circuit technologies are moving to smaller technology nodes, Electromigration (EM) has become one of the most challenging problems facing the EDA industry. While numerical approaches have been widely deployed since they can handle complicated interconnect structures, they tend to be much slower than analytical approaches. In this paper, we present a fast semi-analytical approach, based on the matrix exponential, for the solution of Korhonen’s stress equation at discrete spatial points of interconnect trees, which enables the analytical calculation of EM stress at any time and point independently. The proposed approach is combined with the extended Krylov subspace method to accurately simulate large EM models and accelerate the calculation of the final solution. Experimental evaluation on OpenROAD benchmarks demonstrates that our method achieves 0.5% average relative error over the COMSOL industrial tool while being up to three orders of magnitude faster.

1A-2 (Time: 10:45 - 11:10) (In-person)
TitleChiplet Placement for 2.5D IC with Sequence Pair Based Tree and Thermal Consideration
Author*Hong-Wen Chiou, Jia-Hao Jiang, Yu-Teng Chang, Yu-Min Lee, Chi-Wen Pan (National Yang Ming Chiao Tung University, Taiwan)
Pagepp. 7 - 12
Keyword2.5D IC, chiplet placement, sequence pair, thermal
AbstractWe develop an efficient thermal-aware chiplet placer for 2.5D IC. Combining sequence-pair based tree, branch-and-bound method, and advanced placement techniques, the placer can find the solution fast with better half perimeter wire length (HPWL). Additionally, with post placement procedure, the placer reduces maximum temperatures while HPWL increases slightly. Experimental results show that the placer can not only find better HPWL (-1.03%) but also speedup at most two orders of magnitude than the prior-art. With thermal consideration, the placer can reduce the maximum temperature up to 8.214 C with average 5.376% HPWL increasing.

1A-3 (Time: 11:10 - 11:35) (In-person)
TitleAn On-line Aging Detection and Tolerance Framework for Improving Reliability of STT-MRAMs
Author*Yu-Guang Chen, Po-Yeh Huang, Jin-Fu Li (National Central University, Taiwan)
Pagepp. 13 - 18
KeywordSTT-MRAM, Aging Detection and Tolerance, TDDB
AbstractSpin-transfer-torque magnetic random-access memory (STT-MRAM) is one of the most promising emerging memories for on-chip memory. However, the magnetic tunnel junction (MTJ) in the STT-MRAM suffers from several reliability threats which degrade the endurance, create defects, and cause memory failure. One of the primary reliability issues comes from time-dependent dielectric breakdown (TDDB) on MTJ, which deviates resistance value of MTJ over time and may lead to reading error. To overcome this challenge, in this paper we present an on-line aging detection and tolerance framework to dynamically monitor the electrical parameter deviations and provide appropriate compensation to avoid reading error. The on-line aging detection mechanism can identify aged words by monitoring read current and then the aging tolerance mechanism can adjust the reference resistance of the sensing amplifier to compensate the aging-induced resistance drop of MTJ. In comparison with existing testing-based aging detection techniques, our mechanism can operate on-line with read operations for both aging detection and tolerance simultaneously with negligible performance overhead. Simulation and analysis results show that the proposed techniques can successfully detect 99% aging words under process variation and achieve at most 25% reliability improvement of STT-MRAMs.

[To Session Table]

Session 1B  Accelerators and Equivalence Checking
Time: 10:20 - 11:35, Tuesday, January 17, 2023
Location: Room Uranus
Chair: Sri Parameswaran (UNSW)

1B-1 (Time: 10:20 - 10:45) (In-person)
TitleAutomated Equivalence Checking Method for Majority based In-Memory Computing on ReRAM Crossbars
Author*Arighna Deb (School of Electronics Engineering, KIIT DU, India), Kamalika Datta, Muhammad Hassan, Saeideh Shirinzadeh (German Research Centre for Artificial Intelligence (DFKI), Germany), Rolf Drechsler (Group of Computer Architecture, University of Bremen, Germany)
Pagepp. 19 - 25
KeywordReRAM, Verification, Boolean Satifiability (SAT), ReRAM crossbar, In-Memory Computing
AbstractRecent progress in the fabrication of Resistive Random Access Memory (ReRAM) devices has paved the way for large scale crossbar structures. In particular, in-memory computing on ReRAM crossbars helps in bridging the processor-memory speed gap for current CMOS technology. To this end, synthesis and mapping of Boolean functions to such crossbars have been investigated by researchers. However the verification of simple designs on crossbar is still done through manual inspection or sometimes complemented by simulation based techniques. Clearly this is an important problem as real world designs are complex and have higher number of inputs. As a result manual inspection and simulation based methods for these designs are not practical. In this paper for the first time as per our knowledge we propose an automated equivalence checking methodology for majority based in-memory designs on ReRAM crossbars. Our contributions are twofold: first, we introduce an intermediate data structure called ReRAM Sequence Graph (ReSG) to represent the logic-in-memory design. This in turn is translated into Boolean Satifiability (SAT) formulas. These SAT formulas are verified against the golden functional specification using Z3 Satifiability Modulo Theory (SMT) solver. We validate the proposed method by running widely available benchmarks.

1B-2 (Time: 10:45 - 11:10) (Online)
TitleAn Equivalence Checking Framework for Agile Hardware Design
Author*Yanzhao Wang, Fei Xie (Portland State University, USA), Zhenkun Yang, Pasquale Cocchini, Jin Yang (Intel Labs, USA)
Pagepp. 26 - 32
KeywordAgile Hardware Design, Equivalence Checking, Formal Verification, Symbolic Execution, Halide
AbstractAgile hardware design enables designers to produce new design iterations efficiently. Equivalence checking is critical in ensuring that a new design iteration conforms to its specification. In this paper, we introduce an equivalence checking framework for hardware designs represented in HalideIR. HalideIR is a popular intermediate representation in software domains such as deep learning and image processing, and it is increasingly utilized in agile hardware design. We have developed a fully automatic equivalence checking workflow seamlessly integrated with HalideIR and several optimizations that leverage the incremental nature of agile hardware design to scale equivalence checking. Evaluations of two deep learning accelerator designs show our automatic equivalence checking framework scales to hardware designs of practical sizes and detects inconsistencies that manually crafted tests have missed.

1B-3 (Time: 11:10 - 11:35) (Online)
TitleTowards High-Bandwidth-Utilization SpMV on FPGAs via Partial Vector Duplication
Author*Bowen Liu, Dajiang Liu (Chongqing University, China)
Pagepp. 33 - 38
KeywordSpMV, FPGA, Vector Duplication, Bandwidth Utilization
AbstractSparse matrix-vector multiplication (SpMV) is widely used in many fields and usually dominates the execution time of a task. With large off-chip memory bandwidth, customizable on-chip resources and high-performance float-point operation, FPGA is a potential platform to accelerate SpMV tasks. However, as compressed data formats for SpMV usually introduce irregular memory access while it is also memory-intensive, implementing an SpMV accelerator on FPGA to achieve a high bandwidth utilization (BU) is a challenging work. Existing works either eliminate irregular memory access at the sacrifice of increasing data redundancy or try to locally reduce the port conflicts introduced by irregular memory access, leading to a limited BU improvement. To this end, this paper proposes a high-bandwidth-utilization SpMV accelerator on FPGAs using partial vector duplication, where read-conflict-free vector buffer, writing-conflict-free adder tree, and ping-pong-like accumulator registers are well elaborated. The FPGA implementation results show that the proposed design can achieve an average of 1.10x performance speedup compared to the state-of-the-art work.

[To Session Table]

Session 1C  New Frontiers in Cyber-Physical and Autonomous Systems
Time: 10:20 - 11:35, Tuesday, January 17, 2023
Location: Room Venus
Chairs: Tsung-Yi Ho (National Tsing Hua University, Taiwan), Ningshi Yao (George Mason University, USA)

Best Paper Candidate
1C-1 (Time: 10:20 - 10:45) (Online)
TitleSafety-driven Interactive Planning for Neural Network-based Lane Changing
Author*Xiangguo Liu, Ruochen Jiao (Northwestern University, USA), Bowen Zheng, Dave Liang (Pony.ai Inc., USA), Qi Zhu (Northwestern University, USA)
Pagepp. 39 - 45
Keywordautonomous driving, neural networks, human-robot interaction
AbstractNeural network-based planners have shown great promises in improving task performance of autonomous driving. However, it is critical and yet very challenging to ensure the safety of systems with neural network-based components, especially in dense and highly interactive traffic environments. In this work, we propose a safety-driven interactive planning framework for neural network-based lane changing. To prevent over-conservative planning, we identify the driving behavior of surrounding vehicles and assess their aggressiveness, and then adapt the planned trajectory for the ego vehicle accordingly in an interactive manner. The ego vehicle can proceed to change lanes if a safe evasion trajectory exists even in the predicted worst case, otherwise, it can stay around the current lateral position or return back to the original lane. We quantitatively demonstrate the effectiveness of our planner design and its advantage over baseline methods through extensive simulations with diverse and comprehensive experimental settings, as well as in real-world scenarios collected by Pony.ai.

1C-2 (Time: 10:45 - 11:10) (Online)
TitleSafety-Aware Flexible Schedule Synthesis for Cyber-Physical Systems using Weakly-Hard Constraints
Author*Shengjie Xu, Bineet Ghosh, Clara Hobbs (University of North Carolina at Chapel Hill, USA), P. S. Thiagarajan (Chennai Mathematical Institute, India/University of North Carolina at Chapel Hill, USA), Samarjit Chakraborty (University of North Carolina at Chapel Hill, USA)
Pagepp. 46 - 51
Keywordcontrol, scheduling, real-time systems, safety, weakly-hard systems
AbstractWith the emergence of complex autonomous systems, multiple control tasks are increasingly being implemented on shared computational platforms. Due to the resource-constrained nature of such platforms in domains such as automotive, scheduling all the control tasks in a timely manner is often difficult. The usual requirement --- that all task invocations must meet their deadlines --- stems from the isolated design of a control strategy and its implementation (including scheduling) in software. This separation of concerns, where the control designer sets the deadlines, and the embedded software engineer aims to meet them, eases the design and verification process. However, it is not flexible and is overly conservative. In this paper, we show how to capture the deadline miss patterns under which the safety properties of the controllers will still be satisfied. The allowed patterns of such deadline misses may be captured using what are referred to as "weakly-hard constraints." But scheduling tasks under these weakly-hard constraints is non-trivial since common scheduling policies like fixed-priority or earliest deadline first do not satisfy them in general. The main contribution of this paper is to automatically synthesize schedules from the safety properties of controllers. Using real examples, we demonstrate the effectiveness of this strategy and illustrate that traditional notions of schedulability, e.g., utility ratios, are not applicable when scheduling controllers to satisfy safety properties.

1C-3 (Time: 11:10 - 11:35) (In-person)
TitleMixed-Traffic Intersection Management Utilizing Connected and Autonomous Vehicles as Traffic Regulators
Author*Pin-Chun Chen (National Taiwan University, Taiwan), Xiangguo Liu (Northwestern University, USA), Chung-Wei Lin (National Taiwan University, Taiwan), Chao Huang (University of Liverpool, UK), Qi Zhu (Northwestern University, USA)
Pagepp. 52 - 57
KeywordConnected and Autonomous Vehicles, Intersection Management, Mixed Traffic
AbstractConnected and autonomous vehicles (CAVs) can realize many revolutionary applications, but it is expected to have mixed-traffic including CAVs and human-driving vehicles (HVs) together for decades. In this paper, we target the problem of mixed-traffic intersection management and schedule CAVs to control the subsequent HVs. We develop a dynamic programming approach and a mixed integer linear programming (MILP) formulation to optimally solve the problems with the corresponding intersection models. We then propose an MILP-based approach which is more efficient and real-time-applicable than solving the optimal MILP formulation, while keeping good solution quality as well as outperforming the first-come-first-served (FCFS) approach. Experimental results and SUMO simulation indicate that controlling CAVs by our approaches is effective to regulate mixed-traffic even if the CAV penetration rate is low, which brings incentive to early adoption of CAVs.

[To Session Table]

Session 1D  Machine Learning Assisted Optimization Techniques for Analog Circuits
Time: 10:20 - 11:35, Tuesday, January 17, 2023
Location: Room Mars/Mercury
Chairs: Ricardo Martins (Instituto de Telecomunicações, IST-University of Lisbon, Portugal), Hung-Ming Chen (National Yang Ming Chiao Tung University)

Best Paper Candidate
1D-1 (Time: 10:20 - 10:45) (In-person)
TitleFully Automated Machine Learning Model Development for Analog Placement Quality Prediction
Author*Chen-Chia Chang, Jingyu Pan (Duke University, USA), Zhiyao Xie (Hong Kong University of Science and Technology, Hong Kong), Yaguang Li, Yishuang Lin, Jiang Hu (Texas A&M University, USA), Yiran Chen (Duke University, USA)
Pagepp. 58 - 63
KeywordAnalog, Machine Learning
AbstractAnalog integrated circuit (IC) placement is a heavily manual and time-consuming task that has a significant impact on chip quality. Several recent studies apply machine learning (ML) techniques to directly predict the impact of placement on circuit performance or even guide the placement process. However, the significant diversity in analog design topologies can lead to different impacts on performance metrics (e.g., common-mode rejection ratio (CMRR) or offset voltage). Thus, it is unlikely that the same ML model structure will achieve the best performance for all designs and metrics. In addition, customizing ML models for different designs require even more tremendous engineering efforts and longer development cycles. In this work, we leverage Neural Architecture Search (NAS) to automatically develop customized neural architectures for different analog circuit designs and metrics. Our proposed NAS methodology supports an unconstrained DAG-based search space containing a wide range of ML operations and topological connections. Our search strategy can efficiently explore this flexible search space and provide every design with the best-customized model to boost the model performance. We make unprejudiced comparisons with the claimed performance of the previous representative work on exactly the same dataset. After fully automated development within only 0.5 days, generated models give 3.61% superior accuracy than the prior art.

1D-2 (Time: 10:45 - 11:10) (In-person)
TitleEfficient Hierarchical mm-Wave System Synthesis with Embedded Accurate Transformer and Balun Machine Learning Models
Author*Fabio Passos (Instituto de Telecomunicacoes, Portugal), Nuno Lourenco (Instituto de Telecomunicacoes and Universidad de Evora, Portugal), Luis Mendes (Instituto de Telecomunicacoes and Politecnico de Leiria, Portugal), Ricardo Martins (Instituto de Telecomunicacoes, Portugal), Joao Vaz, Nuno Horta (Instituto de Telecomunicacoes and Instituto Superior Tecnico, Universidade de Lisboa, Portugal)
Pagepp. 64 - 69
Keywordautomated design, bottom-up methodologies, machine learning, transformers, mm-Wave circuits
AbstractIntegrated circuit design in millimeter-wave (mm-Wave) bands is exceptionally complex and dependent on costly electromagnetic (EM) simulations. Therefore, in the past few years, a growing interest has emerged in developing novel optimization-based methodologies for the automatic design of mm-Wave circuits. However, current approaches lack scalability when the circuit/system complexity increases. Besides, many also depend on EM simulators, which degrade their efficiency. This work resorts to hierarchical system partitioning and bottom-up design approaches, where a precise machine learning model – composed of hundreds of seamlessly integrated sub-models that guarantee high accuracy (validated against EM simulations and measurements) up to 200GHz – is embedded to design passive components, e.g., transformers and baluns. The model generates optimal design surfaces to be fed to the hierarchical levels above or acts as a performance estimator. With the proposed scheme, it is possible to remove the dependency of EM simulations during optimization. The proposed mixed-optimal-surface, performance estimator, and simulation-based bottom-up multiobjective optimization (MOO) are used to fully design a Ka-band mm-Wave transmitter from the device up to the system level in 65-nm CMOS for state-of-the-art specifications.

1D-3 (Time: 11:10 - 11:35) (In-person)
TitleAPOSTLE: Asynchronously Parallel Optimization for Sizing Analog Transistors using DNN Learning
Author*Ahmet Faruk Budak (The University of Texas at Austin, USA), David Smart, Brian Swahn (Analog Devices Inc., USA), David Pan (The University of Texas at Austin, USA)
Pagepp. 70 - 75
Keywordanalog, sizing, parallel, automation, deep learning
AbstractAnalog circuit sizing is a high-cost process in terms of the manual effort invested and the computation time spent. With rapidly developing technology and high market demand, bringing automated solutions for sizing has attracted great attention. This paper presents APOSTLE, an asynchronously parallel optimization method for sizing analog transistors using DNN learning. This work introduces several methods to minimize the real-time of optimization when the sizing task consists of several different simulations with varying time costs. The key contributions of this paper are: (1) a batch optimization framework, (2) a novel deep neural network architecture for exploring design points when the existed solutions are not always fully evaluated, (3) a ranking approximation method based on cheap evaluations and (4) a theoretical approach to balance between the cheap and the expensive simulations to maximize the optimization efficiency. Our method shows high real-time efficiency compared to other black-box optimization methods both on small building blocks and on large industrial circuits while reaching similar or better performance metrics.

[To Session Table]

Session 2A  (SS-1) Machine Learning for Reliable, Secure, and Cool Chips: A Journey from Transistors to Systems
Time: 13:00 - 14:40, Tuesday, January 17, 2023
Location: Room Saturn
Chair: Hussam Amrouch (University of Stuttgart, Germany)

2A-1 (Time: 13:00 - 13:25) (In-person)
Title(Invited Paper) ML to the Rescue: Reliability Estimation from Self-Heating and Aging in Transistors all the Way up Processors
Author*Hussam Amrouch, Florian Klemme (University of Stuttgart, Germany)
Pagepp. 76 - 82
Keywordcircuit reliability, transistor self-heating, transistor aging, machine learning, library characterization
AbstractWith increasingly confined 3D structures and newly-adopted materials of higher thermal resistance, transistor self-heating has risen to a critical reliability threat in state-of-the-art and emerging process nodes. One of the challenges of transistor self-heating is accelerated transistor aging, which leads to earlier failure of the chip if not considered appropriately. Nevertheless, adequate consideration of accelerated aging effects, induced by self-heating, throughout a large circuit design is profoundly challenging due to the large gap between where self-heating does originate (i.e., at the transistor level) and where its ultimate effect occurs (i.e., at the circuit and system levels). In this work, we demonstrate an end-to-end workflow starting from self-heating and aging effects in individual transistors all the way up to large circuits and processor designs. We demonstrate that with our accurately estimated degradations, the required timing guardband to ensure reliable operation of circuits is considerably reduced by up to 96% compared to otherwise worst-case estimations that are conventionally employed.

2A-2 (Time: 13:25 - 13:50) (In-person)
Title(Invited Paper) Graph Neural Networks: A Powerful and Versatile Tool for Advancing Design, Reliability, and Security of ICs
Author*Lilas Alrahis, Johann Knechtel, Ozgur Sinanoglu (New York University Abu Dhabi, United Arab Emirates)
Pagepp. 83 - 90
KeywordGNN, EDA, hardware security, hardware reliability, Survey
AbstractGraph neural networks (GNNs) have pushed the state-of-the-art (SOTA) for performance in learning and predicting on large-scale data present in social networks, biology, etc. Since integrated circuits (ICs) can naturally be represented as graphs, there has been a tremendous surge in employing GNNs for machine learning (ML)-based methods for various aspects of IC design. Given this trajectory, there is a timely need to review and discuss some powerful and versatile GNN approaches for advancing IC design. In this paper, we propose a generic pipeline for tailoring GNN models toward solving challenging problems for IC design. We outline promising options for each pipeline element, and we discuss selected and promising works, like leveraging GNNs to break SOTA logic obfuscation. Our comprehensive overview of GNNs frameworks covers (i) electronic design automation (EDA) and IC design in general, (ii) design of reliable ICs, and (iii) design as well as analysis of secure ICs. We provide our overview and related resources also in the GNN4IC hub at https://github.com/DfX-NYUAD/GNN4IC. Finally, we discuss interesting open problems for future research.

2A-3 (Time: 13:50 - 14:15) (Online)
Title(Invited Paper) Detection and Classification of Malicious Bitstreams for FPGAs in Cloud Computing
AuthorJayeeta Chaudhuri, *Krishnendu Chakrabarty (Duke University, USA)
Pagepp. 91 - 97
KeywordRing oscillators, Convolutional neural networks, FPGA security, Multi-tenant FPGA, Denial of service
AbstractAs FPGAs are increasingly shared and remotely accessed by multiple users and third parties, they introduce significant security concerns. Modules running on an FPGA may include circuits that induce voltage-based fault attacks and denial-of-service (DoS). An attacker might configure some regions of the FPGA with bitstreams that implement malicious circuits. Attackers can also perform side-channel analysis and fault attacks to extract secret information (e.g., secret key of an AES encryption). In this paper, we present a convolutional neural network (CNN)-based defense to detect bitstreams of RO-based malicious circuits by analyzing the static features extracted from FPGA bitstreams. We further explore the criticality of RO-based circuits in order to detect malicious Trojans that are configured on the FPGA. Evaluation on Xilinx FPGAs demonstrates the effectiveness of the security solutions.

2A-4 (Time: 14:15 - 14:40) (Online)
Title(Invited Paper) Learning Based Spatial Power Characterization and Full-Chip Power Estimation for Commercial TPUs
Author*Jincong Lu, Jinwei Zhang, Wentian Jin, Sachin Sachdeva, Sheldon X.-D. Tan (University of California, Riverside, USA)
Pagepp. 98 - 103
KeywordPower Estimation, Processor Power Maps, Machine Learning
AbstractIn this paper, we propose a novel approach for the real-time estimation of chip-level spatial power maps for commercial TPU chips based on a machine-learning technique for the first time. The new method can enable the development of more robust runtime power and thermal control schemes to take advantage of spatial power information such as hot spots that are otherwise not available. Different from the existing commercial multi-core processors in which real-time performance-related utilization information is available, the TPU from Google does not have such information. To mitigate this, we propose using features related to the workloads of running different deep neural networks (DNN) such as the hyperparameters of DNN and TPU resource information generated by the TPU compiler. The new approach involves the offline acquisition of accurate spatial and temporal temperature maps captured from an external infrared thermal imaging camera under the nominal working conditions of a chip. To build the dynamic power density map model, we apply generative adversarial networks (GAN) based on workload-related features. Our study shows that the estimated total powers match the manufacturer's total power measurements extremely well. Experimental results further show that the predictions of power maps are quite accurate.

[To Session Table]

Session 2B  High Performance Memory for Storage and Computing
Time: 13:00 - 14:40, Tuesday, January 17, 2023
Location: Room Uranus
Chairs: Lei Yang (George Mason University, USA), Qiao Li (Xiamen University)

Best Paper Candidate
2B-1 (Time: 13:00 - 13:25) (Online)
TitleDECC: Differential ECC for Read Performance Optimization on High-Density NAND Flash Memory
Author*Yunpeng Song, Yina Lv, Liang Shi (East China Normal University, China)
Pagepp. 104 - 109
KeywordECC, LDPC, 3D NAND flash, read performance
Abstract3D NAND flash memory with advanced multi-level-cell technology has been widely adopted due to its high density, but with significantly degraded reliability. To solve the reliability issue, flash memory often adopts the low-density parity-check code (LDPC) as error correction code (ECC) to encode data and provide fault tolerance. For LDPC, its error correction capability highly depends on the code rates. For LDPC with a low code rate, it can provide a strong correction capability, but with a high energy cost. To avoid the cost, LDPC with a higher code rate is always adopted. When the accessed data is not successfully decoded, LDPC will rely on read retry operations to improve the error correction capability. However, the read retry operation will induce degraded read performance. In this work, a differential ECC (DECC) method is proposed to improve the read performance. The basic idea of DECC is to adopt LDPC with different code rates for data with different access characteristics. Specifically, when data is hot read and retried due to reliability, LDPC with a low code rate will be adopted to optimize performance. With this approach, the cost from LDPC with a low code rate is minimized and the performance is optimized. Through careful design and real-world workloads evaluation on a 3D triple-level-cell (TLC) NAND flash memory, DECC achieves encouraging read performance optimization.

2B-2 (Time: 13:25 - 13:50) (Online)
TitleOptimizing Data Layout for Racetrack Memory in Embedded Systems
Author*Peng Hui, Edwin Hsing-Mean Sha, Qingfeng Zhuge, Rui Xu, Han Wang (East China Normal University, China)
Pagepp. 110 - 115
Keywordracetrack memory, shift operation, hybrid SPM, data layout
AbstractRacetrack memory (RTM), which consists of multiple domain block clusters (DBC) and access ports, is a novel non-volatile memory and has potential as scratchpad memory (SPM) in embedded devices due to its high density and low access latency. However, too many shift operations decrease the performance of RTM and cause unpredictable performance. In this paper, we propose three schemes to optimize the performance of RTM from different aspects, including intra-DBC, inter-DBC, and hybrid SPM with SRAM and RTM. Firstly, a balanced group-based data placement method for the data layout inside one DBC is proposed to reduce shifts. Second, a grouping method for the data allocation among DBCs is proposed. It helps with the shift reduction while using fewer DBCs by using one DBC as multiple DBCs. Finally, we use SRAM to further help the cost reduction, and a cost evaluation metric is proposed to assist the shrinking method which determines the data allocation for hybrid SPM with SRAM and RTM. Experiments show that the proposed schemes can significantly improve the performance of pure RTM and hybrid SPM while using fewer DBCs.

2B-3 (Time: 13:50 - 14:15) (Online)
TitleExploring Architectural Implications to Boost Performance for in-NVM B+-tree
Author*Yanpeng Hu, Qisheng Jiang, Chundong Wang (ShanghaiTech University, China)
Pagepp. 116 - 121
KeywordNVM, B+-Tree, Segmenting, VIPT cache, Allocator
AbstractComputer architecture keeps evolving to support the byte-addressable non-volatile memory (NVM). Researchers have tailored the prevalent B+-tree with NVM, crafting a history of utilizing architectural supports to gain both high performance and crash consistency. The latest architecture-level changes for NVM, e.g., the eADR, motivate us to further explore architectural implications in the design and implementation of in-NVM B+-tree. Our quantitative study finds that eADR makes the cache misses impact increasingly on an in-NVM B+-tree’s performance. We hence propose Conan for the conflict-aware node allocation based on theoretical justifications. Conan decomposes the virtual addresses of B+-tree nodes regarding a VIPT cache and intentionally places them into different cache sets. Experiments show that Conan evidently reduces cache conflicts and boosts the performance of state-of-the-art in-NVM B+-tree.

2B-4 (Time: 14:15 - 14:40) (Online)
TitleAn Efficient Near-Bank Processing Architecture for Personalized Recommendation System
Author*Yuqing Yang, Weidong Yang, Qin Wang, Naifeng Jing, Jianfei Jiang, Zhigang Mao, Weiguang Sheng (Shanghai Jiao Tong University, China)
Pagepp. 122 - 127
Keywordrecommendation system, near-memory processing, mapping scheme, HMC
AbstractPersonalized recommendation systems consume the major resources in modern AI data centers. The memory-bound embedding layers with irregular memory access patterns have been identified as the bottleneck of recommendation systems. To overcome the memory challenges, near-memory processing (NMP) would be an effective solution which provides high bandwidth. Recent work proposes an NMP approach to accelerate the recommendation models by utilizing the through-silicon via (TSV) bandwidth in 3D-stacked DRAMs. However, the total bandwidth provided by TSVs is insufficient for a batch of embedding layers processed in parallel. In this paper, we propose a near-bank processing architecture to accelerate recommendation models. By integrating the compute-logic near memory banks on DRAM dies of the 3D-stacked DRAM, our architecture can exploit the enormous bank-level bandwidth which is much higher than TSV bandwidth. We also present a hardware/software interface for embedding layers offloading. Moreover, we propose an efficient mapping scheme to enhance the utilization of bank-level bandwidth. As a result, our architecture achieves up to 2.10x speedup and 31% energy saving for data movement over the state-of-the-art NMP solution for recommendation acceleration based on 3D-stacked memory.

[To Session Table]

Session 2C  Cool and Efficient Approximation
Time: 13:00 - 14:40, Tuesday, January 17, 2023
Location: Room Venus
Chairs: Mohsen Imani (University of California, Irvine, USA), Hussam Amrouch (University of Stuttgart, Germany)

2C-1 (Time: 13:00 - 13:25) (Online)
TitlePAALM: Power Density Aware Approximate Logarithmic Multiplier Design
Author*Shuyuan Yu, Sheldon Tan (University of California, Riverside, USA)
Pagepp. 128 - 133
KeywordApproximate Computing, Logorithmic Multiplier, Power Density
AbstractApproximate hardware designs can lead to significant power or energy reduction. However, a recent study showed that approximated designs might lead to unwanted higher temperature and related reliability issues due to increased power density. In this work, we try to mitigate this important problem by proposing a novel power density aware approximate logarithmic multiplier PAALM design for the first time. The new multiplier design is based on the approximate logarithmic multiplier (ALM) framework due to its rigorous mathematics based foundation. The idea is to re-design the high computing switch activities of existing ALM designs based on equivalent mathematical formula so that the power density can be reduced at no accuracy loss while with some area overheads. Our results show that the proposed PAALM design can improve 11.5%/5.7% of power density and 31.6%/70.8% of area with 8/16-bit precision when compared with the fixed-point multiplier baseline, respectively. And also achieve extremely low error bias: -0.17/0.08 for 8/16-bit precision, respectively.

Best Paper Award
2C-2 (Time: 13:25 - 13:50) (Online)
TitleApproximate Floating-Point FFT Design with Wide Precision-Range and High Energy Efficiency
Author*Chenyi Wen, Ying Wu, Xunzhao Yin, Cheng Zhuo (Zhejiang University, China)
Pagepp. 134 - 139
KeywordFast Fourier Transform, Pipelined FFT, Error analysis, Optimization, Approximate computation
AbstractFast Fourier Transform (FFT) is a key digital signal processing algorithm that is widely deployed in mobile and portable devices. Recently, with the popularity of human perception related tasks, it is noted that the requirements of full precision and exactness are not always necessary for FFT hardware implementation. Unlike many prior work that deploy approximate arithmetic circuits from bottom-up to build the complex FFT, we propose a top-down approximate Floating-Point FFT design methodology to fully exploit the error-tolerance nature of the FFT algorithm. An efficient error modeling of the configurable approximate multiplier is proposed to link the multiplier approximation to the FFT algorithm precision. Then an approximation optimization flow is formulated to maximize the energy efficiency while satisfying the design specification. Experimental results on both simulation and RTL implementation show that the proposed approximate FFT can achieve up to 52% Area-Delay-Product improvement and 23% energy saving when compared to a conventional exact FFT design. When compared to other approximate FFT designs, the proposed design is found to cover almost 2× wider precision range with higher energy efficiency.

2C-3 (Time: 13:50 - 14:15) (Online)
TitleRUCA: RUntime Configurable Approximate Circuits with Self-Correcting Capability
Author*Jingxiao Ma, Sherief Reda (Brown University, USA)
Pagepp. 140 - 145
KeywordApproximate computing, Approximate design automation, Low Power, Dynamically configurable accuracy
AbstractApproximate computing is an emerging computing paradigm that offers improved power consumption by relaxing the requirement for full accuracy. Since the requirements for accuracy may vary according to specific real-world applications, one trend of approximate computing is to design quality-configurable circuits, which are able to switch at runtime among different accuracy modes with different power and delay. In this paper, we present a novel framework RUCA which aims to synthesize runtime configurable approximate circuits based on arbitrary input circuits. By decomposing the truth table, our approach aims to approximate and separate the input circuit into multiple configuration blocks which support different accuracy levels, including a corrector circuit to restore full accuracy. Power gating is used to activate different blocks, such that the approximate circuit is able to operate at different accuracy-power configurations. To improve the scalability of our algorithm, we also provide a design space exploration scheme with circuit partitioning. We evaluate our methodology on a comprehensive set of benchmarks. For 3-level designs, RUCA saves power consumption by 43.71% within 2% error and by 30.15% within 1% error on average.

2C-4 (Time: 14:15 - 14:40) (In-person)
TitleApproximate Logic Synthesis by Genetic Algorithm with an Error Rate Guarantee
AuthorChun-Ting Lee, *Yi-Ting Li (National Tsing Hua University, Taiwan), Yung-Chih Chen (National Taiwan University of Science and Technology, Taiwan), Chun-Yao Wang (National Tsing Hua University, Taiwan)
Pagepp. 146 - 151
KeywordApproximate Computing, Circuit Optimization, Genetic Algorithm
AbstractApproximate computing is an emerging design technique for error-tolerant applications, which may improve circuit area, delay, or power consumption by trading off a circuit’s correctness. In this paper, we propose a novel approximate logic synthesis approach based on genetic algorithm targeting at depth minimization with an error rate guarantee. We conduct experiments on a set of IWLS 2005 and MCNC benchmarks. The experimental results demonstrate that the depth can be reduced by up to 50%, and 22% on average under a 5% error rate constraint. As compared with the state-of-the-art method, our approach can achieve an average of 159% more depth reduction under the same 5% error rate constraint.

[To Session Table]

Session 2D  Logic Synthesis for AQFP, Quantum Logic, AI driven and efficient Data Layout for HBM
Time: 13:00 - 14:40, Tuesday, January 17, 2023
Location: Room Mars/Mercury
Chairs: Yu-Guang Chen (National Central University, Taiwan), Kazutoshi Wakabayashi (University of Tokyo, Japan)

2D-1 (Time: 13:00 - 13:25) (In-person)
TitleDepth-optimal Buffer and Splitter Insertion and Optimization in AQFP Circuits
Author*Alessandro Tempia Calvino, Giovanni De Micheli (EPFL, Switzerland)
Pagepp. 152 - 158
KeywordAQFP, logic synthesis, Superconducting electronics, technology mapping
AbstractThe Adiabatic Quantum-Flux Parametron (AQFP) is an energy-efficient superconducting logic family. AQFP technology requires buffer and splitting elements (B/S) to be inserted to satisfy path-balancing and fanout-branching constraints. B/S insertion policies and optimization strategies have been recently proposed to minimize the number of buffers and splitters needed in an AQFP circuit. In this work, we study the B/S insertion and optimization methods. In particular, the paper proposes: i) an algorithm for B/S insertion that guarantees global depth optimality; ii) a new approach for B/S optimization based on minimum register retiming; iii) a B/S optimization flow based on (i), (ii), and existing work. We show that our approach reduces the number of B/S up to 20% while guaranteeing optimal depth and providing a 55x speed-up in run time compared to the state-of-the-art.

2D-2 (Time: 13:25 - 13:50) (In-person)
TitleArea-driven FPGA Logic Synthesis Using Reinforcement Learning
Author*Guanglei Zhou, Jason H. Anderson (University of Toronto, Canada)
Pagepp. 159 - 165
KeywordLogic Synthesis, Reinforcement learning, Circuit Optimization
AbstractLogic synthesis involves a rich set of optimization algorithms applied in a specific sequence to a circuit netlist prior to technology mapping. A conventional approach is to apply a fixed ``recipe'' of such algorithms deemed to work well for a wide range of different circuits. In this work, we apply reinforcement learning (RL) to determine a unique recipe of algorithms for each circuit. Feature-importance analysis is conducted using a random-forest classifier to prune the number of features visible to the RL agent. We demonstrate conclusive learning by the RL agent and show significant FPGA area reductions vs.~the conventional approach (resyn2}). In addition to circuit-by-circuit training and inference, we also train an RL agent on multiple circuits, and then apply the agent to optimize: 1) the same set of circuits on which it was trained, and 2) an alternative set of ``unseen'' circuits. In both scenarios, we observe that the RL agent produces higher-quality implementations than the conventional approach. This shows that the RL agent is able to generalize, and perform beneficial logic synthesis optimizations across a variety of circuits.

2D-3 (Time: 13:50 - 14:15) (In-person)
TitleOptimization of Reversible Logic Networks with Gate Sharing
Author*Yung-Chih Chen, Feng-Jie Chao (National Taiwan University of Science and Technology, Taiwan)
Pagepp. 166 - 171
KeywordReversible logic network, Logic optimization
AbstractLogic synthesis for quantum computing aims to transform a Boolean logic network into a quantum circuit. A conventional two-stage flow first synthesizes the given Boolean logic network into a reversible logic network composed of reversible logic gates. Then, it maps each reversible logic gate into quantum gates to generate a quantum circuit. The state-of-the-art method for the first stage takes advantage of the lookup-table (LUT) mapping technology for FPGAs to decompose the given Boolean logic network into sub-networks, and then maps the sub-networks into reversible logic networks. Although every sub-network is well synthesized, we observe that the reversible logic networks could be further optimized by sharing the reversible logic gates belonging to different sub-networks. Thus, in this paper, we propose a new optimization method for the reversible logic networks by sharing gates. We translate the problem of extracting shareable gates to the exclusive-sums-of-product term optimization problem. The experimental results show that the proposed method successfully optimizes the reversible logic networks generated by the LUT-based method. It is able to reduce an average of approximately 4% of quantum gate cost without increasing the number of ancilla lines for a set of IWLS 2005 benchmarks.

2D-4 (Time: 14:15 - 14:40) (In-person)
TitleIris: Automatic Generation of Efficient Data Layouts for High Bandwidth Utilization
Author*Stephanie Soldavini, Donatella Sciuto, Christian Pilato (Politecnico di Milano, Italy)
Pagepp. 172 - 177
Keywordbandwidth, high-level synthesis, HBM
AbstractOptimizing data movements is becoming one of the biggest challenges in heterogeneous computing to cope with data deluge and, consequently, big data applications. When creating specialized accelerators, modern high-level synthesis (HLS) tools are increasingly efficient in optimizing the computational aspects, but data transfers have not been adequately improved. To combat this, novel architectures such as High-Bandwidth Memory with wider data busses have been developed so that more data can be transferred in parallel. Designers must tailor their hardware/software interfaces to fully exploit the available bandwidth. HLS tools can automate this process, but the designer must follow strict coding-style rules. If the bus width is not evenly divisible by the data width (e.g., when using custom-precision data types) or if the arrays are not power-of-two length, the HLS-generated accelerator will likely not fully utilize the available bandwidth, demanding even more manual effort from the designer. We propose a methodology to automatically find and implement a data layout that, when streamed between memory and an accelerator, uses a higher percentage of the available bandwidth than a naive or HLS-optimized design. We borrow concepts from multiprocessor scheduling to achieve such high efficiency.

[To Session Table]

Session 2E  University Design Contest
Time: 13:00 - 14:40, Tuesday, January 17, 2023
Location: Miraikan Hall
Chairs: Akira Tsuchiya (The University of Shiga Prefecture, Japan), Mahfuzul Islam (Kyoto University, Japan)

Special Feature Award
2E-1 (Online)
TitleViraEye: An Energy-Efficient Stereo Vision Accelerator with Binary Neural Network in 55 nm CMOS
Author*Yu Zhang, Gang Chen, Tao He, Qian Huang, Kai Huang (Sun Yat-sen University, China)
Pagepp. 178 - 179
KeywordStereo vision, BNN, Energy-efficient, Real-time, ASIC
AbstractThis paper presents the ViraEye chip, an energy-efficient stereo vision accelerator based on the binary neural network (BNN) to achieve high-quality and real-time stereo estimation. This stereo vision accelerator is designed as an end-to-end full pipeline architecture where all processing procedures, including stereo rectification, BNNs, cost aggregation and post-processing, are implemented on the ViraEye chip. ViraEye allows for top level pipelining between accelerator and image sensors, and no external CPUs or GPUs are required. The accelerator is implemented using SMIC 55nm CMOS technology and achieves top-performing processing speed in terms of million disparity estimations per second (MDE/s) metric among the existing ASIC in the open literature.

2E-2 (In-person)
TitleA 1.2nJ/Classification Fully Synthesized All-Digital Asynchronous Wired-Logic Processor Using Quantized Non-linear Function Blocks in 0.18µm CMOS
Author*Rei Sumikawa, Kota Shiba, Atsutake Kosuge, Mototsugu Hamada, Tadahiro Kuroda (The University of Tokyo, Japan)
Pagepp. 180 - 181
KeywordDNN, AI accelerator, ASIC, Edge-Computing, IoT
AbstractA 5.3 times smaller and 2.6 times more energy-efficient all-digital wired-logic processor which infers MNIST with 90.6% accuracy and 1.2nJ of energy consumption has been developed. To improve area efficiency of wired-logic architecture, non-linear neural network (NNN), which is a neuron and synapse efficient network, and logical compression technology to implement it with area-saving and low-power digital circuits by logic synthesis are proposed, and asynchronous digital combinational circuit DNN hardware has been developed.

2E-3 (In-person)
TitleA Fully Synthesized 13.7μJ/prediction 88% Accuracy CIFAR-10 Single-Chip Data-Reusing Wired-Logic Processor Using Non-Linear Neural Network
Author*Yao-Chung Hsu, Atsutake Kosuge, Rei Sumikawa, Kota Shiba, Mototsugu Hamada, Tadahiro Kuroda (The University of Tokyo, Japan)
Pagepp. 182 - 183
KeywordFPGA, CNN, wired-logic, neural network
AbstractAn FPGA-based wired-logic CNN processor is presented that can process CIFAR-10 at 13.7μJ/prediction with an 88% accuracy, which is 2,036 times more energy-efficient than the prior state-of-the-art FPGA-based processor. Energy efficiency is greatly improved by implementing all processing elements and wirings in parallel on a single FPGA chip to eliminate memory access. By utilizing both (1) a non-linear neural network that saves neurons and synapses and (2) a shift register-based wired-logic architecture, hardware resource usage is reduced by three orders of magnitude.

2E-4 (In-person)
TitleA Multimode Hybrid Memristor-CMOS Prototyping Platform Supporting Digital and Analog Projects
Author*Kamel-Eddine Harabi, Clement Turck, Marie Drouhin, Adrien Renaudineau, Thomas Bersani--Veroni, Damien Querlioz (Univ. Paris-Saclay, CNRS, France), Tifenn Hirtzlin, Elisa Vianello (CEA-LETI, Univ. Grenoble-Alpes, France), Marc Bocquet, Jean-Michel Portal (Aix-Marseille Univ., CNRS, France)
Pagepp. 184 - 185
Keywordmemristor, RRAM, prototyping platform, neural network
AbstractWe present an integrated circuit fabricated in a process co-integrating CMOS and hafnium-oxide memristor technology, which provides a prototyping platform for projects involving memristors. Our circuit includes the periphery circuitry for using memristors within digital circuits, as well as an analog mode with direct access to memristors. The platform allows optimizing the conditions for reading and writing memristors, as well as developing and testing innovative memristor-based neuromorphic concepts.

Best Design Award
2E-5 (In-person)
TitleA fully synchronous digital LDO with built-in adaptive frequency modulation and implicit dead-zone control
Author*Shun Yamaguchi, Mahfuzul Islam, Takashi Hisakado, Osami Wada (Kyoto University, Japan)
Pagepp. 186 - 187
KeywordDigital LDO, Adaptive clocking, Dead-zone control
AbstractThis paper proposes a synchronous digital LDO with adaptive clocking and dead-zone control without additional reference voltages. A test chip fabricated in a commercial 65 nm CMOS general-purpose (GP) process achieves 580x frequency modulation with 99.9% maximum efficiency at 0.6V supply.

2E-7 (In-person)
TitleDemonstration of Order Statistics Based Flash ADC in a 65nm Process
Author*Mahfuzul Islam, Takehiro Kitamura, Takashi Hisakado, Osami Wada (Kyoto University, Japan)
Pagepp. 188 - 189
KeywordFlash ADC, Comparator, Offset voltage, Order statistics, Linearity
AbstractThis paper presents measurement results of a flash ADC that utilizes offset voltages as references. To operate the minimum number of comparators, we select the target comparators based on the rankings of the offset voltage. We present performance improvement by tuning offset voltage distribution using multiple comparator groups under the same power. A test chip in a commercial 65 nm GP process demonstrates the ADCs at 1 GS/s operation

[To Session Table]

Session 3A  Synthesis of Quantum Circuits and Systems
Time: 15:00 - 17:05, Tuesday, January 17, 2023
Location: Room Saturn
Chairs: Weiwen Jiang (George Mason University), Michael Miller (University of Victoria, Canada)

Best Paper Candidate
3A-1 (Time: 15:00 - 15:25) (In-person)
TitleA SAT Encoding for Optimal Clifford Circuit Synthesis
AuthorSarah Schneider, *Lukas Burgholzer (Institute for Integrated Circuits, Johannes Kepler University Linz, Austria), Robert Wille (Chair for Design Automation, Technical University of Munich, Germany)
Pagepp. 190 - 195
Keywordquantum circuit compilation, Clifford circuits, optimal synthesis
AbstractExecuting quantum algorithms on a quantum computer requires compilation to representations that conform to all restrictions imposed by the device. Due to device’s limited coherence times and gate fidelities, the compilation process has to be optimized as much as possible. To this end, an algorithm’s description first has to be synthesized using the device’s gate library. In this paper, we consider the optimal synthesis of Clifford circuits—an important subclass of quantum circuits, with various applications. Such techniques are essential to establish lower bounds for (heuristic) synthesis methods and gauging their performance. Due to the huge search space, existing optimal techniques are limited to a maximum of six qubits. The contribution of this work is twofold: First, we propose an optimal synthesis method for Clifford circuits based on encoding the task as a satisfiability (SAT) problem and solving it using a SAT solver in conjunction with a binary search scheme. The resulting tool is demonstrated to synthesize optimal circuits for up to 26 qubits—an improvement by >4× compared to the state of the art. Second, we experimentally show that the overhead introduced by state-of-the-art heuristics exceeds the lower bound by 33 % on average.

3A-2 (Time: 15:25 - 15:50) (In-person)
TitleAn SMT-Solver-based Synthesis of NNA-Compliant Quantum Circuits Consisting of CNOT, H and T Gates
Author*Kyohei Seino, Shigeru Yamashita (Ritsumeikan University, Japan)
Pagepp. 196 - 201
KeywordNearest Neighbor Architecture (NNA) restriction, SMT-Solver, T gate, Don't Care Condition
AbstractIt is natural to assume that we can perform quantum operations between only two adjacent physical qubits (quantum bits) to realize a quantum computer for both the current and possible future technologies. This restriction is called the Nearest Neighbor Architecture (NNA) restriction. This paper proposes an SMT-solver-based synthesis of quantum circuits consisting of CNOT, H, and T gates to satisfy the NNA restriction. Although the existing SMT-solver-based synthesis cannot treat H and T gates directly, our method treats the functionality of quantum-specific T and H gates carefully so that we can utilize an SMT-solver to minimize the number of CNOT gates; unlike the existing SMT-solver-based methods, our method considers ``Don't Care'' conditions in intermediate points of a quantum circuit by exploiting the property of T gates to reduce CNOT gates. Experimental results show that our approach can reduce the number of CNOT gates by 58.11% on average compared to the naive application of the existing method which does not consider the ``Don't Care'' condition.

3A-3 (Time: 15:50 - 16:15) (In-person)
TitleCompilation of Entangling Gates for High-Dimensional Quantum Systems
Author*Kevin Mato (Technical University of Munich, Germany), Martin Ringbauer (University of Innsbruck, Austria), Stefan Hillmich (Johannes Kepler University Linz, Austria), Robert Wille (Technical University of Munich/Competence Center Hagenberg (SCCH) GmbH, Germany)
Pagepp. 202 - 208
KeywordQuantum Computation, Electronic design automation, Emerging languages and compilers
AbstractMost quantum computing architectures to date natively support multi-valued logic, albeit being typically operated in a binary fashion. Yet, multi-valued, or qudit, quantum processors have access to much richer forms of quantum entanglement, which promise to significantly boost the performance and usefulness of quantum devices. However, much of the theory as well as corresponding design methods required for exploiting such hardware remain insufficient and generalizations from qubits are not straightforward. A particular challenge is the compilation of quantum circuits into sets of native qudit gates supported by state-of-the-art quantum hardware. In this work, we address this challenge by introducing a complete workflow for compiling any two-qudit unitary into an arbitrary native gate set. A case study demonstrates the feasibility of both, the proposed approach as well as the corresponding implementation (which is freely available at github.com/cda-tum/qudit-entanglement-compilation).

3A-4 (Time: 16:15 - 16:40) (In-person)
TitleWIT-Greedy: Hardware System Design of Weighted ITerative Greedy Decoder for Surface Code
Author*Wang Liao (the University of Tokyo, Japan), Yasunari Suzuki (NTT Computer and Data Science Laboratories, Japan), Teruo Tanimoto (Kyushu University, Japan), Yosuke Ueno (the University of Tokyo, Japan), Yuuki Tokunaga (NTT Computer and Data Science Laboratories, Japan)
Pagepp. 209 - 215
Keywordquantum error correction, surface code, decoder, FPGA, non-identical weights
AbstractTo demonstrate fault-tolerant quantum computing, performing quantum error correction by encoding qubits with surface codes is considered the most promising approach. To implement the error correction, we need to estimate the errors repetitively during the computation, which is called the decoding process, and the process has strict timing restrictions due to the lifetime of qubits. While several fast hardware implementations have been proposed, they are still challenging in error correction. This is because they assume the uniform error properties of qubits while they have large variations in practice. We show that neglecting the non-uniform error properties imposes significant degradation in the error-estimation performance. Therefore, decoder designs that can treat non-uniform physical error rates are strongly demanded. In this paper, we propose a hardware design of a decoder that can treat non-uniform error properties with small latency. The key of our design is 1) constructing a look-up table for speeding up the error estimation and 2) enabling parallel processing during decoding. With the implementation with FPGAs, we show our design scales up to code distance 11 within a microsecond-level delay, which is comparable to the existing state-of-the-art designs, while our design can treat non-identical errors.

3A-5 (Time: 16:40 - 17:05) (In-person)
TitleQuantum Data Compression for Efficient Generation of Control Pulses
AuthorDaniel Volya, *Prabhat Mishra (University of Florida, USA)
Pagepp. 216 - 221
KeywordQuantum Computing, Quantum Data Compression, Optimal Quantum Control, Quantum Pulse Generation
AbstractIn order to physically realize a robust quantum gate, a specifically tailored laser pulse needs to be derived via strategies such as quantum optimal control. Unfortunately, such strategies face exponential complexity with quantum system size and become infeasible even for moderate-sized quantum circuits. In this paper, we propose an automated framework for effective utilization of these quantum resources. Specifically, this paper makes three important contributions. First, we utilize an effective combination of register compression and dimensionality reduction to reduce the area of a quantum circuit. Next, due to the properties of an autoencoder, the compressed gates produced are robust even in the presence of noise. Finally, our proposed compression reduces the computation time of quantum control. Experimental evaluation using popular quantum algorithms demonstrates that our proposed approach can enable efficient generation of noise-resilient control pulses while state-of-the-art fails to handle large-scale quantum systems.

[To Session Table]

Session 3B  In-Memory/Near-Memory Computing for Neural Networks
Time: 15:00 - 17:05, Tuesday, January 17, 2023
Location: Room Uranus
Chairs: Chao Wu (Northeastern University), Huizhang Luo (Hunan University)

3B-1 (Time: 15:00 - 15:25) (Online)
TitleToward Energy-Efficient Sparse Matrix-Vector Multiplication with Near STT-MRAM Computing Architecture
Author*Yueting Li, He Zhang, Xueyan Wang (Beihang University, China), Hao Cai (Southeast University, China), Yundong Zhang (Vimicro Corporation, China), Shuqin Lv, Renguang Liu (TMC Corporation, China), Weisheng Zhao (Beihang University, China)
Pagepp. 222 - 227
KeywordNear memory computing, SpMV, STT-MRAM, Energy efficient
AbstractSparse Matrix-Vector Multiplication (SpMV) is one of the key computational primitives used in modern workloads. SpMV performs memory access that leads to unnecessary data transmission, massive data access, and redundant multiplicative accumulators. Therefore, we propose the near spin-transfer torque magnetic random access memory (STT-MRAM) processing architecture with three optimizations from circuit to architecture perspectives. These optimizations include (1) the near MRAM processing (NMP) architecture obtains data from the fast pipelined STT-MRAM, (2) the NMP controller receives the instruction through the AXI4 bus to implement the SpMV operation in the following steps: identifies effective data and encodes the index depending on the kernel size, (3) the NMP controller uses high-level synthesis dataflow in shared memory for achieving better performance throughput while do not consume bus bandwidth, and (4) the configurable MACs are implemented in the NMP core without matching step entirely during the multiplication. Using these optimizations, the simulation experimental results in the 40nm process show that the pipelined STT-MRAM improves the read bandwidth to 26.7GB/s and loads energy is only 0.242 pJ/bit. The extensive experimental results on the NMP architecture show that achieves up to 66x and 28x speedup compared with state-of-the-art ones, and 69x speedup than without sparse optimization.

3B-2 (Time: 15:25 - 15:50) (Online)
TitleRIMAC: An Array-level ADC/DAC-free ReRAM-based In-Memory DNN Processor with Analog Cache and Computation
Author*Peiyu Chen, Meng Wu, Yufei Ma, Le Ye, Ru Huang (Peking University, China)
Pagepp. 228 - 233
KeywordIn-Memory Computing, AI Circuit Design
AbstractBy directly computing in analog domain, processing-in-memory (PIM) is emerging as a promising alternative to overcome the memory bottleneck of traditional von-Neuman architecture, especially for deep neural networks (DNNs). However, the data outside PIM macros in most existing PIM accelerators are stored and operated as digital signals that require massive expensive digital-to-analog (D/A) and analog-to-digital (A/D) converters. In this work, an array level ADC/DAC-free ReRAM-based in-memory DNN processor named RIMAC is proposed, which accelerates various DNNs in pure analog-domain with analog cache and analog computation modules to eliminate the expensive D/A and A/D conversions. Our experiment result shows the peak energy efficiency is improved by about 34.8×, 97.6×, 10.7×, and 14.0× compared to PRIME, ISAAC, Lattice, and 21’DAC for various DNNs on ImageNet, respectively.

3B-3 (Time: 15:50 - 16:15) (In-person)
TitleCrossbar-Aligned & Integer-Only Neural Network Compression for Efficient In-Memory Acceleration
Author*Shuo Huai, Di Liu, Xiangzhong Luo, Hui Chen, Weichen Liu (Nanyang Technological University, Singapore), Ravi Subramaniam (HP Inc., USA)
Pagepp. 234 - 239
KeywordIn-memory computing, pruning, quantization, neural networks
AbstractCrossbar-based In-Memory Computing (IMC) accelerators preload the entire Deep Neural Network (DNN) into crossbars before inference. However, devices with limited crossbars cannot infer increasingly complex models. IMC-pruning can reduce the usage of crossbars, but current methods need expensive extra hardware for data alignment. Meanwhile, quantization can represent weights of DNNs by integers, but they employ non-integer scaling factors to ensure accuracy, requiring costly multipliers. In this paper, we first propose crossbar-aligned pruning to reduce the usage of crossbars without hardware overhead. Then, we introduce a quantization scheme to avoid multipliers in IMC devices. Finally, we design a learning method to complete above two schemes and cultivate an optimal compact DNN with high accuracy and large sparsity during training. Experiments demonstrate that our framework, compared to state-of-the-art methods, achieves larger sparsity and lower power consumption with higher accuracy. We even improve the accuracy by 0.43% for VGG-16 with an 88.25% sparsity rate on the Cifar-10 dataset. Compared to the original model, we reduce computing power and area by 19.8x and 18.8x, respectively.

3B-4 (Time: 16:15 - 16:40) (In-person)
TitleDiscovering the In-Memory Kernels of 3D Dot-Product Engines
Author*Muhammad Rashedul Haq Rashed (University of Central Florida, USA), Sumit Kumar Jha (University of Texas at San Antonio, USA), Rickard Ewetz (University of Central Florida, USA)
Pagepp. 240 - 245
KeywordIn-memory computing, 3D computational kernel
AbstractThe capability of resistive random access memory (ReRAM) to implement multiply-and-accumulate operations promises unprecedented efficiency in the design of scientific computing applications. While the use of two-dimensional (2D) ReRAM crossbar has been well investigated in the last few years, the design of in-memory dot-product engines using three-dimensional (3D) ReRAM crossbars remains a topic of active investigations. In this paper, we holistically explore how to leverage 3D ReRAM crossbars with several (2 to 7) stacked crossbar layers. In contrast, previous studies have focused on 3D ReRAM with at most 2 stacked crossbar layers. We first discover the in-memory compute kernels that can be realized using 3D ReRAM with multiple stacked crossbar layers. We discover that matrices with different sparsity patterns can be realized by appropriately assigning the inputs and outputs to the perpendicular metal wires within the 3D stack. We present a design automation tool to map sparse matrices within scientific computing applications to the discovered 3D kernels. The proposed framework is evaluated using 20 applications from the SuitSparse Matrix Collection. Compared with 2D crossbars, the proposed approach using 3D crossbars improves area, energy, and latency with 2.02X, 2.37X, 2.45X, respectively.

3B-5 (Time: 16:40 - 17:05) (In-person)
TitleRVComp: Analog Variation Compensation for RRAM-based In-Memory Computing
Author*Jingyu He, Yucong Huang (Hong Kong University of Science and Technology, Hong Kong), Miguel Lastras (Universidad Autónoma de San Luis Potosí, Mexico), Terry Tao Ye (Southern University of Science and Technology, China), Chi Ying Tsui, Kwang-Ting Cheng (Hong Kong University of Science and Technology, Hong Kong)
Pagepp. 246 - 251
KeywordRRAM, Reliability, Analog compensation
AbstractResistive Random Access Memory (RRAM) has shown great potential in accelerating the memory-intensive computation in neural network applications. However, RRAM-based computing suffers from significant accuracy degradation due to the inevitable device variations. In this paper, we propose RVComp, a fine-grained analog Compensation approach to mitigate the accuracy loss of in-memory computing incurred by the Variations of the RRAM devices. Specifically, weights in the RRAM crossbar are accompanied by dedicated compensation RRAM cells to offset their programming errors with a scaling factor. A programming target shifting mechanism is further designed with the objectives of reducing the hardware overhead and minimizing the compensation errors under large device variations. Based on these two key concepts, we propose double and dynamic compensation schemes and the corresponding support architecture. Since the RRAM cells only account for a small fraction of the overall area of the computing macro due to the dominance of the peripheral circuitry, the overall area overhead of RVComp is low and manageable. Simulation results show our compensation schemes achieve a negligible 1.80% inference accuracy drop for ResNet18 on the CIFAR-10 dataset under 30% device variation with only 7.12% area and 5.02% power overhead and no extra latency.

[To Session Table]

Session 3C  IEEE CEDA Sponsored Technical Session: EDA for New VLSI Revolutions
Time: 15:00 - 17:30, Tuesday, January 17, 2023
Location: Room Venus
Chair: Gi-Joon Nam (IBM Research, USA)

3C-1 (Time: 15:00 - 15:30)
Title(Special Talk) VLSI Mask Optimization: How Learning Can Help
AuthorBei Yu (Chinese University of Hong Kong, Hong Kong)

3C-2 (Time: 15:30 - 16:00)
Title(Special Talk) Modeling and Simulation of CMOS Image Sensors in SystemVerilog
AuthorJaeha Kim (Seoul National University, Republic of Korea)

3C-3 (Time: 16:00 - 16:30)
Title(Special Talk) Transistor Count Optimization
AuthorRicardo Reis (UFRGS, Brazil)

3C-4 (Time: 16:30 - 17:00)
Title(Special Talk) EDA for additive printed electronics
AuthorMehdi Tahoori (Karlsruhe Institute of Technology, Germany)

3C-5 (Time: 17:00 - 17:30)
Title(Special Talk) Memory Safety Environment for RISC-V processors
AuthorSri Parameswaran (University of New South Wales, Australia)
AbstractIn this talk, a novel hardware/software co-design methodology consisting of a RISC-V based processor is extended with new instructions and microarchitecture enhancements, enabling complete memory safety in the C programming language and faster memory safety checks. Furthermore, a compiler is instrumented to provide security operations considering the changes to the processor. Moreover, a design exploration framework is proposed to provide an in-depth search for optimal hardware/software configuration for application-specific workloads regarding performance overhead, security coverage, area cost, and critical path latency.

[To Session Table]

Session 3D  Machine Learning-Based Design Automation
Time: 15:00 - 17:05, Tuesday, January 17, 2023
Location: Room Mars/Mercury
Chairs: Seokhyeong Kang (POSTECH University), Daijoon Hyun (Cheong-Ju University, Republic of Korea)

Best Paper Award
3D-1 (Time: 15:00 - 15:25) (In-person)
TitleRethink before Releasing your Model: ML Model Extraction Attack in EDA
Author*Chen-Chia Chang, Jingyu Pan (Duke University, USA), Zhiyao Xie (Hong Kong University of Science and Technology, Hong Kong), Jiang Hu (Texas A&M University, USA), Yiran Chen (Duke University, USA)
Pagepp. 252 - 257
KeywordMachine learning, Security
AbstractMachine learning (ML)-based techniques for electronic design automation (EDA) have boosted the performance of modern integrated circuits (ICs). Such achievement makes ML model to be of importance for the EDA industry. In addition, ML models for EDA are widely considered having high development cost because of the time-consuming and complicated training data generation process. Thus, confidentiality protection for EDA models is a critical issue. However, an adversary could apply model extraction attacks to steal the model in the sense of achieving the comparable performance to the victim's model. As model extraction attacks have posed a great threat to other application domains, e.g., computer vision and natural language process, in this paper, we study model extraction attacks for EDA models under two real-world scenarios. It is the first work that (1) introduces model extraction attacks on EDA models and (2) proposes two attack methods against the unlimited and limited query budget scenarios. Our results show that our approach can achieve competitive performance with the well trained victim model without any performance degradation. Based on the results, we demonstrate that model extraction attacks truly threaten the EDA model privacy and hope to raise concerns about ML security issues in EDA.

3D-2 (Time: 15:25 - 15:50) (Online)
TitleMacroRank: Ranking Macro Placement Solutions Leveraging Translation Equivariancy
Author*Yifan Chen, Jing Mai, Xiaohan Gao, Muhan Zhang, Yibo Lin (Peking University, China)
Pagepp. 258 - 263
Keywordplacement, routing, machine learning
AbstractModern large-scale designs make extensive use of heterogeneous macros, which can significantly affect routability. Predicting the final routing quality in the early macro placement stage can filter out poor solutions and speed up design closure. By observing that routing is correlated with the relative positions between instances, we propose MacroRank, a macro placement ranking framework leveraging translation equivariance and a Learning to Rank technique. The framework is able to learn the relative order of macro placement solutions and rank them based on routing quality metrics like wirelength, number of vias, and number of shorts. The experimental results show that compared with the most recent baseline, our framework can improve the Kendall rank correlation coefficient by 49.5 and the average performance of top-30 prediction by 8.1, 2.3, and 10.6 on wirelength, vias, and shorts, respectively.

3D-3 (Time: 15:50 - 16:15) (Online)
TitleBufFormer: A Generative ML Framework for Scalable Buffering
Author*Rongjian Liang, Siddhartha Nath, Anand Rajaram (NVIDIA, USA), Jiang Hu (Texas A&M University, USA), Haoxing Ren (NVIDIA, USA)
Pagepp. 264 - 270
Keywordbuffer insertion, timing closure, timing optimization, buffer tree, interconnect optimization
AbstractBuffering is a prevalent interconnect optimization technique to help timing closure and often performed after placement. A common buffering approach is to construct a Steiner tree and then buffers are inserted on the tree based on Ginneken-Lillis style algorithm. Such an approach is difficult to scale with large nets. Our work attempts to solve this problem with a generative machine-learning (ML) approach without Steiner tree construction. Our approach can extract and reuse knowledge from high quality samples and therefore has significantly improved scalability. A generative ML framework, BufFormer, is proposed to construct abstract tree topology while simultaneously determining buffer sizes & locations. A baseline method, FLUTE-based Steiner tree construction followed by Ginneken-Lillis style buffer insertion, is implemented to generate training samples. After training, BufFormer can produce solutions for unseen nets highly comparable to baseline results with a correlation coefficient 0.977 in terms of buffer area and 0.934 for driver-sink delays. And up to 160X speedup can be achieved for large nets when running on a GPU over the baseline on a single CPU thread.

3D-4 (Time: 16:15 - 16:40) (In-person)
TitleDecoupling Capacitor Insertion Minimizing IR-Drop Violations and Routing DRVs
AuthorDaijoon Hyun (Cheongju University, Republic of Korea), *Younggwang Jung, Insu Cho, Youngsoo Shin (Korea Advanced Institute of Science and Technology, Republic of Korea)
Pagepp. 271 - 276
KeywordDecoupling capacitor, Dynamic IR-drop, Design rule violation, Machine learning
AbstractDecoupling capacitor (decap) cells are inserted near function cells of high switching activities so that their IR-drop can be suppressed. Their design becomes more complex and uses higher metal layers, thereby starting to manifest themselves as routing blockage. Post-placement decap insertion, with a goal of minimizing both IR-drop violations and routing design rule violations (DRVs), is addressed for the first time. U-Net with graph convolutional network is introduced to predict routing DRV probability. The decap insertion problem is formulated and a heuristic algorithm is presented. Experiments with a few test circuits demonstrate that DRVs are reduced by 16% on average with no IR-drop violations, compared to a conventional method which does not explicitly consider DRVs. This results in 48% reduction in routing runtime and 23% improvement in total negative slack.

3D-5 (Time: 16:40 - 17:05) (In-person)
TitleDPRoute: Deep Learning Framework for Package Routing
AuthorYeu-Haw Yeh (Institute of Eletronics, National Yang Ming Chiao Tung University, Taiwan), Simon Yi-Hung Chen (Mediatek Inc., Taiwan), *Hung-Ming Chen (Institute of Eletronics, National Yang Ming Chiao Tung University, Taiwan), Deng-Yao Tu, Guan-Qi Fang, Yun-Chih Kuo, Po-Yang Chen (Mediatek Inc., Taiwan)
Pagepp. 277 - 282
Keywordsubstrate routing, deep learning, multi-agent reinforcement learning
AbstractFor routing closures in package designs, net order is critical due to complex design rules and severe wire congestion. However, existing solutions are deliberatively designed using heuristics and are difficult to adapt to different design requirements unless updating the algorithm. This work presents a novel deep learning-based routing framework that can keep improving by accumulating data to accommodate increasingly complex design requirements. Based on the initial routing results, we apply deep learning to concurrent detailed routing to deal with the problem of net ordering decisions. We use multi-agent deep reinforcement learning to learn routing schedules between nets. We regard each net as an agent, which needs to consider the actions of other agents while making pathing decisions to avoid routing conflict. Experimental results on industrial package design show that the proposed framework can improve the number of design rule violations by 99.5% and the wirelength by 2.9% for initial routing.

Wednesday, January 18, 2023

[To Session Table]

Session 2K  Keynote II
Time: 9:00 - 10:00, Wednesday, January 18, 2023
Location: Miraikan Hall
Chair: Toshihiro Hattori (Renesas Electronics, Japan)

Title(Keynote Address) Analog Synthesis 3.0: AI/ML to Boost Automated Design and Test of Analog/Mixed-Signal ICs
AuthorGeorges G.E. Gielen (KU Leuven, Belgium)
AbstractAnalog/mixed-signal integrated circuits remain indispensable in electronics applications that interface the physical with the cyber world. But whereas digital circuits are largely synthesized through EDA software, the design of analog circuits in industry is surprisingly mainly still handcrafted, resulting in low design productivity with long and error-prone design cycles and high development costs. The showstopper for current analog synthesis tools is their need for proper and often circuit-specific design heuristics and constraints to be entered explicitly by designers in order to handle the humongous solution space and to steer the circuit and layout optimizations towards acceptable solutions. This keynote will present an analog synthesis 3.0 methodology based on advanced machine learning (ML) techniques to self-learn and then exploit the design expertise and constraints from available completed designs. Such innovations may finally enable to fully automate analog circuit design without human designer in the loop. The second AI for CAD application that will be presented is in analog test program development where ML methods will be shown to boost analog fault coverage, also for latent defects.

[To Session Table]

Session 4A  Advanced Techniques for Yields, Low Power and Reliability
Time: 10:20 - 11:35, Wednesday, January 18, 2023
Location: Room Saturn
Chairs: Yibo Lin (Peking University, China), Yukihide Kohira (The University of Aizu, Japan)

4A-1 (Time: 10:20 - 10:45) (Online)
TitleHigh Dimensional Yield Estimation using Shrinkage Deep Features and Maximization of Integral Entropy Reduction
Author*Shuo Yin (Beihang University, China), Guohao Dai (Shenzhen University, China), Wei W. Xing (Beihang University, China)
Pagepp. 283 - 289
Keywordyield estimate, Bayesian optimization, failure probability, deep kernel learning, gaussian process
AbstractDespite the advances in high-sigma yield analysis with the help of machine learning techniques in the past decade, one of the main challenges, the curse of ``dimensionality'', which is inevitable when dealing with the modern large-scale circuit, remains unsolved. To resolve this challenge, we propose an absolute shrinkage deep kernel learning, ASDK, which automatically identifies the dominant process variation parameters in a nonlinear-correlated deep kernel, as a surrogate model to emulate the expensive SPICE simulation. To improve the yield estimation efficiency, we propose a novel maximization of approximated entropy reduction for an efficient model update, which is further enhanced with parallel batch sampling for parallel computing. Experiments on SRAM column circuits demonstrate the superiority of ASDK over the state-of-the-art approaches in terms of accuracy and efficiency, with up to 9x speedup over the SOTA methods.

4A-2 (Time: 10:45 - 11:10) (In-person)
TitleMIA-aware Detailed Placement and VT Reassignment for Leakage Power Optimization
Author*Hung-Chun Lin, Shao-Yun Fang (National Taiwan University of Science and Technology, Taiwan)
Pagepp. 290 - 295
Keywordminimum-implant-area rule, detailed placement
AbstractAs the feature size decreases, leakage power consumption becomes an important target in the design. Using multiple threshold voltages (VTs) in cell-based designs is a popular technique to simultaneously optimize circuit timing and minimize leakage power. However, an arbitrary cell placement result of a multi-VT design may suffer from many design rule violations induced by the Minimum-Implant-Area (MIA) rule, and thus it is necessary to take the MIA rules into consideration during the detailed placement stage. The state-of-the-art works on detailed placement comprehensively tackling MIA rules either disallow VT change or only allow reducing cell VTs to avoid timing degradation. However, these limitations may either result in larger cell displacement or cause overhead in leakage power. In this paper, we propose an optimization framework of VT reassignment and detailed placement to simultaneously consider MIA rules and leakage power minimization under timing constraints. Experimental results show that compared with the state-of-the-art works, the proposed framework can efficiently achieve better trade-off between leakage power and cell displacement.

4A-3 (Time: 11:10 - 11:35) (Online)
TitleSLOGAN: SDC Probability Estimation Using Structured Graph Attention Network
Author*Junchi Ma, Sulei Huang, Zongtao Duan, Lei Tang, Luyang Wang (Chang'an University, China)
Pagepp. 296 - 301
KeywordSoft Error, Silent Data Corruption, Structured Graph Attention Network, Fault propagation
AbstractThe trend of progressive technology scaling makes the computing system more susceptible to soft errors. The most critical issue that soft error incurs is silent data corruption (SDC) since SDC occurs silently without any warnings to users. Estimating SDC probability of a program is the first and essential step towards designing protection mechanism. Prior work suffers from prediction inaccuracy since the proposed heuristic-based models fail to describe the semantic of fault propagation. We propose a novel approach SLOGAN which transfers the prediction of SDC probability into a graph regression task. A program is represented in the form of dynamic dependence graph. To capture the rich semantic of fault propagation, we apply structured graph attention network, which includes node-level, graph-level and layer-level self-attention. With the learned attention coefficients from node-level, graph-level, and layer-level self-attention, the importance of edges, nodes, and layers to the fault propagation can be fully considered. We generate the graph embedding by weighted aggregation of the embeddings of nodes and compute the SDC probability by the regression model. The experiment shows that SLOGAN achieves higher SDC accuracy than state-of-the-art methods with a low time cost.

[To Session Table]

Session 4B  Microarchitectural Design and Neural Networks
Time: 10:20 - 11:35, Wednesday, January 18, 2023
Location: Room Uranus
Chairs: Aviral Srivastava (Arizona State University), Hiroyuki Tomiyama (Ritsumeikan University, Japan)

Best Paper Candidate
4B-1 (Time: 10:20 - 10:45) (In-person)
TitleMicroarchitecture Power Modeling via Artificial Neural Network and Transfer Learning
AuthorJianwang Zhai, Yici Cai (Tsinghua University, China), *Bei Yu (Chinese University of Hong Kong, China)
Pagepp. 302 - 307
KeywordMicroarchitecture Power Modeling, RISC-V BOOM, Artificial Neural Network, Transfer Learning
AbstractAccurate and robust power models are highly demanded to explore better CPU designs. However, previous learning-based power models ignore the discrepancies in data distribution among different CPU designs, making it difficult to use data from the historical configuration to aid modeling for new target configuration. In this paper, we investigate the transferability of power models and propose a microarchitecture power modeling method based on transfer learning (TL). A novel TL method for artificial neural network (ANN)-based power model is proposed, where cross-domain mixup generates more auxiliary samples close to the target configuration to fill in the distribution discrepancy and domain-adversarial training extracts domain-invariant features to complete the target model construction. Experiments show that our method greatly improves the model transferability and can effectively utilize the knowledge of the existing CPU configuration to facilitate target power model construction.

4B-2 (Time: 10:45 - 11:10) (Online)
TitleMUGNoC: A Software-configured Multicast-Unicast-Gather NoC for Accelerating CNN Dataflows
Author*Hui Chen, Di Liu, Shiqing Li, Shuo Huai, Xiangzhong Luo, Weichen Liu (Nanyang Technological University, Singapore)
Pagepp. 308 - 313
KeywordNetwork-on-chips, CNN Dataflow, Parallel multipath transmission
AbstractCurrent communication infrastructures for convolutional neural networks (CNNs) only focus on specific transmission patterns, not applicable to benefit the whole system if the dataflow changes or different dataflows run in one system. To reduce data movement, various CNN dataflows are presented. For these dataflows, parameters and results are delivered using different traffic patterns, i.e., multicast, unicast, and gather, preventing dataflow-specific communication backbones from benefiting the entire system if the dataflow changes or different dataflows run in the same system. Thus, in this paper, we propose MUG-NoC to support typical traffic patterns and accelerate them, therefore boosting multiple dataflows. Specifically, (1) we for the first time support multicast in 2D-mesh software configurable NoC by revising router configuration and proposing the efficient multicast routing; (2) we decrease unicast latency by transmitting data through the different routes in parallel; (3) we reduce output gather overheads by pipelining basic dataflow units. Experiments show that at least our proposed design can reduce 39.2% total data transmission time compared with the state-of-the-art CNN communication backbone.

4B-3 (Time: 11:10 - 11:35) (In-person)
TitleCOLAB: Collaborative and Efficient Processing of Replicated Cache Requests in GPU
Author*Bo-Wun Cheng, En-Ming Huang, Chen-Hao Chao, Wei-Fang Sun (National Tsing Hua University, Taiwan), Tsung-Tai Yeh (National Yang Ming Chiao Tung University, Taiwan), Chun-Yi Lee (National Tsing Hua University, Taiwan)
Pagepp. 314 - 319
KeywordGPU, Cache
AbstractIn this work, we aim to capture replicated cache requests between Stream Multiprocessors (SMs) within an SM cluster to alleviate the Network-on-Chip (NoC) congestion problem of modern GPUs. To achieve this objective, we incorporate a per-cluster Cache line Ownership Lookup tABle (COLAB) that keeps track of which SM within a cluster holds a copy of a specific cache line. With the assistance of COLAB, SMs can collaboratively and efficiently process replicated cache requests within SM clusters by redirecting them according to the ownership information stored in COLAB. By servicing replicated cache requests within SM clusters that would otherwise consume precious NoC bandwidth, the heavy pressure on the NoC interconnection can be eased. Our experimental results demonstrate that the adoption of COLAB can indeed alleviate the excessive NoC pressure caused by replicated cache requests, and improve the overall system throughput of the baseline GPU while incurring minimal overhead. On average, COLAB can reduce 38% of the NoC traffic and improve instructions per cycle (IPC) by 43%.

[To Session Table]

Session 4C  Novel Techniques for Scheduling and Memory Optimizations in Embedded Software
Time: 10:20 - 11:35, Wednesday, January 18, 2023
Location: Room Venus
Chairs: Christian Pilato (Politecnico di Milano, Italy), Hiroshi Sasaki (Tokyo Institute of Technology)

Best Paper Candidate
4C-1 (Time: 10:20 - 10:45) (In-person)
TitleMixed-Criticality with Integer Multiple WCETs and Dropping Relations: New Scheduling Challenges
Author*Federico Reghenzani, William Fornaciari (Politecnico di Milano, Italy)
Pagepp. 320 - 325
Keywordmixed-criticality, scheduling, real-time, fault-tolerance
AbstractScheduling Mixed-Criticality (MC) workload is a challenging problem in real-time computing. Earliest Deadline First Virtual Deadline (EDF-VD) is one of the most famous scheduling algorithm with optimal speedup bound properties. However, when EDF-VD is used to schedule task sets using a model with additional or relaxed constraints, its scheduling properties change. Inspired by an application of MC to the scheduling of fault tolerant tasks, in this article, we propose two models for multiple criticality levels: the first is a specialization of the MC model, and the second is a generalization of it. We then show, via formal proofs and numerical simulations, that the former considerably improves the speedup bound of EDF-VD. Finally, we provide the proofs related to the optimality of the two models, identifying the need of new scheduling algorithms.

4C-2 (Time: 10:45 - 11:10) (In-person)
TitleAn Exact Schedulability Analysis for Global Fixed-Priority Scheduling of the AER Task Model
Author*Thilanka Thilakasiri, Matthias Becker (KTH Royal Institute of Technology, Sweden)
Pagepp. 326 - 332
KeywordReal-Time Systems, AER-model, 3-phase task model, multi-core, global scheduling
AbstractCommercial off-the-shelf (COTS) multi-core platforms offer high performance and large availability of processing resources. Increased contention when accessing shared resources is a result of the high parallelism and one of the main challenges when real-time applications are deployed to these platforms. As a result, several execution models have been proposed to avoid contention by separating access to shared resources from execution. In this work, we consider the Acquisition-Execution-Restitution (AER) model where contention to shared resources is avoided by design. We propose an exact schedulability test for the AER model under global fixed-priority scheduling using timed automata where we describe the schedulability problem as a reachability problem. To the best of our knowledge, this is the first exact schedulability test for the AER model under global fixed-priority scheduling on multiprocessor platforms. The performance of the proposed approach is evaluated using synthetic experiments and provides up to 65% more schedulable task sets than the state-of-the-art.

4C-3 (Time: 11:10 - 11:35) (In-person)
TitleSkyrmion Vault: Maximizing Skyrmion Lifespan for Enabling Low-Power Skyrmion Racetrack Memory
AuthorSyue-Wei Lu (National Tsing Hua University, Taiwan), *Shuo-Han Chen (National Taipei University of Technology, Taiwan), Yu-Pei Liang (National Chung Cheng University, Taiwan), Yuan-Hao Chang (Academia Sinica, Taiwan), Kang Wang (Beihang University, China), Tseng-Yi Chen (National Central University, Taiwan), Wei-Kuan Shih (National Tsing Hua University, Taiwan)
Pagepp. 333 - 338
Keywordskyrmion racetrack memory, SK-RM, lifespan
AbstractSkyrmion racetrack memory (SK-RM) has demonstrated great potential as a high-density and low-cost nonvolatile memory. Nevertheless, even though random data accesses are supported on SK-RM, data accesses can not be carried out on individual data bit directly. Instead, special skyrmion manipulations, such as injecting and shifting, are required to support random information update and deletion. With such special manipulations, the latency and energy consumption of skyrmion manipulations could quickly accumulate and induce additional overhead on the data read/write path of SK-RM. Meanwhile, injection operation consumes more energy and has higher latency than any other manipulations. Although prior arts have tried to alleviate the overhead of skyrmion manipulations, the possibility of minimizing injections through buffering skyrmions for future reuse and energy conservation receives much less attention. Such observation motivates us to propose the concept of skyrmion vault to effectively utilize the skyrmion buffer track structure for energy conservation through maximizing the lifespan of injected skyrmions and minimizing the number of skyrmion injections. Experimental results have shown promising improvements in both energy consumption and skyrmions' lifespan.

[To Session Table]

Session 4D  Efficient Circuit Simulation and Synthesis for Analog Designs
Time: 10:20 - 11:35, Wednesday, January 18, 2023
Location: Room Mars/Mercury
Chairs: Markus Olbrich (Leibniz University Hannover, Germany), Chien-Nan Jimmy Liu (National Yang Ming Chiao Tung University, Taiwan)

4D-1 (Time: 10:20 - 10:45) (Online)
TitleParallel Incomplete LU Factorization Based Iterative Solver for Fixed-Structure Linear Equations in Circuit Simulation
Author*Lingjie Li, Zhiqiang Liu, Kan Liu, Shan Shen, Wenjian Yu (Tsinghua University, China)
Pagepp. 339 - 345
Keywordcircuit simulation, incomplete LU factorization, iterative equation solver, parallel computing
AbstractA series of fixed-structure sparse linear equations are solved in a circuit simulation process. We propose a parallel incomplete LU (ILU) preconditioned GMRES solver for those equations. A new subtree-based scheduling algorithm for ILU factorization and forward/backward substitution is adopted to overcome the load-balancing and data locality problem of the conventional levelization-based scheduling. Experimental results show that the proposed scheduling algorithm can achieve up to 2.6X speedup for ILU factorization and 3.1X speedup for forward/backward substitution compared to the levelization-based scheduling. The proposed ILU-GMRES solver achieves around 4X parallel speedup with 8 threads, which is up to 2.1X faster than that based on the levelization-based scheme. The proposed parallel solver also shows remarkable advantage over existing methods (including HSPICE) on transient simulation of linear and nonlinear circuits.

4D-2 (Time: 10:45 - 11:10) (Online)
TitleAccelerated Capacitance Simulation of 3-D Structures With Considerable Amounts of General Floating Metals
Author*Jiechen Huang, Wenjian Yu, Mingye Song, Ming Yang (Tsinghua University, China)
Pagepp. 346 - 351
KeywordCapacitance Simulation, Floating Metal, Floating Random Walk (FRW) Method, Network Reduction
AbstractFloating metals are special conductors introduced into conductor structures by design for manufacturing (DFM). They bring difficulty to accurate capacitance simulation. In this work, we aim to accelerate the floating random walk (FRW) based capacitance simulation for structures with considerable amounts of general floating metals. We first discuss how the existing modified FRW is affected by the integral surfaces of floating metals and propose an improved placement of integral surface. Then, we propose a hybrid approach called incomplete network reduction to avoid random transitions trapped by floating metals. Experiments on structures from IC and FPD design, which involves multiple floating metals and single or multiple master conductors, have shown the effectiveness of the proposed techniques. The proposed techniques reduce the computational time of capacitance calculation, while preserving the accuracy.

4D-3 (Time: 11:10 - 11:35) (In-person)
TitleOn Automating Finger-Cap Array Synthesis with Optimal Parasitic Matching for Custom SAR ADC
Author*Cheng-Yu Chiang, Chia-Lin Hu, Mark Po-Hung Lin, Yu-Szu Chung, Shyh-Jye Jou, Jieh-Tsorng Wu (Inst. of Electronics, National Yang Ming Chiao Tung University, Taiwan), Shiuh-hua Wood Chiang (Department of Electrical and Computer Engineering, Brigham Young University, Taiwan), Chien-Nan Jimmy Liu, Hung-Ming Chen (Inst. of Electronics, National Yang Ming Chiao Tung University, Taiwan)
Pagepp. 352 - 357
Keywordparasitic effect, capacitance matching, linear programming, analog-to-digital converter, analog layout synthesis
AbstractIn-memory computing (IMC) techniques have been increasingly used in Artificial Intelligence (AI) edge Deep Learning Neural Network (DNN) hardware design to increase the flexibility of each layer and performance. Due to its excellent power efficiency, the successive-approximation register (SAR) analog-to-digital converter (ADC) is an attractive design choice for low-power ADC implements, which is one of the most critical elements in IMC design. In the physical layout of IMC architecture designs, the parasitics induced by interconnecting wires and elements enormously affect the accuracy and performance of the device. Moreover, the Finger-Cap-Array, a new binary-weighted capacitor structure consisting of sequential metal-metal capacitor units, has been widely adopted in SAR ADC for the requirement of low-power and high-speed. The accuracy and performance would further be significantly affected by the parasitic capacitance from interconnecting wires and process gradient with such small capacitors. This paper first presents a framework to synthesize good-quality binary-weighted capacitors for custom SAR ADC. Then we propose a novel routing method based on a parasitic-aware ILP-based weight-dynamic network, which is the first work that generates an optimal layout considering parasitic capacitance and capacitance ratio mismatch simultaneously for the Finger-Cap-Array. Experimental results demonstrate the effectiveness and robustness of our proposed algorithm.

[To Session Table]

Session 5A  (SS-2) Security of Heterogeneous Systems Containing FPGAs
Time: 13:00 - 14:40, Wednesday, January 18, 2023
Location: Room Saturn
Chairs: Jonas Krautter (Karlsruhe Institute of Technology, Germany), Mehdi Tahoori (Karlsruhe Institute of Technology, Germany)

5A-1 (Time: 13:00 - 13:25)
Title(Invited Paper) Remote Power Attacks on ML Accelerators in Multi-Tenant FPGAs
AuthorJakub Szefer (Yale University, USA)
AbstractRecent introduction of FPGAs into the public cloud datacenters gives users ability of to request reconfigurable hardware resources quickly, flexibly, and on-demand. However, as public cloud providers make FPGAs available to many, potentially mutually-untrusting users, security of these FPGA deployments needs to be analyzed, and defenses developed. In particular, many of the cloud-based FPGAs maybe used for accelerating Machine Learning tasks that process sensitive inputs, and where the architectures of the Machine Learning algorithms themselves may be secret or sensitive. These inputs and architecture details may be vulnerable to information leaks. In this talk we will show how, in a multi-tenant FPGA setting, it is possible for adversaries to steal machine learning inputs or architecture details through voltage-based channels. This talk will cover in particular our recent work on analyzing security of an off-the-shelf, general-purpose neural network accelerator realized on commercial cloud-based FPGAs. The objective of the talk is to motivate more research into security, and especially defenses, for FPGA-accelerated cloud computing given the threats we have uncovered.

5A-2 (Time: 13:25 - 13:50) (In-person)
Title(Invited Paper) FPGANeedle: Precise Remote Fault Attacks from FPGA to CPU
Author*Mathieu Gross (Technical University of Munich, Germany), Jonas Krautter, Dennis Gnad (Karlsruhe Institute of Technology, Germany), Michael Gruber, Georg Sigl (Technical University of Munich, Germany), Mehdi Tahoori (Karlsruhe Institute of Technology, Germany)
Pagepp. 358 - 364
KeywordFPGA-SoC, on-chip fault attack, voltage drop, differential fault attack
AbstractFPGA as general-purpose accelerators can greatly improve system efficiency and performance in cloud and edge devices alike. However, they have recently become the focus of remote attacks, such as fault and side-channel attacks from one to another user of a part of the FPGA fabric. In this work, we consider system-on-chip platforms, where an FPGA and an embedded processor core are located on the same die. We show that the embedded processor core is vulnerable to voltage drops generated by the FPGA logic. Our experiments demonstrate the possibility of compromising the data transfer from external DDR memory to the processor cache hierarchy. Furthermore, we were also able to fault and skip instructions executed on an ARM Cortex-A9 core. The FPGA based fault injection is shown precise enough to recover the secret key of an AES T-tables implementation found in the mbedTLS library.

5A-3 (Time: 13:50 - 14:15) (In-person)
Title(Invited Paper) FPGA Based Countermeasures against Side channel Attacks on Block Ciphers
AuthorDarshana Jayasinghe, Brian Udugama, *Sri Parameswaran (University of New South Wales, Australia)
Pagepp. 365 - 371
KeywordSide-channel analysis, power analysis attacks, countermeasures, fault injection attacks, remote power analysis attacks
AbstractField Programmable Gate Arrays (FPGAs) are increasingly ubiquitous. FPGAs enable hardware acceleration and reconfigurability. Any security breach or attack on critical computations occurring on an FPGA can lead to devastating consequences. Side-channel attacks have the ability to reveal secret information, such as secret keys from cryptographic circuits running on FPGAs. Power dissipation (PA), Electromagnetic (EM) radiation, fault injection (FI) and remote power dissipation (RPA) attacks are the most compelling and non-invasive side-channel attacks demonstrated on FPGAs. This paper discusses two PA attack countermeasures (QuadSeal and RFTC) and one RPA attack countermeasure (UCloD) in detail to protect FPGAs.

[To Session Table]

Session 5B  Novel Application & Architecture-Specific Quantization Techniques
Time: 13:00 - 14:40, Wednesday, January 18, 2023
Location: Room Uranus
Chair: Can Li (Hong Kong University, Hong Kong)

5B-1 (Time: 13:00 - 13:25) (Online)
TitleBlock-Wise Dynamic-Precision Neural Network Training Acceleration via Online Quantization Sensitivity Analytics
Author*Ruoyang Liu, Chenhan Wei, Yixiong Yang, Wenxun Wang, Huazhong Yang, Yongpan Liu (Tsinghua University, China)
Pagepp. 372 - 377
Keywordfully-quantized network training, mixed-precision quantization, neural network training acceleration
AbstractData quantization is an effective method to accelerate neural network training and reduce power consumption. However, it is challenging to perform low-bit quantized training: the conventional equal-precision quantization will lead to either high accuracy loss or limited bit-width reduction, while existing mixed-precision methods offer high compression potential but failed to perform accurate and efficient bit-width assignment. In this work, we propose DYNASTY, a block-wise dynamic-precision neural network training framework. DYNASTY provides accurate data sensitivity information through fast online analytics, and maintains stable training convergence with an adaptive bit-width map generator. Network training experiments on CIFAR-100 and ImageNet dataset are carried out, and compared to 8-bit quantization baseline, DYNASTY brings up to 5.1x speedup and 4.7x energy consumption reduction with no accuracy drop and negligible hardware overhead.

5B-2 (Time: 13:25 - 13:50) (Online)
TitleQuantization Through Search: A Novel Scheme to Quantize Convolutional Neural Networks in Finite Weight Space
Author*Qing Lu (University of Notre Dame, USA), Weiwen Jiang (George Mason University, USA), Xiaowei Xu (Guangdong Provincial People's Hospital, USA), Jingtong Hu (University of Pittsburgh, USA), Yiyu Shi (University of Notre Dame, USA)
Pagepp. 378 - 383
KeywordConvolutional Neural Network, Quantization, Weight search
AbstractQuantization has become an essential technique in compressing deep neural networks for deployment onto resource-constrained hardware. It is noticed that, the hardware efficiency of implementing quantized networks is highly coupled with the actual values to be quantized into, and therefore, with given bit widths, we can smartly choose a value space to further boost the hardware efficiency. For example, using weights of only integer powers of two, multiplication can be fulfilled by bit operations. Under such circumstances, however, existing quantization-aware training methods are either not suitable to apply or unable to unleash the expressiveness of very low bit-widths. For the best hardware efficiency, we revisit the quantization of convolutional neural networks and propose to address the training process from a weight-searching angle, as opposed to optimizing the quantizer functions as in existing works. Extensive experiments on CIFAR10 and ImageNet classification tasks are examined with implementations onto well-established CNN architectures, such as ResNet, VGG, and MobileNet, etc. It is shown the proposed method can achieve a lower accuracy loss than the state of arts, and/or improving implementation efficiency by using hardware-friendly weight values at the same time.

5B-3 (Time: 13:50 - 14:15) (Online)
TitleMulti-Wavelength Parallel Training and Quantization-Aware Tuning for WDM-Based Optical Convolutional Neural Networks Considering Wavelength-Relative Deviations
Author*Ying Zhu (State Key Laboratory of Optical Communication Technologies and Networks, China Information Communication Technologies Group Corporation (CICT), China), Min Liu, Lu Xu, Lei Wang (National Information Optoelectronics Innovation Center and China Information Communication Technologies Group Corporation (CICT), China), Xi Xiao, Shaohua Yu (State Key Laboratory of Optical Communication Technologies and Networks, China Information Communication Technologies Group Corporation (CICT), China)
Pagepp. 384 - 389
Keywordconvolutional neural network accelerator, WDM, device variations, quantization errors
AbstractWavelength Division Multiplexing (WDM)-based Mach-Zehnder Interferometer Optical Convolutional Neural Networks (MZI-OCNNs) have emerged as a promising platform to accelerate convolutions that cost most computing sources in neural networks. However, the wavelength-relative imperfect split ratios and actual phase shifts in MZIs and quantization errors from the electronic configuration module will degrade the inference accuracy of WDM-based MZIOCNNs and thus render them unusable in practice. In this paper, we propose a framework that models the split ratios and phase shifts under different wavelengths, incorporates them into OCNN training, and introduces quantization-aware tuning to maintain inference accuracy and reduce electronic module complexity. Consequently, the framework can improve the inference accuracy by 49%, 76%, and 76%, respectively, for LeNet5, VGG7, and VGG8 implemented with multi-wavelength parallel computing. And instead of using Float 32/64 quantization resolutions, only 5,6, and 4 bits are needed and fewer quantization levels are utilized for configuration signals.

5B-4 (Time: 14:15 - 14:40) (Online)
TitleSemantic Guided Fine-grained Point Cloud Quantization Framework for 3D Object Detection
Author*Xiaoyu Feng, Chen Tang, Zongkai Zhang, Wenyu Sun, Yongpan Liu (Tsinghua University, China)
Pagepp. 390 - 395
Keywordadaptive quantization, semantic-guided, block-wise, point cloud, 3D object detection
AbstractUnlike the grid-paced RGB images, network compression, i.e. pruning and quantization, for the irregular and sparse 3D point cloud are faced more challenges. Traditional quantization ignores the unbalanced semantic distribution in 3D point cloud.In this work, we propose a semantic-guided adaptive quantization framework for 3D point cloud. Different from traditional quantization methods that adopt a static and uniform quantization scheme, our proposed framework can adaptively locate the semantic-rich foreground points in the feature maps to allocate a higher bitwidth for these "important" points. Since the foreground points are in a low proportion in the sparse 3D point cloud, such adaptive quantization can achieve higher accuracy than uniform compression under a similar compression rate. Furthermore, we adopt a block-wise fine-grained compression scheme in the proposed framework to fit the larger dynamic range in the point cloud. Moreover, a 3D point cloud based software and hardware co-evaluation process is proposed to evaluate the effectiveness of the proposed adaptive quantization in actual hardware devices. Based on the nuScenes dataset, we achieve 12.52% precision improvement under average 2-bit quantization. Compared with 8-bit quantization, we can achieve 3.11x energy efficiency based on co-evaluation results.

[To Session Table]

Session 5C  Approximate Brain-Inspired Architectures for Efficient Learning
Time: 13:00 - 14:40, Wednesday, January 18, 2023
Location: Room Venus
Chairs: Xun Jiao (Villanova University, USA), Hussam Amrouch (University of Stuttgart, Germany)

5C-1 (Time: 13:00 - 13:25) (In-person)
TitleReMeCo: Reliable Memristor-Based In-Memory Neuromorphic Computation
Author*Ali BanaGozar (Eindhoven University of Technology, Netherlands), Seyed Hossein Hashemi Shadmehri (University of Tehran, Iran), Sander Stuijk (Eindhoven University of Technology, Netherlands), Mehdi Kamal (University of Southern California, USA), Ali Afzali-Kusha (University of Tehran, Iran), Henk Corporaal (Eindhoven University of Technology, Netherlands)
Pagepp. 396 - 401
KeywordReliability, Neural Networks, Computation In-Memory, Neuromorphic, Redundancy
AbstractMemristor-based in-memory neuromorphic computing systems promise a highly efficient implementation of vector-matrix multiplications, commonly used in artificial neural networks (ANNs). However, the immature fabrication process of memristors and circuit level limitations, i.e., stuck-at-fault (SAF), IR-drop, and device-to-device (D2D) variation, degrade the reliability of these platforms and thus impede their wide deployment. In this paper, we present ReMeCo, a redundancy-based reliability improvement framework. It addresses the non-idealities while constraining the induced overhead. It achieves this by performing a sensitivity analysis on ANN. With the acquired insight, ReMeCo avoids the redundant calculation of least sensitive neurons and layers. ReMeCo uses a heuristic approach to find the balance between recovered accuracy and imposed overhead. ReMeCo further decreases hardware redundancy by exploiting the bit-slicing technique. In addition, the framework employs the ensemble averaging method at the output of every ANN layer to incorporate the redundant neurons. The efficacy of the ReMeCo is assessed using two well-known ANN models, i.e., LeNet, and AlexNet, running the MNIST and CIFAR10 datasets. Our results show 98.5% accuracy recovery with roughly 4% redundancy which is more than 20X lower than the state-of-the-art.

5C-2 (Time: 13:25 - 13:50) (In-person)
TitleSyFAxO-GeN: Synthesizing FPGA-based Approximate Operators with Generative Networks
AuthorRohit Ranjan, *Salim Ullah, Siva Satyendra Sahoo, Akash Kumar (TU Dresden, Germany)
Pagepp. 402 - 409
KeywordApproximate Computing, Computer Arithmetic, AI/ML for EDA
AbstractWith rising trends of moving AI inference to the edge, due to communication and privacy challenges, there has been a growing focus on designing low-cost Edge-AI. Given the diversity of application areas at the edge, FPGA-based systems are increasingly used to provide high-performance inference. Similarly, approximate computing has emerged as a viable approach to achieving disproportionate resource gains by leveraging the applications' inherent robustness. However, most related research has focused on selecting the appropriate approximate operators for an application from a set of ASIC-based designs. This approach fails to leverage the FPGA's architecture benefits and limits the scope of approximation to already existing generic designs. To this end, we propose an AI-based approach to synthesizing novel approximate operators optimized for FPGA's Look-up-table-based architecture. Specifically, we use state-of-the-art generative networks to search for constraint-aware arithmetic operators designs optimized for FPGA-based implementation.

5C-3 (Time: 13:50 - 14:15) (In-person)
TitleApproximating HW Accelerators through Partial Extractions onto Shared Artificial Neural Networks
AuthorPrattay Chowdhury (The University of Texas at Dallas, USA), Jorge Castro Godínez (Costa Rica Institute of Technology, Costa Rica), *Benjamin Carrion Schaefer (The University of Texas at Dallas, USA)
Pagepp. 410 - 415
KeywordApproximate Computing, ANN, High-Level Synthesis
AbstractIn this work we propose a fully automatic method that substitutes portions of a hardware accelerator specified in C/C++/SystemC for High-Level Synthesis (HLS) to an Artificial Neural Network (ANN). ANNs have many advantages that make them well suited for this. First, they are very scalable which allows to approximate multiple separate portions of the behavioral description simultaneously on them. Second, multiple ANNs can be fused together and reoptimized to further reduce the power consumption. We use this to share the ANN to approximate multiple different HW accelerators in the same SoC . Experimental results with different error thresholds show that our proposed approach leads to better results than the state of the art.

5C-4 (Time: 14:15 - 14:40) (In-person)
TitleDependableHD: A Hyperdimensional Learning Framework for Edge-oriented Voltage-scaled Circuits
Author*Dehua Liang (Osaka University, Japan), Hiromitsu Awano (Kyoto University, Japan), Noriyuki Miura, Jun Shiomi (Osaka University, Japan)
Pagepp. 416 - 422
KeywordHyperdimensional Computing, Voltage Scaling, Memory Failure
AbstractVoltage scaling is one of the most promising approaches for energy efficiency improvement but also brings challenges to fully guaranteeing the stable operation in modern VLSI. To tackle such issues, we propose DependableHD, a learning framework based on HyperDimensional Computing (HDC), which supports the systems to tolerate bit-level memory failure in the low voltage region with high robustness. For the first time, DependableHD introduces the concept of margin enhancement for model retraining and utilizes noise injection to improve the robustness, which is capable of application in most state-of-the-art HDC algorithms. Our experiment shows that under 10% memory error, DependableHD exhibits a 1.22% accuracy loss on average, which achieves an 11.2× improvement compared to the baseline HDC solution. The hardware evaluation shows that DependableHD supports the systems to reduce the supply voltage from 400mV to 300mV, which provides a 50.41% energy consumption reduction while maintaining competitive accuracy performance.

[To Session Table]

Session 5D  Retrospect and Prospect of Verifiation and Test Technologies
Time: 13:00 - 14:40, Wednesday, January 18, 2023
Location: Room Mars/Mercury
Chairs: Michihiro Shintani (Kyoto Institute of Technology), Renyuan Zhang (Nara Institute of Science and Technology)

5D-1 (Time: 13:00 - 13:25) (In-person)
TitleEDDY: A Multi-Core BDD Package With Dynamic Memory Management and Reduced Fragmentation
Author*Rune Krauss, Mehran Goli, Rolf Drechsler (University of Bremen, Germany)
Pagepp. 423 - 428
KeywordBoolean functions, binary decision diagrams, model checking, memory management, parallel computing
AbstractIn recent years, hardware systems have significantly grown in complexity. Due to the increasing complexity, there is a need to continuously improve the quality of the hardware design process. This leads designers to strive for more efficient data structures and algorithms operating on them to guarantee the correct behavior of such systems through verification techniques like model checking and meet time-to-market constraints. A Binary Decision Diagram (BDD) is a suitable data structure as it provides a canonical compact representation of Boolean functions, given variable ordering, and efficient algorithms for manipulating them. However, reduced ordered BDDs also have challenges: There is a large memory consumption for the BDD construction of some complex practical functions and the use of realizations in the form of BDD packages strongly depends on the application. To address these issues, this paper presents a novel multi-core package called Engineer Decision Diagrams Yourself (EDDY) with dynamic memory management and reduced fragmentation. Experiments on BDD benchmarks of both combinational circuits and model checking show that using EDDY leads to a significantly performance boost compared to state-of-the-art packages.

5D-2 (Time: 13:25 - 13:50) (In-person)
TitleExploiting Reversible Computing for Verification: Potential, Possible Paths, and Consequences
AuthorLukas Burgholzer (Johannes Kepler University Linz, Austria), *Robert Wille (Technical University of Munich, Germany)
Pagepp. 429 - 435
Keywordreversible computing, verification
AbstractToday, the verification of classical circuits poses a severe challenge for the design of circuits and systems. While the underlying (ex- ponential) complexity is tackled in various fashions (simulation-based approaches, emulation, formal equivalence checking, fuzzing, model checking, etc.), no “silver bullet” has been found yet which allows to escape the growing verification gap. In this work, we entertain and investigate the idea of a complementary approach which aims at exploiting reversible computing. More precisely, we show the potential of the reversible computing paradigm for verification, debunk misleading paths that do not allow to exploit this potential, and discuss the resulting consequences for the development of future, complementary design and verification flows. An extensive empirical study (involving more than 30 million simulations) confirms these findings. Although this work cannot provide a fully-fledged realization yet, it certainly provides the basis for an alternative path towards overcoming the verification gap.

Best Paper Candidate
5D-3 (Time: 13:50 - 14:15) (Online)
TitleAutomatic Test Pattern Generation and Compaction for Deep Neural Networks
Author*Dina A. Moussa, Michael Hefenbrock, Christopher Münch, Mehdi Tahoori (Karlsruhe Institute of Technology, Germany)
Pagepp. 436 - 441
KeywordDeep Neural Networks, Fault injection, Functional Faults, Test pattern generation, Test compaction
AbstractDeep Neural Networks (DNNs) have gained considerable attention lately due to their excellent performance on a wide range of recognition and classification tasks. Accordingly, fault detection in DNNs and their implementations plays a crucial role in the quality of DNN implementations to ensure that their post-mapping and in-field accuracy matches with model accuracy. This paper proposes a functional-level automatic test pattern generation approach for DNNs. This is done by generating inputs which causes misclassification of the output class label in the presence of single or multiple faults. Furthermore, to obtain a smaller set of test patterns with full coverage, a heuristic algorithm as well as a test pattern clustering method using K-means were implemented. The experimental results showed that the proposed test patterns achieved the highest label misclassification and a high output deviation compared to state-of-the-art approaches.

5D-4 (Time: 14:15 - 14:40) (In-person)
TitleWafer-Level Characteristic Variation Modeling Considering Systematic Discontinuous Effects
Author*Takuma Nagao (Nara Institute of Science and Technology, Japan), Tomoki Nakamura, Masuo Kajiyama, Makoto Eiki (Sony Semiconductor Manufacturing Corporation, Japan), Michiko Inoue (Nara Institute of Science and Technology, Japan), Michihiro Shintani (Kyoto Institute of Technology, Japan)
Pagepp. 442 - 448
KeywordWafer-level spatial characteristic modeling, Process variation, Gaussian process regression
AbstractStatistical wafer-level variation modeling is an attractive method for reducing the measurement cost in large-scale integrated circuit (LSI) testing while maintaining the test quality. In this method, the performance of unmeasured LSI circuits manufactured on a wafer is statistically predicted from a few measured LSI circuits. Conventional statistical methods model spatially smooth variations in wafer. However, actual wafers may have discontinuous variations that are systematically caused by the manufacturing environments, such as shot dependence. In this study, we propose a modeling method that considers discontinuous variations in wafer characteristics by applying the knowledge of manufacturing engineers to a model estimated using Gaussian process regression. In the proposed method, the process variation is decomposed into the systematic discontinuous and global components to improve the estimation accuracy. An evaluation performed using an industrial production test dataset shows that the proposed method reduces the estimation error for an entire wafer by over 33% compared to conventional methods.

[To Session Table]

Session 5E  DF Keynote / (DF-1) Next-Generation Computing
Time: 13:00 - 14:40, Wednesday, January 18, 2023
Location: Miraikan Hall
Organizer: Chihiro Yoshimura (Hitachi, Ltd., Japan), Chair: Takatsugu Ono (Kyushu University, Japan)

5E-1 (Time: 13:00 - 13:25)
Title(Designers' Forum) DF Keynote: The Impact of AI in Intelligent System Design
AuthorSimon Chang (Cadence)
AbstractElectronics design is undergoing a revolution as semiconductors are used in more and more market applications. Each has its unique data and workload and requires customized compute and analytics architectures. Advanced semiconductors are implemented in the latest process nodes, in the most complex 3D-ICs, to achieve top performance with more operational flexibility. When the scope is expanded to the full system, complexity further exceeds the traditional siloed engineering teams and methodology. AI is showing promise for addressing the growing complexity, finding optimal design outcomes, and substantially improving overall team productivity. But not all problems are equal. Which are the intelligent system design challenges that AI is best suited for? What impact should be expected from applying AI to these challenges? And what is the frontier of AI solutions for intelligent system design?

5E-2 (Time: 13:25 - 13:45)
Title(Designers' Forum) General-Purpose Scalar/Vector Processor for Accelerating Wide Range of Tasks Including Automotive and Industrial Applications
AuthorMasayuki Ito (NSITEXE Inc., Japan)
AbstractEfficient execution of AI and other computing on edge devices has become an important issue in embedded systems due to strict heat and cost constraints. In this talk, a solution and techniques being explored to efficiently accelerate the computation of wide range of contributions to the embedded systems will be covered. More specifically, scalable solutions for each application by combining versatile MIMD-based processors with vector units will be introduced. Both task-level parallelism and data-level parallelism are fulfilled by the processors and the vector units, and tight coordination of them are key to efficient execution contribution. Software solutions including model deployment and toolchains will also be discussed.

5E-3 (Time: 13:45 - 14:05)
Title(Designers' Forum) 16x16 Photonic Analog Vector Matrix Multipliers Based on Silicon Photonics
AuthorShota Kita (NTT Basic Research Laboratories, Japan)
AbstractOwing to recent advances in machine learning and silicon photonics, on-chip photonic processing has been recognized as promising for energy-saving, low latency linear analog matrix operations. In this talk, we show our recent progress on a 16x16 on-chip vector matrix processor with complex-valued inputs based on silicon photonics as a central component of the processor architecture. Furthermore, we developed two calibration schemes based on machine learning techniques, one is specialized for a single task, and the other is for general matrix implementation with a fidelity of ~0.859. We show some initial results of MNIST database classification task as the benchmark.

5E-4 (Time: 14:05 - 14:25)
Title(Designers' Forum) Research Activities toward Larger-Scale Cryogenic Quantum Computer Systems
AuthorTeruo Tanimoto (Kyushu University, Japan)
AbstractQuantum computing (QC) attracts interests in both academia and industries. One of the unique characteristics of R&D of QC is that hardware, software, and system integration are concurrently surveyed. Intensive collaborations across research fields are crucial to realize QC. Since some attractive quantum devices require a cryogenic environment, the power-efficient cryogenic classical information processing is necessary to construct larger-scale QC systems. Kyushu University has started Quantum Computing System Center in February 2022 and we are exploring emerging device technologies such as Single-Flux-Quantum logic circuit and Nanobridge FPGA. In this talk, I would like to introduce our recent activities.

5E-5 (Time: 14:25 - 14:45)
Title(Designers' Forum) Cryogenic Bias Voltage Control Circuits for Large Scale Qubit Arrays
AuthorTakuji Miki (Kobe University, Japan)
AbstractControlling quantum bits (qubits) from cryogenic temperature inside a dilution refrigerator is a key challenge towards large scale quantum computers. Cryogenic CMOS analog circuits, embedded in such controllers for qubit manipulation and readout, require an extremely small area and low power consumption to accommodate parallel control of qubit arrays with limited power budget for heat suppression. This presentation introduces a design strategy of a cryogenic digital to analog converter (DAC) for biasing silicon spin qubits. The bias DAC achieves both compact layout and low power consumption while keeping high linearity by effectively utilizing inherent circuit characteristics at cryogenic temperature.

[To Session Table]

Session 6A  (SS-3) Computing, Erasing, and Protecting: the Security Challenges for the Next Generation of Memories
Time: 15:00 - 16:15, Wednesday, January 18, 2023
Location: Room Saturn
Chairs: Francesco Regazzoni (University of Amsterdam and Università della Svizzera italiana, Netherlands), Robert Wille (Technical University of Munich, Germany)

6A-1 (Time: 15:00 - 15:25) (In-person)
Title(Invited Paper) Hardware Security Primitives using Passive RRAM Crossbar Array: Novel TRNG and PUF Designs
AuthorSimranjeet Singh (Indian Institute of Technology, Bombay, India), *Furqan Zahoor, Gokulnath Rajendran (Nanyang Technological University, Singapore), Sachin Patkar (Indian Institute of Technology, Bombay, India), Anupam Chattopadhyay (Nanyang Technological University, Singapore), Farhad Merchant (RWTH Aachen University, Germany)
Pagepp. 449 - 454
KeywordPUF, TRNG, RRAM, Memristors, Hardware Security
AbstractWith rapid advancements in electronic gadgets, the security and privacy aspects of these devices are significant. For the design of secure systems, physical unclonable function (PUF) and true random number generator (TRNG) are critical hardware security primitives for security applications. This paper proposes novel implementations of PUF and TRNGs on the RRAM crossbar structure. Firstly, two techniques to implement the TRNG in the RRAM crossbar are presented based on write-back and 50% switching probability pulse. The randomness of the proposed TRNGs is evaluated using the NIST test suite. Next, an architecture to implement the PUF in the RRAM crossbar is presented. The initial entropy source for the PUF is used from TRNGs, and challenge-response pairs (CRPs) are collected. The proposed PUF exploits the device variations and sneak-path current to produce unique CRPs. We demonstrate, through extensive experiments, reliability of 100%, uniqueness of 47.78%, uniformity of 49.79%, and bit-aliasing of 48.57% without any post-processing techniques. Finally, the design is compared with the literature to evaluate its implementation efficiency, which is clearly found to be superior to the state-of-the-art.

6A-2 (Time: 15:25 - 15:50) (In-person)
Title(Invited Paper) Data Sanitization on eMMCs
Author*Aya Fukami (University of Amsterdam and Netherlands Forensic Institute, Netherlands), Francesco Regazzoni (University of Amsterdam and Università della Svizzera italiana, Netherlands), Zeno Geradts (University of Amsterdam and Netherlands Forensic Institute, Netherlands)
Pagepp. 455 - 460
Keywordsecurity, digital forensics, data sanitization, data recovery
AbstractData sanitization of modern digital devices is an important issue given that electronic wastes are being recycled and repurposed. The embedded Multi Media Card (eMMC), one of the NAND flash memory-based commodity devices, is one of the popularly recycled products in the current recycling ecosystem. We analyze a repurposed devices and evaluate its sanitization practice. Data from the formerly used device can still be recovered, which may lead to an unintentional leakage of sensitive data such as personally identifiable information (PII). Since the internal storage of an eMMC is the NAND flash memory, sanitization practice of the NAND flash memory-based systems should apply to the eMMC. However, proper sanitize operation is obviously not always performed in the current recycling ecosystem. We discuss how data stored in eMMC and other flash memory-based devices need to be deleted in order to avoid the potential data leakage. We also review the NAND flash memory data sanitization schemes and discuss how they should be applied in eMMCs.

6A-3 (Time: 15:50 - 16:15) (In-person)
Title(Invited Paper) Fundamentally Understanding and Solving RowHammer
Author*Onur Mutlu, Ataberk Olgun, Abdullah Giray Yağlıkcı (ETH Zürich, Switzerland)
Pagepp. 461 - 468
KeywordDRAM, Security, Reliability, Safety, Memory
AbstractWe provide an overview of recent developments and future directions in the RowHammer vulnerability that plagues modern DRAM (Dynamic Random Memory Access) chips, which are used in almost all computing systems as main memory. RowHammer is the phenomenon in which repeatedly accessing a row in a real DRAM chip causes bitflips (i.e., data corruption) in physically nearby rows. This phenomenon leads to a serious and widespread system security vulnerability, as many works since the original RowHammer paper in 2014 have shown. Recent analysis of the RowHammer phenomenon reveals that the problem is getting much worse as DRAM technology scaling continues: newer DRAM chips are fundamentally more vulnerable to RowHammer at the device and circuit levels. It has proven difficult to devise fully-secure and very efficient (i.e., low-overhead in performance, energy, area) protection mechanisms against RowHammer. After reviewing recent developments in exploiting, understanding, and mitigating RowHammer, we discuss future directions that we believe are critical for solving the RowHammer problem. We argue for two major directions to amplify research and development efforts in: 1) building a much deeper understanding of the problem and its many dimensions, in both cutting-edge DRAM chips and computing systems deployed in the field, and 2) the design and development of extremely efficient and fully-secure solutions via system-memory cooperation.

[To Session Table]

Session 6B  System-Level Codesign in DNN Accelerators
Time: 15:00 - 16:40, Wednesday, January 18, 2023
Location: Room Uranus
Chair: Bei Yu (The Chinese University of Hong Kong)

Best Paper Candidate
6B-1 (Time: 15:00 - 15:25) (In-person)
TitleHardware-Software Codesign of DNN Accelerators using Approximate Posit Multipliers
AuthorTom Glint, *Kailash Prasad, Jinay Dagli (IIT Gandhinagar, India), Krishil Gandhi (SVNIT, India), Aryan Gupta, Vrajesh Patel, Neel Shah, Joycee Mekie (IIT Gandhinagar, India)
Pagepp. 469 - 474
Keywordco-design, DNN accelerators, neural networks
AbstractModern AI/ML workloads are hit by memory and power walls when run on general-purpose compute cores. Among the several solutions proposed to tackle this concern, DNN accelerator architectures have found a prominent place. In this work, we propose a hardware-software co-design approach to design a highly optimized DNN accelerator architecture based on SOTA architecture and uses data-aware POSIT number system representation for numbers to achieve very high quantization. This automatically reduces the buffer/storage requirements within the architecture and reduces the data transfer cost between the main memory and the DNN accelerator. We have investigated the impact of integer, IEEE floating point, and posit multipliers for LeNet, ResNet and VGG NNs trained and tested on MNIST, CIFAR10 and ImageNet datasets, respectively. Based on the analysis conducted, we propose an approximate-fixed-posit multiplier when implemented on Simba, achieves ~2.2x speed up, consumes ~3.1x less energy and requires 3.2xless area, respectively, on average without loss of accuracy +/-1% against the baseline SOTA architecture.

6B-2 (Time: 15:25 - 15:50) (In-person)
TitleReusing GEMM Hardware for Efficient Execution of Depthwise Separable Convolution on ASIC-based DNN Accelerators
Author*Susmita Dey Manasi (University of Minnesota Twin Cities, USA), Suvadeep Banerjee, Abhijit Davare, Anton A. Sorokin, Steven M. Burns, Desmond A. Kirkpatrick (Intel Corporation, USA), Sachin S. Sapatnekar (University of Minnesota Twin Cities, USA)
Pagepp. 475 - 482
KeywordDepthwise convolution, Lightweight CNN, Deep learning accelerator
AbstractDeep learning (DL) accelerators are optimized for standard convolution. However, lightweight convolutional neural networks (CNNs) use depthwise convolution (DwC) in key layers, and the structural difference between DwC and standard convolution leads to significant performance bottleneck in executing lightweight CNNs on such platforms. This work reuses the fast general matrix-vector multiplication (GEMM) core of DL accelerators by mapping DwC to channel-wise parallel matrix-vector multiplications. An analytical framework is developed to guide pre-RTL hardware choices, and new hardware modules and software support are developed for end-to-end evaluation of the solution. This GEMM-based DwC execution strategy offers substantial performance gains for lightweight CNNs: 7x speedup and 1.8x lower off-chip communication for MobileNet-v1 over a conventional DL accelerator, and 74x speedup over a CPU, and even 1.4x speedup over a power-hungry GPU.

6B-3 (Time: 15:50 - 16:15) (In-person)
TitleBARVINN: Arbitrary Precision DNN Accelerator Controlled by a RISC-V CPU
Author*MohammadHossein AskariHemmat (Ecole Polytechnique Montreal, Canada), Sean Wagner (IBM, Canada), Olexa Bilaniuk (MILA, Canada), Yassine Hariri (CMC, Canada), Yvon Savaria, Jean-Pierre David (Ecole Polytechnique Montreal, Canada)
Pagepp. 483 - 489
Keywordneural networks, hardware acceleration, FPGA, low-precision, RISC-V
AbstractWe present a DNN accelerator that allows inference at arbitrary precision with dedicated processing elements that are configurable at the bit level. Our DNN accelerator has 8 Processing Elements controlled by a RISC-V controller with a combined 8.2 TMACs of computational power when implemented with the recent Alveo U250 FPGA platform.We develop a code generator tool that ingests CNN models in ONNX format and generates an executable command stream for the RISC-V controller.We demonstrate the scalable throughput of our accelerator by running different DNN kernels and models when different quantization levels are selected. Compared to other low precision accelerators, our accelerator provides run time programmability without hardware reconfiguration and can accelerate DNNs with multiple quantization levels, regardless of the target FPGA size. BARVINN is an open source project and it is available at https://github.com/hossein1387/BARVINN.

6B-4 (Time: 16:15 - 16:40) (Online)
TitleAgile Hardware and Software Co-design for RISC-V-based Multi-precision Deep Learning Microprocessor
AuthorZicheng He (UCLA/Southern University of Science and Technology, USA), Ao Shen, *Qiufeng Li (Southern University of Science and Technology, China), Quan Cheng (Department of Communications and Computer Engineering, Graduate School of Informatics, Kyoto University, Japan), Hao Yu (Southern University of Science and Technology, China)
Pagepp. 490 - 495
Keywordagile development, hardware/software co-design, multi-precision, deep learning compiler
AbstractRecent network architecture search (NAS) has been widely applied to simplify deep learning neural networks, which typically result in a multi-precision network. Many multi-precision accelerators have been developed as well to support computing multi-precision networks manually. A software-hardware interface is thereby needed to automatically map multi-precision networks onto multi-precision accelerators. In this paper, we have developed an agile hardware and software co-design for RISC-V-based multi-precision deep learning microprocessor. We have designed custom RISC-V instructions with a framework to automatically compile multi-precision CNN networks onto multi-precision CNN accelerators , demonstrated on FPGA. Experiments show that with NAS optimized multi-precision CNN models (LeNet, VGG16, ResNet, MobileNet), the RISC-V core with multi-precision accelerators can reach the highest throughput in 2,4,8-bit precisions respectively on a Xilinx ZCU102 FPGA.

[To Session Table]

Session 6C  New Advances in Hardware Trojan Detection
Time: 15:00 - 16:40, Wednesday, January 18, 2023
Location: Room Venus
Chairs: Yongqiang Lyu (Tsinghua University, China), Muhammad Hassan (University of Bremen, Germany)

Best Paper Candidate
6C-1 (Time: 15:00 - 15:25) (In-person)
TitleHardware Trojan Detection Using Shapley Ensemble Boosting
AuthorZhixin Pan, *Prabhat Mishra (University of Florida, USA)
Pagepp. 496 - 503
KeywordHardware Trojan, Machine Learning, Hardware Security, Boosting, Explainable Machine Learning
AbstractDue to globalized semiconductor supply chain, there is an increasing risk of exposing system-on-chip designs to hardware Trojans (HT). While there are promising machine Learning based HT detection techniques, they have four major limitations: ad-hoc feature selection, lack of explainability, and vulnerability towards adversarial attacks. In this paper, we propose a novel HT detection approach using an effective combination of Shapley value analysis and boosting framework. Specifically, this paper makes two important contributions. We use Shapley value (SHAP) to analyze the importance ranking of input features. It not only provides explainable interpretation for HT detection, but also serves as a guideline for feature selection. We utilize boosting (ensemble learning) to generate a sequence of lightweight models that significantly reduces the training time while provides robustness against adversarial attacks. Experimental results demonstrate that our approach can drastically improve both detection accuracy (up to 24.6%) and time efficiency (up to 5.1x) compared to state-of-the-art HT detection techniques.

6C-2 (Time: 15:25 - 15:50) (Online)
TitleASSURER: A PPA-friendly Security Closure Framework for Physical Design
Author*Guangxin Guo, Hailong You, ZhengGuang Tang, Benzheng Li, Cong Li (Xidian University, China), Xiaojue Zhang (GIGA Design Automation, China)
Pagepp. 504 - 509
KeywordHardware security closure, PPA-friendly, Physical design, Hardware Trojans, Probing attacks
AbstractHardware security is emerging in the very large scale integration (VLSI). The seminal threats, like hardware Trojan insertion, probing attacks, and fault injection, are hard to detect and almost impossible to fix at post-design stage. The optimal solution is to prevent them at the physical design stage. Usually, defending against them may cause a lot of power, performance, and area (PPA) loss. In this paper, we propose a PPA-friendly physical layout security closure framework ASSURER. Reward-directed placement refinement and multi-threshold partition algorithm are proposed to assure Trojan threats are empty. Cleaning up probing attacks is established on a patch-based ECO routing flow. Evaluated on the ISPD’22 benchmarks, ASSURER can clean out the Trojan threat with no leakage power increase when shrinking the physical layout area. When not shrinking, ASSURER only increases 14% total power. Compared with the work of first place in the ISPD2022 Contest, ASSURE reduced 53% additional total power consumption, and probing vulnerability can be reduced by 97.6% under the premise of timing closure. We believe this work shall open up a new perspective for preventing Trojan insertion and probing attacks.

6C-3 (Time: 15:50 - 16:15) (Online)
TitleStatic Probability Analysis Guided RTL Hardware Trojan Test Generation
Author*Haoyi Wang, Qiang Zhou, Yici Cai (Tsinghua University, China)
Pagepp. 510 - 515
KeywordHardware Trojan, Directed Test Generation, Static Probability Analysis, Security Assertion, Register Probability Equation System
AbstractDirected test generation is an effective method to detect potential hardware Trojan (HT) in RTL. While the existing works are able to activate hard-to-cover Trojans by covering security targets, the effectiveness and efficiency of identifying the targets to cover are ignored. We propose a static probability analysis method for identifying the hard-to-active data channel targets and generating the corresponding assertions for the HT test generation. Our method could generate test vectors to trigger Trojans from Trust-hub, DeTrust, and OpenCores in 1 minute and get 104.33X time improvement on average compared with the existing method.

6C-4 (Time: 16:15 - 16:40) (In-person)
TitleHardware Trojan Detection and High-Precision Localization in NoC-based MPSoC using Machine Learning
Author*Haoyu Wang, Basel Halak (University of Southampton, UK)
Pagepp. 516 - 521
KeywordNoC, MPSoC, Hardware Security, Hardware Trojan, Machine Learning
AbstractNetworks-on-Chips (NoC) based Multi-Processor System-on-Chip(MPSoC) are increasingly employed in industrial and consumer electronics. Outsourcing third-party IPs (3PIPs) and tools in NoC-based MPSoC is a prevalent development way in most fabless companies. However, Hardware Trojan (HT) injected during its design stage can maliciously tamper with the functionality of this communication scheme, which undermines the security of the system and may cause a failure. Detecting and localizing HT with high precision is a big challenge for current techniques. This work proposes for the first time a novel approach that allows detection and high-precision localization of HT, which is based on the use of packet information and machine learning algorithms. It is equipped with a novel Dynamic Confidence Interval (DCI) algorithm to detect malicious packets, and a novel Dynamic Security Credit Table (DSCT) algorithm to localize HT. We evaluated the proposed framework on the mesh NoC running real workloads. The average detection precision of 96.3% and the average localization precision of 100% were obtained from the experiment results, and the minimum HT localization time is around 5.8~12.9us at 2GHz depending on the different HT-infected nodes and workloads.

[To Session Table]

Session 6D  Advances in Physical Design and Timing Analysis
Time: 15:00 - 17:05, Wednesday, January 18, 2023
Location: Room Mars/Mercury
Chairs: Takashi Sato (Kyoto University), Andy Yu-Guang Chen (National Central University, Taiwan)

6D-1 (Time: 15:00 - 15:25) (Online)
TitleAn Integrated Circuit Partitioning and TDM Assignment Optimization Framework for Multi-FPGA Systems
Author*Dan Zheng, Evangeline F. Y. Young (The Chinese University of Hong Kong, Hong Kong)
Pagepp. 522 - 528
Keywordpartitioning, TDM optimization
AbstractIn multi-FPGA systems, Time-Division Multiplexing (TDM) is a widely used method for transferring multiple signals over a common wire. The circuit performance will be significantly influenced by this inter-FPGA delay. Some inter-FPGA nets are driven by different clocks, in which case they cannot share the same wire. In this paper, to minimize the maximum delay of inter-FPGA nets, we propose a two-step framework. First, a TDM-aware partitioning algorithm is adopted to minimize the maximum cut size between an FPGA-pair. A TDM ratio assignment method is then applied to assign TDM ratio for each inter-FPGA net optimally. Experimental results show that our algorithm can reduce the maximum TDM ratio significantly within reasonable runtime.

6D-2 (Time: 15:25 - 15:50) (Online)
TitleA Robust FPGA Router with Concurrent Intra-CLB Rerouting
Author*Jiarui Wang, Jing Mai (Peking University, China), Zhixiong Di (Southwest Jiaotong University, China), Yibo Lin (Peking University/Beijing Advanced Innovation Center for Integrated Circuits, China)
Pagepp. 529 - 534
KeywordFPGA CAD, FPGA Routing
AbstractRouting is a key step in the FPGA design flow. It remains the most time-consuming step in the design flow with increasingly complicated FPGA architectures and design scales. The growing complexity of connections between logic pins inside CLBs of FPGAs challenges the efficiency and quality of FPGA routers, which requires FPGA routers to generate paths between CLBs and generate paths inside CLBs. Existing negotiation-based rip-up and reroute schemes will result in a large number of iterations when generating paths inside CLBs. In this work, we propose a robust routing framework for FPGAs with complex connections between logic elements and switch boxes. We propose a concurrent intra-CLB rerouting algorithm that can effectively resolve routing congestions inside a CLB tile. Experimental results on modified ISPD 2016 benchmarks demonstrate that our framework can achieve 100% routability in less wirelength and runtime, while the state-of-the-art VTR 8.0 routing algorithm failed at 4 of 12 benchmarks.

6D-3 (Time: 15:50 - 16:15) (Online)
TitleEfficient Global Optimization for Large Scaled Ordered Escape Routing
AuthorChuandong Chen, *Dishi Lin, Rongshan Wei, Qinghai Liu (Fuzhou University, China), Ziran Zhu (Southeast University, China), Jianli Chen (Fudan University, China)
Pagepp. 535 - 540
KeywordEscape Routing, integer linear programming
AbstractOrdered Escape Routing (OER) problem is an NP-hard problem. Typical approaches for solving small-scale OER often involve integer linear programming (ILP) or heuristic algorithms. In this paper, we propose a method to plan the routing resources from the global respective, which combines the advantages of ILP and heuristic algorithm to satisfy a series of routing constraints and minimize the wiring length. And the algorithm is successfully applied to solve OER problems. We first separate the non-crossing requirement from typical ILP modeling, which considerably expands the scale that may be solved. A routing failure policy was also included at the same time to help speed up the routing process. Then, considering the congestion of wiring resources, the ILP method is proposed to detect congestion, and the semi-automatic capacity reduction is adopted to address congestion. Finally, it is discussed that this method can be extended to heuristic algorithms as a preprocessing stage to avoid ripping-up and rerouting as much as possible. Compared to previous heuristic techniques like A*, the routing resources are logically designed and the wire length is minimized better. And compared to standard ILP, our algorithm not only solves large-scale problems but also reduces routing time by 76%.

6D-4 (Time: 16:15 - 16:40) (Online)
TitleAn Adaptive Partition Strategy of Galerkin Boundary Element Method for Capacitance Extraction
Author*Shengkun Wu (Peng Cheng Laboratory, China), Biwei Xie (Institute of Computing Technology, Chinese Academy of Sciences, China), Xingquan Li (School of Mathematics and Statistics, Minnan Normal University, China)
Pagepp. 541 - 546
KeywordGalerkin method, boundary element method, boundary partition, capacitance extraction
AbstractIn advanced process, electromagnetic coupling among interconnect wires plays an increasingly important role in signoff analysis. For VLSI chip design, the requirement of fast and accurate capacitance extraction is becoming more and more urgent. And the critical step of extracting capacitance among interconnect wires is solving electric field. However, due to the high computational complexity, solving electric field is extreme timing-consuming. The Galerkin boundary element method (GBEM) was used for capacitance extraction. In this paper, we are going to use some mathematical theorems to analysis its error. Furthermore, with the error estimation of the Galerkin method, we design a boundary partition strategy to fit the electric field attenuation. It is worth to mention that this boundary partition strategy can greatly reduce the number of boundary elements on the promise of ensuring that the error is small enough. As a consequence, the matrix order of the discretization equation will also decrease. We also provide our suggestion of the calculation of the matrix elements. Experimental analysis demonstrates that, our partition strategy obtains a good enough result with a small number of boundary elements.

6D-5 (Time: 16:40 - 17:05) (Online)
TitleGraph-Learning-Driven Path-Based Timing Analysis Results Predictor from Graph-Based Timing Analysis
Author*Yuyang Ye (Southeast University, China), Tinghuan Chen (The Chinese University of Hong Kong, Hong Kong), Yifei Gao, Hao Yan (Southeast University, China), Bei Yu (The Chinese University of Hong Kong, Hong Kong), Longxing Shi (Southeast University, China)
Pagepp. 547 - 552
KeywordStatic Timing analysis, Graph learning, Deep EdgeGAT
AbstractWith diminishing margins in advanced technology nodes, the performance of static timing analysis (STA) is a serious concern, including accuracy and runtime. The STA can generally be divided into graph-based analysis (GBA) and path-based analysis (PBA). For GBA, the timing results are always pessimistic, leading to overdesign during design optimization. For PBA, the timing pessimism is reduced via propagating real path-specific slews with the cost of severe runtime overheads relative to GBA. In this work, we present a fast and accurate predictor of post-layout PBA timing results from inexpensive GBA based on deep edge-featured graph attention network, namely deep EdgeGAT. Compared with the conventional machine and graph learning methods, deep EdgeGAT can learn global timing path information. Experimental results demonstrate that our predictor has the potential to substantially predict PBA timing results accurately and reduce timing pessimism of GBA with maximum error reaching 6.81 ps, and our work achieves an average 24.80x speedup faster than PBA using the commercial STA tool.

[To Session Table]

Session 6E  (DF-2) Advanced Sensor Technologies and Application
Time: 15:00 - 16:15, Wednesday, January 18, 2023
Location: Miraikan Hall
Organizer/Chair: Yasuhisa Tochigi (SONY Semiconductor Solutions Corp., Japan)

6E-1 (Time: 15:00 - 15:25)
Title(Designers' Forum) Advanced Technologies of an Organic-Photoconductive-Film CMOS Image Sensor
AuthorNaoki Shimasaki (Panasonic Holdings Corporation, Japan)
AbstractIn this presentation, we introduce our organic-photoconductive-film CMOS (OPF) image sensor technologies. The OPF image sensor has three major features due to its unique structure. (i) the wide dynamic range technology. A photoelectric conversion and charge storage parts are completely independent. Therefore, the OPF image sensor realize a significant improvement of dynamic range performance. (ii) the global shutter technology. A photoelectric conversion efficiency is electrically controllable, so the OPF image sensor actualize a high-speed exposure control like physical mechanical shutter. (iii) the RGB-NIR sensing technology. A photoelectric conversion wavelength is selectable by changing the organic film material. Hence OPF image sensor can achieve high sensitivity to the user's desired wavelength. Consequently, the OPF image sensor will be beyond human sensing ability and suitable for the field of industrial, surveillance, automotive and so on.

6E-2 (Time: 15:25 - 15:50)
Title(Designers' Forum) A 0.37W 143dB-Dynamic-Range 1Mpixel Backside-Illuminated Charge-Focusing SPAD Image Sensor with Pixel-Wise Exposure Control and Adaptive Clocked Recharging
AuthorYasuharu Ota, K. Morimoto (Canon Inc., Japan)
AbstractWe present a 3D-BSI 1Mpixel charge focusing SPAD image sensor based on pixel-wise exposure control, adaptive clocked recharging, and dead-time-free global shutter at 90fps. The proposed architecture enables scalable implementation of photon counting pixels with <0.4W sensor power consumption and 143dB dynamic range with PDE of 70% and DCR of 2.5cps. Single-photon-sensitive HDR imaging result verifies the feasibility of the SPAD sensor for security, automotive and medical imaging applications.

6E-3 (Time: 15:50 - 16:15)
Title(Designers' Forum) A 1200x84-pixels 64cc Solid-State LiDAR RX with an HV/LV transistors Hybrid Active-Quenching-SPAD Array and Background Digital PT Compensation
AuthorTuan Thanh Ta (Toshiba Corp. R&D Center, Japan)
AbstractThis paper presents two essential techniques, Active-Quenching (AQ) SPAD consisting of hybrid HV/LV transistors and Digital SPAD Characteristic Compensation (DSCC) circuit to realize high-performance and palm-sized LiDAR. The hybrid AQ circuit shrinks the pixel area while ensuring a high sensitivity. The high pixel density 2D-SPAD array realizes a high image-resolution LiDAR while reducing the light-receiver (RX) unit size and its light-receiving lens. The DSCC provides an on-chip Process/Temperature (PT) calibration without external components, which contributes to weather ability assurance and LiDAR miniaturization. These technologies downsize LiDAR RX to the worldfs smallest size, 64cc, and realize a total 350cc size LiDAR with its competitive performance as a mechanical one.

Thursday, January 19, 2023

[To Session Table]

Session 3K  Keynote III
Time: 9:00 - 10:00, Thursday, January 19, 2023
Location: Miraikan Hall
Chair: Atsushi Takahashi (Tokyo Institute of Technology, Japan)

Title(Keynote Address) Innovation by Design and Technology Co-Optimization
AuthorTakuya Yasui (TSMC Japan Design Center, Japan)
AbstractThe semiconductor industry has been challenging transistor energy efficiency and performance and area scaling for each new technology generation. For further growth as continue to move forward on both "Moore's Law" and "More than Moore", Design and Technology Co-Optimization (DTCO) is getting more important. A lot of technology like Standard cell architecture, 3D IC design flow and so on are introduced to next new technology generation. DTCO examples and challenges will be shared in this presentation.

[To Session Table]

Session 7A  (SS-4) Brain-inspired Hyperdimensional Computing to the Rescue for beyond von Neumann Era
Time: 10:20 - 11:35, Thursday, January 19, 2023
Location: Room Saturn
Chair: Lilas Alrahis (New York University Abu Dhabi, United Arab Emirates)

7A-1 (Time: 10:20 - 11:35) (In-person)
Title(Invited Paper) Beyond von Neumann Era: Brain-inspired Hyperdimensional Computing to the Rescue
Author*Hussam Amrouch, Paul R. Genssler (University of Stuttgart, Germany), Mohsen Imani, Mariam Issa (UC Irvine, USA), Xun Jiao (Villanova University, USA), Wegdan Mohammed, Glorian Sepanta (University of Stuttgart, Germany), Ruixuan Wang (Villanova University, USA)
Pagepp. 553 - 560
KeywordBrain-Inspired Computing, Computer Architecture
AbstractBreakthroughs in deep learning (DL) continuously fuel innovations that profoundly improve our daily life. However, DNNs overwhelm conventional computing architectures by their massive data movements between processing and memory units. As a result, novel computer architectures are indispensable to improve or even replace the decades-old von Neumann architecture. Nevertheless, going far beyond the existing von Neumann principles comes with profound reliability challenges for the performed computations. This is due to analog computing together with emerging beyond-CMOS technologies being inherently noisy and inevitably leading to unreliable computing. Hence, novel robust algorithms become a key to go beyond the boundaries of the von Neumann era. Hyperdimensional Computing (HDC) is rapidly emerging as an attractive alternative to traditional DL and ML algorithms. Unlike conventional DL and ML algorithms, HDC is inherently robust against errors along a much more efficient hardware implementation. In addition to these advantages at hardware level, HDC's promise to learn from little data and the underlying algebra enable new possibilities at the application level. In this work, the robustness of HDC algorithms against errors and beyond von Neumann architectures are discussed. Further, the benefits of HDC as a machine learning algorithm are demonstrated with the example of outlier detection and reinforcement learning.

[To Session Table]

Session 7B  System Level Design Space Exploration
Time: 10:20 - 11:35, Thursday, January 19, 2023
Location: Room Uranus
Chair: Yun Liang (Beijing University)

7B-1 (Time: 10:20 - 10:45) (In-person)
TitleSystem-Level Exploration of In-Package Wireless Communication for Multi-Chiplet Platforms
Author*Rafael Medina, Joshua Klein, Giovanni Ansaloni (École Polytechnique Fédérale de Lausanne, Switzerland), Marina Zapater (Haute École Spécialisée de Suisse Occidentale, Switzerland), Sergi Abadal, Eduard Alarcón (Universitat Politècnica de Catalunya, Spain), David Atienza (École Polytechnique Fédérale de Lausanne, Switzerland)
Pagepp. 561 - 566
KeywordMulti-chiplet systems, On-package wireless communication, Full system-level simulation, DNNs
AbstractMulti-Chiplet architectures are being increasingly adopted to support the design of very large systems in a single package, facilitating the integration of heterogeneous components and improving manufacturing yield. However, chiplet-based solutions have to cope with limited inter-chiplet routing resources, which complicate the design of the data interconnect and the power delivery network. Emerging in-package wireless technology is a promising strategy to address these challenges, as it allows to implement flexible chiplet interconnects while freeing package resources for power supply connections. To assess the capabilities of such an approach and its impact from a full-system perspective, herein we present an exploration of the performance of in-package wireless communication, based on dedicated extensions to the gem5-X simulator. We consider different Medium Access Control (MAC) protocols, as well as applications with different runtime profiles, showcasing that current in-package wireless solutions are competitive with wired chiplet interconnects. Our results show how in-package wireless solutions can outperform wired alternatives when running artificial intelligence workloads, achieving up to a 2.64x speed-up when running deep neural networks (DNNs) on a chiplet-based system with 16 cores distributed in four clusters.

7B-2 (Time: 10:45 - 11:10) (In-person)
TitleEfficient System-Level Design Space Exploration for High-Level Synthesis using Pareto-Optimal Subspace Pruning
AuthorYuchao Liao, *Tosiron Adegbija, Roman Lysecky (University of Arizona, USA)
Pagepp. 567 - 572
KeywordSystem-level optimization, Design space exploration, High-level synthesis, Subspace pruning, Embedded system
AbstractHigh-level synthesis (HLS) is a rapidly evolving and popular approach to designing, synthesizing, and optimizing embedded systems. Many HLS methodologies utilize design space exploration (DSE) at the post-synthesis stage to find Pareto-optimal hardware implementations for individual components. However, the design space for the system-level Pareto-optimal configurations is orders of magnitude larger than component-level design space, making existing approaches insufficient for system-level DSE. This paper presents Pruned Genetic Design Space Exploration (PG-DSE)-an approach to post-synthesis DSE that involves a pruning method to effectively reduce the system-level design space and an elitist genetic algorithm to accurately find the system-level Pareto-optimal configurations. We evaluate PG-DSE using an autonomous driving application subsystem (ADAS) and three synthetic systems with extremely large design spaces. Experimental results show that PG-DSE can reduce the design space by several orders of magnitude compared to prior work while achieving higher quality results (an average improvement of 58.1x).

7B-3 (Time: 11:10 - 11:35) (Online)
TitleAutomatic Generation of Complete Polynomial Interpolation Design Space for Hardware Architectures
AuthorBryce Orloski (Intel Corporation, USA), *Samuel Coward (Intel Corporation, UK), Theo Drane (Intel Corporation, USA)
Pagepp. 573 - 578
Keyworddatapath design, elementary function, polynomial interpolation
AbstractHardware implementations of elementary functions regularly deploy piecewise polynomial approximations. This work determines the complete design space of piecewise polynomial approximations meeting a given accuracy specification. Knowledge of this design space determines the minimum number of regions required to approximate the function accurately enough and facilitates the generation of optimized hardware which is competitive against the state of the art. Designers can explore the space of feasible architectures without needing to validate their choices. A heuristic based decision procedure is proposed to generate optimal ASIC hardware designs. Targeting alternative hardware technologies simply requires a modified decision procedure to explore the space. We highlight the difficulty in choosing an optimal number of regions to approximate the function with, as this is input width dependent.

[To Session Table]

Session 7C  Security Assurance and Acceleration
Time: 10:20 - 11:35, Thursday, January 19, 2023
Location: Room Venus
Chairs: Prabhat Mishra (University of Florida, USA), Pengfei Qiu (Beijing University of Posts and Telecommunications, China)

7C-1 (Time: 10:20 - 10:45) (In-person)
TitleSHarPen: SoC Security Verification by Hardware Penetration Test
AuthorHasan Al-Shaikh, Arash Vafaei, Mridha Md Mashahedur Rahman, Kimia Zamiri Azar, Fahim Rahman, Farimah Farahmandi, *Mark Tehranipoor (University of Florida, USA)
Pagepp. 579 - 584
KeywordSoC Security Verification, Penetration Testing, Binary Particle Swarm, Cost Function, SoC Prototyping
AbstractAs modern SoC architectures incorporate many complex/heterogeneous intellectual properties (IPs), the protection of security assets has become imperative, and the number of vulnerabilities revealed is rising due to the increased number of attacks. Over the last few years, penetration testing (PT) has become an increasingly effective means of detecting software (SW) vulnerabilities. As of yet, no such technique has been applied to the detection of hardware vulnerabilities. This paper proposes a PT framework, SHarPen, for detecting hardware vulnerabilities, which facilitates the development of a SoC-level security verification framework. SHarPen proposes a formalism for performing gray-box hardware (HW) penetration testing instead of relying on coverage-based testing and provides an automation for mapping hardware vulnerabilities to logical/mathematical cost functions. SHarPen supports both simulation and FPGA-based prototyping, allowing us to automate security testing at different stages of the design process with high capabilities for identifying vulnerabilities in the targeted SoC.

7C-2 (Time: 10:45 - 11:10) (In-person)
TitleSecHLS: Enabling Security Awareness in High-Level Synthesis
AuthorShang Shi, Nitin Pundir, Hadi Mardani Kamali, Mark Tehranipoor, *Farimah Farahmandi (University of Florida, USA)
Pagepp. 585 - 590
KeywordHigh-Level Synthesis, Security, Scheduling, Binding
AbstractIn their quest for further optimization, High-level synthesis (HLS) utilizes advanced automatic optimization algorithms to achieve lower implementation time/effort for even more complex designs. These optimization algorithms are for the HLS tools’ backend stages, e.g., allocation, scheduling, and binding, and they are highly optimized for resources/latency constraints. However, current HLS tools’ backend is unaware of designs’ security assets, and their algorithms are incapable of handling security constraints. In this paper, we propose Secure-HLS (SecHLS), which aims to define underlying security constraints for HLS tools’ backend stages and intermediate representations. In SecHLS, we improve a set of widely-used scheduling and binding algorithms by integrating the proposed security-related constraints into them. We evaluate the effectiveness of SecHLS in terms of power, performance, area (PPA), security, and complexity (execution time) on small and real-size benchmarks, showing how the proposed security constraints can be integrated into HLS while maintaining low PPA/complexity burdens.

7C-3 (Time: 11:10 - 11:35) (In-person)
TitleA Flexible ASIC-oriented Design for a Full NTRU Accelerator
Author*Francesco Antognazza, Alessandro Barenghi, Gerardo Pelosi (Politecnico di Milano, Italy), Ruggero Susella (STMicroelectronics, Italy)
Pagepp. 591 - 597
KeywordPost Quantum Cryptography, Hardware Accelerator
AbstractPost-quantum cryptosystems are the subject of a significant research effort, witnessed by various international standardization competitions. Among them, the NTRU Key Encapsulation Mechanism has been recognized as a secure, patent-free, and efficient public key encryption scheme. In this work, we perform a design space exploration on an FPGA target, with the final goal of an efficient ASIC realization. Specifically, we focus on the possible choices for the design of polynomial multipliers with different memory bus widths to trade-off lower clock cycle counts with larger interconnections. Our design outperforms the best FPGA synthesis results at the state of the art, and we report the results of ASIC syntheses minimizing latency and area with a 40nm industrial grade technology library. Our speed-oriented design computes an encapsulation in 4.1 to 10.2 µs and a decapsulation in 7.1 to 11.7 µs, depending on the NTRU security level, while our most compact design only takes 20% more area than the underlying SHA-3 hash module.

[To Session Table]

Session 7D  (SS-5) Hardware and Software Co-design of Emerging Machine Learning Algorithms
Time: 10:20 - 11:35, Thursday, January 19, 2023
Location: Room Mars/Mercury
Chairs: Xiaobo Sharon Hu (University of Notre Dame, USA), Dayane Reis (University of South Florida, USA)

7D-1 (Time: 10:20 - 10:45) (Online)
Title(Invited Paper) Robust Hyperdimensional Computing Against Cyber Attacks and Hardware Errors: A Survey
AuthorDongning Ma, Sizhe Zhang, *Xun Jiao (Villanova University, USA)
Pagepp. 598 - 605
Keywordhyperdimensional computing, robustness, cyber attack, hardware error
AbstractHyperdimensional Computing (HDC), also known as Vector Symbolic Architecture (VSA), is an emerging AI algorithm inspired by the way the human brain functions. Compared with deep neural networks (DNNs), HDC possesses several advantages such as smaller model size, less computation cost, and one/few-shot learning, making it a promising alternative computing paradigm. With the increasing deployment of AI in safety-critical systems such as healthcare and robotics, it is not only important to strive for high accuracy, but also to ensure its robustness under even highly uncertain and adversarial environments. However, recent studies show that HDC, just like DNNs, is vulnerable to both cyber attacks (e.g., adversarial attacks) and hardware errors (e.g., memory failures). While a growing body of research has been studying the robustness of HDC, there is a lack of systematic review of research efforts on this increasingly-important topic. To the best of our knowledge, this paper presents the first survey dedicated to review the research efforts made to the robustness of HDC against cyber attacks and hardware errors. While the performance and accuracy of HDC as an AI method still expects future theoretical advancement, this survey paper aims to shed light and call for community efforts on robustness research of HDC.

7D-2 (Time: 10:45 - 11:10) (In-person)
Title(Invited Paper) In-Memory Computing Accelerators for Emerging Learning Paradigms
Author*Dayane Reis (University of South Florida, USA), Ann Franchesca Laguna (De La Salle University, Philippines), Michael Niemier, Xiaobo S. Hu (University of Notre Dame, USA)
Pagepp. 606 - 611
KeywordComputing-in-memory, Emerging technologies, Machine learning, FeFET, RRAM
AbstractOver the past decades, emerging, data-driven machine learning (ML) paradigms have increased in popularity, and revolutionized many application domains. To date, a substantial effort has been devoted to devising mechanisms for facilitating the deployment and near ubiquitous use of these memory intensive ML models. This review paper presents the use of in-memory computing (IMC) accelerators for emerging ML paradigms from a bottom-up perspective through the choice of devices, the design of circuits/architectures, to the application-level results.

7D-3 (Time: 11:10 - 11:35) (Online)
Title(Invited Paper) Toward Fair and Efficient Hyperdimensional Computing
Author*Yi Sheng, Junhuan Yang, Weiwen Jiang, Lei Yang (George Mason University, USA)
Pagepp. 612 - 617
KeywordFair, Efficient, HDC
AbstractWe are witnessing the evolution that Machine Learning (ML) is applied to varied applications, such as intelligent security systems, medical diagnoses, etc. With this trend, it has high demand to run ML on end devices with limited resources. What’s more, the fairness in these ML algorithms is mounting important, since these applications are not designed for specific users (e.g., people with fair skin in skin disease diagnosis) but need to be applied to all possible users (i.e., people with different skin tones). Brain-inspired hyperdimensional computing (HDC) has demonstrated its ability to run ML tasks on edge devices with a small memory footprint; yet, it is unknown whether HDC can satisfy the fairness requirements from applications (e.g., medical diagnosis for people with different skin tones). In this paper, for the first time, we reveal that the vanilla HDC has severe bias due to its sensitivity to colour information. Toward a fair and efficient HDC, we propose a holistic framework, namely FE-HDC, which integrates the image processing and input compression techniques in HDC’s encoder. Compared with the vanilla HDC, results show that the proposed FE-HDC can reduce the unfairness score by 90%, achieving fairer architectures with competitive high accuracy.

[To Session Table]

Session 8A  (SS-6) Full-Stack Co-design for On-Chip Learning in AI Systems
Time: 13:00 - 14:15, Thursday, January 19, 2023
Location: Room Saturn
Chairs: Anup Das (Drexel University, USA), Antonino Tumeo (Pacific Northwest National Laboratory, USA)

8A-1 (Time: 13:00 - 13:25) (In-person)
Title(Invited Paper) Improving the Robustness and Efficiency of PIM-based Architecture by SW/HW Co-design
AuthorXiaoxuan Yang, Shiyu Li, Qilin Zheng, *Yiran Chen (Duke University, USA)
Pagepp. 618 - 623
KeywordProcessing-in-Memory, Hardware-Software Co-Design, Resistive Random Access Memory, Machine Learning, Transformer
AbstractProcessing-in-memory (PIM) based architecture shows great potential to process several emerging artificial intelligence workloads, including vision and language models. Cross-layer optimizations could bridge the gap between computing density and the available resources by reducing the computation and memory cost of the model and improving the model’s robustness against non-ideal hardware effects. We first introduce several hardware-aware training methods to improve the model robustness to the PIM device’s nonideal effects, including stuck-at-fault, process variation, and thermal noise. Then, we further demonstrate a software/hardware (SW/HW) co-design methodology to efficiently process the state-of-the-art attention-based model on PIM-based architecture by performing sparsity exploration for the attention-based model and circuit-architecture co-design to support the sparse processing.

8A-2 (Time: 13:25 - 13:50) (Online)
Title(Invited Paper) Hardware-Software Co-Design for On-Chip Learning in AI Systems
AuthorL. M. Varshika, Abhishek Kumar Mishra, Nagarajan Kandasamy, *Anup Das (Drexel University, USA)
Pagepp. 624 - 631
KeywordSpiking Neural Network (SNN), Neuromorphic Computing, Convolutional Neuromorphic Computing (CNN), Spike Timing Dependent Plasticity (STDP), FPGA
AbstractSpike-based CNNs are empowered with on-chip learning in their convolution layers, enabling the layer to learn to detect features by combining those extracted in the previous layer. We propose ECHELON, a generalized design template for a tile-based neuromorphic hardware with on-chip learning capabilities. Each tile in ECHELON consists of a neural processing units (NPU) to implement convolution and dense layers of a CNN model, an on-chip learning unit (OLU) to facilitate spike-timing dependent plasticity (STDP) in the convolution layer, and a special function unit (SFU) to implement other CNN functions such as pooling, concatenation, and residual computation. These tile resources are interconnected using a shared bus, which is segmented and configured via the software to facilitate parallel communication inside the tile. Tiles are themselves interconnected using a classical NoC interconnect. We propose a system software to map CNN models to ECHELON, maximizing the performance. We integrate the hardware design and software optimization within a co-design loop to obtain the hardware and software architectures for a target CNN, satisfying both performance and resource constraints. We show the implementation of a tile on a FPGA. Using 8 STDP-enabled CNN models, we show the potential of our co-design methodology to optimize hardware resources.

8A-3 (Time: 13:50 - 14:15) (In-person)
Title(Invited Paper) Towards On-Chip Learning for Low Latency Reasoning with End-to-End Synthesis
Author*Vito Giovanni Castellana (Pacific Northwest National Laboratory, USA), Nicolas Bohm Agostini (Northeastern University and Pacific Northwest National Laboratory, USA), Ankur Limaye (Pacific Northwest National Laboratory, USA), Serena Curzel, Michele Fiorito (Politecnico di Milano, Italy), Vinay Amatya, Marco Minutoli, Joseph Manzano (Pacific Northwest National Laboratory, USA), Fabrizio Ferrandi (Politecnico di Milano, Italy), Antonino Tumeo (Pacific Northwest National Laboratory, USA)
Pagepp. 632 - 638
KeywordHLS, Machine Learning, Low Latency
AbstractThe Software Defined Architectures (SODA) Synthesizer is an open-source compiler-based tool able to automatically generate domain-specialized systems targeting Application-Specific Integrated Circuits (ASICs) or Field Programmable Gate Arrays (FPGAs) starting from high-level programming. SODA is composed of a frontend, SODA-OPT, which leverages the multilevel intermediate representation (MLIR) framework to interface with productive programming tools (e.g., machine learning frameworks), identify kernels suitable for acceleration, and perform high-level optimizations, and of a state-of-the-art high-level synthesis backend, Bambu from the PandA framework, to generate custom accelerators. One specific application of the SODA Synthesizer is the generation of accelerators to enable ultra-low latency inference and control on autonomous systems for scientific discovery (e.g., electron microscopes, sensors in particle accelerators, etc.). This paper provides an overview of the flow in the context of the generation of accelerators for edge processing to be integrated in transmission electron microscopy (TEM) devices, focusing on use cases from precision material synthesis. We show the tool in action with an example of design space exploration for inference on reconfigurable devices with a conventional deep neural network model (LeNet). Finally, we discuss the research directions and opportunities enabled by SODA in the area of autonomous control for scientific experimental workflows.

[To Session Table]

Session 8B  Energy-Efficient Computing for Emerging Applications
Time: 13:00 - 14:40, Thursday, January 19, 2023
Location: Room Uranus
Chair: Olivia Chen (Yokohama National University, Japan)

8B-1 (Time: 13:00 - 13:25) (Online)
TitleKnowledge Distillation in Quantum Neural Network using Approximate Synthesis
AuthorMahabubul Alam, Satwik Kundu, *Swaroop Ghosh (Pennsylvania State University, USA)
Pagepp. 639 - 644
KeywordQML, QNN, Quantum Computing, Quantum Machine Learning, Quantum Neural Network
AbstractRecent assertions of a potential advantage of Quantum Neural Networks (QNN) for specific Machine Learning (ML) tasks have sparked the curiosity of a sizable number of application researchers. The parameterized quantum circuit (PQC), a major building block of a QNN, consists of several layers of single-qubit rotations and multi-qubit entanglement operations. The optimum number of PQC layers for a particular ML task is generally unknown. A larger network often provides better performance in noiseless simulations. However, it may perform poorly on hardware compared to a shallower network. Because the amount of noise varies amongst quantum devices, the optimal depth of PQC can vary significantly. Additionally, the gates chosen for the PQC may be suitable for one type of hardware but not for another due to compilation overhead. This makes it difficult to generalize a QNN design to a wide range of hardware and noise levels. An alternate approach is to build and train multiple QNN models targeted for each hardware which can be expensive. To circumvent these issues, we introduce the concept of knowledge distillation in QNN using approximate synthesis. The proposed approach will create a new QNN network with (i) a reduced number of layers or (ii) a different gate set without having to train it from scratch. Training the new network for a few epochs can compensate for the loss caused by approximation error. Through empirical analysis, we demonstrate a 71.4% reduction in circuit layers and still achieve 16.2% better accuracy under noise.

8B-2 (Time: 13:25 - 13:50) (Online)
TitleNTGAT: A Graph Attention Network Accelerator with Runtime Node Tailoring
AuthorWentao Hou, *Kai Zhong, Shulin Zeng, Guohao Dai, HuaZhong Yang, Yu Wang (Tsinghua University, China)
Pagepp. 645 - 650
Keywordsoftware-hardware co-design, graph attention network, accelerator, GNN
AbstractGraph Attention Network (GAT) has demonstrated better performance in many graph tasks than previous Graph Neural Network (GNN) models like Graph Convolution Network (GCN). However, it involves graph attention operations that introduce extra computing complexity. While a large amount of existing literature has researched GNN acceleration, few have focused on the attention mechanism. The graph attention mechanism makes the computation flow of GAT different from that of GCN and introduces the calculation of attention coefficients. Therefore, previous GNN accelerators can not support GAT. Besides, the attention coefficients distinguish the importance of neighbors and make it possible to reduce the workload through runtime tailoring. In this paper, we present NTGAT, a software-hardware co-design approach to accelerate graph attention network with runtime node tailoring. Our work comprises both a runtime node tailoring algorithm and an accelerator dedicated to it. We propose a pipeline sorting method and a specialized hardware unit to dynamically support node tailoring during inference. The experiments show that our algorithm can reduce up to 86% of aggregation workload on large datasets while incurring slight accuracy loss (<0.4%). And the FPGA based accelerator can achieve up to 3.8x speedup and 4.98x energy efficiency comparing to the GPU baseline.

8B-3 (Time: 13:50 - 14:15) (In-person)
TitleA Low-Bitwidth Integer-STBP Algorithm for Efficient Training and Inference of Spiking Neural Networks
Author*Pai-Yu Tan, Cheng-Wen Wu (National Tsing Hua University, Taiwan)
Pagepp. 651 - 656
Keywordartificial intelligence (AI), back-propagation, image classification, neuromorphic computing, spiking neural network (SNN)
AbstractSpiking neural networks (SNNs) that enable energy-efficient neuromorphic hardware are receiving growing attention. Training SNNs directly with back-propagation has demonstrated accuracy comparable to deep neural networks (DNNs). However, previous direct-training algorithms require high-precision floating-point operations, which are not suitable for low-power end-point devices. The high-precision operations also require the learning algorithm to run on high-performance accelerator hardware. In this paper, we propose an improved approach that converts the high-precision floating-point operations to low-bitwidth integer operations for an existing direct-training algorithm, i.e., the Spatio-Temporal Back-Propagation (STBP) algorithm. The proposed low-bitwidth integer-STBP algorithm requires only integer arithmetic for SNN training and inference, which greatly reduces the computational complexity. Experimental results show that the proposed STBP algorithm achieves comparable accuracy and higher energy efficiency than the original floating-point STBP algorithm. Moreover, it can be implemented on low-power end-point devices to provide learning capability during inference, which are mostly supported by fixed-point hardware.

8B-4 (Time: 14:15 - 14:40) (In-person)
TitleTiC-SAT: Tightly-coupled Systolic Accelerator for Transformers
Author*Alireza Amirshahi, Joshua Alexander Harrison Klein, Giovanni Ansaloni, David Atienza (EPFL, Switzerland)
Pagepp. 657 - 663
KeywordSystolic Array, Tightly-coupled Accelerators, Transformers
AbstractTransformer models have achieved impressive results in various AI scenarios, ranging from vision to natural language processing. However, their computational complexity and their vast number of parameters hinder their implementations on resource-constrained platforms. Furthermore, while loosely-coupled hardware accelerators have been proposed in the literature, data transfer costs limit their speed-up potential. We address this challenge along two axes. First, we introduce tightly-coupled, small-scale systolic arrays (TiC-SATs), governed by dedicated ISA extensions, as dedicated functional units to speed up execution. Then, thanks to the tightly-coupled architecture, we employ software optimizations to maximize data reuse, thus lowering miss rates across cache hierarchies. Full system simulations across various BERT and VisionTransformer models are employed to validate our strategy, resulting in substantial application-wide speed-ups (e.g., up to 89.5X for BERT-large). TiC-SAT is available as an open-source framework.

[To Session Table]

Session 8C  Side-Channel Attacks and RISC-V Security
Time: 13:00 - 14:40, Thursday, January 19, 2023
Location: Room Venus
Chairs: Farimah Farahmandi (University of Florida, USA), Md Tanvir Arafin (George Mason University, USA)

8C-1 (Time: 13:00 - 13:25) (In-person)
TitlePMU-Leaker: Performance Monitor Unit-based Realization of Cache Side-Channel Attacks
AuthorPengfei Qiu, Qiang Gao (Key Laboratory of Trustworthy Distributed Computing and Service (BUPT), Ministry of Education/Tsinghua University, China), Dongsheng Wang, Yongqiang Lyu (Tsinghua University, China), Chunlu Wang (Key Laboratory of Trustworthy Distributed Computing and Service (BUPT), Ministry of Education, China), Chang Liu (Tsinghua University, China), Rihui Sun (Harbin Institute of Technology, China), *Gang Qu (University of Maryland, College Park, USA)
Pagepp. 664 - 669
Keywordperformance monitor unit, side channel attack, transient execution attack, hardware security, information leakage
AbstractPerformance Monitor Unit (PMU) is a special hardware module in processors that contains a set of counters to record various architectural and micro-architectural events. In this paper, we propose PMU-Leaker, a novel realization of all existing cache side-channel attacks where accurate execution time measurements are replaced by information leaked through PMU. The efficacy of PMU-Leaker is demonstrated by (1) leaking the secret data stored in Intel Software Guard Extensions (SGX) with the transient execution vulnerabilities including Spectre and ZombieLoad and (2) extracting the encryption key of a victim AES performed in SGX. We perform thorough experiments on a DELL Inspiron 15-7560 laptop that has an IntelR CoreTM i5-7200U processor with the Kaby Lake architecture and the results show that, among the 176 PMU counters, 24 of them are vulnerable and can be used to launch the PMU-Leaker attack.

8C-2 (Time: 13:25 - 13:50) (Online)
TitleEO-Shield: A Multi-function Protection Scheme against Side Channel and Focused Ion Beam Attacks
Author*Ya Gao, Qizhi Zhang, Haocheng Ma, Jiaji He, Yiqiang Zhao (Tianjin University, China)
Pagepp. 670 - 675
KeywordActive shield, electromagnetic side-channel, side-channel security
AbstractSmart devices, especially Internet-connected devices, typically incorporate security protocols and cryptographic algorithms to ensure the control flow integrity and information security. However, there are various invasive and non-invasive attacks trying to tamper with these devices. Chip-level active shield has been proved to be an effective countermeasure against invasive attacks, but existing active shields cannot be utilized to counter side-channel attacks (SCAs). In this paper, we propose a multi-function protection scheme and an active shield prototype to against invasive and non-invasive attacks simultaneously. The protection scheme has a complex active shield implemented using the top metal layer of the chip and an information leakage obfuscation module underneath. The leakage obfuscation module generates its protection patterns based on the operating conditions of the circuit that needs to be protected, thus reducing the correlation between electromagnetic (EM) emanations and cryptographic data. We implement the protection scheme on one Advanced Encryption Standard (AES) circuit to demonstrate the effectiveness of the method. Experiment results demonstrate that the information leakage obfuscation module decreases SNR below 0.6 and reduces the success rate of SCAs. Compared to existing single-function protection methods against physical attacks, the proposed scheme provides good performance against both invasive and non-invasive attacks.

8C-3 (Time: 13:50 - 14:15) (In-person)
TitleCompaSeC: A Compiler-assisted Security Countermeasure to Address Instruction Skip Fault Attacks on RISC-V
Author*Johannes Geier (Technical University of Munich, Germany), Lukas Auer (Fraunhofer Institute for Applied and Integrated Security (AISEC), Germany), Daniel Müller-Gritschneder, Uzair Sharif, Ulf Schlichtmann (Technical University of Munich, Germany)
Pagepp. 676 - 682
KeywordRedundancy, Fault injection attack, Compiler, RISC-V
AbstractFault-injection attacks are a risk for any computing system executing security-relevant tasks, such as a secure boot process. While hardware-based countermeasures to these invasive attacks have been found to be a suitable option, they have to be implemented via hardware extensions and, thus, are not available in most commonly used off-the-shelf (COTS) components. Software-implemented hardware fault tolerance (SIHFT) is therefore the only valid option to enhance a COTS system’s resilience against fault attacks. Established SIHFT techniques usually target to detect random hardware errors for functional safety and not targeted attacks. Using the example of a secure boot system running on a RISC-V processor, in this work we first show that when the software is hardened by these existing techniques from the safety domain, the number of vulnerabilities in the boot process to single, double, triple, and quadruple instruction skips cannot be fully closed. We extend these techniques to the security domain and propose Compiler-assisted Security Countermeasure (CompaSeC). We demonstrate that CompaSeC can close all vulnerabilities for the studied secure boot system. To further reduce performance and memory overhead we additionally propose a method for CompaSeC to selectively harden individual vulnerable functions without compromising security against the considered instruction skip faults.

8C-4 (Time: 14:15 - 14:40) (In-person)
TitleTrojan-D2: Post-Layout Design and Detection of Stealthy Hardware Trojans - a RISC-V Case Study
Author*Sajjad Parvin (University of Bremen, Germany), Mehran Goli (University of Bremen/German Research Centre for Artificial Intelligence, Germany), Frank Sill Torres (German Aerospace Center (DLR), Germany), Rolf Drechsler (University of Bremen/German Research Centre for Artificial Intelligence, Germany)
Pagepp. 683 - 689
KeywordOptical Probing, Hardware Trojan, LLSI, Hardware Secuirty
AbstractWith the exponential increase in the popularity of the RISC-V ecosystem, the security of this platform must be re-evaluated especially for mission-critical and IoT devices. Besides, the insertion of a Hard ware Trojan (HT) into a chip after the in-house mask design is out sourced to a chip manufacturer abroad for fabrication is a significant source of concern. Though abundant HT detection methods have been investigated based on side-channel analysis, physical measurements, and functional testing to overcome this problem, there exists stealthy HTs that can hide from detection. This is due to the small overhead of such HTs compared to the whole circuit. In this work, we propose several novel HTs that can be placed into a RISC-V core’s post-layout in an untrusted manufacturing environment. Next, we propose a non-invasive analytical method based on contactless optical probing to detect any stealthy HTs. Finally, we propose an open-source library of HTs that can be used to be placed into a processor unit in the post-layout phase. All the designs in this work are done using TSMC 28nm technology.

[To Session Table]

Session 8D  Simulation and Verification of Quantum Circuits
Time: 13:00 - 14:40, Thursday, January 19, 2023
Location: Room Mars/Mercury
Chair: Shigeru Yamashita (Ritsumeikan University, Japan)

8D-1 (Time: 13:00 - 13:25) (In-person)
TitleGraph Partitioning Approach for Fast Quantum Circuit Simulation
Author*Jaekyung Im, Seokhyeong Kang (POSTECH, Republic of Korea)
Pagepp. 690 - 695
KeywordQuantum Computing, Quantum Circuit, Quantum Circuit Simulation, Quantum Circuit Verification, Graph Partitioning
AbstractOwing to the exponential increase in computational complexity, the fast simulation of the large quantum circuit has become very difficult. This is an important challenge for the utilization of quantum computers because it is closely related to the verification of quantum computation by classical machines. The Hybrid Schrodinger-Feynman simulation seems to be a promising solution, but its application is very limited. To solve this drawback, we propose an improved simulation method based on graph partitioning. Experimental results show that our approach significantly reduces the simulation time of the Hybrid Schrodinger-Feynman simulation.

8D-2 (Time: 13:25 - 13:50) (In-person)
TitleA Robust Approach to Detecting Non-equivalent Quantum Circuits Using Specially Designed Stimuli
AuthorHsiao-Lun Liu, *Yi-Ting Li (National Tsing Hua University, Taiwan), Yung-Chih Chen (National Taiwan University of Science and Technology, Taiwan), Chun-Yao Wang (National Tsing Hua University, Taiwan)
Pagepp. 696 - 701
KeywordQuantum Circuit, Equivalence Checking
AbstractAs several compilation and optimization techniques have been proposed, equivalence checking for quantum circuits has become essential in design flows. The state-of-the-art to this problem observed that even small errors substantially affect the entire quantum system. As a result, it exploited random simulations to prove the non-equivalence of two quantum circuits. However, when errors occurred close to outputs, it was hard for the work to prove the non-equivalence of some non-equivalent quantum circuits under a limited number of simulations. In this work, we propose a novel simulation-based approach using a set of specially designed stimuli. The simulation runs of the proposed approach is linear rather than exponential to the number of quantum bits of a circuit. According to the experimental results, the success rate of our approach is 100% (100%) under a simulation run (execution time) constraint for a set of benchmarks, while that of the state-of-the-art is only 69% (74%) on average. Our approach also achieves a speedup of 26 on average.

8D-3 (Time: 13:50 - 14:15) (In-person)
TitleEquivalence Checking of Parameterized Quantum Circuits: Verifying the Compilation of Variational Quantum Algorithms
Author*Tom Peham (Technical University of Munich, Germany), Lukas Burgholzer (Johannes Kepler University Linz, Austria), Robert Wille (Technical University of Munich, Germany)
Pagepp. 702 - 708
KeywordQuantum Computing, Verification, ZX-calculus, Variational Algorithms, NISQ
AbstractVariational quantum algorithms have been introduced as a promising class of quantum-classical hybrid algorithms that can already be used with the noisy quantum computing hardware available today by employing parameterized quantum circuits. Considering the non-trivial nature of quantum circuit compilation and the subtleties of quantum computing, it is essential to verify that these parameterized circuits have been compiled correctly. Established equivalence checking procedures that handle parameter-free circuits already exist. However, no methodology capable of handling circuits with parameters has been proposed yet. This work fills this gap by showing that verifying the equivalence of parameterized circuits can be achieved in a purely symbolic fashion using an equivalence checking approach based on the ZX-calculus. At the same time, proofs of inequality can be efficiently obtained with conventional methods by taking advantage of the degrees of freedom inherent to parameterized circuits. We implemented the corresponding methods and proved that the resulting methodology is complete. Experimental evaluations (using the entire parametric ansatz circuit library provided by Qiskit as benchmarks) demonstrate the efficacy of the proposed approach.

8D-4 (Time: 14:15 - 14:40) (In-person)
TitleSoftware Tools for Decoding Quantum Low-Density Parity Check Codes
Author*Lucas Berent (Technical University of Munich, Germany), Lukas Burgholzer (Johannes Kepler University Linz, Austria), Robert Wille (Technical University of Munich, Germany)
Pagepp. 709 - 714
Keywordquantum error correction, software tools, QLDPC codes
AbstractQuantum Error Correction (QEC) is an essential field of research towards the realization of large-scale quantum computers. On the theoretical side, a lot of effort is put into designing error-correcting codes that protect quantum data from errors, which inevitably happen due to the noisy nature of quantum hardware and quantum bits (qubits). Protecting data with an error-correcting code necessitates means to recover the original data, given a potentially corrupted data set - a task referred to as decoding. It is vital that decoding algorithms can recover error-free states in an efficient manner. While theoretical properties of certain QEC methods have been extensively studied, good techniques to analyze their performance in practically more relevant settings is still a widely unexplored area. In this work, we propose a set of software tools that facilitate numerical experiments with so-called Quantum Low-Density Parity-Check codes (QLDPC codes) - a broad class of codes, some of which have recently been shown to be asymptotically good. Based on that, we provide an implementation of a general decoder for QLDPC codes. On top of that, we propose a highly efficient heuristic decoder that eliminates the runtime bottlenecks of the general QLDPC decoder while still maintaining comparable decoding performance. These tools eventually make it possible to confirm theoretical results around QLDPC codes in a more practical setting and showcase the value of software tools (in addition to theoretical considerations) for investigating codes for practical applications. The resulting tool, which is publicly available at https://github.com/cda-tum/qecc as part of the Munich Quantum Toolkit (MQT), is meant to provide a playground for the search for "practically good" quantum codes.

[To Session Table]

Session 8E  (DF-3) Edge AI Design
Time: 13:00 - 14:15, Thursday, January 19, 2023
Location: Miraikan Hall
Organizer/Chair: Yohei Nakata (Panasonic Holdings Corporation, Japan)

8E-1 (Time: 13:00 - 13:20)
Title(Designers' Forum) Neuromorphic Computing Expanding AI Coverage at the Edge with Ultra-Low Energy Consumption
AuthorKazuhisa Fujimoto (Hitachi, Ltd., Japan)
AbstractNeuromorphic computing, which is based on a spiking neural network (SNN) model that mimics brain processing to significantly reduce its energy consumption to 1% or less than that of a GPU, has been an active research topic in recent years with the emergence of neuromorphic devices. The challenge is to develop optimal SNN models, algorithms, and engineering technologies for real use cases. Various applications have been investigated to address these challenges. Neuromorphic computing can be a fundamental technology to support future AI development, especially in edge computing because of its need for ultra-low energy consumption. Application studies at the edge are presented in this talk.

8E-2 (Time: 13:20 - 13:40)
Title(Designers' Forum) HERO: Hessian-Enhanced Robust Optimization for Unifying and Improving Generalization and Quantization Performance
AuthorHuanrui Yang (University of California, Berkeley, USA)
AbstractWith the recent demand of deploying neural network models on mobile and edge devices, it is desired to improve the model's generalizability on unseen testing data, as well as enhance the model's robustness under post-training fixed-point quantization with dynamic precision for efficient deployment. Minimizing the training loss, however, provides few guarantees on the generalization and quantization performance. In this work, we improve generalization and quantization performance simultaneously by theoretically unifying them under the framework of robustness against bounded weight perturbation. We therefore propose HERO, a Hessian-enhanced robust optimization method, to minimize the Hessian eigenvalues through a gradient-based training process, simultaneously improving the generalization and quantization performance.

8E-3 (Time: 13:40 - 14:00)
Title(Designers' Forum) Object-Based Fusion System in ADAS
AuthorYan Zheng (Huayu Automotive Systems Co.,Ltd., China)
AbstractAs different types of sensors are applied in ADAS, most ADAS systems are becoming multi-sensor systems beyond single-sensor systems, and fusion techniques are widely applied in these systems. Generally speaking, a fusion system has three levels. The first level is the sensor level, which may include sensors such as cameras, millimeter wave radar, ultra-sonic radar, and Lidar. The second level is the fusion level, which integrates the object information detected by sensors and outputs the fused object with speed, distance, and other properties. The third level is the application level, which sends the brake or steering request to the vehicle actuators to realize automatically longitudinal and lateral control. In this speech, the interrelationship between levels and the system will be explained.

8E-4 (Time: 14:00 - 14:20)
Title(Designers' Forum) Millimeter-Wave Radar: A New Approach for Privacy Protection Human Sensing
AuthorJun Tian (Fujitsu R&D Center Co.,Ltd., China)
AbstractAs ageing society is coming, to care the elder people better, some sensing approaches are critical to protect them from harm like falling. Using video can effectively realize the function. But, considering the privacy issue, it is not suitable to be used in bedroom, toilet etc. where falling happens frequently. mmWave radar, progressing rapidly recently, is a good candidate to realize the sensing function in the scenario. In this report, based on two sensing methodologies, the research of human fall detection technologies using mmWave radar is introduced.

[To Session Table]

Session 9A  (SS-7) Learning x Security in DFM
Time: 15:00 - 16:40, Thursday, January 19, 2023
Location: Room Saturn
Chair: Bei Yu (Chinese University of Hong Kong, Hong Kong)

9A-1 (Time: 15:00 - 15:25) (Online)
Title(Invited Paper) Enabling Scalable AI Computational Lithography with Physics-Inspired Models
Author*Haoyu Yang, Haoxing Ren (NVIDIA, USA)
Pagepp. 715 - 720
KeywordPhysics-Inspired Machine Learning, Computational Lithography, Lithography Modeling, Mask Optimization
AbstractComputational lithography is a critical research area for the continued scaling of semiconductor manufacturing process technology by enhancing silicon printability via numerical computing methods. Today’s solutions for these problems are primarily CPU-based and require many thousands of CPUs running for days to tape out a modern chip. We seek AI/GPU-assisted solutions for the two problems, aiming at improving both runtime and quality. Prior academic research has proposed using machine learning for lithography modeling and mask optimization, typically represented as image-to-image mapping problems, where convolution layer backboned UNets and ResNets are applied. However, due to the lack of domain knowledge integrated into the framework designs, these solutions have been limited by their application scenarios or performance. Our method aims to tackle the limitations of such previous CNN-based solutions by introducing lithography bias into the neural network design, yielding a much more efficient model design and significant performance improvements

9A-2 (Time: 15:25 - 15:50)
Title(Invited Paper) Canceled

9A-3 (Time: 15:50 - 16:15) (In-person)
Title(Invited Paper) Data-Driven Approaches for Process Simulation and Optical Proximity Correction
AuthorHao-Chiang Shao (National Chung Hsing University, Taiwan), Chia-Wen Lin (National Tsing Hua University, Taiwan), *Shao-Yun Fang (National Taiwan University of Science and Technology, Taiwan)
Pagepp. 721 - 726
KeywordLithography simulation, Optical proximity correction, Machine learning
AbstractWith continuous shrinking of process nodes, semiconductor manufacturing encounters more and more serious inconsistency between designed layout patterns and resulted wafer images. Conventionally, examining how a layout pattern can deviate from its original after complicated process steps, such as optical lithography and subsequent etching, relies on computationally expensive process simulation, which suffers from incredibly long runtime for large-scale circuit layouts, especially in advanced nodes. In addition, being one of the most important and commonly adopted resolution enhancement techniques, optical proximity correction (OPC) corrects image errors due to process effects by moving segment edges or adding extra polygons to mask patterns, while it is generally driven by simulation or time-consuming inverse lithography techniques (ILTs) to achieve acceptable accuracy. As a result, more and more state-of-the-art works on process simulation or/and OPC resort to the fast inference characteristic of machine/deep learning. This paper reviews these data-driven approaches to highlight the challenges in various aspects, explore preliminary solutions, and reveal possible future directions to push forward the frontiers of the research in design for manufacturability.

9A-4 (Time: 16:15 - 16:40) (Online)
Title(Invited Paper) Mixed-Type Wafer Failure Pattern Recognition
Author*Hao Geng (ShanghaiTech University, China), Qi Sun, Tinghuan Chen (Chinese University of Hong Kong, Hong Kong), Qi Xu (USTC, China), Tsung-Yi Ho, Bei Yu (Chinese University of Hong Kong, Hong Kong)
Pagepp. 727 - 732
KeywordMixed-Type Wafer Failure, Pattern Recognition, Survey
AbstractThe ongoing evolution in process fabrication enables us to step below the 5nm technology node. Although foundries can pattern and etch smaller but more complex circuits on silicon wafers, a multitude of challenges persist. For example, defects on the surface of wafers are inevitable during manufacturing. To increase the yield rate and reduce time-to-market, it is vital to recognize these failures and identify the failure mechanisms of these defects. Recently, applying machine learning-powered methods to combat single defect pattern classification has made significant progress. However, as the processes become increasingly complicated, various single-type defect patterns may emerge and be coupled on a wafer and thus shape a mixed-type pattern. In this paper, we will survey the re- cent pace of progress on advanced methodologies for wafer failure pattern recognition, especially for mixed-type one. We sincerely hope this literature review can highlight the future directions and promote the advancement of the wafer failure pattern recognition.

[To Session Table]

Session 9B  Lightweight Models for Edge AI
Time: 15:00 - 16:40, Thursday, January 19, 2023
Location: Room Uranus
Chair: Yiran Chen (Duke University, USA)

9B-1 (Time: 15:00 - 15:25) (Online)
TitleAccelerating Convolutional Neural Networks in Frequency Domain via Kernel-sharing Approach
Author*Bosheng Liu, Hongyi Liang, Jigang Wu (Guangdong University of Technology, China), Xiaoming Chen (Institute of Computing Technology, Chinese Academy of Sciences, China), Peng Liu (Guangdong University of Technology, China), Yinhe Han (Institute of Computing Technology, Chinese Academy of Sciences, China)
Pagepp. 733 - 738
Keywordacceleration, frequency-domain DNN architecture
AbstractConvolutional neural networks (CNNs) are typically computationally heavy. Fast algorithms such as fast Fourier transforms (FFTs), are promising in significantly reducing computation complexity by replacing convolutions with frequency-domain element-wise multiplication. However, the increased high memory access overhead of complex weights counteracts the computing benefit, because frequency-domain convolutions not only pad weights to the same size as input maps, but also have no sharable complex kernel weights. In this work, we propose an FFT-based kernel-sharing technique called FS-Conv to reduce memory access. Based on FS-Conv, we derive the sharable complex weights in frequency-domain convolutions, which has never been solved. FS-Conv includes a hybrid padding approach, which utilizes the inherent periodic characteristic of FFT transformation to provide sharable complex weights for different blocks of complex input maps. We in addition build a frequency-domain inference accelerator that can utilize the sharable complex weights for CNN accelerations. Evaluation results demonstrate the significant performance and energy efficiency benefits compared with the state-of-the-art baseline.

9B-2 (Time: 15:25 - 15:50) (Online)
TitleMortar: Morphing the Bit Level Sparsity for General Purpose Deep Learning Acceleration
AuthorYunhung Gao (Peking University, China), Hongyan Li (State Key Lab of Processors, Institute of Computing Technology, CAS. University of Chinese Academy of Sciences, China), Kevin Zhang (Peking University, China), Xueru Yu (Shanghai Integrated Circuits R&D Center Co. Ltd., China), *Hang Lu (State Key Lab of Processors, Institute of Computing Technology, CAS. University of Chinese Academy of Sciences, China)
Pagepp. 739 - 744
KeywordDeep Neural Network, Bit-level sparsity, Deep Learning Accelerator
AbstractVanilla Deep Neural Networks (DNN) after training are represented with native floating-point 32 (fp32) weights. We observe that the bit-level sparsity of these weights is very abundant in the mantissa and can be directly exploited to speed up model inference. In this paper, we propose 'Mortar', an off-line/on-line collaborated approach for fp32 DNN acceleration, which includes two parts: first, an off-line bit sparsification algorithm to construct the target formulation by “mantissa morphing”, which maintains higher model accuracy while increasing bit-level sparsity; second, the associating hardware accelerator architecture to speed up the on-line fp32 inference through manipulating the enlarged bit sparsity. We highlight the following results by evaluating various deep learning tasks, including image classification, object detection, video understanding, video & image super-resolution, etc.: (1) we increase bit-level sparsity up to 1.28~2.51x with only a negligible -0.09~0.23% accuracy loss, (2) maintain on average 3.55% higher model accuracy while increasing more bit-level sparsity than the baseline, (3) our hardware accelerator outperforms up to 4.8x over the baseline, with an area of 0.031mm2 and power of 68.58mW.

9B-3 (Time: 15:50 - 16:15) (Online)
TitleData-Model-Circuit Tri-design for Ultra-light Video Intelligence on Edge Devices
AuthorYimeng Zhang (Michigan State University, USA), *Akshay Karkal Kamath (Georgia Institute of Technology, USA), Qiucheng Wu (University of California, Santa Barbara, USA), Zhiwen Fan, Wuyang Chen, Zhangyang Wang (University of Texas at Austin, USA), Shiyu Chang (University of California, Santa Barbara, USA), Sijia Liu (Michigan State University, USA), Cong Hao (Georgia Institute of Technology, USA)
Pagepp. 745 - 750
Keywordsoftware-hardware co-design, Mult-object tracking, data efficiency, model compression
AbstractIn this paper, we propose a data-model-hardware tri-design framework for high-throughput, low-cost, and high-accuracy multi-object tracking (MOT) on High-Definition (HD) video stream. First, to enable ultra-light video intelligence, we propose temporal frame-filtering and spatial saliency-focusing approaches to reduce the complexity of massive video data. Second, we exploit structure-aware weight sparsity to design a hardware-friendly model compression method. Third, assisted with data and model complexity reduction, we propose a sparsity-aware, scalable, and low-power accelerator design, aiming to deliver real-time performance with high energy efficiency. Different from existing works, we make a solid step towards the synergized software/hardware co-optimization for realistic MOT model implementation. Experiments at both software and hardware levels are conducted to demonstrate the effectiveness. Compared to the state-of-the-art MOT baseline, our tri-design approach can achieve 12.5× latency reduction, 20.9× effective frame rate improvement, 5.83× lower power, and 9.78× better energy efficiency, without much accuracy drop.

9B-4 (Time: 16:15 - 16:40) (In-person)
TitleLatent Weight-based Pruning for Small Binary Neural Networks
Author*Tianen Chen (University of Wisconsin-Madison, USA), Noah Anderson (Stanford University, USA), Younghyun Kim (University of Wisconsin-Madison, USA)
Pagepp. 751 - 756
Keywordmachine, learning, pruning, binary neural networks, latent weight
AbstractBinary neural networks (BNNs) substitute complex arithmetic operations with simple bit-wise operations. The binarized weights and activations in BNNs can drastically reduce memory requirement and energy consumption, making it attractive for edge ML applications with limited resources. However, the severe memory capacity and energy constraints of low-power edge devices call for further reduction of BNN models beyond binarization. Weight pruning is a proven solution for reducing the size of many neural network (NN) models, but the binary nature of BNN weights make it difficult to identify insignificant weights to remove. In this paper, we present a pruning method based on latent weight with layer-level pruning sensitivity analysis which reduces the over-parameterization of BNNs, allowing for accuracy gains while drastically reducing the model size. Our method advocates for a heuristics that distinguishes weights by their latent weights, a real-valued vector used to compute the pseduogradient during backpropagation. It is tested using three different convolutional NNs on the MNIST, CIFAR-10, and Imagenette datasets with results indicating a 34%-46% reduction in operation count, with no accuracy loss, improving upon previous works in accuracy, model size, and total operation count.

[To Session Table]

Session 9D  Design Automation for Emerging Devices
Time: 15:00 - 16:40, Thursday, January 19, 2023
Location: Room Mars/Mercury
Chair: Frank Sill Torres (DLR, Germany)

9D-1 (Time: 15:00 - 15:25) (Online)
TitleAutoFlex: Unified Evaluation and Design Framework for Flexible Hybrid Electronics
Author*Tianliang Ma, Zhihui Deng, Leilai Shao (Shanghai Jiaotong University, China)
Pagepp. 757 - 762
Keywordflexible hybrid electronics, design automation, heterogeneous system design, flexible electronics, system-level simulation
AbstractFlexible hybrid electronics (FHE), integrating high performance silicon chips with multi-functional sensors and actuators on flexible substrates, can be intimately attached onto irregular surfaces without compromising their functionalities, thus enabling more innovations in healthcare, internet of things (IoTs) and various human-machine interfaces (HMIs). Recent developments on compact models and process design kits (PDKs) of flexible electronics have made designs of small to medium flexible circuits feasible. However, the absence of a unified model and comprehensive evaluation benchmarks for flexible electronics makes it infeasible for a designer to fairly compare different flexible technologies and to explore potential design options for a heterogeneous FHE design. In this paper, we present AutoFlex, a unified evaluation and design framework for flexible hybrid electronics, where device parameters can be extracted automatically and performance can be evaluated comprehensively from device levels, digital blocks to large-scale digital circuits. Moreover, a ubiquitous FHE sensor acquisition system, including a flexible multi-functional sensor array, scan drivers, amplifiers and a silicon based analog-to-digital converter (ADC), is developed to reveal the design challenges of a representative FHE system.

9D-2 (Time: 15:25 - 15:50) (In-person)
TitleCNFET7: An Open Source Cell Library for 7-nm CNFET Technology
Author*Chenlin Shi, Shinobu Miwa (The University of Electro-Communications, Japan), Tongxin Yang, Ryota Shioya (The University of Tokyo, Japan), Hayato Yamaki, Hiroki Honda (The University of Electro-Communications, Japan)
Pagepp. 763 - 768
KeywordCNFET, cell library, logic synthesis
AbstractIn this paper, we propose CNFET7, the first open-source cell library for 7-nm carbon nanotube field-effect transistor (CNFET) technology. CNFET7 is based on an open-source CNFET SPICE model called VS-CNFET, and various model parameters such as the channel width and carbon nanotube diameter are carefully tuned to mimic the predictive 7-nm CNFET technology presented in a published paper. Some nondisclosure parameters, such as the cell size and pin layout, are derived from those of the NanGate 15-nm open-source cell library in the same way as for an open-source framework for CNFET circuit design.CNFET7 includes two types of delay model (i.e., the composite current source and nonlinear delay model), each having 56 cells, such as INV_X1 and BUF_X1.CNFET7 supports both logic synthesis and timing-driven place and route in the Cadence design flow. Our experimental results for several synthesized circuits show that CNFET7 has reductions of up to 96%, 62% and 82% in dynamic and static power consumption and critical-path delay, respectively, when compared with ASAP7.

9D-3 (Time: 15:50 - 16:15) (In-person)
TitleA Global Optimization Algorithm for Buffer and Splitter Insertion in Adiabatic Quantum-Flux-Parametron Circuits
Author*Rongliang Fu (The Chinese University of Hong Kong, Hong Kong), Mengmeng Wang (Yokohama National University, Japan), Yirong Kan (Nara Institute of Science and Technology, Japan), Nobuyuki Yoshikawa (Yokohama National University, Japan), Tsung-Yi Ho (The Chinese University of Hong Kong, Hong Kong), Olivia Chen (Tokyo City University, Japan)
Pagepp. 769 - 774
Keywordsuperconducting electronics, AQFP, buffer and splitter insertion
AbstractAs a highly energy-efficient application of low-temperature superconductivity, the adiabatic quantum-flux-parametron (AQFP) logic circuit has characteristics of extremely low-power consumption, making it an attractive candidate for extremely energy-efficient computing systems. Since logic gates are driven by the alternating current (AC) serving as the clock signal in AQFP circuits, plenty of AQFP buffers are required to ensure that the dataflow is synchronized at all logic levels of the circuit. Meanwhile, since the currently developed AQFP logic gates can only drive a single output, splitters are required by logic gates to drive multiple fan-outs. These gates take up a significant amount of the circuit's area and delay. This paper proposes a global optimization algorithm for buffer and splitter (B/S) insertion to address the issues above. The B/S insertion is first identified as a combinational optimization problem, and a dynamic programming formulation is presented to find the global optimal solution. Due to the limitation of its impractical search space, an integer linear programming formulation is proposed to explore the global optimization of B/S insertion approximately. Experimental results on the ISCAS'85 and simple arithmetic benchmark circuits show the effectiveness of the proposed method, with an average reduction of 8.22% and 7.37% in the number of buffers and splitters inserted compared to the state-of-the-art methods from ICCAD'21 and DAC'22, respectively.

9D-4 (Time: 16:15 - 16:40) (In-person)
TitleFLOW-3D: Flow-Based Computing on 3D Nanoscale Crossbars with Minimal Semiperimeter
Author*Sven Thijssen (University of Central Florida, USA), Sumit Kumar Jha (University of Texas at San Antonio, USA), Rickard Ewetz (University of Central Florida, USA)
Pagepp. 775 - 780
Keywordin-memory, emerging technology, memristor
AbstractThe emergence of data-intensive applications has spurred the interest for in-memory computing using nanoscale crossbars. Flow-based in-memory computing is a promising approach for evaluating Boolean logic using the natural flow of electrical currents. While automated synthesis approaches have been developed for 2D crossbars, 3D crossbars have advantageous properties in terms of density, area, and performance. In this paper, we propose the first framework for performing flow-based computing using 3D crossbars. The framework called FLOW-3D takes a Boolean function specified in a hardware descriptive language and automatically synthesizes it into a crossbar design. The FLOW-3D framework is based on an analogy between BDDs and nanoscale crossbars, which allows the synthesis of 3D crossbar designs with minimal semiperimeter. In particular, nodes and edges in an BDD correspond to metal wires and memristors, respectively. This allows a BDD with n nodes to be mapped to a 3D nanoscale crossbar with (n+k) metal wires. The k extra metal wire are needed to handle hardware imposed constraints. We evaluate FLOW-3D using 15 circuits from the RevLib benchmark suite. Compared with the state-of-the-art synthesis tool for 2D crossbars, FLOW-3D improves semiperimeter, area, energy consumption, and latency up to 61%, 84%, 37%, and 41%.

[To Session Table]

Session 9E  (DF-4) Panel Discussion: Aiming Direction of DX System Design from Hardware to Application
Time: 15:00 - 16:15, Thursday, January 19, 2023
Location: Miraikan Hall
Organizer/Chair: Koichiro Yamashita (Fujitsu R&D Center Co., Ltd., China)

9E-1 (Time: 15:00 - 16:15)
Title(Panel Discussion) Aiming Direction of DX System Design from Hardware to Application
AuthorPanelists: Masayuki Ito (NSITEXE Inc., Japan), Teruo Tanimoto (Kyushu University, Japan), Takuji Miki (Kobe University, Japan), Naoki Shimasaki (Panasonic Holdings Corporation, Japan), Yasuharu Ota (Canon Inc., Japan), Tuan Thanh Ta (Toshiba Corp. R&D Center, Japan), Kazuhisa Fujimoto (Hitachi, Ltd., Japan), Huanrui Yang (University of California, Berkeley, USA), Yan Zheng (Huayu Automotive Systems Co.,Ltd., China), Jun Tian (Fujitsu R&D Center Co.,Ltd., China)
AbstractCommon topic or keyword with Designer's Forum speakers from different layers and fields. In oral sessions, there are hot and traditional fields such as "next-generation computers," "sensors," and"edge." Then "Automotive" emerged as a common keyword. The theme of panel discussion is "Aiming Direction of DX (Digital Transformation) System Design from Hardware to Application," and opinions will be exchanged at Automotive as a specific application topic. Speakers of the oral sessions will join again as panelists as much as time allows, and to conduct meaningful discussions. Speakers, various layers and different fields, hope that it will be an opportunity to discover new perspectives and collaboration in future.