(Go to Top Page)

The 29th Asia and South Pacific Design Automation Conference
Technical Program

Remark: The presenter of each paper is marked with "*".
Technical Program:   SIMPLE version   DETAILED version with abstract
Author Index:   HERE

Session Schedule

Monday, January 22, 2024

Room 204 Room 205 Room 206 Room 207
T1  Tutorial to NeuroSim: A Versatile Benchmark Framework for AI Hardware
9:30 - 12:30
T2  Toward Robust Neural Network Computation on Emerging Crossbar-based Hardware and Digital Systems
9:30 - 12:30
T3  Morpher: A Compiler and Simulator Framework for CGRA
9:30 - 12:30
T4  Machine Learning for Computational Lithography
9:30 - 12:30
12:30 - 14:00
T5  Low Power Design: Current Practice and Opportunities
14:00 - 17:00
T6  Leading the industry, Samsung CXL Technology
14:00 - 17:00
T7  Sparse Acceleration for Artificial Intelligence: Progress and Trends
14:00 - 17:00
T8  CircuitOps and OpenROAD: Unleashing ML EDA for Research and Education
14:00 - 17:00

Tuesday, January 23, 2024

Room 204 Room 205 Room 206 Room 207 Room 107/108 Room 110/111
1K  (Room Premier A/B)
Opening and Keynote Session I

9:00 - 10:30
Coffee Break
10:30 - 10:45
1A  Emerging NoC Designs
10:45 - 12:00
1B  When AI Meets Edge Devices
10:45 - 12:00
1C  Innovations in New Computing Paradigms: Stochastic, Hyper-Dimensional and High-Performance Computing
10:45 - 12:00
1D  3D IC
10:45 - 12:00
1E  (SS-1) Open-Source EDA Algorithms and Software
10:45 - 12:25
1F  (DF-4) Advanced EDA using AI/ML at Synopsys
10:45 - 11:45
2A  Frontiers in Embedded and Edge AI: from Adversarial Attacks to Intelligent Homes
13:30 - 14:45
2B  Innovations in Quantum EDA: from Design to Deployment
13:30 - 15:10
2C  Architecting for Dependability: System Design with Compute-in-Memory
13:30 - 15:10
2D  ML for Physical Design and Timing
13:30 - 15:10
2E  University Design Contest
13:30 - 15:20
2F  (DF-1) Next-Generation AI Semiconductor Design
13:30 - 15:10
3A  GPU and Custom Accelerators
15:30 - 17:35
3B  Hardware Acceleration for Graph Neural Networks and New Models
15:30 - 17:35
3C  New Frontiers in Verification and Simulation
15:30 - 17:35
3D  Partition and Placement
15:30 - 17:35
3E  (SS-2) Hardware Security -- A True Multidisciplinary Research Area
15:30 - 17:35
3F  (DF-2) Heterogeneous Integration and Chiplet Design
15:30 - 17:10
1S  (Room 204/205)
SIGDA Student Research Forum
18:00 - 21:00

Wednesday, January 24, 2024

Room 204 Room 205 Room 206 Room 207 Room 107/108
4A  Detection Techniques for SoC Vulnerability and Malware
9:00 - 10:15
4B  Design for Manufacturability: from Rule Checking to Yield Optimization
9:00 - 10:15
4C  Advances in Logic Synthesis and Optimization
9:00 - 10:15
4D  Learning-Based Optimization for RF/Analog Circuit
9:00 - 10:15
4E  (SS-3) LLM Acceleration and Specialization for Circuit Design and Edge Applications
8:35 - 10:15
2K  (Room Premier A/B)
Keynote Session II

10:30 - 11:30
5A  Bridging Memory, Storage, and Data Processing Techniques
13:00 - 14:40
5B  Exploring EDA’s Next Frontier: AI-Driven Innovative Design Methods
13:00 - 14:40
5C  New Frontiers in Testing
13:00 - 14:40
5D  FPGA-Based Neural Network Accelerator Designs and Applications
13:00 - 14:40
5E  (DF-3) AI/ML for Chip Design and EDA - Current Status and Future Perspectives from Diverse Views
13:00 - 14:40
6A  Enabling Techniques to Make CIM Small and Flexible
15:00 - 16:40
6B  Emerging Trends in Hardware Design: from Biochips to Quantum Systems
15:00 - 17:05
6C  New Advances in Logic Locking and Side-Channel Analysis
15:00 - 17:05
6D  Advanced Simulation and Modeling
15:00 - 17:05
6E  (SS-4) Cutting-Edge Techniques for EDA in Analog/Mixed-Signal ICs
15:00 - 16:40
Banquet (Grand Ballroom A/B)
18:00 - 19:30
CEDA activity (Grand Ballroom A/B)
18:00 - 18:15
3K  (Grand Ballroom A/B)
Keynote Session III

18:15 - 19:15

Thursday, January 25, 2024

Room 204 Room 205 Room 206 Room 207 Room 107/108
7A  System Performance and Debugging
9:00 - 10:15
7B  Innovations in Autonomous Systems: Hyperdimensional Computing and Emerging Application Frontiers
9:00 - 10:15
7C  Productivity Management for High Level Design
9:00 - 10:15
7D  Emerging Memory Yield Optimization and Modeling
9:00 - 10:15
7E  (SS-5) Enabling Chiplet-based Custom Designs
8:35 - 10:15
4K  (Room Premier A/B)
Keynote Session IV

10:30 - 11:30
8A  Advances in Efficient Embedded Computing: from Hardware Accelerator to Task Management
13:00 - 14:40
8B  In-Memory Computing Architecture Design and Logic Synthesis
13:00 - 14:40
8C  Firing Less for Evolution: Quantization & Learning Spikes
13:00 - 14:40
8D  New Techniques for Photonics and Analog Circuit Design
13:00 - 14:40
8E  (JW-1) TILOS & AI-EDA Joint Workshop - I
13:00 - 14:40
9A  Advancing AI Algorithms: Faster, Smarter, and More Efficient
15:00 - 16:40
9B  Design Explorations for Neural Network Accelerators
15:00 - 17:05
9C  High-Level Security Verification and Efficient Implementation
15:00 - 17:05
9D  Routing
15:00 - 16:15
9E  (JW-2) TILOS & AI-EDA Joint Workshop - II
15:00 - 16:40

DF: Designers' Forum, SS: Special Session

List of papers

Remark: The presenter of each paper is marked with "*".

Monday, January 22, 2024

[To Session Table]

Session T1  Tutorial to NeuroSim: A Versatile Benchmark Framework for AI Hardware
Time: 9:30 - 12:30, Monday, January 22, 2024
Location: Room 204
Chair: Shimeng Yu (Georgia Institute of Technology, USA)

Title(Tutorial) Tutorial to NeuroSim: A Versatile Benchmark Framework for AI Hardware
AuthorShimeng Yu (Georgia Institute of Technology, USA)
KeywordAI hardware
AbstractNeuroSim is a widely used open-source simulator to benchmark AI hardware, and it is primarily developed for compute-in-memory (CIM) accelerators, for deep neural network (DNN) inference and training, with hierarchical design options from device-level to circuit-level, and up to algorithm-level. It is timely to hold a tutorial to introduce the research community about new features and recent updates of NeuroSim, including the technology support down to 1 nm node, new modes of operations such as digital CIM (DCIM) and compute-in-3D NAND, as well as the heterogeneous 3D integration of chiplets to support ultra-large AI models. During this tutorial, a real-time demo will also be shown to provide hands-on experiences for the attendees to use/modify the NeuroSim framework to suite their own research purposes.

[To Session Table]

Session T2  Toward Robust Neural Network Computation on Emerging Crossbar-based Hardware and Digital Systems
Time: 9:30 - 12:30, Monday, January 22, 2024
Location: Room 205
Chair: Masanori Hashimoto (Kyoto University, Japan)

Title(Tutorial) Toward Robust Neural Network Computation on Emerging Crossbar-based Hardware and Digital Systems
AuthorYiyu Shi (University of Notre Dame, USA), Masanori Hashimoto (Kyoto University, Japan)
KeywordRobust neural network
AbstractAs a promising alternative to traditional neural network computation platforms, Compute-in-Memory (CiM) neural accelerators based on emerging devices have been intensively studied. These accelerators present an opportunity to overcome memory access bottlenecks. However, they face significant design challenges. The non-ideal conditions resulting from the manufacturing process of these devices induce uncertainties. Consequently, the actual weight values in deployed accelerators may deviate from those trained offline in data centers, leading to performance degradation. The first part of this tutorial will cover: 1. Efficient worst-case analysis for neural network inference using emerging device-based CiM, 2. Enhancement of worst-case performance through noise-injection training, 3. Co-design of software and neural architecture specifically for emerging device-based CiMs. Deep Neural Networks (DNNs) are currently operated on GPUs in both cloud servers and edge-computing devices, with recent applications extending to safety-critical areas like autonomous driving. Accordingly, the reliability of DNNs and their hardware platforms is garnering increased attention. This talk will focus on soft errors, predominantly caused by cosmic rays, a major error source during an intermediate device's lifetime. While DNNs are inherently robust against bit flips, these errors can still lead to severe miscalculations due to weight and activation perturbations, bit flips in AI accelerators, and errors in their interfaces with microcontrollers, etc. The latter part of this tutorial will discuss: 1. Identification of vulnerabilities in neural networks, 2. Reliability analysis and enhancement of AI accelerators for edge computing, 3. Reliability assessment of GPUs against soft errors.

[To Session Table]

Session T3  Morpher: A Compiler and Simulator Framework for CGRA
Time: 9:30 - 12:30, Monday, January 22, 2024
Location: Room 206
Chair: Tulika Mitra (National University of Singapore, Singapore)

Title(Tutorial) Morpher: A Compiler and Simulator Framework for CGRA
AuthorTulika Mitra, Zhaoying Li, Thilini Kaushalya Bandara (National University of Singapore, Singapore)
AbstractCoarse-Grained Reconfigurable Architecture (CGRA) provides a promising pathway to scale the performance and energy efficiency of computing systems by accelerating the compute-intensive loop kernels. However, there exists no end-to-end open-source toolchain for CGRA, supporting architectural design space exploration, compilation, simulation, and FPGA emulation for real-world applications. This hands-on tutorial presents Morpher, an open-source end-to-end compilation and simulation framework for CGRA, featuring state-of-the-art mappers, assisting in design space exploration, and enabling application-level testing of CGRA. Morpher can take a real-world application with a compute-intensive kernel and a user-provided CGRA architecture as input, compile the kernel using different mapping methods, and automatically validate the compiled binaries through cycle-accurate simulation using test data extracted from the application. Morpher can handle real-world application kernels without being limited to simple toy kernels through its feature-rich compiler.Morpher architecture description language allows users to easily specify a variety of architectural features such as complex interconnects, multi-hop routing, and memory organizations. Morpher is available online at https://github.com/ecolab-nus/morpher.

[To Session Table]

Session T4  Machine Learning for Computational Lithography
Time: 9:30 - 12:30, Monday, January 22, 2024
Location: Room 207
Chair: Yonghwi Kwon (Synopsys, USA)

Title(Tutorial) Machine Learning for Computational Lithography
AuthorYonghwi Kwon (Synopsys, USA), Haoyu Yang (NVIDIA Research, USA)
KeywordMachine learning, Lithograpy
AbstractAs the technology node shrinks, the number of mask layers and pattern density has been increasing exponentially. This has led to a growing need for faster and more accurate mask optimization techniques to achieve high manufacturing yield and faster turn-around-time (TAT). Machine learning has emerged as a promising solution for this challenge, as it can be used to automate and accelerate the mask optimization process. This tutorial will introduce recent studies on using machine learning for computational lithography. We will start with a comprehensive introduction to computational lithography, including its challenges and how machine learning can be applied to address them. We will then present recent research in four key areas: (1) Mask optimization: This is the most time-consuming step in the resolution enhancement technique (RET) flow. This tutorial compares different approaches for machine learning-based mask optimization, based on features and machine learning model architecture. (2) Lithography modeling: An accurate and fast lithography model is essential for every step of the RET flow. Machine learning can be used to develop more accurate and efficient lithography models, by incorporating physical properties into the model and learning from real-world data. (3) Sampling and synthesis of test patterns: A comprehensive set of test patterns is needed for efficient modeling and machine learning training. Machine learning can be used to identify effective sampling methods and generate new patterns for better coverage. (4) Hotspot prediction and correction: Lithography hotspots can lead to circuit failure. Machine learning can be used to predict hotspots and develop correction methods that can improve the yield of manufactured chips. We will also discuss how machine learning is being used in industry for computational lithography, and what the future directions of this research are. This tutorial is intended for researchers and engineers who are interested in learning about the latest advances in machine learning for computational lithography. It will provide a comprehensive overview of the field, and will introduce the most promising research directions.

[To Session Table]

Session T5  Low Power Design: Current Practice and Opportunities
Time: 14:00 - 17:00, Monday, January 22, 2024
Location: Room 204
Chair: Gang Qu (University of Maryland, USA)

Title(Tutorial) Low Power Design: Current Practice and Opportunities
AuthorGang Qu (University of Maryland, USA)
Keywordlow power
AbstractPower and energy efficiency has been one of the most critical design criteria for the past several decades. This tutorial consists of three parts that will help both academic researchers and industrial practitioners to understand the current state-of-the-art, the new advances, challenges and opportunities related to low power design. After a brief motivation, we will review some of the most popular low power design technologies including dynamic voltage and frequency scaling (DVFS), clock gating and power gating. Then we will cover some of the recent advances such as approximate computing and in-memory computing. Finally, we will share with the audience some of the security pitfalls in implementing these low power methods. This tutorial is designed for graduate students and professionals from industry and government working in the general fields of EDA, embedded systems, and Internet of Things. Previous knowledge on low power design and security are not required.

[To Session Table]

Session T6  Leading the industry, Samsung CXL Technology
Time: 14:00 - 17:00, Monday, January 22, 2024
Location: Room 205
Chair: Jeongheyon Cho (Samsung Electronics, Republic of Korea)

Title(Tutorial) Leading the industry, Samsung CXL Technology
AuthorJeonghyeon Cho, Jinin So, Kyungsan Kim (Samsung Electronics, Republic of Korea)
AbstractThe rapid development of data-intensive technology has driven an increasing demand for new architectural solutions with scalable, composable, and coherent computing environments. Recent efforts on compute express link (CXL) are a key enabler in accelerating the architecture shift for memory-centric architecture. CXL is an industry standard interconnect protocol for various processors to efficiently expand memory capacity and bandwidth with a memory-semantic protocol. Also, the memory connected with CXL allows the handshaking communication to include processing-near-memory (PNM) engine into the memory. As the foremost leader in the memory solution, Samsung electronics has developed CXL-enabled memory solutions: CXL-MXP and CXL-PNM. CXL-MXP allows flexible memory expansion compared to current DIMM-based memory solution and CXL-PNM is the world first CXL-based PNM solution for GPT inference acceleration. By adopting CXL protocol to memory solutions, our memory solutions will expand the CXL memory ecosystem while strengthening its presence in the next-generation memory solutions market.

[To Session Table]

Session T7  Sparse Acceleration for Artificial Intelligence: Progress and Trends
Time: 14:00 - 17:00, Monday, January 22, 2024
Location: Room 206
Chair: Guohao Dai (Shanghai Jiao Tong University, China)

Title(Tutorial) Sparse Acceleration for Artificial Intelligence: Progress and Trends
AuthorGuohao Dai (Shanghai Jiao Tong University, China), Xiaoming Chen (Chinese Academy of Sciences, China), Mingyu Gao, Zhenhua Zhu (Tsinghua University, China)
Keywordsparse computing
AbstractAfter decades of advancements, artificial intelligence algorithms have become increasingly sophisticated, with sparse computing playing a pivotal role in their evolution. On the one hand, sparsity is an important method to compress neural network models and reduce computational workload. Furthermore, generative algorithms like Large Language Models (LLM) have brought AI into the 2.0 era, and the large computational complexity of LLM makes using sparsity to reduce workload more crucial. On the other hand, for real-world sparse applications like point cloud and graph processing, emerging AI algorithms have been developed to process sparse data. In this tutorial, we will review and summarize the characteristics of sparse computing in AI 1.0 & 2.0. Then, the development trend of sparse computing will also be discussed from the circuit level to the system level. This tutorial includes three parts: (1) From the circuit perspective, emerging Processing-In-Memory (PIM) circuits have demonstrated attractive performance potential compared to von Neumann architectures. We will first explain the opportunities and challenges when deploying irregular sparse computing onto the dense PIM circuits. Then, we will introduce multiple PIM circuits and sparse algorithm co-optimization strategies to improve the PIM computing energy efficiency and reduce the communication latency overhead; (2) From the architecture perspective, this tutorial will first present multiple domain-specific architectures (DSA) for efficient sparse processing, including application-dedicated accelerators and Near Data Processing (NDP) architectures. These DSA architectures achieve performance improvement of one to two orders of magnitude compared to CPU in graph mining, recommendation systems, etc. After that, we will discuss the design idea and envision the future of general sparse processing for various sparsity and sparse operators; (3) From the system perspective, we will introduce the sparse kernel optimization strategies on GPU systems. Based on these studies, an open-source sparse kernel library, i.e., dgSPARSE, will be presented. The dgSPARSE library outperforms commercial libraries on various graph neural network models and sparse operators. Furthermore, we will also discuss the application of the above system-level design methodologies in the optimization of LLM.

[To Session Table]

Session T8  CircuitOps and OpenROAD: Unleashing ML EDA for Research and Education
Time: 14:00 - 17:00, Monday, January 22, 2024
Location: Room 207
Chair: Andrew B. Kahng (Synopsys, USA)

Title(Tutorial) CircuitOps and OpenROAD: Unleashing ML EDA for Research and Education
AuthorAndrew B. Kahng (University of California San Diego, USA), Vidya A. Chhabria (Arizona State University, USA)
AbstractThis tutorial will first present NVIDIA’s CircuitOps approach to modeling chip data, and generation of chip data using open-source infrastructure of OpenROAD (in particular, OpenDB and OpenSTA, along with Python APIs). The tutorial will highlight how integration of CircuitOps and OpenROAD has created an ML EDA infrastructure which will serve as a playground for users to directly experiment with generative and reinforcement learning-based ML techniques within an open-source EDA tool. Recently developed Python APIs around OpenROAD have allowed the integration of CircuitOps with OpenROAD to both query data from OpenDB and modify the design via ML-algrothim to OpenDB callbacks. As a part of the tutorial, participants will work with OpenROAD’s python interpreter and leverage CircuitOps to (i) represent and query chip data in ML-friendly data formats such as graphs, numpy arrays, pandas dataframes, and images, and (ii) modify circuit netlist information from a simple implementation of a reinforcement learning framework for logic gate sizing. Several detailed examples will show how ML EDA applications can be built on OpenROAD and CircuitOps ML EDA infrastructure. The tutorial will also survey the rapidly-evolving landscape of ML EDA – spanning generative methods, reinforcement learning, and other methods – that build on open design data, data formats and tool APIs. Attendees will receive pointers to optional pre-reading and exercises in case they would like to familiarize themselves with the subject matter before attending the tutorial. The tutorial will be formulated to be as broadly interesting and useful as possible to students, researchers and faculty, and to practicing engineers in both EDA and design.

Tuesday, January 23, 2024

[To Session Table]

Session 1K  Opening and Keynote Session I
Time: 9:00 - 10:30, Tuesday, January 23, 2024
Location: Room Premier A/B
Chairs: Kyu-Myung Choi (Seoul National University, Republic of Korea), Taewhan Kim (Seoul National University, Republic of Korea)

TitleASP-DAC 2024 Opening

Title(Keynote Address) Advanced Technology and Design Enablement
AuthorSei Seung Yoon (Samsung Electronics, Republic of Korea)
AbstractNext gen SoC designs are witnessing remarkable advancements in MBCFET, library design, migration to advanced nodes, automotive design, and Multi Die Integration (MDI). Improving fundamental technologies helps achieve high performance computing with continuous pitch scaling requirements. Focusing on this aspect, Samsung Foundry pioneered in applying MBCFET for it’s designs. The MBCFET is an innovative CMOS structure that provides power-performance advantage over FinFET through key device characteristics. Newly suggested low-area standard cell architecture, EG-less design solution, and DTCO also make this possible. EASY migration and library design methodologies enable mature IP designs on the advanced node, thus reducing the development time. To meet the ever growing needs of Automotive Design, Samsung Foundry is working towards an Auto Grade IP Design and Design Solution. To offer superior packaging solutions, Samsung Foundry is focusing on MDI technologies beyond the existing 2D technologies. As a part of our efforts to complete and mature advanced technologies in the semiconductor industry, we are collaborating with numerous EDA companies.

[To Session Table]

Session 1A  Emerging NoC Designs
Time: 10:45 - 12:00, Tuesday, January 23, 2024
Location: Room 204
Chair: Qinru Qiu (Syracuse University, USA)

1A-1 (Time: 10:45 - 11:10)
TitleCANSim: When to Utilize Synchronous and Asynchronous Routers in Large and Complex NoCs
Author*Tom Glint (IIT Gandhinagar, India), Manu Awasthi (Ashoka University, India), Joycee Mekie (IIT Gandhinagar, India)
Pagepp. 1 - 6
Keywordasynchronous router, GALS simulator
AbstractAsynchronous routers offer benefits in latency, energy, and area over synchronous routers, but their large system design adoption is limited due to inadequate tool support for performance quantification. We introduce CANSim, a fast and accurate simulator for complex asynchronous and synchronous NoCs. Verified against synthesis models, CANSim models any graph-representable topology, asymmetric links, data-dependent delays, meta-stability, and supports synthetic and real-world traffic. It also accommodates power gating, hierarchical networks, and GALS-like networks. Our comparisons reveal that asynchronous NoCs provide lower packet latency at no-load conditions, but synchronous NoCs may outperform near saturation. On average, asynchronous NoCs offer up to 36% latency and 52% power benefits, with power-gating potentially reducing static power by 90%.

1A-2 (Time: 11:10 - 11:35)
TitlePAIR: Periodically Alternate the Identity of Routers to Ensure Deadlock Freedom in NoC
Author*Zifeng Zhao, Xinghao Zhu, Jiyuan Bai, Gengsheng Chen (Fudan University, China)
Pagepp. 7 - 12
KeywordNetwork on Chip, Many-Core Systems, Deadlock Freedom
AbstractNetwork-on-Chip (NoC) has become widely adopted in multi/many-core systems for on-chip communication. Avoiding deadlock is a critical issue in NoC design. Recently proposed techniques partition the network resources into one or more express paths. Each path allows a blocked packet to move forward to its destination, thereby breaking deadlocks. However, as the scaling up of NoC, the existing methods are experiencing a decline in efficiency due to the coarse-grained partitioning strategy. In this work, we present PAIR, a novel scheme that guarantees deadlock freedom. PAIR adopts a fine-grained resource partitioning approach, significantly increasing the number of express paths available. The express paths in PAIR not only allow blocked packets but also permit non-blocked packets to be forwarded to the next hop, minimizing the performance degradation for the normal flow control. We implemented and evaluated PAIR on the mesh network using classic synthetic traffic patterns. Our experiments show that PAIR significantly improves throughput performance, achieving an average increase of 80.36% compared with state-of-the-art deadlock-free solutions. Furthermore, PAIR outperforms the latest deadlock-free scheme by 37.78% in terms of throughput.

1A-3 (Time: 11:35 - 12:00)
TitleSCNoCs: An Adaptive Heterogeneous Multi-NoC with Selective Compression and Power Gating
Author*Fan Jiang, Chengeng Li, Lin Chen, Jiaqi Liu, Wei Zhang (The Hong Kong University of Science and Technology, Hong Kong), Jiang Xu (The Hong Kong University of Science and Technology (Guangzhou), China)
Pagepp. 13 - 18
Keywordin-network compression, Multi-NoC, power gating, energy efficiency
AbstractIn-network compression has been proposed recently to support efficient communication. However, we find employing compression blindly cannot always pay off since de/compression leads to extra packet transmission delay. We thereby propose selective compression which compresses data adaptively based on network state and predicted compression ratio. Moreover, we ob- serve that simply applying selective compression in a conventional single network is not energy efficient. Therefore, we propose SCNoCs, a heterogeneous Multi-NoC (Main-Net and Helper- Net) architecture with the support of selective compression and power gating. SCNoCs can dynamically adjust the policy of selective compression and the utilization degree of the Helper-Net according to the network state at runtime. Experimental results show that our selective compression outperforms conventional compression by 1.5x. Besides, our proposed SCNoCs achieves comparable performance while reducing energy consumption by 43.4%, compared with the baseline.

[To Session Table]

Session 1B  When AI Meets Edge Devices
Time: 10:45 - 12:00, Tuesday, January 23, 2024
Location: Room 205
Chairs: Guangyu Sun (Peking University, China), Jinjun Xiong (University at Buffalo, USA)

Best Paper Candidate
1B-1 (Time: 10:45 - 11:10)
TitleQuadraNet: Improving High-Order Neural Interaction Efficiency with Hardware-Aware Quadratic Neural Networks
Author*Chenhui Xu, Fuxun Yu, Zirui Xu (George Mason University, USA), Chenchen Liu (University of Maryland, Baltimore County, USA), Jinjun Xiong (University at Buffalo, USA), Xiang Chen (George Mason University, USA)
Pagepp. 19 - 25
KeywordQuadratic Neural Networks, Efficient Model, High-order Interaction
AbstractRecent progress in computer vision-oriented neural network designs is mostly driven by capturing high-order neural interactions among inputs and features. And there emerged a variety of approaches to accomplish this, such as Transformers and its variants. However, these interactions generate a large amount of intermediate state and/or strong data dependency, leading to considerable memory consumption and computing cost, and therefore compromising the overall runtime performance. To address this challenge, we rethink the high-order interactive neural network design with a quadratic computing approach. Specifically, we propose QuadraNet — a comprehensive model design methodology from neuron reconstruction to structural block and eventually to the overall neural network implementation. Leveraging quadratic neurons’ intrinsic high-order advantages and dedicated computation optimization schemes, QuadraNet could effectively achieve optimal cognition and computation performance. Incorporating state-of-the-art hardware-aware neural architecture search and system integration techniques, QuadraNet could also be well generalized in different hardware constraint settings and deployment scenarios. The experiment shows that QuadraNet achieves up to 1.5× throughput, 30% less memory footprint, and similar cognition performance, compared with state of the art.

1B-2 (Time: 11:10 - 11:35)
TitleRobustDiCE: Robust and Distributed CNN Inference at the Edge
Author*Xiaotian Guo (University of Amsterdam/Leiden University, Netherlands), Quan Jiang (Nanjing Agricultural University, China), Andy Pimentel (University of Amsterdam, Netherlands), Todor Stefanov (Leiden University, Netherlands)
Pagepp. 26 - 31
KeywordRobustness, Distributed Inference, edge computing, fault-tolerant design, deep learning
AbstractPrevalent large CNN models pose a significant challenge in terms of computing resources for resource-constrained devices at the Edge. Distributing the computations and coefficients over multiple edge devices collaboratively has been well studied but these works generally do not consider the presence of device failures (e.g., due to temporary connectivity issues, overload, discharged battery, etc. of edge devices). Such unpredictable failures can compromise the reliability of edge devices, inhibiting the proper execution of distributed CNN inference. In this paper, we present a novel partitioning method, called RobustDiCE, for robust distribution and inference of CNN models over multiple edge devices. Our method can tolerate intermittent and permanent device failures in a distributed system at the Edge, offering a tunable trade-off between robustness (i.e., retaining model accuracy after failures) and resource utilization. We evaluate RobustDiCE using the ImageNet-1K dataset on several representative CNN models under various device failure scenarios and compare it with several state-of-the-art partitioning methods as well as an optimal robustness approach (i.e., full neuron replication). In addition, we demonstrate RobustDiCE's advantages in terms of memory usage and energy consumption per device, and system throughput for various system set-ups with different device counts.

1B-3 (Time: 11:35 - 12:00)
TitleYoseUe: "trimming" Random Forest's training towards resource-constrained inference
Author*Alessandro Verosimile, Alessandro Tierno, Andrea Damiani, Marco Domenico Santambrogio (Politecnico di Milano, Italy)
Pagepp. 32 - 37
KeywordRandom Forest, Embedded devices, Hardware accelerators, Machine Learning, FPGA
AbstractEndowing artificial objects with intelligence is a longstanding computer science and engineering vision that recently converged under the umbrella of Artificial Intelligence of Things (AIoT). Nevertheless, AIoT’s mission cannot be fulfilled if objects rely on the cloud for their “brain,” at least concerning inference. Thanks to heterogeneous hardware, it is possible to bring Machine Learning (ML) inference on resource-constrained embedded devices, but this requires careful co-optimization between model training and its hardware acceleration. This work proposes YoseUe, a memory-centric hardware co-processor for Random Forests (RFs) inference, which significantly reduces the waste of memory resources by exploiting a novel train-acceleration co-optimization. YoseUe proposes a novel ML model, the Multi-Depth Random Forest Classifier (MDRFC), in which a set of RFs are trained at decreasing depths and then weighted, exploiting a Neural Network (NN) tailored to counteract potential accuracy losses w.r.t. classical RFs. With this modeling technique, first proposed in this paper, it becomes possible to accelerate the inference of RFs that count up to 2 orders of magnitude more Decision Trees (DTs) than those the current state-of-the-art architectures can fit on embedded devices. Furthermore, this is achieved without losing accuracy with respect to classical, full-depth RF in their most relevant configurations.

[To Session Table]

Session 1C  Innovations in New Computing Paradigms: Stochastic, Hyper-Dimensional and High-Performance Computing
Time: 10:45 - 12:00, Tuesday, January 23, 2024
Location: Room 206
Chair: Yue Zhang (Beihang University, China)

1C-1 (Time: 10:45 - 11:10)
TitleP2LSG: Powers-of-2 Low-Discrepancy Sequence Generator for Stochastic Computing
AuthorMehran Shoushtari Moghadam, Sercan Aygun, Mohsen Riahi Alam, *M Hassan Najafi (University of Louisiana at Lafayette, USA)
Pagepp. 38 - 45
Keywordstochastic computing, deterministic sequences, video processing, image processing
AbstractStochastic Computing (SC) is an unconventional computing paradigm processing data in the form of random bit-streams. The accuracy and energy efficiency of SC systems highly depend on the stochastic number generator (SNG) unit that converts the data from conventional binary to stochastic bit-streams. Recent work has shown significant improvement in the efficiency of SC systems by employing low-discrepancy (LD) sequences such as Sobol and Halton sequences in the SNG unit. Still, the usage of many well-known random sequences for SC remains unexplored. This work studies some new random sequences for potential application in SC. Our design space exploration proposes a promising random number generator for accurate and energy-efficient SC. We propose P2LSG, a low-cost and energy-efficient Low-discrepancy Sequence Generator derived from Powers-of-2 Van der Corput (VDC) sequences. We evaluate the performance of our novel bit-stream generator for two SC image and video processing case studies: image scaling and scene merging. For the scene merging task, we propose a novel SC design for the first time. Our experimental results show higher accuracy and lower hardware cost and energy consumption compared to the state-of-the-art.

1C-2 (Time: 11:10 - 11:35)
TitlePAAP-HD: PIM-Assisted Approximation for Efficient Hyper-Dimensional Computing
Author*Fangxin Liu, Haomin Li, Ning Yang (Shanghai Jiao Tong University, China), Yichi Chen (Tianjin University, China), Zongwu Wang (Shanghai Jiao Tong University, China), Tao Yang (Huawei Technologies Co., Ltd., China), Li Jiang (Shanghai Jiao Tong University, China)
Pagepp. 46 - 51
Keywordhyper-dimensional computing, approximate computing, energy-efficiency
AbstractHyper-Dimensional Computing (HDC) is a brain-inspired learning framework that's particularly suited to resource-limited edge devices. HDC operates in a high-parallel manner, encoding raw data into hyper-dimensional space, thus enabling efficient training and inference. However, the high dimensionality of data representation in HDC demands a substantial multiplication cost for calculating cosine similarity in high-precision HDC processes. While binarization of HDC can circumvent these multiplications, it often results in unsatisfactory accuracy. In this paper, we propose PAAP-HD, a novel approximation framework that is both accurate and hardware-friendly, designed to enhance the efficiency of HDC inference. Our framework employ a simple neural network as a universal approximator, which can be mapped to parallel Multiply–Accumulate (MAC) operations of the ReRAM-based PIM crossbar. Additionally, we introduce an algorithm to guide model switching, which aids in managing the approximation quality. This algorithm can be instantiated as a just-in-time predictor, seamlessly integrated into HDC to prescribe the appropriate model for each sample. Our evaluation is conducted on data sets in four different fields, and the results shows that PAAP-HD can bring an execution time speedup of $93.1\times$ and improve energy efficiency by $41.5\%$ energy with just $<1\%$ accuracy loss.

1C-3 (Time: 11:35 - 12:00)
TitleFPGA-Based HPC for Associative Memory System
Author*Deyu Wang (Fudan University, China), Yuning Wang (University of Turku, Finland), Yu Yang, Dimitrios Stathis, Ahmed Hemani, Anders Lansner (KTH Royal Institute of Technology, Sweden), Jiawei Xu (Guangdong Institute of Intelligence Science and Technology, China), Li-Rong Zheng, Zhuo Zou (Fudan University, China)
Pagepp. 52 - 57
KeywordAssociative memory, Spiking Neural Network (SNN), Bayesian Confidence Propagation Neural Network (BCPNN), FPGA, High Performance Computing (HPC)
AbstractAssociative memory plays a crucial role in the cognitive capabilities of the human brain. The Bayesian Confidence Propagation Neural Network (BCPNN) is a cortex model capable of emulating brain-like cognitive capabilities, particularly associative memory. However, the existing GPU-based approach for BCPNN simulations faces challenges in terms of time overhead and power efficiency. In this paper, we propose a novel FPGA-based high performance computing (HPC) design for the BCPNN-based associative memory system. Our design endeavors to maximize the spatial and timing utilization of FPGA while adhering to the constraints of the available hardware resources. By incorporating optimization techniques including shared parallel computing units, hybrid-precision computing for a hybrid update mechanism, and the globally asynchronous and locally synchronous (GALS) strategy, we achieve a maximum network size of 150x10 and a peak working frequency of 100 MHz for the BCPNN-based associative memory system on the Xilinx Alveo U200 Card. The tradeoff between performance and hardware overhead of the design is explored and evaluated. Compared with the GPU counterpart, the FPGA-based implementation demonstrates significant improvements in both performance and energy efficiency, achieving a maximum latency reduction of 33.25X, and a power reduction of over 6.9X, all while maintaining the same network configuration.

[To Session Table]

Session 1D  3D IC
Time: 10:45 - 12:00, Tuesday, January 23, 2024
Location: Room 207
Chairs: Sunmean Kim (Kyungpook National University, Republic of Korea), Daijoon Hyun (Cheong-Ju University, Republic of Korea)

1D-1 (Time: 10:45 - 11:10)
TitleChipletizer: Repartitioning SoCs for Cost-Effective Chiplet Integration
Author*Fuping Li, Ying Wang, Yujie Wang, Mengdi Wang, Yinhe Han, Huawei Li, Xiaowei Li (Chinese Academy of Sciences, China)
Pagepp. 58 - 64
Keywordchiplet, design partitioning, MCM, InFO, 2.5D
AbstractThe stagnation of Moore's law stimulates the concept of breaking monolithic chips into smaller chiplets. However, tactic design partitioning remains an unaddressed issue despite its crucial role in chip product cost reduction. In this paper, we propose Chipletizer, a framework to guide the design partitioning for those who would benefit from chiplet reuse across a line of SoC products. The proposed generic framework supports the repartitioning of multiple SoCs into reusable chiplets economically and efficiently with user-specified parameters. Experimental results show that, compared with existing partitioning strategies, our proposed framework achieves notable cost improvement on realistic products with acceptable power and latency overheads.

Best Paper Candidate
1D-2 (Time: 11:10 - 11:35)
TitleCoPlace: Coherent Placement Engine with Layout-aware Partitioning for 3D ICs
Author*Bangqi Fu, Lixin Liu, Yang Sun, Wing-Ho Lau, Martin D.F. Wong, Evangeline F.Y. Young (The Chinese University of Hong Kong, Hong Kong)
Pagepp. 65 - 70
KeywordPhysical design, 3D IC, Global placement, Partitioning
AbstractThe emerging technologies of 3D integrated circuits (3DICs) unveil a new avenue for expanding the design space into the 3D domain and present the opportunity to overcome the bottleneck of Moore’s Law for the traditional 2DICs. Among various technologies, the face- to-face bonding structure provides high integration density and reliable performance. Most commercial EDA tools, however, do not support 3DIC and cannot give a convincing solution. To exploit the benefits of stacking multiple tiers vertically, placement algorithms for 3DIC are imperatively in need. In this paper, we proposed a design flow that optimizes partitioning and placement quality for 3DICs in a unified way. Experimental results on the ICCAD2022 contest benchmark show that our work outperforms the first-place team by 3.35% in quality with less runtime and terminals used.

1D-3 (Time: 11:35 - 12:00)
TitleO.O: Optimized One-die Placement for Face-to-face Bonded 3D ICs
Author*Xingyu Tong, Zhijie Cai, Peng Zou, Min Wei, Yuan Wen (Fudan University, China), Zhifeng Lin (Fuzhou University, China), Jianli Chen (Fudan University, China)
Pagepp. 71 - 76
KeywordF2F 3D IC, Physical design methodology, partition, placement
AbstractThe expansion of the IC dimension is ushering in a more-than-Moore era, necessitating corresponding EDA tools. Existing TSV-based 3D placers focus on minimizing cuts, while burgeoning F2F-bonded ICs features dense interconnection between two planar die. Towards this novel structure, we proposed an integrated adaptation methodology upon mature one-die-based placement strategies. First, we instructively utilized a one-die placer to provide a statistical looking-ahead net diagnosis. The netlist henceforth shall be coarsened topologically and geometrically with a multi-level framework. Level by level, the partition will be refined according to a multi-objective gain formulation, including cut expectation, heterogeneous row height, and balanced cell distribution. Given the partition, we synchronized the behavior of analytical planar placers by balancing the density and wirelength objective function among asymmetric layers. Finally, the result will be further improved by heuristic bonding terminals' detail placement and a post-place partition adjustment. Compared to the top three winners of the 2022 CAD Contest at ICCAD, experiment results show that our fine-grained fusion upon partitioning and placement gets the best normalized average wirelength with a fairly reasonable runtime under all 3D architectural constraints.

[To Session Table]

Session 1E  (SS-1) Open-Source EDA Algorithms and Software
Time: 10:45 - 12:25, Tuesday, January 23, 2024
Location: Room 107/108
Chairs: Qi Sun (Zhejiang University), Ting-Chi Wang (National Tsing-Hua University)

1E-1 (Time: 10:45 - 11:10)
Title(Invited Paper) iEDA: An Open-source infrastructure of EDA
Author*Xingquan Li, Zengrong Huang, Simin Tao, Zhipeng Huang, Chunan Zhuang (Peng Cheng Laboratory, China), Hao Wang (Institute of Computing Technology, Chinese Academy of Sciences, China), Yifan Li (Peng Cheng Laboratory, China), Yihang Qiu (University of Chinese Academy of Sciences, China), Guojie Luo (Peking University, China), Huawei Li (Institute of Computing Technology, Chinese Academy of Sciences, China), Haihua Shen (University of Chinese Academy of Sciences, China), Mingyu Chen, Dongbo Bu (Institute of Computing Technology, Chinese Academy of Sciences, China), Wenxing Zhu (Fuzhou University, China), Ye Cai (Shenzhen University, Chile), Xiaoming Xiong (Guangdong University of Technology, China), Ying Jiang, Yi Heng (Sun Yat-sen University, China), Peng Zhang (Peng Cheng Laboratory, China), Bei Yu (The Chinese University of Hong Kong, China), Biwei Xie, Yungang Bao (Institute of Computing Technology, Chinese Academy of Sciences, China)
Pagepp. 77 - 82
KeywordEDA infrastructure, database, evaluation, EDA tool software development, chip design
AbstractBy leveraging the power of open-source software, the EDA tool offers a cost-effective and flexible solution for designers, researchers, and hobbyists alike. Open-source EDA promotes collaboration, innovation, and knowledge sharing within the EDA community. It emphasizes the role of the toolchain in accelerating the development of electronic systems, reducing design costs, and improving design quality. This paper presents an open-source EDA project, iEDA, aiming to build a basic infrastructure for EDA technology evolution and closing the industrial-academic gap in the EDA area. As the foundation for developing EDA tools and researching EDA algorithms and technologies, iEDA is mainly composed of file system, database, manager, operator and interface. To demonstrate the effectiveness of iEDA, we implement and tape out four chips of different scales (from 700k to 500M gates) on different process nodes (110nm and 28nm) with iEDA. iEDA is publicly available on the project home page at https://github.com/OSCC-Project/iEDA.

1E-2 (Time: 11:10 - 11:35)
Title(Invited Paper) iPD: An Open-source intelligent Physical Design Toolchain
AuthorXingquan Li, Simin Tao, Shijian Chen, Zhisheng Zeng, Zhipeng Huang (Peng Cheng Laboratory, China), Hongxi Wu (Fuzhou University, China), Weiguo Li (Minnan Normal University, China), Zengrong Huang, Liwei Ni (Peng Cheng Laboratory, China), Xueyan Zhao (Institute of Computing Technology, Chinese Academy of Sciences, China), He Liu (Peking University, China), Shuaiying Long (Peng Cheng Laboratory, China), Ruizhi Liu (Institute of Computing Technology, Chinese Academy of Sciences, China), Xiaoze Lin, Bo Yang (Peng Cheng Laboratory, China), Fuxing Huang (Fuzhou University, China), Zonglin Yang (Shenzhen University, China), Yihang Qiu (University of Chinese Academy of Sciences, China), Zheqing Shao (University of Science and Technology of China, China), Jikang Liu, Yuyao Liang (Shenzhen University, China), Biwei Xie, Yungang Bao (Institute of Computing Technology, Chinese Academy of Sciences, China), *Bei Yu (The Chinese University of Hong Kong, China)
Pagepp. 83 - 88
KeywordNetlist-to-GDS-II, physical design, open-source EDA tool, chip design flow, placement and routing
AbstractOpen-source electronic design automation (EDA) shows promising potential in unleashing EDA innovation and lowering the cost of chip design. The open-source EDA toolchain is a comprehensive set of software tools designed to facilitate the design, analysis, and verification of electronic circuits and systems. We developed a physical design EDA toolchain (named iPD) from netlist to GDS-II, including design, analysis, and verification. iPD now covers the whole flow of physical design (including floorplan, placement, clock tree synthesis, routing, timing optimization etc.), part of the analysis tools (timing analysis and power analysis), and part of the verification tools (design rule check). For more friendly support EDA research and development and chip design, we design a reliability, extendibility, ease-of-use, and feature richness physical design toolchain. This paper introduces the software structure, functions, and metrics of the iPD toolchain.

1E-3 (Time: 11:35 - 12:00)
Title(Invited Paper) A Resource-efficient Task Scheduling System Using Reinforcement Learning
AuthorChedi Morchdi (University of Utah, USA), Cheng-Hsiang Chiu (University of Wisconsin at Madison, USA), Yi Zhou (University of Utah, USA), *Tsung-Wei Huang (University of Wisconsin at Madison, USA)
Pagepp. 89 - 95
Keywordreinforcement learning, task scheduling
AbstractComputer-aided design (CAD) tools typically incorporate thousands or millions of functional tasks and dependencies to implement various synthesis and analysis algorithms. Efficiently scheduling these tasks in a computing environment that comprises manycore CPUs and GPUs is critically important because it governs the macro-scale performance. However, existing scheduling methods are typically hardcoded within an application that are not adaptive to the change of computing environment. To overcome this challenge, this paper will introduce a novel reinforcement learning-based scheduling algorithm that can learn to adapt the performance optimization to a given runtime (task execution environment) situation. We will present a case study on VLSI timing analysis to demonstrate the effectiveness of our learning-based scheduling algorithm. For instance, our algorithm can achieve the same performance of the baseline while using only 20% of CPU resources

1E-4 (Time: 12:00 - 12:25)
Title(Invited Paper) Machine learning and GPU accelerated sparse linear solvers for transistor-level circuit simulation: a perspective survey
Author*Zhou Jin, Wenhao Li, Yinuo Bai, Tengcheng Wang, Yicheng Lu, Weifeng Liu (China University of Petroleum-Beijing, China)
Pagepp. 96 - 101
KeywordCircuit simulation, Sparse linear solver, Linear algebra, AI, GPU
AbstractSparse linear solvers play a crucial role in transistor-level circuit simulation, especially for large-scale post-layout circuit simulation when considering complex parasitic effects. As semiconductor technology advances rapidly, the increasing sizes of circuits result in sparse linear solvers that require extended execution times and additional memory resources. Consequently, high-performance sparse linear solvers emerge as pivotal tools to facilitate rapid circuit simulation and verification. However, circuit matrices frequently exhibit high sparsity and non-uniform distributions of nonzero elements, compounding the challenge of achieving efficient acceleration. Recently, the flourishing developments in machine learning technology and the continuous enhancement of hardware capabilities have presented new opportunities for accelerating sparse linear solvers. This paper provides a perspective review of these technological advancements, while also highlighting the challenges and future opportunities in this evolving landscape.

[To Session Table]

Session 1F  (DF-4) Advanced EDA using AI/ML at Synopsys
Time: 10:45 - 11:45, Tuesday, January 23, 2024
Chair: Heechun Park (Kookmin University, Republic of Korea)

1F-1 (Time: 10:45 - 11:15)
Title(Designers' Forum) AI-Driven Solution for DFT Optimization
AuthorSoochang Park (EDA Group, Synopsys, USA)
AbstractIn the industry of semiconductor design, configuring optimal specifications for a design is becoming challenge due to numerous inter-dependent design parameters. Specifically, in design for testability (DFT), it is more demanding to predict the quality of test pattern in advance since it is generated through automatic test pattern generation (ATPG) step after DFT IP is implemented. Thus, such flow innately requires long iteration run-time and computing resources for designers. In frond-end of design implementation step, ways to improve ATPG quality of results (QoR) can be adjusting DFT specifications and applying automatic test-point insertion (TPI). However, DFT specifications should be considered within design limitations and hierarchical design guidelines, and TPI solutions might increase area overhead. To find the recipe of DFT for optimal ATPG results while considering the design circumstances, the ML applied solution of Synopsys is suggested as a promising solution to automates the process and eventually to replaces human resources. In conducted experiments, integrating two different steps from DFT insertion to ATPG is enabled, so that the ML solution can automate the flow and learn the relation of DFT recipe and ATPG QoR. The experimental results not only show outstanding results in terms of ATPG QoR, but also successfully show that the constraints of synthesis can be reflected in accordance with the user’s intention.

1F-2 (Time: 11:15 - 11:45)
Title(Designers' Forum) Optimization of PDN and DTCO using Synopsys Machine Learning Framework
AuthorKyoung-In Cho (EDA Group, Synopsys, USA)
AbstractIn the realm of sub-nanometer technology nodes, the semiconductor industry faces the challenge of fulfilling increasingly complex application demands while adhering to stringent power, performance, and area (PPA) requirements. A major hurdle is the mismatch between metal pitch and cell height reduction, leading to heightened routing congestion and hindering effective chip size reduction. Although adding more metal layers can mitigate this issue, it substantially raises production costs and expands the design space, especially when considering IR-drop during physical implementations. Design-Technology Co-Optimization (DTCO) emerges as a vital strategy to bypass physical scaling limits and enhance transistor density, performance, and power efficiency. However, it significantly broadens the design scope in physical implementations, as chip designers must integrate technology-related variables with existing design parameters in the early technology stage. To address these challenges, we advocate for the integration of Machine Learning (ML) to identify and optimize technical parameters. ML demonstrates exceptional potential in fine-tuning complex parameters. Specifically, Synopsys DSO.ai (Design Space Optimization AI), a pioneer in applying ML within the Electronic Design Automation (EDA) sector, shows promising results. Our experiments using Synopsys DSO.ai successfully identify an optimal metal pitch that minimizes IR-drop impact and efficiently determine suitable parameters for DTCO.

[To Session Table]

Session 2A  Frontiers in Embedded and Edge AI: from Adversarial Attacks to Intelligent Homes
Time: 13:30 - 14:45, Tuesday, January 23, 2024
Location: Room 204
Chair: BaekGyu Kim (Daegu Gyeongbuk Institute of Science and Technology, Republic of Korea)

2A-1 (Time: 13:30 - 13:55)
TitleHomeSGN: A Smarter Home with Novel Rule Mining Enabled by a Scorer-Generator GAN
Author*Zehua Yuan, Junhao Pan, Xiaofan Zhang, Deming Chen (University of Illinois at Urbana Champaign, USA)
Pagepp. 102 - 108
KeywordSmart Home, Internet of Things, Artificial Intelligence
AbstractMost contemporary research in advanced smart homes has been primarily focused on understanding the environment and identifying activities. However, it can never translate these insights into actionable rules that could improve residents' quality of life, much less optimize the entire home environment. Addressing this gap, our paper introduces HomeSGN, an end-to-end trainable Scorer-Generator system founded on the Generative Adversarial Network (GAN) architecture. Specifically tailored for smart home applications, HomeSGN extracts, assesses, and proffers beneficial rules from residents' everyday activities, thereby improving living conditions and optimizing the home environment with adaptable targets. Complemented by pioneering data augmentation and rectification strategies, the system assures model stability, avoids mode collapse, and maintains data integrity throughout GAN training. Integrating HomeSGN into an existing smart home infrastructure establishes a seamless sensor-to-rule pipeline. The effectiveness of HomeSGN is underscored by significant benefits, notably an enhancement of life quality by over 50% in single-user homes and 30% in multi-user scenarios, thus truly embodying the promise of "smart" in smart homes.

2A-2 (Time: 13:55 - 14:20)
TitleAdaptive Workload Distribution for Accuracy-aware DNN Inference on Collaborative Edge Platforms
Author*Zain Taufique (University of Turku, Finland), Antonio Miele (Politecnico di Milano, Italy), Pasi Liljeberg, Anil Kanduri (University of Turku, Finland)
Pagepp. 109 - 114
KeywordEdge Computing, DNN inference, Distributed Computing, Approximate Computing
AbstractDNN inference can be accelerated by distributing the workload among a cluster of collaborative edge nodes. Heterogeneity among edge devices and accuracy-performance trade-offs of DNN models present a complex exploration space while catering to the inference performance requirements. In this work, we propose adaptive workload distribution for DNN inference, jointly considering node-level heterogeneity of edge devices, and application-specific accuracy and performance requirements. Our proposed approach combinatorically optimizes heterogeneity-aware workload partitioning and dynamic accuracy configuration of DNN models to ensure performance and accuracy guarantees. We tested our approach on an edge cluster of Odroid Xu4, Raspberry Pi, and Jetson Nano boards and achieved an average gain of 41.52% in performance and 5.2% in output accuracy as compared to state-of-the-art workload distribution strategies.

2A-3 (Time: 14:20 - 14:45)
TitleExtending Neural Processing Unit and Compiler for Advanced Binarized Neural Networks
AuthorMinjoon Song, Faaiz Asim, *Jongeun Lee (Ulsan National Institute of Science and Technology (UNIST), Republic of Korea)
Pagepp. 115 - 120
Keywordcompiler, BiRealNet-18, binarized neural network, VTA, Glow
AbstractBinarized neural networks (BNNs) are one of the most promising approaches to deploy deep neural network models on resource-constrained devices. However, there is very little support on compilers and programmable accelerators for BNNs especially with the modern BNNs that use scale factors and skip connections to maximize network performance. In this paper we present a set of methods to extend a neural processing unit (NPU) and a compiler to support modern BNNs. Our novel ideas include (i) batch-norm folding for binarized layers with scale factors and skip connections, (ii) efficient handling of convolutions with few input channels, and (iii) bit-packing pipelining. Our evaluation using BiRealNet-18 on an FPGA board demonstrates that our compiler-architecture hybrid approach can yield significant speed-ups for binary convolution layers over the baseline NPU. Also our approach gives 3.6~5.5x better end-to-end performance on BiRealNet-18 compared with previous BNN compiler approaches.

[To Session Table]

Session 2B  Innovations in Quantum EDA: from Design to Deployment
Time: 13:30 - 15:10, Tuesday, January 23, 2024
Location: Room 205
Chair: Tsung-Yi Ho (Chinese University of Hong Kong, Hong Kong)

2B-1 (Time: 13:30 - 13:55)
TitleJustQ: Automated Deployment of Fair and Accurate Quantum Neural Networks
AuthorRuhan Wang, Fahiz Baba-Yara, *Fan Chen (Indiana University Bloomington, USA)
Pagepp. 121 - 126
KeywordQuantum neural networks, fairness, accuracy, noisy intermediate-scale quantum, reinforcement learning
AbstractDespite the success of Quantum Neural Networks (QNNs) in decision-making systems, their fairness remains unexplored, as the focus primarily lies on accuracy. This work conducts a design space exploration, unveiling QNN unfairness, and highlighting the significant influence of QNN deployment and quantum noise on accuracy and fairness. To effectively navigate the vast QNN deployment design space, we propose JustQ, a framework for deploying fair and accurate QNNs on NISQ computers. It includes a complete NISQ error model, reinforcement learning-based deployment, and a flexible optimization objective incorporating both fairness and accuracy. Experimental results show JustQ outperforms previous methods, achieving superior accuracy and fairness. This work pioneers fair QNN design on NISQ computers, paving the way for future investigations.

2B-2 (Time: 13:55 - 14:20)
TitleUsing Boolean Satisfiability for Exact Shuttling in Trapped-Ion Quantum Computers
Author*Daniel Schönberger (Technical University of Munich, Germany), Stefan Hillmich (Software Competence Center Hagenberg (SCCH) GmbH, Austria), Matthias Brandl (Infineon Technologies AG, Germany), Robert Wille (Technical University of Munich, Germany)
Pagepp. 127 - 133
Keywordquantum computing, trapped ions, shuttling
AbstractTrapped ions are a promising technology for building scalable quantum computers. Not only can they provide a high qubit quality, but they also enable modular architectures, referred to as Quantum Charge Coupled Device (QCCD) architecture. Within these devices, ions can be shuttled (moved) throughout the trap and through different dedicated zones, e.g., a memory zone for storage and a processing zone for the actual computation. However, this movement incurs a cost in terms of required time steps, which increases the probability of decoherence, and, thus, should be minimized. In this paper, we propose a formalization of the possible movements in ion traps via Boolean satisfiability. This formalization allows for determining the minimal number of time steps needed for a given quantum algorithm and device architecture, hence reducing the decoherence probability. An empirical evaluation confirms that—using the proposed approach—minimal results (i.e., the lower bound) can be determined for the first time. An open-source implementation of the proposed approach is publicly available at https://github.com/cda-tum/mqt-ion-shuttler.

2B-3 (Time: 14:20 - 14:45)
TitleOptimizing Decision Diagrams for Measurements of Quantum Circuits
Author*Ryosuke Matsuo (Osaka University, Japan), Rudy Raymond (IBM, Japan), Shigeru Yamashita (Ritsumeikan University, Japan), Shin-ichi Minato (Kyoto University, Japan)
Pagepp. 134 - 139
KeywordVariational quantum algorithm, Decision Diagram
AbstractVariational quantum algorithm (VQA) is a promising near-term quantum algorithm to efficiently generate quantum states useful for various applications from shallow parametrized quantum circuits (PQCs). To fully utilize VQA, it is essential to have measurement methods that efficiently extract desired information from the quantum states. Classical shadow is such method that measures each qubit onto one of three Pauli bases uniformly at random. It has been attracting active research for characterizing the quantum states of PQCs due to its requiring only polynomial number of measurements, in the number of qubits, in contrast to the exponential-measurement quantum state tomography. There are several variants of classical shadow to improve the accuracy of measurement. A highly accurate classical shadow whose choices of Pauli bases are based on a decision diagram (DD) has been recently proposed in designing PQCs. Here, we further extend the DD-based classical shadow by novel modification and application of conventional techniques to optimize DD. We develop a method to optimize the size of DD that can lead to even fewer number of measurements for some instances in quantum chemistry as confirmed by numerical experiments. Our results show another facet of the usefulness of DD in the design of PQCs.

2B-4 (Time: 14:45 - 15:10)
TitleCTQr: Control and Timing-Aware Qubit Routing
Author*Ching-Yao Huang, Wai-Kei Mak (National Tsing Hua University, Taiwan)
Pagepp. 140 - 145
KeywordQuantum computing, Qubit routing, Qubit mapping, Quantum gates scheduling, Scheduling
AbstractMost often the quantum logical circuit cannot be executed directly on the quantum processor due to the limited connectivity between the physical qubits of the processor. So, a quantum compiler needs to perform qubit routing by inserting auxiliary gates to execute operations like SWAP, MOVE, and BRIDGE in order to satisfy the connectivity constraint. Qubit routing yields a physical circuit that can be executed on the target processor. Finally, the physical circuit still has to be scheduled considering the gate delays and the control constraints imposed by the shared classical control electronics of the quantum processor. For noisy intermediate-scale quantum processors, the most important objective of quantum compilation is to minimize the latency of the final scheduled physical circuit. However, solving qubit routing without considering gate delays and control constraints will inevitably lead to suboptimal final results. Here we propose a control and timing-aware qubit routing algorithm, CTQr, considering gate delays and control constraints. Moreover, CTQr performs gate merging on the fly in order to minimize the final circuit latency. The experimental results show that CTQr outperforms the state-of-art approach with 11.2%, 8.8%, and 54.6% average reduction in the circuit latency, number of additional gates, and execution time, respectively.

[To Session Table]

Session 2C  Architecting for Dependability: System Design with Compute-in-Memory
Time: 13:30 - 15:10, Tuesday, January 23, 2024
Location: Room 206
Chair: Po-Chun Huang (National Taipei University of Technology, Taiwan)

2C-1 (Time: 13:30 - 13:55)
TitleBNN-Flip: Enhancing the Fault Tolerance and Security of Compute-in-Memory Enabled Binary Neural Network Accelerators
Author*Akul Malhotra, Chunguang Wang, Sumeet Kumar Gupta (Purdue University, USA)
Pagepp. 146 - 152
KeywordBinary Neural Networks, Compute-in-Memory, DNN Security, Fault tolerance, Stuck-at faults
AbstractCompute-in-memory based binary neural networks or CiM-BNNs offer high energy/area efficiency for the design of edge deep neural network (DNN) accelerators, with only a mild accuracy reduction. However, for successful deployment, the design of CiM-BNNs must consider challenges such as memory faults and data security that plague existing DNN accelerators. In this work, we aim to mitigate both these problems simultaneously by proposing BNN-Flip, a training-free weight transformation algorithm that not only enhances the fault tolerance of CiM-BNNs but also protects them from weight theft attacks. BNN-Flip inverts the rows and columns of the BNN weight matrix in a way that reduces the impact of memory faults on the CiM-BNN’s inference accuracy, while preserving the correctness of the CiM operation. Concurrently, our technique encrypts the CiM-BNN weights, securing them from weight theft. Our experiments on various CiM-BNNs show that BNN-Flip achieves an inference accuracy increase of up to 10.55% over the baseline (i.e. CiM-BNNs not employing BNN-Flip) in the presence of memory faults. Additionally, we show that the encrypted weights generated by BNN-Flip furnish extremely low (near ’random guess’) inference accuracy for the adversary attempting weight theft. The benefits of BNN-Flip come with an energy overhead of < 3%.

2C-2 (Time: 13:55 - 14:20)
TitleZEBRA: A Zero-Bit Robust-Accumulation Compute-In-Memory Approach for Neural Network Acceleration Utilizing Different Bitwise Patterns
Author*Yiming Chen, Guodong Yin, Hongtao Zhong, Mingyen Lee, Huazhong Yang (Tsinghua University, China), Sumitha George (North Dakota State University, USA), Vijaykrishnan Narayanan (Pennsylvania State University, USA), Xueqing Li (Tsinghua University, China)
Pagepp. 153 - 158
KeywordCompute-In-Memory, Robustness
AbstractDeploying a lightweight quantized model in compute-in-memory (CIM) might result in significant accuracy degradation due to reduced signal-noise rate (SNR). To address this issue, this paper presents ZEBRA, a zero-bit robust-accumulation CIM approach, which utilizes bitwise zero patterns to compress computation with ultra-high resilience against noise due to circuit non-idealities, etc. First, ZEBRA provides a cross-level design that successfully exploits value-adaptive zero-bit patterns to improve the performance in robust 8-bit quantization dramatically. Second, ZEBRA presents a multi-level local computing unit circuit design to implement the bitwise sparsity pattern, which boosts the area/energy efficiency by 2x-4x compared with existing CIM works. Experiments demonstrate that ZEBRA can achieve <1.0% accuracy loss in CIFAR10/100 with typical noise, while conventional CIM works suffer from > 10% accuracy loss. Such robustness leads to much more stable accuracy for high-parallelism inference on large models in practice.

2C-3 (Time: 14:20 - 14:45)
TitleA Cross-layer Framework for Design Space and Variation Analysis of Non-Volatile Ferroelectric Capacitor-Based Compute-in-Memory Accelerators
Author*Yuan-Chun Luo, James Read, Anni Lu, Shimeng Yu (Georgia Institute of Technology, USA)
Pagepp. 159 - 164
KeywordCapacitive Synapse, Ferroelectrics, Compute-inmemory, Benchmarking Framework, Hardware Accelerator
AbstractUsing non-volatile “capacitive” crossbar arrays for compute-in-memory (CIM) offers higher energy and area efficiency compared to “resistive” crossbar arrays. However, the impact of device-to-device (D2D) variation and temporal noise on the system-level performance has not been explored yet. In this work, we provide an end-to-end methodology that incorporates experimentally measured D2D variation into the design space exploration from capacitive weight cell design, CIM array with peripheral circuits, to the inference accuracy of SwinV2-T vision transformer and ResNet-50 on the ImageNet dataset. Our framework further assesses the system's power, performance, and area (PPA) by considering cell design, circuit structure, and model selection. We explore the design space using an early stopping algorithm to produce optimal designs while meeting strict inference accuracy requirements. Overall findings suggest that the capacitive CIM system is robust against D2D variation and noise, outperforming its resistive counterpart by 6.95× and 14.1× for the optimal design in the figure of merit (TOPS/W×TOPS/mm2) for ResNet-50 and SwinV2-T respectively.

2C-4 (Time: 14:45 - 15:10)
TitleDesign of Aging-Robust Clonable PUF Using an Insulator-Based ReRAM for Organic Circuits
Author*Kunihiro Oshima (Kyoto University, Japan), Kazunori Kuribara (National Institute of Advanced Industrial Science and Technology, Japan), Takashi Sato (Kyoto University, Japan)
Pagepp. 165 - 170
Keywordresistive random access memory, metal oxide, physically unclonable function, organic thin-film transistor, thin-film device
AbstractWe enhance the robustness of organic thin-film transistor (OTFT)-based physically unclonable functions (PUFs) against unstable output responses that originate from OTFT characteristic degradation. The proposed PUF incorporates a new resistive random access memory (ReRAM) constructed from metal-oxide thin films, enabling both ReRAM and OTFT components to be simultaneously fabricated using common materials and processes. The device combinations facilitate the efficient design of clonable PUF (CPUF) circuits with equivalent response outputs. Evaluations using measurements and simulations validate the successful operation of the CPUF.

[To Session Table]

Session 2D  ML for Physical Design and Timing
Time: 13:30 - 15:10, Tuesday, January 23, 2024
Location: Room 207
Chair: Pei-Yu Lee (Synopsys, Taiwan)

2D-1 (Time: 13:30 - 13:55)
TitleHeterogeneous Graph Attention Network Based Statistical Timing Library Characterization with Parasitic RC Reduction
Author*Xu Cheng, Yuyang Ye, Guoqing He, Qianqian Song, Peng Cao (Southeast University, China)
Pagepp. 171 - 176
KeywordStatistical Timing Library Characterization, PVT corner, Heterogeneous Graph Attention Network (HGAT), Parasitic RC Reduction
AbstractStatistical timing characterization for standard cell library poses significant challenge to accuracy and runtime cost. Prior analytical and machine learning-based methods neglect the profound influence induced by layout-dependent parasitic resistor and capacitor (RC) network in cell netlist as well as the timing correlation between topological structures of cells and process, voltage, and temperature (PVT) corners, resulting in tremendous simulation effort and/or poor accuracy. In this work, an accurate and efficient statistical cell timing characterization framework is proposed based on heterogeneous graph attention network (HGAT) assisted with parasitic RC reduction approach, where the transistors and parasitic RC in cell are represented as heterogeneous nodes for graph learning and redundant RC nodes are removed to alleviate node imbalance issue and improve prediction accuracy. The proposed framework was validated with TSMC 22nm standard cells under multiple PVT corners to predict the standard deviation of cell delay with the error of 2.67% on average for all validated cells in terms of relative Root Mean Squared Error (RMSE) with 3 characterization runtime speedup, achieving 2.7~6.9 accuracy improvement compared with prior works.

2D-2 (Time: 13:55 - 14:20)
TitleAn Optimization-aware Pre-Routing Timing Prediction Framework Based on Heterogeneous Graph Learning
Author*Guoqing He, Wenjie Ding, Yuyang Ye, Xu Cheng, Qianqian Song, Peng Cao (Southeast University, China)
Pagepp. 177 - 182
KeywordStatic timing analysis, Graph learning, Placement and routing
AbstractAccurate and efficient pre-routing timing estimation is particularly crucial in timing-driven placement, as design iterations caused by timing divergence are time-consuming. However, existing machine learning prediction models overlook the impact of timing optimization techniques during routing stage, such as adjusting gate sizes or swapping threshold voltage types to fix routing-induced timing violations. In this work, an optimization-aware pre-routing timing prediction framework based on heterogeneous graph learning is proposed to calibrate the timing changes introduced by wire parasitic and optimization techniques. The path embedding generated by the proposed framework fuses learned local information from graph neural network and global information from transformer network to perform accurate endpoint arrival time prediction.Experimental results demonstrate that the proposed framework achieves an average accuracy improvement of 0.10 in terms of R2 score on testing designs and brings average runtime acceleration of three orders of magnitude compared with the design flow.

2D-3 (Time: 14:20 - 14:45)
TitleBoCNT: A Bayesian Optimization Framework for Global CNT Interconnect Optimization
Author*Hang Wu, Ning Xu (Wuhan University of Technology, China), Wei Xing, Yuanqing Cheng (Beihang University, China)
Pagepp. 183 - 188
KeywordCarbon nanotube interconnect, Timing optimization, Bayesian optimization, Power-delay-product
AbstractAs the prevailing copper interconnect technology advances to its fundamental physical limit, interconnect delay due to ever-increasing wire resistivity shows a significant impact on circuit performance. Bundled single-wall carbon nanotube (SWCNTs) interconnects have emerged as a promising candidate technique to replace copper interconnects thanks to its superior conductivity and immunity to electromigration. To deliver satisfying performance within small power consumption, the CNT interconnect timing is optimized by adjusting either the interconnect geometry, e.g., CNT diameter and nanotube pitch, or the buffer insertion. These two criteria are normally optimized separately, which leads to an inferior design that is not global optimum. To resolve this problem, we first propose a model that parameterizes SWCNT interconnects. We then leverage the known Bayesian optimization to optimize SWCNT global interconnect and buffer insertion simultaneously to promote interconnect performance. The proposed method is assessed based on a set of interconnect benchmarks at 22nm technology node. Compared to the stateof-the-art methods, the Bayesian co-optimization technique canreduce more than 17% PDP. Additionally, we further reduce the interconnect delay ratio relative to copper by 35% on average compared to fixed-parameter CNT at the same wire dimension, which highlights the effectiveness and efficiency of the proposed technique.

2D-4 (Time: 14:45 - 15:10)
TitleTiming Analysis beyond Complementary CMOS Logic Styles
Author*Jan Lappas, Mohamed Amine Riahi, Christian Weis, Norbert Wehn (University of Kaiserslautern-Landau, Germany), Sani Nassif (Radyalis LLC, USA)
Pagepp. 189 - 194
Keywordtiming simulation, pass transistor, BTWC, DTA, STA
AbstractWith scaling unabated, device density continues to increase, but power and thermal budgets prevent the full use of all available devices. This leads to the exploration of alternative circuit styles beyond traditional CMOS, especially dynamic data- dependent styles, but the excessive pessimism inherent in conven- tional static timing analysis tools presents a barrier to adoption. One such circuit family is Pass-Transistor Logic (PTL), which holds significant promise but behaves differently from CMOS in that traditional CMOS-oriented EDA tools cannot produce sufficiently accurate performance estimates. In this work, we revisit timing analysis and its premises and show a significantly improved methodology of a more generalized dynamic timing engine that accurately predicts timing performance for traditional CMOS as well as PTL, with an accuracy of 4.0% compared to SPICE and with a run-time comparable to traditional gate-level simulation. The run-time improvement compared with SPICE is four orders of magnitude.

[To Session Table]

Session 2E  University Design Contest
Time: 13:30 - 15:20, Tuesday, January 23, 2024
Location: Room 107/108
Chairs: Hyun Kim (Seoul National University of Science and Technology, Republic of Korea), Ki-seok Chung (Hanyang University, Republic of Korea), Min-Seong Choo (Hanyang University, Republic of Korea), Jungwook Choi (Hanyang University, Republic of Korea)

2E-1 (Time: 13:30 - 13:40)
TitleAn In-Memory Computing SRAM Macro for Memory-Augmented Neural Network in 40nm CMOS
Author*Sunghoon Kim, Wonjae Lee, Sundo Kim, Sungjin Park, Dongsuk Jeon (Seoul National University, Republic of Korea)
KeywordIn-Memory Computing, Memory Augmented Neural Network, L1 distance calculation, Winner-Take-All, Few Shot Learning
AbstractThis work presents an In-Memory Computing (IMC) SRAM macro for Memory-Augmented Neural Network (MANN). The design efficiently accelerates key operations of MANN using charge-domain computation along with a time-based winter-take-all (WTA) circuit. Fabricated in 40nm LP CMOS, the design demonstrates 27.7 TOPS/W maximum energy efficiency and achieves 98.28% classification accuracy for 5-way 5-shot learning on the Omniglot dataset.

2E-2 (Time: 13:40 - 13:50)
TitleA Mobile 3D-CNN Processor with Hierarchical Sparsity-Aware Computation and Temporal Redundancy-aware Network
Author*Seungbin Kim, Kyuho Lee (Ulsan National Institute of Science and Technology, Republic of Korea)
Keyword3D-CNN, hand gesture recognition processor, hardware-software codesign, low latency, VLSI
AbstractA hierarchical sparsity-aware 3D-convolution neural network (3D-CNN) accelerator is proposed for low-power real-time mobile hand gesture recognition (HGR) system. The complex computation of 3D-CNN with video input makes it difficult for real-time operation in mobile platforms where hardware resource is limited. To facilitate implementation of real-time HGR, this paper proposes hardware-algorithm co-optimization to greatly reduce the external memory accesses and MAC operations of temporal redundancy of video data by hierarchical sparsity-aware computation. The ROI-only inter-frame differential-aware network with dedicated 3D-CNN processor achieves 24ms latency for real-time HGR with 262mW.

2E-3 (Time: 13:50 - 14:00)
TitleA 740μW Real-Time Speech Enhancement Processor Using Band Optimization and Multiplier-Less PE Arrays for Hearing Assistive Devices
Author*Sungjin Park, Sunwoo Lee (Seoul National University, Republic of Korea), Jeongwoo Park (Sungkyunkwan University, Republic of Korea), Hyeong-Seok Choi (Supertone, Republic of Korea), Dongsuk Jeon (Seoul National University, Republic of Korea)
KeywordSpeech enhancement, Algorithm optimization, Reconfigurable PE, ASIC
AbstractThis work presents a real-time speech enhancement (SE) processor for hearing aids in 28 nm CMOS. Through an importance-aware neural network optimization, we reduce the operation count by 29.7% without compromising enhancement quality. Also, the processor achieves real-time SE with a power consumption of 740µW using reconfigurable processing element (PE) capable of handling both coordinate rotation digital computer (CORDIC) algorithm and neural network layers. Consequently, our design outperforms previous SE processors, providing the most superior speech enhancement quality.

2E-4 (Time: 14:00 - 14:10)
TitleStrongARM Latch-based Clocked Comparator Design for Improving Low Speed DRAM Testing Reliability
Author*Jongchan Lee, Chanheum Han, Ki-Soo Lee, Joo-Hyung Chae (Kwangwoon University, Republic of Korea)
KeywordClocked Comparator, Leakage Current, DRAM, Low-Speed Test
AbstractThis paper introduces a StrongARM latch-based clocked comparator with leakage current compensation to address sampling errors in low speed wafer tests for dynamic random access memory production. A low s peed test data pattern is transferred through the clocked comparator to identify cell and circuit defects. However, leakage current can cause bit errors and reduce testing reliability. The proposed StrongARM latch-based comparator effectively prevents bit errors by implementing the leakage current flowing path to the ground. A prototype chip fabricated using the 28-nm CMOS process demonstrated successful bit error compensation at 1 MHz for 1.0 V and 3 MHz for 1.3 V supply voltage.

2E-5 (Time: 14:10 - 14:20)
TitleNano-Watt High-Resolution Continuous-Time Delta-Sigma Modulator With On-Chip PMIC for Sensor Applications
Author*Jaedo Kim, Jiho Moon, Tian Guo, Chaeyoung Kang, Jeongjin Roh (Hanyang University, Republic of Korea)
KeywordLow power, power management IC, delta-sigma modulator, sensor system
AbstractThis paper proposes a power management and nano-Watt sensor circuit integrated chip for ultra-low power skin sensors. To reduce the ripple voltage of the Buck converter, an LDO regulator is employed to supply the voltage to the proposed CTDSM. As a result, it is possible to achieve a high resolution in the proposed CTDSM. The proposed integrated chip consumes a very low current of approximately 1.1 µA. The proposed integrated chip was validated by operating it using solar panels for energy harvesting and applying it to a flexible sensor system for the Internet of Things, enabling real-time extraction of ECG waveforms.

2E-6 (Time: 14:20 - 14:30)
TitleA 0.15-to-1.15V Output Range 270mA Self-Calibrating-Clocked Capacitor-Free LDO Using Rail-to-Rail Voltage-Difference-to-Time Converter with 0.183fs FoM
Author*Youngmin Park, Dongsuk Jeon (Seoul National University, Republic of Korea)
KeywordLow-dropout regulator, voltage difference to time converter, charge pump, voltage regulator, output-capacitor-free LDO
AbstractThis paper proposes a fully integrated low-dropout regulator (LDO) for mobile applications with a load current capacity of 270mA. It utilizes a rail-to-rail voltage-difference-to-time-converter (VDTC) and a charge-pump (CP) to achieve a wide output range of 0.15-1.15V with 99.99\ peak current efficiency and 12.7uA quiescent current. Fast transient responses are achieved using a self-calibrating clock generator (SCCG) and a tunable undershoot compensator (TUC), resulting in 150mV undershoot and 100ns settling time at a 200mA/3ns slew rate, with a figure-of-merit (FoM) of 0.183fs.

2E-7 (Time: 14:30 - 14:40)
TitleA 0.37 V 126 nW 0.29 mm2 65-nm CMOS Biofuel-Cell-Modulated Biosensing System Featuring an FSK-PIM-Combined 2.4 GHz Transmitter for Continuous Glucose Monitoring Contact Lenses
AuthorGuowei Chen, Akiyoshi Tanaka (Nagoya University, Japan), *Kiichi Niitsu (Kyoto University, Japan)
Keywordbiomedical system, glucose monitor, smart contact lens, sensing interface
AbstractThis work presents a battery-less biosensing system for continuous glucose monitoring contact lenses featuring a 2.4 GHz transmitter. The modulation method combines frequency shift keying and pulse interval modulation to send radio frequency (RF) signals with two-dimensional information including supply voltage level, which aims to overcome the impact of unstable solar power supply. The charge-pump-based energy harvester converts the supply voltage above 1.8 V to achieve a transmission distance longer than 40 cm, which allows communications between contact lenses and handsets. The fabricated chip in a 65 nm CMOS process consumes 126 nW at 0.37-V supply, which is manageable by on-lens solar cells in an indoor ambient-light environment.

2E-8 (Time: 14:40 - 14:50)
TitlePower-Efficient FPGA Implementation of CNN-Based Object Detector
Author*Haein Lee, Inseong Hwang, Hyun Kim (Department of Electrical and Information Engineering, Seoul National University of Science and Technology, Republic of Korea)
KeywordConvolutional Nerual Networks, tiny-YOLOv3, Accelerators, Field Programmable Gate Array, External Memory (DRAM) Access
AbstractAs convolutional neural network (CNN)-based object detection technology with high accuracy continues to develop, interest in implementing artificial intelligence (AI) accelerators for deploying CNN-based object detectors in edge devices has increased. When developing such an accelerator, the primary concern is to meet the low power demands, considering the specific characteristics of edge devices. For this purpose, we implement a low-power CNN-based object detector on the FPGA platform. In detail, we implement and verify the tiny-YOLOv3 accelerator on the Xilinx Zynq Ultrascale+ MPSoC ZCU102 Evaluation Kit. The proposed accelerator achieved a throughput of 92.10GOP/s and a power consumption of 5.442W. The proposed IP is expected to be utilized in various edge device-based applications that require object detection.

2E-9 (Time: 14:50 - 15:00)
TitleTransCoder: Efficient Hardware Implementation of Transformer Encoder
Author*Sangki Park, Chan-Hoon Kim, Soo-Min Rho, Jeong-Hyun Kim, Seo-Ho Chung, Ki-Seok Chung (Hanyang University, Republic of Korea)
KeywordFPGA, Transformer, softmax, GEMM, Fix-point
AbstractThis paper presents TransCoder, an FPGA-based Transformer encoder accelerator that supports fixed-point GEMM and softmax computation using ExCORIDC, an extended version of the CORDIC algorithm. TransCoder is implemented on Xilinx ZCU111 FPGA operating at 215MHz, and the accuracy difference is less than 1% compared with floating-point-based software implementation with custom APIs on GLUE tasks in the BERT-Large model.

2E-10 (Time: 15:00 - 15:10)
TitleImplementation of a High-throughput and Accurate Gaussian-TinyYOLOv3 Hardware Accelerator
Author*Juntae Park, Subin Ki, Hyun Kim (Seoul National University of Science and Technology, Republic of Korea)
KeywordAccelerator, FPGA, YOLO
AbstractThis paper presents a dedicated FPGA implementation of the Gaussian TinyYOLOv3 accelerator using a streamline architecture for object detection in mobile and edge devices. The proposed accelerator employs a hardware-friendly shift-based floating-fixed MAC operator and shift-based quantization method that significantly reduces hardware resources and minimizes accuracy degradation. The proposed IP implemented on Xilinx XCVU9P achieves a processing speed of 62.9 FPS and an accuracy of 34.01% on the COCO2014 dataset.

2E-11 (Time: 15:10 - 15:20)
TitleA 17.01 MOP/s/LUT Binary Neural Network Inference Processor Showing 87.81% CIFAR10 Accuracy with 2.6M-bit On-Chip Parameters in a 28nm FPGA
Author*Gil-Ho Kwak, Tae-Hwan Kim (School of Electronics and Information Engineering, Korea Aerospace University, Republic of Korea)
KeywordNeural networks, Inference, FPGA, Binarization, Resource efficiency
AbstractThis paper presents an efficient binary neural network (BNN) inference processor. The proposed processor is designed to process various BNN blocks with low resource usage, including general and group convolutions and global average pooling, consistently based on the unified mechanism. Implemented in a 28nm FPGA, the proposed processor exhibits a resource efficiency of 17.01 MOP/s/LUT, which achieves 87.81% CIFAR10 accuracy with only 2.6M-bit parameters stored in on-chip memory.

[To Session Table]

Session 2F  (DF-1) Next-Generation AI Semiconductor Design
Time: 13:30 - 15:10, Tuesday, January 23, 2024
Chair: Changho Han (Kumoh National Institute of Technology, Republic of Korea)

2F-1 (Time: 13:30 - 13:55)
Title(Designers' Forum) Building the programmable, high performance and energy-efficient AI chip for ChatGPT
AuthorJoon Ho Baek (FuriosaAI, Republic of Korea)
AbstractWith the advent of ChatGPT and generative AI models, the demand for deep learning inference in data centers is exploding. While energy efficiency is important to reduce TCO (total cost of ownership), high performance is also essential to serve large models in production. Hyperscalers, on the other hand, emphasized the importance of programmability and flexibility for inference accelerators to track DNN progress. This talk will introduce high performance AI chips developed by FuriosaAI, designed to tackle all these challenges.

2F-2 (Time: 13:55 - 14:20)
Title(Designers' Forum) Enabling AI Innovation through Zero-touch SAPEON AI Inference System
AuthorSoojung Ryu (SAPEON, Republic of Korea)
AbstractSAPEON, a leading player in the AI semiconductor industry, has achieved remarkable success in bringing server-grade semiconductors to market through its X220 platform. These semiconductors have gained wide recognition for their exceptional performance in MLPerf benchmarks. SAPEON's Zero-Touch AI Inference System, powered by state-of-the-art semiconductor technology, provides an AI model inference SDK and a cloud-based Inference Serving Platform. This comprehensive solution enables Customer Engineers to perform AI model inference on NPUs with minimal involvement. SAPEON is actively engaged in collaborations with key stakeholders in the AI industry, playing a pivotal role in driving AI innovation forward as we prepare for the widespread adoption of AI inference using our cutting-edge X330 platform.

2F-3 (Time: 14:20 - 14:45)
Title(Designers' Forum) Processing-in-Memory in Generative AI Era
AuthorKyomin Sohn (Samsung Electronics, Republic of Korea)
AbstractWith the advancement of neural networks, particularly the emergence of LLM(large language models), a solution to address memory bottlenecks and improve system energy efficiency is required strongly. Currently, HBM DRAM is the only memory solution to meet high bandwidth requirements. In this talk, we will look at HBM DRAM that are currently actively used and discuss the next generation of HBM DRAM and what technologies are needed. However, the memory bottleneck caused by the Von Neumann architecture is no exception in the case of HBM, so we look at the PIM technology that is actively discussed to overcome this limitation. The concept and implementation cases of the recently developed HBM-PIM will be examined, and the next generation of DRAM-PIM will be discussed.

2F-4 (Time: 14:45 - 15:10)
Title(Designers' Forum) AiMX: Cost-effective LLM accelerator using AiM (SK hynix’s PIM)
AuthorEuicheol Lim (SK Hynix, Republic of Korea)
AbstractAI chatbot service has been opening up the mainstream market for AI services. But problems seem to exist with considerably higher operating costs and substantially longer service latency. As the LLM size continued to increase, memory intensive function takes up most of the service operation. That’s why even latest GPU system does not provide sufficient performance and energy efficiency. To resolve it, we are introducing shorter latency and operating cost effective LLM accelerator using AiM (SK hynix’s PIM) We’d like to introduce how to reduce service latency and decrease energy consumption through AiM, as well as explain the architecture of AiMX, an accelerator using AiM. Please come and see for yourself that AiM is no longer a future technology, but can be deployed to the existing system right now.

[To Session Table]

Session 3A  GPU and Custom Accelerators
Time: 15:30 - 17:35, Tuesday, January 23, 2024
Location: Room 204
Chair: Tuo Li (Chinese Academy of Sciences, China)

3A-1 (Time: 15:30 - 15:55)
TitleCollaborative Coalescing of Redundant Memory Access for GPU System
AuthorFan Jiang, *Chengeng Li, Wei Zhang (The Hong Kong University of Science and Technology, Hong Kong), Jiang Xu (The Hong Kong University of Science and Technology (Guangzhou), China)
Pagepp. 195 - 200
KeywordGPU, memory coalescing
AbstractGPU-based computing serves as the primary solution driving the performance of HPC systems. However, modern GPU systems encounter performance bottlenecks resulting from heavy memory access traffic and insufficient NoC bandwidth. In this work, we propose a collaborative coalescing mechanism aimed at eliminating redundant memory access and boosting GPU system performance. To achieve this, we design a coalescing unit for each memory partition, effectively merging requests from both inter-cluster and intra-cluster SMs. Additionally, we introduce a hierarchical multicast module to replicate and distribute the coalesced reply messages to multiple destination SMs. Experimental results show that our method achieves 20.6% improvement on performance and 27.1% reduction on NoC traffic over the baseline.

3A-2 (Time: 15:55 - 16:20)
TitleWER: Maximizing Parallelism of Irregular Graph Applications Through GPU Warp EqualizeR
Author*En-Ming Huang, Bo-Wun Cheng (National Tsing Hua University, Taiwan), Meng-Hsien Lin (National Yang Ming Chiao Tung University, Taiwan), Chun-Yi Lee (National Tsing Hua University, Taiwan), Tsung Tai Yeh (National Yang Ming Chiao Tung University, Taiwan)
Pagepp. 201 - 206
KeywordGPU Architecture, Irregular Graph Processing, Hardware-Software Co-Design, System Optimization, SIMD Lane Utilization
AbstractIrregular graphs are becoming increasingly prevalent across a broad spectrum of data analysis applications. Despite their versatility, the inherent complexity and irregularity of these graphs often result in the underutilization of Single Instruction, Multiple Data (SIMD) resources when processed on Graphics Processing Units (GPUs). This underutilization originates from two primary issues: the occurrence of inactive threads and intra-warp load imbalances. These issues can produce idle threads, lead to inefficient usage of SIMD resources, consequently hamper throughput, and increase program execution time. To address these challenges, we introduce Warp EqualizeR (WER), a framework designed to optimize the utilization of SIMD resources on a GPU for processing irregular graphs. WER employs both software API and a specifically-tailored hardware microarchitecture. Such a synergistic approach enables workload redistribution in irregular graphs, which allows WER to enhance SIMD lane utilization and further harness the SIMD resources within a GPU. Our experimental results over seven different graph applications indicate that WER yields a geometric mean speedup of 2.52x and 1.47x over the baseline GPU and existing state-of-the-art methodologies, respectively.

3A-3 (Time: 16:20 - 16:45)
TitleSoC-Tuner: An Importance-guided Exploration Framework for DNN-targeting SoC Design
Author*Shixin Chen, Su Zheng, Chen Bai, Wenqian Zhao, Shuo Yin, Yang Bai, Bei Yu (The Chinese University of Hong Kong, Hong Kong)
Pagepp. 207 - 212
KeywordSystem-on-Chip, DNN, Design Space Exploration, Bayessian Optimization
AbstractDesigning a system-on-chip (SoC) for deep neural network (DNN) acceleration requires balancing multiple metrics such as latency, power, and area. However, most existing methods ignore the interactions among different SoC components and rely on inaccurate and error-prone evaluation tools, leading to inferior SoC design. In this paper, we present SoC-Tuner, a DNN-targeting exploration framework to find the Pareto optimal set of SoC configurations efficiently. Our framework constructs a thorough SoC design space of all components and divides the exploration into three phases. We propose an importance-based analysis to prune the design space, a sampling algorithm to select the most representative initialization points, and an information-guided multi-objective optimization method to balance multiple design metrics of SoC design. We validate our framework with the actual very-large-scale-integration (VLSI) flow on various DNN benchmarks and show that it outperforms previous methods. To the best of our knowledge, this is the first work to construct an exploration framework of SoCs for DNN acceleration.

3A-4 (Time: 16:45 - 17:10)
TitleARS-Flow: A Design Space Exploration Flow for Accelerator-rich System based on Active Learning
Author*Shuaibo Huang, Yuyang Ye, Hao Yan, Longxing Shi (Southeast University, China)
Pagepp. 213 - 218
Keywordaccelerator-rich system, design space exploration, sampling method
AbstractSurrogate model-based design space exploration (DSE) is the mainstream method to search for optimal microarchitecture designs. However, it is hard to build accurate models for accelerator-rich systems within limited samples due to its high dimensional characteristic. Moreover, it is easy to fall into local optimal or difficult to converge. To solve these two problems, we propose a DSE flow based on active learning, namely ARS-Flow. It is featured with Pareto-region-oriented stochastic resampling method (PRSRS) and multiobjective genetic algorithm with self-adaptive hyperparameter control (SAMOGA). Taking the gem5-SALAM system for illustration, the proposed method can build more accurate models and find better microarchitecture designs with acceptable runtime costs.

3A-5 (Time: 17:10 - 17:35)
TitleSecco: Codesign for Resource Sharing in Regular-Expression Accelerators
Author*Jackson Woodruff, Sam Ainsworth, Micahel F.P. O'Boyle (University of Edinburgh, UK)
Pagepp. 219 - 224
KeywordRegular Expression, FPGA, Overlay
Abstractintrusion detection to bioinformatics. This has led to the wide-spread development of FPGA-based hardware accelerators. However, reprogramming these accelerators for different regular expressions is slow and difficult due to FPGA toolchain overheads. Translation overlays that enable fast repurposing of existing accelerator layouts have been proposed. However, these assume the underlying accelerator design exists and is well-designed. For many domains, this is impossible to do by-hand: requiring simple patterns and significant programmer effort. We present Secco, a compiler targeting symbol-only reconfigurable architectures, which codesigns the (fast-reconfigured) translation overlay and the (slow-reconfigured) underlying accelerator layout, by reusing the same hardware when simple overlay translations are available, and generating new hardware otherwise. This allows significant efficiency improvements: Secco enables 5.9x more expressions using the same resources across all ANMLZoo benchmarks compared to regular expression tool chains like REAPR. This enables large numbers of diverse regular expressions to be accelerated with context-switching overheads in the milliseconds.

[To Session Table]

Session 3B  Hardware Acceleration for Graph Neural Networks and New Models
Time: 15:30 - 17:35, Tuesday, January 23, 2024
Location: Room 205
Chair: Mohamed M. Sabry Aly (Nanyang Technological University, Singapore)

Best Paper Candidate
3B-1 (Time: 15:30 - 15:55)
TitleSparGNN: Efficient Joint Feature-Model Sparsity Exploitation in Graph Neural Network Acceleration
Author*Chen Yin, Jianfei Jiang, Qin Wang, Zhigang Mao, Naifeng Jing (Shanghai Jiao Tong University, China)
Pagepp. 225 - 230
KeywordGNN acceleration, Algorithm and hardware co-design, Data sparsity, Compressed product dataflow
AbstractWith the rapid explosion in both graph scale and model size, accelerating graph neural networks (GNNs) at scale encounters significant pressure on computation and memory footprint. Exploiting data sparsity with pruning, which exhibits remarkable effect in deep neural networks (DNNs), while still lags behind in GNN acceleration. This is because costly pruning overhead upon large graphs and inefficient hardware support will eclipse the benefit of GNN sparsification. To this end, this paper proposes SparGNN, an algorithm and accelerator co-design that can efficiently exploit data sparsity in both features and models to speedup GNN acceleration while reserving its accuracy. In algorithm, to reduce the overhead of iterative pruning, we distill a sparsified subgraph to substitute the original input graph for pruning, which can low-costly excavate the potential data sparsity in both features and models without accuracy compromise. In hardware, to improve data locality of the sparsified feature-weight multiplication, we design compressed row-/column-wise product dataflow for efficient feature updating. We then propose lightweight hardware changes to make our design applicable to conventional GNN accelerators. The experimental results show that compared to the state-of-the-art GNN accelerators, SparGNN reduces 1.5∼4.3× computation and gains an average of 1.8∼6.8× speedup with 1.4∼9.2× energy efficiency improvement.

3B-2 (Time: 15:55 - 16:20)
TitleAPoX: Accelerate Graph-Based Deep Point Cloud Analysis via Adaptive Graph Construction
Author*Lei Dai (SKLCA, Institute of Computing Technology, Chinese Academy of Sciences, China), Shengwen Liang, Ying Wang (SKLCA, Institute of Computing Technology, University of Chinese Academy of Sciences; Zhongguancun National Laboratory, China), Huawei Li (SKLCA, Institute of Computing Technology, University of Chinese Academy of Sciences; Peng Cheng Laboratory, China), Xiaowei Li (SKLCA, Institute of Computing Technology, University of Chinese Academy of Sciences; Zhongguancun National Laboratory, China)
Pagepp. 231 - 237
KeywordPoint cloud accelerator, graph construction, ANN
AbstractGraph-based deep learning point cloud processing has gained increasing popularity but its performance is dragged by the dominating graph construction (GC) phase with irregular computation and memory access. Existing works that accelerate GC by tailoring architecture for a single GC algorithm fail to maintain efficiency because they neglected the best GC algorithm variation incurred by the point-cloud density variation in changing scenarios. Therefore, we propose APoX, a unified architecture with an adaptive GC scheme that can identify the optimum GC approach according to the point cloud variation. Experiments indicate that APoX achieves higher performance and energy efficiency over existing accelerators.

3B-3 (Time: 16:20 - 16:45)
TitleFuseFPS: Accelerating Farthest Point Sampling with Fusing KD-tree Construction for Point Clouds
Author*Meng Han, Liang Wang, Limin Xiao, Hao Zhang, Chenhao Zhang, Xilong Xie, Shuai Zheng (Beihang University, China), Jin Dong (Beijing Academy of Blockchain and Edge Computing, China)
Pagepp. 238 - 243
KeywordFarthest point sampling, KD-tree construction, Point cloud analytics, accelerator
AbstractPoint cloud analytics has become a critical workload for embedded and mobile platforms across various applications. Farthest point sampling (FPS) is a fundamental and widely used kernel in point cloud processing. However, the heavy external memory access makes FPS a performance bottleneck for real-time point cloud processing. Although bucket-based farthest point sampling can significantly reduce unnecessary memory accesses during the point sampling stage, the KD-tree construction stage becomes the predominant contributor to execution time. In this paper, we present FuseFPS, an architecture and algorithm co-design for bucket-based farthest point sampling. We first propose a hardware-friendly sampling-driven KD-tree construction algorithm. The algorithm fuses the KD-tree construction stage into the point sampling stage, further reducing memory accesses. Then, we design an efficient accelerator for bucket-based point sampling. The accelerator can offload the entire bucket-based FPS kernel at a low hardware cost. Finally, we evaluate our approach on various point cloud datasets. The detailed experiments show that compared to the state-of-the-art accelerator QuickFPS, FuseFPS achieves about 4.3x and about 6.1x improvements on speed and power efficiency, respectively.

3B-4 (Time: 16:45 - 17:10)
TitleA Fixed-Point Pre-Processing Hardware Architecture Design for Complex Independent Component Analysis
Author*Yashwant Moses, Madhav Rao (International Institute of Information Technology Bangalore, India)
Pagepp. 244 - 249
KeywordCORDIC, Complex Independent Analysis (c-ICA), Pipelined Preprocessing Accelerator, Blind Source Separation
AbstractThe complex Independent Component Analysis (c-ICA) is a widely employed in many applications involving MIMO communication systems, radar signal processing, medical imaging (MRI), and other fields where data is represented in complex number format. This paper proposes a configurable fixed-point pre-processing hardware accelerator for c-ICA algorithm that offers a balanced combination of high throughput with low area and power costs. The proposed accelerator performs the c-ICA pre-processing in multiple stages including a step for centering and covariance matrix computation, followed by eigenvalue decomposition (EVD) and whitening matrix computation units. The datapath flow is pipelined in such a way that each stage in the path is operated in parallel and individual stage designs are pipelined within resulting in high throughput. The paper characterizes the proposed architecture design using 45 nm process flow and compares its performance with the current state-of-the-art (SOTA) designs. Experimental results showcase substantial savings in processing time and computational resources, making it highly suitable for real-time and resource constrained applications. A throughput gain of 49.96% and complexity reduction of 22.23% and 19.63% for covariance cum centering and EVD units respectively was achieved by the proposed design over the best of the SOTA designs. The hardware design files are made freely available for further usage to the designers and researchers’ community [1].

3B-5 (Time: 17:10 - 17:35)
TitlePearls Hide Behind Linearity: Simplifying Deep Convolutional Networks for Embedded Hardware Systems via Linearity Grafting
Author*Xiangzhong Luo (Nanyang Technological University, Singapore), Di Liu (Norwegian University of Science and Technology, Norway), Hao Kong, Shuo Huai, Hui Chen, Shiqing Li, Guochu Xiong, Weichen Liu (Nanyang Technological University, Singapore)
Pagepp. 250 - 255
KeywordHardware-Aware Network Compression, Efficient Latency Modeling, Efficient Accuracy Modeling
AbstractThe increasing complexity of convolutional neural networks (CNNs) has fueled a huge demand for compression. Nonetheless, network pruning, as the most effective knob, fails to deliver \textit{Pareto-optimal} networks. To tackle this issue, we introduce a novel pruning-free compression framework dubbed \textit{Domino}, pioneering to revisit the trade-off dilemma between accuracy and efficiency from a fresh perspective of linearity and non-linearity. Specifically, \textit{Domino} leverages two predictors, including one vanilla latency predictor and one meta-accuracy predictor, to identify the less important non-linear building blocks, which are then grafted with the linear counterparts. And next, the grafted network is trained on target task to obtain decent accuracy, after which the grafted linear building block that contains multiple consecutive linear layers is reparameterized into one single linear layer to boost the efficiency on target hardware without degrading the accuracy on target task. Extensive experiments on two popular Nvidia Jetson embedded platforms (i.e., Xavier and Nano) and two representative networks (i.e., MobileNetV2 and ResNet50) clearly demonstrate the superiority of \textit{Domino}. For example, \textit{Domino}-Aggressive achieves $+$10.6\%/$+$8.8\% higher top-1/top-5 accuracy on ImageNet than MobileNetV2$\times$0.2, while bringing $\times$1.9/$\times$1.3 speedup on Xavier/Nano.

[To Session Table]

Session 3C  New Frontiers in Verification and Simulation
Time: 15:30 - 17:35, Tuesday, January 23, 2024
Location: Room 206
Chair: Jaeyong Chung (Incheon National University, Republic of Korea)

Best Paper Candidate
3C-1 (Time: 15:30 - 15:55)
TitleOn Decomposing Complex Test Cases for Efficient Post-silicon Validation
AuthorHarshitha C, Sundarapalli Harikrishna, Peddakotla Rohith, *Sandeep Chandran (Indian Institute of Technology Palakkad, India), Rajshekar Kalayappan (Indian Institute of Technology Dharwad, India)
Pagepp. 256 - 261
KeywordFunctional Verification, Test generation, Post-silicon Validation
AbstractIn post-silicon validation, the first step when an erroneous behavior is uncovered by a long-running test case is to reproduce the erroneous behavior in a shorter execution because it is amenable for debugging using a variety of tools and techniques. In this work, we propose an automated tool called Gru, that takes as input a long execution trace and generates an executable corresponding to each section of the trace. Each generated executable is guaranteed to faithfully replicate the behavior observed during the original long-running test case during the corresponding section of the execution. Since each executable is independent of the other, they can all be executed in parallel to further hasten localizing the bug to a particular section. Further, the generation of the smaller executables does not require the source code (or executable) of the application that triggered the erroneous behavior. We demonstrate the effectiveness of this tool on a collection of 10 EEMBC benchmarks that are executed on a bare-metal LEON3 SoC.

3C-2 (Time: 15:55 - 16:20)
TitleDeepIC3: Guiding IC3 Algorithms by Graph Neural Network Clause Prediction
Author*Guangyu Hu, Jianheng Tang (The Hong Kong University of Science and Technology, Hong Kong), Changyuan Yu (The Hong Kong University of Science and Technology (Guangzhou), China), Wei Zhang (The Hong Kong University of Science and Technology, Hong Kong), Hongce Zhang (The Hong Kong University of Science and Technology (Guangzhou), China)
Pagepp. 262 - 268
Keywordmodel checking, property directed reachability, inductive generalization, graph learning
AbstractIn recent years, machine learning has demonstrated its potential in many challenging problems. In this paper, we extend its use to hardware formal property verification and propose DeepIC3, a method that takes advantage of graph learning in the classic IC3/PDR algorithm. In DeepIC3, graph neural networks are integrated to improve the result of local inductive generalization. This helps provide a global view of the state transition system and can potentially lead the algorithm out of local optima in the search of inductive invariants. Our experiments demonstrate that DeepIC3 accelerates the vanilla algorithm in nontrivial test cases of hardware model checking competition benchmarks (HWMCC2020) with up to 10.8x speed-up. The proposed machine-learning integration preserves soundness and is universally applicable to various IC3/PDR implementations.

3C-3 (Time: 16:20 - 16:45)
TitleTIUP: Effective Processor Verification with Tautology-Induced Universal Properties
Author*Yufeng Li, Yiwei Ci, Qiusong Yang (Institute of Software, Chinese Academy of Sciences/University of Chinese Academy of Sciences, China)
Pagepp. 269 - 274
KeywordFormal verification, Universal property, Tautology, Processor
AbstractDesign verification is a complex and costly task, especially for large and intricate processor projects. Formal verification techniques provide advantages by thoroughly examining design behaviors, but they require extensive labor and expertise in property formulation. Recent research focuses on verifying designs using the self-consistency universal property, reducing verification difficulty as it is design-independent. However, the single self-consistency property faces false positives and scalability issues due to exponential state space growth. To tackle these challenges, this paper introduces TIUP, a technique using tautologies as universal properties. We show how TIUP effectively uses tautologies as abstract specifications, covering processor data and control paths. TIUP simplifies and streamlines verification for engineers, enabling efficient formal processor verification.

3C-4 (Time: 16:45 - 17:10)
TitleVerifying Embedded Graphics Libraries leveraging Virtual Prototypes and Metamorphic Testing
Author*Christoph Hazott, Florian Stögmüller, Daniel Große (Johannes Kepler Universität Linz, Austria)
Pagepp. 275 - 281
KeywordVerification, Embedded Graphics Libraries, Virtual Prototypes, Metamorphic Testing
AbstractEmbedded graphics libraries are part of the firmware of embedded systems and provide complex function- alities optimized for specific hardware. After unit testing of embedded graphics libraries, integration testing is a significant challenge, in particular since the hardware is needed to obtain the output image as well as the inherent difficulty in defining the reference result. In this paper, we present a novel approach focusing on integration testing of embedded graphic libraries. We leverage Virtual Prototypes (VPs) and integrate them with Metamorphic Testing (MT). MT is a software testing technique that uncovers faults or issues in a system by exploring how its outputs change under predefined input transformations, without relying on explicit oracles or predetermined results. In combination with virtualizing the displays in VPs, we even eliminate the need for physical hardware. This allows us to develop a MT framework automating the verification process. In our evaluation, we demonstrate the effectiveness of our MT framework. On an extended RISC-V VP for the GD32V platform we found 15 distinct bugs for the widely used TFT eSPI embedded graphics library, confirming the strength our approach.

3C-5 (Time: 17:10 - 17:35)
TitleMemSPICE: Automated Simulation and Energy Estimation Framework for MAGIC-Based Logic-in-Memory
AuthorSimranjeet Singh (Indian Institute of Technology Bombay, India), Chandan Kumar Jha (University of Bremen, Germany), Ankit Bende, Vikas Rana (Forschungszentrum Jülich GmbH, Germany), Sachin Parkar (Indian Institute of Technology Bombay, India), Rolf Drechsler (University of Bremen, Germany), *Farhad Merchant (Newcastle University, Newcastle upon Tyne, UK)
Pagepp. 282 - 287
KeywordDigital Logic-in-Memory, MAGIC, Memristors, SPICE, Simulation
AbstractExisting logic-in-memory (LiM) research is limited to generating mappings and micro-operations. In this paper, we present MemSPICE, a novel framework that addresses this gap by automatically generating both the netlist and testbench needed to evaluate the LiM on a memristive crossbar. MemSPICE goes beyond conventional approaches by providing energy estimation scripts to calculate the precise energy consumption of the testbench at the SPICE level. We propose an automated framework that utilizes the mapping obtained from the SIMPLER tool to perform accurate energy estimation through SPICE simulations. To the best of our knowledge, no existing framework is capable of generating a SPICE netlist from a hardware description language. By offering a comprehensive solution for SPICE-based netlist generation, testbench creation, and accurate energy estimation, MemSPICE empowers researchers and engineers working on memristor-based LiM to enhance their understanding and optimization of energy usage in these systems. Finally, we tested the circuits from the ISCAS'85 benchmark on MemSPICE and conducted a detailed energy analysis.

[To Session Table]

Session 3D  Partition and Placement
Time: 15:30 - 17:35, Tuesday, January 23, 2024
Location: Room 207
Chair: Evangeline F. Y. (Young, Hong Kong)

3D-1 (Time: 15:30 - 15:55)
TitleAn Effective Netlist Planning Approach for Double-sided Signal Routing
Author*Tzu-Chuan Lin, Fang-Yu Hsu, Wai-Kei Mak, Ting-Chi Wang (National Tsing Hua University, Taiwan)
Pagepp. 288 - 293
Keywordback-side power delivery network, double-sided signal routing
AbstractSeparating the power delivery network (PDN) from the front-side metal stack and using the back-side metal stack primarily for the PDN has been proposed to improve the PDN performance. To well utilize the surplus routing resources left on the back side after the PDN is built, we study in this paper how to route signal nets on both front and back sides (i.e. double-sided signal routing). To this end, we present a netlist planning approach that is able to well distribute a set of signal nets to two sides and properly insert a set of bridging cells such that after performing placement legalization and signal routing on each side separately, a high-quality double-sided routing solution can be produced. Our netlist planning approach has been combined with a commercial place and route tool, and our experimental results show that compared to traditional single-sided routing without back-side PDN, double-sided routing achieves 9.1% reduction in wirelength and 1.8% decrease in via count. Additionally, for critical nets, the percentage of the total length of their wire segments routed on preferred metal layers were improved by 13.6%.

3D-2 (Time: 15:55 - 16:20)
TitleAn Analytical Placement Algorithm with Routing topology Optimization
Author*Min Wei, Xingyu Tong, Zhijie Cai, Peng Zou (Fudan University, China), Zhifeng Lin (Fuzhou University, China), Jianli Chen (Fudan University, China)
Pagepp. 294 - 299
KeywordRefinement placement, Wirelength model, Analytical placement, Physical design
AbstractPlacement is a critical step in the modern VLSI design flow, as it dramatically determines the performance of circuit designs. Most placement algorithms estimate the design performance with a half-perimeter wirelength (HPWL) and target it as their optimization objective. The wirelength model used by these algorithms limits their ability to optimize the internal routing topology, which can lead to discrepancies between estimates and the actual routing wirelength. This paper proposes an analytical placement algorithm to optimize the internal routing topology. We first introduce a differential wirelength model in the global placement stage based on an ideal routing topology RSMT. Through screening and tracing various segments, this model can generate meaningful gradients for interior points during gradient computation. Then, after global placement, we propose a cell refinement algorithm and further optimize the routing wirelength with swift density control. Experiments on ICCAD2015 benchmarks show that our algorithm can achieve a 3% improvement in routing wirelength, 0.8% in HPWL, and 23.8% in TNS compared with the state-of-the-art analytical placer.

3D-3 (Time: 16:20 - 16:45)
TitleEffective Analytical Placement for Advanced Hybrid-Row-Height Circuit Designs
Author*Yuan Wen, Benchao Zhu (Fudan University, China), Zhifeng Lin (Fuzhou University, China), Jianli Chen (Fudan University, China)
Pagepp. 300 - 305
KeywordPhysical design, Placement, Hybrid-row-height structure, Nonlinear placement
AbstractRecently, hybrid-row-height designs have been introduced to achieve performance and area co-optimization in advanced nodes. Hybrid-row-height designs incur challenging issues to layout due to the heterogeneous cell and row structures. In this paper, we present an effective algorithm to address the hybrid-row-height placement problem in two major stages: (1) global placement, and (2) legalization. Inspired by the multi-channel processing method in convolutional neural networks (CNN), we use the feature extraction technique to equivalently transform the hybrid-row-height global placement problem into two sub-problems that can be solved effectively. We propose a multi-layer nonlinear framework with alignment guidance and a self-adaptive parameter adjustment scheme, which can obtain a high-quality solution to the hybrid-row-height global placement problem. In the legalization stage, we formulate the hybrid-row-height legalization problem into a convex quadratic programming (QP) problem, then apply the robust modulus-based matrix splitting iteration method (RMMSIM) to solve the QP efficiently. After RMMSIM-based global legalization, Tetris-like allocation is used to resolve remaining physical violations. Compared with the state-of-the-art work, experiments on the 2015 ISPD Contest benchmarks show that our algorithm can achieve 7\% shorter final total wirelength and 2.23$\times$ speedup.

3D-4 (Time: 16:45 - 17:10)
TitleRow Planning and Placement for Hybrid-Row-Height Designs
Author*Ching-Yao Huang, Wai-Kei Mak (National Tsing Hua University, Taiwan)
Pagepp. 306 - 311
KeywordHybrid-Row-Height Design, Row planning, Row configuration, Placement, Hybrid-Row-Height
AbstractTraditionally, a standard cell library is composed of pre-designed cells all of which have identical height so that the cells can be placed in rows of uniform height on a chip. The desire to integrate more logic gates onto a single chip has led to a continuous reduction of row height with reduced number of routing tracks over the years. It has reached a point that not all cells can be designed with the minimum row height due to internal routability issue. Hybrid-row-height IC design with placement rows of different heights has emerged which offers a better sweet spot for performance and area optimization. [7] proposed the first row planning algorithm for hybrid-row-height design based on kmeans clustering to determine the row configuration so that the cells in an initial placement can be moved to rows with matching height with as little cell displacement as possible. The biggest limitation of the k-means clustering method is that it only works for designs without any macros. Here we propose an effective and highly flexible dynamic programming approach to determine an optimized row configuration for designs with or without macros. The experimental results show that for designs without any macros, our approach resulted in 28% reduction in total cell displacement and 7.1% reduction in the final routed wirelength on average compared to the k-means clustering approach while satisfying the timing constraints. Additional experimental results show that our approach can comfortably handle designs with macros while satisfying the timing constraints.

3D-5 (Time: 17:10 - 17:35)
TitleTransPlace: A Scalable Transistor-Level Placer for VLSI Beyond Standard-Cell-Based Design
Author*Chen-Hao Hsu (The University of Texas at Austin, USA), Xiaoqing Xu, Hao Chen, Dino Ruic (X, USA), David Z. Pan (The University of Texas at Austin, USA)
Pagepp. 312 - 318
KeywordPhysical Design, Placement, Transistor
AbstractThe standard-cell methodology is widely adopted in the VLSI design flow due to its scalability, reusability, and compatibility with electronic design automation (EDA) tools. However, the fixed relative positions and confinement of PMOS and NMOS transistors within a standard cell impose limitations on overall wirelength and area optimization. Directly placing individual transistors in a design can provide greater flexibility to explore more diffusion-sharing opportunities, which can potentially result in less wirelength and smaller design areas than standard-cell-based designs. Unfortunately, existing transistor placement approaches are limited to a very small scale, e.g., a single standard cell. This paper presents TransPlace, the first transistor placement framework that is capable of placing a large number of transistors while considering diffusion sharing in FinFET technologies. Experimental results demonstrate its effectiveness in minimizing wirelength and reducing design area beyond the limits of standard-cell-based designs.

[To Session Table]

Session 3E  (SS-2) Hardware Security -- A True Multidisciplinary Research Area
Time: 15:30 - 17:35, Tuesday, January 23, 2024
Location: Room 107/108
Chairs: Gang Qu (University of Maryland College Park), Takashi Sato (Kyoto University)

3E-1 (Time: 15:30 - 15:55)
Title(Invited Paper) Towards Finding the Sources of Polymorphism in Polymorphic Gates
AuthorTimothy Dunlap (University of Maryland, USA), Zelin Lu, *Gang Qu (Univ. of Maryland, USA)
Pagepp. 319 - 324
Keywordpolymorphism, polymorphic gate, evolutionary algorithm, circuit analysis, hardware security
AbstractA polymorphic gate can change its functionality based on external conditions such as temperature, voltage, and external signals. Although this concept was proposed more than two decades ago and has found success in designing circuits for area minimization and security applications, how to construct polymorphic gates remains as a challenge because they do not have the structure of conventional CMOS gates. Without knowing the causes of polymorphism, the majority of reported polymorphic gates are generated in the ad hoc fashion, using evolutionary algorithms and the time-consuming Spice simulation. In this paper, we study thousands of candidate circuits that we have created for the sources of polymorphism. We observe several features that are not present in traditional CMOS gates. Circuit analysis suggests that these features are potential sources of polymorphism, which is confirmed as they exist in the polymorphic gates reported in the literature. Furthermore, we demonstrate with examples that polymorphic gates can be effectively constructed using these features as guidance.

3E-2 (Time: 15:55 - 16:20)
Title(Invited Paper) HOGE: Homomorphic Gate on An FPGA
Author*Kotaro Matsuoka (Kyoto University, Japan), Song Bian (Beihang University, China), Takashi Sato (Kyoto University, Japan)
Pagepp. 325 - 332
KeywordFPGA, Homomorphic Encryption, TFHE
AbstractThis paper proposes HOGE, a new accelerator architecture on FPGA, for the Fully Homomorphic Encryption over the Torus (TFHE). TFHE is one of the FHE schemes that allows arbitrary logical circuits to be evaluated over encrypted ciphertexts. To the best of our knowledge, HOGE is the first hardware architecture that achieves the evaluation of a complete homomorphic gate with bootstrapping in one commercial hardware device and the proven capability for working with the host machine to evaluate homomorphic logic circuits. HOGE is equipped with carefully designed resource-efficient four-step radix-32 NTT/INTT architectures that achieve higher parallelism, resulting in an overall lower latency. HOGE is implemented on the Xilinx Alveo U280 platform and demonstrated that the actual performance can be 5−6× faster than the state-of-the-art CPU implementation of TFHE, carrying out a homomorphic gate within about 1.6ms.

3E-3 (Time: 16:20 - 16:45)
Title(Invited Paper) Sensors for Remote Power Attacks: New Developments and Challenges
Author*Brian Udugama (University of New South Wales, Australia), Darshana Jayasinghe, Sri Parameswaran (University of Sydney, Australia)
Pagepp. 333 - 340
KeywordRemote Power Analysis, On-chip Sensors, FPGA
AbstractPower consumption as a side channel has garnered significant attention in security research. Traditional power attacks, also referred to as power analysis attacks, necessitated physical access to target devices to measure power consumption fluctuations for disclosing sensitive information. Recent developments, however, have revealed that field programmable gate arrays (FPGAs) in remote settings and cloud services are vulnerable to remote power analysis (RPA) attacks, avoiding the need for physical access. Understanding evolving threats and sensor methodologies is crucial for the development of robust defense strategies. Thus, this paper discusses two stealthy on-chip sensors, the Voltage-Induced Time Interval Sensor (VITI) and the Power to Pulse Width Modulation Sensor (PPWM), offering effective means for conducting RPA attacks.

3E-4 (Time: 16:45 - 17:10)
Title(Invited Paper) PRESS: Persistence Relaxation for Efficient and Secure Data Sanitization on Zoned Namespace Storage
AuthorYun-Shan Hsieh (Academia Sinica, Taiwan), Bo-Jun Chen (National Tsing Hua University, Taiwan), *Po-Chun Huang (National Taipei University of Technology, Taiwan), Yuan-Hao Chang (Academia Sinica, Taiwan)
Pagepp. 341 - 348
Keyworddata sanitization, secure data deletion, zoned namespace storage
AbstractRecently, secure data deletion or data sanitization has been identified as a key technology of storage devices to securely delete obsolete sensitive data that are no longer used. However, secure data deletion requires extra management efforts on flash memory storage devices, due to the deferred reclamation of flash blocks in many flash translation layer schemes. The emerging zoned namespace storage further exacerbates the design complexity of secure data deletion, due to the much larger size of a zone than that of a flash block. Concerning the very long latency to reset an entire zone, once some data have been written into a zone, it is very difficult to securely delete them from the zone. To achieve efficient and secure data deletion on zoned namespace storage, we propose persistence relaxation for efficient and secure sanitization (PRESS), which considers the working principle of zones of zoned namespace storage and allows the fine-grained control of deferred data persistence. As a result, applications can efficiently delete their recently written data or make the data persistent for long-term storage. Our proposal, PRESS, is evaluated through a series of experimental studies, where the results are quite encouraging.

3E-5 (Time: 17:10 - 17:35)
Title(Invited Paper) Hardware Phi-1.5B: A Large Language Model Encodes Hardware Domain Specific Knowledge
AuthorWeimin Fu (Kansas State University, USA), Shijie Li, Yifang Zhao (University of Science and Technology of China, China), Haocheng Ma (Tianjin University, China), Raj Dutta (Silicon Assurance, USA), Xuan Zhang (Washington University in St. Louis, USA), Kaichen Yang (Michigan Technological University, USA), *Yier Jin (University of Science and Technology of China, China), Xiaolong Guo (Kansas State University, USA)
Pagepp. 349 - 354
KeywordLarge Language Model, Hardware Design, Hardware Verification, Generative AI
AbstractIn the rapidly evolving semiconductor industry, where research, design, verification, and manufacturing are intricately linked, the potential of Large Language Models to revolutionize hardware design and security verification is immense. The primary challenge, however, lies in the complexity of hardware-specific issues that are not adequately addressed by the natural language or software code knowledge typically acquired during the pretraining stage. Additionally, the scarcity of datasets specific to the hardware domain poses a significant hurdle in developing a foundational model. Addressing these challenges, this paper introduces Hardware Phi-1.5B, an innovative large language model specifically tailored for the hardware domain of the semiconductor industry. We have developed a specialized, tiered dataset—comprising small, medium, and large subsets—and focused our efforts on pre-training using the medium dataset. This approach harnesses the compact yet efficient architecture of the Phi-1.5B model. The creation of this first pre-trained, hardware domain-specific large language model marks a significant advancement, offering improved performance in hardware design and verification tasks and illustrating a promising path forward for AI applications in the semiconductor sector.

[To Session Table]

Session 3F  (DF-2) Heterogeneous Integration and Chiplet Design
Time: 15:30 - 17:10, Tuesday, January 23, 2024
Organizer/Chair: Rino Choi (Inha University, Republic of Korea), Co-Chair: Jaeduk Han (Hanyang University, Republic of Korea)

3F-1 (Time: 15:30 - 15:55)
Title(Designers' Forum) Co-Design Considerations of Heterogeneous Integrated Packaging
AuthorGu-Sung Kim (Kangnam University, Republic of Korea)
AbstractModern Semiconductor technology is a position representing competitiveness among countries. Due to the preoccupation of semiconductor front-end technology by a few countries, most of the countries that were left out of ranks are strengthening support for semiconductor back-end technology, such as China, Japan, Southern Asia, and Europe. Unlike the eight major semiconductor processes, it is difficult to understand the flow of the entire technology in the semiconductor back-end process due to its diversity and variability. Heterogeneous Integration is new era technology, integration of separately manufactured into a higher-level assembly that provides functional improvements. The Presenter explains the semiconductor Back-End Process and Technology, from the assembly technology mentioned in the ITRS last version to the Heterogeneous Integration technology in the IEEE EPS HIR, in connection with Moore’s Law. In addition, present what is considerable design concepts including electrical, mechanical, and thermal simulations in this area.

3F-2 (Time: 15:55 - 16:20)
Title(Designers' Forum) Don’t Close Your Eyes on Temperature: System Level Thermal Perspectives of 3D Stacked Chips
AuthorSung Woo Chung (Korea University, Republic of Korea)
AbstractWith the limited process technology scaling, heterogeneous integration becomes a viable solution to elevate system performance. For heterogeneous integration, 3D stacking is one of the attractive solutions since it leads to small area. However, 3D stacking inevitably causes higher on-chip temperature due to high power density, which negatively affects processing units and DRAM as follows: 1) When on-chip temperature of the processing units such as CPU, GPU, and NPU reaches threshold temperature, DTM(Dynamic Thermal Management) is invoked to reduce power consumption (which eventually sustains on-chip temperature). For DTM, frequency and (or) voltage is decreased leading to system performance degradation. 2) When on-chip temperature of DRAM (HBM) goes over threshold temperature, DRAM should be refreshed more frequently to safely store data, resulting in refresh energy increase and system performance degradation. In this talk, experimental results on Intel Lakefield (first TSV-based CPU) and AMD 5800X3D (first C2C-based CPU; actually, last level cache is stacked) are presented. Additionally, additional refresh overhead due to high on-chip temperature (mainly dissipated from a processing unit through silicon via) in HBM is presented.

3F-3 (Time: 16:20 - 16:45)
Title(Designers' Forum) Introducing UCIe: The Global Chiplet Interconnect Standard
AuthorYoungbin Kwon (Samsung Electronics, Republic of Korea)
AbstractAs the chiplet business gains momentum in the HPC (High-Performance Computing) market, several chiplet interconnect standards have emerged and vied for global standardization. Throughout this process, two standards, UCIe (Universal Chiplet Interconnect Express) and BoW (Bunch of Wire), have persevered. UCIe, in particular, has garnered significant attention as a potential global standard, primarily due to its robust consortium backing. Additionally, unlike other standards, UCIe offers support for widely used protocols in the HPC field, such as PCIe and CXL. This talk aims to introduce the key characteristics of UCIe, delve into an overview of the UCIe Transceiver Structure, and highlight its major electrical parameters. Furthermore, this talk will briefly touch upon the current development trends of IP vendors within the UCIe ecosystem.

3F-4 (Time: 16:45 - 17:10)
Title(Designers' Forum) Addressing Modeling and Simulation Challenges in Chiplet Interfaces
AuthorJaeha Kim (Seoul National University, Republic of Korea)
AbstractHigh-speed die-to-die interfaces are essential in enabling chiplets, the emerging building blocks for heterogeneous integration. Interestingly, many chiplet interface standards including Universal Chiplet Interconnect Express (UCIe) are evolving in ways so that the analog circuits become standardized blocks, and the digital finite-state machines (FSMs) provide complex functionalities. While such architecture can improve design efficiency and portability, it presents new challenges for verifying the overall system functionalities. This talk will use an example of modeling both the analog circuits and digital FSMs of a UCIe physical layer in SystemVerilog and discuss how one can combine the analog and digital approaches to functional verification.

Wednesday, January 24, 2024

[To Session Table]

Session 4A  Detection Techniques for SoC Vulnerability and Malware
Time: 9:00 - 10:15, Wednesday, January 24, 2024
Location: Room 204
Chair: Makoto Ikeda (University of Tokyo, Japan)

4A-1 (Time: 9:00 - 9:25)
TitleFormalFuzzer: Formal Verification Assisted Fuzz Testing for SoC Vulnerability Detection
AuthorNusrat Farzana Dipu, Muhammad Monir Hossain, Kimia Zamiri Azar, Farimah Farahmandi, *Mark Tehranipoor (University of Florida, USA)
Pagepp. 355 - 361
KeywordFuzzing, Formal Method, Cost function, SoC
AbstractModern Systems-on-Chips (SoCs) integrate numerous insecure intellectual properties to meet design-cost and time-to-market constraints. Incorporating these SoCs into security-critical systems severely threatens users’ privacy. Traditional formal/simulation-based verification techniques detect vulnerabilities to some extent. However, these approaches face challenges in detecting unknown vulnerabilities and suffer from significant manual efforts, false alarms, low coverage, and scalability. Several fuzzing techniques have been developed to mitigate pre-silicon hardware verification limitations. Nevertheless, these techniques suffer from major challenges such as slow simulation platforms, extensive design knowledge requirements, and lacking consideration of untrusted inter-module communications. To overcome these shortcomings, we developed FormalFuzzer, an emulation-based hybrid framework by combining formal verification and fuzz testing, leveraging their own benefits. FormalFuzzer incorporates formal-verification-based pre-processing using template-based assertion generation to narrow down the search space for fuzz testing and appropriate mutation strategy selection by dynamic feedback derived from a security-oriented cost function. The cost function is developed using vulnerability databases and specifications, indicating the likelihood of triggering a vulnerability. A vulnerability is detected when the cost function reaches global or local minima. Our experiments on RISC-V-based Ariane SoC demonstrate the efficiency of proposed formal-verification-based pre-processing strategies and cost function-driven feedback on fuzzing in detecting both known and unknown vulnerabilities expeditiously.

4A-2 (Time: 9:25 - 9:50)
TitleDeepIncept: Diversify Performance Counters with Deep Learning to Detect Malware
AuthorZhuoran Li (Old Dominion University, USA), *Dan Zhao (University of Arizona, USA)
Pagepp. 362 - 367
KeywordUnknown Malware Detection, Hardware Performance Counters, Lightweight Deep Learning, Multicollinearity
AbstractUnknown exploits or zero-day attacks targeting Internet of Things (IoT) devices, accompanied by new malware variants, have significantly increased cyberattack activities in early 2023. To tackle the challenge of detecting unknown malware, we propose a lightweight and non-intrusive detection engine that leverages a fusion of deep learning techniques with hardware performance counter (HPC) analysis. Unlike previous approaches that merely utilize HPC events without comprehensive analysis, we introduce a novel idea of diversifying hardware performance counters with in-depth correlation analysis to enhance deep learning detection performance. Specifically, our method employs in-depth correlation analysis to identify HPC events that possess two crucial characteristics: high representativeness and diverse attributes, significantly improving the detection performance for both existing and unknown malware. To achieve on-device detection, we introduce DeepIncept, a compact network architecture that takes advantage of depth-aware deconstruction and streamlined contextual filtering. This architecture incorporates efficient depthwise separable convolutions and 1-Dimensional Convolutional Neural Network (CNN) kernels to create an inception-like structure, enabling accurate extraction of event-specific and multi-event-combined features from the reformed HPC image data. The experimental results unequivocally showcase the effectiveness of DeepIncept, achieving an accuracy of 98.58% and 98.31% in detecting existing and unknown malware respectively. This outperforms the model that lacks in-depth analysis and selection of HPC events by over 5%. Furthermore, DeepIncept demonstrates a 3-4% improvement across various performance metrics compared to the traditional CNN model while achieving a detection speed of approximately 2ms, three times faster than the classical approach.

4A-3 (Time: 9:50 - 10:15)
TitleResource- and Workload-aware Malware Detection through Distributed Computing in IoT Networks
Author*Sreenitha Kasarapu, Sanket Shukla, Sai Manoj PD (George Mason University, USA)
Pagepp. 368 - 373
KeywordHardware Security, Malware Detection, Distributed learning, Deep Learning
AbstractNetworked IoT systems have emerged in recent years to facilitate seamless connectivity, portability, and smarter functionality. Despite lending a plethora of benefits, IoT devices are exploited by adversaries for various illicit purposes. IoT systems are a popular target due to the lack of security traits in the design, and minimal available computational and storage resources on the devices. Among multiple threats, malicious applications a.k.a malware are seen as a pivotal security threat on IoT devices and networks. Many malware detection techniques have been proposed recently. However, the existing techniques focus either on general-purpose systems or assume the availability of abundant resources at their disposal for malware detection. However, for IoT devices, the ongoing workloads such as sensing, and on-device computations further minimize the available resources for malware detection. We propose a novel resource- and workload-aware malware detection integrated with distributed computing for IoT networked systems to address these challenges. The device analyzes the available resources for malware detection using a lightweight regression model. Depending on the available resources, ongoing workload executions, and communication cost the malware detection task is either performed on-device or offloaded to neighboring IoT nodes with sufficient resources. To ensure data integrity and user privacy, instead of offloading the whole malware detection, the classifier is partitioned and distributed over multiple nodes and further integrated at the parent node for malware detection. Experimental analysis shows that the proposed technique can achieve a speed-up of 9.8x compared to on-device inference while maintaining a malware detection accuracy of 96.7%.

[To Session Table]

Session 4B  Design for Manufacturability: from Rule Checking to Yield Optimization
Time: 9:00 - 10:15, Wednesday, January 24, 2024
Location: Room 205
Chairs: Bei Yu (The Chinese University of Hong Kong, Hong Kong), Yuzhe Ma (The Hong Kong University of Science and Technology (Guangzhou), China)

4B-1 (Time: 9:00 - 9:25)
TitleAPPLE: An Explainer of ML Predictions on Circuit Layout at the Circuit-Element Level
Author*Tao Zhang (Hong Kong University of Science and Technology, Hong Kong), Haoyu Yang (Nvidia, USA), Kang Liu (Huazhong University of Science and Technology, China), Zhiyao Xie (Hong Kong University of Science and Technology, Hong Kong)
Pagepp. 374 - 379
KeywordML Interpretability, Lithography hotspot
AbstractIn recent years, we have witnessed many excellent machine learning (ML) solutions targeting circuit layouts. These ML models provide fast predictions on various design objectives. However, almost all existing ML solutions have neglected the basic interpretability requirement from potential users. As a result, it is very difficult for users to figure out any potential accuracy degradation or abnormal behaviors of given ML models. In this work, we propose a new technique named APPLE to explain each ML prediction at the resolution level of circuit elements. To the best of our knowledge, this is the first effort to explain ML predictions on circuit layouts. It provides a significantly more reasonable, useful, and efficient explanation for lithography hotspot prediction, compared with the highest-cited prior solution for natural images.

4B-2 (Time: 9:25 - 9:50)
TitleE2E-Check: End to End GPU-Accelerated Design Rule Checking with Novel Mask Boolean Algorithms
Author*Yifei Zhou, Zijian Wang, Chao Wang (Southeast University, China)
Pagepp. 380 - 385
KeywordDesign Rule Checking, mask boolean operation
AbstractDesign Rule Checking (DRC) is an essential part of physical verification to guarantee the successful fabrication of chips. In this paper, we present a comprehensive DRC flow E2E-Check that leverages the potential of heterogeneous CPU-GPU parallelism, resulting in GPU-accelerated end-to-end processing. We first propose a novel Fast Candidate Edges Construction (FCEC) algorithm which can reduce the complexity of merge operation and directly construct the ordered and directed candidate edges in parallel to overcome the potential computational bottlenecks. Second, we propose a heuristic pruning algorithm to significantly reduce redundant checks and data transfer. Then we explore an efficient GPU-based sweepline algorithm suitable for sliced candidate edges set. Experimental results demonstrate that E2E-Check achieves significant speedup and outperforms state-of-the-art multi-threaded design rule checker. The FCEC achieves speedup of 5.8x to 429.9x for the merge operation, and E2E-Check achieves an average speedup of 443x for spacing rule checks and 3063x for enclosure rule checks.

4B-3 (Time: 9:50 - 10:15)
TitleCIS: Conditional Importance Sampling for Yield Optimization of Analog and SRAM Circuits
Author*Yanfang Liu (Beihang University, China), Wei Xing (University of Sheffield, UK)
Pagepp. 386 - 391
KeywordYield Estimation, Yield Optimization, Importance Sampling, Conditional Normalizing Flow
AbstractYield optimization is one of the central challenges in submicrometer integrated circuit manufacture. Classic yield optimization methods rely on importance sampling (IS) to provide efficient and robust yield estimation for each individual design. Despite its success, such an approach is still computationally expensive due to the large number of calculations for many different designs. To resolve this challenge, we propose conditional importance sampling (CIS) that can approximate the optimal proposal distribution for any given design by leveraging the power of the modern deep-learning-based sampling method, conditional normalizing flow. More importantly, CIS generalizes well to unseen design and thus can deliver effective yield optimization with a small number of expensive simulations. To conduct yield optimization efficiently with consideration of creditable uncertainty, we propose a novel Important Sampling Bayesian Optimization (ISBO) using a deep-warped gradient-boosting regression (GBR). The proposed method is extensively evaluated against five state-of-the-art baselines; the results show that the proposed method delivers superior performance: a speedup of 1.10x-10.46x (4.45x on average) with even higher yield designs, an improvement of 1.1x-10x (4.44x on average) in consideration of the Optimality-Cost Ratio, and most importantly, excellent robustness and consistency in all our extensive experiments on analog and SRAM circuits.

[To Session Table]

Session 4C  Advances in Logic Synthesis and Optimization
Time: 9:00 - 10:15, Wednesday, January 24, 2024
Location: Room 206
Chair: Yung-Chih Chen (National Taiwan University of Science and Technology, Taiwan)

Best Paper Award
4C-1 (Time: 9:00 - 9:25)
TitleFineMap: A Fine-grained GPU-parallel LUT Mapping Engine
Author*Tianji Liu (The Chinese University of Hong Kong, Hong Kong), Lei Chen, Xing Li, Mingxuan Yuan (Huawei Noah's Ark Lab, Hong Kong SAR, Hong Kong), Evangeline F.Y. Young (The Chinese University of Hong Kong, Hong Kong)
Pagepp. 392 - 397
KeywordLUT Mapping, GPU-parallel, Technology Mapping
AbstractLookup-table (LUT) mapping is an indispensable step in FPGA design flows, and also serves as a building block in many technology-independent optimization algorithms. Therefore, it is crucial to accelerate LUT mapping in order to satisfy the demand for synthesizing high-quality, large-scale VLSI designs. Previous work on GPU LUT mapping suffers from low speedup due to limited degree of parallelism. In this paper, we propose an ultra-fast GPU-parallel LUT mapping engine named FineMap, which is composed of a novel fine-grained mapping phase with a high degree of parallelism, a parallel cut expansion phase and a parallel timing analysis pass. The mapping phase is enhanced by specifically tailored cut evaluation and memory management algorithms for GPUs that enable fast mapping of large circuits with limited GPU memory. Experiments show that compared with the high-performance mapper implemented in ABC, FineMap achieves 128.7x speedup with better quality in terms of area on large benchmarks.

4C-2 (Time: 9:25 - 9:50)
TitleTransduction Method for AIG Minimization
Author*Yukio Miyasaka (UC Berkeley, USA)
Pagepp. 398 - 403
KeywordArea Minimization, High-Effort Logic Synthesis, AIG Minimization, Truth Table, Combinational Logic Synthesis
AbstractDue to the recent hike in the cost of silicon wafers, area minimization is becoming increasingly important, which makes high-effort circuit optimization more attractive, despite the additional runtime. In this paper, we revisit the transduction method originally proposed in 1980's. The method computes don't-cares for nodes in the circuit and iteratively performs wire reduction and other transformations. Several novel variations of the transduction method are proposed, aiming at high-effort area minimization for and-inverter graphs (AIGs) with up to one thousand nodes. These variations are applied iteratively by a script, which also performs stochastic optimization with randomized parameters. The script has been used to minimize AIGs derived from the truth tables provided at IWLS 2022 Programming Contest. In all cases, the resulting AIG sizes are the same or smaller, compared to the best results produced by the contest participants.

4C-3 (Time: 9:50 - 10:15)
TitleIn Medio Stat Virtus: Combining Boolean and Pattern Matching
AuthorGianluca Radi, *Alessandro Tempia Calvino, Giovanni De Micheli (EPFL, Switzerland)
Pagepp. 404 - 410
KeywordTechnology mapping, Boolean matching, Pattern matching, Logic synthesis
AbstractTechnology mapping transforms a technology-independent representation into a technology-dependent one given a library of cells. This process is performed by means of local replacements that are extracted by matching sections of the subject graph to library cells. Matching techniques are classified mainly into pattern and Boolean. These two techniques differ in quality and number of generated matches, scalability, and run time. This paper proposes hybrid matching, a new methodology that integrates both techniques in a technology mapping algorithm. In particular, pattern matching is used to speed up the matching phase and support large cells. Boolean matching is used to increase the number of matches and quality. Compared to Boolean matching, we show that hybrid matching yields an average reduction in the area and run time by 6% and 25%, respectively, with similar delay.

[To Session Table]

Session 4D  Learning-Based Optimization for RF/Analog Circuit
Time: 9:00 - 10:15, Wednesday, January 24, 2024
Location: Room 207
Chair: Hung-Ming Chen (National Yang Ming Chiao Tung University, Taiwan)

Best Paper Candidate
4D-1 (Time: 9:00 - 9:25)
TitleA Transferable GNN-based Multi-Corner Performance Variability Modeling for Analog ICs
Author*Hongjian Zhou (ShanghaiTech University, China), Yaguang Li (Texas A&M University, USA), Xin Xiong, Pingqiang Zhou (ShanghaiTech University, China)
Pagepp. 411 - 416
KeywordAnalog circuit, Parametric yield, Performance modeling, Graph Neural Network, Transfer learning
AbstractPerformance variability appears strong-nonlinear in analog ICs due to large process variations in advanced technologies. To capture such variability, a vast amount of data is required for learning-based accurate models. On the other hand, yield estimation across multiple PVT corners exacerbates data dimensionality further. In this paper, we propose a graph neural network (GNN)-based performance variability modeling method. The key idea is to leverage GNN techniques to extract variations-related local mismatch in analog circuits, and data efficiency is benefited by the ability of knowledge transfer among different PVT corners. Demonstrated upon three circuits in a commercial 65nm CMOS process and compared with the state-of-the-art modeling techniques, our method can achieve higher modeling accuracy while utilizing significantly less training data.

4D-2 (Time: 9:25 - 9:50)
TitleAn Efficient Transfer Learning Assisted Global Optimization Scheme for Analog/RF Circuits
Author*Zhikai Wang (Tsinghua University, China), Jingbo Zhou (Baidu Research, China), Xiaosen Liu, Yan Wang (Tsinghua University, China)
Pagepp. 417 - 422
Keywordanalog/RF circuits, sizing, transfer learning
AbstractOnline surrogate model-assisted evolution algorithms (SAEAs) are very efficient for analog/RF circuit optimization. To improve modeling accuracy/sizing results, we propose an efficient transfer learning-assisted global optimization (TLAGO) scheme that can transfer useful knowledge between neural networks to improve modeling accuracy in SAEAs. The novelty mainly relies on a novel transfer learning scheme, including a modeling strategy and novel adaptive transfer learning network, for high-accuracy modeling, and greedy strategy for balancing exploration and exploitation. With lower optimization time,TLAGO can have a faster rate of convergence and more than 8% better performances than GASPAD.

4D-3 (Time: 9:50 - 10:15)
TitleMACRO: Multi-agent Reinforcement Learning-based Cross-layer Optimization of Operational Amplifier
Author*Zihao Chen, Songlei Meng, Fan Yang, Li Shang, Xuan Zeng (Fudan University, China)
Pagepp. 423 - 428
Keywordoperational amplifier, topology design, parameter tuning, multi-agent reinforcement learning
AbstractThe optimization of operational amplifiers, including topology design and parameter tuning, is significantly challenging, given the high-dimensional and heterogeneous characteristics of the design space. This paper presents MACRO, a novel approach to operational amplifier design that employs multi-agent reinforcement learning for cross-layer optimization. We model the sequentially executed topology design and parameter tuning tasks as a Markov decision process, where the high-dimensional design space is effectively transformed into a series of manageable action spaces at each step. Two agents are meticulously tailored to specialize in these two distinct tasks respectively. The co-evolution of the agents is ensured by sharing design information and customizing the policy-gradient training method. Experimental results show that, compared with state-of-the-art methods, MACRO can produce superior-performing circuits while maintaining competitive design efficiency.

[To Session Table]

Session 4E  (SS-3) LLM Acceleration and Specialization for Circuit Design and Edge Applications
Time: 8:35 - 10:15, Wednesday, January 24, 2024
Location: Room 107/108
Chairs: Deming Chen (UIUC), Yao Chen (AUS)

4E-1 (Time: 8:35 - 9:00)
Title(Invited Paper) Applications of LLM for Chip Design
AuthorHaoxing (Mark) Ren (NVIDIA, USA)

4E-2 (Time: 9:00 - 9:25)
Title(Invited Paper) TinyChat for On-device LLM
AuthorSong Han (MIT, USA)

4E-3 (Time: 9:25 - 9:50)
Title(Invited Paper) FL-NAS: Towards Fairness of NAS for Resource Constrained Devices via Large Language Models
AuthorRuiyang Qin (University of Notre Dame, USA), Yuting Hu (University at Buffalo, USA), Zheyu Yan (University of Notre Dame, USA), *Jinjun Xiong (University at Buffalo, USA), Ahmed Abbasi, Yiyu Shi (University of Notre Dame, USA)
Pagepp. 429 - 434
Keywordneural architecture search, hardware efficiency, large language models, fairness
AbstractNeural Architecture Search (NAS) has become the de fecto tools in the industry in automating the design of deep neural networks for various applications, especially those driven by mobile and edge devices with limited computing resources. The emerging large language models (LLMs), due to their prowess, have also been incorporated into NAS recently and show some promising results. This paper conducts further exploration in this direction by considering three important design metrics simultaneously, i.e., model accuracy, fairness, and hardware deployment efficiency. We propose a novel LLM-based NAS framework, FL-NAS, in this paper, and show experimentally that FL-NAS can indeed find high-performing DNNs, beating state-of-the-art DNN models by orders-of-magnitude across almost all design considerations.

4E-4 (Time: 9:50 - 10:15)
Title(Invited Paper) Software/Hardware Co-design for LLM and Its Application for Design Verification
AuthorLily Jiaxin Wan, Yingbing Huang, Yuhong Li, Hanchen Ye, Jinghua Wang (UIUC, USA), Xiaofan Zhang (Google, USA), *Deming Chen (UIUC, USA)
Pagepp. 435 - 441
KeywordLarge Language Models, software/hardware codesign, functional verification
AbstractThe widespread adoption of Large Language Models (LLMs) is impeded by their demanding compute and memory resources. The first task of this paper is to explore optimization strategies to expedite LLMs, including quantization, pruning, and operation-level optimizations. One unique direction is to optimize LLM inference through novel software/hardware codesign methods. Given the accelerated LLMs, the second task of this paper is to study LLMs’ performance in the usage scenario of circuit design and verification. Specifically, we place a particular emphasis on functional verification. Through automated prompt engineering, we harness the capabilities of the established LLM, GPT-4, to generate High-Level Synthesis (HLS) designs with predefined errors based on 11 open-source synthesizable HLS benchmark suites. This dataset is a comprehensive collection of over 1000 function-level designs, and each of which is afflicted with up to 45 distinct combinations of defects injected into the source code. This dataset, named Chrysalis, expands upon what’s available in current HLS error models, offering a rich resource for training to improve how LLMs debug code. The dataset can be accessed at: https://github.com/UIUC-ChenLab/Chrysalis-HLS.

[To Session Table]

Session 2K  Keynote Session II
Time: 10:30 - 11:30, Wednesday, January 24, 2024
Location: Room Premier A/B
Chairs: Kyu-Myung Choi (Seoul National University, Republic of Korea), Taewhan Kim (Seoul National University, Republic of Korea)

Title(Keynote Address) Present and Future Challenges of High-Bandwidth Memory
AuthorMyeong-Jae Park (SK Hynix, Republic of Korea)
AbstractHBM(High-Bandwidth Memory) is the best-performing memory product that is used in high-end computing systems such as supercomputers and AI accelerators. The recent boom in machine learning is due in part to the development of computing technologies where HBM, the fastest DRAM played a key role in overcoming the performance limitations. In order to achieve high bandwidth of HBM, various technologies and sophisticated design techniques are required. Especially, the complex structure of stacking a single logic die with 4 to 16 DRAM dies also makes its development more challenging. Since SK hynix developed its first HBM product in 2015, five generations of HBM products have been developed over the past 9 years, and discussions are underway for the development of HBM4, which corresponds to the 6th generation. During this period, bandwidth has increased by 12 times and power efficiency has improved by 3-fold. These performance improvements have been made possible due to various design and process innovations, and given the rapid progress in AI technology, the continuous improvement of high-end memory products like HBM is essential. However, HBM is now facing various technical challenges. Increasing bandwidth, power, and capacity on a small interposer is reaching its technical limits, and issues such as thermal and reliability are becoming more serious. This keynote will present a broad overview of the latest developments in HBM including the current technical challenges from design to devices and packaging technologies, and present future directions for HBM’s developments. Memory solutions beyond HBM, such as PiM and 3D solutions will also be covered.

[To Session Table]

Session 5A  Bridging Memory, Storage, and Data Processing Techniques
Time: 13:00 - 14:40, Wednesday, January 24, 2024
Location: Room 204
Chair: Yuan-Hao Chang (Academia Sinica, Taiwan)

Best Paper Candidate
5A-1 (Time: 13:00 - 13:25)
TitlewearMeter: an Accurate Wear Metric for NAND Flash Memory
Author*Min Ye (City University of Hong Kong, Hong Kong), Qiao Li (Xiamen University, China), Daniel Wen (YEESTOR Microelectronics Co., Ltd, China), Tei-Wei Kuo (National Taiwan University, Mohamed bin Zayed University of Artificial Intelligence, Taiwan), Chun Jason Xue (City University of Hong Kong, Hong Kong)
Pagepp. 442 - 447
Keyword3D NAND flash, accurate wear metric, wearMeter
AbstractThe program/erase (P/E) cycle is frequently utilized as a yardstick for indicating the wear degree of flash memory. However, this metric exhibits a significant limitation in accuracy. After enduring the same number of P/E cycles, the wear degree of flash memory could vary due to factors such as ambient temperatures and dwell time between two P/E cycles. To reflect the true wear degree of flash memory, this paper proposes an accurate and consistent metric, wearMeter. The proposed metric, independent of wear sources, guarantees accurate offline wear measurement. Moreover, by comparing to the error-correcting capability of the ECC engine, it offers a straightforward assessment of block remaining reliability margins. Leveraging wearMeter, this paper further proposes a novel low-wear Single-Level Cell (SLC) mode, \textit{l}SLC, which significantly reduces wear compared to the default SLC mode. Experiments show that \textit{l}SLC demonstrates over 3X the P/E cycles of the default SLC mode under the same conditions, without performance losses.

5A-2 (Time: 13:25 - 13:50)
TitleOverlapping Aware Zone Allocation for LSM Tree-Based Store on ZNS SSDs
Author*Jingcheng Shen, Lang Yang, Linbo Long, Renping Liu, Zhenhua Tan (Chongqing University of Posts and Telecommunications, China), Congming Gao (Xiamen University, China), Yi Jiang (Chongqing University of Posts and Telecommunications, China)
Pagepp. 448 - 453
KeywordZNS SSD, LSM tree, key-value store, data placement
AbstractNVMe Zoned Namespace (ZNS) devices partition the storage space into sequential-write zones, notably reducing the costs of address mapping, garbage collection (GC), and over-provisioning. Log-Structured Merge (LSM) tree-based databases convert random writes into sequential writes and can thus be efficiently handled by ZNS devices. Efficient zone-allocation methods play a pivotal role in maximizing the performance of LSM tree-based store running on ZNS devices. However, existing zone-allocation methods encounter high write-amplification factors due to inaccurate lifetime estimation solely based on the LSM-tree levels. To address this, this paper proposes an overlapping-aware zone-allocation method, termed OAZA, which efficiently selects suitable zones to place data. First, OAZA estimates the data lifetime by considering both the LSM-tree level of the data and the relative data hotness within the same tree level. Secondly, OAZA intelligently selects an appropriate zone to store the data based on the estimated lifetime. Experimental results demonstrate that OAZA outperforms two zone-allocation methods that correlate data lifetime merely to the tree level. Specially, OAZA reduces the amount of GC-induced data copy by average factors of 2.7× and 1.7× in comparison to the two methods, respectively. Additionally, OAZA achieves an impressively low write-amplification factor of 1.1×, outperforming the factors of 1.2× and 1.3× achieved by the two compared methods, respectively.

5A-3 (Time: 13:50 - 14:15)
TitleHardware-Software Co-Design of a Collaborative DNN Accelerator for 3D Stacked Memories with Multi-Channel Data
Author*Tom Glint (IIT Gandhinagar, India), Manu Awasthi (Ashoka University, India), Joycee Mekie (IIT Gandhinagar, India)
Pagepp. 454 - 459
Keywordco-design, DNN accelerators, neural networks
AbstractHardware accelerators are preferred over general-purpose processors for processing Deep Neural Networks (DNN) as the later suffer from power and memory walls. However, hardware accelerators designed as a separate logic chip from the memory still suffer from memory wall. Processing-in-memory accelerators, which try to overcome this memory wall by developing the compute elements as part of the memory structures, are highly constrained due to the memory manufacturing process. Near-data-processing (NDP) based hardware accelerator design is an alternative paradigm that could combine the benefit of high bandwidth, low access energy of processing-in-memory, and design flexibility of separate logic chip. However, NDP has area, data flow and thermal constraints, hindering high throughput designs. In this work, we propose an HBM3-based NDP accelerator that tackles the constraints of NDP with a hardware-software co-design approach. The proposed design takes only 50% area, delivers a speed-up of 3x, and is about 6x more energy efficient than state-of-the-art NDP hardware accelerator for inferencing workloads such as AlexNet, MobileNet, ResNet, and VGG without loss of accuracy.

5A-4 (Time: 14:15 - 14:40)
TitleBridge-NDP: Achieving Efficient Communication-Computation Overlap in Near Data Processing with Bridge Architecture
Author*Liyan Chen, Jianfei Jiang, Qin Wang, Zhigang Mao, Naifeng Jing (Shanghai Jiao Tong University, China)
Pagepp. 460 - 465
KeywordNDP, bridge buses, overlap
AbstractNear data accelerators (NDAs) enable near data processing (NDP) within main memory that benefits performance by providing more aggregated bandwidth and reducing long-distance data transfer. Most prior works focus on reaping higher internal bandwidth to improve performance of the NDA itself. However, the overhead of interactive communication between host and NDAs is overlooked, which has become the bottleneck of NDP systems. In this paper, we propose bridge-NDP, a novel NDP architecture that exploits existing memory buses serving as bridge buses to fully utilize bandwidth. With bridge access enabled by optimized bridge commands, bridge-NDP efficiently overlaps communication and computation. It can be applied to existing NDP systems regardless of the memory level NDAs are attached to. For a variety of key computing kernels from machine learning, data analytics, etc., our evaluation shows that bridge-NDP speeds up not only the NDA performance itself (1.13×-3.62×), but also the host-NDA collaboration performance (2.43×-4.21×), achieving more bandwidth utilization (1.12×-3.67× and 1.48×-4.13×) over the state-of-the-art NDP solution.

[To Session Table]

Session 5B  Exploring EDA’s Next Frontier: AI-Driven Innovative Design Methods
Time: 13:00 - 14:40, Wednesday, January 24, 2024
Location: Room 205
Chair: M. Hassan Najafi (University of Louisiana at Lafayette, USA)

5B-1 (Time: 13:00 - 13:25)
TitleVariational Label-Correlation Enhancement for Congestion Prediction
Author*Biao Liu, Congyu Qiao, Ning Xu, Xin Geng, Ziran Zhu, Jun Yang (Southeast University, China)
Pagepp. 466 - 471
KeywordCongestion Prediction
AbstractAs the complexity of Integrated Circuits (ICs) rises, accurate routing and congestion prediction, crucial for identifying early design flaws, become essential to expedite circuit design and conserve resources in the lengthy physical design process. Despite the advancements in current congestion prediction methodologies, an essential aspect that has been largely overlooked is the spatial label-correlation between different grids in congestion prediction. The spatial label-correlation is a fundamental characteristic of circuit design, where the congestion status of a grid is not isolated but inherently influenced by the conditions of its neighboring grids. In order to fully exploit the inherent spatial label-correlation between neighboring grids, we propose a novel approach, {\ours}, i.e., VAriational Label-Correlation Enhancement for Congestion Prediction, which considers the local label-correlation in the congestion map, associating the estimated congestion value of each grid with a local label-correlation weight influenced by its surrounding grids. {\ours} leverages variational inference techniques to estimate this weight, thereby enhancing the regression model's performance by incorporating spatial dependencies. Experiment results validate the superior effectiveness of {\ours} on the public available \texttt{ISPD2011} and \texttt{DAC2012} benchmarks using the superblue circuit line.

5B-2 (Time: 13:25 - 13:50)
TitleFast Cell Library Characterization for Design Technology Co-Optimization Based on Graph Neural Networks
AuthorTianliang Ma, Zhihui Deng, Xuguang Sun, *Leilai Shao (Shanghai Jiaotong University, China)
Pagepp. 472 - 477
KeywordCell Library Characterization, Design Technology Co-Optimization, Graph Neural Networks, Drive Strength Interpolation
AbstractDesign technology co-optimization (DTCO) plays a critical role in achieving optimal power, performance, and area (PPA) for advanced semiconductor process development. Cell library characterization is essential in DTCO flow, but traditional methods are time-consuming and costly. To overcome these challenges, we propose a graph neural network (GNN)- based machine learning model for rapid and accurate cell library characterization. Our model incorporates cell structures and demonstrates high prediction accuracy across various process- voltage-temperature (PVT) corners and technology parameters. Validation with 512 unseen technology corners and over one million test data points shows accurate predictions of delay, power, and input pin capacitance for 33 types of cells, with a mean absolute percentage error (MAPE) ≤ 0.95% and a speed- up of 100X compared with SPICE simulations. Additionally, we investigate system-level metrics such as worst negative slack (WNS), leakage power, and dynamic power using predictions obtained from the GNN-based model on unseen corners. Our model achieves precise predictions, with absolute error ≤3.0 ps for WNS, percentage errors ≤0.60% for leakage power, and ≤0.99% for dynamic power, when compared to golden reference. With the developed model, we further proposed a fine-grained drive strength interpolation methodology to enhance PPA for small-to-medium-scale designs, resulting in an approximate 1- 3% improvement.

5B-3 (Time: 13:50 - 14:15)
TitleAutomated synthesis of mixed-signal ML inference hardware under accuracy constraints
AuthorKishor Kunal, Jitesh Poojary, S Ramprasath, Ramesh Harjani, *Sachin S. Sapatnekar (University of Minnesota Twin Cities, USA)
Pagepp. 478 - 483
KeywordMixed signal ML inference, ML inference hardware, analog computing, Low power computing
AbstractDue to the inherent error tolerance of machine learning (ML) algorithms, many parts of the inference computation can be performed with adequate accuracy and low power under relatively low precision. Early approaches have used digital approximate computing methods to explore this space. Recent approaches using analog-based operations achieve power-efficient computation at moderate precision. This work proposes a mixed-signal optimization (MiSO) approach that optimally blends analog and digital computation for ML inference. Based on accuracy and power models, an integer linear programming formulation is used to optimize design metrics of analog/digital implementations. The efficacy of the method is demonstrated on multiple ML architectures.

5B-4 (Time: 14:15 - 14:40)
TitleLayNet: Layout Size Prediction for Memory Design Using Graph Neural Networks in Early Design Stage
Author*Hye Rim Ji, Jong Seong Kim, Jung Yun Choi (Samsung Electronics, Republic of Korea), Jee Hyong Lee (Sungkyunkwan University, Republic of Korea)
Pagepp. 484 - 490
KeywordMemory Design, Layout Size, Graph Neural Network
AbstractIn memory designs that adopt a full-custom design, accurately predicting the layout size of a circuit block is crucial for reducing design iterations. However, predicting the layout size is challenging due to the complex space sizes caused by wiring and layout-dependent effects between circuit elements. To address the challenge, we propose LayNet, a novel graph neural network model that predicts the layout size by constructing a weighted graph. We convert a circuit into a weighted graph to model the relationships between elements. By applying graph neural networks to the weighted graph, we can accurately predict the layout size. We also propose the edge selection and hierarchical graph learning approach to reduce memory usage and inference time for large circuit blocks. LayNet achieves state-of-the-art performance on 6300 pairs of circuits and layouts in industrial memory products. Specifically, it significantly reduces the mean absolute percentage error rate by 20.82%~88.17% for manually generated layouts and by 7.97%~73.39% for semi-auto-generated layouts, outperforming conventional approaches. Also, the edge selection and hierarchical graph learning approaches reduce memory usage by 140.85x and 238.10x for these two types of layouts, respectively, and inference time by 14.14x and 37.84x, respectively, while maintaining performance.

[To Session Table]

Session 5C  New Frontiers in Testing
Time: 13:00 - 14:40, Wednesday, January 24, 2024
Location: Room 206
Chair: Yung-Chih Chen (National Taiwan University of Science and Technology, Taiwan)

5C-1 (Time: 13:00 - 13:25)
TitleQcAssert: Quantum Device Testing with Concurrent Assertions
Author*Hasini Dilanka Witharana, Daniel Volya, Prabhat Mishra (University of Florida, USA)
Pagepp. 491 - 496
KeywordQuantum, Assertion-based Validation, Quantum Verification, Quantum Assertions
AbstractQuantum devices are extremely noisy due to its inherent architecture. This can introduce errors or completely erase the information stored in qubits.High noise levels in a quantum device can lead to errors even when the quantum circuit is not buggy. Therefore, it is essential to verify that the noise level of the device is tolerable while running the quantum circuit. In this paper, we propose a quantum device testing framework using concurrent assertions. Specifically, we introduce a new type of assertion ``QcAssert'', which has the ability to run concurrently with the quantum circuit to ensure that the quantum device is working as expected. We demonstrate the effectiveness of the QcAssert in dynamic device testing using a suite of popular quantum benchmarks, including Shor's factoring algorithm and Grover's search algorithm.

5C-2 (Time: 13:25 - 13:50)
TitleHybMT: Hybrid Meta-Predictor based ML Algorithm for Fast Test Vector Generation
Author*Shruti Pandey, Jayadeva, Smruti R. Sarangi (Indian Institute of Technology, Delhi, India)
Pagepp. 497 - 502
KeywordATPG, ML, PODEM, stuck-at-fault, Neural Network
AbstractML models are increasingly being used to increase the test coverage and decrease the overall testing time. This field is still in its nascent stage and up till now there were no algorithms that could match or outperform commercial tools in terms of speed and accuracy for large circuits. We propose an ATPG algorithm HybMT in this paper that finally breaks this barrier. Like sister methods, we augment the classical PODEM algorithm that uses recursive backtracking. We design a custom 2-level predictor that predicts the input net of a logic gate whose value needs to be set to ensure that the output is a given value (0 or 1). Our predictor chooses the output from among two first-level predictors, where the most effective one is a bespoke neural network that we designed. As compared to a popular, state-of-the-art commercial ATPG tool, HybMT shows an overall reduction of 56.6% in the CPU time without compromising on the fault coverage for the EPFL benchmark circuits. HybMT also shows a speedup of 126.4% over the best ML-based algorithm while obtaining an equal or better fault coverage for the EPFL benchmark circuits.

5C-3 (Time: 13:50 - 14:15)
TitleA Fast Test Compaction Method for Commercial DFT Flow Using Dedicated Pure-MaxSAT Solver
Author*Zhiteng Chao (State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences/University of Chinese Academy of Sciences/CASTEST Co., Ltd., China), Xindi Zhang (School of Computer Science and Technology, University of Chinese Academy of Sciences/State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, China), Junying Huang (State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences/University of Chinese Academy of Sciences, China), Jing Ye (State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences/University of Chinese Academy of Sciences/CASTEST Co., Ltd., China), Shaowei Cai (School of Computer Science and Technology, University of Chinese Academy of Sciences/State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, China), Huawei Li, Xiaowei Li (State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences/University of Chinese Academy of Sciences/CASTEST Co., Ltd., China)
Pagepp. 503 - 508
Keywordstatic test compaction, Pure-MaxSAT, DFT, ATPG
AbstractMinimizing the testing cost is crucial in the context of the design for test (DFT) flow. In our observation, the test patterns generated by commercial ATPG tools in test compression mode still contain redundancy. To tackle this obstacle, we propose a post-flow static test compaction method that utilizes a partial fault dictionary instead of a full fault dictionary, and leverages a dedicated Pure-MaxSAT solver to re-compact the test patterns generated by commercial ATPG tools. We also observe that commercial ATPG tools offer a more comprehensive selection of candidate patterns for compaction in the “n-detect” mode, leading to superior compaction efficacy. In experiments on ISCAS89, ITC99, and open-source RISC-V CPU benchmarks, our method achieves an average reduction of 21.58% and a maximum of 29.93% in test cycles evaluated by commercial tools while maintaining fault coverage. Furthermore, our approach demonstrates improved performance compared with existing methods.

5C-4 (Time: 14:15 - 14:40)
TitleA Dynamic Testing Scheme for Resistive-Based Computation-In-Memory Architectures
Author*Sina Bakhtavari Mamaghani, Priyanjana Pal, Mehdi Baradaran Tahoori (Karlsruhe Institute of Technology, Germany)
Pagepp. 509 - 514
KeywordMemory testing, Magnetic random access memory (MRAM), Resistive random access memory (ReRAM), Computation-In-Memory (CIM)
AbstractComputation-in-memory (CIM) is a promising solution to tackle the memory wall problem in big data and artificial intelligence applications. One possible approach to implement such a scheme is to use nonvolatile resistive memory technologies like spin transfer torque magnetic RAM (STT-MRAM) or resistive RAM (ReRAM). However, despite all the attractive features these technologies offer, they introduce new types of defects different from conventional SRAM technologies. Therefore, there is a need for dedicated testing algorithms that can detect such defects. In this paper, we proposed a testing scheme for CIM-capable memories that utilizes trim circuitry to dynamically switch between standard memory testing and CIM testing modes based on the speed and accuracy requirements, eliminating unnecessary testing overheads. This feature provides significant test time reduction while preserving the quality of the test. The proposed method is compatible with existing memory built-in self-test (MBIST) architecture and can be used for different types of emerging resistive memory technologies.

[To Session Table]

Session 5D  FPGA-Based Neural Network Accelerator Designs and Applications
Time: 13:00 - 14:40, Wednesday, January 24, 2024
Location: Room 207
Chair: Fan Chen (Indiana University Bloomington, USA)

5D-1 (Time: 13:00 - 13:25)
TitleSWAT: An Efficient Swin Transformer Accelerator Based on FPGA
Author*Qiwei Dong, Xiaoru Xie, Zhongfeng Wang (Nanjing University, China)
Pagepp. 515 - 520
KeywordSwin Transformer, sparsity, dataflow, FPGA
AbstractSwin Transformer achieves greater efficiency than Vision Transformer by utilizing local self-attention and shifted windows. However, existing hardware accelerators designed for Transformer have not been optimized for the unique computation flow and data reuse property in Swin Transformer, resulting in lower hardware utilization and extra memory accesses. To address this issue, we develop SWAT, an efficient Swin Transformer Accelerator based on FPGA. Firstly, to eliminate the redundant computations in shifted windows, a novel tiling strategy is employed, which helps the developed multiplier array to fully utilize the sparsity. Additionally, we deploy a dynamic pipeline interleaving dataflow, which not only reduces the processing latency but also maximizes data reuse, thereby decreasing access to memories. Furthermore, customized quantization strategies and approximate calculations for non-linear calculations are adopted to simplify the hardware complexity with negligible network accuracy loss. We implement SWAT on the Xilinx Alveo U50 platform and evaluate it with Swin-T on the ImageNet dataset. The proposed architecture can achieve improvements of 2.02x-3.11x in power efficiency compared to existing Transformer accelerators on FPGAs.

5D-2 (Time: 13:25 - 13:50)
TitleTransFRU: Efficient Deployment of Transformers on FPGA with Full Resource Utilization
Author*Hongji Wang, Yueyin Bai, Jun Yu, Kun Wang (Fudan University, China)
Pagepp. 521 - 526
KeywordTransformer, Self-attention, FPGA, DSP
AbstractTransformer-based models have achieved huge success in various artificial intelligence (AI) tasks, e.g., natural language processing (NLP) and computer vision (CV). However, transformer-based models always suffer from high computation density, making them hard to be deployed on resourceconstrained devices like field-programmable gate array (FPGA). Among the overall process of transformers, self-attention contributes to most of the computation load and becomes the bottleneck of transformer-based models. In this paper, we propose TransFRU, a novel FPGA-based accelerator for self-attention mechanism with full utilization of hardware resources. Specifically, we first leverage 4-bit and 8-bit processing elements (PEs) to package multiple signed multiplications into one DSP block. Second, we skip the zero and near-zero values in the intermediate result of self-attention by a sorting engine. The sorting engine is also responsible for operand sharing to boost the computation efficiency of one DSP block. Experimental results show that our TransFRU achieves 7.86-49.16× speedup and 151.1× energy efficiency compared with CPU, 1.41× speedup and 5.9× energy efficiency compared with GPU. Furthermore, we observe 1.91-13.56× better throughput per DSP block and 3.53-9.62× energy efficiency compared with previous FPGA accelerators.

5D-3 (Time: 13:50 - 14:15)
TitleBooth-NeRF: An FPGA Accelerator for Instant-NGP Inference with Novel Booth-Multiplier
Author*Zihang Ma, Zeyu Li, Yuanfang Wang, Yu Li, Jun Yu, Kun Wang (Fudan University, China)
Pagepp. 527 - 532
KeywordFPGA, Hardware Accelerator, NeRF
AbstractInstant-NGP is the state-of-the-art (SOTA) algorithm of Neural Radiance Field (NeRF) and has the potential to be used in AR/VR. However, the high cost of memory and computation limits Instant-NGP’s implementation on edge devices. In light of this, we propose a novel FPGA-based accelerator design to reduce power consumption, called Booth-NeRF. Booth-NeRF adopts a fully-pipelined technique and is built upon the Booth algorithm. Booth-NeRF introduces a new instruction set to accommodate Multi-Layer Perceptron (MLP) of different sizes, ensuring flexibility and efficiency. Moreover, we introduce a novel FPGA-friendly multiplier architecture for matrix multiplication, capable of performing exact or approximate multiplication using the Booth algorithm and select-shift-add technique. Evaluations on a Xilinx Kintex XC7K325T board show that Booth-NeRF achieves a 2.20x speed improvement and 1.31x energy efficiency compared with NVIDIA Jetson Xavier NX-16G GPU.

5D-4 (Time: 14:15 - 14:40)
TitleACane: An Efficient FPGA-based Embedded Vision Platform with Accumulation-as-Convolution Packing for Autonomous Mobile Robots
Author*Jinho Yang, Sungwoong Yune, Sukbin Lim, Donghyuk Kim, Joo-Young Kim (Korea Advanced Institute of Science and Technology, Republic of Korea)
Pagepp. 533 - 538
KeywordField Programmable Gate Array (FPGA), Convolutional Neural Network (CNN), Embedded Platform, Autonomous Mobile Robot, Digital Signal Processor (DSP)
AbstractConvolutional Neural Networks (CNNs) have been extensively deployed on autonomous mobile robots in recent years, and embedded platforms based on field-programmable gate arrays (FPGAs) that involve digital signal processors (DSPs) effectively utilize low-precision quantization with DSP-packing methods to implement large CNN models. However, DSP-packing has a limitation in improving computation performance due to zero bits that prevent bit contamination of output operands. In this paper, we propose ACane, a compact FPGA-based vision platform for autonomous mobile robots, based on a novel DSP-packing technique called accumulation-as-convolution packing, which effectively packs low-bit values to a single DSP, with boosting convolution operations. It also applies optimized data mapping and dataflow to improve computation parallelism of the DSP-packing. ACane successfully achieves the highest DSP efficiency (1.465 GOPS/DSP) and energy efficiency (361.8 GOPS/W), which are 1.98-8.32× and 4.03-25.5× higher compared to the state-of-the-art FPGA-based vision works, respectively.

[To Session Table]

Session 5E  (DF-3) AI/ML for Chip Design and EDA - Current Status and Future Perspectives from Diverse Views
Time: 13:00 - 14:40, Wednesday, January 24, 2024
Location: Room 107/108
Organizer: Changho Han (Kumoh National Institute of Technology, Republic of Korea), Chair: Kyumyung Choi (Seoul National University, Republic of Korea)

5E-1 (Time: 13:00 - 13:25)
Title(Designers' Forum) AI for Chip Design & EDA: Everything, Everywhere, All at Once (?)
AuthorDavid Pan (University of Texas at Austin, USA)
AbstractAI for chip design and EDA has received tremendous interests from both academia and industry in recent years. It touches everything that chip designers care about, from power/performance/area (PPA) to cost/yield, turn-around-time, security, among others. It is everywhere, in all levels of design abstractions, testing, verification, DFM, mask synthesis, for digital as well as some aspects of analog/mixed-signal/RF designs as well. It has also been used to tweak the overall design flow and hyper-parameter tuning, etc., but not yet all at once, e.g., generative AI from design specification all the way to layouts, in a correct-by-construction manner. In this talk, I will cover some recent advancement/breakthroughs in AI for chip design/EDA and share my perspectives.

5E-2 (Time: 13:25 - 13:50)
Title(Designers' Forum) How Engineers can Leverage AI Solutions in Chip Design
AuthorErick Chao (Cadence Design Systems, Taiwan)
AbstractIntegrating AI solutions into chip design can indeed offer significant benefits in terms of optimizing performance, power, area and productivity. This integration can be approached from multiple angles, including those of EDA (Electronic Design Automation) research and development and the end user's perspective. An overview of how engineers can leverage AI solutions in chip design will be introduced.

5E-3 (Time: 13:50 - 14:15)
Title(Designers' Forum) AI/ML Empowered Semiconductor Memory Design: An Industry Vision
AuthorHyojin Choi (Samsung Electronics, Republic of Korea)
AbstractIn this talk, we delved into the transformative realm of AI/ML empowered semiconductor memory design and manufacturing, with a keen focus on memory products within the semiconductor industry. We navigate through the synergistic integration of artificial intelligence and machine learning in streamlining the design and manufacturing processes of semiconductor memory. Our vision embraces a future where AI/ML technologies seamlessly harmonize, efficiency, and product quality. Join us as we present a visionary perspective on how AI/ML technologies are reshaping semiconductor memory design and manufacturing, propelling the industry towards an era of unparalleled advancements and sustainable growth.

5E-4 (Time: 14:15 - 14:40)
Title(Designers' Forum) ML for Computational Lithography: What Will Work and What Will Not?
AuthorYoungsoo Shin (Korea Advanced Institute of Science and Technology, Republic of Korea)
AbstractML has extensively been studied for computational lithography processes including OPC, assist features, lithography modeling, hotspot, and test patterns. This talk will review some of these while focusing on the best practice for industrial applications, e.g. hybrid ML and standard algorithmic approach, synthesis of test data for higher coverage, etc.

[To Session Table]

Session 6A  Enabling Techniques to Make CIM Small and Flexible
Time: 15:00 - 16:40, Wednesday, January 24, 2024
Location: Room 204
Chair: Qiao Li (Xiamen University, China)

6A-1 (Time: 15:00 - 15:25)
TitleOSA-HCIM: On-The-Fly Saliency-Aware Hybrid SRAM CIM with Dynamic Precision Configuration
Author*Yung-Chin Chen (National Taiwan University/Keio University, Taiwan), Shimpei Ando, Daichi Fujiki (Keio University, Japan), Shinya Takamaeda-Yamazaki (The University of Tokyo, Japan), Kentaro Yoshioka (Keio University, Japan)
Pagepp. 539 - 544
KeywordComputing-in-Memory (CIM), saliency, hybrid CIM (HCIM)
AbstractComputing-in-Memory (CIM) has shown great potential for enhancing efficiency and performance for deep neural networks (DNNs). However, the lack of flexibility in CIM leads to an unnecessary expenditure of computational resources on less critical operations, and a diminished Signal-to-Noise Ratio (SNR) when handling more complex tasks, significantly hindering the overall performance. Hence, we focus on the integration of CIM with Saliency-Aware Computing---a paradigm that dynamically tailors computing precision based on the importance of each input. We propose On-the-fly Saliency-Aware Hybrid CIM (OSA-HCIM) offering three primary contributions: (1) On-the-fly Saliency-Aware (OSA) precision configuration scheme, which dynamically sets the precision of each MAC operation based on its saliency, (2) Hybrid CIM Array (HCIMA), which enables simultaneous operation of digital-domain CIM (DCIM) and analog-domain CIM (ACIM) via split-port 6T SRAM, and (3) an integrated framework combining OSA and HCIMA to fulfill diverse accuracy and power demands. Implemented on a 65nm CMOS process, OSA-HCIM demonstrates an exceptional balance between accuracy and resource utilization. Notably, it is the first CIM design to incorporate a dynamic digital-to-analog boundary, providing unprecedented flexibility for saliency-aware computing. OSA-HCIM achieves a 1.95x enhancement in energy efficiency, while maintaining minimal accuracy loss compared to DCIM when tested on CIFAR100 dataset.

6A-2 (Time: 15:25 - 15:50)
TitleBFP-CIM: Data-Free Quantization with Dynamic Block-Floating-Point Arithmetic for Energy-Efficient Computing-In-Memory-based Accelerator
Author*Cheng-Yang Chang, Chi-Tse Huang, Yu-Chuan Chuang, Kuang-Chao Chou, An-Yeu (Andy) Wu (Graduate Institute of Electronics Engineering, National Taiwan University, Taiwan)
Pagepp. 545 - 550
KeywordComputing-in-memory, deep learning, data-free quantization, block-floating-point arithmetic
AbstractConvolutional neural networks (CNNs) are known for their exceptional performance in various applications; however, their energy consumption during inference can be substantial. Analog Computing-In-Memory (CIM) has shown promise in enhancing the energy efficiency of CNNs, but the use of analog-to-digital converters (ADCs) remains a challenge. ADCs convert analog partial sums from CIM crossbar arrays to digital values, with high-precision ADCs accounting for over 60% of the system’s energy. Researchers have explored quantizing CNNs to use low-precision ADCs to tackle this issue, trading off accuracy for efficiency. However, these methods necessitate data-dependent adjustments to minimize accuracy loss. Instead, we observe that the first most significant toggled bit indicates the optimal quantization range for each input value. Accordingly, we propose a range-aware rounding (RAR) for runtime bit-width adjustment, eliminating the need for pre-deployment efforts. RAR can be easily integrated into a CIM accelerator using dynamic block-floating-point arithmetic. Experimental results show that our methods maintain accuracy while achieving up to 2.51× and 2.27× energy efficiency improvements on CIFAR-10 and ImageNet datasets, respectively, compared with state-of-the-art techniques.

6A-3 (Time: 15:50 - 16:15)
TitleA Heuristic and Greedy Weight Remapping Scheme with Hardware Optimization for Irregular Sparse Neural Networks Implemented on CIM Accelerator in Edge AI Applications
Author*Lizhou Wu, Chenyang Zhao (State Key Laboratory of ASIC and System, Fudan University, China), Jingbo Wang (School of Mechanical Engineering, Tongji University, China), Xueru Yu, Shoumian Chen, Chen Li (Shanghai Integrated Circuits R&D Center Co., Ltd., China), Jun Han, Xiaoyong Xue, Xiaoyang Zeng (State Key Laboratory of ASIC and System, Fudan University, China)
Pagepp. 551 - 556
Keywordcomputing-in-memory, genetic algorithm, weight remapping, irregular sparsity, zero-skipping
AbstractComputing-in-memory (CIM) is a promising technique for hardware acceleration of neural networks (NNs) with high performance and efficiency. However, conventional dense mapping scheme cannot well support the compression and optimization of irregular sparse NNs. In this paper, we propose a heuristic and greedy weight remapping scheme for irregular sparse neural networks implemented on CIM accelerator in edge AI applications. The genetic algorithm (GA) is proposed for the first time to be utilized in the column shuffle for sparse weight remapping. Combined with the granularity exploration of the CIM, the proportion of the compressible all-zero rows increase remarkably. A greedy algorithm is then employed to planarize the unevenly compressed units, thus to improve the storage utilization of the crossbar. For hardware optimization, the pipeline is customized with a zero-skipping circuit to leverage the bit-level activation sparsity at runtime. Our results show that the proposed remapping scheme achieves 70%-94% utilization rate of the sparsity, and an average of 1.3× increment compared with the naive compression. The co-optimized CIM achieves 3-7.6× speedup and 2.1-4.8× energy efficiency, compared with the baseline for dense NNs.

6A-4 (Time: 16:15 - 16:40)
TitlePRIMATE: Processing in Memory Acceleration for Dynamic Token-pruning Transformers
Author*Yue Pan, Minxuan Zhou (University of California, San Diego, USA), Chonghan Lee, Zheyu Li, Rishika Kushwah, Vijaykrishnan Narayanan (Pennsylvania State University, USA), Tajana Rosing (University of California, San Diego, USA)
Pagepp. 557 - 563
KeywordProcessing in memory, Transformer, Token pruning
AbstractAttention-based models such as Transformers represent the state of the art for various machine learning (ML) tasks. Their superior performance is often overshadowed by the substantial memory requirements and low data reuse opportunities. Processing in Memory (PIM) is a promising solution to accelerate Transformer models due to its massive parallelism, low data movement costs, and high memory bandwidth utilization. Existing PIM accelerators lack the support for algorithmic optimizations like dynamic token pruning that can significantly improve the efficiency of Transformers. We identify two challenges to enabling dynamic token pruning on PIM-based architectures: the lack of an in-memory top-k token selection mechanism and the memory under-utilization problem from pruning. To address these challenges, we propose PRIMATE, a software-hardware co-design PIM framework based on High Bandwidth Memory (HBM). We initiate minor hardware modifications to conventional HBM to enable Transformer model computation and top-k selection. For software, we introduce a pipelined mapping scheme and an optimization framework for maximum throughput and efficiency. PRIMATE achieves 30.6x improvement in throughput, 29.5x improvement in space efficiency, and 4.3x better energy efficiency compared to the current state-of-the-art PIM accelerator for transformers.

[To Session Table]

Session 6B  Emerging Trends in Hardware Design: from Biochips to Quantum Systems
Time: 15:00 - 17:05, Wednesday, January 24, 2024
Location: Room 205
Chair: Hiromitsu Awano (Kyoto University, Japan)

6B-1 (Time: 15:00 - 15:25)
TitleAdaptive Control-Logic Routing for Fully Programmable Valve Array Biochips Using Deep Reinforcement Learning
Author*Huayang Cai, Genggeng Liu, Wenzhong Guo (Fuzhou University, China), Zipeng Li (Silicon Engineering Group, Apple Inc., Cupertino, CA, USA), Tsung-Yi Ho (The Chinese University of Hong Kong, Hong Kong), Xing Huang (Northwestern Polytechnical University, China)
Pagepp. 564 - 569
Keywordcontrol logic, fully programmable valve array, deep reinforcement learning, timing synchronization
AbstractWith the increasing integration level of flow-based microfluidics, fully programmable valve arrays (FPVAs) have emerged as the next generation of microfluidic devices. Microvalves in an FPVA are typically managed by a control logic, where valves are connected to a core input via control channels to receive control signals that guide states switching. The critical valves that suffer from asynchronous actuation leading to chip malfunctions, however, need to be switched simultaneously in a specific bioassay. As a result, the channel lengths from the core input to these valves are required to be equal or similar, which poses a challenge to the channel routing of the control logic. To solve this problem, we propose a deep reinforcement learning-based adaptive routing flow for the control logic of FPVAs. With the proposed routing flow, an efficient control channel network can be automatically constructed to realize accurate control signals propagation. Meanwhile, the timing skews among synchronized valves and the total length of control channels can be minimized, thus generating an optimized control logic with excellent timing behavior. Simulation results on multiple benchmarks demonstrate the effectiveness of the proposed routing flow.

6B-2 (Time: 15:25 - 15:50)
TitleTowards Automated Testing of Multiplexers in Fully Programmable Valve Array Biochips
AuthorGenggeng Liu, *Yuqin Zeng, Yuhan Zhu, Huayang Cai, Wenzhong Guo (Fuzhou University, China), Zipeng Li (Silicon Engineering Group, Apple Inc., Cupertino, CA, USA), Tsung-Yi Ho (The Chinese University of Hong Kong, Hong Kong), Xing Huang (Northwestern Polytechnical University, China)
Pagepp. 570 - 575
Keywordfully programmable valve array, multiplexer, automated test pattern generation
AbstractFully Programmable Valve Array (FPVA) biochips have attracted much attention as a new generation of continuous-flow microfluidic platform for biochemical experiments automation. With the increasing density of microvalves in FPVA biochips, the control system for managing the open/close of these valves has become more and more complex. To improve the scalability of biochips and reduce the number of control pins, a highly efficient control system using multiplexer and boolean logic has been introduced in FPVA biochips. In the manufacturing and using of such systems, however, various faults such as channel blockage, channel leakage, and reliability issues caused by frequent valve switching can occur in the multiplexers. Accordingly, in this paper, we propose the first automated fault testing method for the multiplexer of FPVA control systems. The proposed method includes the following key techniques: 1) an automated test pattern generation algorithm based on integer linear programming and 2) an automated fault testing strategy based on image recognition technology. Experiment results on multiple benchmarks have shown that the proposed method can generate fewer test patterns, while achiving 100% fault coverage.

6B-3 (Time: 15:50 - 16:15)
TitleThe Need for Speed: Efficient Exact Simulation of Silicon Dangling Bond Logic
Author*Jan Drewniok, Marcel Walter (Technical University of Munich, Germany), Robert Wille (Technical University of Munich/Software Competence Center Hagenberg GmbH, Germany)
Pagepp. 576 - 581
KeywordEmerging Technology, Simulation, Efficient
AbstractThe Silicon Dangling Bond (SiDB) logic platform, an emerging computational beyond-CMOS nanotechnology, is a promising competitor due to its ability to achieve integration density and clock speed values that are several orders of magnitude higher compared to current CMOS fabrication nodes. However, the exact physical simulation of SiDB layouts, which is an essential component of any design validation workflow, is computationally expensive. In this paper, we propose a novel algorithm called QuickExact, which aims to be both, efficient and exact. To this end, we are introducing three techniques, namely 1) Physically-informed Search Space Pruning, 2) Partial Solution Caching, and 3) Effective State Enumeration. Extensive experimental evaluations confirm that, compared to the state-of-the-art algorithm, the resulting approach leads to a paramount runtime advantage of more than a factor of 5000 on randomly generated layouts and more than a factor of 2000 on an established gate library.

Best Paper Candidate
6B-4 (Time: 16:15 - 16:40)
TitleTowards Multiphase Clocking in Single-Flux Quantum Systems
Author*Rassul Bairamkulov, Giovanni De Micheli (EPFL, Switzerland)
Pagepp. 582 - 587
KeywordMultiphase clocking, Single-Flux Quantum, Technology Mapping, Path Balancing, Constraint Programming
AbstractRapid single-flux quantum (RSFQ) is one of the most advanced superconductive electronics technologies. SFQ systems achieve operate at tens of gigahertz with up to three orders of magnitude smaller power as compared to CMOS. In conventional SFQ systems, most gates require clock signal. Each gate should have the fanins with equal logic depth, necessitating insertion of path-balancing (PB) DFFs, incurring prohibitive area penalty. Multiphase clocking is the effective method for reducing the path-balancing overhead at the cost of reduced throughput. However, existing tools are not directly applicable for technology mapping of multiphase systems. To overcome this limitation, in this work, we propose a technology mapping tool for multiphase systems. Our contribution is threefold. First, we formulate multiphase path-balancing as a Constraint Programming with Satisfiability (CP-SAT) problem, to optimize the number of the path-balancing DFFs within an asynchronous path. Second, we propose a method to identify the independent datapaths within the network for more efficient processing. Finally, we integrate these methods into a technology mapping flow to convert a logic network into a multiphase SFQ circuit. In our case studies, by using four phases, the number of path-balancing DFFs is reduced by 30% as compared to the state-of-the-art single-phase technology mapping.

6B-5 (Time: 16:40 - 17:05)
TitleAlgebraic and Boolean Methods for SFQ Superconducting Circuits
Author*Alessandro Tempia Calvino, Giovanni De Micheli (EPFL, Switzerland)
Pagepp. 588 - 593
Keywordsuperconducting electronics, RSFQ, logic synthesis, emerging technologies
AbstractRapid single-flux quantum (RSFQ) is one of the most advanced and promising superconducting logic families, offering exceptional energy efficiency and speed. RSFQ technology requires delay registers (DFFs) and splitter cells to satisfy the path-balancing and driving-capacity constraints. In this paper, we present a comprehensive exploration of methods for synthesizing and optimizing SFQ circuits. Our approach includes algebraic and Boolean optimization techniques that work on the xor-and graph (XAG) representation of combinational logic. Additionally, we introduce a technology mapping method to satisfy the path-balancing and fanout constraints while minimizing the area. Finally, we propose a synthesis flow for SFQ circuits. In the experimental results, we show an average reduction in the area and delay of 43% and 34%, respectively, compared to the state-of-the-art.

[To Session Table]

Session 6C  New Advances in Logic Locking and Side-Channel Analysis
Time: 15:00 - 17:05, Wednesday, January 24, 2024
Location: Room 206
Chair: Andy Yu-Guang Chen (National Central University, Taiwan)

6C-1 (Time: 15:00 - 15:25)
TitleLOOPLock 3.0: A Robust Cyclic Logic Locking Approach
AuthorPei-Pei Chen, Xiang-Min Yang, *Yu-Cheng He (National Tsing Hua University, Taiwan), Yung-Chih Chen (National Taiwan University of Science and Technology, Taiwan), Yi-Ting Li, Chun-Yao Wang (National Tsing Hua University, Taiwan)
Pagepp. 594 - 599
KeywordHardware security, Cyclic logic locking, Logic unlocking, SAT attack, LOOPLock 2.0
AbstractCyclic logic locking is a cutting-edge hardware security method developed to defend against SAT Attack. It introduces cycles into the original circuit, which can cause the circuit to either get trapped in an endless loop or generate incorrect outputs if an incorrect key is used. Recently, a new cyclic logic locking method called LOOPLock 2.0 was proposed. Its primary feature is that the circuit retains its cyclic structure regardless of whether the correct key vector is applied or not. However, LOOPLock 2.0 can still be successfully attacked using locking structure analysis in the state-of-the-art. As a result, this paper presents a more robust cyclic logic locking approach LOOPLock 3.0 to counteract state-of-the-art attacks. The experimental results validate the effectiveness of the proposed approach.

6C-2 (Time: 15:25 - 15:50)
TitleLogic Locking over TFHE for Securing User Data and Algorithms
Author*Kohei Suemitsu, Kotaro Matsuoka, Takashi Sato, Masanori Hashimoto (Kyoto University, Japan)
Pagepp. 600 - 605
Keywordsecure computation, logic locking, LUT-based obfuscation, TFHE
AbstractThis paper proposes the application of logic locking over TFHE to protect both user data and algorithms, such as input user data and models in machine learning inference applications. With the proposed secure computation protocol, algorithm evaluation can be performed distributively on honest- but-curious user computers while keeping the algorithm secure. To achieve this, we combine conventional logic locking for untrusted foundries with TFHE to enable secure computation. By encrypting the logic locking key using TFHE, the key is secured with the degree of TFHE. We implemented the proposed secure protocols for combinational logic neural networks and decision trees using LUT-based obfuscation. Regarding the security anal- ysis, we subjected them to the SAT attack and evaluated their resistance based on the execution time. We successfully configured the proposed secure protocol to be resistant to the SAT attack in all machine learning benchmarks. Also, the experimental result shows that the proposed secure computation involved almost no TFHE runtime overhead in a test case with thousands of gates.

6C-3 (Time: 15:50 - 16:15)
TitleLIPSTICK: Corruptibility-Aware and Explainable Graph Neural Network-based Oracle-Less Attack on Logic Locking
AuthorYeganeh Aghamohammadi (University of California, Santa Barbara, USA), *Amin Rezaei (California State University, Long Beach, USA)
Pagepp. 606 - 611
KeywordLogic Locking, Machine Learning, Graph Neural Networks, Explainability, Corruptibility
AbstractIn a zero-trust fabless paradigm, designers are increasingly concerned about hardware-based attacks on the semiconductor supply chain. Logic locking is a design-for-trust method that adds extra key-controlled gates in the circuits to prevent hardware intellectual property theft and overproduction. While attackers have traditionally relied on an oracle to attack logic-locked circuits, machine learning attacks have shown the ability to retrieve the secret key even without access to an oracle. In this paper, we first examine the limitations of state-of-the-art machine learning attacks and argue that the use of key hamming distance as the sole model-guiding structural metric is not always useful. Then, we develop, train, and test a corruptibility-aware graph neural network-based oracle-less attack on logic locking that takes into consideration both the structure and the behavior of the circuits. Our model is explainable in the sense that we analyze what the machine learning model has interpreted in the training process and how it can perform a successful attack. Chip designers may find this information beneficial in securing their designs while avoiding incremental fixes.

6C-4 (Time: 16:15 - 16:40)
TitlePower Side-Channel Analysis and Mitigation for Neural Network Accelerators based on Memristive Crossbars
Author*Brojogopal Sapui, Mehdi B. Tahoori (Karlsruhe Institute of Technology, Germany)
Pagepp. 612 - 617
Keywordside-channel attack, In-memory computing, neural network, correlation power analysis, TVLA
AbstractThe modern trend of exploring Artificial Intelligence (AI) in various industries, such as big data, edge computing, automobile, and medical applications, has increased tremendously. As functionalities grow, energy-efficient hardware for AI devices becomes crucial. To address that, Computation-in-Memory (CiM) using Non-Volatile Memories (NVMs) offers a promising solution. However, security is also an important concern in this computation paradigm. In this work, we analyze the vulnerability for power side-channel attacks on Multiply-Accumulate (MAC) operations implemented in CiM architecture based on emerging NVMs. Our results show that peripheral devices such as Analog-to-Digital Converters (ADCs) leak much more sensitive information than the crossbar itself because of its significant power consumption. Therefore, we propose a circuit-level countermeasure based on hiding for the ADCs of memristive CiM architecture to mitigate the power attacks. The efficiency of our proposed countermeasure is shown by both attacks and leakage assessment methodologies using a maximum of one million measurement traces.

6C-5 (Time: 16:40 - 17:05)
TitleModeling of Tamper Resistance to Electromagnetic Side-channel Attacks on Voltage-scaled Circuits
Author*Kazuki Minamiguchi, Yoshihiro Midoh, Noriyuki Miura, Jun Shiomi (Osaka University, Japan)
Pagepp. 618 - 624
KeywordElectromagnetic leakage, side-channel attack, voltage-scaled circuit design
AbstractThe threat of information leakage by Side-Channel Attacks (SCAs) using ElectroMagnetic (EM) leakage is becoming more and more prominent for crypto circuits. This paper models tamper resistance to EM SCAs on voltage-scaled crypto circuits. It is well known that if the supply voltage is donwscaled, attackers need to acquire more EM traces to extract secret key information in crypto circuits. Therefore, crypto circuits can process more data safely. However, their supply voltage dependence is not fully studied. This paper thus firstly models voltage dependence of the strength in the EM emission from crypto circuits. Then, this paper proposes the tamper resistance model which analytically expresses the relationship between the supply voltage and the minimum traces to disclosure based on test vector leakage assessment. This helps consider to optimize the trade-off relationship between encryption performance and tamper resistance to the information leakage. The proposed models are validated by measurement results using an Advanced Encryption Standard (AES) circuit with a 180-nm process technology.

[To Session Table]

Session 6D  Advanced Simulation and Modeling
Time: 15:00 - 17:05, Wednesday, January 24, 2024
Location: Room 207
Chairs: Zhou Jin (China University of Petroleum-Beijing, China), Xueqing Li (Tsinghua University, China)

Best Paper Award
6D-1 (Time: 15:00 - 15:25)
TitleSPIRAL: Signal-Power Integrity Co-Analysis for High-Speed Inter-Chiplet Serial Links Validation
Author*Xiao Dong, Songyu Sun, Yangfan Jiang (Zhejiang University, China), Jingtong Hu (University of Pittsburgh, USA), Dawei Gao (Zhejiang University/Zhejiang ICsprout Semiconductor Co., Ltd., China), Cheng Zhuo (Zhejiang University/Key Laboratory of Collaborative Sensing and Autonomous Unmanned Systems of Zhejiang Province, China)
Pagepp. 625 - 630
Keywordsignal integrity, signal-power integrity co-analysis, inter-chiplet serial links, system validation
AbstractChiplet has recently emerged as a promising solution to achieving further performance improvements by breaking down complex processors into modular components and communicating through high-speed inter-chiplet serial links. However, the ever-growing on-package routing density and data rates of such serial links inevitably lead to more complex and worse signal and power integrity issues than a large monolithic chip. This highly demands efficient analysis and validation tools to support robust design. In this paper, a signal-power integrity co-analysis framework for high-speed inter-chiplet serial links validation named SPIRAL is proposed. The framework first builds equivalent models for the links with a machine learning-based transmitter model and an impulse response based model for the channel and receiver. Then, the signal-power integrity is co-analyzed with a pulse response based method using the equivalent models. Experimental results show that SPIRAL yields eye diagrams with 0.82-1.85% mean relative error, while achieving 18-44x speedup compared to a commercial SPICE.

6D-2 (Time: 15:25 - 15:50)
TitleNested Dissection Based Parallel Transient Power Grid Analysis on Public Cloud Virtual Machines
Author*Jiawen Cheng, Zhiqiang Liu, Lingjie Li, Wenjian Yu (Tsinghua University, China)
Pagepp. 631 - 637
Keywordtransient power grid analysis, distributed parallel computing, domain decomposition method, nested dissection, direct sparse solver
AbstractAccurate and efficient transient analysis of power grids (PGs) poses a large challenge of computation for nowadays integrated circuit design. In this work, we propose to leverage the public cloud computing to do PG transient analysis while preserving security. A multi-level distributed parallel LU factorization and forward/backward substitution approach based on nested dissection is then proposed to guarantee accuracy and robustness. Experimental results show that the proposed algorithm can achieve an average 2.06X speedup over NICSLU and 2.85X over conventional domain decomposition method based parallel approach. And, it exhibits good scalability with up to 6.0X parallel speedup on large-scale PGs with 4 cloud computer nodes.

6D-3 (Time: 15:50 - 16:15)
TitleEfficient Sublogic-Cone-Based Switching Activity Estimation using Correlation Factor
Author*Kexin Zhu (Tongji University, China), Runjie Zhang (Phlexing, China), Qing He (Tongji University, China)
Pagepp. 638 - 643
Keywordpower estimation, switching activity, probabilistic, reconvergence
AbstractSwitching activity is one of the key factors that determine digital circuits' power consumption. While gate-level simulations are too slow to support the average power analysis of modern designs blocks (e.g., millions or even billions of gates) over a longer period of time (e.g., millions of cycles), probabilistic methods provide a solution by using RTL simulation results and propagating the switching activity through the combinational logic. This work presents a sublogic-cone-based, probabilistic method for switching activity propagation in combinational logic circuits. We divide the switching activity estimation problem into two parts: incremental propagation (across the entire circuit) and accurate calculation (within the sublogic cones). To construct the sublogic cones, we first introduce a new metric called correlation factor to quantify the impact induced by the correlations between signal nets; then we develop an efficient algorithm that uses the calculated correlation factor to guide the construction of sublogic cones. The experimental results show that our method produces 73.2% more accurate switching activity estimation results compared with the state-of-the-art method, and achieves a 19X speedup at the meantime.

6D-4 (Time: 16:15 - 16:40)
TitleISOP-Yield: Yield-Aware Stack-Up Optimization for Advanced Package using Machine Learning
Author*Hyunsu Chae, Keren Zhu (The University of Texas at Austin, USA), Bhyrav Mutnury (Dell Infrastructure Solutions Group, USA), Zixuan Jiang (The University of Texas at Austin, USA), Daniel de Araujo (Siemens EDA, USA), Douglas Wallace, Douglas Winterberg (Dell Infrastructure Solutions Group, USA), Adam Klivans, David Z. Pan (The University of Texas at Austin, USA)
Pagepp. 644 - 650
Keywordyield aware design, interconnect design, Hyper-parameter optimization, ML surrogate model, manufacturing variation
AbstractHigh-speed cross-chip interconnects and packaging are critical for the overall performance of modern heterogeneous integrated computing systems. Recent studies have developed automatic stack-up design optimization methods for high-density interconnect (HDI) printed circuit board (PCB). However, few have considered the impact of manufacturing variation and the resulting yield issue in high-volume manufacturing (HVM). In this paper, we propose a novel framework for automatic stack-up design, optimizing the interconnect performance with a given yield requirement. The proposed framework utilizes the smooth and gradient-available machine learning surrogate model, employing a first-order Taylor expansion to approximate the out- put performance distribution. Experimental results demonstrate that our method effectively boosts the yield rate compared to the existing stack-up optimization framework. In addition, the proposed yield-aware algorithm shows an average of 49.96% efficiency improvement in yield-aware figure of merits compared to the state-of-the-art input noise-aware Bayesian optimization algorithm for high yield targets.

6D-5 (Time: 16:40 - 17:05)
TitlePhysics-Informed Learning for EPG-Based TDDB Assessment
AuthorDinghao Chen, *Wenjie Zhu, Xiaoman Yang, Pengpeng Ren, Zhigang Ji, Hai-Bao Chen (Shanghai Jiao Tong University, China)
Pagepp. 651 - 656
KeywordTime-dependent dielectric breakdown, Electric path generation, Partial differential equation, Physics-informed neural network
AbstractTime-dependent dielectric breakdown (TDDB) is one of the important failure mechanisms for copper (Cu) interconnects. Many TDDB models have been proposed based on different physics kinetics in the past. Recently, a physics based TDDB model, which is based on the breakdown concept of electric path generation (EPG), has been proposed and has shown advantage over widely accepted existing electrostatic field based TDDB assessment. However, the determination of the time-to-failure from this EPG based TDDB model includes solving partial differential equation (PDE) with time-consuming finite element method (FEM). In recent years, deep neural networks have been proposed to predict numerical solutions of PDEs. In this paper, we use physics-informed neural network to solve the diffusion equation of ions in an electric field extracted from EPG based TDDB model. The continuous definite condition and hard constrain optimization methods are used for improving the performance of PINN in terms of accuracy and speed. Compared with the FEM method, the proposed PINN method can lead to 100 times speedup with less than 0.1% mean squared error.

[To Session Table]

Session 6E  (SS-4) Cutting-Edge Techniques for EDA in Analog/Mixed-Signal ICs
Time: 15:00 - 16:40, Wednesday, January 24, 2024
Location: Room 107/108
Chairs: Keren Zhu (Chinese Univ. of Hong Kong, Hong Kong), Zhaori Bi (Fudan University)

6E-1 (Time: 15:00 - 15:25)
Title(Invited Paper) Toward End-to-End Analog Design Automation with ML and Data-Driven Approaches
AuthorSupriyo Maji (The University of Texas at Austin, USA), Ahmet F. Budak (The University of Texas at Austin/Analog Devices, USA), Souradip Poddar, *David Z. Pan (The University of Texas at Austin, USA)
Pagepp. 657 - 664
KeywordAnalog Design Automation, Circuit Sizing, Layout Automation, Topology Selection, ML Algorithms
AbstractDesigning analog circuits poses significant challenges due to their knowledge-intensive nature and the diverse range of requirements. There has been limited success in achieving a fully automated framework for designing analog circuits. However, the advent of advanced machine learning algorithms is invigorating design automation efforts by enabling tools to replicate the techniques employed by experienced designers. In this paper, we aim to provide an overview of the recent progress in ML-driven analog circuit sizing and layout automation tool developments. In advanced technology nodes, layout effects must be considered during circuit sizing to avoid costly rerun of the flow. We will discuss the latest research in layout-aware sizing. In the end-to-end analog design automation flow, topology selection plays an important role, as the final performance depends on the choice of topology. We will discuss recent developments in ML-driven topology selection before delving into our vision of an end-to-end data-driven framework that leverages ML techniques to facilitate the selection of optimal topology from a library of topologies.

6E-2 (Time: 15:25 - 15:50)
Title(Invited Paper) Reinforcing the Connection between Analog Design and EDA
AuthorKishor Kunal, Meghna Madhusudan, Jitesh Poojary, Ramprasath S, Arvind K. Sharma, Ramesh Harjani, *Sachin S. Sapatnekar (University of Minnesota, USA)
Pagepp. 665 - 670
KeywordAnalog design automation, Mismatch, MIMO receiver, ALIGN
AbstractBuilding upon recent advances in analog electronic design automation (EDA), this paper discusses directions for reinforcing the connection between design and EDA, in order to develop solutions that are meaningful to designers. Two aspects, both related to bridging the gap between EDA and designers, are highlighted. The first discusses the use of test structures to generate meaningful characterized data to aid design automation, specifically understanding the impact of random, correlated, and systematic variations on the design of matched structures. Results on a recent test chip that analyzes these variations and their impact on EDA design choices will be presented. The second illustrates a design testcase that applies analog EDA techniques, using the ALIGN layout engine, to design an RF MIMO receiver, and describes how this experience has helped both in advancing the state of analog EDA and in building circuits with enhanced designer productivity.

6E-3 (Time: 15:50 - 16:15)
Title(Invited Paper) A Study on Exploring and Exploiting the High-dimensional Design Space for Analog Circuit Design Automation
Author*Ruiyu Lyu, Yuan Meng, Aidong Zhao, Zhaori Bi (Fudan University, China), Keren Zhu (The Chinese University of Hong Kong, China), Fan Yang, Changhao Yan (Fudan University, China), Dian Zhou (University of Texas at Dallas, USA), Xuan Zeng (Fudan University, China)
Pagepp. 671 - 678
KeywordAnalog Circuit Optimization, High-dimensional
AbstractThe escalated intricacy of analog circuits, compounded by the high-dimensional nature of the design space, introduces complexities in optimizing circuit performance. Since the evaluation cost, often through circuit simulation, is resource-intensive and time-consuming, it is crucial to obtain a feasible design with a decent Figure of Merit (FOM) value within a limited simulation budget. In this study, we conduct an in-depth review and analysis of cutting-edge exploration and exploitation techniques developed to address the intricacies encountered in analog circuit design automation. Moreover, to enable algorithmic comparisons and advance the state of the field, we provide benchmarks encompassing analog circuit netlists with high-dimensional design variables, which empower researchers to rigorously assess and refine their optimization algorithms, leading to enhanced efficacy and novel developments.

6E-4 (Time: 16:15 - 16:40)
Title(Invited Paper) Performance-Driven Analog Layout Automation: Current Status and Future Directions
AuthorPeng Xu, Jintao Li, Tsung-Yi Ho, Bei Yu, *Keren Zhu (The Chinese University of Hong Kong, Hong Kong)
Pagepp. 679 - 685
KeywordPhysical Design, Analog CAD, VLSI, EDA
AbstractOptimizing circuit performance presents a pivotal challenge in the realm of automatic analog physical design. The intricacy of analog performance arises from its sensitivity to layout implementation, frequently lacking a viable approach for direct optimization. This talk initiates with a comprehensive overview of the present challenges and the techniques currently in use. The emphasis will be laid on the recent advancements in employing black-box optimization for enhancing analog performance. Subsequently, we will delve into a detailed case study and analysis of post-layout performance distribution for a typical analog circuit. This study will showcase various layout implementations generated by the open-source analog layout generator, MAGICAL. Future directions will be discussed based on the case study.

[To Session Table]

Session 3K  Keynote Session III
Time: 18:15 - 19:15, Wednesday, January 24, 2024
Location: Grand Ballroom A/B
Chairs: Kyu-Myung Choi (Seoul National University, Republic of Korea), Taewhan Kim (Seoul National University, Republic of Korea)

3K-1 (Time: 18:15 - 19:15)
Title(Keynote Address) AI/ML and EDA: Current Status and Perspectives on the Future
AuthorAndrew B. Kahng (University of California San Diego, USA)
AbstractIn the six years since an ASP-DAC 2018 invited talk on “New Directions for Learning-Based IC Design Tools and Methodologies”, much has changed. Today, more than half of the research papers at leading EDA conferences involve applications of machine learning. Deep learning approaches have been claimed to achieve superhuman design outcomes. Infusion of AI/ML is seen in all facets of EDA, including the product offerings of the major EDA vendors. The latest wave: large language models and generative AI, everywhere. What will the next six years bring before we meet again at ASP-DAC 2030? This talk will reflect on the current status of AI/ML and EDA: (1) surprises and disappointments, (2) clearer understanding of where AI/ML can and cannot (yet) move the needle, and (3) fundamental challenges and needs. Some perspectives on the future will also be given, including: (1) newly low-hanging fruits, (2) oncoming singularities, and (3) how research, teaching and R&D in our field might evolve.

Thursday, January 25, 2024

[To Session Table]

Session 7A  System Performance and Debugging
Time: 9:00 - 10:15, Thursday, January 25, 2024
Location: Room 204
Chairs: Jing-Jia Liou (National Tsing Hua University, Taiwan), Shigeru Yamashita (Ritsumeikan University, Japan)

Best Paper Candidate
7A-1 (Time: 9:00 - 9:25)
TitleThe Optimal Quantum of Temporal Decoupling
Author*Niko Zurstraßen, Ruben Brandhofer, José Cubero-Cascante, Nils Bosbach (RWTH Aachen University, Germany), Lukas Jünger (MachineWare GmbH, Germany), Rainer Leupers (RWTH Aachen University, Germany)
Pagepp. 686 - 691
KeywordSystemC, gem5, Temporal Decoupling, Full-System Simulation, Virtual Platforms
AbstractVirtual Platforms (VPs) and Full System Simulators (FSSs) are among the fundamental tools of modern Multiprocessor System on A Chip (MPSoC) development. In the last two decades, the execution speed of these simulations did not grow at the same rate as the complexity of the systems to be simulated, creating a need for faster simulation techniques. A popular approach is temporal decoupling (TD), in which parts of the simulation are not synchronized with the rest of the system for a time called quantum. A high quantum is beneficial for simulation performance due to fewer synchronization/context switches. Yet, it also increases the probability of causality errors, leading to inaccuracies. Thus, most users of TD simulations face the question: Which quantum offers the optimal compromise between accuracy and performance? In practice and literature, the quantum is usually chosen based on empirical knowledge. This approach can achieve adequate performance/accuracy, but it lacks proper reasoning. In this work, we address this shortcoming by providing analytical estimations and deeper insights into the effects of Temporal Decoupling (TD). Additionally, we verify the proposed models using TD simulations in SystemC and gem5.

7A-2 (Time: 9:25 - 9:50)
TitleTowards a Highly Interactive Design-Debug-Verification Cycle
Author*Lucas Klemmer, Daniel Groβe (Johannes Kepler University Linz, Austria)
Pagepp. 692 - 697
Keyworddebug, analysis, RTL, waveform, analysis
AbstractTaking a hardware design from concept to silicon is a long and complicated process, partly due to very long-running simulations. After modifying a Register Transfer Level (RTL) design, it is typically handed off to the simulator, which then simulates the full design for a given amount of time. If a bug is discovered, there is no way to adjust the design while still in the context of the simulation. Instead, all simulation results are thrown away, and the entire cycle must be restarted from the beginning. In this paper, we argue that it is worth breaking up this strict separation between design languages, analysis languages, verification languages, and simulators. We present virtual signals, a methodology to inject new logic into existing waveforms. Virtual signals are based on WAL, an open-source waveform analysis language, and can therefore use the capabilities of WAL for debugging, fixing, analyzing, and verifying a design. All this enables an interactive and fast response design-debug-verification cycle. To demonstrate the benefits of our methodology, we present a case-study in which we show how the technique improves debugging and design analysis.

7A-3 (Time: 9:50 - 10:15)
TitleBeyond Time-Quantum: A Basic-Block FDA Approach for Accurate System Computing Performance Estimation
AuthorHsuan-Yi Lin, *Ren-Song Tsay (National Tsing Hua University, Taiwan)
Pagepp. 698 - 703
Keywordprogram execution phase, frequency domain analysis, basic block
AbstractPerformance estimation is an essential technique in computer system design and optimization. Traditional time-quantum-based methods have limitations in accurately capturing program behavior due to the granularity issue. In this paper, we propose a novel approach, called Basic-Block-based Frequency Domain Analysis (BBFDA), that uses basic block analysis to extract program execution phase information and applies recursive frequency domain analysis to estimate performance waveforms. We evaluate our proposed approach on SPEC CPU2017 benchmark suites and compare it with time-quantum-based methods. Results show that our approach achieves higher accuracy and robustness than traditional methods, especially for cases with many phases. Our proposed approach can dynamically track performance behavior without the need to choose an appropriate granularity size.

[To Session Table]

Session 7B  Innovations in Autonomous Systems: Hyperdimensional Computing and Emerging Application Frontiers
Time: 9:00 - 10:15, Thursday, January 25, 2024
Location: Room 205
Chair: Takashi Sato (Kyoto University, Japan)

7B-1 (Time: 9:00 - 9:25)
TitleBoostIID: Fault-agnostic Online Detection of WCET Changes in Autonomous Driving
Author*Saehanseul Yi, Nikil Dutt (University of California, Irvine, USA)
Pagepp. 704 - 709
Keywordautonomous driving, real-time system, fault detection, boosting
AbstractThe lifespan of autonomous vehicles is increasing, exposing them to aging and permanent faults that can impact timing safety based on design-time analyses such as worst-case execution time (WCET). Conventional fault-aware WCET methods incorporate potential faults into the analysis, which can result in severely pessimistic WCET estimates. The resulting underutilized system can exhibit significant energy inefficiency during normal operation. We propose BoostIID, a dynamic fault detection mechanism that achieves improved utilization and energy efficiency by monitoring changes in the statistical distribution of WCET at runtime. Unlike prior monitoring methods, BoostIID proactively detects faults before actual timing violations occur, allowing time for recovery measures. As a result, BoostIID eliminates the overhead of fault recovery from classical designtime WCET estimates, resulting in improved energy efficiency. We further improve the detection accuracy using a collection of independent and identically distributed (i.i.d) tests with a boosting technique. Our experimental results with an autonomous driving benchmark suite show 62.6% energy reduction over pessimistic WCET methods, demonstrating BoostIID’s utility for energy-efficient safety-critical system design.

7B-2 (Time: 9:25 - 9:50)
TitleKalmanHD: Robust On-Device Time Series Forecasting with Hyperdimensional Computing
Author*Ivannia Gomez Moreno (CETYS University Campus Tijuana, Mexico), Xiaofan Yu, Tajana Rosing (University of California, San Diego, USA)
Pagepp. 710 - 715
KeywordTime-Series Forecasting, Hyperdimensional Computing, Kalman Filter
AbstractTime series forecasting is shifting towards Edge AI, where models are trained and executed on edge devices. However, training forecasting models at the edge faces two challenges: (1) dealing with streaming data containing noise, which can lead to degradation in predictions, and (2) coping with limited on-device resources. Traditional approaches focus on simple statistical methods like ARIMA or complex neural networks, which are either robust to sensor noise or efficient for edge deployment, not both. In this paper, we propose a novel, robust, and lightweight method named KalmanHD for on-device time series forecasting using Hyperdimensional Computing (HDC). KalmanHD integrates Kalman Filter (KF) with HDC, for a new regression method that combines the robustness of KF and the efficiency of HDC. KalmanHD encodes past values into a high-dimensional vector representation, then iteratively updates the model based on the incoming samples. It also considers the variability of each sample to enhance robustness. We further accelerate KalmanHD by substituting the expensive matrix multiplication with efficient binary operations. Results show that our method achieves MAE comparable to the state-of-the-art noise-optimized NN-based methods while running 3.6-8.6x faster on typical edge platforms. The source code is available at https://github.com/DarthIV02/KalmanHD

7B-3 (Time: 9:50 - 10:15)
TitleHyperFeel: An Efficient Federated Learning Framework Using Hyperdimensional Computing
Author*Haomin Li, Fangxin Liu (Shanghai Jiao Tong University, China), Yichi Chen (Tianjin University, China), Li Jiang (Shanghai Jiao Tong University, China)
Pagepp. 716 - 721
KeywordHyperdimensional Computing, Federated Learning, Brain-inspired
AbstractFederated Learning (FL) aims to establish a shared model across ecentralized clients under the privacy-preserving constraint. Each client learns an independent model with local data, and only model’s updates are communicated. However, since the FL model typically employs computation-intensive neural networks, major challenges in Federated Learning are (i)significant computation overhead for local training; (ii)massive communication overhead arises from the model updates; (iii)notable performance degradation caused by the non-IID scenario. In this work, we propose HyperFeel, an efficient framework for federated learning based on Hyper-Dimensional Computing (HDC), that can significantly improve communication/storage efficiency over existing works with nearly no performance degradation. Unlike current solutions that employ neural networks as the learning models, HyperFeel introduces a simple yet effective computing paradigm using hyperdimensional vectors to encode and represent data. It performs concise and highly parallel operations for encryption, computation, and communication, taking advantage of the lightweight feature representation of hyperdimensional vectors. To further enhance HyperFeel performance, we propose a two-fold optimization scheme combining the characteristics of encoding and updating in hyper-dimensional computing. On one hand, we design a personalized update strategy for client models based on HDC, which achieves better accuracy on non-IID data. On the other hand, we extend the framework from horizontal FL to vertical FL based on a shared encoding mechanism. Comprehensive experimental results demonstrate that our method consistently outperforms the state-of-the-art FL models. HyperFeel achieves 26× storage reduction and up to 81× communication reduction over FedAvg, with minimal accuracy drops on FEMNIST and Synthetic.

[To Session Table]

Session 7C  Productivity Management for High Level Design
Time: 9:00 - 10:15, Thursday, January 25, 2024
Location: Room 206
Chair: Kenshu Seto (Kumamoto University, Japan)

7C-1 (Time: 9:00 - 9:25)
TitleRTLLM: An Open-Source Benchmark for Design RTL Generation with Large Language Model
Author*Yao Lu, Shang Liu, Qijun Zhang, Zhiyao Xie (Hong Kong University of Science and Technology, Hong Kong)
Pagepp. 722 - 727
KeywordLarge Language Model, ChatGPT, Automatic Design Generation, ML for EDA
AbstractInspired by the recent success of large language models (LLMs) like ChatGPT, researchers start to explore the adoption of LLMs for agile hardware design, such as generating design RTL based on natural-language instructions. However, in existing works, their target designs are all relatively simple and in a small scale, and proposed by the authors themselves, making a fair comparison among different LLM solutions challenging. In addition, many prior works only focus on the design correctness, without evaluating the design qualities of generated design RTL. In this work, we propose an open-source benchmark named RTLLM, for generating design RTL with natural language instructions. To systematically evaluate the auto-generated design RTL, we summarized three progressive goals, named syntax goal, functionality goal, and design quality goal. This benchmark can automatically provide a quantitative evaluation of any given LLM-based solution. Furthermore, we propose an easy-to-use yet surprisingly effective prompt engineering technique named self-planning, which proves to significantly boost the performance of GPT-3.5 in our proposed benchmark.

7C-2 (Time: 9:25 - 9:50)
TitleLSTP : A Logic Synthesis Timing Predictor
Author*Haisheng Zheng (Shanghai AI Laboratory, China), Zhuolun He, Fangzhou Liu, Zehua Pei (Shanghai AI Laboratory/The Chinese University of Hong Kong, Hong Kong), Bei Yu (The Chinese University of Hong Kong, Hong Kong)
Pagepp. 728 - 733
KeywordLogic Synthesis, AI For EDA
AbstractThe ever-growing complexity of modern VLSI circuits brings about a substantial increase in the design cycle. As for logic synthesis, how to efficiently obtain physical characteristics of a design for subsequent design space exploration emerges as a critical issue. In this paper, we propose LSTP , an ML-based logic synthesis predictor, which can rapidly predict the post- synthesis timing of a broad range of circuit designs. Specifically, we explicitly take optimization sequences into consideration so that we can comprehend the synergy between optimization passes and their effects on netlists. Experimental results demonstrate that we outperform state-of-the-art remarkably.

7C-3 (Time: 9:50 - 10:15)
TitleBridging the Design Methodologies of Burst-Mode Specifications and Signal Transition Graphs
Author*Alex Chan (Newcastle University, UK), Danil Sokolov, Victor Khomenko (Dialog Semiconductor (Renasas), UK), Alex Yakovlev (Newcastle University, UK)
Pagepp. 734 - 739
Keywordburst-mode, quasi-delay-insensitive, signal transition graphs, burst-automaton, workcraft
AbstractAsynchronous circuits are a promising type of digital circuit that still see moderate usage in today's commercial products, which has often been linked to the adaptation challenges that are posed within industry, e.g. time required to develop new tools and train designers versus using existing synchronous tools to quickly meet market demands. Several formal models were introduced to aid with the design of asynchronous circuits, including Burst-Mode (BM) Specifications and Signal Transition Graphs (STGs). BM specifications resemble synchronous Finite State Machines (FSMs) allowing circuit designers to easily adapt and use them, however their circuit implementations may be limited due to declining tool support. STGs have access to well-established tools that produce optimal hazard-free circuit implementations, but they are seen as too different by the industry. In this paper, we present a new ‘co-design’ methodology that bridges the gap between BM specifications and STGs by using a formal model called Burst Automaton (BA). BA is a generic FSM-like model that acts as a framework for enabling interoperability between many different formal models, and offers several benefits that BM specifications and STGs can leverage. Our ‘co-design’ methodology is implemented in Workcraft, and is evaluated on several benchmarks showing an improved synthesis flow.

[To Session Table]

Session 7D  Emerging Memory Yield Optimization and Modeling
Time: 9:00 - 10:15, Thursday, January 25, 2024
Location: Room 207
Chair: Jun Shiomi (Osaka University)

7D-1 (Time: 9:00 - 9:25)
TitleSignature Driven Post-Manufacture Testing and Tuning of RRAM Spiking Neural Networks for Yield Recovery
AuthorAnurup Saha, Chandramouli Amarnath, *Kwondo Ma, Abhijit Chatterjee (Georgia Institute of Technology, USA)
Pagepp. 740 - 745
KeywordYield Recovery, Alternate Test, Spiking Neural Networks
AbstractResistive random access Memory (RRAM) based spiking neural networks (SNN) are becoming increasingly attractive for pervasive energy-efficient classification tasks. However, such networks suffer from degradation of performance (as determined by classification accuracy) due to the effects of process variations on fabricated RRAM devices resulting in loss of manufacturing yield. To address such yield loss, a two-step approach is developed. First, an alternative test framework is used to predict the performance of fabricated RRAM based SNNs using the SNN response to a small subset of images from the test image dataset, called the SNN response signature (to minimize test cost). This diagnoses those SNNs that need to be performance-tuned for yield recovery. Next, SNN tuning is performed by modulating the spiking thresholds of the SNN neurons on a layer-by-layer basis using a trained regressor that maps the SNN response signature to the optimal spiking threshold values during tuning. The optimal spiking threshold values are determined by an offline optimization algorithm. Experiments show that the proposed framework can reduce the number of out-of-spec SNN devices by up to 54% and improve yield by as much as 8.6%.

7D-2 (Time: 9:25 - 9:50)
TitlePhysics-Informed Learning for Versatile RRAM Reset and Retention Simulation
Author*Tianshu Hou (the Department of Micro/Nano Electronics, Shanghai Jiao Tong University, China), Yuan Ren, Wenyong Zhou, Can Li, Zhongrui Wang (the Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong), Hai-Bao Chen (the Department of Micro/Nano Electronics, Shanghai Jiao Tong University, China), Ngai Wong (the Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong)
Pagepp. 746 - 751
KeywordRRAM, retention, reset, multiphysics fields, physics-informed learning
AbstractResistive random-access memory (RRAM) constitutes an emerging and promising platform for compute-in-memory (CIM) edge AI. However, the switching mechanism and controllability of RRAM are still under debate owing to the influence of multiphysics. Although physics-informed neural networks (PINNs) are successful in achieving mesh-free multiphysics solutions in many applications, the resultant accuracy is not satisfactory in RRAM analyses. This work investigates the characteristics of RRAM devices - retention and reset transition which are described in terms of the dissolution of a conductive filament (CF) in 3-D axis-symmetric geometry. Specifically, we provide a novel neural network characterization of ion migration, Joule heating, and carrier transport, governed by the solutions of partial differential equations (PDEs). Motivated by physics-informed learning, the separation of variables (SOV) method and the neural tangent kernel (NTK) theory, we propose a customized 3-channel fully-connected network and a modified random Fourier feature (mRFF) embedding strategy to capture multiscale properties and appropriate frequency features of the self-consistent multiphysics solutions. The proposed model eliminates the need for grid meshing and temporal iterations widely used in RRAM analysis. Experiments then confirm its superior accuracy over competing physics-informed methods.

7D-3 (Time: 9:50 - 10:15)
TitleHard Error Correction in STT-MRAM
Author*Surendra Hemaram, Mehdi B Tahoori (Karlsruhe Institute of Technology (KIT), Germany), Francky Catthoor, Siddharth Rao, Sebastien Couet, Gouri Sankar Kar (IMEC, Belgium)
Pagepp. 752 - 757
KeywordSpin-transfer torque magnetic random access memory (STT-MRAM), Error correction pointer (ECP), Error correction string (ECS), Block error correction pointer (BECP)
AbstractSpin-transfer torque magnetic random access memory (STT-MRAM) is a promising alternative to existing on-chip CMOS memory technologies due to its non-volatility, low power consumption, and scalability potential. However, it is sensitive to various unique failure mechanisms, such as manufacturing defects in both CMOS and magnetic layers, temperature variation, repetitive writes, and oxide breakdown, which can cause early cell failure leading to hard errors. This can severely impair the manufacturing yield and its large-scale industrial adoption. We propose a new block error correction pointer (BECP) for hard error correction in STT-MRAM to ensure high manufacturing yield and in-field reliability. The proposed method divides large word lengths into smaller sub-blocks and assigns a specific base per sub-block to determine the offset location of the hard error for each sub-block. This allows storing only the offset instead of the absolute address of the hard error location for each sub-block. The results depict that the proposed method is storage efficient and has very low decoding complexity compared to the existing state-of-the-art methods. We incorporate experimental measurement data from manufactured STT-MRAM chips at different die locations to get the hard error distribution. The proposed method aligns well with our specific STT-MRAM error distribution measurements.

[To Session Table]

Session 7E  (SS-5) Enabling Chiplet-based Custom Designs
Time: 8:35 - 10:15, Thursday, January 25, 2024
Location: Room 107/108
Chairs: Antonino Tumeo (Pacific Northwest National Laboratory), Yi Zhou (University of Utah), Yu Cao (University of Minnesota)

7E-1 (Time: 8:35 - 9:00)
Title(Invited Paper) Exploiting 2.5D/3D Heterogeneous Integration for AI Computing
AuthorZhenyu Wang, Jingbo Sun (Arizona State University, USA), Alper Goksoy (University of Wisconsin-Madison, USA), Sumit Kumar Mandal (Indian Institute of Science, India), Yaotian Liu (Arizona State University, USA), Jae-sun Seo (Cornell Tech, USA), Chaitali Chakrabarti (Arizona State University, USA), Umit Y. Ogras (University of Wisconsin-Madison, USA), Vidya Chhabria, Jeff Zhang (Arizona State University, USA), *Yu Cao (University of Minnesota, USA)
Pagepp. 758 - 764
KeywordHeterogeneous Integration, chiplet, ML accelerators, Performance Analysis
AbstractThe evolution of AI algorithms has not only revolutionized many application domains, but also posed tremendous challenges on the hardware platform. Advanced packaging technology today, such as 2.5D and 3D interconnection, provides a promising solution to meet the ever-increasing demands of bandwidth, data movement, and system scale in AI computing. This work presents HISIM, a modeling and benchmarking tool for chiplet-based heterogeneous integration. HISIM emphasizes the hierarchical interconnection that connects various chiplets through network-on-package. It further integrates technology roadmap, power/latency prediction, and thermal analysis together to support electro-thermal co-design. Leveraging HISIM with in-memory computing chiplets, we explore the advantages and limitations of 2.5D and 3D heterogenous integration on representative AI algorithms, such as DNNs, transformers, and graph neural networks.

7E-2 (Time: 9:00 - 9:25)
Title(Invited Paper) Challenges and Opportunities to Enable Large-scale Computing via Heterogeneous Chiplets
AuthorZhuoping Yang, Shixin Ji, Xingzhen Chen, Jinming Zhuang (University of Pittsburgh, USA), *Weifeng Zhang (Lightelligence Inc, USA), Dharmesh Jani (Meta, USA), Peipei Zhou (University of Pittsburgh, USA)
Pagepp. 765 - 770
KeywordChiplet and interconnect, programming abstraction, advanced packaging and security, heterogeneous computing, large language model and generative AI
AbstractFast-evolving artificial intelligence (AI) algorithms such as large language models have been driving the ever-increasing computing demands in today's data centers. Heterogeneous computing with domain-specific architectures (DSAs) brings many opportunities when scaling up and scaling out the computing system. In particular, heterogeneous chiplet architecture is favored to keep scaling up and scaling out the system as well as to reduce the design complexity and the cost stemming from the traditional monolithic chip design. However, how to interconnect computing resources and orchestrate heterogeneous chiplets is the key to success. In this paper, we first discuss the diversity and evolving demands of different AI workloads. We discuss how chiplet brings better cost efficiency and shorter time to market. Then we discuss the challenges in establishing chiplet interface standards, packaging, and security issues. We further discuss the software programming challenges in chiplet systems.

7E-3 (Time: 9:25 - 9:50)
Title(Invited Paper) Heterogeneous Microelectronics Codesign for Edge Sensing at Deep Cryogenic Temperatures
AuthorJeff Fredenburg (Fermilab, USA)

7E-4 (Time: 9:50 - 10:15)
Title(Invited Paper) Towards Automated Generation of Chiplet-Based System
Author*Ankur Limaye, Claudio Barone, Nicolas Bohm Agostini, Marco Minutoli, Joseph Manzano, Vito Giovanni Castellana (Pacific Northwest National Laboratory, USA), Giovanni Gozzi, Michele Fiorito, Serena Curzel (Politecnico di Milano, Italy), Fabrizio Ferrandi (Politecnico di Milano, USA), Antonino Tumeo (Pacific Northwest National Laboratory, USA)
Pagepp. 771 - 776
KeywordSynthesis, Chiplets, HLS
AbstractThe Software Defined Architectures (SODA) Synthesizer is an open-source compiler-based tool able to automatically generate domain-specialized systems targeting Application- Specific Integrated Circuits (ASICs) or Field Programmable Gate Arrays (FPGAs) starting from high-level programming. SODA is composed of a high-level frontend, SODA-OPT, which leverages the multilevel intermediate representation (MLIR) framework to interface with productive programming tools (e.g., machine learning frameworks), identify kernels suitable for acceleration, and perform high-level optimizations, and of a state-of-the-art high-level synthesis backend, Bambu from the PandA framework, to generate custom accelerators. One specific application of the SODA Synthesizer is the generation of accelerators to enable ultra-low latency inference and control on autonomous systems for scientific discovery (e.g., electron microscopes, sensors in particle accelerators, etc.). This talk will discuss ongoing work on the SODA synthesizer to enable no-human-in-the-loop generation and design space exploration of the chiplets for highly specialized artificial intelligence accelerators. Connecting these highly specialized chiplets to general-purpose cores or programmable accelerators will allow to quickly deploy autonomous systems for scientific discovery.

[To Session Table]

Session 4K  Keynote Session IV
Time: 10:30 - 11:30, Thursday, January 25, 2024
Location: Room Premier A/B
Chairs: Kyu-Myung Choi (Seoul National University, Republic of Korea), Taewhan Kim (Seoul National University, Republic of Korea)

Title(Keynote Address) Unleashing the Future of IC Design with AI Innovation
AuthorErick Chao (Cadence Design Systems, Taiwan)
AbstractThe integration of AI into electronic design holds the potential to revolutionize the industry by enabling innovative and efficient chip design processes. Cadence, a leading provider of electronic design automation (EDA) tools and services, has deployed AI into its chip and system design tools to enable customers to harness its transformative capabilities. In this talk we will share the ways in which Cadence is leveraging AI in chip design.

[To Session Table]

Session 8A  Advances in Efficient Embedded Computing: from Hardware Accelerator to Task Management
Time: 13:00 - 14:40, Thursday, January 25, 2024
Location: Room 204
Chair: Sharad Malik (Princeton University, USA)

8A-1 (Time: 13:00 - 13:25)
TitleFlexible Spatio-Temporal Energy-Efficient Runtime Management
Author*Robert Khasanov, Marc Dietrich, Jeronimo Castrillon (TU Dresden, Germany)
Pagepp. 777 - 784
Keywordresource management, energy-efficiency, spatio-temporal mapping
AbstractHeterogeneous multi-core architectures, such as Arm's big.LITTLE and DynamIQ, feature multiple core types with the same ISA but varied performance-energy characteristics. These are increasingly adopted in embedded systems as they enable dynamic application mapping, balancing performance with energy efficiency. While Hybrid Application Mapping (HAM) approaches have gained popularity in systems running dynamic workloads, most solutions yield spatial mappings and neglect application migrations in output schedules, substantially limiting the solution space. This work introduces STEM and FFEMS, two algorithms utilizing the temporal aspect with job reconfigurations to generate “flexible” spatio-temporal mappings. STEM leverages Memetic Algorithms (MAs), while FFEMS uses fast greedy heuristics. Our evaluation on two heterogeneous multi-core platform models demonstrates that the flexible structure of the spatio-temporal mappings significantly improves the schedulability. On workloads from automotive and multimedia domains, STEM finds the most energy-efficient solutions, but its large overhead makes it unsuitable for use in runtime systems. In contrast, FFEMS exhibits an outstanding balance between performance and runtime overhead: Given similar runtime overhead as MMKP-MDF, the state-of-the-art approach, FFEMS schedules up to 16% more test cases. Its “tail-switch” optimization further improves energy efficiency, though with increased overhead, which is still acceptable within runtime systems.

8A-2 (Time: 13:25 - 13:50)
TitleSparse-Sparse Matrix Multiplication Accelerator on FPGA featuring Distribute-Merge Product Dataflow
Author*Yuta Nagahara, Jiale Yan, Kazushi Kawamura, Masato Motomura, Thiem Van Chu (Tokyo Institute of Technology, Japan)
Pagepp. 785 - 791
KeywordSpM-SpM, FPGA, Dataflow
AbstractSparse-Sparse matrix multiplication (SpMSpM) is a critical computation in various fields. It poses computational challenges for general-purpose CPUs and GPUs due to its requirements for random memory access and the inherently low spatial/temporal locality. Though numerous SpMSpM accelerators have been recently proposed, they suffer from various issues such as low input utilization, heavy computational load, and excessive memory traffic during the merging process of intermediate results. This paper introduces a novel Distribute-Merge Product (DMP) SpMSpM dataflow and a DMP-based SpMSpM Architecture (DMSA). DMP distributes the workload into balanced streams, generates partial matrices based on these streams, and merges the partial results in a parallel and pipelined fashion. We have designed DMSA as a highly scalable architecture, implemented it on a Xilinx ZCU106 Evaluation Kit, and evaluated it on a set of benchmarks from the SuiteSparse matrix collection. When compared to a latest SpMSpM accelerator with approximately the same amount of hardware resources on the same FPGA platform, DMSA achieves 2.72x speedup, by facilitating the parallelism of partial matrix generation and merging. The speedup on the same platform reaches 4.80x when the parallelism explored in the merging process is doubled, evidencing the DMSA's superb scalability.

8A-3 (Time: 13:50 - 14:15)
TitleMeeting Job-Level Dependencies by Task Merging
Author*Matthias Becker (KTH Royal Institute of Technology, Sweden)
Pagepp. 792 - 798
Keywordreal-time, end-to-end latency, task chain, job-level dependency
AbstractIndustrial applications are often time critical and subject to end-to-end latency constraints. Job-level dependencies can be leveraged to specify a partial ordering on tasks' jobs already at early design phases, agnostic of the hardware platform or scheduling algorithm, and guarantee that end-to-end latency constraints of task chains are met as long as the job-level dependencies are respected. However, their realization at runtime can introduce overheads and complicates the scheduling and timing analysis. This work presents an approach that merges multi-periodic tasks that are connected by job-level dependencies to a single task. A Constraint Programming formulation is presented that optimally merges such task clusters while all job-level dependencies are respected. Such an approach removes the need to consider job-level dependencies at runtime without being bound to a specific scheduling algorithm. Evaluations highlight the applicability of the approach by system-level experiments and showcase the scalability of the approach using synthetic task clusters.

8A-4 (Time: 14:15 - 14:40)
TitleA CGRA Front-end Compiler Enabling Extraction of General Control and Dedicated Operators
Author*Xuchen Gao, Yunhui Qiu, Yuan Dai, Wenbo Yin, Lingli Wang (Fudan University, China)
Pagepp. 799 - 804
KeywordCGRA Front-end Compiler, Multi-dimension Memory Access, Dedicated Operators, Variable Bound Loops
AbstractCoarse-grained reconfigurable architecture (CGRA) gradually becomes an extraordinarily promising accelerator due to its flexibility and power efficiency. However, most CGRA front-end compilers focus on the innermost body of regular loops with a pure data flow. Therefore, we propose CO-Compiler, an LLVM-based CGRA front-end compiler to generate an optimized control-data flow graph (CDFG), which can handle versatile loops in C/C++, including general control flow, arbitrary nested levels, and imperfect statements. Then we extract multi-dimension memory access patterns and various dedicated operators adapting to concrete hardware functions. In addition, we analyze variable loop bounds which are settled at runtime, and realize the SoC runtime configuration of CGRA. The feasibility of our methodology is verified by a RISC-V based SoC simulation. The experimental results demonstrate that our dedicated operator extraction can reduce 43% PE resources and decrease 84% initiation interval (II) on a TRAM architecture. Furthermore, compared with state-of-the-art (SOTA) CGRA front-end compilers, CO-Compiler has the highest 88.1% success rate in CDFG generation for a wide range of benchmarks. Moreover, by using the same back-end mappers, our work can reach 78% reduction for II and 2.06x PE spatio-temporal utilization in contrast with their own front-end compilers.

[To Session Table]

Session 8B  In-Memory Computing Architecture Design and Logic Synthesis
Time: 13:00 - 14:40, Thursday, January 25, 2024
Location: Room 205
Chair: Shuo-Han Chen (National Yang Ming Chiao Tung University, Taiwan)

8B-1 (Time: 13:00 - 13:25)
TitleLOSSS- Logic Synthesis based on Several Stateful logic gates for high time-efficient computing
AuthorYihong Hu (College of Computer, National University of Defense Technology, China), *Nuo Xu, Chaochao Feng (College of Computer, National University of Defense Technology/Key Laboratory of Advanced Microprocessor Chips and Systems, China), Wei Tong (Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, China), Kang Liu, Liang Fang (College of Computer, National University of Defense Technology, China)
Pagepp. 805 - 811
KeywordIn-memory computing, Memristor-based stateful logic, Logic synthesis and mapping, Memristor
AbstractMemristor stateful logic is an effective way to achieve the real sense of in-memory computing in memristor-based crossbar array (MCBA). However, cascading stateful logic gates in MCBA is a time-consuming sequential process comparing to the space-wise CMOS combinational logic circuit. It is essential to develop the automatic synthesis tool to achieve complex combinatorial logic function with less in-memory stateful logic gates. In this paper, a logic synthesis process based on several stateful logic gates (LOSSS) is achieved to enhance in-memory computing efficiency of the scene of single row/column-oriented stateful logic computing. First, multiply compatible two/one-input PMR-type stateful logic gates with the functions of NOR, OR and NOT are employed in the initial function synthesis to obtain a good start-point netlist. Then, a post-process stage is added in the flow to reduce the number of the gates of the netlist by developing an automated optimization algorithm of replacing some specific gate groups as the composite gates of IMP and ONOR with consideration of input overwritten. Finally, an improved mapping process is employed to cascade these stateful logic gates in a single row of the crossbar array with less device occupation. Comparing to the standard SIMPLER-MAGIC, LOSSS achieves arithmetic mean improvements of over 23% in performance, and over 34% in effective lifetime under the EPFL benchmark suit which is also better than the results reported by the state-of-art MAGIC synthesis process (X-MAGIC).

8B-2 (Time: 13:25 - 13:50)
TitleTowards Area-Efficient Path-Based In-Memory Computing using Graph Isomorphisms
AuthorSven Thijssen, *Muhammad Rashedul Haq Rashed, Hao Zheng (University of Central Florida, USA), Sumit Kumar Jha (Florida International University, USA), Rickard Ewetz (University of Central Florida, USA)
Pagepp. 812 - 817
Keywordin-memory computing, crossbar
AbstractIn-memory computing has attracted significant attention due to its potential to alleviate the issues caused by the von Neumann bottleneck. Path-based computing is a recently proposed in-memory computing paradigm for evaluating Boolean functions using nanoscale crossbars. Unlike state-of-the-art paradigms that use expensive WRITE operations to execute functions, path-based computing only relies on READ operations, which translates into benefits of low power consumption and low computational delay. Unfortunately, the path-based computing comes with the penalty of substantial area overhead. In this paper, we introduce the ISO framework, a hardware/software solution for minimizing the area overhead of path-based computing systems. The framework is based on mapping computation to in-memory kernels using an intermediate k-LUT representation. The k-LUTs facilitate reusing hardware resources that realize the same computational structures. The reuse is performed by detecting identical subfunctions using isomorphic graphs. We also present a set of program instruction and scheduling algorithms to facilitate the hardware reuse. We have evaluated our proposed ISO framework on the 10 ISCAS85 benchmarks. Our experimental evaluation indicates that our proposed architecture improves energy consumption, latency, and area by 1.30X, 76.59X, and 2.79X on the average compared with previous state-of-the-art methods for path-based computing.

8B-3 (Time: 13:50 - 14:15)
TitleREAD-based In-Memory Computing using Sentential Decision Diagrams
AuthorSven Thijssen, *Muhammad Rashedul Haq Rashed (University of Central Florida, USA), Sumit Kumar Jha (Florida International University, USA), Rickard Ewetz (University of Central Florida, USA)
Pagepp. 818 - 823
Keywordin-memory computing, sentential decision diagrams
AbstractProcessing-in-memory (PIM) has the potential to unleash unprecedented computing capabilities. While most in- memory computing paradigms rely on repeatedly programming the non-volatile memory devices, recent computing paradigms are capable of evaluating Boolean functions by simply ob- serving the flow of electrical currents within a crossbar of non-volatile memory. Synthesizing Boolean functions into such crossbar designs is a fundamental problem for next-generation in-memory computing systems. The selection of the data structure used to guide the synthesis process has a first-order impact on the overall system performance. State-of-the-art in-memory computing paradigms leverage representations such as majority inverter graphs (MIGs), and binary decision diagrams (BDDs). In this paper, we propose the Cascading Crossbar Synthesis using SDDs (C2S2) framework for automatically synthesizing Boolean logic into crossbar designs. The cornerstone of the C2S2 framework is a newly invented data structure called sentential decision diagrams (SDDs). It has been proved that SDDs are more succinct than binary decision diagrams (BDDs). To minimize expensive data transfer on the system bus, C2S2 maps computation to multiple crossbars that are connected together in series. The C2S2 framework is evaluated using 13 benchmark circuits. Compared with state-of-the-art paradigms such as CONTRA, FLOW, and PATH, C2S2 improves energy- efficiency by 6.8x while maintaining similar latency.

8B-4 (Time: 14:15 - 14:40)
TitleConvFIFO: A Crossbar Memory PIM Architecture for ConvNets Featuring First-In-First-Out Dataflow
Author*Liang Zhao, Yu Qian, Fanzi Meng, Xiapeng Xu, Xunzhao Yin, Cheng Zhuo (Zhejiang University, China)
Pagepp. 824 - 829
Keywordprocess in memory, convolutional neural network, FIFO, RRAM, dataflow
AbstractProcess-in-memory (PIM) architectures based on emerging non-volatile memories (NVMs) have been widely studied for more efficient computation of convolutional neural networks (ConvNets). However, conventional NVM-based PIM suffered from various non-idealities including IR drop, sneak-path currents, analog-to-digital converter (ADC) overhead, device variations and mismatch. In this work, we propose ConvFIFO, a crossbar memory PIM architecture for ConvNets featuring a novel first-in-first-out (FIFO) dataflow. Through the design of FIFO-type input/output buffers, ConvFIFO can maximize the reuse rates of inputs and partial sums to achieve a more balanced trade-off among throughput, accuracy and area/energy consumption. By using SRAM-based FIFO, ConvFIFO further achieves a systolic architecture without the need to move weight data, bypassing the limitation of NVM endurance. Compared to classical NVM-based PIM architectures like ISAAC, ConvFIFO exhibits significant performance improvement in terms of energy consumption (1.66-3.56x), latency(1.69-1.74x), Ops/W(4.23-10.17x) and Ops/s×mm2(1.59-1.74x), benchmarked against a number of common ConvNet models.

[To Session Table]

Session 8C  Firing Less for Evolution: Quantization & Learning Spikes
Time: 13:00 - 14:40, Thursday, January 25, 2024
Location: Room 206
Chair: Ting-Chi Wang (National Tsing Hua University, Taiwan)

Best Paper Candidate
8C-1 (Time: 13:00 - 13:25)
TitleMINT: Multiplier-less Integer Quantization for Energy Efficient Spiking Neural Networks
Author*Ruokai Yin, Yuhang Li, Abhishek Moitra, Priyadarshini Panda (Yale University, USA)
Pagepp. 830 - 835
KeywordSpiking Neural Network, Quantization, Neuromorphic Computing
AbstractWe propose Multiplier-less INTeger (MINT) quantization, a uniform quantization scheme that efficiently compresses weights and membrane potentials in spiking neural networks (SNNs). Unlike previous SNN quantization methods, MINT quantizes memory-intensive membrane potentials to an extremely low precision (2-bit), significantly reducing the memory footprint. MINT also shares the quantization scaling factor between weights and membrane potentials, eliminating the need for multipliers required in conventional uniform quantization. Experimental results show that our method matches the accuracy of full-precision models and other state-of-the-art SNN quantization techniques while surpassing them in memory footprint reduction and hardware cost efficiency at deployment. For example, 2-bit MINT VGG-16 achieves 90.6% accuracy on CIFAR-10, with roughly 93.8% reduction in memory footprint from the full-precision model and 90% reduction in computation energy compared to vanilla uniform quantization at deployment.

8C-2 (Time: 13:25 - 13:50)
TitleTQ-TTFS: High-Accuracy and Energy-Efficient Spiking Neural Networks Using Temporal Quantization Time-to-First-Spike Neuron
Author*Yuxuan Yang, Zihao Xuan, Yi Kang (University of Science and Technology of China, China)
Pagepp. 836 - 841
Keywordspiking neural network, temporal quantization, image classification, energy efficiency
AbstractIn recent years, spiking neural networks (SNNs) have gained attention for their biological realistic and event-driven characteristics, which align well with neuromorphic hardware. Time-to-First-Spike (TTFS) coding is an coding scheme for SNNs, where neurons are fired only once throughout the inference process, reducing the number of spikes and improving energy efficiency. However, the SNNs with TTFS coding face the issue of low classification accuracy. This paper introduces TQ-TTFS, a temporal quantization on TTFS neuron model to address this issue. In addition, the temporal quantization neurons can apply lower clock frequency without increasing inference latency, which can lead to higher energy efficiency. The experimental results show the effectiveness of the proposed temporal quantization neuron model in improving both classification accuracy and energy efficiency. In our simulations TQ-TTFS achieves classification accuracy of 98.6% on MNIST dataset and 90.2% on FashionMNIST dataset which are among SOTA of temporal coding SNNs. An analysis is also given to show that TQ-TTFS on an example SNN can have 2.94x energy efficiency improvement compared with tranditional TTFS coding.

8C-3 (Time: 13:50 - 14:15)
TitleTEAS: Exploiting Spiking Activity for Temporal-wise Adaptive Spiking Neural Networks
Author*Fangxin Liu, Haomin Li, Ning Yang, Zongwu Wang (Shanghai Jiao Tong University, China), Tao Yang (Huawei Technologies Co., Ltd., China), Li Jiang (Shanghai Jiao Tong University, China)
Pagepp. 842 - 847
KeywordSNNs, energy efficiency, sparsity
AbstractSpiking neural networks (SNNs) are energy-efficient alternatives to commonly used deep artificial neural networks (ANNs). However, their sequential computation pattern over multiple time steps makes processing latency a significant hindrance to deployment. In existing SNNs deployed on time-driven hardware, all layers generate and receive spikes in a synchronized manner, forcing them to share the same time steps. This often leads to considerable time redundancy in the spike sequences and considerable repetitive processing. Motivated by the effectiveness of dynamic neural networks for boosting efficiency, we propose a temporal-wise adaptive SNN, namely TEAS, in which each layer is configured with independent number of time steps to fully exploit the potential of SNNs. Specifically, given an SNN, the number of time steps of each layer is configured according to its contribution to the final performance of the whole network. Then, we exploit the temporal transforming module to produce a dynamic policy that can adapt the temporal information dynamically during inference. The adaptive configuration generating process also enables the trading-off between model complexity and accuracy. Extensive experiments on a variety of challenging datasets demonstrate that our method provides significant savings in energy efficiency and processing latency under similar accuracy outperforming the existing state-of-the-art methods.

8C-4 (Time: 14:15 - 14:40)
TitleSOLSA: Neuromorphic Spatiotemporal Online Learning for Synaptic Adaptation
AuthorZhenhang Zhang, Jingang Jin, Haowen Fang, *Qinru Qiu (Syracuse University, USA)
Pagepp. 848 - 853
KeywordSpiking Neural Network, Spatiotemporal pattern learning, online learning
AbstractSpiking neural networks (SNNs) are bio-plausible computing models with high energy efficiency. The temporal dynamics of neurons and synapses enable them to detect temporal patterns and generate sequences. While Backpropagation Through Time (BPTT) is traditionally used to train SNNs, it is not suitable for online learning of embedded applications due to its high computation and memory cost as well as extended latency. In this work, we present Spatiotemporal Online Learning for Synaptic Adaptation (SOLSA), which is specifically designed for online learning of SNNs composed of Leaky Integrate and Fire (LIF) neurons with exponentially decayed synapses and soft reset. The algorithm not only learns the synaptic weight but also adapts the temporal filters associated to the synapses. Compared to the BPTT algorithm, SOLSA has much lower memory requirement and achieves a more balanced temporal workload distribution. Moreover, SOLSA incorporates enhancement techniques such as scheduled weight update, early stop training and adaptive synapse filter, which speed up the convergence and enhance the learning performance. When compared to other non-BPTT based SNN learning, SOLSA demonstrates an average learning accuracy improvement of 14.2%. Furthermore, compared to BPTT, SOLSA achieves a 5% higher average learning accuracy with a 72% reduction in memory cost.

[To Session Table]

Session 8D  New Techniques for Photonics and Analog Circuit Design
Time: 13:00 - 14:40, Thursday, January 25, 2024
Location: Room 207
Chairs: Yuanqing Chen (Beihang University, China), Yasuhiro Takashima (University of Kitakyushu, Japan)

8D-1 (Time: 13:00 - 13:25)
TitleSigned Convolution in Photonics with Phase-Change Materials using Mixed-Polarity Bitstreams
Author*Raphael Cardoso, Clément Zrounba (Institut des Nanotechnologies de Lyon, France), Mohab Abdalla (Institut des Nanotechnologies de Lyon/RMIT University, France), Paul Jimenez, Mauricio Gomes de Queiroz (Institut des Nanotechnologies de Lyon, France), Benoît Charbonnier (Univ. Grenoble Alpes/CEA LETI, France), Fabio Pavanello (Institut des Nanotechnologies de Lyon/Univ. Grenoble Alpes/Univ. Savoie Mont Blanc, France), Ian O'Connor (Institut des Nanotechnologies de Lyon, France), Sébastien Le Beux (University of Concordia, Canada)
Pagepp. 854 - 859
Keywordphotonic computing, phase-change materials, stochastic computing, in-memory computing
AbstractAs AI continues to grow in importance, in order to reduce its carbon footprint and utilization of computer resources, numerous alternatives are under investigation to improve its hardware building blocks. In particular, in convolutional neural networks (CNNs), the convolution function represents the most important operation and one of the best targets for optimization. A new approach to convolution had recently emerged using optics, phase-change materials (PCMs) and stochastic computing, but is thus far limited to unsigned operands. In this paper, we propose an extension in which the convolutional kernels are signed, using mixed-polarity bitstreams. We present a proof of validity for our method, while also showing that, in simulation and under similar operating conditions, our approach is less affected by noise than the common approach in the literature.

8D-2 (Time: 13:25 - 13:50)
TitleAn Efficient Branch-and-Bound Routing Algorithm for Optical NoCs
Author*Yihao Liu, Yaoyao Ye (Shanghai Jiao Tong University, China)
Pagepp. 860 - 865
Keywordoptical NoC, branch-and-bound, contention, thermal sensitivity, routing
AbstractSilicon photonics based optical networks-on-chip (ONoCs) are emerging as an power-efficient on-chip communication architecture for the next generation of chip multiprocessors. However, the thermal sensitivity of photonic devices presents power consumption challenges. Existing routing schemes optimized for optical power loss tend to avoid passing through high-temperature nodes, which in turn leads to contention at low-temperature nodes. It remains a crucial challenge to develop an adaptive routing algorithm that strikes a balance between the power consumption optimization and performance optimization. In this paper, we first propose an efficient branch-and-bound routing (BBR) algorithm for ONoCs. Secondly, we obtain 3BOR, a variant of the BBR algorithm with a bi-objective bounding function to optimize both the optical power loss and network performance. To the best of our knowledge, it is the first time that the branch-and-bound method is adopted to solve the routing optimization problem in ONoCs. Experimental results demonstrate that 3BOR reduces thermal-induced optical power loss by 14.8% while enhancing the saturation injection rate by 6.5% as compared to the state-of-the-art heuristic contention-aware thermal-reliable routing algorithm. In terms of running time of algorithm, the branch-and-bound method consistently outperforms the state-of-the-art heuristic algorithm across varying network sizes.

8D-3 (Time: 13:50 - 14:15)
TitleBoosting Graph Spectral Sparsification via Parallel Sparse Approximate Inverse of Cholesky Factor
Author*Baiyu Chen, Zhiqiang Liu, Yibin Zhang, Wenjian Yu (Tsinghua University, China)
Pagepp. 866 - 871
Keywordvery-large-scale-integrated, preconditioned conjugate gradient, graph spectral sparsification, domain decomposition
AbstractWith the advance of very-large-scale-integrated (VLSI) systems, fast and efficient algorithms for solving equations of Laplacian matrices are increasingly significant. Graph spectral sparsification, which aims to produce a ultra-sparse subgraph while preserving properties of original graph, has aroused extensive attention thanks to its distinguished performance. For preconditioning, the effectiveness of sparsifiers produced by graph spectral sparsification algorithms may directly influence the speed of PCG iterations, while the recently proposed algorithm that pursues effective sparsifiers may results in huge time expenditure of sparsifier construction as calculating the sparse approximate inverse of Cholesky factor may be rather time-consuming. In this paper, based on domain decomposition, a parallel algorithm for calculating sparse approximate inverse of Cholesky factor is proposed, where a skill for calculating Schur complement matrix based on partial Cholesky factorization is applied. Based on the proposed parallel algorithm for calculating sparse approximate inverse of Cholesky factor, a fast and effective parallel graph spectral sparsification algorithm is proposed. Extensive experiments reveal that the proposed parallel graph spectral sparsification algorithm shows eminent speedup compared with serial approach. Moreover, for transient analysis of power grids, the proposed algorithm shows significant speedup compared with the state-of-the-art parallel iterative solver based on graph sparsification.

8D-4 (Time: 14:15 - 14:40)
TitleAsynchronous Batch Constrained Multi-Objective Bayesian Optimization for Analog Circuit Sizing
Author*Xuyang Zhao, Zhaori Bi, Changhao Yan, Fan Yang, Ye Lu (Fudan University, China), Dian Zhou (The University of Texas at Dallas, USA), Xuan Zeng (Fudan University, China)
Pagepp. 872 - 877
KeywordAnalog Circuit Sizing, Asynchronous Batch Bayesian Optimization, Multi-Objective Optimization, Constrained Expected Hypervolume Improvement
AbstractFor analog circuit sizing, constrained multi-objective optimization is an important and practical problem. With the popularity of multi-core machines and cloud computing, parallel/batch computing can significantly improve the efficiency of optimization algorithms. In this paper, we propose an Asynchronous Batch Constrained Multi-Objective Bayesian Optimization algorithm (ABCMOBO). Since the performances below the specifications are worthless, we adopt a dynamic reference point selection on the expected hypervolume improvement acquisition function for constraint handling. To save the time of waiting for all the simulations in the same batch to complete, ABCMOBO asynchronously evaluates the next candidate point if there is an idle worker. The experimental results quantitatively demonstrate that our proposed algorithms can reach 3.49-8.18x speed-up with comparable optimization results compared to the state-of-the-art asynchronous/synchronous batch multi-objective optimization methods.

[To Session Table]

Session 8E  (JW-1) TILOS & AI-EDA Joint Workshop - I
Time: 13:00 - 14:40, Thursday, January 25, 2024
Location: Room 107/108
Chair: Kyumyung Choi (Seoul National University, Republic of Korea)

8E-1 (Time: 13:00 - 13:25)
Title(Joint Workshop) Fast and Expandable ANN-Based Compact Model and Parameter Extraction for Emerging Transistors
AuthorJeong-taek Kong, SoYoung Kim (Sungkyunkwan University, Republic of Korea)

8E-2 (Time: 13:25 - 13:50)
Title(Joint Workshop) Fast Timing/Power Library Generation Using Machine Learning
AuthorDaijoon Hyun (Sejong University, Republic of Korea)

8E-3 (Time: 13:50 - 14:15)
Title(Joint Workshop) Clustering-Based Methodology for Fast and Improved Placement of Large-Scale Designs
AuthorAndrew Kahng (UCSD, USA)

8E-4 (Time: 14:15 - 14:40)
Title(Joint Workshop) Routability Prediction and Optimization Using Machine Learning Techniques
AuthorSeokhyeong Kang (POSTECH, Republic of Korea)

[To Session Table]

Session 9A  Advancing AI Algorithms: Faster, Smarter, and More Efficient
Time: 15:00 - 16:40, Thursday, January 25, 2024
Location: Room 204
Chairs: Yu Wang (Tsinghua University, China), Li Jiang (Shanghai Jiao Tong University, China)

9A-1 (Time: 15:00 - 15:25)
TitleQuantization-aware Optimization Approach for CNNs Inference on CPUs
Author*Jiasong Chen, Zeming Xie, Weipeng Liang, Bosheng Liu, Xin Zheng, Jigang Wu, Xiaoming Xiong (Guangdong University of Technology, China)
Pagepp. 878 - 883
Keyworddata movements, CPUs, acceleration
AbstractData movements through the memory hierarchy are a fundamental bottleneck in the majority of convolutional neural network (CNN) deployments on CPUs. Loop-level optimization and hybrid bitwidth quantization are two representative optimization approaches for memory access reduction. For example, loop optimization can apply loop tiling and/or permutation to suit the nested loops of convolution layers on CPUs for efficient memory access and calculation. For example, hybrid bitwidth quantization can reduce the size of CNN models for efficient CNN deployments. However, they were carried out independently because of the significant increased complexity of design space exploration. We present QAOpt, a quantization-aware optimization approach that can reduce the high complexity when combining both for CNN deployments on CPUs. We develop a bitwidth-sensitive quantization strategy that can trade off between model accuracy and data movements when deploying both loop-level optimization and mixed precision quantization. Also, we provide a quantization-aware pruning process that can reduce the design space for high efficiency. Evaluation results demonstrate that our work can achieve better energy efficiency under acceptable accuracy loss.

9A-2 (Time: 15:25 - 15:50)
TitleTSTC: Enabling Efficient Training via Structured Sparse Tensor Compilation
Author*Shiyuan Huang, Fangxin Liu (Shanghai Jiao Tong University, China), Tian Li (Huawei Technologies Co., Ltd., China), Zongwu Wang, Haomin Li, Li Jiang (Shanghai Jiao Tong University, China)
Pagepp. 884 - 889
KeywordCompilation, Sparse, Training, energy efficiency
AbstractNetwork sparsification is an effective technique for Deep Neural Network (DNN) inference acceleration. However, existing sparsification solutions often rely on structured sparsity, which has limited benefits. This is because many sparse storage formats introduce substantial memory and computation overhead for address generation and gradient update, or they are only applicable during the inference, neglecting the training phase. In this paper, we propose a novel compilation optimization design called TSTC that enables efficient training via structured sparse tensor compilation. TSTC introduces a novel sparse format, Tensorization-aware Index Entity (TIE) , that efficiently represents structured sparse tensors by eliminating repeated indices and reducing storage overhead. The TIE format is applied in the Address-carry flow (AC flow) pass, optimizing the data layout at the computational graph layer. Additionally, a shape inference pass utilizes the address-carry flow to derive optimized tensor shapes. Furthermore, an operator-level AC flow optimization pass generates efficient addresses for structured sparse tensors. TSTC is a versatile design that can be efficiently integrated into existing frameworks or compilers. As a result, TSTC achieves $3.64\times$, $5.43\times$, $4.89\times$, and $3.91\times$ speedup compared to state-of-the-art sparse formats on VGG16, ResNet-18, MobileNetV1 and MobileNetV2, respectively.

9A-3 (Time: 15:50 - 16:15)
TitleAn automated approach for improving the inference latency and energy efficiency of pretrained CNNs by removing irrelevant pixels with focused convolutions
AuthorCaleb Tung, Nicholas Eliopoulos, Purvish Jajal, Gowri Ramshankar, Chen-Yun Yang (Purdue University, USA), Nicholas Synovic (Loyola University Chicago, USA), Xuecen Zhang, Vipin Chaudhary (Case Western Reserve University, USA), *George K. Thiruvathukal (Loyola University Chicago, USA), Yung-Hsiang Lu (Purdue University, USA)
Pagepp. 890 - 895
Keywordenergy-efficient computer vision, training-free
AbstractComputer vision often uses highly accurate Convolutional Neural Networks (CNNs), but these deep learning models are associated with ever-increasing energy and computation requirements. Producing more energy-efficient CNNs often requires model training which can be cost-prohibitive. We propose a novel, automated method to make a pretrained CNN more energy-efficient without re-training. Given a pretrained CNN, we insert a threshold layer that filters activations from the preceding layers to identify regions of the image that are irrelevant, i.e. can be ignored by the following layers while maintaining accuracy. Our modified focused convolution operation saves inference latency (by up to 25%) and energy costs (by up to 22%) on various popular pretrained CNNs, with little to no loss in accuracy.

9A-4 (Time: 16:15 - 16:40)
TitlePIONEER: Highly Efficient and Accurate Hyperdimensional Computing using Learned Projection
Author*Fatemeh Asgarinejad (University of California San Diego/San Diego State University, USA), Justin Morris (California State University San Marcos, USA), Tajana Rosing (University of California San Diego, USA), Baris Aksanli (San Diego State University, USA)
Pagepp. 896 - 901
KeywordHyperdimensional Computing, Adaptive training, Efficient Edge AI, Sparsity
AbstractHyperdimensional Computing (HDC) has emerged as a lightweight learning paradigm garnering considerable attention in the IoT domain. Despite its appeal, HDC has lagged behind more intricate Machine Learning (ML) algorithms in accuracy, prompting prior research to propose sophisticated encoding and training techniques at the expense of efficiency. In this study, we present a novel approach for selecting projection vectors, used to encode input data into high-dimensional spaces, to enable HDC to attain high accuracy with significantly reduced vector sizes. We adopt a neural network-based mechanism to learn the projection vectors, and demonstrate their efficacy when integrated into a conventional HDC system. Furthermore, we introduce a novel sparsity technique to enhance hardware efficiency by compressing projection vectors and reducing computational operations with minimal impact on accuracy. Our experimental results reveal that at larger vector dimensions (e.g., 10k), our method, termed PIONEER, leveraging INT4 or binary vectors, outperforms the state-of-the-art high-precision nonlinear encoding in terms of accuracy, while preserving noteworthy accuracy even at extremely lower dimensions of 50–100. Additionally, by applying our proposed sparsification technique, PIONEER achieves significant performance and energy efficiency compared to previous works.

[To Session Table]

Session 9B  Design Explorations for Neural Network Accelerators
Time: 15:00 - 17:05, Thursday, January 25, 2024
Location: Room 205
Chair: Chun-Yi Lee (National Tsing Hua University, Taiwan)

9B-1 (Time: 15:00 - 15:25)
TitleLogic Design of Neural Networks for High-Throughput and Low-Power Applications
Author*Kangwei Xu (Technical University of Munich, Germany), Grace Li Zhang (Technical University of Darmstadt, Germany), Ulf Schlichtmann, Bing Li (Technical University of Munich, Germany)
Pagepp. 902 - 907
KeywordLogic design of neural networks, High-throughput and low power, Hardware-aware training
AbstractNeural networks (NNs) have been successfully deployed in various fields. In NNs, a large number of multiply-accumulate (MAC) operations need to be performed. Most existing digital hardware platforms rely on parallel MAC units to accelerate these MAC operations. However, under a given area constraint, the number of MAC units in such platforms is limited, so MAC units have to be reused to perform MAC operations in a neural network. Accordingly, the throughput in generating classification results is not high, which prevents the application of traditional hardware platforms in extreme-throughput scenarios. Besides, the power consumption of such platforms is also high, mainly due to data movement. To overcome this challenge, in this paper, we propose to flatten and implement all the operations at neurons, e.g., MAC and ReLU, in a neural network with their corresponding logic circuits. To improve the throughput and reduce the power consumption of such logic designs, the weight values are embedded into the MAC units to simplify the logic, which can reduce the delay of the MAC units and the power consumption incurred by weight movement. The retiming technique is further used to improve the throughput of the logic circuits for neural networks. In addition, we propose a hardware-aware training method to reduce the area of logic designs of neural networks. Experimental results demonstrate that the proposed logic designs can achieve high throughput and low power consumption for several high-throughput applications.

9B-2 (Time: 15:25 - 15:50)
TitleExact Scheduling to Minimize Off-Chip Data Movement for Deep Learning Accelerators
AuthorYi Li, Aarti Gupta, *Sharad Malik (Princeton University, USA)
Pagepp. 908 - 914
Keywordscheduling, accelerator, compiler optimization
AbstractSpecialized hardware accelerators are increasingly utilized to provide performance/power efficiency for Deep Neural Network (DNN) applications. However their benefits are limited by expensive off-chip data movement between host memory and the accelerator’s on-chip scratchpad, which can consume significantly more energy than accelerator computation [13]. While application-level DNN operators can have arbitrary sizes, accelerators typically support fixed-sized operations due to constrained on-chip memory and micro-architectures. Consequently, mapping an application-level operator to an accelerator involves decomposing it into loops of smaller tiles. Different choices of tile sizes, loop orders and memory partition across tensors result in a vast design space with huge differences in offchip data movement volume. To address this challenge, we introduce Shoehorn, a schedule optimization framework that jointly optimizes loop tiling, loop ordering, and memory partitioning for mapping application-level DNN operators to hardware accelerators. Shoehorn can generate optimal schedules in subseconds and outperforms state-of-the-art approaches, reducing up to 51% total off-chip memory traffic relative to competing schedulers for several widely-used DNN applications on three distinct hardware accelerator targets.

9B-3 (Time: 15:50 - 16:15)
TitleRun-time Non-uniform Quantization for Dynamic Neural Networks in Wireless Communication
Author*Priscilla Sharon Allwin, Manil Dev Gomony, Marc Geilen (Eindhoven University of Technology, Netherlands)
Pagepp. 915 - 920
KeywordDynamic Neural Networks, Non-uniform Quantization, Dynamic Data-gating, Low Power, Wireless Receivers
AbstractDynamic Neural Networks (DyNN) offer the ability to adapt their structure, parameters, or precision dynamically, making them suitable for systems with rapidly changing environmental conditions, such as wireless communication. Traditional uniform quantization if applied in DyNNs will result in unnecessary switching power as the precision requirements are different at different environment conditions. To address this issue, we present two main contributions. 1) An offline non-uniform quantization algorithm enabling run-time quantization adaptation while preserving system performance. 2) A low-overhead dynamic data-gating architecture facilitating run-time non-uniform quantization. The proposed algorithm facilitated dynamic data-gating of up to 8-bits for QPSK demodulation parameters with no performance loss in a Digital Video Broadcast (DVB-S.2) receiver simulation. The DyNN architecture with data-gating, synthesized using GF 22-nm FDSOI CMOS technology achieves a 43% total power reduction with a minimal 3% area overhead compared to the architecture without data-gating.

9B-4 (Time: 16:15 - 16:40)
TitlePipeFuser: Building Flexible Pipeline Architecture for DNN Accelerators via Layer Fusion
AuthorXilang Zhou, *Shuyang Li (Fudan University, China), Haodong Lu (Nanjing University of Posts and Telecommunications, China), Kun Wang (Fudan University, China)
Pagepp. 921 - 926
KeywordLayer Fusion, DNN Accelerator, Pipeline, Co-design
AbstractIn this paper, we propose a fused-pipeline architecture that leverages the layer fusion technique to harness the strengths of both non-pipeline and full-pipeline architectures while mitigating their disadvantages. In particular, we observe that the performance of the fused-pipeline accelerators is significantly influenced by the layer fusion strategies and intralayer mapping schemes. To optimize and rapidly employ the fused-pipeline architecture, we present an end-to-end automation framework, named PipeFuser. At the core of PipeFuser is a genetic algorithm (GA)-based co-design engine, which is used to acquire near-optimal hardware configurations in the vast design space. Experimental results demonstrate that our fused-pipeline architecture achieves 2.3× to 3.3× higher performance over the non-pipeline design and 1.9× to 2.5× speedup compared to the full-pipeline architecture, with greater deployment flexibility.

9B-5 (Time: 16:40 - 17:05)
TitleA Precision-Scalable RISC-V DNN Processor with On-Device Learning Capability at the Extreme Edge
Author*Longwei Huang, Chao Fang, Qiong Li, Jun Lin, Zhongfeng Wang (Nanjing University, China)
Pagepp. 927 - 932
KeywordRISC-V processor, precision-scalable, on-device learning, reconfiguable accelerator, FPGA accelerator
AbstractExtreme edge platforms, such as in-vehicle smart devices, require efficient deployment of quantized deep neural networks (DNNs) to enable intelligent applications while conserving energy, memory, and computing resources. However, many edge devices struggle to boost inference throughput of various quantized DNNs due to the varying quantization levels, and these devices lack floating-point (FP) support for on-device learning, which prevents them from improving model accuracy while ensuring data privacy. To tackle the challenges above, we propose a precision-scalable RISC-V DNN processor with on-device learning capability. It facilitates diverse precision levels of fixed-point DNN inference, spanning from 2-bit to 16-bit, and enhances on-device learning through improved support with FP16 operations. Moreover, we employ multiple methods such as FP16 multiplier reuse and multi-precision integer multiplier reuse, along with balanced mapping of FPGA resources, to significantly improve hardware resource utilization. Experimental results on the Xilinx ZCU102 FPGA show that our processor significantly improves inference throughput by 1.6∼14.6× and energy efficiency by 1.1∼14.6× across various DNNs, compared to the prior state-of-the-art, XpulpNN. Additionally, our processor achieves a 16.5× higher FP throughput for on-device learning.

[To Session Table]

Session 9C  High-Level Security Verification and Efficient Implementation
Time: 15:00 - 17:05, Thursday, January 25, 2024
Location: Room 206
Chairs: Amin Rezaei (California State University at Long Beach, USA), Danella Zhao (University of Arizona, USA)

9C-1 (Time: 15:00 - 15:25)
TitleMicroscope: Causality Inference Crossing the Hardware and Software Boundary from Hardware Perspective
AuthorZhaoxiang Liu, Kejun Chen (Kansas State University, USA), Dean Sullivan (University of New Hampshire, USA), Orlando Arias (University of Massachusetts Lowel, USA), Raj Dutta (Silicon Assurance, USA), *Yier Jin (University of Science and Technology of China, USA), Xiaolong Guo (Kansas State University, USA)
Pagepp. 933 - 938
KeywordCausal Inference, Hardware Security, Hardware and Software Co-Verification
AbstractThe increasing complexity of System-on-Chip (SoC) designs and the rise of third-party vendors in the semiconductor industry have led to unprecedented security concerns. Traditional formal methods struggle to address software-exploited hardware bugs, and existing solutions for hardware-software co-verification often fall short. This paper presents Microscope, a novel framework for inferring software instruction patterns that can trigger hardware vulnerabilities in SoC designs. Microscope enhances the Structural Causal Model (SCM) with hardware features, creating a scalable Hardware Structural Causal Model (HW-SCM). A domain-specific language (DSL) in SMT-LIB represents the HW-SCM and predefined security properties, with incremental SMT solving deducing possible instructions. Microscope identifies causality to determine whether a hardware threat could result from any software events, providing a valuable resource for patching hardware bugs and generating test input. Extensive experimentation demonstrates Microscope's capability to infer the causality of a wide range of vulnerabilities and bugs located in SoC-level benchmarks.

9C-2 (Time: 15:25 - 15:50)
Titled-GUARD: Thwarting Denial-of-Service Attacks via Hardware Monitoring of Information Flow using Language Semantics in Embedded Systems
AuthorGarett Cunningham, Harsha Chenji, *David Juedes, Avinash Karanth (Ohio University, USA)
Pagepp. 939 - 944
KeywordLanguage Semantics, Embedded Secuirty, Network On Chips
AbstractAs low-level embedded systems are vulnerable to attacks that exploit flaws in either hardware or software, it is essential to enforce secure policies to protect the system from malicious instructions that significantly alter program behavior. To improve efficiency of implementation, high-level secure policy languages have been defined such that the policies can be directly synthesized into hardware monitors. However, the language semantics define policies that are static throughout the program execution which limits the flexibility. Moreover, secure policies target processor pipelines and not the network-on-chip (NoC) connecting several processor where denial-of-service attacks could originate. In this paper, we enable dynamically reconfigurable security policies through a high-level language called \M that target both processor pipeline and NoC architecture in mutlicore embedded systems. Alongside static policies, \M's semantics support policies that dynamically change behavior in response to program conditions at runtime. In addition, we also propose policies to thwart denial-of-service attacks by rate limiting the packet flow into the network using the same dynamic policies expressed by D-GAURD. We describe a Verilog compiler to support realizing policies as hardware monitors for both processor pipelines and network interfaces. D-GAURD, is developed using the Coq proof assistant, enabling the formal verification of policy correctness and other properties. This approach takes advantage of the abstractions and expressiveness of a higher-level language while minimizing the overhead that comes with other general-purpose approaches implemented purely in hardware, as well as offering the groundwork for a formally verified tool chain.

9C-3 (Time: 15:50 - 16:15)
TitleSecurity Coverage Metrics for Information Flow at the System Level
Author*Ece Nur Demirhan Coşkun (Cyber-Physical Systems, DFKI GmbH, Germany), Sallar Ahmadi-Pour (Institute of Computer Science, University of Bremen, Germany), Muhammad Hassan, Rolf Drechsler (Institute of Computer Science, University of Bremen/Cyber-Physical Systems, DFKI GmbH, Germany)
Pagepp. 945 - 950
KeywordSystem validation, Availability, Threat modeling, Hardware security, System-level design
AbstractIn this paper, we introduce a novel set of security coverage metrics for information flow at the system level. The proposed security coverage metrics play a crucial role in assessing the qualification and quantification of various security properties, in addressing specific threat models, such as availability, and in identifying potential security vulnerabilities associated with information flow. To implement these metrics, we present SiMiT, a tool that leverages Virtual Prototypes (VP), and Static and Dynamic Information Flow Tracking (IFT) methodologies. We demonstrate the applicability of the proposed security coverage metrics through SiMiT on an open-source RISC-V VP architecture with its peripherals. By assessing the security properties using these metrics, we pave the way for a security-aware Completeness Driven Development (CDD) concept and the development of secure System-on-Chip (SoC) designs.

9C-4 (Time: 16:15 - 16:40)
TitleTheoretical Patchability Quantification for IP-Level Hardware Patching Designs
Author*Wei-Kai Liu (Duke University, USA), Benjamin Tan (University of Calgary, Canada), Jason M. Fung (Intel Corporation, USA), Krishnendu Chakrabarty (Arizona State University, USA)
Pagepp. 951 - 956
KeywordPatching, Patchability, IP, SoC, Hardware bugs
AbstractAs the complexity of System-on-Chip (SoC) designs continues to increase, ensuring thorough verification becomes a significant challenge for system integrators. The complexity of verification can result in undetected bugs. Unlike software or firmware bugs, hardware bugs are hard to fix after deployment and they require additional logic, i.e., patching logic integrated with the design in advance in order to patch. However, the absence of a standardized metric for defining "patchability" leaves system integrators relying on their understanding of each IP and security requirements to engineer ad hoc patching designs. In this paper, we propose a theoretical patchability quantification method to analyze designs at the Register Transfer Level (RTL) with provided patching options. Our quantification defines patchability as a combination of observability and controllability so that we can analyze and compare the patchability of IP variations. This quantification is a systematic approach to estimate each patching architecture’s ability to patch at run-time and complements existing patching works. In experiments, we compare several design options of the same patching architecture and discuss their differences in terms of theoretical patchability and how many potential weaknesses can be mitigated.

9C-5 (Time: 16:40 - 17:05)
TitleMultiplierless Design of High-Speed Very Large Constant Multiplications
Author*Levent Aksoy (Tallinn University of Technology, Estonia), Debapriya Basu Roy (Indian Institute of Technology Kanpur, India), Malik Imran, Samuel Pagliarini (Tallinn University of Technology, Estonia)
Pagepp. 957 - 962
Keywordvery large constant multiplication, high-speed design architectures, area optimization, Montgomery multiplication, cryptography
AbstractIn cryptographic algorithms, the constants to be multiplied by a variable can be very large due to security requirements. Thus, the hardware complexity of such algorithms heavily depends on the design architecture handling large constants. In this paper, we introduce an electronic design automation tool, called LEIGER, which can automatically generate the realizations of very large constant multiplications for low-complexity and high-speed applications, targeting the ASIC design platform. LEIGER can utilize the shift-adds architecture and use 3-input operations, i.e., carry-save adders (CSAs), where the number of CSAs is reduced using a prominent optimization algorithm. It can also generate constant multiplications under a hybrid design architecture, where 2-and 3-input operations are used at different stages. Moreover, it can describe constant multiplications under a design architecture using compressor trees. As a case study, high-speed Montgomery multiplication, which is a fundamental operation in cryptographic algorithms, is designed with its constant multiplication block realized under the proposed architectures. Experimental results indicate that LEIGER enables a designer to explore the trade-off between area and delay of the very large constant and Montgomery multiplications and leads to designs with area-delay product, latency, and energy consumption values significantly better than those obtained by a recently proposed algorithm.

[To Session Table]

Session 9D  Routing
Time: 15:00 - 16:15, Thursday, January 25, 2024
Location: Room 207
Chair: Masato Inagi (Hiroshima City University, Japan)

9D-1 (Time: 15:00 - 15:25)
TitleV-GR: 3D Global Routing with Via Minimization and Multi-Strategy Rip-up and Rerouting
Author*Ping Zhang, Pengju Yao (Fuzhou University, China), Xingquan Li (Pengcheng Laboratory, China), Bei Yu (The Chinese University of Hong Kong, Hong Kong), Wenxing Zhu (Fuzhou University, China)
Pagepp. 963 - 968
Keywordglobal routing, via, rip-up & rerouting, maze routing
AbstractIn VLSI, a large number of vias may reduce manufacturability, degrade circuit performance, and increase layout area required for interconnection. In this paper, we propose a 3D global router V-GR, which considers minimizing the number of vias. V-GR uses a modified via-aware routing cost that considers the impact of wire density on the via. This cost function is more sensitive to the number of vias. Meanwhile, a novel multi-strategy rip-up & rerouting framework is developed for V-GR to solve the overflowed net, effectively optimizing wire length, overflow, and minimizing the number of vias. The proposed framework first leverages two proprietary routing techniques, namely the 3D monotonic routing and 3D 3-via-stack routing, to control the number of vias and reduce overflow. Additionally, the framework incorporates an RSMT-aware expanded source 3D maze routing algorithm to build routing paths with shorter wire length. Experimental results on the ICCAD’19 contest benchmarks show that, V-GR achieves high-quality results, reducing vias by 8% and overflow by 7.5% in the global routing phase. Moreover, to achieve a fair comparison, TritonRoute is used to conduct detailed routing, and Innovus is used to evaluate the final solution. Comparison shows that V-GR achieves 4.7% reduction in vias and 8.7% reduction in DRV, while maintaining almost the same wire length.achieves high-quality results, reducing vias by 8% and overflow by 7.5% in the global routing phase. Moreover, we leverage the SOTA detailed router, TritonRoute, to generate detailed routing solutions. Subsequently, Innovus is used to evaluate these solutions, reporting a 4.7% reduction in vias and an 8.7% reduction in DRV, while maintaining almost the same wire length.

9D-2 (Time: 15:25 - 15:50)
TitleA Fast and Robust Global Router with Capacity Reduction Techniques
Author*Yun-Kai Fang, Ye-Chih Lin, Ting-Chi Wang (National Tsing Hua University, Taiwan)
Pagepp. 969 - 974
KeywordGlobal Routing, Physical Design
AbstractAs design rules become more complex in new technology nodes, signal routing presents increasing challenges. In order to reduce the complexity, routing is typically divided into global routing (GR) and detailed routing (DR). However, a feasible GR solution may not always translate to a feasible DR solution due to resource mismatches between the two stages. In this work, we propose capacity reduction techniques that are applied in GR while aiming to enhance detailed-routability and bridge the mismatches between GR and DR. Encouraging experimental results demonstrate that after including the proposed capacity reduction techniques, the outdated open-source global router NTHU-Route 2.0 is revitalized and outperforms the state-of-the-art global router of TritonRoute-WXL. We achieve all DRC-clean solutions with 7.8% less via usage, 29% less non-preferred usage, and 2% DR quality score improvement while only increasing wirelength by 0.4%. Additionally, our GR runtime and DR runtime are respectively 44% and 31% shorter.

9D-3 (Time: 15:50 - 16:15)
TitleA High Performance Detailed Router Based on Integer Programming with Adaptive Route Guides
AuthorZhongdong Qi, *Shizhe Hu, Qi Peng, Hailong You, Chao Han, Zhangming Zhu (Xidian University, China)
Pagepp. 975 - 980
KeywordDetailed routing, integer programming, rip-up and reroute, design rule checking
AbstractDetailed routing is a crucial and time-consuming stage for ASIC design. As the number and complexity of design rules increase, it is challenging to achieve high solution quality and fast speed at the same time in detailed routing. In this work, a high performance detailed routing algorithm named IPAG with integer programming (IP) is proposed. The IP formulation uses the selection of candidate routes as decision variables. High quality candidate routes are generated by queue-based ripup and reroute with adaptive global route guidance. A design rule checking engine which can simultaneously process nets with multiple routes is designed, to efficiently construct penalty parameters in the IP formulation. Experimental results on ISPD 2018 detailed routing benchmark show that IPAG achieves better solution quality in shorter or comparable runtime, as compared to the state-of-the-art academic detailed router.

[To Session Table]

Session 9E  (JW-2) TILOS & AI-EDA Joint Workshop - II
Time: 15:00 - 16:40, Thursday, January 25, 2024
Location: Room 107/108
Chair: Youngsoo Shin (KAIST, Republic of Korea)

9E-1 (Time: 15:00 - 15:25)
Title(Joint Workshop) ML Assisted DTCO Framework & Physical Design Optimization using DSO
AuthorKyumyung Choi, Taewhan Kim (Seoul National University, Republic of Korea)

9E-2 (Time: 15:25 - 15:50)
Title(Joint Workshop) Differential Design Search: A Learning-Based Optimization Framework for EDA
AuthorJaeyong Jung (Incheon National University, Republic of Korea)

9E-3 (Time: 15:50 - 16:15)
Title(Joint Workshop) AI-Based Design Optimization of SRAM-MRAM Hybrid Cache and On-Chip Interconnection Network
AuthorEui-young Chung (Yonsei University, Republic of Korea)

9E-4 (Time: 16:15 - 16:40)
Title(Joint Workshop) ML/AI and Cross Layer Optimizations for Electronic and Photonic Design Automation
AuthorDavid Pan (University of Texas, Austin, USA)