The 30th Asia and South Pacific Design Automation Conference
Technical Program

Remark: The presenter of each paper is marked with "*".

Technical Program: SIMPLE version DETAILED version with abstract

Author Index: HERE

Session Schedule

Monday, January 20, 2025

Room Saturn	Innovation Hall
T1 Tutorial-1: Automation of Standard Cell Layout Generation and Design-Technology Co-optimization 9:30 - 12:30	T2 Tutorial-2: AHS: An EDA toolbox for Agile Chip Front-end Design 9:30 - 12:30
T3 Tutorial-3: Memory Built-In Self-Test (MBIST): Advanced Techniques for SoC Design and Verification 14:30 - 16:30	T4 Tutorial-4: Efficient Deployment of Large Language Models on Resource Constrained Edge Computing Platforms 14:30 - 17:30

Tuesday, January 21, 2025

Room Saturn	Room Uranus	Room Venus	Room Mars/Mercury	Innovation Hall	Miraikan Hall
1K (Miraikan Hall) Opening and Keynote Session I 8:30 - 9:45
Coffee Break 9:45 - 10:05
1A (T1-1) System-Level Modeling and Design Methodologies 10:05 - 11:45	1B (T4.2-1) Tools and Techniques for On-Device AI Deployment 10:05 - 11:45	1C (T8-1) AI and Logic Synthesis – A perfect match? 10:05 - 11:20	1D (T3-1) Application-Specific Computing-In-Memory 10:05 - 11:45	1E (SS-1) Machine Learning Based Physical Simulation and Physics-Aware Optimization 10:05 - 11:45	1F (SS-CEDA) CEDA/CASS/SSCS Joint session on Silicon Photonics 10:05 - 11:45
Lunch Break 11:45 - 13:15
2A (T5-1) Accelerating Vision and Transformer Models 13:15 - 15:20	2B (T6-1) Shaping the Future of Analog EDA 13:15 - 14:55	2C (T7-1) Approximate and Stochastic Computing 13:15 - 15:20	2D (T2-1) Next-Generation Embedded Architectures and Tools 13:15 - 15:20	2E (SS-2) Advances in 3D-IC and Ultra-Large-Scale Integration 13:15 - 15:20	2F University Design Contest 13:15 - 15:20
Coffee Break 15:20 - 15:40
3A (SS-3) LLM Acceleration and Specialization for Circuit Design and Edge Applications 15:40 - 17:20	3B (T9-1) Timing Analysis and Optimization 15:40 - 17:45	3C (T12-1) Side Channel Attacks and Trusted Execution Environment 15:40 - 17:45	3D (T3-2) Frameworks and Modeling for Computing-In-Memory 15:40 - 17:20	3E (T4.1-1) AI/ML for Circuit Design and Prediction 15:40 - 17:45
1S (Room Jupiter) ACM SIGDA Student Research Forum / WIP Poster Session 18:00 - 20:00

Wednesday, January 22, 2025

Room Saturn	Room Uranus	Room Venus	Room Mars/Mercury	Innovation Hall	Miraikan Hall
2K (Miraikan Hall) 30th Anniversary and Keynote Session II 8:30 - 9:45
Coffee Break 9:45 - 10:05
4A (T4.2-2) Advanced Methods in AI Hardware Co-Design 10:05 - 11:45	4B (T1-2) Communication Networks 10:05 - 11:45	4C (T2-2) Design and Optimization of Emerging Embedded Applications 10:05 - 11:45	4D (T11-1) Verification and Testing in Machine Lerning Era 10:05 - 11:45	4E (T3-3) Hybrid/Co-Designed Near/In Memory Computing 10:05 - 11:45	4F (SS-4) ML for IC Design and Manufacturing: When Is It Real? 10:05 - 11:45
Lunch Break / CEDA 20th Anniversary Panel (Innovation Hall and more) (See also information at CEDA site) 11:45 - 13:15
5A (T5-2) Innovations in Deep Learning and Neural Network Acceleration 13:15 - 15:20	5B (T7-2) Neuromorphic and Emerging Computing Techniques 13:15 - 15:20	5C (T9-2) Package and PCB 13:15 - 14:55	5D (T12-2) Logic Locking and Hardware Watermarking 13:15 - 14:55	5E (T4.1-2) AI-Driven Innovative Design Methods 13:15 - 15:20	5F (DF-1) Quantum Computing 13:15 - 14:30
Coffee Break 15:20 - 15:40
6A (SS-5) Beyond Digital: Advancing Design Automation for Physical Computing Systems 15:40 - 17:20	6B (T9-3) Floorplan and Placement 15:40 - 17:45	6C (T13-1) Let’s Quantumize: Welcome to the World of Quantum 15:40 - 16:55	6D (T10-1) Innovative Techniques for Energy-Efficient and Reliable Hardware Systems 15:40 - 17:45	6E (T4.1-3) Leveraging Large Language Models in Hardware Design 15:40 - 17:45	6F (DF-2) Advanced Sensor Technologies and Sensor Fusion 15:40 - 16:55
Banquet 18:30 - 20:30

Thursday, January 23, 2025

Room Saturn	Room Uranus	Room Venus	Room Mars/Mercury	Innovation Hall	Miraikan Hall
3K (Miraikan Hall) Keynote Session III 9:00 - 9:45
Coffee Break 9:45 - 10:05
7A (T1-3) Accelerator Design Methodologies 10:05 - 11:45	7B (T5-3) Advanced Architectures for Scientific and Edge Computing 10:05 - 11:45	7C (T6-2) The Science of Light: the New Advancement of Photonic Computing 10:05 - 11:45	7D (T8-2) From Math to Circuits 10:05 - 11:20	7E (T4.2-3) Innovative Techniques in AI Model Optimization and Training 10:05 - 11:20	7F (SS-6) Rapidus' Initiatives to Half Semiconductor Development Time 10:05 - 11:20
Lunch Break 11:45 - 13:15
8A (T1-4) System Level Modelling & Optimization 13:15 - 14:55	8B (T5-4) Emerging Trends in Reconfigurable and Compute-in-Memory 13:15 - 14:55	8C (T3-4) Adaptive and Flexible Memory Architecture 13:15 - 14:55	8D (T9-4) Reliability in Physical Design 13:15 - 14:55	8E (SS-7) Hardware Authenticity towards a Trustworthy Society 13:15 - 14:55	8F (DF-3) Extending the Limits of Classical Computers using Emerging Device and Circuit Technology 13:15 - 14:30
Coffee Break 14:45 - 15:20
9A (T12-3) Homomorphic Encryption and Cloud Security 15:20 - 17:30	9B (T11-2) Advanced Modeling, Simulation, and Verification 15:20 - 17:30	9C (T13-2) Carbon, Light, Fluids: Emerging Technologies 15:20 - 17:30	9D (T10-2) Advanced Techniques for Power Optimization and IR Prediction 15:20 - 17:30	9E (SS-8) Innovations and Challenges on Cryo-CMOS Devices, Circuits and Design Platforms 15:20 - 17:30	9F (DF-4) Integrated Circuit Design Methodologies using Open Source and Artificial Intelligence 15:20 - 16:35

DF: Designers' Forum, SS: Special Session

List of papers

Remark: The presenter of each paper is marked with "*".

Monday, January 20, 2025

[To Session Table]

Session T1 Tutorial-1: Automation of Standard Cell Layout Generation and Design-Technology Co-optimization
Time: 9:30 - 12:30, Monday, January 20, 2025
Location: Room Saturn

T1-1 (Time: 9:30 - 12:30)

Title	(Tutorial) Automation of Standard Cell Layout Generation and Design-Technology Co-optimization
Author	*Taewhan Kim (Seoul National University, Republic of Korea)
Abstract	As the technology advances, the design rules to be considered in standard cell layout generation become much complex and the count increases very rapidly. Moreover, at the advanced nodes, it becomes much slow or very difficult to achieve the objective i.e., to deliver or find the best-yield target node design rules while ensuring PPA-excellent target chip implementation. This is the reason why, from the EDA perspective, the need for automatic cell layout generation is inevitable at the advanced nodes, and design and technology co-optimization is a key enabling technique to achieve the chip PPA objective. In this respect, this tutorial covers EDA research area regarding two topics: (1) automation of standard cell layout generation and (2) methodologies of design and technology co-optimization utilizing diverse cell layout structures. For standard cells, starting from the detailed layer-by-layer structure for cell layout synthesis, the complex design rules, the algorithms of cell synthesis including transistor placement and in-cell routing, and existing cell generation tools developed so far in academia with comparison of their distinct features and pros/cons will be covered. In addition, new EDA challenge of cell generation for next generation of BEOL and FEOL technologies will be discussed. For DTCO, starting from the DTCO concept utilizing standard cells, noticeable DTCO works focusing on, for target chip PPA improvement, multi-bit flip-flop cells, cells with pin inaccessibility, and cells with metal and gate poly misalignment will be covered.

[To Session Table]

Session T2 Tutorial-2: AHS: An EDA toolbox for Agile Chip Front-end Design
Time: 9:30 - 12:30, Monday, January 20, 2025
Location: Innovation Hall

T2-1 (Time: 9:30 - 12:30)

Title	(Tutorial) AHS: An EDA toolbox for Agile Chip Front-end Design
Author	Yun (Eric) Liang, Youwei Xiao, Ruifan Xu (Peking University, China)
Abstract	Compared to software design, hardware design is more expensive and time-consuming. This is partly because software community has developed a rich set of modern tools to help software programmers to get projects started and iterated easily and quickly. However, the tools are seriously antiquated and lacking for hardware design. Modern digital chips are still designed manually using hardware description language such as Verilog or VHDL, which requires low-level and tedious programming, debugging, and tuning. In this tutorial, we will introduce AHS: An EDA toolbox for Agile Chip Front-end Design, which includes various EDA tools for both chip design and verification.

[To Session Table]

Session T3 Tutorial-3: Memory Built-In Self-Test (MBIST): Advanced Techniques for SoC Design and Verification
Time: 14:30 - 16:30, Monday, January 20, 2025
Location: Room Saturn

T3-1 (Time: 14:30 - 16:30)

Title	(Tutorial) Memory Built-In Self-Test (MBIST): Advanced Techniques for SoC Design and Verification
Author	*Prashant Seetharaman (Siemens Digital Industries Software, USA)
Abstract	As System-on-Chip (SoC) designs continue to grow in complexity and memory density, ensuring the reliability and testability of embedded memories becomes increasingly critical. This tutorial presents an in-depth exploration of Memory Built-In Self-Test (MBIST), a powerful methodology for testing and diagnosing memory faults in modern SoC designs. We will discuss the fundamentals of MBIST, advanced implementation techniques, and emerging trends in the field. The tutorial also covers key aspects of MBIST, including architecture, test algorithm selection, integration in the SoC design flow, and strategies for optimizing test coverage and reducing test time. Through theoretical analysis and practical examples, we will demonstrate how MBIST addresses the challenges of testing high-density memories in complex SoCs, balancing test quality, silicon area overhead, and test time to achieve optimal results in real-world scenarios. Additionally, we explore the application of MBIST to emerging memory technologies encompassing non-volatile memories and resistive RAMs. Finally, we will discuss future directions in memory testing, including the use of machine learning and artificial intelligence techniques.

[To Session Table]

Session T4 Tutorial-4: Efficient Deployment of Large Language Models on Resource Constrained Edge Computing Platforms
Time: 14:30 - 17:30, Monday, January 20, 2025
Location: Innovation Hall

T4-1 (Time: 14:30 - 17:30)

Title	(Tutorial) Efficient Deployment of Large Language Models on Resource-Constrained Edge Computing Platforms
Author	*Yiyu Shi (University of Notre Dame, USA)
Abstract	Scaling laws guide the design of large language models (LLMs), assuming unlimited computing resources, but this is impractical for deploying personalized LLMs on resource-constrained edge devices. Critical questions arise regarding trade-offs between design factors?such as learning methods, data volume, model size, compression techniques, and training time?and their impact on efficiency and accuracy. This tutorial provides comprehensive guidelines for deploying LLMs on constrained devices and explores how advanced hardware architectures like Compute-in-Memory (CiM) can aid this process. Attendees will learn to make informed decisions on key topics like choosing between compressed vs. uncompressed LLMs, selecting appropriate personalization techniques, and optimizing training time under resource limitations.

Tuesday, January 21, 2025

[To Session Table]

Session 1K Opening and Keynote Session I
Time: 8:30 - 9:45, Tuesday, January 21, 2025
Location: Miraikan Hall
Chair: Yuichi Nakamura (NEC, Japan)

1K-1

Title

ASP-DAC 2025 Opening

1K-2

Title	(Keynote Address) Design Innovation and Collaboration with Foundries: Towards a Sustainable Semiconductor Industry
Author	Kazunari Ishimaru (Senior Managing Executive Officer, Head of Silicon Technology Division, Rapidus Corporation, Japan)
Abstract	As the semiconductor industry faces mounting pressure to meet both rising demand and sustainability challenges, the need for innovative design and effective collaboration with foundries has never been more critical. The lead time from design to productization is becoming longer with each transition to the next generation, which is particularly critical in the rapidly growing AI market, where time to market is crucial. Additionally, traditional DFM (Design for Manufacturability) to address the increasingly complex device structures and manufacturing processes is also becoming more complicated, leading to longer design periods. As a foundry, it is essential to provide feedback to the design space from a manufacturing perspective, and the realization of MFD (Manufacturing for Design) will become essential in the future. The DMCO (Design-Manufacturing Co-Optimization) proposed by Rapidus is a method to solve this issue, and collaboration with customers and ecosystem partners is crucial. Furthermore, collaborative efforts with foundries enable adopting sustainable production practices, such as reducing energy consumption, minimizing waste, and ensuring supply chain transparency. The importance of integrating design innovation and foundry collaboration to build a semiconductor industry that is resilient, responsible, and aligned with global sustainability objectives.

[To Session Table]

Session 1A (T1-1) System-Level Modeling and Design Methodologies
Time: 10:05 - 11:45, Tuesday, January 21, 2025
Location: Room Saturn
Chair: Jing-Jia Liou (National Tsing Hua University, Taiwan)

1A-1 (Time: 10:05 - 10:30)

Title	MACO: A HW-Mapping Co-optimization Framework for DNN Accelerators
Author	*Wujie Zhong, Zijun Jiang, Yangdi Lyu (The Hong Kong University of Science and Technology (Guangzhou), China)
Page	pp. 1 - 7
Keyword	Deep Neural Network, Hardware Accelerator, Multi-Objective Bayesian Optimization
Abstract	Deep neural network (DNN) accelerators have been developed to enhance the effectiveness of DNN models, particularly in resource- constrained devices. Achieving high-throughput and energy-efficient inference within area constraints requires careful consideration of design choices in both hardware (HW) and mapping spaces. To get better performance and energy efficiency, it is important to optimize hardware and mapping together. However, this co-optimization process presents a considerable challenge due to the expansive combined HW-Mapping design space. To find the optimal configu- ration in this large design space, we formulate the exploration of the hardware configuration as an optimization problem, and embed the exploration of the mapping design space into the evaluation stage of optimization. We implement a HW-Mapping co-optimization framework called MACO to find optimal configurations for both hardware and mapping, and provide a generic interface to inte- grate different optimization algorithms, including multi-objective Bayesian optimization (MOBO), non-dominated sorting genetic algorithms (NSGA), and random search. We evaluate our frame- work with four popular DNN models of different properties. Our evaluation shows that the MOBO-based approach can achieve a 30% energy reduction and a 37% latency reduction with the same area as the state-of-the-art HW-Mapping optimization framework.

1A-2 (Time: 10:30 - 10:55)

Title	KAPLA: Scalable NN Accelerator Dataflow Design Space Structuring and Fast Exploring
Author	Zhiyao Li, *Mingyu Gao (Tsinghua University, China)
Page	pp. 8 - 15
Keyword	hardware acceleration, deep learning, dataflow
Abstract	Dataflow scheduling is of vital importance to neural network (NN) accelerators. Recent scalable NN accelerators support a rich set of advanced dataflow techniques. The problems of comprehensively representing and quickly finding optimized dataflow schemes thus become significantly more complicated and challenging. In this work, we first propose comprehensive and pragmatic dataflow representations for temporal and spatial scheduling on scalable multi-node NN architectures. An informal hierarchical taxonomy highlights the tight coupling across different levels of the dataflow space as the major difficulty for fast design exploration. A set of formal tensor-centric directives accurately express various inter-layer and intra-layer schemes, and allow for quickly determining their validity and efficiency. We then build a generic, optimized, and fast dataflow solver, KAPLA. It makes use of the pragmatic directives to explore the design space with effective validity check and efficiency estimation. KAPLA decouples the upper inter-layer level for fast pruning, and solves the lower intra-layer schemes with a novel bottom-up cost descending method. KAPLA achieves within only 2.2% and 7.7% energy overheads on the result dataflow for training and inference, respectively, compared to the exhaustively searched optimal schemes. It also outperforms random and machine-learning-based approaches, with more optimized results and orders of magnitude faster search speed.

1A-3 (Time: 10:55 - 11:20)

Title	Dynamic Co-Optimization Compiler: Leveraging Multi-Agent Reinforcement Learning for Enhanced DNN Accelerator Performance
Author	Arya Fayyazi, Mehdi Kamal, *Massoud Pedram (University of Southern California, USA)
Page	pp. 16 - 22
Keyword	Hardware/Software Co-design, Deep neural network, Multi-Agent Reinforcement Learning
Abstract	This paper introduces a novel Dynamic Co-Optimization Compiler (DCOC), which employs an adaptive Multi-Agent Reinforcement Learning (MARL) framework to enhance the efficiency of mapping machine learning (ML) models, particularly Deep Neural Networks (DNNs), onto diverse hardware platforms. DCOC incorporates three specialized actor-critic agents within MARL, each dedicated to different optimization facets: one for hardware and two for software. This cooperative strategy results in an integrated hardware/software co-optimization approach, improving the precision and speed of DNN deployments. By focusing on high-confidence configurations, DCOC effectively reduces the search space, achieving remarkable performance over existing methods. Our results demonstrate that DCOC enhances throughput by up to 37.95% while reducing optimization time by up to 42.2% across various DNN models, outperforming current state-of-the-art frameworks.

1A-4 (Time: 11:20 - 11:45)

Title	A Computation and Energy Efficient Hardware Architecture for SSL Acceleration
Author	*Huidong Ji (Fudan University, China), Sheng Li (University of Pittsburgh, USA), Yue Cao (Fudan University, China), Chen Ding (Guangdong Institute of Intelligence Science and Technology, China), Jiawei Xu (KTH Royal Institute of Technology, Sweden), Qitao Tan (University of Georgia, USA), Jun Liu (Northeastern University, USA), Ao Li (University of Arizona, USA), Xulong Tang (University of Pittsburgh, USA), Lirong Zheng (Fudan University, China), Geng Yuan (University of Georgia, USA), Zhuo Zou (Fudan University, China)
Page	pp. 23 - 29
Keyword	Self-Supervised Learning, DNN
Abstract	In the domain of Computer Vision (CV), the deployment of advanced Convolutional Neural Networks (CNNs) is often hindered by their substantial computational requirements and massive amounts of labeled data for training. Self-supervised learning (SSL) serves as an effective approach to reducing the reliance on labeled data with the option for augmentation methods to infer and train CNNs. There are efforts to enhance SSL by reducing computational load through block pruning based on similarity, which is particularly critical for CV models. The irreverent feature block’s exclusion from computation speeds up the learning and helps the model find the best optimization point faster. To further exploit the computational parallel processing capabilities, we propose a Field-Programmable Gate Array (FPGA)-based hardware accelerator architecture tailored for SSL framework, leveraging its parallelism and reconfigurability to expedite block matching, optimize sparse convolutions, and manage data reuse, significantly improving resource and energy efficiency. The implementation and evaluation of our work on Xilinx ZCU102 FPGA working at 200 MHz confirm that the similarity finding part’s FPGA accelerations with a low hardware overhead generates a latency of 0.0106 seconds, a 97× speed up compared to CPU, and in the sparse CNN’s FPGA acceleration part, with the processing of VGG16 and ResNet50, our work achieves the highest throughput of 922.17 GOPS, 898.38 GOPS and energy efficiency of 30.71 GOPS/W, 29.47 GOPS/W separately. Compared with the related FPGA-based works, our design claims a maximum of 3.08× throughput improvement and a maximum of 2.47× in energy efficienc

[To Session Table]

Session 1B (T4.2-1) Tools and Techniques for On-Device AI Deployment
Time: 10:05 - 11:45, Tuesday, January 21, 2025
Location: Room Uranus
Chairs: Jongeun Lee (Ulsan National Institute of Science and Technology (UNIST)), Youngsoo Shin (KAIST, Republic of Korea)

1B-1 (Time: 10:05 - 10:30)

Title	Sequential Printed Multilayer Perceptron Circuits for Super-TinyML Multi-Sensory Applications
Author	Gurol Saglam (Karlsruhe Institute of Technology, Germany), Florentia Afentaki, *Georgios Zervakis (University of Patras, Greece), Mehdi Tahoori (Karlsruhe Institute of Technology, Germany)
Page	pp. 30 - 35
Keyword	Super-TinyML, Approximate Computing, Electrolyte-gated FET, Printed Electronics, Multilayer Perceptrons
Abstract	Super-TinyML aims to optimize machine learning models for deployment on ultra-low-power application domains such as wearable technologies and implants. Such domains also require conformality, flexibility, and non-toxicity which traditional silicon-based systems cannot fulfill. Printed Electronics (PE) offers not only these characteristics, but also cost-effective and on-demand fabrication. However, Neural Networks (NN) with hundreds of features —often necessary for target applications— have not been feasible in PE because of its restrictions such as limited device count due to its large feature sizes. In contrast to the state of the art using fully parallel architectures and limited to smaller classifiers, in this work we implement a super-TinyML architecture for bespoke (application-specific) NNs that surpasses the previous limits of state of the art and enables NNs with large number of parameters. With the introduction of super-TinyML into PE technology, we address the area and power limitations through resource sharing with multi-cycle operation and neuron approximation. This enables, for the first time, the implementation of NNs with up to 35.9x more features and 65.4x more coefficients than the state of the art solutions.

1B-2 (Time: 10:30 - 10:55)

Title	Learning to Prune and Low-Rank Adaptation for Compact Language Model Deployment
Author	*Asmer Hamid Ali (Arizona State University, USA), Fan Zhang (Johns Hopkins University, USA), Li Yang (University of North Carolina at Charlotte, USA), Deliang Fan (Arizona State University, USA)
Page	pp. 36 - 42
Keyword	Efficient-Learning and Inference, Parameter-efficient fine-tuning, Model Pruning, Compact Model
Abstract	Nowadays, parameter-efficient fine-tuning (PEFT) large pre-trained models (LPMs) for downstream task have gained significant popularity, since it could significantly minimize the training computational overhead. The representative work, LoRA, learns a low-rank adaptor for new downstream task, rather than fine-tuning the whole backbone model. However, for inference, the final learned model size increases, leading to inefficient inference computation. To mitigate this, in this work, we are the first to propose a learning-to-prune methodology specially designed for fine-tuning downstream tasks based on LPMs with low-rank adaptation. Unlike prior low-rank adaptation approaches that only learn the low-rank adaptors for downstream tasks, our method further leverages the Gumbel-Sigmoid tricks to learn a set of trainable binary channel-wise masks that automatically prune the backbone LPMs. Therefore, our method could leverage the benefits of low-rank adaptation to reduce the training parameters size and smaller pruned backbone LPM size for efficient inference computation. Extensive experiments show that the Pruned-RoBbase model with our method achieves an average channel-wise structured pruning ratio of 24.5% across the popular GLUE Benchmark, coupled with an average of 18% inference time speed-up in real NVIDIA A5000 GPU. The Pruned-DistilBERT shows an average of 13% inference time improvement with 17% sparsity. The Pruned-LLaMA-7B model achieves up to 18.2% inference time improvement with 24.5% sparsity, demonstrating the effectiveness of our learnable pruning approach across different models and tasks.

1B-3 (Time: 10:55 - 11:20)

Title	LightCL: Compact Continual Learning with Low Memory Footprint For Edge Device
Author	*Zeqing Wang, Fei Cheng, Kangye Ji, Bohu Huang (Xidian University, China)
Page	pp. 43 - 50
Keyword	Efficient Training, Continual Learning, Catastrophic Forgetting, Edge Computing
Abstract	Continual learning (CL) is a technique that enables neural networks to constantly adapt to their dynamic surroundings. Despite being overlooked for a long time, this technology can considerably address the customized needs of users in edge devices. Actually, most CL methods require huge resource consumption by the training behavior to acquire generalizability among all tasks for delaying forgetting regardless of edge scenarios. Therefore, this paper proposes a compact algorithm called LightCL, which evaluates and compresses the redundancy of already generalized components in structures of the neural network. Specifically, we consider two factors of generalizability, learning plasticity and memory stability, and design metrics of both to quantitatively assess generalizability of neural networks during CL. This evaluation shows that generalizability of different layers in a neural network exhibits a significant variation. Thus, we Maintain Generalizability by freezing generalized parts without the resource-intensive training process and Memorize Feature Patterns by stabilizing feature extracting of previous tasks to enhance generalizability for less-generalized parts with a little extra memory, which is far less than the reduction by freezing. Experiments illustrate that LightCL outperforms other state-of-the-art methods and reduces at most 6.16× memory footprint. We also verify the effectiveness of LightCL on the edge device.

1B-4 (Time: 11:20 - 11:45)

Title	Skip2-LoRA: A Lightweight On-device DNN Fine-tuning Method for Low-cost Edge Devices
Author	*Hiroki Matsutani, Masaaki Kondo, Kazuki Sunaga (Keio University, Japan), Radu Marculescu (The University of Texas at Austin, USA)
Page	pp. 51 - 57
Keyword	On-device learning, Fine-tuning, DNN, Edge AI, LoRA
Abstract	This paper proposes Skip2-LoRA as a lightweight fine-tuning method for deep neural networks to address the gap between pre-trained and deployed models. In our approach, trainable LoRA (low-rank adaptation) adapters are inserted between the last layer and every other layer to enhance the network expressive power while keeping the backward computation cost low. This architecture is well-suited to cache intermediate computation results of the forward pass and then can skip the forward computation of seen samples as training epochs progress. We implemented the combination of the proposed structure and cache, denoted as Skip2-LoRA, and tested it on a $15 single board computer. Our results show that Skip2-LoRA reduces the fine-tuning time by 90.0% on average compared to the counterpart that has the same number of trainable parameters while preserving the accuracy, while taking only a few seconds on the microcontroller board.

[To Session Table]

Session 1C (T8-1) AI and Logic Synthesis – A perfect match?
Time: 10:05 - 11:20, Tuesday, January 21, 2025
Location: Room Venus
Chairs: Christophe Dubach (McGill University, Canada), Andrea Costamagna (EPFL, Switzerland)

1C-1 (Time: 10:05 - 10:30)

Title	High-Effort Logic Synthesis Using Randomized Transduction
Author	*Yukio Miyasaka (UC Berkeley/X, the moonshot factory, USA), Alan Mishchenko, John Wawrzynek (UC Berkeley, USA), Dino Ruić, Xiaoqing Xu (X, the moonshot factory, USA)
Page	pp. 58 - 64
Keyword	Combinational logic synthesis, AIG, High-effort, Transduction method, Stochastic flow
Abstract	High-effort logic synthesis has become an important research direction due to the increase in silicon cost and the growth of design complexity. The emphasis on security leads to complex cryptographic circuits, while the acceleration of AI/ML results in custom arithmetic blocks—all of which need to be highly optimized by EDA tools. In such applications, high-effort logic synthesis allows for an efficient exploration of larger solution spaces, leading to area and power savings beyond the capacity of traditional methods. This paper presents a novel variation of high-effort logic synthesis called transduction, which performs transformation and reduction using don't-cares to restructure the circuit. Integrating the proposed method into a stochastic optimization flow with dynamic scheduling saved 6.8% AIG nodes on average, compared to the original flow using the same runtime. An additional experiment further demonstrated the strength of the proposed method, which derived smaller AIGs than the previously synthesized minimum AIGs for 46 out of 100 benchmarks.

1C-2 (Time: 10:30 - 10:55)

Title	PIRLLS: Pretraining with Imitation and RL Finetuning for Logic Synthesis
Author	*Guande Dong, Jianwang Zhai, Hongtao Cheng, Xiao Yang, Chuan Shi, Kang Zhao (Beijing University of Posts and Telecommunications, China)
Page	pp. 65 - 71
Keyword	Logic Synthesis, Imitation Learning, Reinforcement Learning, Design Space Exploration
Abstract	As a key step in digital integrated circuit (IC) design, logic synthesis involves various logic optimization algorithms, where the quality of results (QoR) depends heavily on the optimization sequence used.Exploring the optimization space is challenging as the number of potential optimal permutations grows exponentially. Traditional methods rely on manual adjustments by experts, but are difficult to deal with complex and different circuits, leading to significant optimality gaps.Many automatic methods have been introduced, but still face problems of low generalization and low efficiency. In this work, we propose PIRLLS, a two-stage learning framework for imitation learning on expert trajectories followed by reinforcement learning (RL) finetuning, to efficiently explore the optimal synthesis flows.Firstly, PIRLLS uses imitation learning to pretrain fast and high-performance intelligent policy, to fully leverage the offline knowledge of a large corpus of high-quality expert trajectories. Then, the pretrained policy is finetuned for target circuits using the proximal policy optimization (PPO) algorithm and policy distillation to obtain better results.Compared with the state-of-the-art (SOTA) method, our framework can effectively improve the quality of logic optimization and significantly speed up the exploration time.

1C-3 (Time: 10:55 - 11:20)

Title	MTLSO: A Multi-Task Learning Approach for Logic Synthesis Optimization
Author	Faezeh Faez, Raika Karimi, *Yingxue Zhang (Huawei Noah’s Ark Lab, Canada), Xing Li, Lei Chen, Mingxuan Yuan (Huawei Noah’s Ark Lab, China), Mahdi Biparva (Huawei Noah’s Ark Lab, Canada)
Page	pp. 72 - 78
Keyword	logic synthesis optimization, QoR prediction, multi-task learning, hierarchical graph representation learning
Abstract	Electronic Design Automation (EDA) is essential for IC design and has recently benefited from AI-based techniques to improve efficiency. Logic synthesis, a key EDA stage, transforms high-level hardware descriptions into optimized netlists. Recent research has employed machine learning to predict Quality of Results (QoR) for pairs of And-Inverter Graphs (AIGs) and synthesis recipes. However, the severe scarcity of data due to a very limited number of available AIGs results in overfitting, significantly hindering performance. Additionally, the complexity and large number of nodes in AIGs make plain GNNs less effective for learning expressive graph-level representations. To tackle these challenges, we propose MTLSO - a Multi-Task Learning approach for Logic Synthesis Optimization. On one hand, it maximizes the use of limited data by training the model across different tasks. This includes introducing an auxiliary task of binary multi-label graph classification alongside the primary regression task, allowing the model to benefit from diverse supervision sources. On the other hand, we employ a hierarchical graph representation learning strategy to improve the model's capacity for learning expressive graph-level representations of large AIGs, surpassing traditional plain GNNs. Extensive experiments across multiple datasets and against state-of-the-art baselines demonstrate the superiority of our method, achieving an average performance gain of 8.22% for delay and 5.95% for area.

[To Session Table]

Session 1D (T3-1) Application-Specific Computing-In-Memory
Time: 10:05 - 11:45, Tuesday, January 21, 2025
Location: Room Mars/Mercury
Chairs: Hiromitsu Awano (Kyoto University, Japan), Yiming Chen (Tsinghua University, China)

1D-1 (Time: 10:05 - 10:30)

Title	ReTAP: Processing-in-ReRAM Bitap Approximate String Matching Accelerator for Genomic Analysis
Author	Tsung-Yu Liu (Academia Sinica, Taiwan), Yen An Lu (Cornell University, USA), James Yu (Georgia Institute of Technology, USA), *Chin-Fu Nien (National Yang Ming Chiao Tung University, Taiwan), Hsiang-Yun Cheng (Academia Sinica, Taiwan)
Page	pp. 79 - 85
Keyword	Genome sequencing, Read mapping, Approximate string matching (ASM), Processing-in-memory (PIM), Resistive random access memory (ReRAM)
Abstract	Read mapping, which involves computationally intensive approximate string matching (ASM) on large datasets, is the primary performance bottleneck in genome sequence analysis. To accelerate read mapping, a processing-in-memory (PIM) architecture that conducts highly parallel computations within the memory to reduce energy-inefficient data movements can be a promising solution. In this paper, we present ReTAP, a processing-in-ReRAM Bitap accelerator for genomic analysis. Instead of using the intricate dynamic programming algorithm, our design incorporates the Bitap algorithm, which uses only simple bitwise operations to perform ASM. Additionally, we explore the opportunity to reduce redundant computations by dynamically adjusting the error tolerance of Bitap and co-design the hardware to enhance computation parallelism. Our evaluation demonstrates that ReTAP outperforms GenASM, the state-of-the-art Bitap accelerator, with a 5.74x throughput and 1.79x energy efficiency.

Best Paper Candidate
1D-2 (Time: 10:30 - 10:55)

Title	High-Parallel In-Memory NTT Engine with Hierarchical Structure and Even-Odd Data Mapping
Author	*Bing Li, Huaijun Liu (Capital Normal University, China), Yibo Du (Institute of Computing Technology, Chinese Academy of Sciences/University of Chinese Academy of Sciences, China), Ying Wang (Institute of Computing Technology, Chinese Academy of Sciences, China)
Page	pp. 86 - 92
Keyword	Digital in-SRAM computing, Number theoretic transform, Data Engine
Abstract	The number theoretic transform (NTT) significantly impacts the execution time of FHE in practical applications. Accelerating NTT becomes a major research focus. However, NTT encounters memory bottlenecks due to enormous data volumes. Computing-in-memory (CIM) is a promising solution to address this issue. Implementing an efficient CIM-based NTT engine is challenging due to NTT's unique operations and large bit-width data. While the butterfly-based NTT is popular for its optimized computational complexity, CIM's computing parallelism is hindered by data dependencies across stages in butterfly computation. To fully leverage CIM’s capabilities, we propose HP-CIM, a high-parallelism digital SRAM-based CIM NTT engine designed for large-scale NTT operations. HP-CIM integrates with MVM-based NTT, featuring a hierarchical SRAM architecture that enhances data-level and bit-level parallelism and a novel even-odd data mapping technique that facilitates large-scale NTT with minimal computational overhead. Experimental results demonstrate that HP-CIM achieves almost 3.08x reduction in execution time and 4.96x energy reduction compared to prior CIM-based NTT designs, demonstrating its superior performance.

1D-3 (Time: 10:55 - 11:20)

Title	Efficient and Reliable Vector Similarity Search Using Asymmetric Encoding with NAND-Flash for Many-Class Few-Shot Learning
Author	*Hao-Wei Chiang, Chi-Tse Huang (National Taiwan University, Taiwan), Hsiang-Yun Cheng (Academia Sinica, Taiwan), Po-Hao Tseng, Ming-Hsiu Lee (Macronix International Co. Ltd, Taiwan), An-Yeu (Andy) Wu (National Taiwan University, Taiwan)
Page	pp. 93 - 99
Keyword	NAND-Flash, Thermometer Code, Vector Similarity Search, In-memory Search
Abstract	While memory-augmented neural networks (MANNs) offer an effective solution for few-shot learning (FSL) by integrating deep neural networks with external memory, the capacity requirements and energy overhead of data movement become enormous due to the large number of support vectors in many-class FSL scenarios. Various in-memory search solutions have emerged to improve the energy efficiency of MANNs. NAND-based multi-bit content addressable memory (MCAM) is a promising option due to its high density and large capacity. Despite its potential, MCAM faces limitations such as a restricted number of word lines, limited quantization levels, and non-ideal effects like varying string currents and bottleneck effects, which lead to significant accuracy drops. To address these issues, we propose several innovative methods. First, the Multi-bit Thermometer Code (MTMC) leverages the extensive capacity of MCAM to enhance vector precision using cumulative encoding rules, thereby mitigating the bottleneck effect. Second, the Asymmetric Vector Similarity Search (AVSS) reduces the precision of the query vector while maintaining that of the support vectors, thereby minimizing the search iterations and improving efficiency in many-class scenarios. Finally, the Hardware-Aware Training (HAT) method optimizes controller training by modeling the hardware characteristics of MCAM, thus enhancing the reliability of the system. Our integrated framework reduces search iterations by up to 32 times, and increases overall accuracy by 1.58% to 6.94%.

1D-4 (Time: 11:20 - 11:45)

Title	DCiROM: A Fully Digital Compute-in-ROM Design Approach to High Energy Efficiency of DNN Inference at Task Level
Author	*Tianyi Yu, Tianyu Liao, Mufeng Zhou, Xiaotian Chu, Guodong Yin, Mingyen Lee, Yongpan Liu, Huazhong Yang, Xueqing Li (Tsinghua University, China)
Page	pp. 100 - 105
Keyword	compute-in-memory (CiM), read-only memory (ROM), multiply-and-accumulate (MAC), deep neural network (DNN)
Abstract	Owing to mature fabrication support and high flexibility, static random-access memory (SRAM) has become a very promising candidate for compute-in-memory (CiM) in accelerating deep neural networks (DNNs). However, SRAM-based CiM has low memory density and thus very limited total on-chip capacity, resulting in frequent weights reloading and additional power consumption during end-to-end inference tasks. Analog ROM CiM increases memory density but suffers from low computing density caused by A/D converter (ADC) limitation. To address these challenges, for the first time, a fully digital compute-in-read-only-memory (DCiROM) design approach is proposed in this paper. DCiROM introduces a novel ROM-logic fusion CiM that successfully reduces CiM area by 51% while maintaining high memory density and computing performance. By reusing multiply-and-accumulation (MAC) resources, DCiROM further achieves flexibility with a minimal area cost. We have implemented a DCiROM chip loaded 3024Kb ResNet-56 parameters using 65nm CMOS technology. This macro achieves 10.2x-55.7x higher normalized FoM (memory density x computing density) than the state-of-the-art CiM works. It also reduces 2.9x-9.9x energy consumption per image inference than SRAM CiM works when considering off-chip access.

[To Session Table]

Session 1E (SS-1) Machine Learning Based Physical Simulation and Physics-Aware Optimization
Time: 10:05 - 11:45, Tuesday, January 21, 2025
Location: Innovation Hall
Chairs: Wenjian Yu (Tsinghua University, China), Yuanqing Cheng (Beihang University, China)

1E-1 (Time: 10:05 - 10:30)

Title	(Invited Paper) Deep Learning Inspired Capacitance Extraction Techniques
Author	*Wenjian Yu, Shan Shen, Dingcheng Yang, Haoyuan Li, Jiechen Huang, Chunyan Pei (Tsinghua University, China)
Page	pp. 106 - 112
Keyword	Capacitance extraction, convolutional neural network, deep learning, graph neural network, layout parasitic extraction
Abstract	With the advancement of integrated circuit (IC), the process technology becomes more complicated and the design margin shrinks. Thus, the parasitic extraction is more demanded during IC design. In this invited paper, we survey the research progress on IC capacitance extraction, especially the usage of deep-learning technologies in relevant problems. Firstly, a method based on graph neural network (GNN) for predicting the parasitic capacitances in the pre-layout design stage is presented. It exhibits potential benefit for the optimization of SRAM design. Then, the deep-learning-inspired methods for post-layout capacitance extraction are presented, including CNN-Cap, NAS-Cap and GNN-Cap, etc. They can revamp the accuracy drawback of layout parasitic extraction (LPE) method and the efficiency drawback of 3-D capacitance field solver. Lastly, we briefly review the deep-learning technique for improving the accuracy of the random walk based 3-D capacitance solver for the structures under the advanced process technology.

1E-2 (Time: 10:30 - 10:55)

Title	(Invited Paper) Enhanced Operator Learning for Scalable and Ultra-fast Thermal Simulation in 3D-IC Design
Author	Xinling Yu, Ziyue Liu (UCSB, USA), Hai Li, Ian Young (Intel Corporation, USA), *Zheng Zhang (UCSB, USA)
Page	p. 113
Keyword	thermal analysis, 3D IC, Operator learning, sepearable training
Abstract	Design of 3D-IC and chiplet systems require high-fidelity PDE simulation of thermal bevaviors. Due to the varying input signals and PDE configurations (e.g., material paramters, boundary conditions, geometric parameters), a PDE needs to be simulated many times to evaluate different design cases. This talk present DeepOHeat and its enhancement technique. DeepOHeat can use one run of physics-constrained training to generate a surrogate model for various PDE configurations, allowing real-time thermal prediction under unforeseen power maps. By utilizing seperable architecture, we can reduce the training time by about 60 times.

1E-3 (Time: 10:55 - 11:20)

Title	(Invited Paper) Boosting the Performance of Transistor-Level Circuit Simulation with GNN
Author	Jiqing Jiang, Yongqiang Duan, *Zhou Jin (China University of Petroleum-Beijing, China)
Page	pp. 114 - 120
Keyword	DC analysis, Pseudo transient analysis, Graph Neural Network, Spice Simulation
Abstract	Efficiently solving DC operating points for large-scale nonlinear circuits in SPICE simulation is both critical and challenging. Pseudo transient analysis (PTA) is a widely used and promising approach for DC analysis, with the pseudo element embedding strategy playing a key role in ensuring convergence and simulation efficiency. In this paper, we present GPTA, a graph neural network (GNN) enhanced PTA method that adaptively positions embeddings by considering circuit topology. GPTA transforms the nonlinear DC circuits into linearized graph representations and then integrates multi-head messaging, adaptive message filtering, and multi-scale information fusion in the GNN model to improve feature extraction. Additionally, a layer-by-layer pooling and prediction strategy effectively retains intermediate layer information, enhancing model expressiveness. Numerical results show that GPTA significantly improves the efficiency of DC analysis in terms of both convergence and simulation speed.

1E-4 (Time: 11:20 - 11:45)

Title	(Invited Paper) Emag-Aware ML-Based Layout Optimization for High-Speed IC Design
Author	*Garth Sundberg (ANSYS Inc., USA), Rodger Luo (ANSYS Inc., China)
Page	pp. 121 - 127
Keyword	AMOP, automation, high-speed interface, impedance matching, machine learning
Abstract	Machine learning (ML) based optimization techniques are well suited for creating efficient floorplans for designs that require multiple spiral inductors. The non-linear nature of inductive coupling in general as well as the complex coupling mechanisms due to dense routing, semiconducting substrate and DRC based routing restrictions make the manual optimization of floorplans in such designs extremely error-prone and time consuming. This paper proposes a unique approach to automate this effort. The use of an adaptive metamodel of optimal prognosis (AMOP - an iterative meta-modelling approach based on the MOP using an adaptive refinement of the data points) based optimizer that relies on a high-capacity electromagnetic modeling engine and circuit simulator in series to automatically size and place spiral devices for optimal circuit performance. This significantly reduces design cycle time and manual effort and provides high confidence that an optimal solution has been found. We use a specific high-speed driver design with read/write channels requiring the use of T-coils to demonstrate our thesis.

[To Session Table]

Session 1F (SS-CEDA) CEDA/CASS/SSCS Joint session on Silicon Photonics
Time: 10:05 - 11:45, Tuesday, January 21, 2025
Chair: Tsung-Yi Ho (The Chinese University of Hong Kong, Hong Kong)

1F-1 (Time: 10:05 - 10:30)

Title	(Invited Paper) Bridging EDA and Silicon Photonics Design: Enabling Robust-by-Design Photonic Integrated Circuits
Author	Zahra Ghanaatian, Asif Mirza, Amin Shafiee, Sudeep Pasricha, *Mahdi Nikdast (Colorado State University, USA)
Page	pp. 128 - 134
Keyword	Silicon photonics, design automation and optimization
Abstract	Silicon photonic devices are essential components of integrated optical communication systems and emerging photonic processors. However, their performance is notably impacted by fabrication-process variations (FPVs), which primarily stem from optical lithography imperfections. The impact of FPVs can accumulate and deteriorate the system-level performance through, for example, increasing system power consumption, accumulated crosstalk noise, and degrading signal integrity in photonic systems. In this paper, we discuss the promise of variation-aware design-space exploration and optimization to enhance photonic device robustness under different FPVs while considering two silicon photonic devices used widely in different applications, namely Microring Resonators (MRRs) and Mach–Zehnder Interferometers (MZIs). In addition, we consider a system-level case study of an MZI-based coherent neural network, where we show how our proposed variation-aware design optimization at the device level helps improve the network accuracy by up to 88% under FPVs.

1F-2 (Time: 10:30 - 10:55)

Title	(Invited Paper) SPICE-Compatible Modeling and Design for Electronic-Photonic Integrated Circuits
Author	*Yuxiang Fu, Yinyi Liu (Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong), Ngai Wong (Department of Electrical and Electronic Engineering, University of Hong Kong, Hong Kong), Jiang Xu (Microelectronics Thrust, Hong Kong University of Science and Technology (Guangzhou), China)
Page	pp. 135 - 140
Keyword	compact model, silicon photonic device, device modeling, physics-informed neural network
Abstract	Electronic-photonic integrated circuit (EPIC) technologies are revolutionizing computing systems by improving their performance and energy efficiency. However, simulating EPIC is challenging and time consuming. In this paper, we propose the physics-informed neural network (PINN) based SPICE-compatible modeling method for EPICs. Experimental results show our method can speed up EPIC simulation by more than 100 times on average compared to FDTD method.

1F-3 (Time: 10:55 - 11:20)

Title	(Invited Paper) Modeling and Simulation of Silicon Photonics Systems in SystemVerilog/XMODEL
Author	*Jaeha Kim (Seoul National University/Scientific Analog, Inc., Republic of Korea)
Page	pp. 141 - 146
Keyword	silicon photonics, modeling, simulation, SystemVerilog, XMODEL
Abstract	Silicon photonics integrates both photonic and electronic components on the same silicon chip and promises ultra-dense, high-bandwidth interconnects via wavelength division multiplexing (WDM). However, when verifying such silicon photonic systems, the existing IC simulators face challenges due to the WDM signals containing multiple frequency tones at ~200-THz with ~50-GHz spacing. This paper presents a systematic approach to modeling the silicon photonic elements and devices as equivalent multi-port transmission lines using XMODEL primitives and simulating the WDM link models in an efficient, event-driven fashion in SystemVerilog. 5Gb/s, 3-channel WDM link models with micro-ring, Mach-Zehnder, and electro-absorption modulators demonstrate the simulation speeds of 4.2, 8.3, and 8.3 symbols/second, respectively.

1F-4 (Time: 11:20 - 11:45)

Title	(Invited Paper) Si Photonic Ring-Resonator-Based WDM Transceivers
Author	*Woo-Young Choi, Dae-Won Rho (Yonsei University, Republic of Korea), Jae-Koo Park (Yonsei University/Samsung Electronics, Republic of Korea), Seung-Jae Yang, Jae-Ho Lee, Yongjin Ji (Yonsei University, Republic of Korea)
Page	p. 147
Keyword	Si Photonics
Abstract	There is increasing interest in Si photonic interconnect solutions, especially for AI/ML applications. Silicon ring-resonator-based modulators and wavelength filters, in particular, attract significant attention due to their potential for delivering energy-efficient, high-bandwidth interconnects within a compact footprint. However, silicon ring resonators are highly sensitive to temperature variations, necessitating smart circuit techniques to maintain stable operation. This presentation covers research results at Yonsei University on Si photonic ring-resonator-based WDM transceivers with integrated temperature controllers. It also demonstrates how equivalent circuit models for Si ring modulators are implemented, enabling Tx and Rx simulations within standard circuit design environments.

[To Session Table]

Session 2A (T5-1) Accelerating Vision and Transformer Models
Time: 13:15 - 15:20, Tuesday, January 21, 2025
Location: Room Saturn
Chairs: Quan Chen (Southern University of Science and Technology, China), Sheldon Tan (UC Riverside, USA)

Best Paper Award
2A-1 (Time: 13:15 - 13:40)

Title	ViDA: Video Diffusion Transformer Acceleration with Differential Approximation and Adaptive Dataflow
Author	*Li Ding, Jun Liu, Shan Huang, Guohao Dai (Shanghai Jiao Tong University, China)
Page	pp. 148 - 154
Keyword	Diffusion Transformer, Video Generation, Neural Network Accelerator
Abstract	Recent advancements in Video Diffusion Transformer (VDiT) mod- els have greatly promoted the development of video generation, as exemplified by Sora of OpenAI. However, there are still two challenges for VDiT: 1) There is still existing large inter-frame redundant computation. Previous works on reducing compu- tation based on inter-frame similarity simply consider the Act-W operators. The remaining Act-Act operators still dominate the ex- ecution of VDiT (about 57%). 2) Operational intensity varies greatly, leading to under-utilization. There is a massive gap between the operational intensity of Act-W and Act-Act operators in VDiT with multiple frames. Previous works with the static hard- ware architecture and dataflow lead to under-utilization (<36.42%). In this paper, we propose ViDA, a Vi deo Diffusion Transformer Accelerator with D ifferential A pproximation and A daptive D ataflow. 1) At the algorithm level, we propose the differential approxima- tion method that exploits similarity for both Act-Act and Act-W operators to reduce redundant computation. 2) At the hardware level, we propose the column-concentrated PE by exploiting the column sparsity pattern in differential computing. 3) At the dataflow level, we propose an intensity adaptive dataflow archi- tecture to balance the execution of operators with significant opera- tional intensity differences. Experiments show that ViDA achieves average 16.44×/2.18× speedup and 18.39×/2.35× area efficiency compared with NVIDIA A100 GPU and SOTA vision accelerator.

2A-2 (Time: 13:40 - 14:05)

Title	APTO: Accelerating Serialization-Based Point Cloud Transformers with Position-Aware Pruning
Author	*Qichu Sun, Rui Meng, Haishuang Fan (State Key Laboratory of Processors, Institute of Computing Technology, Chinese Academy of Sciences/University of Chinese Academy of Sciences, China), Fangqiang Ding (University of Edinburgh, UK), Linxi Lu (State Key Laboratory of Processors, Institute of Computing Technology, Chinese Academy of Sciences/University of Chinese Academy of Sciences, China), Jingya Wu, Xiaowei Li (State Key Laboratory of Processors, Institute of Computing Technology, Chinese Academy of Sciences, China), Guihai Yan (State Key Laboratory of Processors, Institute of Computing Technology, Chinese Academy of Sciences/YUSUR Technology Co., Ltd., China)
Page	pp. 155 - 162
Keyword	Point cloud, Transformer, Serialization, Attention, Accelerator
Abstract	Point cloud processing has broad applications in autonomous driving and robotics. Serialization-based point cloud transformers map unordered point clouds onto directed curves, use sparse convolution for down-sampling, and apply attention in local windows to capture spatial relationships. Despite achieving great accuracy, these models face inference latency challenges: neighbor search in sparse convolution exhibits low parallelism; attention computation remains complex, especially with larger window sizes; softmax introduces data dependencies. This paper proposes APTO, an accelerator for serialization-based models. It uses voxels' z-curve indices to perform neighbor searches in parallel, employs a position-aware pruning strategy using neighboring voxel counts to eliminate useless attention computations, and adopts a fine-grained attention dataflow for parallel processes with minimal data dependencies. Besides, its hardware has dedicated computation cores for efficient processing. Evaluations show that APTO achieves average 10.22 times, 3.53 times and 2.70 times speedups over RTX 4090 GPU, PointAcc, and SpOctA, with 153.59 times, 8.57 times and 7.25 times energy savings.

2A-3 (Time: 14:05 - 14:30)

Title	UEDA: A Universal And Efficient Deformable Attention Accelerator For Various Vision Tasks
Author	Kairui Sun, *Meiqi Wang, Junhai Zhou, Zhongfeng Wang (Sun Yat-sen University, China)
Page	pp. 163 - 169
Keyword	Deformable attention, systolic array, hardware accelerator, FPGA
Abstract	Deformable attention (DA) provides an efficient and adaptive solution for capturing diverse object shapes, reducing computational complexity across multiple vision tasks. However, the various DA types lead to flexible matrix computation dimensions and complex attention pipelines. Moreover, the sampling operator's dynamic and irregular memory access significantly reduces data reuse and processing elements (PE) utilization, hindering DA from fully leveraging its low computational complexity. In this paper, we propose UEDA, a universal and efficient accelerator for DA. Specifically, a flexible 3D Folded Dimension Systolic Array (FDSA) is designed for efficient matrix multiplication computations with various dimensions, while achieving consistent high efficiency in supporting multiple networks. Secondly, a Reorganized Feature Map (RFM) sampling strategy is proposed to address parallel memory access conflicts, boosting the sampling module’s processing rate by up to 4 times. Finally, an Inter-tile Cross Parallel (ITCP) dataflow is proposed to further hide the sampling module’s latency, enhancing circuit throughput. The proposed UEDA is implemented on a Xilinx UltraScale+ FPGA. Experimental results show that UEDA achieves 11.7-14.8× speedup and 17.8-29.1x energy efficiency compared with GPU. Furthermore, we observe up to 2.19× better speedup and 27.8× higher energy efficiency compared to prior FPGA accelerators.

2A-4 (Time: 14:30 - 14:55)

Title	Deploying Diffusion Models with Scheduling Space Search and Memory Overflow Prevention Based on Graph Optimization
Author	*Hao Zhou, Yang Liu, Hongji Wang (Fudan University, China), EnHao Tang (Nanjing University, China), Shun Li (Southeast University, China), Yifan Zhang (Fudan University, China), Guohao Dai (Shanghai Jiao Tong University, China), Yongpan Liu (Tsinghua University, China)
Page	pp. 170 - 176
Keyword	FPGA, Diffusion Model, Inference, Mapping Flow
Abstract	In recent years, Neural Networks developed rapidly to deal with tasks in the field of Computer Vision and Natural Language Process, etc. With the development of AI Generated Content, U-Net based Diffusion Models (DM) take image synthesis to new heights. U-Net performs the noise prediction of DM, the latency of which accounts for the majority of the end-to-end latency. Although FPGA has been proven to be a high performance platform to deploy NN, a series of facts still pose challenges for efficient U-Net based DMs deployment based on FPGA. The input vector length and type of the special function vary between different layers. The absence of model periodicity increases the granularity and complexity of operator scheduling. Skip-connection and residual connection inside model cause meta-data retaining in the memory, which is not conductive to avoiding memory overflow and decreasing total off-chip memory access. In this paper, we propose DMAcc, a FPGA-based accelerator for U-Net based DMs. To increase the overlap of operator execution time, we introduce a unified Special Function Unit (SFU), which integrates all kinds of special functions in U-Net. SFU and Matrix Multiplication Unit (MMU) execute in parallel. Moreover, a series of front-end algorithm is proposed to improve hardware efficiency in the way of represent DMs with computational graph. We propose latency-oriented look-ahead scheduling to improve system performance. To prevent memory overflow, we optimize the dataflow of operators and propose a data filtering metric. We evaluate the performance of DMAcc on Stable Diffusion v1.5 and SDXL. Compared with CPU, DMAcc achieve 4.41× ~ 5.31× speed-up and 13.31× ~ 23.46× energy efficiency improvement. DMAcc is 2.27× ~ 2.52× more energy efficient than GPU. Evaluation of the front-end algorithm indicates that operator scheduling contributes to a 6.3% speedup, and memory optimization reduces DDR accesses by 16%.

2A-5 (Time: 14:55 - 15:20)

Title	TWDP: A Vision Transformer Accelerator with Token-Weight Dual-Pruning Strategy for Edge Device Deployment
Author	*Guang Yang, Xinming Yan, Hui Kou, Zihan Zou, Qingwen Wei, Hao Cai, Bo Liu (Southeast University, China)
Page	pp. 177 - 182
Keyword	Vision Transformer, Token Pruning, Weight Sparsity, Average Hessian Trace, Hardware Accelerator
Abstract	Vision Transformers (ViTs) have attracted significant attention due to their superior accuracy compared to convolutional neural networks (CNNs) in various computer vision tasks. However, their substantial computational load and significant memory footprint lead to excessive delay and considerable data storage overhead, posing challenges for resource-limited edge device deployment. To address these issues, we present TWDP, a vision transformer accelerator employing a Token-Weight Dual-Pruning strategy to enhance the efficiency of the inference process. Firstly, we propose a parameter-free self-adaptive token pruning method to skip redundant computations in an image-dependent manner. Secondly, we apply a Hessian-aware layer-wise N:M weight pruning approach to minimize storage overhead, memory access, and computational power consumption. Additionally, to manage the complex computing patterns in ViTs, an overlapping dataflow is utilized to further reduce temporal storage and inference latency. Implemented and evaluated under an industrial 28nm technology, the proposed TWDP framework reduces 66.1% weight storage requirements and achieves an energy efficiency of 2070.9 FPS/W. Compared to state-of-the-art architectures, TWDP obtains a 1.6× energy efficiency improvement with negligible accuracy loss, demonstrating the superiority of TWDP in edge device deployment scenarios.

[To Session Table]

Session 2B (T6-1) Shaping the Future of Analog EDA
Time: 13:15 - 14:55, Tuesday, January 21, 2025
Location: Room Uranus
Chairs: Zhou Jin (China University of Petroleum, China), Nobukazu Takai (Kyoto Institute of Technology)

Best Paper Award
2B-1 (Time: 13:15 - 13:40)

Title	A Practical Randomized GMRES Algorithm for Solving Linear Equation System in Circuit Simulation
Author	*Baiyu Chen, Jiawen Cheng, Wenjian Yu (Tsinghua University, China)
Page	pp. 183 - 189
Keyword	generalized minimal residual (GMRES) method, large-scale circuit simulation, sparse matrix, random sketching
Abstract	Efficient solver for general linear equations is of significance for EDA problems. The generalized minimal residual (GMRES) method, which can solve general linear equations efficiently, is one of the most widely-used fundamental algorithms. Randomized Arnoldi process, which leverages sketched least-squares solver to orthogonalize Krylov subspace basis, has shown potential to promote the effectiveness of Arnoldi process, the core step in GMRES. However, how to make it more efficient, and utilize it to develop a practical GMRES solver is still an open problem. In this work, we aim at obtaining a practically-useful randomized GMRES algorithm (named PRGMRES) for solving general sparse linear equations. Firstly, an efficient estimator of residual error based on a modified randomized Gram-Schmidt process and a double-tolerance scheme are proposed to enable a practical restarted GMRES algorithm which terminates at a solution satisfying the specified accuracy tolerance. Then, a linear-time-complexity sketching algorithm based on Rademacher matrices is proposed to facilitate fast and robust random sketching. After that, incremental solution of the sketched least-squares problems, and the theoretical analysis supporting smaller sketching size are presented. Based on the above proposed techniques and theoretical results, the PRGMRES algorithm, which has stronger theoretically-supported stability and efficiency, is proposed. Numerical experiments on various circuit simulation problems validate the efficiency and effectiveness of the proposed algorithm.

2B-2 (Time: 13:40 - 14:05)

Title	Balancing Objective Optimization and Constraint Satisfaction for Robust Analog IC Design Automation
Author	*Jintao Li (University of Electronic Science and Technology of China, China), Haochang Zhi (Southeast University, China), Jiang Xiao (University of Electronic Science and Technology of China, China), Yanhan Zeng (Guangzhou University, China), Weiwei Shan (Southeast University, China), Yun Li (University of Electronic Science and Technology of China, China)
Page	pp. 190 - 196
Keyword	analog circuit optimization, electronic design automation, multi-factorial evolution, knowledge transfer
Abstract	Automated design of analog integrated circuits (ICs) involves balancing multiple objectives under process, voltage, and temperature (PVT) variations. An excess of constraints can ensnare algorithms in local optima, while the variations elevate the costs of simulation. To address this challenge, we propose a two-search mode multi-task evolutionary framework to balance objective optimization and constraint satisfaction under variations. Specifically, considering the inherent relationships between objective optimizations and constraint violations, our method adaptively switches between unconstrained surrogate-assisted and constrained simulation-driven search modes. Furthermore, our framework treats PVT variations as a multi-task challenge, facilitating inter-corner knowledge transfer via multi-task evolution, which substantially lowers simulation costs. Our framework has been evaluated using two different sensing elements and an amplifier within a 22 nm process. Based on Monte-Carlo simulations, compared to multi-task reinforcement learning, this method not only attains a 60% to 80% reduction in the relative inaccuracy of sensing elements but also accomplishes a 60% decrease in total runtime.

2B-3 (Time: 14:05 - 14:30)

Title	Analog Circuit Transfer Method Across Technology Nodes via Transistor Behavior
Author	*Haochang Zhi (Southeast University, China), Jintao Li, Yun Li (Shenzhen Institute for Advanced Study, UESTC, China), Weiwei Shan (Southeast University, China)
Page	pp. 197 - 203
Keyword	Technology Nodes Transfer, Analog Circuit Design Automation, MOSFET Modeling, Transistor Behavioral Model, Circuit Representation
Abstract	In the post-Moore era, chips integrate multiple technology node chiplets, necessitating repeated implementations of the same circuit topology across nodes, highlighting the need for technology-independent circuit representation. We use a four-parameter vector—𝑔𝑚 , 𝑓𝑡 , 𝑉𝐷𝑆 , and Δ𝑉𝐺𝑆 —to represent the behavior of each transistor, called the transistor behavioral vector (TBV). The TBVs are vertically concatenated to form the transistor behavioral circuit representation (TBCR) matrix, which precisely reflects the circuit’s performance and provides a technology-independent representation. Furthermore, we propose a transistor behavioral model (TBM) to convert the TBV into the corresponding sizing. Finally, we propose a method to transfer analog circuits between different technology nodes using TBM (TNT), translating the modifications in the process parameters into the corresponding adjustments in Δ𝑉𝐺𝑆 . The experimental results show that for a single transistor, the mapping accuracy from TBV to simulation result was reached 99%. Multiple amplifiers were transferred from 180𝑛𝑚 to 22𝑛𝑚 technology, compared to the conventional transfer method based on 𝑔𝑚 /𝑖𝑑 , our transfer method based on TBCR achieved a success rate of up to 5× higher, along with additional performance improvements from the scaling down.

2B-4 (Time: 14:30 - 14:55)

Title	AIPlace: Analog IC Placement with Multi-Task Learning Framework
Author	*Lijie Wang, Jing Wang, Song Chen, Qi Xu (University of Science and Technology of China, China)
Page	pp. 204 - 210
Keyword	Analog IC Placement, Multi-Task Learning, Graph Neural Network, Unsupervised Training
Abstract	Layout design of analog integrated circuits is a time-consuming manual process with limited automation methods. Recently, advances in machine learning have opened up possibilities for automated design, making it a viable option to improve efficiency. In this paper, we present an innovative and highly effective approach to achieve automated analog circuit placement. We transform the analog placement constraints into multiple task objectives, and apply multi-task neural network learning to perform accurate placement solutions efficiently. Besides, the global position information is utilized to achieve more orderly placement. Due to the computational properties of the network, the method exhibits versatility in accommodating diverse scales of circuit netlists. Moreover, the model is trained through unsupervised learning. Compared to the supervised counterpart using many generated synthetic layout datasets, the proposed approach dramatically reduces the cost of placement data. Experimental results demonstrate that compared to SOTA works, the proposed placement learning method can achieve significant performance gains.

[To Session Table]

Session 2C (T7-1) Approximate and Stochastic Computing
Time: 13:15 - 15:20, Tuesday, January 21, 2025
Location: Room Venus
Chair: Yue Zhang (Beihang University, China)

2C-1 (Time: 13:15 - 13:40)

Title	Stochastic Multivariate Universal-Radix Finite-State Machine: a Theoretically and Practically Elegant Nonlinear Function Approximator
Author	*Xincheng Feng (The University of Hong Kong, Hong Kong), Guodong Shen, Jianhao Hu (The University of Electronic Science and Technology of China, China), Meng Li (Peking University, China), Ngai Wong (The University of Hong Kong, Hong Kong)
Page	pp. 211 - 217
Keyword	AI, Nonlinear functions, Stochastic computing, multivariate universal-radix FSM,, low hardware consumption
Abstract	Nonlinearities are crucial for capturing complex input-output relationships especially in deep neural networks. However, nonlinear functions often incur various hardware and compute overheads. Meanwhile, stochastic computing (SC) has emerged as a promising approach to tackle this challenge by trading output precision for hardware simplicity. To this end, this paper proposes a first-of-its-kind stochastic multivariate universal-radix finite-state machine (SMURF) that harnesses SC for hardware-simplistic multivariate nonlinear function generation at high accuracy. We present the finite-state machine (FSM) architecture for SMURF, as well as analytical derivations of sampling gate coefficients for accurately approximating generic nonlinear functions. Experiments demonstrate the superiority of SMURF, requiring only 16.07% area and 14.45% power consumption of Taylor-series approximation, and merely 2.22% area of look-up table (LUT) schemes.

2C-2 (Time: 13:40 - 14:05)

Title	ACLAM: Accuracy-Configurable Logarithmic Approximate Floating-point Multiplier
Author	*Zhongyu Guan, Qiang Liu (Tianjin University, China), Guangdong Lin (Anhui Siliepoch Technology Company, China)
Page	pp. 218 - 223
Keyword	Approximate computation, logarithmic multiplier, configurable accuracy
Abstract	Approximate computation is a method that effectively reduces design complexity and resource consumption of computing systems. Floating-point (FP) multiplication is a computationally complex and resource-intensive operation and is widely used. Therefore, the approximate FP multiplier has been a promising alternative for improving computation efficiency. In this paper, we propose an accuracy-configurable logarithmic FP multiplication algorithm (ACLAM) and design an efficient circuit structure for it. ACLAM simultaneously computes the approximate product and the error of the logarithmic multiplication. Then a part of the error is added to the product to achieve different levels of accuracy. The complexity of the circuit is reduced by approximating the error expression and calculating the error only based on the Most Significant Bits (MSBs) of the mantissas. Experimental results show that, compared to the exact FP multiplier, ACLAM can reduce 88.1% of area, 89.8% of energy, and 28.1% of delay while maintaining an average accuracy of 99.07%. The proposed ACLAM is also applied to two real applications to demonstrate its efficiency and effectiveness.

2C-3 (Time: 14:05 - 14:30)

Title	AmPEC: Approximate MRAM with Partial Error Correction for Fine-grained Energy-quality Trade-off
Author	*Lan-yang Sun (Southeast University, China), Yaoru Hou (The Hong Kong University of Science and Technology, Hong Kong), Hao Cai (Southeast University, China)
Page	pp. 224 - 229
Keyword	STT-MRAM, approximate memory, energy-quality trade-off, error correction
Abstract	Spin Transfer Torque-Magnetic Random Access Memory(STT-MRAM)is a promising nonvolatile memory technology for future on-chip storage. However, its energy consumption during read and write operations poses a challenge to its overall energy efficiency. To address this, approximate storage techniques are explored to enhance energy saving in STT-MRAM while maintaining a low error probability. This paper proposes a fine-grained, quality-tunable approximate STT-MRAM that allows for an energy-quality trade-off. Compared with uniform approach that approximate all the bits to the same quality, our approach leverages bit-level approximate methods and data mapping to minimize quality loss. The reuse of circuits for different modes eliminates area overhead, and the modified read and write schemes are achieved through control signal manipulation. Partial Error Correction Code(ECC) is employed to check the most significant bits(MSBs) and ensure the minimum precision when quality of MSBs cannot be guaranteed. The evaluation demonstrates a 49.5% reduction in energy consumption with negligible area overhead and image quality loss. Partial error correction further enhances image quality without requiring additional column area.

2C-4 (Time: 14:30 - 14:55)

Title	HyPPO: Hybrid Piece-wise Polynomial Approximation and Optimization for Hardware Efficient Designs
Author	*Lakshmi Sai Niharika Vulchi, Valipireddy Pranathi, Mahati Basavaraju, Madhav Rao (IIIT Bangalore, India)
Page	pp. 230 - 236
Keyword	Polynomial Approximation, Piece-wise-Linear, Piece-wise-Quadratic, Partical Swarm Optimization
Abstract	Piece-wise polynomial approximation - linear (PWL) and quadratic (PWQ) have proven to efficiently implement non-linear functions on hardware. This paper introduces Hybrid Piece-wise Polynomial Approximation (PW-Hybrid) where pieces of approximation for a function are obtained as the best combination of linear and quadratic polynomials, such that the error converges to the desired minimum. The hardware for PWL, PWQ and PW-Hybrid designs are further refined using Particle Swarm Optimization (PSO) algorithm to fine-tune the quantized bit-widths for realizing coefficients of the polynomial employed. This PSO optimised hardware design is evolved for a range of non-linear functions including i) Piece-wise polynomial linear optimized (PWLO) - Logarithmic, Hyperbolic-Tangent, Sigmoid and Softsign, and ii) Piece-wise polynomial quadratic optimized (PWQO) Exponential, and iii) Hybrid Piece-wise Polynomial optimized (HyPPO) - Sine and Sinc. The proposed design shows considerable decrease in hardware resource consumption and critical path delay, when synthesized using Cadence 45nm gpdk library. Highest improvements in HyPPO, PWLO and PWQO designs, are observed for Sine, Logarithmic and Exponential functions, with 65.06%, 24.47% and 9.67% gain in power-area-delay product (PADP) respectively, when compared with SOTA - PWL and PWQ designs. The proposed methods also exhibited minimal inference accuracy loss when tested on popular CNN architectures.

2C-5 (Time: 14:55 - 15:20)

Title	Hybrid Temporal Computing for Lower Power Hardware Accelerators
Author	Maliha Tasnim, Sachin Sachdeva, Yibo Liu, *Sheldon X.D. Tan (University of California, Riverside, USA)
Page	pp. 237 - 244
Keyword	Temporal Computing, Deterministic Computing, DSP, Low Power, Accelerator
Abstract	In this paper, we propose a new hybrid temporal computing (HTC) framework that leverages both pulse rate and temporal data encoding to design ultra-low energy hardware accelerators. Our approach is inspired by the recently proposed temporal computing, or race logic, which encodes data values as single delays, leading to significantly lower energy consumption due to minimized signal switching. The new HTC framework overcomes the inherent limitations of race logic by encoding signals in both temporal and pulse rate formats for multiplication and in temporal format for propagation. We demonstrate how HTC multiplication is performed for both unipolar and bipolar data encoding while consuming reduced switching energy. Additionally, we implement two widely used hardware accelerators: a Finite Impulse Response (FIR) filter and a Discrete Cosine Transform (DCT)/iDCT. Experimental results show that compared to the CBSC MAC, the HTC MAC reduces power consumption by 45.2% and area footprint by 50.13%. Compared to the CBSC design, the HTC-based FIR filter reduces power consumption by 36.61% and area cost by 45.85%. The HTC-based DCT filter retains the quality of the original image with a decent PSNR, while consuming 23.34% less power and occupying 18.20% less area than the CBSC MAC-based DCT filter.

[To Session Table]

Session 2D (T2-1) Next-Generation Embedded Architectures and Tools
Time: 13:15 - 15:20, Tuesday, January 21, 2025
Location: Room Mars/Mercury
Chair: Qi Zhu (Northwestern University, USA)

Best Paper Candidate
2D-1 (Time: 13:15 - 13:40)

Title	PULSE: Progressive Utilization of Log-Structured Techniques to Ease SSD Write Amplification in B-epsilon-tree
Author	Huai-De Peng (Department of Computer Science and Information Engineering, National Central University, Taiwan), *Yi-Shen Chen (Department of Electronic and Computer Engineering, National Taiwan University of Science and Technology, Taiwan), Tseng-Yi Chen (Department of Computer Science and Information Engineering, National Central University, Taiwan), Yuan-Hao Chang (Institute of Information Science, Academia Sinica, Taiwan)
Page	pp. 245 - 250
Keyword	solid-state drive, flash memory, B-epsilon-tree, write amplification
Abstract	During the unprecedented expansion of global data, efficient storage solutions are essential for processing massive datasets stored on modern storage devices. B-epsilon-tree is one of the most well-known techniques that provides a write-optimized structure for database file systems. With the excellent access performance and high energy efficiency of solid-state drives (SSDs), they are expected to yield promising outcomes for large-scale data computation. However, their integration into storage systems has the challenges of write amplification, which impacts SSD endurance and reliability. This work identifies significant write amplification issues with B-epsilon-tree implementations on SSDs due to the complicated management of key-value pairs. To mitigate the impact of write amplification, we propose PULSE, a novel scheme that rethinks B-epsilon-tree designs by leveraging log-structured techniques optimized for SSDs. Moreover, PULSE integrates auxiliary indexing and a dual flush selector to minimize write amplification. Experimental results demonstrate that PULSE significantly mitigates write amplification by more than 62.6% on SSDs for the representative benchmarks compared to the B-epsilon-tree.

2D-2 (Time: 13:40 - 14:05)

Title	Rethinking B^ε tree Indexing Structure over NVM with the Support of Multi-write Modes
Author	Hui-Tang Luo, *Tseng-Yi Chen (National Central University, Taiwan)
Page	pp. 251 - 257
Keyword	key-value store, non-volatile memory, multi-write modes
Abstract	The B𝜖 tree is an essential indexing structure in modern file and database systems, renowned for its high read and write performance. However, constructing a Bε tree involves substantial write overhead due to repetitive key writing during flushing. This study presents the mw-B𝜖 tree, a novel approach leveraging multi-write modes in persistent memory to reduce B𝜖 tree construction costs. By dynamically adapting write modes based on node update frequencies, the mw-B𝜖 tree outperforms traditional fixed-mode indexing schemes. Our research demonstrates that integrating multi-write mode support in non-volatile memory (NVM) can significantly enhance the efficiency of B𝜖 tree indexing. Experimental results using real-world workloads show that the mw-Bε tree reduces power consumption by up to 43.5% and energy usage by 27.6%, while also improving write latency. This work is the first to investigate and address power consumption challenges in Bε tree construction, providing a compelling case for the adoption of multi-write modes in NVM technologies for indexing structures.

2D-3 (Time: 14:05 - 14:30)

Title	End-to-end Compilation is All FPGAs Need: A Unified Overlay-based FPGA Compiler for Deep Learning
Author	*Kai Qian, Haodong Lu (Fudan University, China), Yinqiu Liu (Nanyang Technological University, Singapore), Zexu Zhang, Kun Wang (Fudan University, China)
Page	pp. 258 - 264
Keyword	Compiler, FPGA
Abstract	Field-Programmable Gate Array (FPGA) has shown great application potential in deploying Neural Networks (NNs) relying on the characteristics of programmability, low power consumption, \textit{etc}. However, deploying models on FPGA is non-trivial because (1) Mainstream NNs pose significant FPGA architecture design challenges due to their large number of parameters, complex operations, and the need for data optimization, and (2) Supporting the deployment of different machine learning frameworks to FPGA requires significant manual effort, consuming a large amount of time. In this paper, we propose \textit{AutoCompiler} — a unified compiler for mapping NNs to different FPGAs, along with overlay techniques to enable fast and efficient implementation. To the best of our knowledge, we are the first work to support both Deep Neural Networks (DNNs) and Transformer-based networks for overlay-based FPGA deployment. \textit{AutoCompiler} comprises three integrated enablers: (1) \textit{Model Translator}, built on top of a topology-based NNs representation, which can optimize the topology and data representation of the models from an algorithmic level based on different hardware configurations, \textit{e.g.,} DSP utilization, (2) \textit{Instruction Generator}, which generates pipeline data streams according to various FPGA resource configurations by manipulating the instruction set at the upper level rapidly, and (3) \textit{end-to-end optimization}, which moves as much of the computational processes as possible onto the FPGA chip and minimizes the interaction between CPU and FPGA. Extensive experiments on various Xilinx FPGAs show that \textit{AutoCompiler} outperforms state-of-the-art overlay-based compiler by 1.2$\times$ - 1.35$\times$ and same-level GPUs by 1.15$\times$ - 1.59$\times$ for classic DNN models, and ViT inference, respectively.

2D-4 (Time: 14:30 - 14:55)

Title	HDCC: A Hierarchical Dataflow-Oriented CGRA Compiler for Complex Applications
Author	*Shangli Li (Institute of Software, Chinese Academy of Sciences/Hangzhou Institute for Advanced Study, UCAS/University of Chinese Academy of Sciences, China), Mingjie Xing (Institute of Software, Chinese Academy of Sciences/Nanjing Institute of Software Technology/ University of Chinese Academy of Sciences, China), Yanjun Wu (Institute of Software, Chinese Academy of Sciences/University of Chinese Academy of Sciences, China)
Page	pp. 265 - 271
Keyword	CGRA, MLIR, Dataflow, Compiler, Heterogeneous Computing
Abstract	CGRA(Coarse-Grained Reconfigurable Architecture) is characterized by high energy efficiency and reconfigurability, plays an important role in various complex applications. However, current CGRA compilers only handle simple inner loops and struggle with complex nested loops, making it hard to deploy large-scale complex applications on CGRA for acceleration. Therefore, we proposed a hierarchical dataflow compiler HDCC based on MLIR to deal with complex loops in layers and developed a tool for generating dataflow graphs from MLIR intermediate code to capture the data flow of complex loops. We also propose a method to deploy large-scale applications to CGRA with HDCC.The experimental results show that HDCC is capable of deploying large-scale complex applications such as neural network tasks and cryptographic tasks onto CGRA, achieving up to 11.4x performance improvement on these tasks.

2D-5 (Time: 14:55 - 15:20)

Title	HAMMER: Hardware-aware Runtime Program Execution Acceleration through runtime reconfigurable CGRAs
Author	Qilin Si, *Benjamin Carrion Schafer (The University of Texas at Dallas, USA)
Page	pp. 272 - 278
Keyword	CGRA, Runtime acceleration, Accelerator, High-Level Synthesis
Abstract	This work introduces a novel computer architecture consisting of an embedded processor with a tightly coupled Coarse-grain Reconfigurable Array (CGRA) that is able to accelerate the execution of sequential programs at runtime. To accomplish this, our work pre-characterizes a large variety of different portions of code from multiple application domains that can be accelerated offline. These kernels are subsequently synthesized onto the CGRA such that the proposed architecture detects at runtime if portions of code from a new unseen application can be accelerated or not at runtime. If they can be accelerated, then the system autonomously configures the CGRA with the specific accelerator, while if not present, then the code is executed sequentially on the CPU only. This approach implies sequential code compiled for a specific CPU only does not need to be recompiled for alternative architectures like CPU+CGRA. We do specifically use CGRAs in this system because they can be reconfigured in a single clock cycle (1ns) and hence, do not lead to any reconfiguration overhead. This also allows them to hold as many accelerators as the CGRA configuration memory can hold. Experimental results show the effectiveness of our flow and architecture leading to different levels of speedups and energy efficiency vs. the purely sequential execution of the code.

[To Session Table]

Session 2E (SS-2) Advances in 3D-IC and Ultra-Large-Scale Integration
Time: 13:15 - 15:20, Tuesday, January 21, 2025
Location: Innovation Hall
Chair: Yibo Lin (Peking University, China)

2E-1 (Time: 13:15 - 13:40)

Title	(Invited Paper) Fast Routing Algorithm for Mask Stitching Region of Ultra Large Wafer Scale Integration
Author	Zhen Zhuang (The Chinese University of Hong Kong, Hong Kong), Quan Chen, Hao Yu (Southern University of Science and Technology, China), *Tsung-Yi Ho (The Chinese University of Hong Kong, Hong Kong)
Page	pp. 279 - 284
Keyword	Routing, Ultra large 2.5D wafer scale packaging, Ultra large silicon interposer, Mask stitching
Abstract	Interposer-based packaging has gained tremendous popularity in integrating advanced logic and memory chiplets for artificial intelligence and high-performance computing systems. The size of the silicon interposer is the critical bottleneck in improving the performance of integrated systems by mounting more and more advanced chiplets, such as high bandwidth memory (HBM). Nowadays, ultra large wafer scale integration is a popular alternative to integrate large amounts of advanced chiplets on a big wafer scale silicon interposer. However, wafer scale silicon interposers cannot be manufactured by one mask due to the reticle limitation. Therefore, the mask stitching technique is used to manufacture ultra large systems by applying multiple masks for different sub-regions of an ultra large silicon interposer. To achieve the alignment of two adjacent sub-regions manufactured by different masks, the two sub-regions have an overlapped stitching region. Previous algorithms cannot handle the special design rules of mask stitching regions and are not efficient enough to generate high-quality routing solutions. In this work, we propose a fast routing algorithm for mask stitching regions to efficiently solve the special design rules. The time complexity of the proposed algorithm is O(nlogn), where n is the number of nets. Compared with state-of-the-art work, our algorithm can achieve 100% routability with an effective reduction of wirelength. Furthermore, the proposed algorithm can achieve a speedup of thousands of times.

2E-2 (Time: 13:40 - 14:05)

Title	(Invited Paper) The Survey of 2.5D Integrated Architecture: An EDA perspective
Author	Shixin Chen (The Chinese University of Hong Kong, Hong Kong), Hengyuan Zhang, Zichao Ling, Jianwang Zhai (Beijing University of Post and Telecommunication, China), *Bei Yu (The Chinese University of Hong Kong, Hong Kong)
Page	pp. 285 - 293
Keyword	2.5D IC, Chiplet, EDA, Survey, Hardware Architecture
Abstract	Enhancing performance while reducing costs is the fundamental design philosophy of integrated circuits (ICs). With advancements in packaging technology, interposer-based chiplet architecture has emerged as a viable solution. Chiplet-based integration, often referred to as 2.5D IC, provides notable benefits, including cost-effectiveness, reusability, and improved performance. However, fully leveraging the advantages of chiplet-based architecture presents challenges, particularly in thermal management, communication latency, and the optimal placement of chiplets within advanced packages. Consequently, several issues persist within the electronic design automation (EDA) workflow, for example, simulation of integration, partition, interconnection, physical design, evaluation modeling, etc. This talk will focus on reviewing the research literature regarding chiplet-based architecture, highlighting current challenges and exploring opportunities for future development. We sincerely expect this survey can provide insights for the future development of EDA tools for 2.5D integrated architectures.

2E-3 (Time: 14:05 - 14:30)

Title	(Invited Paper) Toward Advancing 3D-ICs Physical Design: Challenges and Opportunities
Author	Xueyan Zhao (State Key Laboratory of Processors, Institute of Computing Technology, Chinese Academy of Sciences/University of Chinese Academy of Sciences, China), Weiguo Li, Zhisheng Zeng (Pengcheng Laboratory, China), Zhipeng Huang (Beijing Institute of Open Source Chip, China), Biwei Xie (State Key Laboratory of Processors, Institute of Computing Technology, Chinese Academy of Sciences/Pengcheng Laboratory/University of Chinese Academy of Sciences, China), *Xingquan Li (Pengcheng Laboratory/Beijing Institute of Open Source Chip, China), Yungang Bao (State Key Laboratory of Processors, Institute of Computing Technology, Chinese Academy of Sciences/Beijing Institute of Open Source Chip/University of Chinese Academy of Sciences, China)
Page	pp. 294 - 301
Keyword	3D-IC, physical design, hybrid bounding
Abstract	As the demand for higher integration density and performance efficiency continues to grow, 3D stacking has emerged as a promising solution. In 3D ICs, the complexity of physical design and the optimization space is significantly increasing. Therefore, researching high-quality 3D native instead of pesudo 3D physical design has become even more important. This paper reviews recent advancements and persistent challenges in 3D physical design, focusing on F2F bonding technologies. Then, this paper discusses several issues that still require further research and some overlooked problems, with the hope of helping researchers develop higher-quality 3D native physical design tools in the future.

2E-4 (Time: 14:30 - 14:55)

Title	(Invited Paper) Processing-Near-Memory with Chip Level 3D-IC
Author	*Miao Liu, Qingqing Sun, David Wei Zhang (Fudan University, China)
Page	pp. 302 - 307
Keyword	3DIC, Memory-on-Logic, More than Moore, Implementation
Abstract	Moore Law has met the limitation when since 2014, after moving to 16nm, so there are two research trends: More Moore and More than Moore. From an easy-to-implement point of view, 3D-IC is one good technology for More than Moore. Now there are already many package-level 3DIC in industry, such as Integrated Fan-Out (INFO), Fan-out Wafer Level Package (FOWLP), and other applications. This paper is talking about a new and more aggressive 3D-IC technology: SRAM-on-Logic. In fact, DRAM-on-Logic has become a mature technology for AI and deep learning chips in past year, such as Bitcoin mining machine. SRAM-on-Logic is still a very new technology researched by Foundry, EDA and leading design houses. SRAM-on-Logic can bring general performance, power and area benefit comparing with traditional 2D floorplan.

2E-5 (Time: 14:55 - 15:20)

Title	(Invited Paper) Clustering-Driven Bonding Terminal Legalization with Reinforcement Learning for F2F 3D ICs
Author	Gyumin Kim, *Heechun Park (Ulsan National Institute of Science and Technology (UNIST), Republic of Korea)
Page	pp. 308 - 314
Keyword	3D IC, routing, reinforcement learning
Abstract	3D ICs have garnered significant attention as a means to overcome the scaling limits of Moore’s Law. Specifically, Face-to-Face (F2F) 3D ICs with hybrid bonding have become a promising solution as they enable heterogeneous stacking with different technology nodes. However, the state-of-the-art F2F 3D IC design flow faces a challenge in locating bonding terminals, since their large bonding pitches compared to the adjacent metal layers cause the existing routing engine to generate a huge amount of design rule violations (DRVs) from the overlapped pitches. In this paper, we introduce our clustering-driven bonding terminal legalization method that efficiently honors bonding pitches. Starting from the optimal 3D via locations with lots of bonding pitch overlaps, our method legalizes all bonding terminals by leveraging both aspects of minimizing displacements (i.e., maintaining routing quality) and grid-based assignments (i.e., overlap elimination) through applying grid-based legalization on clustered bonding terminals. Moreover, we apply reinforcement learning for clustering to obtain the optimal bonding terminal clusters that utilize both advantages. Experiments demonstrate that we successfully eliminate all bonding terminal pitch overlaps with minimal displacement of 3D vias from their optimal locations, thereby reducing the overhead of timing and power degradation than the previous approach.

[To Session Table]

Session 2F University Design Contest
Time: 13:15 - 15:20, Tuesday, January 21, 2025
Chairs: Mahfuzul Islam (Tokyo Institute of Technology, Japan), Shinya Takamaeda Yamazaki (The University of Tokyo, Japan)

2F-1 (Time: 13:15 - 13:20)

Title	A 10.60 μW 150 GOPS Mixed-Bit-Width Sparse CNN Accelerator for Life-Threatening Ventricular Arrhythmia Detection
Author	Yifan Qin, Zhenge Jia, Zheyu Yan (University of Notre Dame, USA), Jay Mok, Manto Yung, Yu Liu, Xuejiao Liu (Hong Kong University of Science and Technology/AI Chip Center for Emerging Smart System, Hong Kong), Wujie Wen (North Carolina State University, USA), Luhong Liang, Kwang-Ting Tim Cheng (Hong Kong University of Science and Technology/AI Chip Center for Emerging Smart System, Hong Kong), Xiaobo Sharon Hu, *Yiyu Shi (University of Notre Dame, USA)
Page	pp. 315 - 316
Keyword	ventricular arrhythmia, sparsity, CNN accelerator, chip
Abstract	This paper proposes an ultra-low power, mixed-bit-width sparse convolutional neural network (CNN) accelerator to accelerate ventricular arrhythmia (VA) detection. The chip achieves 50% sparsity in a quantized 1D CNN using a sparse processing element (SPE) architecture. Measurement on the prototype chip TSMC 40nm CMOS low-power (LP) process for the VA classification task demonstrates a power consumption of 10.60 uW with a performance of 150 GOPS and a diagnostic accuracy of 99.95%. The computation power density reaches 0.57 uW/mm^2, which is 14.23X smaller than state-of-the-art works, making it highly suitable for implantable and wearable medical devices.

2F-2 (Time: 13:20 - 13:25)

Title	Headset-Integrated Brain-Machine Interface for Mind Imagery and Control in VR/MR Applications
Author	Zhiwei Zhong (Northwestern University, USA), Yijie Wei (Kilby Labs, Texas Instruments, USA), Lance Go, Yiqi Li, *Jie Gu (Northwestern University, USA)
Page	pp. 317 - 320
Keyword	Brain machine interface, virtual reality, mind imagery control, system-on-chip
Abstract	Virtual Reality and Mixed Reality systems have revolutionized consumer electronics, driving innovations in the metaverse. However, traditional VR headsets lack brain activity integration for feedback and control. This work introduces a 65nm SoC for in situ mind imagery-based brain machine interface, seamlessly integrated into VR/MR headsets, with reduced energy consumption and AI support. The digital core of the SoC achieves the state-of-the-art energy consumption <1μJ/class for compute-intensive CNN operations thanks to the low-power features and system-level optimizations of the designs. The demo video is provided in https://youtu.be/WuGlcMSSQzY.

2F-3 (Time: 13:25 - 13:30)

Title	Humanoid Robot Control: A Mixed-Signal Footstep Planning SoC with ZMP Gait Scheduler and Neural Inverse Kinematics
Author	Qiankai Cao, Yiqi Li, Juin Chuen Oh, *Jie Gu (Northwestern University, USA)
Page	pp. 321 - 324
Keyword	Humanoid robot, 3D footstep planning, mixed-signal, ZMP, system-on-chip
Abstract	With the rapid expansion of autonomous robotic systems in recent years, humanoid robots are also gaining considerable attention. However, their motion control presents more complex challenges than wheeled mobile robots. For the first time, this work presents a complete footstep planning SoC chip for humanoid robots. It includes a novel time-domain graph search engine for 3D footstep planning, along with a mixed-signal zero-moment point (ZMP) gait scheduler enhanced by neural inverse kinematics for efficient motion control. This work is demonstrated in-situ on a fully assembled robot using the 65nm system-on-chip (SoC), achieving the best-in-class performance and power including an 18.4x enhancement on energy efficiency for motion control and a 2.7x energy saving in graph search compared to previous works. The demo video is provided in https://youtu.be/kBe-aRnzmG4.

Special Feature Award
2F-4 (Time: 13:30 - 13:35)

Title	A Coarse- and Fine-Grained LUT Segmentation Method Enabling Single FPGA Implementation of Wired-Logic DNN Processor
Author	*Yuxuan Pan, Dongzhu Li, Mototsugu Hamada, Atsutake Kosuge (The University of Tokyo, Japan)
Page	pp. 325 - 327
Keyword	Look-up table, area efficiency, wired-logic, FPGA, deep neural network
Abstract	A coarse- and fine-grained LUT segmentation technique is developed for wired-logic AI processors to improve FPGA resource utilization efficiency. By applying the proposed technique to FPGA-based wired-logic processors used for CIFAR-10 classification and keyword spotting, the hardware resource requirements for non-linear functions were reduced by 92% and 92.8%, respectively, with negligible accuracy degradation.

2F-5 (Time: 13:35 - 13:40)

Title	Learned Image Codec on FPGA: Algorithm, Architecture and System Design
Author	*Heming Sun, Jing Wang (Yokohama National University, Japan), Silu Liu, Shinji Kimura (Waseda University, Japan), Masahiro Fujita (The University of Tokyo, Japan)
Page	pp. 328 - 331
Keyword	Image Compression, Neural Network, FPGA
Abstract	This paper describes our design for learned image codec (LIC) on FPGA, from the aspects of algorithm, architecture and system. For the algorithm, we build the neural network on the hyperprior structure. Besides, we present a quantization aware training scheme specifically adapted to LIC. For the architecture, we propose a fine-grained pipeline architecture. Channel parallelism constraint and neural network are proposed to improve the DSP utilization and efficiency, respectively. For the system, we make a CPU-FPGA heterogeneous coding system in which a system-level pipeline is proposed to maximize the throughput. A 720P@30FPS demo and a cross-platform demo are provided in the websites.

2F-6 (Time: 13:40 - 13:45)

Title	Transformer Hetero-CiM: Heterogeneous Integration of ReRAM CiM and SRAM CiM for Vision Transformer at Edge Devices
Author	*Naoko Misawa, Tao Wang, Chihiro Matsui, Ken Takeuchi (The University of Tokyo, Japan)
Page	pp. 332 - 334
Keyword	Vision Transformer, Computation-in-Memory, SRAM, ReRAM
Abstract	This paper proposes design of Transformer Hetero-Computation-in-Memory (CiM) for Vision Transformer (ViT) at edge devices. ViT achieves high inference accuracy using parallel processing. However, ViT is computationally intensive due to a huge number of multiply-accumulate (MAC) operations. Moreover, ViT makes the diverse requirements of Read-MAC operation in linear & FC layers and Read/Write MAC operation in self-attention. Thus, proposed Transformer Hetero-CiM overcomes the issues by heterogeneous integration system of MLC ReRAM CiM, SRAM CiM and digital processors. As a result, proposed Transformer Hetero-CiM achieves 97.9% inference accuracy with reducing 89.1% of array area.

Special Feature Award
2F-7 (Time: 13:45 - 13:50)

Title	A High-Density Hybrid Buck Converter with a Charge Converging Phase Reducing Inductor Current for 12V Power Supply Systems
Author	*Yichao Ji, Ji Jin, Lin Cheng (University of Science and Technology of China, China)
Page	pp. 335 - 337
Keyword	Buck converter, Dual path, High current density, Hybrid converter
Abstract	A 12V-input and 1-1.8V-output hybrid buck converter with a charge converging phase is presented in this summary. The charge converging phase reduces the inductor current to always less than half of load current, which reduces DCR conduction loss and mitigates the reliance on a bulky inductor. Fabricated in a 0.18μm BCD process with a die area of 5mm2, this design utilizes only 5V and 1.8V NMOS as power switches and achieves a peak efficiency of 94.7% and a current density of 685A/cm3.

2F-8 (Time: 13:50 - 13:55)

Title	A 500-MS/s 8-bit SAR ADC Generated from an Automated Layout Generation Framework in 14-nm FinFET Technology
Author	*Yunseong Jo, Taeseung Kang (Hanyang University, Republic of Korea), Jeonghyu Yang (Ramschip, Republic of Korea), Jaeduk Han (Hanyang University, Republic of Korea)
Page	pp. 338 - 341
Keyword	FinFET, LAYGO2, layout generation, SAR ADC
Abstract	This paper presents the process of generating the layout of the 8-bit Successive Approximation Register Analog to Digital Converter (SAR ADC). By utilizing LAYGO2 [1], a Python framework that allows for detailed and flexible specification of custom layout generation processes, code-based generators for the component blocks of a SAR ADC were developed according to their specific operational characteristics and requirements. As a result, 85.5% of the SAR ADC was automatically generated. The SAR ADC test chip was fabricated in a 14-nm CMOS FinFET process and achieved an SNDR of 41.67 dB at 500 MS/s, consuming 2.07 mW from 0.9 V supply, and occupying an area of 4,131 um2.

2F-9 (Time: 13:55 - 14:00)

Title	A 4-Stream 8-Element Time-Division MIMO Phased-Array Receiver for 5G NR and Beyond Achieving 9.6Gbps Data Rate
Author	*Yi Zhang, Minzhe Tang, Zheng Li, Dongfan Xu, Kazuaki Kunihiro, Hiroyuki Sakai, Atsushi Shirane, Kenichi Okada (Institute of Science Tokyo, Japan)
Page	pp. 342 - 343
Keyword	Millimeter-Wave, MIMO, Fast Beam Switching, Receiver
Abstract	An 8-Element 28-GHz time-division MIMO beamformer supporting 4x4 MIMO is presented. A fast- switching phase shifter is utilized to realize Nyquist-rate beam switching to lossless receive multiple beams by reusing single RF signal path. A prototype fabricated in 65nm CMOS achieves 64- QAM 4x4 MIMO reception in OTA measurement using four 5G NR spatial data streams with 400MHz channel bandwidth.

2F-10 (Time: 14:00 - 14:05)

Title	Design of a 1-5GHz Inverter-Based Phase Interpolator for Spin-Wave Detection
Author	*Yuyang Zhu, Zunsong Yang, Zhenyu Cheng, Md Shamim Sarker, Hiroyasu Yamahara, Munetoshi Seki, Hitoshi Tabata, Tetsuya Iizuka (The University of Tokyo, Japan)
Page	pp. 344 - 347
Keyword	Phase interpolator, Spin-wave detection, phase tuning
Abstract	This paper details the design and implementation of a wide-band, digital-controlledinverter-based phase interpolator (PI) for a spin-wave detection circuit. The PI contains input phase generation part for coarse phase tunning, and PI core part for fine phase tunning. Fabricated using 65nm CMOS technology, the circuit operates across a frequency range of 1GHz to 5GHz with a 1V supply. The chip is implemented in 65nm CMOS technology and operates from 1GHz to 5GHz, with 1V supply. The power consumption of the proposed PI core is 1.89mW at 5GHz. The measured differential non-linearity and integral non-linearity are 0.6 LSB and −5.17 LSB respectively at the worst case.

2F-11 (Time: 14:05 - 14:10)

Title	Self-recovery hysteresis control based on-chip SC DC-DC converter robust to load fluctuation
Author	*Koji Kikuta, Takashi Hisakado (Kyoto University, Japan), Mahfuzul Islam (Tokyo Institute of Technology, Japan)
Page	pp. 348 - 351
Keyword	Analog integrated circuit, Power supply, SC converter, Frequency scaling, Hystresis control
Abstract	This paper proposes a switched-capacitor DC-DC converter with a self-recovery hysteresis control mechanism to deal with both step-like and spike-like load fluctuations. Our designed test chip is fabricated in a commercial 65 nm CMOS low-power (LP) process and deals with over 0.11∼ 8.0 mA with a total capacitance of 1 nF and recover quickly from a current spike of as high as 21 mA within only 92 mV of voltage drop.

2F-12 (Time: 14:10 - 14:15)

Title	A Tri-Mode Harmonic-Selection Mixer with Multiphase LO Supporting 24.25–71GHz for Multi-Band 5G NR
Author	*Dongfan Xu, Minzhe Tang, Yi Zhang, Zheng Li, Jian Pang, Atsushi Shirane, Kenichi Okada (Institute of Science Tokyo, Japan)
Page	pp. 352 - 353
Keyword	Mixer, 5G NR, harmonic selection
Abstract	A tri-mode mixer supporting the 5G NR standards over 24.25–71GHz is presented. The harmonic-selection technique reduces the local oscillator (LO) frequency range and power consumption. Besides, the proposed multi-phase LO rejects unwanted harmonics. Fabricated in a 65nm CMOS process, this work consumes 12-21mW power under a 1V supply. It presents more than 20dB rejections to the undesired harmonic components at all operation bands. The required LO frequency range is only 10GHz.

2F-13 (Time: 14:15 - 14:20)

Title	A D-Band CMOS Transceiver Chipset Supporting 640Gb/s Date Rate with 4×4 Line-of-Sight MIMO
Author	*Chenxin Liu, Zheng Li, Yudai Yamazaki, Hans Herdian, Chun Wang, Anyi Tian, Jun Sakamaki, Han Nie, Xi Fu, Sena Kato, Wenqian Wang, Hongye Huang (Tokyo Institute of Technology, Japan), Shinsuke Hara, Akifumi Kasamatsu (National Institute of Information and Communications Technology, Japan), Hiroyuki Sakai, Kazuaki Kunihiro, Atsushi Shirane, Kenichi Okada (Tokyo Institute of Technology, Japan)
Page	pp. 354 - 355
Keyword	CMOS, Wireless, TRX, 640Gb/s, MIMO
Abstract	A D-band (114-170GHz) CMOS transceiver (TRX) chipset covering a 56GHz signal-chain bandwidth with a 640-Gb/s data rate is proposed in this work. The design includes an 8-way low-Q power-combined power amplifier (PA), a 2-way low-Q power-combined low noise amplifier (LNA), wideband-impedance-transformation mixers, and common-source-based cascaded distributed amplifiers (DA) to improve bandwidth and linearity. The proposed TRX chipset achieves a 200Gb/s SISO data rate and a 640Gb/s MIMO data rate.

2F-14 (Time: 14:20 - 14:25)

Title	Low quiescent current LDO with FBPEC to improve PSRR specific frequency band for wearable EEG recording devices
Author	*Kenji Mii, Daisuke Kanemoto, Tetsuya Hirose (Osaka University, Japan)
Page	pp. 356 - 359
Keyword	FVF, low-dropout, low quiescent current, PSRR
Abstract	This design contest document proposes a low quiescent current low-dropout regulator (LDO) with an auxiliary amplifier, flipped voltage follower (FVF)-based power supply rejection ratio enhanced circuit (FBPEC) for electroencephalogram (EEG) recording devices. A FVF filter, current mirror, and common-source amplifier are employed to configure the FBPEC. The FBPEC employs the characteristics of the FVF filter to reduce the current consumption and increase the gain at specific frequencies. A 0.18 μm CMOS process is used to design and fabricate the proposed LDO. Compared to the general configuration LDO by measurement results, the proposed LDO exhibits an enhanced power supply rejection ratio (PSRR) up to 18 dB at frequencies exceeding 8 kHz. Moreover, the quiescent current of the proposed LDO at no-load is 648 nA. The proposed LDO exhibits a good figure-of-merit score compared to those of previous works, suggesting that the proposed circuit is an effective solution for use in wearable low-power EEG recording devices.

2F-15 (Time: 14:25 - 14:30)

Title	Design of a 7.2-GHz CMOS Receiver Front-end for One-chip Transponders in Deep Space Probes
Author	*Sota Kano (The University of Tokyo, Japan), Naoto Usami, Atsushi Tomiki (Japan Aerospace Exploration Agency, Japan), Tetsuya Iizuka (The University of Tokyo, Japan)
Page	pp. 360 - 363
Keyword	CMOS, Analog, RF, Wireless, Deep space probes
Abstract	Demand for miniaturization in deep space proves is increasing due to the need to reduce costs. Along with this trend, transponders, which are essential building blocks for communication, are also required to be downsized. The feasibility of the one-chip transponder with general-purpose low-cost silicon CMOS technology needs to be studied to achieve ultimate downsizing while maintaining electrical reliability. This paper presents a 7.2 GHz CMOS receiver front-end chip with a 65 nm general purpose CMOS process. The direct probing measurements demonstrated that our prototype receiver front-end achieves the input return loss of -20 dB, the conversion gain of 25 dB, the NF of 2.3 dB, and the IIP3 of -26 dBm, while consumes 20.1 mW for the core block including the LNA and the mixers. The measurement results indicate that the receivers of previous deep space transponders can be replaced by a CMOS single chip without sacrificing the functionality and performance.

Best Design Award
2F-16 (Time: 14:30 - 14:35)

Title	Design of 0.9-2.6pW 0.1-0.25V 22nm 2-bit Supply-to-Digital Converter Using Always-Activated Supply-Controlled Oscillator and Supply-Dependent-Activation Buffers for Bio-Fuel-Cell-Powered-and-Sensed Time-Stamped Bio-Recording
Author	*Hiroaki Kitaike, Hironori Tagawa, Shufan Xu, Ruilin Zhang, Kunyang Liu, Kiichi Niitsu (Kyoto University, Japan)
Page	pp. 364 - 367
Keyword	Supply-to-Digital Converter, CMOS, IoT, supply-dependent-activation buffer, CGM
Abstract	This paper presents a low-power supply-to-digital converter (SDC) for bio-recording system where bio-fuel-cell provides both power and sensing data. It is achieved by a supply-controlled oscillator, supply-dependent activation buffers (SDAB) with different V-thresholds, and an encoder. A 22-nm prototype chip exhibits its feasibility with a power of 0.9-2.6 pW under 0.1-0.25 V. Design methodology for minimizing power consumption and techniques of controlling buffer V-threshold are also introduced.

2F-17 (Time: 14:35 - 14:40)

Title	0.36μW/channel Capacitively-coupled Chopper Instrumentation Amplifier for EEG Recording Wearable Devices with Compressed Sensing Framework
Author	Kenji Mii, *Daisuke Kanemoto, Tetsuya Hirose (Osaka University, Japan)
Page	pp. 368 - 371
Keyword	CCIA, compressed sensing, EEG, low power consumption, random undersampling
Abstract	In this design contest document, measurement results demonstrate the effectiveness of a designed low current consumption amplifier for a compressed-sensing (CS) framework in wearable electroencephalography (EEG) recording devices. When reconstructing with a frequency bases, the reduction of biased 1/f noise is more important than frequency-unbiased white noise. Therefore, we designed an amplifier that reduces 1/f noise rather than white noise, while reducing power consumption, and employed it in the system. The designed amplifier is based a capacitively coupled chopper instrumentation amplifier (CCIA) architecture which used for low-noise amplifier (LNA). According to measurements of the designed CCIA, the power consumption is 0.36 μW/channel, it has the lower power consumption compared to amplifiers designed for similar applications in the past. The input referred noise (IRN) excluding the hum of the power supply was 3.3 μVrms. The measured IRN and simulations were used to confirm the effect of noise from CCIA on the CS-based EEG measurement framework. The difference in the normalized mean squared error at CR = 4 to the uncompressed conditions is 0.008. This result shows that even with the LNA specialized for low power consumption, a slight signal degradation is observed when the compression ratio is increased up to 4 in the CS framework by making use of the sparsity of EEG in the frequency domain.

[To Session Table]

Session 3A (SS-3) LLM Acceleration and Specialization for Circuit Design and Edge Applications
Time: 15:40 - 17:20, Tuesday, January 21, 2025
Location: Room Saturn
Chair: Cheng Zhuo (Zhejiang University, China)

3A-1 (Time: 15:40 - 16:05)

Title	(Invited Paper) Standard Cell Layout Generation: Review, Challenges, and Future Works
Author	*Chung-Kuan Cheng, Byeonggon Kang, Bill Lin, Yucheng Wang (University of California, San Diego, USA)
Page	pp. 372 - 378
Keyword	Standard Cell Design, Cell Layout, Layout Automation, DTCO
Abstract	With the growing demand for VLSI scaling, standard cell library generation becomes crucial process to enhance performance via design technology co-optimization (DTCO) and system technology co-optimization (STCO) exploration. In this work, we review existing methodologies and algorithms used for standard cell layout automation for sub-10nm nodes, categorized by their algorithmic approaches for transistor placement and internal cell routing. We address the challenges of managing the complexity inherent in cell layout automation with complexity analysis on transistor partitioning, and gear ratio-enabled routing. We also demonstrate the challenges of optimality. Particularly in advanced technology nodes, we present examples of trade-offs between cell metrics due to the scarcity of routing resources. We identify future directions for DTCO/STCO and layout automation. These include advancements in logic optimization, circuit topology enumeration, standard cell fusion, and integrating novel devices such as thin-film and Magnetoelectric Spin-Orbit (MESO) libraries into the standard cell design process. This study aims to offer a clear roadmap for future research and development in standard cell layout automation and explore solutions to meet the challenges posed by shrinking technology nodes and increasingly complex design requirements.

3A-2 (Time: 16:05 - 16:30)

Title	(Invited Paper) ML-assisted SRAM Soft Error Rate Characterization: Opportunities and Challenges
Author	*Masanori Hashimoto, Ryuichi Yasuda, Kazusa Takami, Yuibi Gomi (Kyoto University, Japan), Kozo Takeuchi (Japan Aerospace Exploration Agency, Japan)
Page	pp. 379 - 384
Keyword	Single-Event Upset, Soft Error, TCAD, Active Learning, Machine Learning
Abstract	Soft errors from cosmic rays are a significant concern for reliability-critical applications such as autonomous driving and supercomputers. In this paper, we review soft error rate (SER) estimation for SRAM, the most sensitive component in digital logic chips, and explore how machine learning can assist in SRAM SER characterization. We propose an efficient discriminator construction method for single-event upset (SEU) using active learning and adaptive hyperparameter tuning in the learning algorithm. This method iteratively labels samples through technology computer-aided design (TCAD) simulations, determining whether an upset occurs for an unlabeled sample with the lowest confidence in prediction. Our approach eliminates the need for empirical modeling based on tacit knowledge, systematically building a model while reducing the training data needed to achieve sufficient event-wise accuracy. Experiments with a 12-nm SRAM show that the training data required to achieve the same accuracy was reduced by 41\% for 80\% accuracy and by 31\% for 85\% accuracy. Finally, we discuss future directions and challenges in advanced nano-sheet and CFET transistors.

3A-3 (Time: 16:30 - 16:55)

Title	(Invited Paper) Boosting Standard Cell Library Characterization with Machine Learning
Author	Zhengrui Chen, Chengjun Guo, Zixuan Song (Zhejiang University, China), Guozhu Feng (Zhejiang University/Empyrean Technology, China), Shizhang Wang (Zhejiang University, China), Li Zhang (Hubei University of Technology, China), Xunzhao Yin, Zhenhua Wu, Zheyu Yan, *Cheng Zhuo (Zhejiang University, China)
Page	pp. 385 - 391
Keyword	Standard cell, Library characterization, Machine learning
Abstract	As VLSI designs grow more complex and transition to smaller process nodes, accurate and efficient library characterization has become increasingly crucial within DTCO and STCO flows. Current open-source tools, however, are constrained to basic library characterization functions and fail to adequately meet modern design demands. In this paper, we review the existing open-source standard cell characterization tools, summarize their limitations, and introduce ZlibBoost—a new open-source framework designed to offer both flexibility and efficiency. We leverage ZlibBoost for LUT index optimization, dynamic power supply noise modeling, and machine learning-based prediction to enhance efficiency and accuracy in library characterization. Experimental results show that such a tool is helpful for both academia and industry to effectively navigate DTCO and STCO challenges.

3A-4 (Time: 16:55 - 17:20)

Title	(Invited Paper) Exploring Better Intra-Cell Routability for Layout Synthesis of Multi-Row Standard Cells
Author	Kairong Guo, Xiaohan Gao, Haoyi Zhang, Runsheng Wang, Ru Huang, *Yibo Lin (Peking University, China)
Page	p. 392
Keyword	Standard Cell, Placement, Routing, DTCO
Abstract	Standard cells are the primary building blocks for modern digital integrated circuits. Traditionally, standard cells are designed with identical heights to fit into placement rows, which are also known as single-row height cells. With the aggresive scaling of technology nodes, single-row cells are no longer suitable for complex cells like large combinational gates, multi-bit flip-flops, and so on. Multi-row height standard cells have been adopted due to their potential advantages in performance, power, and area (PPA). By extending cell height from one row to multiple rows, multi-row designs allow for greater functional density within a single cell, potentially mitigating circuit-level routability issues, optimizing signal delay, and enhancing power distribution. However, multi-row cells also pose unique challenges in intra-cell routability, as the expanded cell height introduces additional vertical interconnects and broader search space for transistor placement. This talk presents an approach to explore better intra-cell routability for layout synthesis of multi-row standard cells. Our method integrates early-stage routing consideration into transistor placement to enhance the routability and efficiency of the follow-up transistor routing. We propose an efficient routability-aware exploration technique to search transistor placement concurrently. This technique enables parallel exploration of candidate placement solutions and prunes inferior solutions at early stages. Given the transistor placement, we further develop a SAT-based router to validate the routing quality and efficiency. Experimental results demonstrate the potential of our method to improve intra-cell routability in multi-row cell design, enabling the possibility of designing large standard cells and marking a step forward in Design-Technology Co-Optimization (DTCO) for advanced technology nodes.

[To Session Table]

Session 3B (T9-1) Timing Analysis and Optimization
Time: 15:40 - 17:45, Tuesday, January 21, 2025
Location: Room Uranus
Chair: Heechun Park (Ulsan National Institute of Science and Technology, Republic of Korea)

3B-1 (Time: 15:40 - 16:05)

Title	Graph-Based Timing Prediction at Early-Stage RTL Using Large Language Model
Author	*Fahad Rahman Amik, Yousef Safari, Zhanguang Zhang (McGill University, Canada), Boris Vaisband (University of California, Irvine, USA)
Page	pp. 393 - 400
Keyword	Verilog RTL, timing analysis, graph attention network, large language model, worst negative slack
Abstract	Early-stage timing analyses are essential for exploring design alternatives before physical synthesis in integrated circuit design, which needs to assess signal propagation delay multiple times with varying accuracy. Machine learning (ML) offers promising solutions for early-stage timing prediction, improving result quality while reducing runtime, time-to-market, and non-recurring engineering costs. However, existing ML-based approaches for predicting timing at the register-transfer level (RTL) are not sufficiently reliable to replace traditional electronic design automation tools as they face two key challenges: 1) feature generation based on high-level RTL is unreliable due to unpredictable synthesizer outputs, and 2) they omit essential features like technology library information and design constraints, which are crucial for accurate timing analysis. This paper presents a novel ML-based framework addressing these challenges for accurate and fast timing prediction at the RTL-level. The framework includes three components: an RTL dataset generator, a large language model-based RTL-to-graph parser, and a graph-based ML engine. The performance of the proposed framework was evaluated using real world RTL benchmarks and compared to commercial tools. A correlation of 0.98 in predicting timing slack is achieved by the proposed method surpassing prior state-of-the-art ML-based techniques. Furthermore, this method achieves a 749x speedup as compared to Cadence Tempus.

3B-2 (Time: 16:05 - 16:30)

Title	SI-Aware Wire Timing Prediction at Pre-Routing Stage with Multi-Corner Consideration
Author	Yushan Wang, *Xu He, Renjun Zhao (Hunan University, China), Yao Wang (Independent Researcher, China), Chang Liu, Yang Guo (National University of Defense Technology, China)
Page	pp. 401 - 406
Keyword	Machine Learning, Static Timing Analysis, Timing Prediction, Multi-Corner
Abstract	Timing closure is a critical but effort-taking task in VLSI designs. Early design stages have relatively ample room for changes that can fix timing problems in a proactive manner. However, accurate timing prediction is very challenging at early stages due to the absence of information determined by later stages in the design flow. At the pre-routing stage, it is generally believed that the prediction of wire delay is more complicated than that of gate delay, since the former is highly dependent on the routing information and PVT conditions. Addressing that, in this work, the prediction model is studied and the importance of multiple features, including Signal-Integrity (SI) related ones, is explored, with the purpose of boosting the turnaround time of physical design and reducing the performance penalty caused by worst-case scenario assumptions. Experimental results show that the proposed timing predictor has achieved a correlation of over 0.98 with the sign-off timing results in SI-mode under multi-corner scenarios.

3B-3 (Time: 16:30 - 16:55)

Title	iTAP: An Incremental Task Graph Partitioner for Task-parallel Static Timing Analysis
Author	Boyang Zhang, Che Chang, Cheng-Hsiang Chiu, Dian-Lun Lin (University of Wisconsin, Madison, USA), Yang Sui (Rice University, USA), Chih-Chun Chang, Yi-Hua Chung, Wan Luan Lee (University of Wisconsin, Madison, USA), Zizheng Guo, Yibo Lin (Peking University, USA), *Tsung-Wei Huang (University of Wisconsin, Madison, USA)
Page	pp. 407 - 415
Keyword	Static Timing Analysis
Abstract	Recent static timing analysis (STA) tools have utilized task dependency graph (TDG) parallelism to enhance the STA runtime performance. Although TDG parallelism shows promising speedup, the overhead of scheduling a TDG can become dominant as the TDG becomes larger. To minimize the scheduling overhead, several TDG partitioning algorithms have been proposed to reduce the TDG size without affecting its task parallelism. Despite improved performance, existing TDG partitioners all fall short of incremental partitioning, limiting their practical use in STA tools that support timing-driven operations. To overcome this limitation, we propose iTAP, an incremental TDG partitioner to fully leverage the power of TDG partitioning in task-parallel STA applications. Compared to a state-of-the-art full TDG partitioner, iTAP enhances the overall STA performance by up to 2.97×.

Best Paper Candidate
3B-4 (Time: 16:55 - 17:20)

Title	PathGen: An Efficient Parallel Critical Path Generation Algorithm
Author	Che Chang, Boyang Zhang, Cheng-Hsiang Chiu, Dian-Lun Lin, Yi-Hua Chung, Wan-Luan Lee (University of Wisconsin at Madison, USA), Zizheng Guo, Yibo Lin (Peking University, China), *Tsung-Wei Huang (University of Wisconsin at Madison, USA)
Page	pp. 416 - 424
Keyword	Static Timing Analysis, Critical Path Generation
Abstract	Critical Path Generation (CPG) is fundamental for many static timing analysis (STA) applications. As the circuit complexity continues to increase, CPG runtime has quickly become the bottleneck due to its time-consuming and iterative nature. Despite many CPG algorithms introduced by existing timers, nearly all of them are limited to a single CPU thread, leading to long runtime for large CPG queries. To mitigate this runtime challenge, we need a parallel CPG algorithm. However, designing a parallel CPG algorithm is very challenging because we need to strategically partition the path search space into multiple groups that can run in parallel while accommodating different slack priorities. To overcome this challenge, we propose PathGen, an efficient CPU-parallel CPG algorithm. PathGen introduces a multi-level queue scheduling framework that can efficiently parallelize the search process of critical paths. Compared to a state-of-the-art single-threaded timer, PathGen is up to 7.4× faster with 16 threads and achieves nearly 100% accuracy when generating one million critical paths on large designs.

3B-5 (Time: 17:20 - 17:45)

Title	Yield-driven Clock Skew Scheduling Based on Generalized Extreme Value Distribution
Author	*Kaixiang Zhu, Wai-shing Luk, Lingli Wang (Fudan University, China)
Page	pp. 425 - 432
Keyword	Clock skew scheduling, Design-for-Manufacturability, General extreme value
Abstract	Clock skew scheduling is a cost-effective technique for enhancing the synchronous digital VLSI systems. The technique solely requires adjusting the clock skew to meet the timing constraints of the signal paths in order to increase the clock frequency or yield. In the past, Gaussian distributions were commonly assumed in the probability density function (PDF) modeling of maximum and minimum path delays under process variations. However, this assumption may not be appropriate due to the notable asymmetry of the actual path delay distributions. In this paper, we suggest the generalized extreme value (GEV) distribution as a potential alternative. Furthermore, we evaluate maximum likelihood estimation, linear-moments, and the method of moments (MoM) for parameter estimation. Experimental results show that the GEV distribution can more accurately approximate the cumulative distribution function (CDF) of the benchmark circuit path delays, resulting in an average improvement of 40% in the Kolmogorov-Smirnov (KS) statistic. Furthermore, yield-driven clock skew scheduling based on the GEV distribution produces superior timing yield outcomes compared to that based on the Gaussian distribution, with an improvement in timing yield up to 33% and average 8%.

[To Session Table]

Session 3C (T12-1) Side Channel Attacks and Trusted Execution Environment
Time: 15:40 - 17:45, Tuesday, January 21, 2025
Location: Room Venus
Chairs: Qiang Liu (Tianjin University, China), Song Chen (University of Science and Technology of China, China)

3C-1 (Time: 15:40 - 16:05)

Title	Making Legacy Hardware Robust against Side Channel Attacks via High-Level Synthesis
Author	Md Imtiaz Rashid, *Benjamin Carrion Schafer (The University of Texas at Dallas, USA)
Page	pp. 433 - 439
Keyword	Side-channel attack, Legacy code, compiler, High-Level Synthesis
Abstract	This work introduces a complete flow to make legacy, side-channel attack (SCA) unaware, hardware given as an Register Transfer Level (RTL) description (Verilog) secure through an RTL to C compiler that generates optimized C code for High-Level Synthesis (HLS). This compiler analyzes the legacy RTL description against SCA and generates C code that can then be in turn re-synthesized into new RTL code that is security-aware. Experimental results show that our proposed flow is able to make security unaware RTL code secure introducing minimal overheads.

3C-2 (Time: 16:05 - 16:30)

Title	Machine Learning-Based Real-Time Detection of Power Analysis Attacks Using Supply Voltage Comparisons
Author	Nan Wang, *Ruichao Liu, Yufeng Shan, Yu Zhu (East China University of Science and Technology, China), Song Chen (University of Science and Technology of China, China)
Page	pp. 440 - 446
Keyword	Power analysis attack, real-time detection, power grid, voltage comparator, support vector machine
Abstract	Modern power analysis attacks (PAAs) pose significant threats to hardware security, and reliably securing integrated systems against advanced PAAs has become a significant design target in integrated circuits. However, the detection accuracy of most countermeasures to PAAs significantly decreases when power side-channel information is mixed with voltage noise. In this paper, a real-time PAA detection technique is proposed to achieve high detection accuracy even with large voltage noise. The voltage drops of certain power grid (PG) nodes caused by PAA are evaluated by a number of voltage comparisons between PG nodes, which compensate for the effects of voltage noise. These voltage comparison results are analyzed by machine learning algorithms, and a linear support vector machine (SVM) model is selected as the PAA detection model. The PAAs on an IBM benchmarked microprocessor are applied to evaluate the detection accuracy, and our proposed PAA detection method achieves 93.11% accuracy in detecting a resistance

3C-3 (Time: 16:30 - 16:55)

Title	Side-channel Collision Attacks on Hyper-Dimensional Computing based on Emerging Resistive Memories
Author	*Brojogopal Sapui, Mehdi B. Tahoori (Karlsruhe Institute of Technology, Germany)
Page	pp. 447 - 453
Keyword	side channel, CiM, hyperdimensional computing, collision analysis, countermeasure
Abstract	Brain-inspired architectures are increasingly favored for edge devices due to their efficient execution of cognitive tasks with limited energy and computational resources. A promising approach in this field is Hyper-Dimensional Computing (HDC), known for its robustness against noise and simple computational operations, despite being constrained by memory bandwidth. HDC is well-suited for computation in memory (CiM) using emerging resistive memory technologies. However, security concerns arise from potential attack vectors in HDC, spanning from computational algorithms to the underlying technology. Since HDC relies on unique data patterns, or class hypervectors, stored in memory, there is a risk of undetected data manipulation or poisoning. We demonstrate that power information from insensitive (public) outputs can expose secret data stored in memories. This study investigates side-channel vulnerabilities in Content Addressable Memory (CAM)-HDC implemented with resistive memory-based CiM. We develop a collision attack using side-channel information to recover predicted classes from all possible outputs accurately. Our findings highlight a security threat in HDC even with parallel computation between query and class hypervectors. To address this vulnerability, we propose an effective countermeasure based on a hiding technique for CiM implementation, mitigating the identified security risks.

3C-4 (Time: 16:55 - 17:20)

Title	Dep-TEE: Decoupled Memory Protection for Secure and Scalable Inter-enclave Communication on RISC-V
Author	*Shangjie Pan (SKLP, Institute of Computing Technology, Chinese Academy of Sciences/University of Chinese Academy of Sciences/Zhongguancun Laboratory, China), Xuanyao Peng (SKLP, Institute of Computing Technology, Chinese Academy of Sciences/University of Chinese Academy of Sciences, China), Zeyuan Man (ShanghaiTech University/Beijing Institute of Open Source Chip, China), Xiquan Zhao, Dongrong Zhang (Zhongguancun Laboratory, China), Bicheng Yang, Dong Du (Shanghai Jiao Tong University, China), Hang Lu (SKLP, Institute of Computing Technology, Chinese Academy of Sciences/University of Chinese Academy of Sciences/Zhongguancun Laboratory, China), Yubin Xia (Shanghai Jiao Tong University, China), Xiaowei Li (SKLP, Institute of Computing Technology, Chinese Academy of Sciences/University of Chinese Academy of Sciences/Zhongguancun Laboratory, China)
Page	pp. 454 - 460
Keyword	RISC-V, Trusted execution environment, Memory isolation, Communication
Abstract	Trusted Execution Environment (TEE) has been widely implemented by modern hardware vendors to protect security and privacy-sensitive applications and data, such as Intel SGX/TDX, ARM TrustZone, AMD SEV, and RISC-V Penglai. However, existing TEE systems face challenges in balancing memory isolation among security, performance, and scalability requirements. This paper introduces a novel TEE system, Dep-TEE, which decouples memory protection (to segments) from address translation (to page tables). This design improves communication performance by dynamically adjusting memory protection capabilities, without sacrificing application compatibility, and enhances security by safeguarding against attacks on page tables. We have built the prototype of Dep-TEE based on FPGA with the hardware extensions and software supports. The evaluation demonstrates that Dep-TEE significantly surpasses existing TEE solutions, achieving three orders of magnitude lower communication latency and 10x greater scalability while maintaining robust security guarantees.

3C-5 (Time: 17:20 - 17:45)

Title	Through Fabric: A Cross-world Thermal Covert Channel on TEE-enhanced FPGA-MPSoC Systems
Author	*Hassan Nassar, Jeferson Gonzalez-Gomez, Varun Manjunath (KIT, Germany), Lars Bauer (-, Germany), Jörg Henkel (KIT, Germany)
Page	pp. 461 - 467
Keyword	FPGA, OPTEE, thermal covert channel, accelerator
Abstract	The ever-evolving computing landscape gets more complex in every moment and the need for heterogeneous compute systems becomes more relevant. As the usability of such systems grew, finding methods for securing them became more relevant. Commercial vendors already introduced Trusted Execution Environments (TEEs) for those systems. TEEs serve the need for isolation, where sensitive data are processed in a secure world, and non-trusted applications are executed in the normal world. In this paper, we introduce Through Fabric: a novel attack against TEE-enhanced FPGA-MPSoCs. We show that existing benign hardware accelerators can be manipulated from the secure world to implement a temperature-based covert channel. We successfully run this attack on a commercial FPGA-MPSoC within the OPTEE environment without additional access rights. We use an open-source implementation of AES for the accelerator and we reach a transmission speed of 2 bits per second with bit error rate of 1.9% and packet error rate of 4.3%. We are the first to show that a TEE can be bypassed on FPGA-MPSoCs via temperature-based covert channel communication.

[To Session Table]

Session 3D (T3-2) Frameworks and Modeling for Computing-In-Memory
Time: 15:40 - 17:20, Tuesday, January 21, 2025
Location: Room Mars/Mercury
Chairs: Zhenhua Zhu (Tsinghua University, China), Kentaro Yoshioka (Keio University, Japan)

3D-1 (Time: 15:40 - 16:05)

Title	Theoretical Optimal Specifications of Memcapacitors for Charge-Based In-Memory Computing
Author	*Zichen Qian, Rentao Wan (Columbia University, USA), Chin-Hsiang Liao, Steven Koester (University of Minnesota, USA), Mingoo Seok (Columbia University, USA)
Page	pp. 468 - 475
Keyword	memcapacitor, in-memory computing, optimal specifications, energy efficiency, weight density
Abstract	This paper presents the optimal theoretical specifications of memcapacitors that allow them to surpass the existing state-of-the-art in-memory computing (IMC) prototypes in terms of energy efficiency and weight density. To do so, we build and simulate the SPICE model of an existing memcapacitor device in an IMC macro featuring charge-based computing. We develop the energy efficiency model and weight density model of the memcapacitor-based IMC macro, which are verified against SPICE simulation. Finally, we present the optimal theoretical specifications for memcapacitor devices to obtain a 10x improvement in energy efficiency and/or weight density over the existing IMC prototypes.

3D-2 (Time: 16:05 - 16:30)

Title	An Island Style Multi-Objective Evolutionary Framework for Synthesis of Memristor-Aided Logic
Author	Umar Afzaal, *Seunggyu Lee, Youngsoo Shin (Korea Advanced Institute of Science and Technology (KAIST), Republic of Korea)
Page	pp. 476 - 482
Keyword	Memristor-aided logic, Logic synthesis, Evolutionary algorithm
Abstract	The optimal in-memory mapping onto memristor crossbars involves competing design goals: minimizing crossbar utilization, reducing delay, and achieving an even layout. Existing heuristic algorithms struggle to address these objectives simultaneously, often yielding suboptimal solutions. This paper introduces an automatic design solution to optimize multiple objectives concurrently. Specifically, it proposes an island-style evolutionary algorithm for multi-objective optimization of in-memory mapping. This algorithm produces a set of solutions, corresponding to Pareto points. Each point can be stored in a library of mapping solutions, which can be chosen when corresponding design is re-used as a macro. Experimental evaluation on IWLS benchmarks demonstrates the effectiveness of this approach in addressing multiple design objectives efficiently.

3D-3 (Time: 16:30 - 16:55)

Title	PIMutation: Exploring the Potential of Real PIM Architecture for Quantum Circuit Simulation
Author	Dongin Lee (National University of Singapore, Singapore), *Enhyeok Jang, Seungwoo Choi, Junwoong An, Cheolhwan Kim, Won Woo Ro (Yonsei University, Republic of Korea)
Page	pp. 483 - 490
Keyword	Near Data Processing, Processing-in-Memory, Quantum Circuit Simulation
Abstract	Quantum circuit simulations are essential for the verification of quantum algorithms. However, as the number of qubits increases, the memory requirements for performing full state vector simulations grow exponentially. Furthermore, a substantial number of computations in quantum circuit simulations cause low locality data accesses. The two characteristics lead to significant latency and energy overhead for CPU and main memory data transfers. Processing-in-Memory (PIM), which leverages computational logic near DRAM banks, can be a promising solution to solve problems. In this paper, we propose PIMutation (PIM framework for qUanTum circuit simulATION). PIMutation is the first attempt to leverage UPMEM, a publicly available PIM-integrated DIMM, to implement quantum circuit simulations. PIMutation features three optimization strategies to overcome the overhead of quantum circuit simulation using the real PIM system: (i) gate merging, (ii) row swapping, and (iii) vector partitioning. Our experimental results show that PIMutation achieves an average speedup of 2.3x and 16.5x with a reduction of energy of 37.2% and 72.5% over the QuEST simulator on CPU in 16- and 32-qubit benchmarks.

3D-4 (Time: 16:55 - 17:20)

Title	A Fail-Slow Detection Framework for HBM devices
Author	*Zikang Xu, Yiming Zhang, Zhirong Shen (XiaMen University, China)
Page	pp. 491 - 497
Keyword	HBM Reliability, Fault detection, Transient Fault, Performance Degradation
Abstract	Fail-slow is a failure mode that exists in various components such as disks, SSDs, CPUs, memory, and networks, manifested as decreased performance but not causing crashes. We investigated the manifestation of Fail-slow in HBM systems and proposed a fault detection framework based on a light regression model. It can detect potential Fail-slow faults while HBM devices are undergoing factory testing. In our small-scale simulation tests, we found a Fail-slow phenomenon in one single HBM device. In addition, We assemble a large-scale HBM log dataset from our production traces, based on which we provide root cause analysis on Fail-slow HBM devices covering a variety of ill-implemented scheduling, hardware defects, and environmental factors. We have released the dataset to the public for further study.

[To Session Table]

Session 3E (T4.1-1) AI/ML for Circuit Design and Prediction
Time: 15:40 - 17:45, Tuesday, January 21, 2025
Location: Innovation Hall
Chair: Jaeyong Chung (Yonsei University, Republic of Korea)

Best Paper Candidate
3E-1 (Time: 15:40 - 16:05)

Title	DeepSeq2: Enhanced Sequential Circuit Learning with Disentangled Representations
Author	*Sadaf Khan, Zhengyuan Shi, Ziyang Zheng (The Chinese University of Hong Kong, Hong Kong), Min Li (Huawei Noah's Ark Lab, China), Qiang Xu (The Chinese University of Hong Kong, Hong Kong)
Page	pp. 498 - 504
Keyword	Representation Learning, Graph Neural Network, Sequential Circuit
Abstract	Circuit representation learning is increasingly pivotal in Electronic Design Automation (EDA), serving various downstream tasks with enhanced model efficiency and accuracy. One notable work, DeepSeq, has pioneered sequential circuit learning by encoding temporal correlations. However, it suffers from significant limitations including prolonged execution times and architectural inefficiencies. To address these issues, we introduce DeepSeq2, a novel framework that enhances the learning of sequential circuits, by innovatively mapping it into three distinct embedding spaces—structure, function, and sequential behavior—allowing for a more nuanced representation that captures the inherent complexities of circuit dynamics. By employing an efficient Directed Acyclic Graph Neural Network (DAG-GNN) that circumvents the recursive propagation used in DeepSeq, DeepSeq2 significantly reduces execution times and improves model scalability. Moreover, DeepSeq2 incorporates a unique supervision mechanism that captures transitioning behaviors within circuits more effectively. DeepSeq2 sets a new benchmark in sequential circuit representation learning, outperforming prior works in power estimation and reliability analysis.

3E-2 (Time: 16:05 - 16:30)

Title	A Self-Supervised, Pre-Trained, and Cross-Stage-Aligned Circuit Encoder Provides a Foundation for Various Design Tasks
Author	*Wenji Fang, Shang Liu (The Hong Kong University of Science and Technology, Hong Kong), Hongce Zhang (The Hong Kong University of Science and Technology (Guangzhou), China), Zhiyao Xie (The Hong Kong University of Science and Technology, Hong Kong)
Page	pp. 505 - 512
Keyword	circuit representation learning, contrastive learning, RTL-stage PPA prediction, netlist reverse engineering
Abstract	Machine learning (ML) techniques have shown remarkable effectiveness in electronic design automation (EDA). Traditionally, most ML for EDA approaches are task-specific, requiring a tedious development process of a tailored ML model for each individual design task. Recently, circuit representation learning has emerged as a promising trend. This approach converts circuits into embeddings, which can then be adapted to distinct downstream tasks. However, existing methods still fall short of providing a truly general circuit representation that supports highly diverse tasks. In this work, we introduce CircuitEncoder, a self-supervised, pre-trained, and cross-stage-aligned general circuit encoder. It provides a general foundation for diverse ML-based EDA tasks, including both design quality and functional reasoning. CircuitEncoder is pre-trained through multi-stage contrastive learning utilizing unlabeled circuits. It encodes circuits from different design stages into embedding vectors within shared latent space, facilitating fine-tuning for various downstream tasks. CircuitEncoder outperforms the state-of-the-art task-specific supervised solutions for multiple EDA tasks, including design quality tasks for register-transfer level (RTL)-stage timing and area prediction, as well as functional tasks for netlist-stage state register identification.

3E-3 (Time: 16:30 - 16:55)

Title	ParaFormer: A Hybrid Graph Neural Network and Transformer Approach for Pre-Routing Parasitic RC Prediction
Author	*Jongho Yoon, Jakang Lee, Donggyu Kim, Junseok Hur, Seokhyeong Kang (POSTECH, Republic of Korea)
Page	pp. 513 - 519
Keyword	Parasitic Prediction, Heterogeneous Graph, Graph Neural Network, Transformer
Abstract	Predicting the quality of post-route design at an early stage can reduce overall design time. To achieve this, we propose ParaFormer, a pre-routing parasitic RC prediction framework. This framework integrates a heterogeneous graph neural network (HGNN) and a graph transformer to capture the topological and geometric information of circuit data. The HGNN model represents circuit data as heterogeneous graphs to learn complex topological relationships, while the graph transformer calculates attention between each net to learn geometric relationships. Our framework predicts parasitic RC, enabling RC tree modeling and SPEF file generation. This allows the predicted results to be utilized in timing and power analysis using commercial tools. Additionally, we incorporate gradient normalization to reduce the imbalance between different objectives in multi-task learning, improving overall model performance. Experimental results show that ParaFormer achieves R2 scores of 0.9901 and 0.9630 for resistance and capacitance, respectively. In timing analysis, it achieves R2 scores of 0.9749 for wire delay and 0.9876 for cell delay, with a MAPE of 1.45% in power analysis. These results indicate that our method is highly effective for timing and power prediction in the early design stage.

3E-4 (Time: 16:55 - 17:20)

Title	Static IR Drop Prediction with Limited Data from Real Designs
Author	Lizi Zhang, *Azadeh Davoodi (University of Wisconsin - Madison, USA)
Page	pp. 520 - 526
Keyword	Power Delivery Network, static IR drop prediction, deep learning, transfer learning, training with limited data
Abstract	There has been significant recent progress to reduce the computational effort of static IR drop analysis using neural networks, and modeling as an image-to-image translation task. A crucial issue is lack of sufficient data from real industry designs to train these networks. In this work, we first propose a number of improvements to the state-of-the-art U-Net neural network model to achieve better IR drop prediction. First, we propose U-Net with attention gates which allows selective emphasis on relevant parts of the input data without supervision. This is desired because of the often sparse nature of the IR drop map. We also embed the U-Net model with a preprocessing convolutional block which introduces an initial per- image filter to better handle the multi-image to single-image nature of the problem. Next, to address lack of sufficient data we propose a two-phase training process which utilizes a mix of artificially-generated data and a limited number of points from real designs with custom learning and dropout rates at each phase, and a custom loss function. We also propose a data augmentation step based on image transformations to augment the training data. Based on the ICCAD 2023 contest setup, our results are on-average, 38% (64%) better in MAE and 26% (142%) in F1 score compared to the winner of the ICCAD 2023 contest (and U-Net only) when only tested on the set of real designs in the testing set.

3E-5 (Time: 17:20 - 17:45)

Title	Towards Big Data in AI for EDA Research: Generation of New Pseudo-Circuits at RTL Stage
Author	*Shang Liu, Wenji Fang, Yao Lu, Qijun Zhang, Zhiyao Xie (HKUST, Hong Kong)
Page	pp. 527 - 533
Keyword	pseudo-circuit, graph generation
Abstract	Machine learning (ML) techniques have demonstrated remarkable effectiveness in electronic design automation (EDA). ML models need to be trained on diverse circuit datasets for better accuracy and generalization capabilities. However, the availability of circuit data remains a long-standing severe issue. The strong data privacy concern in the semiconductor industry makes direct sharing of circuit IPs almost impossible. To address the data availability problem, open-source datasets like CircuitNet have been proposed, but they mostly focus on collecting labels of several existing open-source designs, instead of generating any new designs. In this work, we make an innovative exploration to directly generate new pseudo-circuits without human effort. We believe that generating pseudo-circuits is the most promising, if not the only, approach to achieving “big data” in the semiconductor industry in the foreseeable future. We demonstrate that pseudo-circuits can significantly boost the performance of ML models in early design quality predictions, as early as the pre-synthesis RTL stage.

Wednesday, January 22, 2025

[To Session Table]

Session 2K 30th Anniversary and Keynote Session II
Time: 8:30 - 9:45, Wednesday, January 22, 2025
Location: Miraikan Hall
Chair: Shinji Kimura (Waseda University, Japan)

2K-1

Title

30th Anniversary

2K-2

Title	(Keynote Address) Compilation and Architecture Optimization for Quantum Computing
Author	Jason Cong (Computer Science Department, UCLA/Center for Domain-Specific Computing (CDSC), UCLA, USA)
Abstract	The rapid progress in quantum computing (QC) technologies in the past decade led to QC processors with hundreds to thousands of qubits. As a result, efficient compilation flow become both important and challenging. In this talk, I focus on the critical step of the compilation called quantum layout synthesis (QLS), which determines the space and time of computation of a quantum circuit. I first show that the existing QLS solutions, surprisingly, are far away from optimal, despite the effort of the research community for more than a decade. Then, I present recent progress on scalable and highly optimized QLS solutions for both QC processors with fixed connectivity, such as those based on the superconducting technology, and those with programmable connectivity, such as those using neutral atom arrays. Finally, I shall discuss how such optimized QLS tools can be used to guide QC processor architecture optimization, e.g. in determining the number of movable lasers and the configuration of storage vs computing zones for neutral atom arrays.

[To Session Table]

Session 4A (T4.2-2) Advanced Methods in AI Hardware Co-Design
Time: 10:05 - 11:45, Wednesday, January 22, 2025
Location: Room Saturn
Chairs: Wanyeong Jung (Korea Advanced Institute of Science and Technology (KAIST)), Yeseong Kim (DGIST, Republic of Korea)

4A-1 (Time: 10:05 - 10:30)

Title	Accelerator for LLM-Enhanced GNN with Product Quantization and Unified Indexing
Author	*Jiaming Xu, Jinhao Li, Jun Liu, Hao Zhou, Guohao Dai (Shanghai Jiao Tong University, China)
Page	pp. 534 - 540
Keyword	Graph Foundation Model, LLM, GNN, Product Quantization
Abstract	To alleviate the vulnerability of graph neural networks (GNNs) on unseen graphs, many works propose to integrate large language models (LLMs) into GNNs, called graph foundation models (GFMs). The LLM-enhanced GNN, a typical integration method of GFMs, has achieved state-of-the-art performance in most graph-related tasks. However, intensive general matrix multiplications (GEMMs) overhead of LLMs poses a significant challenge to end-to-end inference latency. The introduction of LLMs brings 100× more workload than original GNNs, with GEMMs accounting for more than 99%, becoming the bottleneck of end-to-end inference. To tackle the above challenge, we present GFMEngine, an algorithm and hardware co-design accelerator supporting LLM-enhanced GNNs at multiple levels. (1) At the algorithm level, we point out that the computational precision of LLMs has little impact on the end-to-end accuracy, and propose a product-quantization-based (PQ-based) matrix multiplication for LLMs to alleviate the intensive GEMMs in LLMs, reducing more than 70% computation with negligible accuracy loss. (2) At the hardware level, we point out that the implementation of PQ-based matrix multiplication effectively alleviates the intensive GEMMs but results in a substantial increase in dynamic memory access. Coupled with the dynamic memory access inherent in GNNs, we design a unified indexing unit as the hardware support, reducing ∼ 30% memory access in end-to-end inference. (3) At the compilation level, we further design an extensible instruction set architecture as the software support, GFM-ISA, for various real-world GFM tasks. We implement GFMEngine with TSMC 28nm process, and extensive experiments show that GFMEngine achieves up to 3.93×, 38.66×, 22.32×, 2.96× speedup and up to 102.52×, 37.82×, 28.37×, 2.56× energy efficiency improvement compared with NVIDIA Tesla A100 and the domain-specific accelerators, SGCN, MEGA, FACT, respectively.

4A-2 (Time: 10:30 - 10:55)

Title	MICSim: A Modular Simulator for Mixed-signal Compute-in-Memory based AI Accelerator
Author	*Cong Wang, Zeming Chen, Shanshi Huang (The Hong Kong University of Science and Technology(Guangzhou), China)
Page	pp. 541 - 547
Keyword	Compute-in-memory, deep learning accelerator, pre-circuit simulator, open-source tool
Abstract	This work introduces MICSim, an open-source, pre-circuit simulator designed for early-stage evaluation of chip-level software performance and hardware overhead of mixed-signal compute-in-memory (CIM) accelerators. MICSim features a modular design, allowing easy multi-level co-design and design space exploration. Modularized from the state-of-the-art CIM simulator NeuroSim, MICSim provides a highly configurable simulation framework supporting multiple quantization algorithms, diverse circuit/architecture designs, and different memory devices. This modular approach also allows MICSim to be effectively extended to accommodate new designs. MICSim natively supports evaluating accelerators’ software and hardware performance for CNNs and Transformers in Python, leveraging the popular PyTorch and HuggingFace Transformers frameworks. These capabilities make MICSim highly adaptive when simulating different networks and user-friendly. This work demonstrates that MICSim can easily be combined with optimization strategies to perform design space exploration and used for chip-level Transformers CIM accelerators evaluation. Also, MICSim can achieve a 9x ~ 32x speedup of NeuroSim through a statistic-based average mode proposed by this work.

4A-3 (Time: 10:55 - 11:20)

Title	DIAG: A Refined Four-layer Agile Hardware Developing Flow for Generating Flexible Reconfigurable Architectures
Author	*Haojia Hui (Li Auto/School of Integrated Circuits, Tsinghua University, China), Jiangyuan Gu (School of Integrated Circuits, Tsinghua University/International lnnovation Center of Tsinghua University, China), Xunbo Hu, Shaojun Wei, Shouyi Yin (School of Integrated Circuits, Tsinghua University, China)
Page	pp. 548 - 553
Keyword	DIAG, Agile Hardware Developint, Generative HDL, CGRA Generator
Abstract	Rapid evolution in application algorithms, exemplified by advancements in artificial intelligence, wireless communication, and sciencific computing, necessitates a focus on developing energy-efficient, highly-flexible parallel computing architectures. This urgency is further amplified by the need for agile hardware development techniques to mitigate design complexity and reduce costs. Among emerging agile hardware development techniques, generative-HDL stands out due to its straightforward grammatical structure and compatibility with hardware design thinking, yet it remains underutilized. In response, this paper introduces a novel four-layer agile developing flow, termed DIAG, innovatively leveraging unique Plugin-Service technology. The DIAG framework is applied to an extensible reconfigurable architecture generator, enabling the generation of diverse CGRA designs suitable for accelerating task computations across multiple application domains. Our comprehensive experiments on the CGRA generator design validate the efficiency of the DIAG flow and underscore generative-HDL's significant potential for complex, large-scale hardware development.

4A-4 (Time: 11:20 - 11:45)

Title	MPICC: Multiple-Precision Inter-Combined MAC Unit with Stochastic Rounding for Ultra-Low-Precision Training
Author	*Leran Huang (Key Laboratory of Advanced Sensor and Integrated System, Tsinghua Shenzhen International Graduate School, China), Yongpan Liu, Xinyuan Lin, Chenhan Wei, Wenyu Sun, Zengwei Wang, Boran Cao, Chi Zhang, Xiaoxia Fu, Wentao Zhao (Department of Electronic Engineering, Tsinghua University, China)
Page	pp. 554 - 559
Keyword	Ultra-Low-Precision Training, Multiply-Accumulate (MAC) Unit, Multiple-Precision Computing, Stochastic Rounding (SR)
Abstract	Recent studies have proved the feasibility of ultra-low-precision (≤ 8-bit) training. However, most of the existing operational circuits support only a few higher precisions (FP16, FP32, etc.) for multiplication, and the bit width for accumulation cannot be reduced, which results in low area and power efficiency. In this paper, we propose MPICC, a multiple-precision Multiply-Accumulate (MAC) unit designed for ultra-low-precision training. It supports inter-combined computations across 18 different precision combinations, including LOG4 (radix-4 FP4), FP4, FP6, FP8, and INT4. It also reduces the accumulation precision from FP16 to FP12 through an optimized Stochastic Rounding (SR) strategy to further save logic resources. Moreover, a low-cost emulating controller, which time-division multiplexed the low-precision MAC unit, is also designed to accomplish high-precision computations for critical DNN layers. Compared with the existing multiple-precision computing units, the area/energy efficiencies of this design are improved by 1.17×/1.19× at FP8, and 4.69×/3.64× at FP4, respectively. The SR strategy further reduces the area/power consumption by 15.6%/14.9% of the floating-point accumulator.

[To Session Table]

Session 4B (T1-2) Communication Networks
Time: 10:05 - 11:45, Wednesday, January 22, 2025
Location: Room Uranus
Chair: Jiang Xu (Hong Kong University of Science and Technology (Guangzhou), Hong Kong)

4B-1 (Time: 10:05 - 10:30)

Title	Physically Aware Wavelength-Routed Optical NoC Design for Customized Topologies with Parallel Switching Elements and Sequence-Based Models
Author	*Wei-Yao Kao, Tai-Jung Lin, Yao-Wen Chang (National Taiwan University, Taiwan)
Page	pp. 560 - 566
Keyword	Optical Network-on-Chip, Network Topology, Wavelength Routing, Parallel Switching Element
Abstract	The wavelength-routed optical network-on-chip (WRONoC) is a promising solution for system-on-chip designs. Recent work in the WRONoC topology designs mainly utilizes crossing switching elements (CSEs) as switching mechanisms on predefined templates. However, using CSEs incurs more microring resonator (MRR) usage and waveguide crossings than parallel switching elements (PSEs), and their predefined templates constrain the solution spaces. To remedy these disadvantages, we propose a fully automated topology design flow that utilizes PSE structures to reduce MRR usage and waveguide crossings. Our add-drop filter sequence model expands the solution space and leverages the advantage of the crossing-free PSE structure. Our fixed-node crossing-aware edge routing effectively minimizes the waveguide crossings, and our A*-search preserves the admissibility property and guarantees an optimal routing solution. Besides, our design flow thoroughly considers the physical layout information. Experimental results show that our design substantially outperforms state-of-the-art works on customized designs.

4B-2 (Time: 10:30 - 10:55)

Title	Zipper: Latency-Tolerant Optimizations for High-Performance Buses
Author	*Shibo Chen (University of Michigan, USA), Hailun Zhang (University of Wisconsin, USA), Todd Austin (University of Michigan, USA)
Page	pp. 567 - 574
Keyword	Accelerator, HW/SW codesign
Abstract	As heterogeneous designs take over the world of hardware designs, the data bus plays a vital role in interconnecting hosts and accelerators. While past works have emphasized increasing communication bandwidth for data-hungry workloads, this work focuses on optimizing latency for latency-sensitive acceleration applications. We first study the pattern of various accelerator workloads and demonstrate that various optimization opportunities exist to reduce the communication latency overhead. To help developers exploit these opportunities, we introduce Zipper--a protocol optimization layer that reduces communication costs by enabling device and request level parallelism and exploiting data locality for existing bus standards. We applied Zipper to two domains and implemented the end-to-end system on a heterogeneous hardware platform with integrated FPGA. Our physical experiments show that Zipper provides 8x speedup for one accelerator with 4.3% logic overhead and 1.5x speedup for another with 0.9% logic overhead.

4B-3 (Time: 10:55 - 11:20)

Title	A Buffer Reservation Scheduling Strategy for Enhancing Performance of NoC Router Bypassing
Author	*Zixuan Liu, Yaoyao Ye (Shanghai Jiao Tong University, China)
Page	pp. 575 - 580
Keyword	Computer architecture, Network-on-chip, Flow Control
Abstract	Network-on-Chip (NoC) is one of the key technologies for augmenting performance and energy efficiency of many-core processors, which addresses the limitations of conventional bus architecture. Flow control in NoC manages the allocation of buffers and links, as well as determines resource assignment among packets. Bypass flow control further optimizes this process by permitting certain packets to bypass specific router pipelines to diminish router latency. Nonetheless, bypass flow control necessitates the assurance of conflict-free bypass paths and the availability of buffer space at the destinations. In this work, we propose a Buffer Reservation Scheduling strategy (BuffeRS) aimed at enhancing NoC performance by increasing secure bypass packet transmissions within a time period. BuffeRS enables packets to reach their destinations via regular transmission or dynamically generated private bypass paths. By designating specific roles to different router groups within corresponding time slots, BuffeRS ensures that destination buffers are reserved before packet arrival. We further delineate the router microarchitecture design to implement BuffeRS in NoC. Simulation results demonstrate an average performance enhancement by 17.9%~43% under synthetic traffic and by 6.23% under PARSEC benchmarks in full-system simulations, as compared to contemporary state-of-the-art works.

4B-4 (Time: 11:20 - 11:45)

Title	RUNoC: Re-inject into the Underground Network to Alleviate Congestion in Large-Scale NoC
Author	*Xinghao Zhu, Jiyuan Bai, Zifeng Zhao, Qirong Yu (Fudan University, China), Gengsheng Chen (Fudan University/Jiashan Fudan Institute, China), Xiaofang Zhou (Fudan University, China)
Page	pp. 581 - 586
Keyword	Large-Scale NoC, Network Congestion, Two-Level Network
Abstract	In modern high-performance systems, the demand for large-scale Networks-on-Chip (NoC) has grown rapidly. As NoCs are scaled up, the issue of network congestion becomes increasingly critical and complex to address. Existing researches utilize partition-based NoC and Two-Level Network (TLN) techniques to alleviate congestion in large-scale NoCs. However, most of these solutions have limitations concerning universality, hardware overhead and workload balancing due to their inherent complexity and design constraints. In this article, we propose RUNoC, a new partition-based TLN architecture consisting of a Main Network for normal transmission and a sparse Underground Network enabling fast transmission. A special hardware unit is designed and integrated into each Main Network Router to decide when a packet needs to be routed to Underground Network based on network congestion information and the distance to the packet's destination, thereby alleviating congestion in Main Network and ensuring a subtle load balance between the two networks. Additionally, RUNoC is further enhanced by Shared Row Buffers (SRBs) and specialized network interface to guarantee deadlock and livelock freedom. Evaluation results indicate that RUNoC improves up to 60% in performance compared to XY routing scheme and up to 34% in performance-area ratio compared to existing researches.

[To Session Table]

Session 4C (T2-2) Design and Optimization of Emerging Embedded Applications
Time: 10:05 - 11:45, Wednesday, January 22, 2025
Location: Room Venus
Chair: Fang-Jing Wu (National Taiwan University, Taiwan)

4C-1 (Time: 10:05 - 10:30)

Title	A Hierarchical Dataflow-Driven Heterogeneous Architecture for Wireless Baseband Processing
Author	*Limin Jiang, Yi Shi, Yintao Liu, Qingyu Deng, Siyi Xu, Yihao Shen, Fangfang Ye, Shan Cao, Zhiyuan Jiang (School of Communication and Information Engineering, Shanghai University, China)
Page	pp. 587 - 593
Keyword	Wireless baseband processing, NUMA, dataflow-driven
Abstract	Wireless baseband processing (WBP) is a key element of wireless communications, with a series of signal processing modules to improve data throughput and counter channel fading. Conventional hardware solutions, such as digital signal processors (DSPs) and more recently, graphic processing units (GPUs), provide various degrees of parallelism, yet they both fail to take into account the cyclical and consecutive character of WBP. Furthermore, the large amount of data in WBPs cannot be processed quickly in symmetric multiprocessors (SMPs) due to the unpredictability of memory latency. To address this issue, we propose a hierarchical dataflow-driven architecture to accelerate WBP. A configurable digital front end (DFE) performs pipelined pre-processing of incoming baseband signals, followed by a pack-and-ship approach under a non-uniform memory access (NUMA) architecture to allow the subordinate tiles to operate in a bundled access and execute manner. We also propose a multi-level dataflow model and the hybrid scheduling scheme to manage and allocate the heterogeneous hardware resources. Experiment results demonstrate that our prototype achieves 2× and 2.3× speedup in terms of normalized throughput and single-tile clock cycles compared with GPU and DSP counterparts in several critical WBP benchmarks. Additionally, a link-level throughput of 288 Mbps can be achieved with a 45-core configuration.

4C-2 (Time: 10:30 - 10:55)

Title	Exploiting Differential-Based Data Encoding for Enhanced Query Efficiency
Author	*Fangxin Liu, Zongwu Wang, Peng Xu, Shiyuan Huang, Li Jiang (Shanghai Jiao Tong University, China)
Page	pp. 594 - 600
Keyword	high-dimensional data, Query Efficiency, Compression
Abstract	Storing large-scale high-dimensional data, which is rapidly generated by both industry and academia, poses substantial challenges, primarily in terms of storage and maintenance costs. While data compression techniques offer a potential solution to these challenges, they must overcome two critical hurdles: 1) preserving data integrity within lossless bounds and 2) maintaining query performance on compressed data. To address the above challenges, we propose a novel approach based on differential techniques that combine high-dimensional data compression with an efficient query engine. Specifically, Our methodology begins by employing vector quantization to transform high-dimensional vectors into compact integer sequences, known as quantization codes—a classical and versatile compression method. Subsequently, we introduce DCQ, which builds upon the differential concept to perform lossless compression on quantization codes. DCQ organizes quantization codes into a tree structure and constructs a hierarchy of trees by analyzing the dissimilarity between two quantization codes. To minimize the impact on query performance, we have developed an adaptive reconstruction algorithm that partitions and reconstructs the original tree structure. This algorithm effectively balances the workload for searching each sub-tree, optimizing the compression-performance trade-off. Finally, we adopt a storage format based on multiple subtrees for rapid searching, thereby enabling differential storage to facilitate efficient queries. Extensive experiments on various large-scale real-world datasets show that DCQ achieves a compression ratio of up to 2.5 on the quantization codes and achieves > 50× performance improvement compared to the state-of-the-art general-purpose lossless compression techniques.

4C-3 (Time: 10:55 - 11:20)

Title	Automated Power-saving User-interfaces for Application Designers
Author	*Huan-Chun Yeh, Yu-Zheng Su, Chun-Han Lin (National Taiwan Normal University, Taiwan)
Page	pp. 601 - 606
Keyword	application designers, mobile applications, user interfaces, OLED displays, power-saving methods
Abstract	As mobile applications become more ingrained in our daily routines, there is a noticeable gap in incorporating power-saving strategies into the toolkit of \textit{user-interface} (UI) designers. This paper explores the fusion of power reduction techniques and UI guidance principles to craft an innovative power-saving design. The method initiates by extracting visible element layouts and assessing UI guidance with human visual systems. Then, a power-saving design is created, designed to uphold global and local UI guidance. Evaluation results conducted using four distinct UI previews on a commercial smartphone are very promising.

4C-4 (Time: 11:20 - 11:45)

Title	An Edge AI and Adaptive Embedded System Design for Agricultural Robotics Applications
Author	*Chun-Hsian Huang (National Changhua University of Education, Taiwan), Zhi-Rui Chen, Huai-Shu Hsu (National Taitung University, Taiwan)
Page	pp. 607 - 613
Keyword	Edge AI, FPGA, Cyber-Physical System, Multimodal learning, Partial Reconfiguration
Abstract	This work presents the AgrBot, an agricultural robot designed to intelligently estimate and predict crop pest and disease severity (PDS). The AgrBot incorporates two binarized neural network (BNN) hardware modules for recognizing target crops and estimating their PDS. In a resource-constrained FPGA-based design, these BNN hardware modules can be configured on-demand, showcasing system adaptivity. Furthermore, a multimodal model that integrates crop images, sensor data, and time features is presented for predicting PDS. Employing edge artificial intelligence (AI) through the BNN hardware modules and the multimodal model enables the AgrBot to determine if biological agents are applied to protect crops from pests and diseases, creating a comprehensive agricultural cyber-physical system (CPS). Experimental results demonstrate accuracies of 76.3% for recognizing target crops, 65.3% for estimating PDS, and 67% for predicting PDS. In comparison to existing microprocessor-based design methods, the AgrBot's BNN hardware modules improve frames per second (FPS) by a factor of 790, while the multimodal model reduces processing time by up to 50.9%.

[To Session Table]

Session 4D (T11-1) Verification and Testing in Machine Lerning Era
Time: 10:05 - 11:45, Wednesday, January 22, 2025
Location: Room Mars/Mercury
Chair: Michihiro Shintani (Kyoto Institute of Technology, Japan)

4D-1 (Time: 10:05 - 10:30)

Title	AssertLLM: Generating Hardware Verification Assertions from Design Specifications via Multi-LLMs
Author	*Zhiyuan Yan (The Hong Kong University of Science and Technology (Guangzhou), China), Wenji Fang, Mengming Li (The Hong Kong University of Science and Technology, Hong Kong), Min Li (Huawei Technologies Co., Ltd., China), Shang Liu, Zhiyao Xie (The Hong Kong University of Science and Technology, Hong Kong), Hongce Zhang (The Hong Kong University of Science and Technology (Guangzhou), China)
Page	pp. 614 - 621
Keyword	Assertion Generation, Large Language Model, Formal Verification, Assertion-based Verification
Abstract	Assertion-based verification (ABV) is a critical method to ensure logic designs comply with their architectural specifications. ABV requires assertions, which are generally converted from specifications through human interpretation by verification engineers. Existing methods for generating assertions from specification documents are limited to sentences extracted by engineers, discouraging their practical applications. In this work, we present AssertLLM, an automatic assertion generation framework that processes complete specification documents. AssertLLM can generate assertions from both natural language and waveform diagrams in specification files. It first converts unstructured specification sentences and waveforms into structured descriptions using natural language templates. Then, a customized Large Language Model (LLM) generates the final assertions based on these descriptions. Our evaluation demonstrates that AssertLLM can generate more accurate and higher-quality assertions compared to GPT-4o and GPT-3.5.

4D-2 (Time: 10:30 - 10:55)

Title	Learning Gate-level Netlist Testability in the Presence of Unknowns through Graph Neural Networks
Author	*Thai-Hoang Nguyen, Youngjin Ju, Dongsub Yoon, Hyojin Choi (Samsung Electronics, Republic of Korea)
Page	pp. 622 - 627
Keyword	Unknown values, Graph neural network, Testability analysis
Abstract	VLSI testing plays a critical role in designing reliable digital integrated circuits (ICs). However, as modern ICs become increasingly complex, numerous testing issues have emerged, hindering their reliability. One such challenge is the presence of unknown input values, known as the X-source inputs problem, where the inputs of a gate-level netlist are unknown. X-source inputs can render other nodes in the circuit untestable, thereby lowering test coverage. To effectively address the X-source problem, it is crucial to understand its impact on the design's testability. In this paper, we propose a Graph Neural Network (GNN)-based method to learn the impact of X-source inputs during testability analysis. Specifically, we first introduce a novel way to represent gate-level netlists as graphs, focusing on testability analysis. We then propose a GNN architecture that can learn both the structural and functional information of a given netlist, significantly aiding in predicting the impact of X-source inputs. Experime

4D-3 (Time: 10:55 - 11:20)

Title	Efficient ML-Based Transient Thermal Prediction for 3D-ICs
Author	Yun-Feng Yang, *Wei-Shen Wang, Yung-Jen Lee, James Chien-Mo Li (National Taiwan University, Taiwan), Norman Chang, Akhilesh Kumar, Ying-Shiun Li, Jessica Yen, Lang Lin (Ansys Inc., USA)
Page	pp. 628 - 634
Keyword	3D-IC, machine learning, thermal prediction
Abstract	Thermal issues of 3D-ICs have become increasingly severe in recent years. Thus, thermal simulation is needed to ensure thermal safety during the design stage. However, performing thermal simulation iteratively requires a significant amount of time. As a result, a fast and accurate method for thermal prediction is a promising alternative to improve the turnaround time. In this paper, we propose a fast thermal prediction method using machine learning models. In the training phase, we employ two models: one for the initial three time steps and another for the subsequent time steps. To enhance prediction accuracy, we introduce two types of features: spaced-windowed features and time-decayed features. These features help us to capture spatial and temporal information effectively. In our experiment, the mean absolute error for the predicted temperature is 1.12°C, and the maximum error is 7.27 °C. In the prediction phase, we achieve a 116X speed-up compared to a commercial tool. With our proposed method, users can predict transient thermal profiles quickly and accurately to ensure thermal safety.

4D-4 (Time: 11:20 - 11:45)

Title	Device-Aware Test for Anomalous Charge Trapping in FeFETs
Author	*Sicong Yuan (Technische Universiteit Delft, Netherlands), Changhao Wang (Politecnico di Torino, Italy), Moritz Fieback, Hanzhi Xun, Mottaqiallah Taouil (Technische Universiteit Delft, Netherlands), Xiuyan Li, Danyang Chen, Lin Wang (Shanghai Jiao Tong University, China), Nicolò Bellarmino, Riccardo Cantoro (Politecnico di Torino, Italy), Said Hamdioui (Technische Universiteit Delft, Netherlands)
Page	pp. 635 - 641
Keyword	FeFET, memory test, defect modeling, charge trapping
Abstract	The development of Ferroelectric Field-Effect Transistor (FeFET) manufacturing requires high-quality test solutions, yet research on FeFET testing is still in a nascent stage. To generate a dedicated test method for FeFETs, it is critical to have a deep understanding of manufacturing defects and accurately model them. In this work, we introduce the unique defect, Anomalous Charge Trapping (ACT), in FeFETs. The ACT- defective FeFET is characterized, and the physical mechanism of the defect is explained. Then, we apply the Device-aware Test (DAT) method to design a specific ACT-defective FeFET model, which includes the physical impact of the defect on the electrical parameters of defect-free models, and calibrate the model with measurement data. Fault modeling is performed based on circuit-level simulations, and dedicated test solutions are proposed.

[To Session Table]

Session 4E (T3-3) Hybrid/Co-Designed Near/In Memory Computing
Time: 10:05 - 11:45, Wednesday, January 22, 2025
Location: Innovation Hall
Chairs: Bokyung Kim (Rutgers University, USA), Zhong Sun (Peking University, China)

4E-1 (Time: 10:05 - 10:30)

Title	3D-METRO: Deploy Large-Scale Transformer Model on a Chip Using Transistor-Less 3D-Metal-ROM-Based Compute-in-Memory Macro
Author	Yiming Chen, *Xirui Du, Guodong Yin, Wenjun Tang, Yongpan Liu, Huazhong Yang, Xueqing Li (Tsinghua University, China)
Page	pp. 642 - 647
Keyword	Compute-in-Memory, ROM-CiM, YOLoC, 3D Stacking
Abstract	While large Transformer models have exhibited outstanding performance on multimodal tasks, the underlying massive parameters land up with memory-wall issues. To address this bottleneck, SRAM-based compute-in-memory (CiM) is a promising technique. However, frequent off-chip weight loading due to limited on-chip capacity could severely limit the system- level energy efficiency. Recently, a high-density CiM structure at 16.4Mb/mm2, YOLoC, has shown the potential of complete on- chip deployment of a large detection model using transistor-based read-only-memory (ROM). However, it is still challenging to deploy even larger Transformer models. With opportunities provided by LoRA for finetuning large pretrained models on ROM- CiM with very light SRAM-CiMs, this work achieves ultra-high density up to 165.6Mb/mm2 by eliminating the use of transistors for ROM-CiM with a proposed 3D-METRO and a 3D stacking array on the mature CMOS process. Unlike the usual belief that parasitics have negative impacts, this work observes

4E-2 (Time: 10:30 - 10:55)

Title	HCiM: ADC-Less Hybrid Analog-Digital Compute in Memory Accelerator for Deep Learning Workloads
Author	*Shubham Negi, Utkarsh Saxena, Deepika Sharma, Kaushik Roy (Purdue University, USA)
Page	pp. 648 - 655
Keyword	Compute-in-Memory, Quantization, ADC-Less, Hardware Algorithm Co-design
Abstract	Analog Compute-in-Memory (CiM) accelerators are increasingly recognized for their efficiency in accelerating Deep Neural Networks (DNN). However, their dependence on Analog-to-Digital Converters (ADCs) for accumulating partial sums from crossbars leads to substantial power and area overhead. Moreover, the high area overhead of ADCs constrains throughput due to the limited number of ADCs that can be integrated per crossbar. To mitigate this issue, extreme low-precision quantization (binary or ternary) for partial sums can be adopted, eliminating the need for ADCs. While this strategy effectively reduces ADC costs, it introduces the challenge of managing numerous floating-point scale factors, which are trainable parameters like DNN weights. These scale factors must be multiplied with the binary or ternary outputs at the crossbar columns to maintain system accuracy, offsetting the benefits of CiM and partial sum quantization. To that effect, we propose an algorithm-hardware co-design approach. Initially, DNNs are trained with quantization-aware training. Subsequently, we introduce HCiM, an ADC-Less Hybrid Analog-Digital CiM accelerator. HCiM uses analog CiM crossbars for performing Matrix-Vector Multiplication operations, coupled with a digital CiM array for processing scale factors. Compared to an analog CiM baseline architecture using 7 and 2-bit ADCs, HCiM achieves energy reductions up to 28× and 11×, respectively.

4E-3 (Time: 10:55 - 11:20)

Title	MDNMP: Metapath-Driven Software-Hardware Co-Design for HGNN Acceleration with Near-Memory Processing
Author	*Liyan Chen, Jianfei Jiang, Qin Wang, Zhigang Mao, Naifeng Jing (Shanghai Jiao Tong University, China)
Page	pp. 656 - 662
Keyword	NMP, HGNN, Metapath
Abstract	Heterogeneous graph neural networks (HGNNs), which capture rich structural and semantic information by learning low-dimensional vertex representations based on metapath, have drawn considerable attention in recent years. Due to substantial memory consumption and unique irregular access patterns, its performance is hindered by memory-bound metapath instance matching and aggregation. To address this challenge, the recent proposal employs near-memory processing (NMP) and achieves impressive performance speedups. However, due to oversight of the intrinsic characteristics of metapath, it fails to fully exploit the potential of NMP. We propose MDNMP, a software-hardware co-design NMP architecture for HGNN acceleration. Specifically, we leverage push-pull workflow tailored for high-degree and low-degree metapaths, guided by the analysis of their computation and communication patterns. Additionally, a low-cost locality-aware heterogeneous graph partition method is proposed to further reduce inter-DIMM communication. Finally, we integrate hardware units in DIMMs to accelerate HGNNs and optimize the broadcast mechanism for efficient communication. Our evaluation demonstrates that MDNMP achieves 47%-88% inter-DIMM communication reduction, 3.8×-42.4× performance improvement and 43%-52% energy consumption reduction compared to the state-of-the-art NMP solution.

4E-4 (Time: 11:20 - 11:45)

Title	A 24.65 TOPS/W@INT8 Hybrid Analog-Digital Multi-core SRAM CIM Macro with Optimal Weight Dividing and Resource Allocation Strategies
Author	*Yitong Zhou, Wente Yi, Sifan Sun, Wenjia Wang, Jinyu Bai, He Zhang (Beihang University, China)
Page	pp. 663 - 668
Keyword	Hybrid analog-digital CIM, Deep neural networks, Heterogeneous Multi-Core
Abstract	Compute-in-memory (CIM) technology integrates memory and computation to reduce memory bottlenecks in modern systems. However, current CIM architectures face challenges in balancing accuracy and energy efficiency. Analog-CIM (ACIM) is energy-efficient but less accurate, while Digital-CIM (DCIM) is accurate but consumes more energy. In this paper, we propose a novel multi-core hybrid analog-digital CIM macro that effectively addresses this trade-off. Our approach intelligently allocates computation tasks to ACIM and DCIM cores based on their accuracy requirements, achieving a balance of accuracy and efficiency. Additionally, we developed an optimization framework to determine the optimal weight divide ratio and computing resource allocation for the hybrid CIM. Experimental results demonstrate the efficacy of our approach. The proposed hybrid CIM achieves an outstanding energy efficiency of 24.65 TOPS/W at 8-bit precision, surpassing DCIM by a factor of 1.33 while maintaining a low error rate of only 0.4\%, which is 30 times better than ACIM at the same precision.

[To Session Table]

Session 4F (SS-4) ML for IC Design and Manufacturing: When Is It Real?
Time: 10:05 - 11:45, Wednesday, January 22, 2025
Chairs: Yibo Lin (Peking University, China), Youngsoo Shin (KAIST, Republic of Korea)

4F-1 (Time: 10:05 - 10:30)

Title	(Invited Paper) Use Cases and Deployment of ML in IC Physical Design
Author	Amur Ghose, *Andrew B. Kahng, Sayak Kundu, Yiting Liu, Bodhisatta Pramanik, Zhiang Wang, Dooseok Yoon (University of California, San Diego, USA)
Page	pp. 669 - 675
Keyword	Integrated-circuit Physical Design, Machine Learning, MLOps
Abstract	ML for IC physical design must be deployed in order to have business impacts. However, deployment in production must navigate many practical considerations, including choice of targets, skillsets and infrastructure, expectations and resources, data, and "MLOps". Furthermore, usage of ML is not the same as IC design practice and capability. In this invited paper, we give perspectives on basic strategies for selecting applications and pursuing deployment for ML in IC physical design. Example aspects include checklists for data and ML models, evaluation of model performance and progress on the path to deployment, the shifting landscape of MLOps, and challenges of "LLM-ability".

4F-2 (Time: 10:30 - 10:55)

Title	(Invited Paper) Leveraging Machine Learning Techniques to Enhance Traditional EDA Workflows
Author	Jinoh Cho, Jaekyung Im, Jaeseung Lee, Kyungjun Min, Seonghyeon Park, Jaemin Seo, Jongho Yoon, *Seokhyeong Kang (Pohang University of Science and Technology, Republic of Korea)
Page	pp. 676 - 682
Keyword	EDA, Machine learning, Physical Design
Abstract	As technology nodes advance and feature sizes shrink, the increasing complexity of design rules and routing congestion has resulted in greater design challenges and rising costs. Machine learning (ML) models offer significant potential to enhance design quality by enabling early prediction and optimization during the design flow. However, only a few works have validated the effectiveness of ML model when integrated to the traditional design flow. This paper will cover the effectiveness of ML-enhanced EDA workflow with some practical applications. Additionally, we will address which problems should be solved to achieve successful ML integration.

4F-3 (Time: 10:55 - 11:20)

Title	(Invited Paper) ML-Assisted RF IC Design Enablement: the New Frontier of AI for EDA
Author	Hyunsu Chae, Song Hang Chai (The University of Texas at Austin, USA), Taiyun Chi (Rice University, USA), Sensen Li, *David Z. Pan (The University of Texas at Austin, USA)
Page	pp. 683 - 689
Keyword	RFIC, Machine learning, EDA
Abstract	While AI for EDA has seen great success in digital IC design and some success in analog design, its potential for enabling RFIC design is yet to be fully explored. Due to its high-frequency nature, RFIC involves challenges such as parasitic effects, electromagnetic interference (EMI), signal integrity (SI), and other non-idealities. The modeling of passive networks and the associated computationally expensive EM simulations remain the major bottleneck in manual RFIC designs. This paper discusses the challenges and opportunities in ML-assisted RFIC design, covering topics from physics-augmented surrogate modeling to the inverse design of passive structures.

4F-4 (Time: 11:20 - 11:45)

Title	(Invited Paper) ML for Computational Lithography: Practical Recipes
Author	*Youngsoo Shin (KAIST, Republic of Korea)
Page	pp. 690 - 692
Keyword	Lithography, Machine learning, Mask synthesis, Lithography modeling
Abstract	Machine learning (ML) has been applied to majority of these components since around 2010 with key motivations of: (1) ML provides higher modeling capability than traditional analytical model, and (2) many lithography applications can be considered as image recognition or image conversion, which can be handled effectively with recent ML models. Given that some ML solutions are already provided through vendor solutions, it is about right time to address which ML-assisted lithography applications have more use cases while which are still in the stage of research.

[To Session Table]

Session 1W CEDA 20th Anniversary Panel
Time: 12:00 - 13:00, Wednesday, January 22, 2025
Location: Innovation Hall and more
Organizer: Tsung-Yi Ho (Chinese University of Hong Kong, Hong Kong), Panel Moderator: Yu Wang (Tsinghua University, China)

1W-1 (Time: 12:00 - 13:00)

Title	(Panel Discussion) CEDA 20th Anniversary Panel
Author	Panelists: Yao-Wen Chang (National Taiwan University, Taiwan), Kwang-Ting Cheng (Hong Kong University of Science and Technology, Hong Kong), Shinji Kimura (Waseda University, Japan), Jeong-Taek Kong (Sungkyunkwan University, Republic of Korea)
Abstract	Detailed Information: https://ieee-ceda.org/post/announcement/asp-dac-2025-ceda-20th-anniversary-panel

[To Session Table]

Session 5A (T5-2) Innovations in Deep Learning and Neural Network Acceleration
Time: 13:15 - 15:20, Wednesday, January 22, 2025
Location: Room Saturn
Chair: Shinya Takamaeda Yamazaki (The University of Tokyo, Japan)

5A-1 (Time: 13:15 - 13:40)

Title	Hardware Acceleration of Kolmogorov–Arnold Network (KAN) for Lightweight Edge Inference
Author	Wei-Hsing Huang, Jianwei Jia, Yuyao Kong, Faaiq Waqar (Georgia Institute of Technology, USA), Tai-Hao Wen (National Tsing Hua University, Taiwan), Meng-Fan Chang (National Tsing Hua University/TSMC Corporate Research, Taiwan), *Shimeng Yu (Georgia Institute of Technology, USA)
Page	pp. 693 - 699
Keyword	Kolmogorov-Arnold Networks, KAN, Quantization, Compute-in-memory, Hardware Software Co-design
Abstract	Recently, a novel model named Kolmogorov-Arnold Networks (KAN) has been proposed with the potential to achieve the functionality of traditional deep neural networks (DNNs) using orders of magnitude fewer parameters by parameterized B-spline functions with trainable coefficients. However, the B-spline functions in KAN present new challenges for hardware acceleration. Evaluating the B-spline functions can be performed using their mathematical definition, which involves a recursive method. However, this approach requires significantly more computational resources as the order k of the B-spline increases. An alternative approach, more suitable for deployment on edge devices, is to use look-up tables (LUTs) to directly map the B-spline functions, thereby simplifying the hardware implementation and reducing computational resource requirements. However, this method still requires substantial circuit resources (e.g., LUTs, MUXs, decoders). This paper employs a software-hardware co-design methodology to accelerate KAN using the RRAM-Aanalog-CIM (ACIM) circuit. The proposed algorithm-level techniques include Alignment-Symmetry and PowerGap KAN hardware aware quantization, KAN sparsity aware mapping strategy, and circuit-level technique includes N:1 Time Modulation Dynamic Voltage input generator. The impact of non-ideal effects, such as partial sum errors caused by the process variations, has been evaluated with the statistics measured from the TSMC 22nm prototype chips.

5A-2 (Time: 13:40 - 14:05)

Title	SUArch: Accelerating Layer-wise N:M Sparse Pattern with a Unified Architecture for Deep-learning Edge Device
Author	*Xilong Kang, Qingwen Wei (Southeast University, China), Ningyuan Li (Harbin Engineering University, China), Xingyu Xu, Hao Cai, Bo Liu (Southeast University, China)
Page	pp. 700 - 705
Keyword	Accelerator, Layer-wise N:M Sparsity, Transformer, Convolution Neural Networks, Edge Computing
Abstract	Deep neural networks are of the essence for user applications on edge devices. However, the computation and memory-intensive nature of deep neural networks conflicts with the resource-constrained devices. Moreover, the heterogeneity across different models imposes new challenges on deployment on edge devices. To boost the capabilities of edge devices, we propose SUArch which innovates on three fronts: 1) a layer-wise N:M sparsity aware training approach to strike a balance between accuracy and training cost; 2) a sparsity alignment unit based on the butterfly network to maximize hardware utilization and eliminate extra overhead; 3) a mode-heterogenous processing element array to effectively accomplish the unified support for Convolution Neural Network and Transformer. The experimental results demonstrate that when running convolution-based and attention-based models under an industrial 28-nm process, the proposed SUArch realizes an energy efficiency of 52.1 TOPS/W. Compared to state-of-the-art architecture, SUArch achieves an energy efficiency improvement of 2.07× while accuracy loss is within 0.7%.

5A-3 (Time: 14:05 - 14:30)

Title	FactorFlow: Mapping GEMMs on Spatial Architectures through Adaptive Programming and Greedy Optimization
Author	*Marco Ronzani, Cristina Silvano (Politecnico di Milano, Italy)
Page	pp. 706 - 712
Keyword	spatial architectures, DNN accelerators, GEMM mapping, adaptive programming, design space exploration
Abstract	General Matrix Multiplications (GEMMs) are fundamental kernels in tensor-based scientific applications and deep learning. Modern AI accelerators using spatial architectures can run these kernels efficiently by leveraging parallelism and data reuse, but they require specific mappings to plan data movements and computations. The choice of mapping significantly impacts energy consumption and latency. Consequently, the vast space of possible mappings, unique for each GEMM-architecture pair, must be searched thoroughly to find optimal solutions. This is a complex optimization problem that imposes effective map-space exploration strategies. Current state-of-the-art mapping tools primarily address convolution kernels, a superset of GEMMs, but often fail to leverage GEMMs' specific characteristics. As a result, they struggle to consistently generate optimal mappings in a reasonable time for all GEMM-architecture pairs. This paper introduces FactorFlow, an automatic framework designed to map GEMM kernels to spatial architectures using adaptive programming and greedy optimization to minimize the energy-delay product. Our evaluation, conducted against four other state-of-the-art mapping tools on a selected set of GEMMs and architectures, demonstrates that FactorFlow consistently discovers mappings that outperform existing tools in terms of EDP while significantly reducing the exploration execution time.

5A-4 (Time: 14:30 - 14:55)

Title	LUTMUL: Exceed Conventional FPGA Roofline Limit by LUT-based Efficient Multiplication for Neural Network Inference
Author	Yanyue Xie (Northeastern University, USA), Zhengang Li (Adobe, USA), Dana Diaconu, Suranga Handagala, Miriam Leeser, *Xue Lin (Northeastern University, USA)
Page	pp. 713 - 719
Keyword	FPGAs, Quantization, Look-up tables, Roofline model, Neural network
Abstract	For FPGA-based neural network accelerators, digital signal processing (DSP) blocks have traditionally been the cornerstone for handling multiplications. This paper introduces LUTMUL, which harnesses the potential of look-up tables (LUTs) for performing multiplications. The availability of LUTs typically outnumbers that of DSPs by a factor of 100, offering a significant computational advantage. By exploiting this advantage of LUTs, our method demonstrates a potential boost in the performance of FPGA-based neural network accelerators with a reconfigurable dataflow architecture. Our approach challenges the conventional peak performance on DSP-based accelerators and sets a new benchmark for efficient neural network inference on FPGAs. Experimental results demonstrate that our design achieves the best inference speed among all FPGA-based accelerators, achieving a throughput of 1627 images per second and maintaining a top-1 accuracy of 70.95% on the ImageNet dataset.

5A-5 (Time: 14:55 - 15:20)

Title	A Layer-wised Mixed-Precision CIM Accelerator with Bit-level Sparsity-aware ADCs for NAS-Optimized CNNs
Author	Haoxiang Zhou, Zikun Wei, Dingbang Liu, Liuyang Zhang (Southern University of Science and Technology, China), Chenchen Ding (The University of Hong Kong, China), Jiaqi Yang (Southern University of Science and Technology, China), Wei Mao (Xidian University, China), *Hao Yu (Southern University of Science and Technology, China)
Page	pp. 720 - 726
Keyword	NAS, bit-level sparsity, CIM, sparsity-aware ADC
Abstract	Exploring multiple precisions as well as sparsities for a computing-in-memory (CIM) based convolutional accelerators is challenging. To further improve energy efficiency with minimal accuracy loss, this paper develops a neural architecture search (NAS) method to identify precision for each layer of the CNN and further leverages bit-level sparsity. The results indicate that following this approach, ResNet-18 and VGG-16 not only maintain their accuracy but also implement layer-wised mixed-precision effectively. Furthermore, there is a substantial enhancement in the bit-level sparsity of weights within each layer, with an average bit-level sparsity exceeding 90\% per bit, thus providing broader possibilities for hardware-level sparsity optimization. In terms of hardware design, a mixed-precision (2/4/8-bit) readout circuit as well as a bit-level sparsity-aware Analog-to-Digital Converter (ADC) are both proposed to reduce system power consumption. Based on bit-level sparsity mixed-precision CNNs benchmarks, post-layout simulation results in 28nm reveal that the proposed accelerator achieves up to 245.72 TOPS/W energy efficiency, which shows about 2.52 - 6.57$\times$ improvement compared to the state-of-the-art SRAM-based CIM accelerators.

[To Session Table]

Session 5B (T7-2) Neuromorphic and Emerging Computing Techniques
Time: 13:15 - 15:20, Wednesday, January 22, 2025
Location: Room Uranus
Chair: Georgios Zervakis (University of Patras, Greece)

Best Paper Candidate
5B-1 (Time: 13:15 - 13:40)

Title	HyCOMP: A Compiler for ANN-SNN Hybrid Accelerators
Author	Yitian Zhou, Yue Li, *Yang Hong (Shanghai Jiao Tong University, China)
Page	pp. 727 - 733
Keyword	compilation framework, ANN-SNN hybrid network, neural network, machine learning, PIM accelerator
Abstract	The rising energy consumption of Artificial Neural Networks (ANNs) in large-scale computation scenarios has become a critical issue. Spiking Neural Networks (SNNs) offer a promising alternative with their spike-based computation and communication methods. Process-in-memory (PIM) is widely used in SNN acceleration, and is also used to accelerate ANNs to break the memory wall. However, there are few studies on ANN-SNN hybrid network acceleration, which results in several issues. Existing compilers are primarily designed for ANNs or SNNs, with no unified description for hybrid networks. Compilation processes for SNNs vary in focus and lack general phases. Some ANNs hold the potential for conversion to SNNs for energy efficiency, yet their characteristics remain inadequately defined. Thus, we propose HyCOMP - a universal compiler tailored for ANN-SNN hybrid acceleration of PIM architecture. We also design a heuristic algorithm to identify ANNs with energy-efficiency potential. HyCOMP aims to facilitate the mapping of ANNs and SNNs onto distinct acceleration units to enhance energy efficiency. Experiment results on ResNet-18, GoogLeNet, VGG16, and Spikformer show that HyCOMP achieves 2.48x and 2.74x improvements in latency and energy consumption respectively compared to the state-of-the-art techniques.

5B-2 (Time: 13:40 - 14:05)

Title	NeuronQuant: Accurate and Efficient Post-Training Quantization for Spiking Neural Networks
Author	*Haomin Li, Fangxin Liu (Shanghai Jiao Tong University, China), Zewen Sun (Tianjin University, China), Zongwu Wang, Shiyuan Huang, Ning Yang, Li Jiang (Shanghai Jiao Tong University, China)
Page	pp. 734 - 740
Keyword	Spiking Neural Network, Quantization, energy-efficiency
Abstract	Spiking neural networks (SNNs) are an alternative computational paradigm to artificial neural networks (ANNs) that have attracted attention due to their event-driven execution mechanisms, enabling extremely low energy consumption. However, a significant challenge and opportunity in SNNs is to optimize memory and compute costs while maintaining accuracy, thereby further reducing energy consumption. Model quantization has been proposed as a promising technique to improve the running efficiency via the number of data bits reduction. Whereas, this technique has yet to be well studied in the neuromorphic computing domain. The underlying reason is that the behaviors of SNNs are quite different from those of ANNs, making 1) the accuracy of SNNs usually sensitive to data precision, and 2) the introduction of a temporal dimension to characterize neuronal dynamics. In this paper, we present NeuronQuant, an accurate and energy-efficient quantization framework to reduce the precision of neurons while maintaining accuracy. The key insight is to design a post-training quantization method guided by the activity of neurons, efficiently reducing the bit-width of parameters based on local relationships within neurons. Additionally, a budget-aware mixed bit-width allocation strategy for the total model size enables the adaptive growth and narrowing of precision in each layer, leading to a mixed-precision quantization scheme of the desired size. Extensive evaluations demonstrate that NeuronQuant can achieve a compressed SNN with 5.2 bits on average and 1.51× power consumption reduction with a superior model accuracy which is quite impressive for SNN.

5B-3 (Time: 14:05 - 14:30)

Title	SCSC: Leveraging Sparsity and Fault-Tolerance for Energy-Efficient Spiking Neural Networks
Author	Bo Li, *Yue Liu, Wei Liu, Jinghai Wang, Xiao Huang, Zhiyi Yu, Shanlin Xiao (Sun Yat-sen University, China)
Page	pp. 741 - 747
Keyword	Spiking Neural Networks, Sparse Coding, Energy Efficiency, Fault Tolerance, Approximate DRAM
Abstract	Spiking neural networks (SNNs) are more energy-efficient for processing sparse spike signals and demonstrate better fault tolerance compared to artificial neural networks (ANNs). In neuromorphic chips, synaptic weight access and neuron computation operations constitute 75%-95% of the chip’s energy consumption. Therefore, our primary strategy to achieve highly energy-efficient SNNs is to enhance network sparsity while leveraging SNNs’ high fault tolerance to reduce both weight access and neuron computation energy. The coding module is an essential component of SNNs, responsible for encoding non-spiking inputs into spike trains. However, previous coding schemes often exhibit poor sparsity or fault tolerance performance. Thus, we propose a novel coding scheme for SNNs: spiking convolutional sparse coding (SCSC). SCSC utilizes convolutional kernels as dictionaries and achieves sparsity through neural layers. Additionally, dynamic firing thresholds in neural layers balance sparsity with network performance and fault tolerance. The experimental results indicate that SCSC can increase network sparsity by 10%-20% and achieves higher accuracy than baseline networks when dealing with disturbances. Furthermore, we utilize approximate DRAM to store synaptic weights and selectively deactivate specific neuronal computing modules. With only a 1% decrease in accuracy, SCSC can reduce synaptic weight access energy by 29% and neuronal computing energy by 49%.

5B-4 (Time: 14:30 - 14:55)

Title	OpticalHDC: Ultra-fast Photonic Hyperdimensional Computing Accelerator
Author	*Jiaqi Liu (Hong Kong University of Science and Technology, Hong Kong), Yiwen Ma (Institute of Physics, Chinese Academy of Sciences, China)
Page	pp. 748 - 753
Keyword	Hyperdimensional Computing, Photonic Computing
Abstract	The demand for extensive computing resources and energy to support the increasing size of machine learning models has created a disparity between AI applications and the underlying hardware, hindering the advancement of advanced general intelligence. To address this challenge, researchers have been exploring new neuromorphic computing paradigms, among which Hyperdimensional Computing (HDC) stands out for its robustness to system noise, simplicity of operations, and high parallelism. However, accelerating the time-consuming encoding and search algorithms in HDC remains a formidable task. In this paper, we propose the first Photonic Hyperdimensional Computing (HDC) accelerator to overcome the challenges of high-dimensional data processing. Our accelerator incorporates multiple Microring-based homogeneous configurable dot product cores that can be dynamically reconfigured to either the encoding stage or the classification stage. We introduce a one-to-many broadcast-based photonic computing core that leverages the wavelength-division multiplexing (WDM) and free spectral range (FSR) properties to parallel computation. Additionally, we propose a signed number operation algorithm to support the signed dot product with base hypervectors in the encoding stage. We also conduct a thorough analysis of the impact of different dimensions and encoding bit-width to ensure the reliability and robustness of our accelerator. Our experimental results demonstrate a significant speedup of up to 47.87x and 177.37x compared to state-of-the-art ASIC chips and GPUs respectively, highlighting the effectiveness of our proposed approach.

5B-5 (Time: 14:55 - 15:20)

Title	Design and In-training Optimization of Binary Search ADC for Flexible Classifiers
Author	*Paula Carolina Lozano Duarte (Karlsruhe Institute of Technology, Germany), Florentia Afentaki, Georgios Zervakis (University of Patras, Greece), Mehdi Tahoori (Karlsruhe Institute of Technology, Germany)
Page	pp. 754 - 760
Keyword	Flexible Electronics, Binary Search ADC, Flash ADC, In-training Optimization
Abstract	Flexible Electronics (FE) offer distinct advantages, including mechanical flexibility and low process temperatures, enabling extremely low-cost production. To address the demands of applications such as smart sensors and wearables, flexible devices must be small and operate at low supply voltages. Additionally, target applications often require classifiers to operate directly on analog sensory input, necessitating the use of Analog to Digital Converters (ADCs) to process the sensory data. However, ADCs present serious challenges, particularly in terms of high area and power consumption, especially when considering stringent area and energy budget. In this work, we target common classifiers in this domain such as MLPs and SVMs and present a holistic approach to mitigate the elevated overhead of analog to digital interfacing in FE. First, we propose a novel design for Binary Search ADC that reduces area overhead 2× compared with the state-of-the-art Binary design and up to 5.5× compared with Flash ADC. Next, we present an in-training ADC optimization in which we keep the bare-minimum representations required and simplifying ADCs by removing unnecessary components. Our in-training optimization further reduces on average the area in terms of transistor count of the required ADCs by 5× for less than 1% accuracy loss.

[To Session Table]

Session 5C (T9-2) Package and PCB
Time: 13:15 - 14:55, Wednesday, January 22, 2025
Location: Room Venus
Chair: Pei-Yu Lee (Mediatek, Taiwan)

5C-1 (Time: 13:15 - 13:40)

Title	Paired-Spacing-Constrained Package Routing with Net Ordering Optimization
Author	Yi-Sian Ciou, *Ying-Jie Jiang, Yi-Yu Liu, Shao-Yun Fang (National Taiwan University of Science and Technology, Taiwan), Wen-Hao Liu (NVIDIA, Taiwan)
Page	pp. 761 - 767
Keyword	Package routing, Paired-spacing constraint
Abstract	Package design has become increasingly complex with the evolution of technology nodes and heterogeneous integration. To optimize timing performance and signal integrity, it is essential to separate different pairs of geometrically adjacent nets with distinct spacing values, which is referred to as the paired-spacing constraint. This paper presents the first free-assignment package routing algorithm flow considering the paired-spacing constraint. To minimize the routing resource demand and overall wirelength, we propose a dynamic programming-based net ordering method to maximize the number of nets with the same/similar spacing rules positioned next to each other. In addition, the free-assignment routing problem is elegantly solved with a minimum-cost maximum-flow problem on a delicately designed graph model. Experimental results show that the proposed flow can achieve 100% routability for the adopted industrial-modified benchmarks. In contrast, even with modifications to superficially consider paired spacings, a classic model experiences significant routability degradation.

5C-2 (Time: 13:40 - 14:05)

Title	Hybrid Detour Refinement Strategy for Package Substrate Routing
Author	*Ding-Hsun Lin, Tsubasa Koyama, Yu-Jen Chen (National Tsing Hua University, Taiwan), Keng-Tuan Chang, Chih-Yi Huang, Chen-Chao Wang (Advanced Semiconductor Engineering, Inc., Taiwan), Tsung-Yi Ho (The Chinese University of Hong Kong, Hong Kong)
Page	pp. 768 - 773
Keyword	Advanced Packaging, Package Substrate Routing, Routing Refinement, Deep Learning
Abstract	Advanced packaging technologies have become increasingly important due to rapid technological advancements. In these designs, substrate routing is crucial for ensuring functionality and performance, but existing automatic routing tools often yield suboptimal results or design rule violations (DRVs) when handling complex industrial constraints. As a consequence, designers must spend weeks refining these results. In this work, a hybrid detour refinement strategy that combines rule-based and deep learning (DL)-based approaches is proposed to address these challenges. The strategy reduces detours, improves area distribution in industrial Flip-Chip Ball Grid Array (FCBGA) substrate designs, and significantly decreases modification time. Experimental results show an average improvement of 43% in detour reduction and 33% in area distribution, with modification time reduced from weeks to minutes.

5C-3 (Time: 14:05 - 14:30)

Title	On Awareness of Offset-Via and Teardrop in Advanced Packaging Interconnect Synthesis
Author	*Hao-Ju Chang (Institute of Electronics, National Yang Ming Chiao Tung University, Taiwan), Yu-Hung Chen, Hao-Wei Huang (Institue of Pioneer Semiconductor Innovation, National Yang Ming Chiao Tung University, Taiwan), Yihua Yeh (National Yang Ming Chiao Tung University, Taiwan), Hung-Ming Chen, Chien-Nan Jimmy Liu (Institute of Electronics and SoC Research Center, National Yang Ming Chiao Tung University, Taiwan)
Page	pp. 774 - 780
Keyword	D2D, chiplet, signal integrity, offset-via, routing
Abstract	To take full advantage of chiplet-based system synthesis methodology for HPC and AI applications in high-bandwidth memory, die-to-die designs interconnect need an overhaul breakthrough. The major reason lies in the strengthening technologies: offset-via and teardrop. They need special care to enhance reliability and manufacturability. Moreover, conventional signal integrity problems are required to pay attention as well. In this work, we first ensure the optimized routing resources are on all the redistribution layers by empirical offset-via and layer assignment. Then, we devise an S-route detailed routing to prevent detours and reduce rip-up and re-route iterations. Results show that we achieved a total wire length reduction of 7% on average and a total usage of RDLs by nearly 50%, compared with combined SOTA approaches.

5C-4 (Time: 14:30 - 14:55)

Title	PCBAgent: An Agent-based Framework for High-Density Printed Circuit Board Placement
Author	*Lin Chen (The Hong Kong University of Science and Technology/Huawei Noah’s Ark Lab, Hong Kong), Ran Chen, Shoubo Hu (Huawei Noah’s Ark Lab, Hong Kong), Xufeng Yao (The Chinese University of Hong Kong/Huawei Noah’s Ark Lab, Hong Kong), Zhentao Tang, Shixiong Kai, Siyuan Xu (Huawei Noah’s Ark Lab, China), Mingxuan Yuan (Huawei Noah’s Ark Lab, Hong Kong), Jianye Hao (Huawei Noah’s Ark Lab, China), Bei Yu (The Chinese University of Hong Kong, Hong Kong), Jiang Xu (The Hong Kong University of Science and Technology(Guangzhou), China)
Page	pp. 781 - 787
Keyword	PCB placement, reinforcement learning, large language model
Abstract	In recent years, printed circuit board (PCB) placement has emerged as a significant challenge since the scale of PCB designs has rapidly enlarged. Furthermore, the presence of various types of constraints with differing tolerances and violation priorities hampers the automation of PCB layout design, necessitating substantial manual effort. To address this problem, we introduce a novel agent-based framework that automatically generates PCB layouts meeting industrial constraints through user interactions. This framework includes two main agents: a reinforcement learning (RL)-based agent for layout inference and fine-tuning, and a large language model (LLM)-based agent for interactive optimization. Experimental results on 17 industrial PCB designs show that our framework outperforms other state-of-the-art methods.

[To Session Table]

Session 5D (T12-2) Logic Locking and Hardware Watermarking
Time: 13:15 - 14:55, Wednesday, January 22, 2025
Location: Room Mars/Mercury
Chairs: Daisuke Fujimoto (Nara Institute of Science and Technology, Japan), Amin Rezaei (California State University at Long Beach, USA)

Best Paper Candidate
5D-1 (Time: 13:15 - 13:40)

Title	NoXLock: SiP Activation and Licensing through Obfuscated on-Chip Network and Fuzzy Traffic
Author	Md Saad Ul Haque, Azim Uddin, Jingbo Zhou (University of Florida, USA), Hadi Mardani Kamali (University of Central Florida, USA), Farimah Farahmandi, *Mark Tehranipoor (University of Florida, USA)
Page	pp. 788 - 793
Keyword	SiP, Heterogenous Integration, Obfuscation, NoC Security
Abstract	Existing countermeasures designed to protect system-on-chip (SoC) from supply chain threats, i.e., Intellectual Property (IP) piracy and overproduction, are inadequate for heterogeneously integrated systems-in-packages (SiP) due to shifts in the manufacturing flow. Additionally, traditional obfuscation methods have been compromised by emerging deobfuscation techniques. To address these challenges, this paper introduces "NoXLock", a novel network-on-chip (NoC) obfuscation technique to effectively safeguard the IP of SiP designs. Importantly, NoXLock’s implementation is keyless and implicit, without requiring controller state machines (FSMs) for obfuscation - a distinct departure from traditional sequential pattern-based logic-locking techniques. By obfuscating the routing algorithm, the performance of unauthorized SiPs is effectively constrained without proper activation. To securely activate the system, the paper proposes a novel method utilizing dynamic traffic patterns. Extensive security analyses and experimental results, tested on a number of open source benchmark designs, demonstrate that NoXLock resists state-of-the-art attacks, including oracle-guided SAT, oracle-less removal, and probing-based methods, without significant penalties on power, performance, and area (PPA). The proposed approach fills a critical gap in the IP protection landscape for heterogeneously integrated systems, providing SiP designers with a comprehensive solution to resist reverse engineering and counterfeiting.

5D-2 (Time: 13:40 - 14:05)

Title	K-Gate Lock: Multi-Key Logic Locking Using Input Encoding Against Oracle-Guided Attacks
Author	Kevin Lopez, *Amin Rezaei (California State University Long Beach, USA)
Page	pp. 794 - 800
Keyword	Logic Locking, Logic Encryption, Multi-Key Locking, SAT Attack, Input Encoding
Abstract	Logic locking has emerged to prevent piracy and overproduction of integrated circuits ever since the split of the design house and manufacturing foundry was established. While there has been a lot of research using a single global key to lock the circuit, even the most sophisticated single-key locking methods have been shown to be vulnerable to powerful SAT-based oracle-guided attacks that can extract the correct key with the help of an activated chip bought off the market and the locked netlist leaked from the untrusted foundry. To address this challenge, we propose, implement, and evaluate a novel logic locking method called K-Gate Lock that encodes input patterns using multiple keys that are applied to one set of key inputs at different operational times. Our comprehensive experimental results confirm that using multiple keys will make the circuit secure against oracle-guided attacks and increase attacker efforts to an exponentially time-consuming brute force search. K-Gate Lock has reasonable power and performance overheads, making it a practical solution for real-world hardware intellectual property protection.

5D-3 (Time: 14:05 - 14:30)

Title	A Hybrid Machine Learning and Numeric Optimization Approach to Analog Circuit Deobfuscation
Author	*Dipali Jain, Guangwei Zhao, Rajesh Datta, Kaveh Shamsi (University of Texas at Dallas, USA)
Page	pp. 801 - 807
Keyword	deobfuscation, circuit learning, analog security, machine learning, optimization
Abstract	Generic deobfuscation of analog circuits has received less attention than the digital counterpart with existing methods relying on manual expert work to extract closedform equations from the circuit. In this work, we move towards a significantly more automated process by using a combination of machine learning and Newton-method-based analog circuit optimization. We showcase how this hybrid scheme is superior to either standalone approach in terms of runtime and accuracy on a set of analog circuits that include amplifiers, filters, and oscillators. We achieve >98% average accuracy without any manual expert equation extraction in addition to demonstrating a superior resilience to process variation.

5D-4 (Time: 14:30 - 14:55)

Title	RTLMarker: Protecting LLM-Generated RTL Copyright via a Hardware Watermarking Framework
Author	*Kun Wang, Kaiyan Chang, Mengdi Wang, Xingqi Zou, Haobo Xu, Yinhe Han, Ying Wang (Institute of Computing Technology, Chinese Academy of Sciences, China)
Page	pp. 808 - 813
Keyword	Large Language Mode, Hardware Copyright
Abstract	Recent advances of large language models in the field of Verilog generation have raised several ethical and security concerns, such as code copyright protection and dissemination of malicious code. Researchers have employed watermarking techniques to identify codes generated by large language models. However, the existing watermarking works fail to protect RTL code copyright due to the significant syntactic and semantic differences between RTL code and software code in languages like Python. This paper proposes a hardware watermarking framework RTLMarker that embeds watermarks into RTL code and deeper into the synthesized netlist.we propose a rule-based Verilog code transformations, which ensuring the syntactic and semantic correctness of the watermarked RTL code.In addition, We consider an inherent tradeoff between watermark transparency and watermark effectiveness and jointly optimize them. The results demonstrate RTLMarker's superiority over the baseline in RTL code watermarking.

[To Session Table]

Session 5E (T4.1-2) AI-Driven Innovative Design Methods
Time: 13:15 - 15:20, Wednesday, January 22, 2025
Location: Innovation Hall
Chair: Jaeduk Han (Hanyang University, Republic of Korea)

5E-1 (Time: 13:15 - 13:40)

Title	LIBMixer: An all-MLP Architecture for Cell Library Characterization towards Design Space Optimization
Author	*Jaeseung Lee, Sunggyu Jang, Jakang Lee, Seokhyeong Kang (Postech, Republic of Korea)
Page	pp. 814 - 820
Keyword	Library characterization, machine learning, design space optimization, MLP-Mixer architecture, ML-based EDA
Abstract	Cell library characterization is a fundamental stage of electronic design automation (EDA), as it provides essential electrical models for circuit simulation and design quality assessment. However, the development of advanced nodes demands increasing computational resources and engineering efforts for characterization. We introduce LIBMixer, a machine learning-based framework for fast and accurate library characterization designed to enhance design space optimization. Leveraging multi-layer perceptron architectures, LIBMixer efficiently manages complex relationships between technology and electrical characteristics. It achieves a 31.5× faster runtime than conventional EDA tools while enhancing alignment. Compared to state-of-the-art methods, LIBMixer targets 6.4× more standard cells for both power and timing information. This scalability improvement enables the practical synthesis of IP cores, demonstrating high correlation across of power, performance, and area results. Additionally, Pareto fronts of synthesis design results with LIBMixer-inferred libraries closely match those from foundry files. Experimental results highlight the effectiveness of LIBMixer as a fast and reliable alternative for PVT analysis.

5E-2 (Time: 13:40 - 14:05)

Title	DefectTrackNet: Efficient Root Cause Analysis of Wafer Defects in Semiconductor Manufacturing Using a Lightweight CNN-Transformer Architecture
Author	*Lichao Zeng (University of Science and Technology of China, China), Zhouzhouzhou Mei (Zhejiang University, China), Zhongyu Shi (University of Science and Technology of China, China), Yining Chen (Zhejiang University, China)
Page	pp. 821 - 827
Keyword	semiconductor manufacturing, wafer defect, root cause analysis, hybrid architecture
Abstract	Identifying the root cause of defects in semiconductor manufacturing is crucial for enhancing product yield and reliability. Traditional methods often emphasize defect classification and detection, yet they lack the depth required for comprehensive root cause analysis. This study introduces DefectTrackNet, a pioneering framework designed for automatic root cause analysis through historical similarity image retrieval. Our system integrates a novel hybrid CNN-Transformer architecture, which excels in the precise extraction and matching of image features pertinent to defect origins. Validated on a dataset of 2,588 real SEM defect images annotated with root causes, DefectTrackNet demonstrates superior performance in both accuracy and retrieval speed compared to existing methodologies. This innovative approach not only offers significant improvements over conventional techniques but also establishes a new benchmark for efficient and accurate defect root cause analysis, thereby advancing the field of semiconductor

5E-3 (Time: 14:05 - 14:30)

Title	Hybrid Compact Modeling Strategy: A Fully-Automated and Accurate Compact Model with Physical Consistency
Author	*JinYoung Choi, Hyunjoon Jeong, Jeong-Taek Kong, SoYoung Kim (Sungkyunkwan University, Republic of Korea)
Page	pp. 828 - 834
Keyword	Sub-3 nm device, compact model, TCAD, deep learning, SPICE
Abstract	To overcome the difficulty in model parameter extraction and the narrow model coverage of physics-based compact models, artificial neural network-based compact models (ANN CM) have been developed. However, the ANN CM lacks physical consistency and sub-models for circuit simulations. In this work, we propose an innovative hybrid compact model that uses a physics-based compact model as a basis with accurate fitting capability over a wide range using two neural networks(NN). First, a bidirectional physics based compact model generator neural network (PMG-NN) can automatically extract model parameter sets for both the drain current and gate capacitance within a few minutes. Second, a deep global correction neural network (DGC-NN) corrects the electrical inconsistencies caused by process variations to solve the narrow model coverage issues. The hybrid compact model (Hybrid CM) using DGC-NN can predict 𝐼𝑑𝑠, 𝐶𝑔𝑔, 𝐶𝑔𝑑, and 𝐶𝑔𝑠 of TCAD-simulated 3 nm nanosheet FETs (NSFETs) with greater than 98.5% accuracy. The hybrid compact model was implemented using Verilog-A and validated through the Gummel symmetry test and SPICE simulations. This new type of compact model has the potential to be the best solution for efficient process optimization and accurate SPICE simulation in real-world applications.

5E-4 (Time: 14:30 - 14:55)

Title	CAR-Net: Solving Electrical Crosstalk Problem in Capacitive Sensing Array
Author	*Qinghang Zhao, Tao Li (Xidian University, China)
Page	pp. 835 - 841
Keyword	Capacitor Array, Crosstalk, Pressure Sensing, Sensor Calibration
Abstract	Flexible capacitive pressure sensors are promising in applications of robotics, healthcare, and wearables, and are typically organized in array to realize large-area multi-point sensing. Specifically, the crossbar structure is the dominant form owing to its simple fabrication process. However, the crosstalk problem is induced, which is almost impossible to resolve using conventional methods. In this work, we analyze the structural characteristics of capacitive crossbar array. Based on that, we design a module that utilizes Multi-scale Asymmetric Convolution Block (MACB) for feature extraction and subsequently construct a lightweight neural network model named CAR-Net to remove capacitive crosstalk from measured value. To address the inherent error within the model, we develop an outlier removal method to enhance the accuracy of sensor calibration. Both simulation and measurement prove the effectiveness of the proposed model. Specifically, the capacitance recovery accuracy for 8×8, 16×16, and 32×32 array achieves 98.17%, 97.7%, and 95.74%, respectively. Besides, we also validate with classification experiments that the crosstalk removal is beneficial to subsequent tasks in sensing system.

5E-5 (Time: 14:55 - 15:20)

Title	PC-Opt: Partition and Conquest-based Optimizer using Multi-Agents for Complex Analog Circuits
Author	*Youngchang Choi, Sejin Park, Ho-jin Lee, Kyongsu Lee, Jae-Yoon Sim, Seokhyeong Kang (POSTECH, Republic of Korea)
Page	pp. 842 - 848
Keyword	EDA, multi-agent actor-critic, complex analog circuit optimization
Abstract	Recent research in electronic design automation (EDA) tools has focused on utilizing artificial intelligence (AI) for sizing analog circuit designs. Still, there has been a lack of focus on optimizing complex analog circuits. To optimize complex analog circuits within a few circuit simulations, we propose a partition-and-conquest-based optimizer (PC-Opt). PC-Opt assigns distinct actor-critic roles within a multi-agent system, facilitating the partitioning of complex analog circuits and conquering their optimization challenges. Partial differential training is developed for the proper prediction of each actor, which merges each other and then predicts the optimized entire circuit. To generate a compact and non-biased dataset for network training, a concentrated sampling method is devised. Experimental results on three circuits demonstrate the effectiveness of PC-Opt.

[To Session Table]

Session 5F (DF-1) Quantum Computing
Time: 13:15 - 14:30, Wednesday, January 22, 2025
Organizer: Takatsugu Ono (Kyushu University, Japan), Chair: Chihiro Yoshimura (Hitachi, Ltd., Japan)

5F-1 (Time: 13:15 - 13:40)

Title	(Designers' Forum) Design of Superconducting Quantum Computers: Similarity and Dissimilarity
Author	*Yutaka Tabuchi (RIKEN Center for Quantum Computing, Japan)
Abstract	The research and development of superconducting quantum computers is being accelerated. In 1999, researchers at NEC observed the elementary register, i.e., a quantum bit, in superconducting circuits. Since then, their lifetime has extended significantly by a million times, and the circuit has been integrated into a 100-qubit-scale computing device. However, there is much less overlap in communities between physicists and conventional circuit designers, given that we have a high wall to separate the fields; the device is only an analog passive and weakly non-linear circuit without any ability for sequential logic, and quantum gates are defined by software. In this talk, I will discuss the implementation gap between conventional and quantum logic circuits.

5F-2 (Time: 13:40 - 14:05)

Title	(Designers' Forum) Challenges in Developing Practical Qubit Control Systems
Author	*Takefumi Miyoshi (QuEL, Inc./QIQB Osaka University/e-trees.Japan, Inc., Japan)
Abstract	This presentation will explore how the speaker and their collaborators have developed a highly scalable controller capable of managing approximately 100 superconducting qubits. By integrating advanced hardware and software optimizations, this system has shown significant potential to efficiently handle complex quantum operations. In addition to discussing the current implementation, the speaker will outline the next steps in their research, focusing on enhancing the controller's capabilities to support a substantially larger number of qubits in the future. These improvements will also enable compatibility with various types of qubits, expanding the system's versatility and application potential in emerging quantum technologies. This work marks a key step toward creating scalable quantum computing systems for real-world applications.

5F-3 (Time: 14:05 - 14:30)

Title	(Designers' Forum) A Layered Approach to Quantum Computing Software Platforms for the FTQC Era
Author	*Toru Kawakubo (QunaSys Inc., Japan)
Abstract	At QunaSys, we focus on advancing quantum computing through the development of algorithms and software for fields like chemistry and Computer-Aided Engineering (CAE). Our research spans from application-specific quantum algorithms to the foundational software that supports them. Practical quantum algorithm execution on real hardware requires a multi-layered approach, integrating high-level abstractions with hardware interfaces. This talk will explore the QURI SDK, designed with Fault-Tolerant Quantum Computing (FTQC) in mind, and discuss how its architecture facilitates research and development in quantum algorithm application.

[To Session Table]

Session 6A (SS-5) Beyond Digital: Advancing Design Automation for Physical Computing Systems
Time: 15:40 - 17:20, Wednesday, January 22, 2025
Location: Room Saturn
Chair: Antonino Tumeo (PNNL, USA)

6A-1 (Time: 15:40 - 16:05)

Title	(Invited Paper) AI-Guided Codesign for Novel Computing Paradigms
Author	*Suma George Cardwell, J. Darby Smith (Sandia National Laboratories, USA), Karan Patel (University of Tennessee, Knoxville, USA), Andrew Maicke, Jared Arzate, Samuel Liu, Jaesuk Kwon (University of Texas at Austin, USA), Christopher R. Allemang, Douglas C. Crowder, Shashank Misra (Sandia National Laboratories, USA), Frances S. Chance (Sandia National Laboratories, USA, USA), Catherine D. Schuman (University of Tennessee, Knoxville, USA), Jean Anne Incorvia (University of Texas at Austin, USA), James B. Aimone (Sandia National Laboratories, USA)
Page	pp. 849 - 856
Keyword	AI-guided Codesign, Probabilistic Computing, Neuromorphic Computing, Codesign
Abstract	Microelectronics design is often a labor-intensive process involving extensive simulations, fabrication, and testing, particularly in analog design, which demands a skilled workforce with specialized knowledge. Emerging computing paradigms, such as neuromorphic and probabilistic computing, aim to harness the analog characteristics of devices for significant performance improvements over traditional methods. This presents a unique codesign challenge across the design stack, encompassing analog, mixed-signal, and beyond-CMOS devices. In this work, we introduce AI-guided codesign automation techniques, for the design of novel devices and circuits tailored for these cutting-edge computing paradigms, facilitating innovative solutions and hardware-aware algorithms for next-generation heterogeneous architectures.

6A-2 (Time: 16:05 - 16:30)

Title	(Invited Paper) Towards Design Optimization of Analog Compute Systems
Author	*Sara Achour (Stanford University, USA)
Page	pp. 857 - 864
Keyword	analog computing, design optimization
Abstract	There has been an explosion of analog hardware technologies that offer unique capabilities, enabling analog computation in more execution contexts than ever. These analog systems encode information in the physical properties of signals and leverage the physics of materials, devices, and circuits to perform computation. Because these systems leverage physical behavior for computation, they are sensitive to hardware nonidealities, such as noise and fabrication variations. Today, designers must navigate an unforgiving fidelity, programmability, and efficiency tradeoff space to identify a promising analog system design to implement. In this work, we demonstrate how gradient-based methods typically used for training neural ordinary differential equations can be used to perform design optimization for this class of analog computational systems. These methods can optimize complex analog system dynamics, including time-evolving (dynamic) non-linear behaviors, stochastic behaviors, and analog-digital interactions. To enable this form of optimization, we present an analog compute-paradigm-forward, unified modeling methodology that enables co-optimization of the analog system with the target application while also considering hardware non-idealities.

6A-3 (Time: 16:30 - 16:55)

Title	(Invited Paper) Nature-GL: A Revolutionary Learning Paradigm Unleashing Nature's Power in Real-World Spatial-Temporal Graph Learning
Author	Chuan Liu, Chunshu Wu, Ruibing Song (University of Rochester, USA), Yousu Chen, Ang Li (Pacific Northwest National Laboratory, USA), Michael Huang, *Tony (Tong) Geng (University of Rochester, USA)
Page	pp. 865 - 871
Keyword	Nature-powered computing, Dynamical systems, Spatial-temporal graph learning
Abstract	Spatial-Temporal Graph Learning (ST-GL) is a prominent research area due to its unique capability to effectively learn real-world graphs. Applications of ST-GL pose stringent and various demands on not only real-time inference with low energy cost and high accuracy but also fast training. Unfortunately, as Moore’s Law approaches its limits and ST-GL model complexity drastically grows, the gap between digital hardware’s computational power and ST-GL application demands is widening. In response, this paper introduces Nature-GL, a nature-powered graph learning paradigm that exploits the principle of entropy increase to advance graph learning. In particular, Nature-GL transforms both the training and inference of real-valued ST-GL into electron-speed natural annealing processes of a parameterized dynamical system that represents the target graphs. Experimental results across four real-world applications with six datasets demonstrate that Nature-GL achieves orders-of-magnitude speedups in both training and inference, delivering higher accuracy compared to Graph Neural Networks.

6A-4 (Time: 16:55 - 17:20)

Title	(Invited Paper) ChemComp: A Compilation Framework for Computing with Chemical Reaction Networks
Author	Nicolas Bohm Agostini, Connah Johnson, William Cannon, *Antonino Tumeo (Pacific Northwest National Laboratory, USA)
Page	pp. 872 - 878
Keyword	Chemical Reactions, CRNs, Compilers, MLIR
Abstract	The acceleration of scientific computation, data analytics, and artificial intelligence is driving a surge in computational requirements. Yet, state-of-the-art high-performance computing systems are approaching physical limitations that impede further significant improvements in energy efficiency. As we move towards post-exascale computing systems, innovative approaches are necessary to overcome this barrier in power consumption. Novel analog and hybrid digital-analog architectures hold promise for enhancing energy efficiency by several orders of magnitude. Biochemical computation stands out among the various solutions being explored due to its potential to enable new classes of devices with immense computational capabilities. These devices can capitalize on the inherent efficacy of biological cells in solving optimization problems and are scalable through increasing reaction system size or vessel capacity, potentially satisfying scientific computing's high-performance requirements. Nonetheless, several theoretical and practical limitations persist, including problem formulation and mapping to chemical reaction networks (CRNs) and implementation of actual CRN devices. In this paper, we propose a framework for biochemical computation using systems chemistry. We present the initial components of our approach: an abstract chemical reaction dialect implemented as a multi-level intermediate representation (MLIR) compiler extension and a pathway to represent mathematical problems with CRNs. To showcase the potential of this approach, we emulate a simplified chemical reservoir device. This work lays the groundwork for leveraging chemistry's computing potential in creating energy-efficient, high-performance computing systems tailored to contemporary computational needs.

[To Session Table]

Session 6B (T9-3) Floorplan and Placement
Time: 15:40 - 17:45, Wednesday, January 22, 2025
Location: Room Uranus
Chair: Shao-Yun Fang (National Taiwan University of Science and Technology, Taiwan)

6B-1 (Time: 15:40 - 16:05)

Title	PPA-Aware Tier Partitioning for 3D IC Placement with ILP Formulation
Author	Eunsol Jeong, Taewhan Kim (Seoul National University, Republic of Korea), *Heechun Park (Ulsan National Institute of Science and Technology, Republic of Korea)
Page	pp. 879 - 885
Keyword	3D IC, tier partitioning, placement, PPA, ILP
Abstract	3D ICs are renowned for their potential to enable high-performance and low-power designs by utilizing denser and shorter inter-tier connections. In the physical design flow of 3D ICs, the placement stage includes a differentiated design step to assign instances to different tiers, i.e., top or bottom, called tier partitioning. Despite its importance to overall circuit performance, previous tier partitioning approaches have not taken power-performance-area (PPA) optimization into account, leading to degradation in timing and increased power consumption. In this paper, we propose a novel tier partitioning method in 3D IC placement that concurrently optimizes all PPA-relevant aspects, i.e., inter-tier cuts, overlapping areas, tier transitions along timing-critical paths, and local/global area balance. We first reduce the problem complexity with netlist clustering based on logical and physical relations, and then formulate an integer-linear programming (ILP) model for each cluster to find an optimal solution. Experiments on various benchmarks demonstrate that our method achieves significant improvements over previous tier partitioning results in terms of all PPA metrics, including 1.23% reduction in power consumption, 24.08% reduction in total negative slack (TNS), and 3.44% reduction in wirelength on average.

6B-2 (Time: 16:05 - 16:30)

Title	FTAFP: A Feedthrough-Aware Floorplanner for Hierarchical Design of Large-Scale SoCs
Author	Zirui Li, *Kanglin Tian, Jianwang Zhai, Zixuan Li (Beijing University of Posts and Telecommunications, China), Shixiong Kai, Siyuan Xu (Huawei Noah's Ark Lab, China), Bei Yu (The Chinese University of Hong Kong, Hong Kong), Kang Zhao (Beijing University of Posts and Telecommunications, China)
Page	pp. 886 - 892
Keyword	Feedthrough, Floorplan, Physical design, System on chips, Heuristic methods
Abstract	Floorplanning is a critical step in the physical design of digital integrated circuits (ICs). With the increasing complexity of circuits, the hierarchical design paradigm of large-scale systems on chips (SoCs) is gradually emerging, and floorplanning faces more complex optimization challenges, especially feedthrough. Feedthrough is a through-module connection, yet it would require additional buffers and ports inside the module for data transmission. Excessive feedthroughs will inevitably compromise the routability within the reusable module and even potentially lead to congestion and timing problems. However, the modeling and optimization of feedthroughs are difficult and there are few works to handle it. Floorplanning is a critical step in the physical design of digital integrated circuits (ICs). As circuit complexity grows, the hierarchical design paradigm of large-scale systems on chips (SoCs) is gradually emerging, introducing new optimization challenges, particularly with feedthrough. Feedthrough is a through-module connection, yet it would require additional buffers and ports inside the module for data transmission. Excessive feedthroughs will inevitably hinder the routability within reusable modules, causing congestion and timing problems. However, few works have addressed the challenges of feedthrough modeling and optimization. In this work, we propose FATFP, a feedthrough-aware SoC floorplanner, to address the aforementioned issues. First, an estimation model is proposed to assess feedthroughs required in the floorplan. Then, we introduce a novel topological representation, SCB-Tree, which incorporates slack computation into the CB-Tree. We also develop a two-phase simulated annealing (SA) framework and an automatic optimization cost scheme to enhance performance. Experimental results demonstrate that our floorplanner achieves notable optimization in terms of common edge, feedthroughed modules, and feedthrough wirelength over previous work, with only minor trade-offs in total wirelength and runtime.

6B-3 (Time: 16:30 - 16:55)

Title	Mixed-Size Placement Prototyping Based on Reinforcement Learning with Semi-Concurrent Optimization
Author	*Cheng-Yu Chiang, Yi-Hsien Chiang, Chao-Chi Lan, Yang Hsu, Che-Ming Chang, Shao-Chi Huang, Sheng-Hua Wang, Yao-Wen Chang (National Taiwan University, Taiwan), Hung-Ming Chen (National Yang Ming Chiao Tung University, Taiwan)
Page	pp. 893 - 899
Keyword	Placement, Mixed-Size Placement, Reinforcement Learning, Deep Q-learning
Abstract	Placement plays a crucial role in modern chip design, aiming to determine the positions of circuit blocks (macros and standard cells). Traditional data structure-centric heuristics often yield suboptimal placement prototypes, ineffectively guiding downstream mixed-size analytical placement to find the desired results for modern large-scale designs. Recent works have showcased the potential of reinforcement learning (RL) to enhance chip placement by training a policy to place macros as a board game. However, placing macros and fixing them in the earlier stages without sufficient information often incurs undesired solutions. This paper proposes a novel RL-based mixed-size placer with iteratively moving the blocks to characterize dense rewards and comprehensive layout information in each step. We further introduce a semi-concurrent moving mechanism to learn the collaborative dynamics among actions on a subset of blocks at each step. We integrate continuous action spaces to develop a deep Q-learning-based model for learning the semi-concurrent moving policy to derive the proposed moving strategy. Compared with the state-of-the-art methods, experimental results show that our RL-based placer achieves the best placement quality based on commonly used mixed-size placement benchmarks.

6B-4 (Time: 16:55 - 17:20)

Title	ThePlace: Thermal-Aware Placement With Operator Learning-Based Ultra-Fast Simulator
Author	*Xinfei Liu (University of Science and Technology of China, China), Siting Liu, Bei Yu (The Chinese University of Hong Kong, Hong Kong), Song Chen, Qi Xu (University of Science and Technology of China, China)
Page	pp. 900 - 906
Keyword	Placement, Thermal Simulation, Operator Learning
Abstract	Thermal issues are major concerns in integrated circuits (ICs) design. Typically, high temperature induces stress and carrier mobility changes between different materials, causing timing and reliability challenges in chip. In this paper, we propose a thermal-aware placement engine named ThePlace. It consists of an ultra-fast thermal simulation model using Fourier neural operator (FNO) to solve the steady-state heat conduction equation, followed by a force-directed global placement algorithm to co-optimize the peak temperature and wirelength in placements. The experimental results indicate that compared with the wirelength-driven placement approach DREAMPlace, ThePlace method enables significant temperature reduction with subtle variation in wirelength.

6B-5 (Time: 17:20 - 17:45)

Title	An MIP-based Force-directed Large Scale Placement Refinement Algorithm
Author	*Zewen Li, Ke Tang (Nanjing University, China), Lang Feng (Sun Yat-sen University, China), Zhongfeng Wang (Nanjing University/Sun Yat-sen University, China)
Page	pp. 907 - 913
Keyword	Placement, Physical Design, Mixed Integer Programming
Abstract	Placement is an important part in the flow of physical design, which can affect the performance of a circuit significantly. Many algorithms have been proposed to refine the placement in the past years and Mixed Integer Programming (MIP) is one of the directions that can further improve the placement quality, since MIP is able to perform a finer-grained placement with precise MIP formulations. Many previous MIP works try to prune the search space for efficiency, but the strategies for selecting valuable search space do not contain enough analysis of the initial placement before refinement. In this work, we propose an MIP-based algorithm that can refine large scale placement by considering more global factors from initial placement, while achieving the trade-off between efficiency and quality. A force-directed displacement technique is proposed, which quantifies multiple metrics in each orientation systematically to assign a potential region for each cell's movement. Meanwhile, we also propose an accurate wirelength prediction method for high-degree nets by introducing the concept of centroid for net breaking. Experiments on benchmarks of ISPD18 and ISPD19 show that our algorithm is able to reduce the wirelength and vias by 1.02% and 0.58% on average and our work outperforms the state of-the-art related work in wirelength optimization under both its comprehensive mode and wirelength-only mode.

[To Session Table]

Session 6C (T13-1) Let’s Quantumize: Welcome to the World of Quantum
Time: 15:40 - 16:55, Wednesday, January 22, 2025
Location: Room Venus
Chairs: Chun-Yi Lee (National Taiwan University, Taiwan), Ulf Schlichtmann (Technical University of Munich, Germany)

6C-1 (Time: 15:40 - 16:05)

Title	IC-D²S: A Hybrid Ising-Classical-Machines Data-Driven QUBO Solver Method
Author	Armin Abdollahi, Mehdi Kamal, *Massoud Pedram (University of Southern California, USA)
Page	pp. 914 - 920
Keyword	QUBO, Ising machine, Algorithm, (near-)Optimal solution, Runtime
Abstract	We present a heuristic algorithm designed to solve Quadratic Un- constrained Binary Optimization (QUBO) problems efficiently. The algorithm, referred to as ICD²S, leverages a hybrid approach using Ising and classical machines to address very large problem sizes. Considering the practical limitation on the size of the Ising machine (IM), our algorithm partitions the QUBO problem into a collection of QUBO subproblems (called subQUBOs) and utilizes the IM to solve each subQUBO. Our proposed heuristic algorithm uses a set of control parameters to generate the subQUBOs and explore the search space. Also, it utilizes an annealer based on cosine wave- form and applies a mutation operator at each step of the search to diversify the solution space and facilitate the process of finding the global minimum of the problem. We have evaluated the effec- tiveness of our ICD²S algorithm on three large-sized problem sets and compared its efficiency in finding the (near-)optimal solution with three QUBO solvers. One of the solvers is a software-based algorithm (D²TS), while the other one (D-Wave) employs a similar approach to ours, utilizing both classical and Ising machines. The results demonstrate that for large-sized problems (≥ 5000) the proposed algorithm identifies superior solutions. Additionally, for smaller-sized problems (= 2500), ICD²S efficiently finds the optimal solution in a significantly faster manner.

6C-2 (Time: 16:05 - 16:30)

Title	Compilation for Dynamically Field-Programmable Qubit Arrays with Efficient and Provably Near-Optimal Scheduling
Author	Daniel Bochen Tan (University of California, Los Angeles/Harvard University, USA), Wan-Hsuan Lin, *Jason Cong (University of California, Los Angeles, USA)
Page	pp. 921 - 929
Keyword	quantum
Abstract	Dynamically field-programmable qubit arrays based on neutral atoms feature high fidelity and highly parallel gates for quantum computing. However, it is challenging for compilers to fully leverage the novel flexibility offered by such hardware while respecting its various constraints. In this study, we break down the compilation for this architecture into three tasks: scheduling, placement, and routing. We formulate these three problems and present efficient solutions to them. Notably, our scheduling based on graph edge-coloring is provably near-optimal in terms of the number of two-qubit gate stages (at most one more than the optimum). As a result, our compiler, Enola, reduces this number of stages by 3.7x and improves the fidelity by 5.9x compared to OLSQ-DPQA, the current state of the art. Additionally, Enola is highly scalable, e.g., within 30 minutes, it can compile circuits with 10,000 qubits, a scale sufficient for the current era of quantum computing. Enola is open source at https://github.com/UCLA-VAST/Enola

Best Paper Candidate
6C-3 (Time: 16:30 - 16:55)

Title	Back-end-aware Fault-tolerant Quantum Oracle Synthesis
Author	*Mingfei Yu, Alessandro Tempia Calvino (EPFL, Switzerland), Mathias Soeken (Microsoft Quantum, Switzerland), Giovanni De Micheli (EPFL, Switzerland)
Page	pp. 930 - 937
Keyword	quantum compilation, logic synthesis, resource estimation
Abstract	Quantum oracle synthesis involves compiling arbitrary Boolean functions into quantum circuits using specific quantum gates supported by the target quantum computer. The Clifford+T gate library is common in fault-tolerant quantum computing systems. Utilizing XOR-AND-inverter graphs(XAGs) as the logic representation for the target Boolean functions has received extensive attention due to the observed direct correlation between the number of AND nodes in an XAG and the T count and the qubit count of the quantum oracle optimally compiled from it. However, to be deployed onto fault-tolerant quantum hardware, quantum gates must be further re-expressed by logical quantum error correction(QEC) code operations, a process known as back-end compilation. This paper enhances the current XAG-based oracle synthesis techniques by establishing a link between the properties of XAGs and quality measures of back-end-compiled quantum oracles. This link unlocks more optimization opportunities --- experimental results have evidenced average reductions of 4.49% in T count, 7.00% in logical time steps, and 14.89% in helper qubit count, respectively, on benchmarks optimized by the proposed back-end-aware XAG optimization approaches.

[To Session Table]

Session 6D (T10-1) Innovative Techniques for Energy-Efficient and Reliable Hardware Systems
Time: 15:40 - 17:45, Wednesday, January 22, 2025
Location: Room Mars/Mercury
Chairs: Zhiyao Xie (Hong Kong University of Science and Technology, Hong Kong), Qi Sun (Zhejiang University, China), Shanshi Huang (Hong Kong University of Science and Technology (GZ), China)

6D-1 (Time: 15:40 - 16:05)

Title	FEI: Fusion Processing of Sensing Energy and Information for Self-Sustainable Infrared Smart Vision System
Author	*Haijin Su (Beijing Jiaotong University, China), Xin Hong (Beijing University of Technology, China), Maimaiti Nazhamaiti (Tsinghua University, China), Ce Zhang (The Hong Kong University of Science and Technology, Hong Kong), Li Luo (Beijing Jiaotong University, China), Qi Wei (Tsinghua University, China), Zheyu Liu (MakeSens AI, China), Wenjie Deng, Yongzhe Zhang (Beijing University of Technology, China), Fei Qiao (Tsinghua University, China)
Page	pp. 938 - 944
Keyword	Fusion Processing of Energy and Information, In-Pixel-Computing, Power Scheduling, Software-Hardware Co-Design, Smart Sensing
Abstract	In the natural world, energy and information are deeply entwined, mutually constraining and complementing each other. To exploit this natural merit, this paper proposes a FEI strategy: Fusion processing of sensing Energy and Information for infrared smart vision system. The proposed Information-Power-Coupler (IPC_p) takes the ability of simultaneous energy harvesting and low power in-pixel computing, which utilizes in-situ coupled energy to process the containing information on the same focal plane. Furthermore, a self-adaptive Intelligent-Power-Controller (IPC_trl) capable of scheduling the harvested energy to complete low power neural network inference is introduced. The implementation of IPC² system utilizes a software-hardware co-design strategy to exploit the layer-wise characteristic of the computation process and circuit topology, achieving energy-efficient self-sustainable fusion processing of sensing energy and information. Simulation results show that the IPC_trl could supply 594.68nW with the power conversion efficiency of 93.38%, when the harvested energy from the IPC_p is 636.84nW. The performance validates the self-sustainability of the system with the self-powered image recognition of a complete network running at 4fps with an accuracy of 99.4%.

6D-2 (Time: 16:05 - 16:30)

Title	WITCH: WeIghTed Coding Scheme for Crosstalk Reduction in High Bandwidth Memory
Author	Seoyoon Jang, *Sangouk Jeon, Kwanghyun Shin, Dongkwon Lee (Seoul National University, Republic of Korea), Hankyu Chi, Wookjin Shin, Changhyun Pyo (SK hynix, Republic of Korea), Jaeha Kim, Dongsuk Jeon (Seoul National University, Republic of Korea)
Page	pp. 945 - 951
Keyword	Crosstalk Avoidance Code (CAC), High Bandwidth Memory (HBM), Silicon Interposer
Abstract	High bandwidth memory (HBM) has enabled a breakthrough in bandwidth-bound applications, including large-scale artificial intelligence models. HBM is typically connected to other SoCs through a silicon interposer, which provides high bandwidth through a large number of parallel interconnects. However, the increasing density of those parallel wires is incurring significant crosstalk, hindering bandwidth improvement in the next-generation HBMs. While Crosstalk Avoidance Code (CAC) has emerged as an alternative solution to mitigate crosstalk, existing schemes suffer from low bit efficiency and significant hardware overhead. This paper proposes an efficient CAC scheme, WITCH. It employs a new coding system, weighted coding, which gives a different emphasis to each channel according to its relative position in the channel array. This enables crosstalk reduction with higher bit efficiency than prior CACs treating all channels in the array equally. The extended version of WITCH, WITCH-AS, is also proposed with additional shielding for further crosstalk reduction. Our coding system shows high bit efficiency of 91.2-91.7% and 84.3-84.6% for WITCH and WITCH-AS, which is up to 20.8\% higher than the state-of-the-art schemes while preserving the same crosstalk level reduction. We have shown through simulations using a real channel model from an industry-leading HBM manufacturer that WITCH and WITCH-AS improve the eye heights by 10.1-49.4% and 17.1-51.1% respectively. In addition, this paper presents an efficient hardware implementation of our coding schemes which shows 28.2% lower critical path delay and 31.0% smaller area than a naive implementation, proving itself a practical solution for HBMs.

6D-3 (Time: 16:30 - 16:55)

Title	Compact Interleaved Thermal Control for Improving Throughput and Reliability of Networks-on-Chip
Author	*Tong Cheng, Zirui Xu, Xinyi Li, Li Li, Yuxiang Fu (Nanjing University, China)
Page	pp. 952 - 958
Keyword	Dynamic Thermal Management, Reliability Enhancement, Networks-on-Chip
Abstract	Due to the scaling of sub-micron technology and the growing complexity of applications, escalating power density and traffic workloads heavily burden the network-on-chip (NoC) in multi-core systems and exacerbate thermal reliability issues. While recent thermal management techniques offer innovative solutions, they often employ the same management strategy for all tiles in NoC and activate it synchronously, which inevitably causes system oscillation and temperature cycling. In this paper, we propose a novel compact interleaved thermal control method that staggers the control phases of neighboring nodes to create negative feedback for each tile. We further explore the optimal control phase assignment by formulating it as a graph coloring problem to achieve the best performance. Experimental results demonstrate that the proposed method lowers the maximal spatial and temporal temperature variations up to 83.1% and 71.2% and improves the system throughput by 35.85% averagely compared with the state-of-the-art work. Besides, the proposed method can achieve a significant average enhancement of 317.50% and 234.83% in minimal and average thermal-related mean time to failure (MTTF). Moreover, the method is scalable without extra power or area costs and is compatible with existing thermal management techniques.

6D-4 (Time: 16:55 - 17:20)

Title	E-QUARTIC: Energy Efficient Edge Ensemble of Convolutional Neural Networks for Resource-Optimized Learning
Author	Le Zhang, Onat Gungor, *Flavio Ponzina, Tajana Rosing (University of California, San Diego, USA)
Page	pp. 959 - 965
Keyword	Energy harvesting, Energy-efficient ML, Ensemble learning
Abstract	Ensemble learning is a meta-learning approach that combines the predictions of multiple learners, demonstrating improved accuracy and robustness. Nevertheless, ensembling models like Convolutional Neural Networks (CNNs) result in high memory and computing overhead, preventing their deployment in embedded systems. These devices are usually equipped with small batteries that provide power supply and might include energy-harvesting modules that extract energy from the environment. In this work, we propose E-QUARTIC, a novel Energy Efficient Edge Ensembling framework to build ensembles of CNNs targeting Artificial Intelligence (AI)-based embedded systems. Our design outperforms single-instance CNN baselines and state-of-the-art edge AI solutions, improving accuracy and adapting to varying energy conditions while maintaining similar memory requirements. Then, we leverage the multi-CNN structure of the designed ensemble to implement an energy-aware model selection policy in energy-harvesting AI systems. We show that our solution outperforms the state-of-the-art by reducing system failure rate by up to 40% while ensuring higher average output qualities. Ultimately, we show that the proposed design enables concurrent on-device training and high-quality inference execution at the edge, limiting the performance and energy overheads to less than 0.04%.

6D-5 (Time: 17:20 - 17:45)

Title	Hardware Error Detection with In-Situ Monitoring of Control Flow-Related Specifications
Author	*Tomonari Tanaka (Kyoto University, Japan), Takumi Uezono (Hitachi, Ltd., Japan), Kohei Suenaga, Masanori Hashimoto (Kyoto University, Japan)
Page	pp. 966 - 973
Keyword	soft error, control flow, error detection
Abstract	In hardware accelerators used in data centers and safety-critical applications, soft errors and resultant silent data corruption significantly compromise reliability, particularly when upsets occur in control-flow operations, leading to severe failures. To address this, we introduce a method for monitoring control flow-related specifications using Petri nets. We validated our method across three designs: convolutional layers in LeNet-5, Gaussian blur in Canny edge detection, and AES encryption. Our fault injection campaign targeting the control registers and primary control inputs demonstrated high error detection rates in both datapath and control logic. Synthesis results show that a maximum detection rate is achieved with a few to around 10 % area overhead in most cases. The proposed detectors quickly detect 88.0% to 99.9% of failures resulting from upsets in internal control registers and perturbation in primary control inputs.

[To Session Table]

Session 6E (T4.1-3) Leveraging Large Language Models in Hardware Design
Time: 15:40 - 17:45, Wednesday, January 22, 2025
Location: Innovation Hall
Chair: Cong Hao (Georgia Institute of Technology, USA)

6E-1 (Time: 15:40 - 16:05)

Title	LLSM: LLM-enhanced Logic Synthesis Model with EDA-guided CoT Prompting, Hybrid Embedding and AIG-tailored Acceleration
Author	*Shan Huang, Jinhao Li, Zhen Yu, Jiancai Ye, Jiaming Xu, Ningyi Xu, Guohao Dai (Shanghai Jiao Tong University, China)
Page	pp. 974 - 980
Keyword	large language model, logic synthesis, ppa prediction
Abstract	Machine learning-based methods have shown promising results in the field of Electronic Design Automation (EDA), such as predicting logic synthesis results, enabling a shift-left in the overall EDA flow. Designers should fully optimize their Register Transfer Level (RTL) designs at this early stage because remedying low-quality RTL in downstream synthesis stages is extremely challenging. However, previous works mainly start modeling from the netlist level or layout level and apply Graph Neural Networks (GNNs) to make predictions. RTL captures the logic and scale information of circuits with uniform representation, making it suitable for a unified embedding approach. Since Large Language Models (LLMs) possess the ability to understand text modality, they are a potential method for understanding RTL and performing various EDA tasks. Therefore, we propose \we, the first LLM-enhanced logic synthesis model to extract information directly from RTL code. We also propose three novel approaches for \we in this paper. (1) EDA-guided Chain-of-Thought (CoT) prompting. We apply LLMs guided by circuit knowledge to summarize and analyze RTL code and generate text with circuit information. (2) Text-circuit hybrid embedding. We train a small Language Model (LM) to encode the generated circuit information from the LLM and fuse the embeddings of text and circuit modalities with weighted summation. (3) AIG-tailored acceleration library. We utilize an ELL2 format with zero padding tailored for And-Inverter-Graph (AIG) circuit representation and fuse the computation and format conversion. We also design a cacheable state strategy to avoid redundant computation for the LM. We are the first to work with both LLM and GNN for the prediction of logic synthesis results and conduct extensive experiments on the OpenABC-D dataset. \we achieve up to 21.53% and 19.27% loss reduction in delay and area prediction, respectively. It also achieves 1.34× and 6296.77× speedup on average compared to PyG implementation and Synopsys Design Compiler, respectively.

6E-2 (Time: 16:05 - 16:30)

Title	OPL4GPT: An Application Space Exploration of Optimal Programming Language for Hardware Design by LLM
Author	Kimia Tasnia, *Sazadur Rahman (University of Central Florida, USA)
Page	pp. 981 - 987
Keyword	Large Language Model, High Level Synthesis, Hardware Description Language, Prompt Engineering
Abstract	Despite the emergence of Large Language Models (LLMs) as potential tools for automating hardware design, the optimal programming language to describe hardware functions remains unknown. Prior works extensively explored optimizing Verilog-based HDL design, which often overlooked the potential capabilities of alternative programming language for hardware designs. This paper investigates the efficacy of C++ and Verilog as input languages in extensive application space exploration, tasking an LLM to generate implementations for various types of System-on-chip functional blocks. We proposed an automated Optimal Programming Language (OPL) framework that leverages OpenAI's GPT-4o LLM to translate natural language specifications into hardware descriptions using both high-level and low-level programming paradigms. The OPL4GPT demonstration initially employs a novel prompt engineering approach that decomposes design specifications into manageable submodules, presented to the LLM to generate code in both C++ and Verilog. A closed-loop feedback mechanism automatically incorporates error logs from the LLM's outputs, encompassing both syntax and functionality. Finally, functionally correct outputs are synthesized using either RTL (Register-Transfer Level) for Verilog or High-Level Synthesis for C++ to assess area, power, and performance. Our findings illuminate the strengths and weaknesses of each language across various application domains, empowering hardware designers to select the most effective approach for LLM-driven design.

6E-3 (Time: 16:30 - 16:55)

Title	Exploring Code Language Models for Automated HLS-based Hardware Generation: Benchmark, Infrastructure and Analysis
Author	Jiahao Gai (University of Cambridge/Imperial College London, UK), *Hao Chen (Imperial College London, UK), Zhican Wang (Shanghai Jiaotong University, China), Hongyu Zhou (University of Sydney, Australia), Wanru Zhao, Nicholas Lane (University of Cambridge, UK), Hongxiang Fan (Imperial College London/University of Cambridge, UK)
Page	pp. 988 - 994
Keyword	Code Language Model, High Level Synthesis
Abstract	Recent advances in code generation have illuminated the potential of employing large language models (LLMs) for general-purpose programming languages such as \textit{Python} and \textit{C++}, opening new opportunities for automating software development and enhancing programmer productivity. The potential of LLMs in software programming has sparked significant interest in exploring automated hardware generation and automation. Although preliminary endeavors have been made to adopt LLMs in generating hardware description languages (HDLs) such as \textit{Verilog} and \textit{SystemVerilog}, several challenges persist in this direction. First, the volume of available HDL training data is substantially smaller compared to that for software programming languages. Second, the pre-trained LLMs, mainly tailored for software code, tend to produce HDL designs that are more error-prone. Third, the generation of HDL requires a significantly higher number of tokens compared to software programming, leading to inefficiencies in cost and energy consumption. To tackle these challenges, this paper explores leveraging LLMs to generate High-Level Synthesis (\textit{HLS})-based hardware design. We aim to investigate the suitability of \textit{HLS} over low-level HDLs for hardware design generation. To facilitate this, we first introduce an open-source dataset for the text-to-code \textit{HLS} design generation, encompassing text prompts and corresponding reference designs. An LLM-assisted framework is then proposed to automate end-to-end hardware code generation, which also investigates the impact of feedback loops and chain-of-thought promoting techniques on \textit{HLS}-based design generation. Our experimental results demonstrate that the framework, enhanced with several optimizations, can generate \textit{HLS} designs exhibiting high levels of syntax and functional correctness. To facilitate the future development of this direction, both datasets and code infrastructures will be open-sourced upon paper acceptance, and we plan to submit our work for artifact evaluation.

6E-4 (Time: 16:55 - 17:20)

Title	MetRex: A Benchmark for Verilog Code Metric Reasoning Using LLMs
Author	*Manar Abdelatty, Jingxiao Ma, Sherief Reda (Brown University, USA)
Page	pp. 995 - 1001
Keyword	LLM, Reasoning, SFT, Post-synthesis, Verilog
Abstract	Large Language Models (LLMs) have been applied to various hardware design tasks, including Verilog code generation, EDA tool scripting, and RTL bug fixing. Despite this extensive exploration, LLMs are yet to be used for the task of post-synthesis metric reasoning and estimation of HDL designs. In this paper, we assess the ability of LLMs to reason about post-synthesis metrics of Verilog designs. We introduce MetRex, a large-scale dataset comprising 25,868 Verilog HDL designs and their corresponding post-synthesis metrics, namely area, delay, and static power. MetRex incorporates a Chain of Thought (CoT) template to enhance LLMs' reasoning about these metrics. Extensive experiments show that Supervised Fine-Tuning (SFT) boosts the LLM's reasoning capabilities on average by 37.0\%, 25.3\%, and 25.7\% on the area, delay, and static power, respectively. While SFT improves performance on our benchmark, it remains far from achieving optimal results, especially on complex problems. Comparing to state-of-the-art regression models, our approach delivers accurate post-synthesis predictions for 17.4\% more designs (within a 5\% error margin), in addition to offering a 1.7x speedup by eliminating the need for pre-processing. This work lays the groundwork for advancing LLM-based Verilog code metric reasoning.

6E-5 (Time: 17:20 - 17:45)

Title	SimEval: Investigating the Similarity Obstacle in LLM-based Hardware Code Generation
Author	Mohammad Akyash, *Hadi Mardani Kamali (University of Central Florida, USA)
Page	pp. 1002 - 1007
Keyword	Large Language Model (LLM), Hardware Design, Similarity, Fine-tuning, Dataset
Abstract	The increasing use and efficiency of large language models (LLMs) in digital hardware circuit design has started to revolutionize the early stages of integrated circuit (IC) supply chain design and implementation, pushing towards enhanced automation. Despite these advances, hardware circuits' inherent complexity and limited data present significant challenges. Recent studies have begun to explore various attributes of LLM-generated hardware code, including semantics, syntax, fluency, and flexibility. Given that many code generation methodologies rely on fine-tuned LLMs and face constraints due to the limited availability of datasets for hardware designs, this paper investigates the "diversity" of codes generated by LLMs. We introduce SimEval, a comprehensive, multifaceted metric vector designed to assess the similarity of LLM-generated hardware codes at the syntactic, structural, and behavioral levels, from high-level register transfer (RT-level) to synthesized (gate-level) netlists. SimEval uniquely combines sub-tree matching from abstract syntax trees (AST) with structural similarity based on kernel graphs for control flow graphs (CFG). Our experiments focusing on samples from GPT-3.5 datasets and evaluating their similarity using SimEval, highlight the critical role of SimEval in evaluating LLM-based hardware code generators w.r.t. diversity.

[To Session Table]

Session 6F (DF-2) Advanced Sensor Technologies and Sensor Fusion
Time: 15:40 - 16:55, Wednesday, January 22, 2025
Organizer: Takashi Moue (Sony Semiconductor Solutions Corp., Japan), Chair: Koichiro Yamashita (Fujitsu Ltd., Japan)

6F-1 (Time: 15:40 - 16:05)

Title	(Designers' Forum) Human Sensing Using Millimeter Wave Radar
Author	*Hongchun Li, Jun Tian, Qian Zhao, Lili Xie, Yingju Xia (Fujitsu Research & Development Center CO.,LTD., China)
Abstract	In human-centric applications, a common requirement is to extract information about the people present in an environment. Human sensing with radar devices is an emerging technology that uses radio frequency signals to detect human information. Radar system is a contactless sensing solution, unaffected by lighting, and has advantages in privacy-preservation. These qualities make radar technology ideal for application in healthcare and smart home settings. This report introduces our research progress on radar-based human sensing technology, including posture detection in active human sensing, static human localization and vital measurement in static human sensing.

6F-2 (Time: 16:05 - 16:30)

Title	(Designers' Forum) 3D-Stacked 1Megapixel Time-Gated SPAD Image Sensor with 2D Interactive Gating Network for Image Alignment-Free Sensor Fusion
Author	Kazuhiro Morimoto, Naoki Isoda, Hiroshi Sekine, Tomoya Sasago, Yu Maehashi, Satoru Mikajiri, Kenzo Tojima, Mahito Shinohara, Ayman Abdelghafar, Hiroyuki Tsuchiya, Kazuma Inoue, Satoshi Omodani, *Kazuma Chida, Alice Ehara, Junji Iwata, Tetsuya Itano, Yasushi Matsuno, Katsuhito Sakurai, Takeshi Ichikawa (Canon Inc., Japan)
Abstract	We present a 5µm-pitch, 3D-BSI 1Mpixel time-gated SPAD image sensor with 2D interactive gating network, enabling image alignment-free sensor fusion. The SPAD image sensor operates at 1,310fps for global shutter 2D imaging, and event vision sensing with 0.76ms temporal resolution under 0.02lux. Range-gated imaging result demonstrates a feasibility of robust imaging under harsh environments. The proposed gating network architecture enables background suppression in 3D depth measurement under 50klux ambient light.

6F-3 (Time: 16:30 - 16:55)

Title	(Designers' Forum) 1.22µm-pixel Back-illuminated Stacked RGB Hybrid Event-based Vision Sensor
Author	*Kazutoshi Kodama, Yusuke Sato, Yuhi Yorikado, Kyoji Mizoguchi, Takahiro Miyazaki, Masahiro Tsukamoto, Yoshihisa Matoba, Hirotaka Shinozaki, Atsumi Niwa, Tetsuji Yamaguchi (Sony Semiconductor Solutions, Japan), Christian Braendli (Sony Advanced Visual Sensing, Switzerland), Hayato Wakabayashi, Yusuke Oike (Sony Semiconductor Solutions, Japan)
Abstract	Recently, image sensors for mobile devices have shrunk in pixel size and increased in resolution. However, when capturing fast-moving objects with longer exposure times because of reduced sensitivity, blurry images are unavoidable. To address these issues, we implemented RGB pixels and event pixels on the same sensor. The sensor successfully demonstrates RGB output with existing RGB sensor equivalent readout noise of 1.57e- and event output for image quality enhancement with high speed event output of up to 10Kfps using variable frame rate. Experimental results show the potential of neural network based deblur application realization.

Thursday, January 23, 2025

[To Session Table]

Session 3K Keynote Session III
Time: 9:00 - 9:45, Thursday, January 23, 2025
Location: Miraikan Hall
Chair: Atsushi Takahashi (Institute of Science Tokyo, Japan)

3K-1

Title	(Keynote Address) In-Memory Computing-based Deep Learning Accelerators: An Overview and Future Prospects
Author	Abu Sebastian (IBM Research Europe - Zurich, Switzerland)
Abstract	Analog in-memory computing (AIMC), where synaptic weights are stored in nanoscale non-volatile memory elements and computations are carried out in the analogue or mixed-signal domain, represents a promising approach for developing the next generation of deep learning accelerators. In the first part of the presentation, I will explore the current advancements in this area, focusing on a 64-core AIMC chip built using 14nm CMOS technology with integrated phase-change memory. This chip achieves classification accuracy comparable to floating-point operations and demonstrates seamless integration of analogue and digital processing units. This work lays the foundation for a heterogeneous mixed-signal architecture. In the second part, I will cover ongoing efforts to design the next generation of AIMC chips for deep learning inference, targeting both edge and cloud applications.

[To Session Table]

Session 7A (T1-3) Accelerator Design Methodologies
Time: 10:05 - 11:45, Thursday, January 23, 2025
Location: Room Saturn
Chair: Yaoyao Ye (Shanghai Jiao Tong University, China)

Best Paper Candidate
7A-1 (Time: 10:05 - 10:30)

Title	In-Storage Read-Centric Seed Location Filtering Using 3D-NAND Flash for Genome Sequence Analysis
Author	You-Kai Zheng (National Taiwan University, Taiwan), *Ming-Liang Wei (National Taiwan University/Macronix, Taiwan), Hsiang-Yun Cheng (Academia Sinica, Taiwan), Chia-Lin Yang, Ming-Hsiang Tsai, Chia-Chun Chien, Yuan-Hao Zhong (National Taiwan University, Taiwan), Po-Hao Tseng, Hsiang-Pang Li (Macronix, Taiwan)
Page	pp. 1008 - 1015
Keyword	Genomic-Alignment, Computing-in-Memory, 3D NAND Flash Memory, Computatinoal Storage
Abstract	Read mapping is a critical bottleneck in genome sequence analysis, requiring costly approximate string matching to identify potential matches between reads and a reference genome. Pre-alignment filtering methods aim to mitigate this issue by filtering out unnecessary mapping locations, and implementing them with processing-in-memory (PIM) approaches offers potential benefits by offloading filtering from the computing unit. However, the sparse number of potential mapping locations for each read limits the utilization of PIM's parallel computing capabilities, thereby hindering the overlapping of filtering and sequence alignment to hide filtering latency overheads. In this paper, we propose a 3D NAND-based in-storage pre-alignment filtering approach. Leveraging the read depth property, we introduce a read-centric pre-alignment filtering method that enables parallel comparison of multiple reads. We co-design software and hardware for in-situ processing of read-centric pre-alignment filtering within the storage, capitalizing on 3D NAND Flash's approximate parallel search capability. When integrating with a representative read mapping accelerator, our design achieves an average 1.36x performance improvement with comparable energy consumption. Compared to the state-of-the-art (SOTA) PIM solution, our design results in 123.8x and 53.3x performance gain and energy efficiency improvement.

7A-2 (Time: 10:30 - 10:55)

Title	A Synthesis Methodology for Intelligent Memory Interfaces in Accelerator Systems
Author	*Ankur Limaye, Nicolas Bohm Agostini, Claudio Barone, Vito Giovanni Castellana (Pacific Northwest National Laboratory, USA), Michele Fiorito, Fabrizio Ferrandi (Politecnico di Milano, Italy), Andres Marquez, Antonino Tumeo (Pacific Northwest National Laboratory, USA)
Page	pp. 1016 - 1022
Keyword	Buffering system, HLS compiler, Memory Interface
Abstract	Domain-specific systems improve the performance of a specific set of applications compared to general-purpose processing systems by deploying custom hardware accelerators. These hardware accelerators are generated using high-level synthesis (HLS) tools. The HLS tools enable a comprehensive design space exploration to optimize the compute performance of the generated accelerators. However, they often ignore the challenges of implementing the accelerators in a system-on-chip, particularly how the accelerators access memory. Our work introduces a buffering system design that improves accelerators' memory accesses by intelligently employing burst transactions to prefetch useful data from external memory to on-chip local buffers. Our design is dynamic, parametric, and transparent to the accelerators generated by HLS tools. We derive the buffering system parameters using appropriate compiler-based analysis passes and memory channel latency constraints. The proposed buffering system design results in, on average, 8.8x performance improvements while lowering memory channel utilization on average by 53.2% for a set of PolyBench kernels.

7A-3 (Time: 10:55 - 11:20)

Title	Towards Efficient Data Parallelism on Spatial CGRA via Constraint Satisfaction and Graph Coloring
Author	Yuan Dai, *Xuchen Gao, Chen Shen, Bingbing Peng, Wenbo Yin, Wai-Shing Luk, Lingli Wang (Fudan University, China)
Page	pp. 1023 - 1030
Keyword	CGRA, Memory Partition, Constraint Satisfaction, Graph Coloring
Abstract	Coarse-grained reconfigurable Architecture (CGRA) is a competitive accelerator architecture for computation-intensive loop kernels. Spatial CGRA is a typical CGRA that performs all the operations spatially, demanding high data parallelism. Given the performance limitations of single-bank memory, partitioning original data into multi-bank memory within the spatial CGRA is favored. However, we observe that the mapping result can cause the inter-iteration conflict, thereby invalidating the memory partition scheme. In this paper, we develop a constraint satisfaction problem-based conflict detection approach capable of detecting the conflict in intra- and inter-iterations within a partition scheme. Besides, we formulate access scheduling as a graph coloring problem, which can minimize conflicts and improve performance. Overall, we develop a comprehensive end-to-end framework with architectural and compiler support for efficient data parallelism on the spatial CGRA. Experimental results show that our architecture

7A-4 (Time: 11:20 - 11:45)

Title	HyperG: Multilevel GPU-Accelerated k-way Hypergraph Partitioner
Author	Wan Luan Lee, Dian-Lun Lin, Cheng-Hsiang Chiu (University of Wisconsin at Madison, USA), Ulf Schlichtmann (TUM, Germany), *Tsung-Wei Huang (University of Wisconsin at Madison, USA)
Page	pp. 1031 - 1040
Keyword	Graph partitioning, GPU acceleration, Hypergraph partitioning
Abstract	Hypergraph partitioning plays a critical role in computer-aided design (CAD) because it allows us to break down a large circuit into several manageable pieces that facilitate efficient CAD algorithm designs. However, as circuit designs continue to grow in size, hypergraph partitioning becomes increasingly time-consuming. Recent research has introduced parallel hypergraph partitioners using multi-core CPUs to reduce the long runtime. However, the speedup of existing CPU parallel hypergraph partitioners is typically limited to a few cores. To overcome these challenges, we propose HyperG, a GPU-accelerated multilevel k-way hypergraph partitioning algorithm. HyperG introduces an innovative balanced group coarsening and a sequence-based refinement algorithm to accelerate both the coarsening and uncoarsening stages. Experimental results show that HyperG outperforms both the state-of-the-art sequential and CPUbased parallel partitioners with an average speedup of 133× and 4.1× while achieving comparable partitioning quality

[To Session Table]

Session 7B (T5-3) Advanced Architectures for Scientific and Edge Computing
Time: 10:05 - 11:45, Thursday, January 23, 2025
Location: Room Uranus
Chair: Hao Yu (SUSTech, China)

7B-1 (Time: 10:05 - 10:30)

Title	Exploring and Exploiting Runtime Reconfigurable Floating Point Precision in Scientific Computing: a Case Study for Solving PDEs
Author	*Cong Hao (Georgia Institute of Technology, USA)
Page	pp. 1041 - 1047
Keyword	scientific computing, reconfigurable floating point multiplication, hardware architecture
Abstract	Scientific computing applications, such as computational fluid dynamics and climate modeling, typically rely on 64-bit double-precision floating-point operations, which are extremely costly in terms of computation, memory, and energy. While the machine learning community has successfully utilized low-precision computations, scientific computing remains cautious due to concerns about numerical stability. To tackle this long-standing challenge, we propose a novel approach to dynamically adjust the floating-point data precision at runtime, maintaining computational fidelity using lower bit widths. We first conduct a thorough analysis of data range distributions during scientific simulations to identify opportunities and challenges for dynamic precision adjustment. We then propose a runtime reconfigurable, flexible floating-point multiplier (R2F2), which automatically and dynamically adjusts multiplication precision based on the current operands, ensuring accurate results with lower bit widths. Our evaluation shows that 16-bit R2F2 significantly reduces error rates by 70.2% compared to standard half precision, with resource overhead ranging from a 5% reduction to a 7% increase and no latency overhead. In two representative scientific computing applications, R2F2, using 16 or fewer bits, can achieve the same simulation results as 32-bit precision, while standard half precision will fail. This study pioneers runtime reconfigurable arithmetic, demonstrating great potential to enhance scientific computing efficiency. Code available at https://github.com/sharc-lab/R2F2.

7B-2 (Time: 10:30 - 10:55)

Title	A Holistic FPGA Architecture Exploration Framework for Deep Learning Acceleration
Author	*Jiadong Zhu, Dongsheng Zuo, Yuzhe Ma (Hong Kong University of Science and Technology (Guangzhou), China)
Page	pp. 1048 - 1054
Keyword	FPGA architecture, Accelerator, Design space exploration
Abstract	FPGAs have become a promising solution for accelerating deep learning (DL) workloads because of their inherent reconfigurability and heterogeneous architecture, which effectively handles specific computing tasks. Previous works have proposed various modifications to FPGA architectures for DL acceleration. However, they mainly focus on manual architecture designs, making it difficult to handle multiple scenarios and potentially limiting exploration of the search space. We propose a holistic automatic framework to explore FPGA architectures tailored for DL acceleration. By modifying and integrating CAD tools, we enable automated architecture generation and evaluation. This is combined with a multi-objective Tree-structured Parzen Estimator (TPE) algorithm to iterate the exploration process for finding optimal solutions. Experimental results show that the optimized architectures outperform all the baseline architectures in both delay and the area-delay product (ADP). Furthermore, our results achieve a 29.4% increase in hypervolume and an 89.5% reduction in average distance to reference set (ADRS).

7B-3 (Time: 10:55 - 11:20)

Title	OpenGeMM: A High-Utilization GeMM Accelerator Generator with Lightweight RISC-V Control and Tight Memory Coupling
Author	*Xiaoling Yi, Ryan Antonio, Joren Dumoulin, Jiacong Sun, Josse Van Delm (KU Leuven, Belgium), Guilherme Paim (KU Leuven/INESC-ID, Belgium), Marian Verhelst (KU Leuven, Belgium)
Page	pp. 1055 - 1061
Keyword	Matrix Multiplication, GeMM accelerator, Hardware generators, RISC-V, Tight memory coupling
Abstract	Deep neural networks (DNNs) face significant challenges when deployed on resource-constrained extreme edge devices due to their computational and data-intensive nature. While standalone accelerators tailored for specific application scenarios suffer from inflexible control and limited programmability, generic hardware acceleration platforms coupled with RISC-V CPUs can enable high reusability and flexibility, yet typically at the expense of system level efficiency and low utilization. To fill this gap, we propose OpenGeMM, an open-source acceleration platform, jointly demonstrating high efficiency and utilization, as well as ease of configurability and programmability. OpenGeMM encompasses a parameterized Chisel-coded GeMM accelerator, a lightweight RISC-V processor, and a tightly coupled multi-banked scratchpad memory. The GeMM core utilization and system efficiency are boosted through three mechanisms: configuration pre-loading, input pre-fetching with output buffering, and programmable strided memory access. Experimental results show that OpenGeMM can consistently achieve hardware utilization ranging from 81.89% to 99.34% across diverse CNN and Transformer workloads. Compared to the SotA open-source Gemmini accelerator, OpenGeMM demonstrates a 3.58× to 16.40× speedup on normalized throughput across a wide variety of GeMM workloads, while achieving 4.68 TOPS/W system efficiency.

7B-4 (Time: 11:20 - 11:45)

Title	Pointer: An Energy-Efficient ReRAM-based Point Cloud Recognition Accelerator with Inter-layer and Intra-layer Optimizations
Author	*Qijun Zhang, Zhiyao Xie (The Hong Kong University of Science and Technology, Hong Kong)
Page	pp. 1062 - 1069
Keyword	Point Cloud, AI Accelerator
Abstract	Point cloud is an important data structure for a wide range of applications, including robotics, AR/VR, and autonomous driving. To process the point cloud, many deep-learning-based point cloud recognition (PCR) algorithms have been proposed. However, to meet the requirement of applications like autonomous driving, the PCR must be fast enough, rendering accelerators necessary at the inference stage. But existing PCR accelerators are still inefficient due to two challenges. First, the multi-layer perception (MLP) during feature computation is the performance bottleneck. Second, the feature vector fetching operation incurs heavy DRAM access. In this paper, we propose Pointer, an energy-efficient Resistive Random Access Memory (ReRAM)-based point cloud recognition accelerator with inter- and intra-layer optimizations. It proposes three customized techniques for PCR acceleration. First, Pointer adopts ReRAM-based architecture to significantly accelerate the MLP in feature computation. Second, to reduce DRAM acce

[To Session Table]

Session 7C (T6-2) The Science of Light: the New Advancement of Photonic Computing
Time: 10:05 - 11:45, Thursday, January 23, 2025
Location: Room Venus
Chairs: Ryosuke Matsuo (University of Tokyo, Japan), Yuanqing Cheng (Beihang University, China)

7C-1 (Time: 10:05 - 10:30)

Title	An Efficient General-Purpose Optical Accelerator for Neural Networks
Author	*Sijie Fei, Amro Eldebiky (Techinical University of Munich, Germany), Grace Li Zhang (Technical University of Darmstadt, Germany), Bing Li (University of Siegen, Germany), Ulf Schlichtmann (Techinical University of Munich, Germany)
Page	pp. 1070 - 1076
Keyword	Optical Neural Network, Optical Computing
Abstract	General-purpose optical accelerators (GOAs) have emerged as a promising platform to accelerate deep neural networks (DNNs) due to their low latency and energy consumption. Such an accelerator is usually composed of a given number of interleaving Mach-Zehnder-Interferometers (MZIs). This interleaving architecture, however, has a low efficiency when accelerating neural networks of various sizes due to the mismatch between weight matrices and the GOA architecture. In this work, a hybrid GOA architecture is proposed to enhance the mapping efficiency of neural networks onto the GOA. In this architecture, independent MZI modules are connected with microring resonators (MRRs), so that they can be combined to process large neural networks efficiently. Each of these modules implements a unitary matrix with inputs adjusted by tunable coefficients. The parameters of the proposed architecture are searched using generic algorithm. To enhance the accuracy of neural networks, selected weight matrices are expanded to multiple unitary matrices applying singular value decomposition (SVD). The kernels in neural networks are also adjusted to use up the on-chip computational resources. Experimental results show that with a given number of MZIs, the mapping efficiency of neural networks on the proposed architecture can be enhanced by 21.87%, 21.20%, 24.69%, and 25.52% for VGG16 and Resnet18 on datasets Cifar10 and Cifar100, respectively. The energy consumption and computation latency can also be reduced by over 67% and 21%, respectively.

7C-2 (Time: 10:30 - 10:55)

Title	Zero-Shot Automated Circuit Topology Search for Pareto-Optimal Photonic Tensor Cores
Author	Ziyang Jiang, Pingchuan Ma (Arizona State University, USA), Meng Zhang, Rena Huang (Rensselaer Polytechnic Institute, USA), *Jiaqi Gu (Arizona State University, USA)
Page	pp. 1077 - 1083
Keyword	ONN, Auto-Design, Topology, Multi-objective
Abstract	Photonic tensor cores (PTCs) are essential building blocks for optical artificial intelligence (AI) accelerators based on programmable photonic integrated circuits. Most PTC designs today are manually constructed, with low design efficiency and unsatisfying solution quality. This makes it challenging to meet various hardware specifications and keep up with rapidly evolving AI applications. Prior work has explored gradient-based methods to learn a good PTC structure differentiably. However, it suffers from slow training speed and optimization difficulty when handling multiple non-differentiable objectives and constraints. Therefore, in this work, we propose a more flexible and efficient zero-shot multi-objective evolutionary topology search framework that explores Pareto-optimal PTC designs with advanced devices in a larger search space. Multiple objectives can be co-optimized while honoring complicated hardware constraints. With only 3 hours of search, we can obtain tens of diverse Pareto-optimal solutions, outperforming prior manual designs in the accuracy-density-efficiency design space, with potentially >100x faster runtime than the single-objective gradient-based method.

7C-3 (Time: 10:55 - 11:20)

Title	Reuse and Blend: A Weight-Sharing Energy-Efficient Optical Neural Network
Author	*Bo Xu, Yuetong Fang (The Hong Kong University of Science and Technology (Guangzhou), China), Shaoliang Yu (Zhejiang Lab, China), Renjing Xu (The Hong Kong University of Science and Technology (Guangzhou), China)
Page	pp. 1084 - 1090
Keyword	weight sharing, microring resonator, optical computing
Abstract	Optical neural networks (ONN) based on micro-ring resonators (MRR) have emerged as a promising alternative to significantly accelerating the massive matrix-vector multiplication (MVM) operations in artificial intelligence (AI) applications. However, the limited scale of MRR arrays presents a challenge for AI acceleration. The disparity between the small MRR arrays and the large weight matrices in AI necessitates extensive MRR writings, including reprogramming and calibration, resulting in considerable latency and energy overheads. To address this problem, we propose a novel design methodology to lessen the need for frequent weight reloading. Specifically, we propose a reuse and blend (R&B) architecture to support efficient layer-wise and block-wise weight sharing, which allows weights to be reused several times between layers/blocks.Experiment results demonstrate the R&B system can maintain comparable accuracy with 69% energy savings and 57% latency improvement. These results highlight the promise of the R&

7C-4 (Time: 11:20 - 11:45)

Title	PhotonGraph: High-performance Photonic Graph Processing Accelerator
Author	*Jiaqi Liu, Xianbin Li (Hong Kong University of Science and Technology, Hong Kong)
Page	pp. 1091 - 1096
Keyword	Graph Processing, Photonic Computing, Microring Resonators
Abstract	Graph processing algorithms are attracting more attention with the increasing need of processing irregular large-scale graphs in nature and social sciences. The memory access becomes the performance bottleneck of graph applications, sourced from high memory bandwidth requirements, poor locality, and sparsity. Conventional graph accelerators with electronic interconnect utilize customized execution pipelines and high-bandwidth memory. However, they still face the problem of the memory wall and limited working frequency. Consequently, there is an increasing interest in investigating optical interconnect and photonic computing as an alternative technology because of higher bandwidth and ultra-fast processing speed. In this paper, we propose the first photonic graph processing accelerator, PhotonGraph. We propose a detailed workflow for photonic graph processing including scatter phase and gather phase. We design a dedicated photonic sparsity compression computing core for the gather phase based on the wavelength routing technique. We also introduce the data fetching and dispatching in the scatter phase with optical network-on-chip and high-bandwidth memory(HBM). PhotonGraph outperforms the GPU platform with 151x better performance and 1111x greater energy efficiency. Additionally, we surpass state-of-the-art processing-in-memory(PIM) accelerators by 207x in terms of performance and 114x in terms of energy efficiency.

[To Session Table]

Session 7D (T8-2) From Math to Circuits
Time: 10:05 - 11:20, Thursday, January 23, 2025
Location: Room Mars/Mercury
Chair: Yukio Miyasaka (UC Berkeley)

7D-1 (Time: 10:05 - 10:30)

Title	An Algebraic Approach to Partial Synthesis of Arithmetic Circuits
Author	Bhavani Sampathkumar, Ritaja Das, Bailey Martin (The University of Utah, USA), Florian Enescu (Georgia State University, USA), *Priyank Kalla (The University of Utah, USA)
Page	pp. 1097 - 1103
Keyword	Partial Logic Synthesis, Polynomial Ideals, Groebner Bases, Algebraic Varieties
Abstract	We present an approach to partial logic synthesis of arithmetic circuits. Its targeted applications are rectification of buggy circuits, and computing care and don't care sets at internal nets of the circuit. The approach models the circuit by way of polynomial ideals in rings with coefficients in the field of rationals (Q). Techniques from commutative algebra are applied to compute internal patch functions as polynomials over Q. We describe how the care set and the don't care conditions manifest in the algebraic setting, and show how to generate corresponding Boolean functions from polynomials over Q. Experiments are conducted over various integer multiplier architectures which demonstrate the efficacy of our approach, where SAT/interpolation based techniques are infeasible.

7D-2 (Time: 10:30 - 10:55)

Title	Hardware Synthesizable Exceptions using Continuations
Author	*Paul Teng (McGill University, Canada), Christophe Dubach (McGill University/MILA, Canada)
Page	pp. 1104 - 1111
Keyword	Hardware Synthesis, Runtime Exceptions, Continuation Passing Style, Programming Languages
Abstract	High-level synthesis (HLS ) turns “high-level” programs (e.g., C code) into hardware automatically. Despite simplifying hardware design, HLS still lacks support for fundamental features such as runtime exceptions. Implementing exceptions is challenging since the synthesized hardware typically does not contain any call stack. Two alternatives to supporting exceptions in hardware consists of either using error codes, or option types if available in the language. However, these approaches put a heavy burden on the programmer, and are low-level compared to directly supporting exceptions. This paper proposes to implement exceptions in hardware using Continuation passing style (CPS). As we will see, using continuations removes the needs for a call stack in hardware. This paper demonstrates the strength of this approach by synthesizing a subset of the MachSuite benchmark suite on the Intel Max 10 FPGA. The rewritten benchmarks, with exception handling, produce faster hardware than the two alternative approaches, with increase productivity for the programmer.

7D-3 (Time: 10:55 - 11:20)

Title	Area-Oriented Optimization After Standard-Cell Mapping
Author	*Andrea Costamagna, Alessandro Tempia Calvino (EPFL, Switzerland), Alan Mishchenko (UC Berkeley, USA), Giovanni De Micheli (EPFL, Switzerland)
Page	pp. 1112 - 1119
Keyword	resynthesis, mapped circuits, standard cells
Abstract	We address the problem of minimizing the area of circuits mapped to a technology library, with or without delay constraints. While traditional methods optimize first a technology-independent representation and then perform technology mapping to a library, this paper explores the potential for further optimizations through technology-dependent algorithms. We propose an optimization engine for mapped circuits that relies on a database of mapped sub-networks for efficient resynthesis. Experimental results on the EPFL benchmarks after area-oriented optimization and mapping show that the proposed method leads to average area improvements of 5.47% without degrading the delay.

[To Session Table]

Session 7E (T4.2-3) Innovative Techniques in AI Model Optimization and Training
Time: 10:05 - 11:20, Thursday, January 23, 2025
Location: Innovation Hall
Chairs: Dongsuk Jeon (Seoul National University, Republic of Korea), Ik-Joon Chang (Kyung Hee University)

Best Paper Candidate
7E-1 (Time: 10:05 - 10:30)

Title	ROBIN: A Novel Framework for Accelerating Robust Multi-Variant Training
Author	*Yan Wang, Xingbin Wang, Yulan Su, Sisi Zhang, Zechao Lin, Dan Meng, Rui Hou (Institude of Information Engineering, Chinese Academy of Sciences, China)
Page	pp. 1120 - 1125
Keyword	efficient ml training
Abstract	Robust variants are a promising approach to improving model robustness against adversarial attacks by exploring different network architectures. However, due to the high cost of adversarial training for multiple variants, existing adversarial defense techniques are often confined to limited model architectures, failing to fully exploit the robustness benefits offered by architectural variations. In this paper, we first reveal that function-preserving knowledge transfer can be used to accelerate adversarial training of architecture variants. Then, We propose ROBIN, a framework for accelerating robust multi-variant training. ROBIN can capitalize on architectural similarities between various variants and achieve weight transformation between models through two atomic operations at the tensor level, thereby expediting the convergence process of multiple models. Experimental results show that ROBIN can accelerate the adversarial training process of multiple architecture variants by 2.56 × to 4.27 ×, enabling efficient exploration of robust network architectures.

7E-2 (Time: 10:30 - 10:55)

Title	Dual-branch cross-modal fusion with local-to-global learning for UAV object detection
Author	Binyi Fang, *Yixin Yang, Jingjing Chang, Ziyang Gao, Hai-Bao Chen (Shanghai Jiao Tong University, China)
Page	pp. 1126 - 1132
Keyword	multi-modal fusion, UAV object detection
Abstract	Due to the significant differences between unmanned aerial vehicle (UAV) images and natural scene images in terms of lighting, scale, and viewing angle, existing multispectral detection techniques often fail to fully utilize the remote dependencies between global and local information, resulting in poor performance in complex UAV scenarios. In this paper, we propose a novel two-branch cross-modal fusion network that integrates a dual cross-attention transformer fusion block (CTF) for global feature dependency and an adaptive mask convolution fusion block (MCF) for underlying feature extraction. This achieves a unified representation with both global and local receptive fields. Our local-global training strategy utilizes a shallow global fusion network and a deep local fusion network, which operate on the entire image while also focusing on detailed local features. Additionally, we integrate an asymptotic feature pyramid network that employs adaptive spatial fusion to refine features, enhancing the accuracy of small object detection in UAV scenes. Evaluating our work with the DroneVehicle dataset for vehicle detection using infrared and visible light, our network outperformed existing methods, improving mAP@0.5 by 7.01% compared to CALNet. APs for YOLOv8m-based single-modal infrared detection and visible light detection increased by 3.1% and 11.6%, respectively.

7E-3 (Time: 10:55 - 11:20)

Title	H4H: Hybrid Convolution-Transformer Architecture Search for NPU-CIM Heterogeneous Systems for AR/VR Applications
Author	*Yiwei Zhao (Carnegie Mellon University, USA), Jinhui Chen (Reality Labs Research, Meta, USA), Sai Qian Zhang (New York University, USA), Syed Shakib Sarwar, Kleber Hugo Stangherlin, Jorge Tomas Gomez, Jae-Sun Seo, Barbara De Salvo, Chiao Liu (Reality Labs Research, Meta, USA), Phillip B. Gibbons (Carnegie Mellon University, USA), Ziyun Li (Reality Labs Research, Meta, USA)
Page	pp. 1133 - 1141
Keyword	neural architecture search, edge AI inference, neural processing unit, compute-in-memory, algorithm-hardware co-design
Abstract	Low-latency and low-power edge AI is crucial for Virtual Reality and Augmented Reality applications. Recent advances demonstrate that hybrid models, combining convolution layers (CNN) and transformers (ViT), often achieve a superior accuracy/performance tradeoff on various computer vision and machine learning (ML) tasks. However, hybrid ML models can present system challenges for latency and energy efficiency due to their diverse nature in dataflow and memory access patterns. In this work, we leverage architecture heterogeneity from Neural Processing Units (NPU) and Compute-In-Memory (CIM) and explore diverse execution schemas to efficiently execute these hybrid models. We introduce H4H-NAS, a two-stage Neural Architecture Search (NAS) framework to automate the design of efficient hybrid CNN/ViT models for heterogeneous edge systems featuring both NPU and CIM. We propose a two-phase incremental supernet training in our NAS framework to resolve gradient conflicts between sampled subnets caused by different types of blocks in a hybrid model search space. Our H4H-NAS approach is also powered by a performance estimator built with NPU performance results measured on real silicon, and CIM performance based on industry IPs. H4H-NAS searches hybrid CNN-ViT models with fine granularity and achieves significant (up to 1.34%) top-1 accuracy improvement on ImageNet. Moreover, results from our algorithm/hardware co-design reveal up to 56.08% overall latency and 41.72% energy improvements by introducing heterogeneous computing over baseline solutions. Overall, our framework guides the design of hybrid network architectures and system architectures for NPU+CIM heterogeneous systems.

[To Session Table]

Session 7F (SS-6) Rapidus' Initiatives to Half Semiconductor Development Time
Time: 10:05 - 11:20, Thursday, January 23, 2025
Chair: Koki Tsurusaki (Rapidus Corporation, Japan)

7F-1 (Time: 10:05 - 10:30)

Title	(Invited Paper) Raads: Rapidus's AI/ML based assisted design flow to reduce design period halved
Author	*Koki Tsurusaki (Rapidus Corp., Japan)
Page	p. 1142
Keyword	Short TAT designing, AI/ML assisted, EDA tool
Abstract	In the advanced foundry services of Rapidus Corporation, we will discuss the outline of Raads (Rapidus AI-Assisted Design Solutions), the digital design environment provided by Rapidus, aimed at shortening the TAT (Turnaround Time) for design and manufacturing in 2nm technology. Raads is being developed with the goal of halving the TAT for design and reduce total design cost drastically by offering features and corresponding ML models that utilize ML/AI, in addition to the basic reference design flow traditionally provided by foundries. Raads will provide effective design flow for large scale AI and HPC product design to improve PPAC (Performance Power Area and Cost) + t (Time). Furthermore, Raads will be one of the components of RUMS (Rapidus & Unified Manufacturing Service), which is a collaborative model for 2nm design and manufacturing at Rapidus.

7F-2 (Time: 10:30 - 10:55)

Title	(Invited Paper) DMCO: A Strategy for Design-Manufacturing Co-optimization
Author	*Masaharu Kobayashi (Rapidus, Japan)
Page	p. 1143
Keyword	Short TAT manufacturing, Design cycle-time reduction, Yield enhancement
Abstract	As domain-specific AI-chip demands grow, time-to-market of the product is desired for winning the competition. On the other hand, as the process technology node proceeds, chip-design is going to face more challenges and become more expensive for time and cost than before, particularly at 2nm Gate-All-Around Nano Sheet (GAA NS) technology. Rapidus offers short turnaround time (TAT) of manufacturing based on our unique capability as a foundry. Our short TAT manufacturing by all single-wafer processes generates tremendous amount of data, which can be leveraged for cycle-time reduction of design and packaging by design-manufacturing co-optimization, DMCO. This leads to the customers’ time-to-market reduction through DMCO-enhanced design flow and collaterals with yield-aware. In this presentation, we will explain the concept and strategy of DTCO and illustrates some use cases.

7F-3 (Time: 10:55 - 11:20)

Title	(Invited Paper) Advanced Packaging Technology and Design Methodology for Next Generation Chiplets
Author	*Hideki Sasaki (Rapidus Corporation, Japan)
Page	p. 1144
Keyword	Chiplet
Abstract	AI and HPC are driving not only advanced CMOS process nodes, but also advanced packaging technology. The strong demand for higher performance of semiconductor products leads to larger chip size. When the chip size is estimated to be larger than the recent reticle size, the chip is divided into multiple chips based on optimal function and/or suitable CMOS process node, and then integrated into a package. High-density interconnects through advanced packaging technology contribute significantly to the performance of semiconductor products. This presentation will talk about the requirements of the latest advanced packages for next generation chiplets and also mention the design methodology to achieve high performance with short TAT.

[To Session Table]

Session 8A (T1-4) System Level Modelling & Optimization
Time: 13:15 - 14:55, Thursday, January 23, 2025
Location: Room Saturn
Chair: Zhe Lin (Sun Yat-sen University, China)

8A-1 (Time: 13:15 - 13:40)

Title	FirePower: Towards a Foundation with Generalizable Knowledge for Architecture-Level Power Modeling
Author	*Qijun Zhang, Mengming Li, Yao Lu, Zhiyao Xie (The Hong Kong University of Science and Technology, Hong Kong)
Page	pp. 1145 - 1152
Keyword	Power Model, Machine Learning
Abstract	Power efficiency is a critical design objective in modern processor design. A high-fidelity architecture-level power modeling method is greatly needed by CPU architects for guiding early optimizations. However, traditional architecture-level power models can not meet the accuracy requirement, largely due to the discrepancy between the power model and actual design implementation. While some machine learning (ML)-based architecture-level power modeling methods have been proposed in recent years, the data-hungry ML model training process requires sufficient similar known designs, which are unrealistic in many development scenarios. This work proposes a new power modeling solution FirePower that targets few-shot learning scenario for new target architectures. FirePower proposes multiple new policies to utilize cross-architecture knowledge. First, it develops power models at component level, and components are defined in a power-friendly manner. Second, it supports different generalization strategies for models of different components. Third, it formulates generalizable and architecture-specific design knowledge into two separate models. FirePower also supports the evaluation of the generalization quality. In our experiments, FirePower can achieve a low error percentage of 5.8% and a high correlation R of 0.98 on average only using two configurations of target architecture. This is 8.8% lower in error percentage and 0.03 higher in R compared with directly training McPAT-Calib baseline on configurations of target architecture.

8A-2 (Time: 13:40 - 14:05)

Title	DISS: A Novel Data Invalidation Scheme for Swap-Data on Flash Storage Systems
Author	*Dingcui Yu, Longfei Luo, Han Wang (College of Computer Science and Technology, East China Normal University, China), Yina Lv (Department of Computer Science, City University of Hong Kong, Hong Kong), Liang Shi (College of Computer Science and Technology, East China Normal University, China)
Page	pp. 1153 - 1159
Keyword	Swap, Trim, Lifetime, Flash Storage
Abstract	Storage swapping has been a critical technique used to relieve memory pressure and improve user experience. However, it generates lots of data writes in flash storage, deteriorating the lifetime and performance. In this paper, inspired by empirical studies on swap data access characteristics, we propose a novel data invalidation scheme, namely DISS, which includes two methods. First, a cross-layer swap-data invalidation method is proposed to invalidate swapped-in data at a low cost. Second, a swap data separation method is proposed to schedule swap data and file-backed data into different places. Experimental results show that DISS achieves encouraging flash lifetime and performance optimization.

8A-3 (Time: 14:05 - 14:30)

Title	Response Range Optimization for Run-Time Requirement Enforcement on MPSoCs
Author	*Khalil Esper, Stefan Wildermann, Jürgen Teich (Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany)
Page	pp. 1160 - 1166
Keyword	Optimization, Verification, Finite State Machine, Runtime Requirement Enforcement, Probabilistic Model Checking
Abstract	Embedded system applications often come with a set of non-functional requirements on execution properties (e.g., latency), expressed by a corridor of permissible values. These requirements should be guaranteed during each program execution on a given MPSoC platform. This can be achieved using a reactive control loop based on a requirement response, with an enforcer finite state machine (FSM) controlling the properties to be enforced, e.g., by adapting the number of cores allocated to a program or by scaling the voltage/frequency mode of active processors. A finer-grained control can be achieved using response ranges, which allow an enforcer to react based on the amount of violation of a requirement. But as the search space of enforcer FSMs to be explored by design space exploration (DSE) can be quite huge when jointly exploring transition relations together with response ranges of the transitions, we propose two heuristics for generating suitable response ranges prior to performing a DSE of proper enforcement FSMs. Our evaluation shows that the two proposed heuristics can generate efficient enforcement FSMs within a substantially smaller number of iterations (respectively time) compared to the case of using DSE to explore the joint space of transition relations and response ranges.

8A-4 (Time: 14:30 - 14:55)

Title	TL-CSE: Microarchitecture-Compiler Co-design Space Exploration via Transfer Learning
Author	*Zheng Wu, Jinyi Shen, Xuyang Zhao, Changxu Liu, Li Shang, Fan Yang (Fudan University, China)
Page	pp. 1167 - 1173
Keyword	RISC-V, Microarchitecture Design, GCC Compiler Opimize, Transfer Learning, Bi-level Bayesian Opimization
Abstract	The system design of domain-specific processors is a challenging task due to the vast architecture-compiler search space and time-consuming simulation processes. Nowadays, architecture space exploration and compiler optimization are conducted in silo. However, due to the interdependence between microarchitecture and compiler, this segmented approach shrinks the hardware-software design space, leading to sub-optimal global outcomes. To solve problems in co-optimization, this paper introduces TL-CSE, a microarchitecture-compiler co-design space exploration framework. It utilizes a bi-level optimization framework with transfer learning techniques to efficiently explore the hardware-software system design space. Results demonstrate that TL-CSE improves the quality of the Pareto optimal system by 31.5% and speeds up the exploration time by 6.1 times compared to previous co-design frameworks.

[To Session Table]

Session 8B (T5-4) Emerging Trends in Reconfigurable and Compute-in-Memory
Time: 13:15 - 14:55, Thursday, January 23, 2025
Location: Room Uranus
Chair: Guohao Dai (Shanghai Jiao Tong University, China)

8B-1 (Time: 13:15 - 13:40)

Title	RISC-V Driven Orchestration of Vector Processing Units and eFlash Compute-in-Memory Arrays for Fast and Accurate Keyword Spotting
Author	*Gunil Kang (Korea University & Samsung Electronics, Republic of Korea), Dahoon Park, Hojin Lee (Korea University, Republic of Korea), Sangwoo Jung (DGIST, Republic of Korea), Jiyong Park (Korea University, Republic of Korea), Jung Gyu Min, Youngjoo Lee (Pohang University of Science and Technology, Republic of Korea), Jaeha Kung (Korea University, Republic of Korea)
Page	pp. 1174 - 1180
Keyword	keyword spotting, risc-v, tinyml, compute in memory, HW/SW co-design
Abstract	In this paper, we propose a computationally efficient keyword spotting (KWS) model, named hybrid reparameterized FSMN (HRepFSMN), by carefully examining the impact of binarization on the accuracy. In particular, we found that binarizing depthwise convolution (DW-Conv) within the previous binarized KWS model, i.e., BiFSMNv2, does not lead to a significant reduction in FLOPs. Therefore, we allow floating-point (FP) operations on less computation-intensive DW-Conv layers while the remaining layers are computed in a binary fashion (hybrid data type). In addition, we remove skip connections, which require data fetching in full precision, by applying a reparameterization technique. More importantly, to efficiently compute the proposed HRepFSMN, we present a RISC-V controlled hardware accelerator that consists of reconfigurable vector processing units for FP operations and eFlash compute-in-memory arrays for binary operations. We extend RISC-V instructions so that the core can efficiently manage both computing fabrics. As a result, our HRepFSMN improves accuracy by 2.57%/4.98% with 24.02×/3.66× speed-up compared to BiFSMNv2/BiFSMNv2_small. By shrinking down our HRepFSMN, we achieve 0.95% higher accuracy with 20.87× speed-up compared to BiFSMNv2_small.

8B-2 (Time: 13:40 - 14:05)

Title	Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores
Author	*Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang (Nanjing University, China)
Page	pp. 1181 - 1187
Keyword	LLM, Inference Acceleration, GPU, Ultra-Low Bit Quantization, Tensor Core
Abstract	Large language models (LLMs) have been widely applied but face challenges in efficient inference. While quantization methods reduce computational demands, ultra-low bit quantization with arbitrary precision is hindered by limited GPU Tensor Core support and inefficient memory management, leading to suboptimal acceleration. To address these challenges, we propose a comprehensive acceleration scheme for arbitrary precision LLMs. At its core, we introduce a novel bipolar-INT data format that facilitates parallel computing and supports symmetric quantization, effectively reducing data redundancy. Building on this, we implement an arbitrary precision matrix multiplication scheme that decomposes and recovers matrices at the bit level, enabling flexible precision while maximizing GPU Tensor Core utilization. Furthermore, we develop an efficient matrix preprocessing method that optimizes data layout for subsequent computations. Finally, we design a data recovery-oriented memory management system that strategically utilizes fast shared memory, significantly enhancing kernel execution speed and minimizing memory access latency. Experimental results demonstrate our approach’s effectiveness, with up to 2.4× speedup in matrix multiplication compared to NVIDIA’s CUTLASS. When integrated into LLMs, we achieve up to 6.7× inference acceleration. These improvements significantly enhance LLM inference efficiency, enabling broader and more responsive applications of LLMs.

8B-3 (Time: 14:05 - 14:30)

Title	Large-Scale AGV Routing Based on Multi-FPGA SQA Acceleration
Author	*Thinh Nguyen Quang, Kosuke Matsuyama, Keisuke Shimizu, Hiroki Sugano, Eiji Kurimoto (Sharp Corporation, Japan), Hasitha Muthumala Waidyasooriya, Masanori Hariyama, Masayuki Ohzeki (Tohoku University, Japan)
Page	pp. 1188 - 1194
Keyword	FPGA, AGV, Routing problem, Simulated Quantum Annealing, SQA Accelerator
Abstract	Enhancing the efficiency, safety, and speed of large-scale Automated Guided Vehicle (AGV) systems is critical to increasing the productivity of logistics warehouses. Studies using the latest quantum annealers such as ``D-Wave Advantage'' with over 5000 qubits have shown the potential of quantum annealing (QA) to rapidly optimize AGV routing. However, applying QA to complex and large-scale AGV routing problems is a challenging task due to insufficient consideration of intricate operational conditions, and also due to the insufficient number of qubits in quantum annealers. This paper proposes a refined combinatorial optimization problem that minimizes the total travel time of thousands of AGVs while enhancing safety and efficiency by avoiding collisions. To solve such large-scale optimization problems with thousands of variables, we also propose a novel system architecture containing a Simulated Quantum Annealing (SQA) accelerator using multiple FPGAs. The proposed SQA accelerator is capable of processing problems with over 50,000 variables, which could be a few tens to several hundred times larger than the problems processed on the ``D-Wave Advantage``. It addresses multiple combinatorial optimization problems across multiple FPGAs concurrently while processing each problem in a high degree of parallelism. We demonstrate the accurate operation of the proposed SQA accelerator using a real-world large-scale AGV system with over 1000 AGVs. According to the experimental results, we observed faster processing speed and better quality results over existing SQA solvers.

8B-4 (Time: 14:30 - 14:55)

Title	A Data-Driven Approach to Dataflow-Aware Online Scheduling for Graph Neural Network Inference
Author	Pol Puigdemont (Universitat Politècnica de Catalunya (UPC), Spain), *Enrico Russo (University of Catania, Italy), Axel Wassington, Abhijit Das, Sergi Abadal (Universitat Politècnica de Catalunya (UPC), Spain), Maurizio Palesi (University of Catania, Italy)
Page	pp. 1195 - 1201
Keyword	graph neural networks (GNNs), dataflow-aware latency prediction, online scheduling, GNN accelerators
Abstract	Graph Neural Networks (GNNs) have shown significant promise in various domains, such as recommendation systems, bioinformatics, and network analysis. However, the irregularity of graph data poses unique challenges for efficient computation, leading to the development of specialized GNN accelerator architectures that surpass traditional CPU and GPU performance. Despite this, the structural diversity of input graphs results in varying performance across different GNN accelerators, depending on their dataflows. This variability in performance due to differing dataflows and graph properties remains largely unexplored, limiting the adaptability of GNN accelerators. To address this, we propose a data-driven framework for dataflow-aware latency prediction in GNN inference. Our approach involves training regressors to predict the latency of executing specific graphs on particular dataflows, using simulations on synthetic graphs. Experimental results indicate that our regressors can predict the optimal dataflow for a given graph with up to 91.28% accuracy and a Mean Absolute Percentage Error (MAPE) of 3.78%. Additionally, we introduce an online scheduling algorithm that uses these regressors to enhance scheduling decisions. Our experiments demonstrate that this algorithm achieves up to 3.17x speedup in mean completion time and 6.26x speedup in mean execution time compared to the best feasible baseline across all datasets.

[To Session Table]

Session 8C (T3-4) Adaptive and Flexible Memory Architecture
Time: 13:15 - 14:55, Thursday, January 23, 2025
Location: Room Venus
Chairs: Chao Wu (Nanjing University of Science and Technology, China), Yuzhe Ma (Hong Kong University of Science and Technology (Guangzhou), China)

8C-1 (Time: 13:15 - 13:40)

Title	FPBA: Flexible Percentile-Based Allocation for Multiple-Bits-Per-Cell RRAM
Author	*Junfei Liu (University of Rochester/University of California, San Diego, USA), Anson Kahng (University of Rochester, USA)
Page	pp. 1202 - 1208
Keyword	Resistive RAM (RRAM), Level allocation, Multiple-bits-per-cell, Design automation, Nonvolatile memory
Abstract	Advances in resistive random access memory (RRAM) technologies allow for multiple-bits-per-cell (MBPC) data storage. A central tool in MBPC data storage is a level allocation algorithm that maps resistance ranges to bit combinations. The best-performing algorithm in the literature is percentile-based allocation (PBA), which drastically improves on earlier parameterized approaches like sigma-based allocation (SBA). We demonstrate that PBA's level allocation subroutine can produce arbitrarily poor approximations of the number of levels possible at a given error threshold and propose flexible percentile-based allocation (FPBA), which is provably optimal. Additionally, we propose two heuristic interventions---finding all possible level allocations at a given error threshold and exhaustively searching over level refinements---to further reduce the bit error rate (BER) produced at the end of PBA. Our interventions result in 2.8%-32.4% lower BER and 3.1%-15.6% lower error-correcting code (ECC) storage overhead than PBA on 3- and 4-bits-per-cell (bpc) data storage schemes.

8C-2 (Time: 13:40 - 14:05)

Title	Mpache: Interaction Aware Multi-level Cache Bypassing on GPUs
Author	*Mengyue Xi, Tianyu Guo, Xuanteng Huang, Zejia Lin, Xianwei Zhang (Sun Yat-sen University, China)
Page	pp. 1209 - 1215
Keyword	GPUs, Cache bypassing, Multi-level caches, Compiler, Load interaction
Abstract	Graphics Processing Units (GPUs) are essential for general-purpose applications and are commonly leveraging multi-level caches to alleviate memory access pressure. However, the default cache management may lose opportunities for optimal performance in different applications. Although existing cache bypassing techniques tend to address this challenge, these methods predominantly concentrate on single-level cache, thus restricting their potential for further enhancements. To mitigate this issue, we propose Mpache, a novel software-based mechanism designed to bypass multi-level caches based on the characterization of load instructions. Mpache constructs an interaction graph and analyzes the cooperation and contention among instructions. Then, the profiling data of bypassing effectiveness guides Mpache to select the appropriate cache levels to bypass for each instruction. Finally, the design is integrated into the compiler to enable automatic bypassing for existing workloads. Evaluations on off-the-shelf GPUs show that Mpache achieves an average 1.15× speedup over the default cache policy, and effectively outperforms prior arts.

8C-3 (Time: 14:05 - 14:30)

Title	A Novel Mixed-Signal Flash-based Finite Impulse Response (FFIR) Filter for IoT Applications
Author	Cheng-Yen Lee, *Sunil P. Khatri (Texas A&M University, USA), Ali Ghrayeb (Texas A&M University at Qatar, Qatar)
Page	pp. 1216 - 1222
Keyword	Processing In-Memory, Flash Transistors, Analog Computation, Flash-based FIR Filter, Internet of Things
Abstract	In this paper, we present a novel mixed-signal flash-based Finite Impulse Response (FFIR) filter architecture for IoT applications. The FFIR filter is scalable in that it can implement any filter with up to a provisioned maximum number of taps. Our FFIR filter utilizes flash transistors, a type of non-volatile memory (NVM) device, to perform analog computations in the current domain, achieving low power, energy, and area requirements. This design is well-suited for the Internet of Things (IoT) applications and other scenarios where resources are highly constrained. Our FFIR filter consists of several Flash-based Coefficient Multipliers (FCMs). The FIR coefficients of each FCM are stored in its constituent flash transistors, with the threshold voltage (Vt) of the flash transistors serving as a proxy for the filter coefficients. Furthermore, the impact of process or voltage variations is mitigated by precisely tuning the Vt of the flash transistors. The tuning of the Vt of the flash transistors can be performed either in the factory by the manufacturer (to negate process variations), or by the user in the field (to negate voltage variations or aging effects). We evaluate the tolerance of FFIR filters to manufacturing variations through Monte Carlo analysis, demonstrating robustness to process and VDD variations. Our FFIR design achieves a significant improvement over previous approaches. Compared to Digital FIR (DFIR) filters operating at the fastest frequency, we reduce the average of power, energy, and area by 4.05x, 1.95x, and 6.06x, while achieving an average peak signal-to-noise ratio (PSNR) of 38.04 dB and an average effective number of bits (ENOB) of 8.87 bits. In addition, we compare our FFIR filter with state-of-the-art Analog FIR (AFIR) filters as well. Our designs demonstrate significantly improved performance of at least 1.3x, 5.3x, and 18.5x in terms of energy per tap, area, and latency, respectively, when compared with the best among 4 recently published AFIR works.

8C-4 (Time: 14:30 - 14:55)

Title	TRIFP-DCIM: A Toggle-Rate-Immune Floating-point Digital Compute-in-Memory Design with Adaptive-Asymmetric Compute-Tree
Author	Xing Wang, Tianhui Jiao, Shaochen Li, Yuchen Ma, *Zhican Zhang, Zhichao Liu, Xi Chen, Xin Si (Southeast University, China)
Page	pp. 1223 - 1229
Keyword	floating-point computing-in-memory, digital compute tree, toggle rate, model sparsification
Abstract	Floating-point compute-in-memory (FP-CIM) is regarded as an attractive approach to enhancing the energy efficiency of complex neural networks. Digital domain compute mechanism has been widely utilized in CIM designs owing to its high robustness to PVT variations. However, the energy consumption of digital CIM is significantly influenced by the toggle rate of the compute-tree. This work proposes a toggle rate-immune floating-point digital CIM (TRIFP-DCIM) design with 34.03% compute energy reduction on average. Combined with the TRIFP-DCIM design, a toggle-rate gathering method is employed in the neural network training/inference process with almost no accuracy loss. Experiment results show that the TRIFP-DCIM can achieve 14.51-36.83 TFLOPS/W @BF16 in 28nm technology process.

[To Session Table]

Session 8D (T9-4) Reliability in Physical Design
Time: 13:15 - 14:55, Thursday, January 23, 2025
Location: Room Mars/Mercury
Chair: Wenjian Yu (Tsinghua University, China)

8D-1 (Time: 13:15 - 13:40)

Title	Revisit MBFF: Efficient Early-Stage Multi-bit Flip-Flops Clustering with Physical and Timing Awareness
Author	*Yichen Cai, Linyu Zhu (Shanghai Jiao Tong University, China), Xinfei Guo (Shanghai Jiao Tong University/State Key Laboratory of Integrated Chips and Systems (SKLICS), China)
Page	pp. 1230 - 1236
Keyword	Low power, Multi-bit flip-flops, Clock Skew, Quality of results
Abstract	Despite the maturity of Multi-bit Flip-Flops (MBFF) clustering in modern Electronic Design Automation (EDA) tools for saving power, there remains a trade-off between the flexibility to cluster flip-flops and the overall quality of results (QoR). This paper proposes a novel approach to MBFF clustering, integrating early-stage physical and timing awareness to optimize the design quality. Our pre-placement MBFF clustering algorithm addresses this trade-off by incorporating early distance estimation and predicted skews, improving timing conditions without compromising power savings. We evaluate our approach using widely-used benchmark circuits, demonstrating significant improvements in power savings and timing compared to state-of-the-art techniques. Notably, our method achieves an average improvement of 22.5% in Worst Negative Slack (WNS) and 33.91% in Total Negative Slack (TNS), while reducing power by 3.01% compared to commercial tools with MBFF clustering enabled at placement. Against the state-of-the-art pre-placement MBFF clustering algorithm, our methodology shows 25.59% and 37.97% improvements in WNS and TNS, respectively, while further reducing power by 3.08%. In addition, the proposed approach proves robust against variations in early-stage path delay estimation, maintaining superior performance even with deviations of over 20%.

8D-2 (Time: 13:40 - 14:05)

Title	Pin Access-aware Multiple Via Pillar Co-Design for Routability Optimization
Author	*Man-Ling Hong, Ying-Jie Jiang, Shao-Yun Fang (National Taiwan University of Science and Technology, Taiwan)
Page	pp. 1237 - 1242
Keyword	Via pillar, Routability optimization
Abstract	As technology nodes advance and feature sizes shrink, wire resistance grows significantly, resulting in a substantial rise in circuit delays and reliability risks. Via pillars made up of multiple layers of parallel metals and vias can mitigate delay and electromigration by offering lower resistance and current density. However, it has become more and more challenging due to the increasing demand for inserting via pillars and the increasing density of power and ground (P/G) stripes in the lower layers. Moreover, an improperly designed via pillar structure can also block access to adjacent pins and worsen routability. In this paper, we propose comprehensive via pillar design strategies simultaneously considering flexible pin base selection, obstacle avoidance, and pin accessibility optimization. In addition, due to the high via pillar insertion rates advanced nodes target, we propose a congestion-aware via pillar design flow for high-density and closely positioned via pillars. Experimental results show that compared with a state-of-the-art work, our flow eliminates the design rule violations (DRVs) by 99% after via pillar insertion. In addition, by considering the pin accessibility of neighboring pins, our work also reduces over 98% DRVs after detailed routing.

8D-3 (Time: 14:05 - 14:30)

Title	ResCap: Fast-yet-Accurate Capacitance Extraction for Standard Cell Design by Physics-Guided Machine Learning
Author	Jiun-Cheng Tsai, *Hsuan-Ming Huang, Wei-Min Hsu, Pei-Ting Lee, Jen-Hang Yang, Heng-Liang Huang (MediaTek, Taiwan), Yen-Ju Su, Charles H. -P. Wen (NYCU, Taiwan)
Page	pp. 1243 - 1250
Keyword	Parasitic capacitance extraction, physics-guided machine learning, deep neural-networks, standard cell PPA
Abstract	In the field of VLSI design, accurate capacitance extraction is essential for ensuring optimal performance of integrated circuits, especially in standard cell designs. Conventional techniques, such as the 2.5D model and 3D field solver, either suffer from inaccuracies or are computationally intensive. To address these challenges, we present ResCap, an innovative approach that synergizes physics-guided linear models with advanced machine learning techniques. Rather than directly predicting the target capacitance, our method starts by employing physical principles to estimate the initial capacitance value, ensuring that predictions are grounded in well-established physical laws. Subsequently, machine learning is applied to predict residual values, thereby refining the initial estimates. This approach not only enhances accuracy and generalization but also reduces dependency on extensive training datasets. Experimental results demonstrate that ResCap significantly outperforms conventional methods on industrial standard cell designs under 4nm process technology, achieving high accuracy with an average error of 0.06% in delay and 0.16% in power. Notably, ResCap exhibits no outliers (error > 1%), whereas the conventional 2.5D extraction tool shows significant outliers of 10.95% in delay and 50.4% in power. Furthermore, our framework demonstrates remarkable efficiency, reducing extraction time by 215x compared to field solvers.

8D-4 (Time: 14:30 - 14:55)

Title	CPONoC: Critical Path-aware Physical Implementation for Optical Networks-on-Chip
Author	Yan-Ting Chen (National Taiwan University of Science and Technology, Taiwan), Zhidan Zheng (Technical University of Munich, Germany), *Shao-Yun Fang (National Taiwan University of Science and Technology, Taiwan), Tsun-Ming Tseng, Ulf Schlichtmann (Technical University of Munich, Germany)
Page	pp. 1251 - 1256
Keyword	Optical Networks-on-Chips, Physical Implementation
Abstract	Optical networks-on-chips (ONoCs), which adopt optical waveguides, microring resonators (MRRs), and the wavelength division multiplexing (WDM) scheme to transmit optical signals, serve as promising solutions for integrating multi- and many-core systems to provide high-bandwidth, low-latency, and low-power on-chip communication. To minimize the insertion loss of a wavelength-routed ONoC (WRONoC) during physical implementation, existing studies either adopt conventional standard cell placement techniques or maximally avoid waveguide crossings; however, all of them ignore the fact that the critical path suffering from the maximum insertion loss dominates the overall power efficiency and system performance. In this work, we propose CPONoC, a critical path-aware physical implementation tool for WRONoCs. Different from existing studies, CPONoC focuses on directly minimizing the insertion loss of the critical path with an iterative and crossing-aware force-directed approach, and it is comparably general and robust such that it can apply to any logic scheme and input configuration. Compared to state-of-the-art design automation tools, CPONoC achieves an average reduction of 9.6% in maximum insertion loss.

[To Session Table]

Session 8E (SS-7) Hardware Authenticity towards a Trustworthy Society
Time: 13:15 - 14:55, Thursday, January 23, 2025
Location: Innovation Hall
Chairs: Jun Shiomi (Osaka University, Japan), Michihiro Shintani (Kyoto Institute of Technology, Japan)

8E-1 (Time: 13:15 - 13:40)

Title	(Invited Paper) Hardware Trojan Detection by Fine-grained Power Domain Partitioning
Author	*Takahiro Ishikawa, Kose Yokooji, Yoshihiro Midoh, Noriyuki Miura (Osaka University, Japan), Michihiro Shintani (Kyoto Institute of Technology, Japan), Jun Shiomi (Osaka University, Japan)
Page	pp. 1257 - 1263
Keyword	Hardware Trojan, Hardware Security
Abstract	Hardware Trojans (HTs) are regarded as a security threat in the information society. HTs are unintentionally injected into LSI circuits by untrusted entities before the chip fabrication. HTs trigger malicious operations such as information leakage without designers noticing their operations. This paper proposes fine-grained power domain partitioning, which is a circuit design technique for detecting HT activities. This paper assumes a scenario where a fab injects HTs into the taped-out layout data and that circuit designers test the presence of HTs by using side-channel information (power consumption) of fabricated chips. Fine-grained power domain partitioning decomposes the power domain of target circuit into multiple power domains, enabling to effectively measure the small power consumption introduced by activities of tiny HTs. The measurement result using an HT-injected Advanced Encryption Standard (AES) circuit with fine-grained power domain partitioning shows that HTs can be detected.

8E-2 (Time: 13:40 - 14:05)

Title	(Invited Paper) Cryo-HT: Hardware Trojan Activated at Cryogenic Temperatures
Author	*Ayano Takaya, Ryuichi Nakajima (Kyoto Institute of Technology, Japan), Jun Shiomi (Osaka University, Japan), Michihiro Shintani (Kyoto Institute of Technology, Japan)
Page	pp. 1264 - 1269
Keyword	Hardware security, Hardware Trojan, Hardware Trojan attack, Cryo-CMOS, Side-channel analysis
Abstract	It is well-known that operating transistors in low-temperature environments improves their characteristics. This has led to the expansion of integrated circuit (IC) applications at low temperatures, including high-performance computers and controllers for quantum computers. On the other hand, hardware Trojans (HTs) pose a severe concern, threatening the authenticity of ICs. This study proposes a new HT circuit, referred to as Cryo-HT, that can be activated by the increased discharge time of a metal-oxide-semiconductor (MOS) capacitor at low temperatures. We fabricated an advanced encryption standard (AES) circuit with Cryo-HT using 180 nm process technology. We demonstrate that Cryo-HT is not activated at room temperature but activates at 213 K to reveal the AES encryption key.

8E-3 (Time: 14:05 - 14:30)

Title	(Invited Paper) Current Consumption Model for More Efficient Side-channel Tolerant Design at FPGA Design Stage
Author	*Daisuke Fujimoto, Yuichi Hayashi (Nara Institute of Science and Technology, Japan)
Page	pp. 1270 - 1274
Keyword	Side-channel attack, Simulation
Abstract	Side-channel attacks, which estimate the internal secret key by analyzing the power supply voltage fluctuation generated by the current consumption of the encryption circuit, represent a realistic threat. Countermeasures against side-channel attacks, which can be implemented with inexpensive setups, are essential to ensure security. FPGAs are reconfigurable. Therefore, side-channel attacks can be performed and evaluated after implementation. However, it is difficult to identify vulnerable areas from the entire circuit, so it is important to identify vulnerable areas using simulation as in the case of ASICs. On the other hand, power consumption models provided by vendors are based on switching rates and are not accurate enough for side-channel evaluation. In this paper, we propose a measurement-based method to model look-up table (LUT) element’s power consumption. Specifically, the number of circuit elements such as control circuits is fixed, and the power consumption per LUT is estimated from the difference in power consumption of circuits with different numbers of LUT operations. The cost of creating a power consumption model for the entire circuit will also be discussed.

8E-4 (Time: 14:30 - 14:55)

Title	(Invited Paper) White-box logic obfuscation: A Transparent Solution to Hardware Piracy and Reverse Engineering
Author	Leon Li, *Alex Orailoglu (University of California, San Diego, USA)
Page	pp. 1275 - 1281
Keyword	logic obfuscation, reverse engineering
Abstract	We propose an obfuscation technique designed to thwart behavioral reverse engineering of finite state machines, even when attackers possess the details of a functional netlist. The solution employs self-generated and mutating keys in a locking framework to hinder an attacker's ability to learn functionality beyond what sequential queries provide.

[To Session Table]

Session 8F (DF-3) Extending the Limits of Classical Computers using Emerging Device and Circuit Technology
Time: 13:15 - 14:30, Thursday, January 23, 2025
Organizer: Chihiro Yoshimura (Hitachi, Ltd., Japan), Chair: Takatsugu Ono (Kyushu University, Japan)

8F-1 (Time: 13:15 - 13:40)

Title	(Designers' Forum) Amorphica: Fully Connected Metamorphic Annealing Processor with Programmable Optimization Strategy
Author	*Kazushi Kawamura (Institute of Science Tokyo, Japan), Jaehoon Yu (Samsung Advanced Institute of Technology, Republic of Korea), Daiki Okonogi, Satoru Jimbo, Genta Inoue, Akira Hyodo (Institute of Science Tokyo, Japan), Ángel López García-Arias (NTT Corporation, Japan), Kota Ando, Bruno Hideki Fukushima-Kimura (Hokkaido University, Japan), Ryota Yasudo (Kyoto University, Japan), Thiem Van Chu, Masato Motomura (Institute of Science Tokyo, Japan)
Abstract	Annealing processors (or Ising processors) are domain-specific computers utilized as solvers for combinatorial optimization problems. In ISSCC 2023, we have presented a novel annealing processor named Amorphica. Amorphica features adaptability to a wide range of problems by integrating multiple annealing strategies into a chip. The data-path circuit includes an innovative selector that can freely switch from one annealing policy to another. The chip has been fabricated with TSMC 40nm technology on a 3mm x 3mm die. The experimental results have demonstrated that the system consisting of multiple Amorphica chips achieves up to 30,000 times energy efficiency compared to a GPU.

8F-2 (Time: 13:40 - 14:05)

Title	(Designers' Forum) Nanophotonic Devices toward Opto-Electronic Accelerator
Author	*Akihiko Shinya (NTT Basic Research Laboratories, Japan)
Abstract	Opto-electronic integration technology is expected to bring about a paradigm shift in computation by taking advantage of high-speed, low-loss propagation of light. In my talk, I will introduce our photonic crystal based devices that significantly reduce the energy cost of optical-to-electrical and (O-E) electrical-to-optical (E-O) signal conversion and have the potential to realize a seamless interface between CMOS and photonic layers. In addition, I will show our recent progress of 16×16 on-chip analog vector-matrix multiplier (VMM) based on silicon photonics, and discuss O-E-O conversion devices with functions such as optical signal control by optical signals, signal amplification, and nonlinear processing between VMM, which are essential for advanced electronic photonics logic circuits.

8F-3 (Time: 14:05 - 14:30)

Title	(Designers' Forum) Highly Energy-Efficient Processing by Controlling Flexibility of Information Carrier using Superconductor Half-Flux Quantum Logic
Author	*Masamitsu Tanaka (Nagoya University, Japan)
Abstract	In this talk, superconductor half-flux-quantum (HFQ) circuits with high power design flexibility will be reported. In HFQ circuits, a superconducting quantum interference device (SQUID) containing a π-shift Josephson junction is used for the switching elements. Because the π-shift assists switching by inducing a spontaneous current, the SQUID is easily switched by a weak driving force depending on the design of its loop inductance. This feature can significantly lower the power consumption even near the thermal limit, compromising the increase in error rates. This research aims to develop a computing technology based on highly energy-efficient, approximate operation by cross-layer co-design.

[To Session Table]

Session 9A (T12-3) Homomorphic Encryption and Cloud Security
Time: 15:20 - 17:30, Thursday, January 23, 2025
Location: Room Saturn
Chairs: Junghee Lee (Korea University, Republic of Korea), Zhiyao Xie (Hong Kong University of Science and Technology, Hong Kong)

9A-1 (Time: 15:20 - 15:45)

Title	Efficient and Secure Cloud-based Split Logic Synthesis
Author	Chaitali Sathe, Yiorgos Makris, *Benjamin Carrion Schafer (The University of Texas at Dallas, USA)
Page	pp. 1282 - 1287
Keyword	Split Synthesis, Clould, security
Abstract	This work introduces a secure split logic synthesis (cloud+local) approach to enable Third Party Intellectual Property (3PIP) vendors that do not have access to expensive state-of-the-art logic synthesis tools to efficiently and securely synthesize their IPs with minimal area and delay overheads. For this, we propose to split the Register Transfer Level (RTL) IP given in Verilog or VHDL such that one part is synthesized on the cloud using a state-of-the-art commercial logic synthesis tool (e.g., Synopsys Design Compiler) while synthesizing locally, on the IP vendor’s side, the missing portion of the design using free logic synthesis tools (e.g., Yosys). This approach allows 3PIPs to leverage the power of commercial logic synthesis tools while protecting their IP from anyone having access to the cloud where the logic synthesis tools is hosted without fearing that the IP will be stolen. Experimental results show that our proposed flow is secure, while leading to negligible area and delay overheads. In particular, the proposed flow has an average area overhead of 0.94% to 1.81% for different types of design implementations and in all cases the original timing constraint is met.

9A-2 (Time: 15:45 - 16:10)

Title	Efficient Key Switching Accelerator for Fully Homomorphic Encryption
Author	Seoyoon Jang, Sungjin Park, *Dongsuk Jeon (Seoul National University, Republic of Korea)
Page	pp. 1288 - 1294
Keyword	Cryptographic Accelerators, Fully Homomorphic Encryption, ASIC, Privacy-preserving, attice-based Cryptography
Abstract	Fully Homomorphic Encryption (FHE) is a cryptosystem of the newest generation that allows limitless computation on encrypted data, enabling privacy-preserving computing in cloud services. While the Key Switching (KS) operation is the key bottleneck in FHE and has its unique, complex data access patterns, little attention has been given to maximizing efficiency for this particular key operation through dedicated hardware. This paper presents an efficient KS accelerator for FHE employing various design techniques on multiple levels. The proposed techniques lead to (i) an efficient modular multiplier showing 47.8% and 46.0% reduction each in area and power compared with naive Barrett modular multiplier, lightweight NTT units with (ii) an efficient twiddle factor generator (TFG) reducing on-chip memory space for twiddle factors in NTT units from O(N) to O(logN) with minimal overhead and (iii) conflict-free addressing scheme to replace dual-port memories into single-port ones without performance degradation, and (iv) bandwidth-efficient core design leading to 38.7% lower external memory access compared with baseline. Implemented in a 28nm CMOS process, the design occupies 5.13x3.71mm² and consumes 5.2 mJ per KS operation, achieving 11.0x acceleration of KS operation compared with the state-of-the-art CPU implementation and 12.7x higher energy efficiency in KS operation compared with state-of-the-art FHE processors or KS accelerators on various platforms.

9A-3 (Time: 16:10 - 16:35)

Title	The Unlikely Hero: Nonidealities in Analog Photonic Neural Networks as Built-in Adversarial Defenders
Author	Haotian Lu, Ziang Yin, Partho Bhoumik, Sanmitra Banerjee, Krishnendu Chakrabarty, *Jiaqi Gu (Arizona State University, USA)
Page	pp. 1295 - 1301
Keyword	Reliability, Photonics, Attack, Unary
Abstract	Electronic-photonic computing systems have emerged as a promising platform for accelerating deep neural network (DNN) workloads. Major efforts have been focused on countering hardware non-idealities and boosting efficiency with various hardware/algorithm co-design methods. However, the adversarial robustness of such photonic analog mixed-signal AI hardware remains unexplored. Though the hardware variations can be mitigated with robustness-driven optimization methods, malicious attacks on the hardware show distinct behaviors from noises, which requires a customized protection method tailored to optical analog hardware. In this work, we rethink the role of conventionally undesired non-idealities in photonic analog accelerators and claim their surprising effects on defending against adversarial weight attacks. Inspired by the protection effects from DNN quantization and pruning, we propose a synergistic defense framework tailored for optical analog hardware that proactively protects sensitive weights via pre-attack unary weight encoding and post-attack vulnerability-aware weight locking. Efficiency-reliability trade-offs are formulated as constrained optimization problems and efficiently solved offline without model re-training costs. Extensive evaluation of various DNN benchmarks with a multi-core photonic accelerator shows that our framework maintains near-ideal on-chip inference accuracy under adversarial bit-flip attacks with merely <3% memory overhead.

9A-4 (Time: 16:35 - 17:00)

Title	Low Multiplicative Depth Polynomial Evaluation Architectures for Homomorphic Encrypted Data
Author	*Jianfei Wang, Jia Hou, Fahong Zhang, Yishuo Meng (Xi’an Jiaotong University, China), Yang Su (Engineering University of People’s Armed Police, China), Chen Yang (Xi’an Jiaotong University, China)
Page	pp. 1302 - 1307
Keyword	Polynomial evaluation, multiplicative depth, homomorphic encryption, hardware architecture, activation function
Abstract	In order to reduce the multiplicative depth required by high-order polynomial evaluation for homomorphic encrypted data, we propose two novel and low multiplicative depth polynomial evaluation algorithms: an x⁴-step nesting algorithm based on parity extraction (x⁴-NAPE) and a parallel algorithm with high data utilization (PAHU). Compared with the conventional Schoolbook method and Horner’s rule, x⁴-NAPE can reduce about 50% of the ciphertext-ciphertext multiplication (CCM) and 75% of the multiplicative depth in the ideal limiting case. While PAHU does not reduce CCM, it can achieve a logarithmic growth of multiplicative depth, with the increase of polynomial length, the growth rate is much slower than x⁴-NAPE, Schoolbook method, and Horner’s rule. Moreover, the hardware architectures of the above polynomial evaluation algorithms for homomorphic encrypted data are investigated and proposed. The proposed hardware architectures are assessed on the FPGA-based reconfigurable hardware platform for FHE named ReMCA. The assessment results demonstrate that under a fixed upper limit of multiplicative depth, our proposed architectures of x⁴-NAPE and PAHU support 2.67× and 14.3× the range of polynomial lengths of Schoolbook and Horner, respectively. For polynomial evaluation of the same length, compared with architectures of Schoolbook and Horner, our proposed architectures of x⁴-NAPE and PAHU can achieve up to 1.13× improvement in execution time, up to 50% reduction in multiplicative depth, and up to 2.39× improvement in the depth-time product.

9A-5 (Time: 17:00 - 17:25)

Title	PRICING: Privacy-Preserving Circuit Data Sharing Framework for Lithographic Hotspot Detection
Author	Chen-Chia Chang (Duke University, USA), Wan-Hsuan Lin (UCLA, USA), Jingyu Pan, Guanglei Zhou (Duke University, USA), Zhiyao Xie (HKUST, Hong Kong), Jiang Hu (TAMU, USA), *Yiran Chen (Duke University, USA)
Page	pp. 1308 - 1313
Keyword	EDA, AI/ML, Security, Physical Design
Abstract	Machine learning (ML) techniques have demonstrated potential in electronic design automation (EDA). Models need to be trained on diverse datasets to exhibit abundant reliability and generalization capabilities, especially when applied to modern circuits. However, data availability remains a severe issue as circuit data is typically kept confidential within each data provider due to the difficulty of secure data sharing. This problem has impeded the development of ML for EDA in both industry and academia and has never been well addressed. To facilitate model development, enabling secure data sharing among various data providers is essential. To this end, we propose PRICING, a privacy-preserving circuit data sharing framework. This is the first work to (1) investigate the secure data sharing problem in EDA and (2) generate protected circuit features that hide important circuit information while preserving sufficient information for EDA applications. Our results demonstrate that our approach successfully protects raw circuit features, providing 55\% superior protection over existing state-of-the-art techniques in computer vision. Moreover, models trained with our protected data achieve up to 48\% higher accuracy than models trained with limited raw data. This shows the effectiveness of PRICING in enhancing model development for EDA.

[To Session Table]

Session 9B (T11-2) Advanced Modeling, Simulation, and Verification
Time: 15:20 - 17:30, Thursday, January 23, 2025
Location: Room Uranus
Chair: Yutaka Masuda (Nagoya University, Japan)

9B-1 (Time: 15:20 - 15:45)

Title	Efficient Hypergraph Modeling of VLSI Circuits for the MFS-Based Emulation and Simulation Acceleration
Author	*Jiahao Xu, Chunyan Pei, Shengbo Tong, Wenjian Yu (Tsinghua University, China)
Page	pp. 1314 - 1320
Keyword	Hypergraph Partitioning, Multi-FPGA System, Hypergraph Modeling, Logic emulation
Abstract	As the scale of integrated circuit (IC) design continues to expand, the multi-FPGA system (MFS) is widely employed for logic emulation and simulation acceleration which ensures the functional correctness of logic circuits. During this process, circuit partitioning becomes a dispensable step. In this work, we address the hypergraph modeling techniques for the MFS-orientated circuit partitioning. Firstly, an efficient adaptive flattening algorithm considering multi-dimensional resource constraints and based on dynamic programming (DP) is proposed. Then, a parallel algorithm for clock modeling is proposed. With them, an efficient and practical tool of hypergraph modeling is developed. Experiments on industrial benchmarks with up to sixty million cells have validated the efficiency and correctness of the proposed techniques. The results also demonstrate the great benefit of the adaptive flattening to the subsequent hypergraph partitioning, and the significant acceleration effects of the proposed DP-based adaptive flattening and the parallel clock modeling algorithms.

Best Paper Candidate
9B-2 (Time: 15:45 - 16:10)

Title	ETPG: Efficient Transition Fault Simulation via Dual-Strategy Pattern Parallelism and Gate Restructuring
Author	*Mingjun Wang (State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences/University of Chinese Academy of Sciences/CASTEST Co., Ltd., China), Hui Wang (CASTEST Co., Ltd., China), Zizhen Liu (State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences/CASTEST Co., Ltd., China), Feng Gu (State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences/University of Chinese Academy of Sciences/CASTEST Co., Ltd., China), Jianan Mu (State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences/CASTEST Co., Ltd., China), Jiaping Tang (State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences/University of Chinese Academy of Sciences/CASTEST Co., Ltd., China), Jun Gao (CASTEST Co., Ltd., China), Huawei Li, Jing Ye, Xiaowei Li (State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences/University of Chinese Academy of Sciences/CASTEST Co., Ltd., China)
Page	pp. 1321 - 1327
Keyword	Transition Fault Simulation, Multi-Dimensional Optimization, Adaptive Parallelism, Gate Sorting Optimization
Abstract	With the advancement of integrated circuit technology, the sensitivity to delay faults has significantly increased, rendering Transition Fault (TF) testing crucial for ensuring chip quality. However, existing TF testing methodologies face several challenges: inadequate detection of multi-cycle faults, inflexibility in parallelization algorithms, and inefficient memory access for large-scale circuits. This paper introduces ETPG (Efficient Transition fault simulation via dual-strategy Pattern parallelism and Gate restructuring), a novel TF simulation algorithm based on multi-dimensional optimization. The key innovations include: an adaptive dual-strategy pattern parallel strategy that dynamically optimizes parallelization based on test pattern characteristics, enhancing efficiency and multi-cycle fault detection capability; a dual-dimension gate restructuring method that optimizes memory storage order, significantly reducing memory access time, particularly beneficial for large-scale circuits; and a collaborative mechanism between pattern processing and circuit storage optimization, achieving comprehensive performance improvements at both algorithmic and memory access levels. Experimental results demonstrate ETPG’s significant performance improvements across various circuit scales. For circuits below and above 100k gates, ETPG achieves average speedups of 2.846× and 4.428× respectively compared to TMAX. The algorithm’s dual-strategy parallelization technique yields machine word reductions of up to 57.82% for multi-cycle patterns, enhancing both simulation efficiency and fault detection capability.

9B-3 (Time: 16:10 - 16:35)

Title	DEMOTIC: A Differentiable Sampler for Multi-Level Digital Circuits
Author	*Arash Ardakani, Minwoo Kang, Kevin He (University of California, Berkeley, United States Minor Outlying Islands), Qijing Huang (NVIDIA Research, United States Minor Outlying Islands), Vighnesh Iyer, Suhong Moon, John Wawrzynek (University of California, Berkeley, United States Minor Outlying Islands)
Page	pp. 1328 - 1335
Keyword	Circuit Satisfiability, Gradient Descent, Multi-level Circuits, Verification, and Testing
Abstract	Efficient sampling of satisfying formulas for circuit satisfiability (CircuitSAT), a well-known NP-complete problem, is essential in modern front-end applications for thorough testing and verification of digital circuits. Generating such samples is a hard computational problem due to the inherent complexity of digital circuits, size of the search space, and resource constraints involved in the process. Addressing these challenges has prompted the development of specialized algorithms that heavily rely on heuristics. However, these heuristic-based approaches frequently encounter scalability issues when tasked with sampling from a larger number of solutions, primarily due to their sequential nature. Different from such heuristic algorithms, we propose a novel differentiable sampler for multi-level digital circuits, called DEMOTIC, that utilizes gradient descent (GD) to solve the CircuitSAT problem and obtain a wide range of valid and distinct solutions. DEMOTIC leverages the circuit structure of the problem instance to learn valid solutions using GD by re-framing the CircuitSAT problem as a supervised multi-output regression task. This differentiable approach allows bit-wise operations to be performed independently on each element of a tensor, enabling parallel execution of learning operations, and accordingly, GPU-accelerated sampling with significant runtime improvements compared to state-of-the-art heuristic samplers. We demonstrate the superior runtime performance of DEMOTIC in the sampling task across various CircuitSAT instances from the ISCAS-85 benchmark suite. Specifically, DEMOTIC outperforms the state-of-the-art sampler by more than two orders of magnitude in most cases.

9B-4 (Time: 16:35 - 17:00)

Title	Corvus: Efficient HW/SW Co-Verification Framework for RISC-V Instruction Extensions with FPGA Acceleration
Author	*Zijian Jiang (Institute of Computing Technology, Chinese Academy of Sciences/University of Chinese Academy of Sciences, China), Keran Zheng (Imperial College London, UK), David Boland (The University of Sydney, Australia), Yungang Bao, Kan Shi (Institute of Computing Technology, Chinese Academy of Sciences/University of Chinese Academy of Sciences, China)
Page	pp. 1336 - 1342
Keyword	Verification, RISC-V, FPGA
Abstract	The RISC-V instruction set architecture (ISA) offers flexibility for domain-specific custom instruction extensions. While the basic RISC-V ISA contains common instructions, the extended accelerators provide additional computing power to meet diverse needs. High-level synthesis (HLS) is often used to agilely create custom extension accelerators, allowing engineers to design complex digital circuits using high-level languages such as C/C++, further improving development efficiency. However, verifying a design that includes RISC-V cores and custom extensions is rarely studied and can be challenging. Traditional approaches for verifying HLS-generated designs use C-RTL co-simulation, primarily focusing on the unit level. This method can be extremely time-consuming and often makes impractical assumptions about interactions between HLS-generated circuits and the processor. Therefore, system-level verification is essential to exercise the RISC-V cores, the custom extensions, and their interconnections extensively. To efficiently verify a RISC-V processor design with custom instruction extensions, we propose Corvus: a novel co-verification framework that combines the benefits of the high-level abstraction of C/C++ simulation and cycle-accurate modeling of C-RTL co-simulations. \textsc{Corvus} provides hardware wrapper templates that efficiently connect HLS-generated accelerators and the RISC-V core, and a tool flow that automatically translates HLS unit-level tests into system-level stimulus. Corvus maps the core and the accelerators, along with their corresponding C/C++ software models, onto the same FPGA with hardened processors, allowing them to run simultaneously whilst checking both results on-the-fly with dedicated hardware monitors and checkers. If a mismatch is detected, we capture a snapshot of the accelerator hardware and reconstruct the simulation in external software simulators for detailed debugging. When potential issues are fixed, the accelerator region can be partially reconfigured without recompiling the entire design, further improving the design-verification efficiency. Results show a significant performance improvement over conventional approaches, from 4423x to 16626x.

9B-5 (Time: 17:00 - 17:25)

Title	SISCO: Selective Invariant Sharing, Clustering and Ordering for Effective Multi-Property Formal Verification
Author	*Sourav Das, Aritra Hazra (Indian Institute of Technology Kharagpur, India), Pallab Dasgupta, Himanshu Jain, Sudipta Kundu (Synopsys Inc., USA)
Page	pp. 1343 - 1349
Keyword	Property Directed Reachability, Clustering, Ordering, Multi-property Verification, Selective Invariant Sharing
Abstract	Multi-property formal verification remains a significant challenge in the chip design industry. With hundreds of property goals to verify in a design, several questions arise regarding goal ordering and grouping of properties, sharing of information across proven properties, and other heuristics to improve the verification productivity. This paper introduces SISCO, a novel method for addressing multi-property verification in complex designs. SISCO unfolds properties to a certain depth, clusters them, reorders properties within each cluster, and then uses a modified version of IC3/Property Directed Reachability (PDR) algorithm for efficient verification. The method stores the invariants of proven properties and selectively shares them while solving undecided goals. Additionally, SISCO keeps track of counter-example traces of falsified goals to assist verification. Experimental results demonstrate that clustering and ordering help solve more goals while selective invariant sharing accelerates the process. SISCO achieves a significant average improvement of 4.73x in the runtime of individual goals compared to all invariant sharing.

[To Session Table]

Session 9C (T13-2) Carbon, Light, Fluids: Emerging Technologies
Time: 15:20 - 17:30, Thursday, January 23, 2025
Location: Room Venus
Chairs: Krishnendu Chakrabarty (Arizona State University, USA), Yangdi Lyu (HKUST (GZ), China)

9C-1 (Time: 15:20 - 15:45)

Title	CACTI-CNFET: an Analytical Tool for Timing, Power, and Area of SRAMs with Carbon Nanotube Field Effect Transistors
Author	*Shinobu Miwa, Eiichiro Sekikawa, Tongxin Yang (The University of Electro-Communications, Japan), Ryota Shioya (The University of Tokyo, Japan), Hayato Yamaki, Hiroki Honda (The University of Electro-Communications, Japan)
Page	pp. 1350 - 1356
Keyword	CNFET, SRAM, timing estimation, power estimation, area estimation
Abstract	In this paper, we propose CACTI-CNFET, the first architecture-level analytical tool for SRAMs designed with CNFETs. Similar to CACTI, a widely used tool for analyzing silicon-based SRAMs, CACTI-CNFET allows us to quickly estimate the timing, power, and area of CNFET SRAMs only with architecture-level descriptions such as memory capacity and port count, thus helping us to make a quick decision on the architecture of various SRAMs used in a targeted processor. Because CACTI-CNFET is implemented on top of CACTI and published as an open source project, all CACTI users can use and extend this new functionality freely. Our experimental results show that CACTI-CNFET can estimate the leakage power, dynamic energy consumption for write and read operations, and circuit delay of various CNFET SRAMs with errors of 7.40%, 7.08%, 2.36%, and 8.33%, respectively. Moreover, our experimental results show that CACTI-CNFET can estimate the area of a CNFET SRAM with an error of 10.25%.

9C-2 (Time: 15:45 - 16:10)

Title	3M-DeSyn: Design Synthesis for Multi-Layer 3D-Printed Microfluidics with Timing and Volumetric Control
Author	*Yushen Zhang, Dragan Rašeta, Tsun-Ming Tseng, Ulf Schlichtmann (Technical University of Munich, Germany)
Page	pp. 1357 - 1363
Keyword	Microfluidics, 3D Printing, Design Synthesis, Timing Control, Volume Control
Abstract	3D printing has revolutionized microfluidic device fabrication, enabling rapid prototyping and intricate geometries. However, designing lab-on-a-chip systems remains challenging. In microfluidic devices, precise control over fluid behavior is crucial, requiring careful attention to both timing and volume. Current state-of-the-art design automation tools for microfluidics have limitations, particularly in addressing the specific challenges of 3D-printed microfluidics and user-defined timing and volumetric constraints, and no design synthesis tool exists targeting these domains. We present 3M-DeSyn, a novel design synthesis method for 3D-printed microfluidics that incorporates timing and volumetric constraints and outputs print-ready 3D modeling files. It automates the design process, allowing users to specify schematics and desired flow control parameters. The underlying methodology is based on mathematical modeling of fluidic behavior and utilizes constraint optimization programming to find optimized solutions. Experimental results show significant improvements in design time while enabling rapid development of custom microfluidic systems.

9C-3 (Time: 16:10 - 16:35)

Title	Dynamic Topology-Aware Flow Path Construction and Scheduling Optimization for Multilayered Continuous-Flow Microfluidic Biochips
Author	*Meng Lian, Shucheng Yang, Mengchu Li, Tsun-Ming Tseng, Ulf Schlichtmann (Technical University of Munich, Germany)
Page	pp. 1364 - 1371
Keyword	Multilayered continuous-flow microfluidic biochip, High-level synthesis, Quadratic programming
Abstract	Multilayered continuous-flow microfluidic biochips are highly valued for their miniaturization and high bio-application throughput. However, challenges arise as the dynamic connections of channels, adjusted to satisfy varying demands of fluid transportation at different moments, complicate the execution of bioassays. The existing methods often focus on device binding and operation scheduling during high-level synthesis but overlook the topological connections within the microfluidic network. This oversight leads to mismanagement of conflicts between fluid transportations and erroneous assumptions about constant flow velocities, resulting in decreased accuracy and efficiency or even infeasibility of bioassay execution. To address this problem, we mathematically model the flow velocity that varies according to the dynamic changes of the topological connections between the on-chip components during the execution of the bioassay. Further integrating the flow velocity model into the high-level synthesis, we propose a quadratic programming (QP) method that constructs flow paths and optimizes scheduling schemes to minimize the bioassay completion time. Experimental results confirm that, compared with the state-of-the-art approach, our method shortened the bioassay completion time by an average of 40.9%.

9C-4 (Time: 16:35 - 17:00)

Title	A Backup Resource Customization and Allocation Method for Wavelength-Routed Optical Networks-on-Chip Topologies
Author	*Zhidan Zheng, You-Jen Chang, Liaoyuan Cheng, Tsun-Ming Tseng, Ulf Schlichtmann (Technical University of Munich, Germany)
Page	pp. 1372 - 1378
Keyword	Wavelength-routed optical networks-on-chip, fault-tolerant design, backup resource allocation, fault model
Abstract	Wavelength-routed networks-on-chip (WRONoCs) are known for providing high-speed and low-power communication. Despite those advantages, the key components, microring resonators (MRRs), are prone to process thermal variation, which causes signals to fail to reach their intended destinations. Thus, several WRONoC fault-tolerant methods propose to prepare a constant number of backups, which often leads to inefficient resource allocation, i.e. insufficient backups for the signals that are prone to errors, while more than enough backups for the signals that are barely affected, resulting in much power waste. In this work, we propose a dynamical backup resource allocation methodology for reliability maximization and power minimization in WRONoCs. Precisely, our methodology starts with accurately modeling the WRONoC faults, which considers the deviation of an MRR's default behavior as a Gaussian Distribution. Considering that signal paths are different the number of MRRs and the signals have different probabilities of deviating from their designated paths, our methodology customizes the number of backup paths for every signal and automatically allocates the minimum resources to optimize the reliability.

9C-5 (Time: 17:00 - 17:25)

Title	GSNorm: An Efficient 3D Gaussian Rendering Accelerator with Splat Normalization and LUT-assist Rasterization
Author	*Yiyang Sun, Peiran Yan, Yiqi Jing, Le Ye, Tianyu Jia (Peking University, China)
Page	pp. 1379 - 1385
Keyword	3D Gaussian Splatting, Real-time Rendering, Edge Computing
Abstract	3D Gaussian Splatting recently emerged as the new SOTA approach for many computer graphic tasks. While Gaussian Splatting has demonstrated impressive rendering quality and performance on GPUs, real-time GS rendering on edge devices is still challenging. We identified the unbalanced Rendering pipeline and the uneven Gaussian distribution as the main obstacles to efficient rendering. To address these problems, we present GSNorm, a rendering accelerator with an online quantization preprocessor for per-gaussian coordinate transformation to normalize Gaussian footprints for pixel-wise calculation reduction. A LUT-based quantized rendering design is also presented to break the pipeline data dependency. Furthermore, a depth-guided cluster-sorting unit is incorporated to improve Gaussian sorting efficiency. GSNorm accelerator is implemented and evaluated in TSMC 22 nm technology with several real-world scenes, providing significant rendering efficiency and performance improvements for real-time applications.

[To Session Table]

Session 9D (T10-2) Advanced Techniques for Power Optimization and IR Prediction
Time: 15:20 - 17:30, Thursday, January 23, 2025
Location: Room Mars/Mercury
Chairs: Bei Yu (The Chinese University of Hong Kong, Hong Kong), Yu-Guang Chen (National Central University, Taiwan), Hongce Zhang (Hong Kong University of Science and Technology (Guangzhou), China)

9D-1 (Time: 15:20 - 15:45)

Title	Via Fabrication with Multi-Row Guiding Templates Using Lamellar DSA
Author	*Yun-Na Tsai, Shao-Yun Fang (National Taiwan University of Science and Technology, Taiwan)
Page	pp. 1386 - 1391
Keyword	Directed self-assembly, Guiding template design
Abstract	Directed self-assembly (DSA) using block copolymers (BCP) has become a very promising technique for the fabrication of via layers in integrated circuits with the dramatic shrink of feature sizes and great increase in circuit complexity. Since the cylindrical DSA suffers from the drawbacks of fixed via pitch in a single template and great displacement error due to process variation, lamellar DSA in combination with the self-aligned via (SAV) process becomes an alternative that may lead to better manufacturability. Many studies have investigated design methodologies for via/contact layer fabrication with cylindrical block copolymers, but there is only one existing work focusing on the guiding template design problem for lamellar DSA. However, this work only considers one-dimensional lamellar guiding templates, while adopting two-dimensional templates can resolve more template conflicts. This paper presents the first work of multi-row guiding template design for lamellar DSA with SAV technology and multiple patterning lithography (MPL). To tackle the problem, we enumerate all guiding template shapes for a given via layout by considering the design constraints in the target process and design flexibility with dummy vias. Two methods are proposed afterward. The first one is a method based on integer linear programming (ILP), and the second one is a heuristic method. The experimental results show that the optimal solutions can be obtained by solving the ILP formulation, and the heuristic method can obtain near-optimal solutions with much less runtime.

Best Paper Candidate
9D-2 (Time: 15:45 - 16:10)

Title	SMART-GPO: Gate-Level Sensitivity Measurement with Accurate Estimation for Glitch Power Optimization
Author	*Yikang Ouyang, Yuchao Wu, Dongsheng Zuo (The Hong Kong University of Science and Technology (Guangzhou), China), Subhendu Roy (Cadence Design Systems, USA), Tinghuan Chen (The Chinese University of Hong Kong, Shenzhen, China), Zhiyao Xie (The Hong Kong University of Science and Technology, Hong Kong), Yuzhe Ma (The Hong Kong University of Science and Technology (Guangzhou), China)
Page	pp. 1392 - 1398
Keyword	Glitch, Optimization, Estimation
Abstract	Dynamic power consumption is a significant concern in modern integrated circuits. This issue is primarily caused by signal toggling, including unwanted toggles known as glitches. With the number of operations increasing in circuits, glitches can lead to significant additional dynamic power. This paper presents SMART-GPO, a novel framework that efficiently and accurately estimates and optimizes glitch power. Our approach samples cycles for accurate glitch estimation, followed by gate-sizing to optimize glitch power based on sensitivity measurements. We validated SMART-GPO on the Berkeley Out-of-Order Machine (BOOM) and Rocket SoCs using TSMC N28 technology across six widely adopted benchmarks. It achieves a mean absolute percentage error (MAPE) of 2% on glitch power estimation when running power analysis on only 1% simulation cycles. The optimization results demonstrate that our framework reduces glitch power by more than 9%.

9D-3 (Time: 16:10 - 16:35)

Title	Robust Technology-Transferable Static IR Drop Prediction Based on Image-to-Image Machine Learning
Author	*Chao-Chi Lan, Chuan-Chi Su, Yuan-Hsiang Lu, Yao-Wen Chang (National Taiwan University, Taiwan)
Page	pp. 1399 - 1405
Keyword	IR drop, Power delivery network, Machine learning
Abstract	IR drop analysis in the power delivery network (PDN) is crucial for the signoff of integrated circuit (IC) design. Static IR drop significantly affects the IC reliability. Machine learning (ML) has recently been applied to static IR drop prediction for its high accuracy and efficiency. However, most previous works cannot predict with unseen designs, and none can handle different technologies. These problems lead to long training times and data-gathering difficulties, making ML-based methods impractical in the industry. Therefore, a more applicable methodology for static IR drop predictions is needed. This paper proposes a fast, robust, highly technology-transferable image-to-image ML-based methodology for static IR-drop prediction. To enhance transferability and accuracy, we introduce a new input feature, layerwise maps, which encapsulates the PDN network topology well. We further derive a novel generic ML model for various designs and technologies with different numbers of PDN layers. Experimental results de

9D-4 (Time: 16:35 - 17:00)

Title	T-Fusion: Thermal Prediction of 3D ICs with Multi-fidelity Fusion
Author	Bingrui Zhang (Beihang University, China), *Wei Xing (the University of Sheffield, UK), Xin Zhao, Yuquan Sun (Beihang University, China)
Page	pp. 1406 - 1412
Keyword	Thermal Modeling, Multi-fudelity, Reliability, Simulation, Acceleration
Abstract	In the post-Moore era, three-dimensional integrated circuit (3D-IC) technology is a key direction for continuing to enhance chip performance. However, in the thermal simulation field, existing works either only address two-dimensional temperature fields or require a large number of samples and a long training time to train the model. In order to meet the current demand in chip design for rapid and accurate thermal prediction with limited sample sizes, this paper introduces a multi-fidelity model, T-Fusion, which combines tensor arithmetic and Bayesian autoregression. Leveraging a sparse set of high-fidelity data alongside abundant low-fidelity samples, T-Fusion reliably estimates high-fidelity thermal distribution across the chip. We validate our model on single-core double-layer chips, quad-core triple-layer, and octa-core double-layer chips respectively. We compare the predicted heat distribution with commercial thermal simulation software such as COMSOL, MTA, and Hotspot, achieving accelerations of 10,000x to 1,000,000x. T-fusion can also be applied to transient temperature prediction of the chip, requiring only 20 sets of high-precision data and 64 sets of low-precision data to control ME under 1K.

9D-5 (Time: 17:00 - 17:25)

Title	Towards Functional Safety of Neural Network Hardware Accelerators: Concurrent Out-of-Distribution Detection in Hardware Using Power Side-Channel Analysis
Author	*Vincent Meyers (Karlsruhe Institute of Technology, Germany), Michael Hefenbrock (RevoAI GmbH, Germany), Mahboobe Sadeghipourrudsari, Dennis Gnad, Mehdi Tahoori (Karlsruhe Institute of Technology, Germany)
Page	pp. 1413 - 1419
Keyword	out of distribution detection, neural network accelerators, power side channel
Abstract	For AI hardware, functional safety is crucial, especially for neural network (NN) accelerators used in safety-critical systems. A key requirement for maintaining this safety is the precise detection of out-of-distribution (OOD) instances, which are inputs significantly distinct from the training data. Neglecting to integrate robust OOD detection may result in possible safety hazards, diminished performance, and inaccurate decision-making within NN applications. Existing methods for OOD detection have been explored for full-precision models. However, the evaluation of methods on quantized neural network (QNN), which are often deployed on hardware accelerators such as FPGAs, and on-device hardware realization of concurrent OOD detection (COD) is missing in literature. In this paper, we provide a novel approach to OOD detection for NN FPGA accelerators using power measurements. Utilizing the power side-channel through digital voltage sensors allows on-device OOD detection in a non-intrusive and concurrent manner, without relying on explicit labels or modifications to the underlying NN. Furthermore, our method allows OOD detection before the inference finishes. Additionally to the evaluation, we provide an efficient hardware implementation of COD on an actual FPGA.

[To Session Table]

Session 9E (SS-8) Innovations and Challenges on Cryo-CMOS Devices, Circuits and Design Platforms
Time: 15:20 - 17:30, Thursday, January 23, 2025
Location: Innovation Hall
Chairs: Chika Tanaka (KIOXIA Corporation/Kyoto Institute of Technology, Japan), Nobuyuki Momo (KIOXIA Corporation, Japan)

9E-1 (Time: 15:20 - 15:45)

Title	(Invited Paper) Physics-based Modeling to Extend a MOSFET Compact Model for Cryogenic Operation
Author	*Dondee Navarro (KIOXIA Corporation/Kyoto Institute of Technology, Japan), Shin Taniguchi (Kyoto Institute of Technology, Japan), Chika Tanaka (KIOXIA Corporation/Kyoto Institute of Technology, Japan), Kazutoshi Kobayashi (Kyoto Institute of Technology, Japan), Takashi Sato (Kyoto University, Japan), Michihiro Shintani (Kyoto Institute of Technology, Japan)
Page	pp. 1420 - 1425
Keyword	Cryo-CMOS, Compact Model, Cryogenic, HiSIM
Abstract	This paper utilizes physics-based models of the cryogenic effects on semiconductors to extend the low-temperature description of an industry-standard compact metal-oxide-semiconductor field-effect transistor (MOSFET) model. Specifically, the incomplete dopant ionization effect is incorporated in the bulk Fermi potential calculation of the compact model as a threshold voltage shift in the formulation of the Poisson's equation. Temperature-related models for the bandgap energy, saturation velocity, and contact resistances at the source/drain regions are also enhanced. Through evaluation using transistors fabricated by 22 nm process technology, we demonstrate that the consistent formulation enables reproduction of the I-V and Vth-T characteristics from 300 K to 4 K.

9E-2 (Time: 15:45 - 16:10)

Title	(Invited Paper) Cryo-Compact Modeling Based on Sparse Gaussian Process
Author	*Tetsuro Iwasaki (Kyoto Institute of Technology, Japan), Takashi Sato (Kyoto University, Japan), Michihiro Shintani (Kyoto Institute of Technology, Japan)
Page	pp. 1426 - 1431
Keyword	Cryo-CMOS, Compact model, Sparse Gaussian process regression
Abstract	To facilitate the scalability of quantum computers, CMOS circuits operating at cryogenic temperatures (Cryo-CMOS) are being intensively studied for qubit control applications. A transistor model that works at cryogenic temperatures is necessary to design robust Cryo-CMOS integrated circuits. Recently, a sparse Gaussian process (SGP)-based modeling method was introduced to model the current characteristics consistently across room to cryogenic temperatures. However, this method does not address the current characteristics in the subthreshold region and is limited to modeling only the current characteristics. In this paper, we extend the SGP-based approach to accurately represent ultra-steep subthreshold slopes at cryogenic temperatures and integrate it into the BSIMBULK model for a comprehensive representation of transistor behavior. Evaluations using transistors fabricated in 22 nm technology show that the proposed model effectively simulates the transitions between off-current and on-current at 4 and 70 K and accurately predicts the transfer characteristics of an inverter.

9E-3 (Time: 16:10 - 16:35)

Title	(Invited Paper) Re-Consideration of Correlation Between Interface States and Bulk Traps Using Cryogenic Measurement
Author	*Yuichiro Mitani, Tatsuya Suzuki, Yohei Miyaki (Tokyo City University, Japan)
Page	pp. 1432 - 1437
Keyword	Cryogenic, MOSFET, Reliability, Interface states, Hydrogen
Abstract	The interface trap generation under applying stress voltage is caused by hydrogen diffusion subsequent to hydrogen release at MOS interfaces. These released hydrogen atoms also create bulk traps. In this study, the relationship between the interface trap generation and MOSFET degradation under electrical stressing at 77 K ~ 300 K. As a result, thedegradation of ON current is more serious at lower temperature irrespective of the same density of generated interface trap. It is inferred that the additional degradation mechanism is working at lower temperature region.

9E-4 (Time: 16:35 - 17:00)

Title	(Invited Paper) Random Telegraph Noise Observed on 65-nm Bulk pMOS Transistors at 3.8K
Author	*Takuma Kawakami, Takashi Sato, Hiromitsu Awano (Kyoto University, Japan)
Page	pp. 1438 - 1443
Keyword	Random Telegraph Noise, Cryogenic CMOS, Reliability
Abstract	This paper presents a comprehensive study on the behavior of Random Telegraph Noise (RTN) under cryogenic conditions. The study leverages a device array, BTIarray, to statistically measure RTN in a temperature range from room temperature down to 3.8\,K\@. The measurement results suggest that while RTN’s impact diminishes in the low-temperature region at about 100\,K, it becomes more pronounced in the extreme low-temperature region below that, especially for transistors with shorter channel lengths. This research contributes to the understanding of RTN behavior under cryogenic conditions, providing valuable insights for future IC design.

9E-5 (Time: 17:00 - 17:25)

Title	(Invited Paper) Cryo-CMOS Analog Circuits for Spin Qubit Control
Author	*Takuji Miki (Kobe University, Japan)
Page	pp. 1444 - 1449
Keyword	Cryo-CMOS, Spin qubit, Quantum computer, AD converter, DA converter
Abstract	Silicon spin qubits offer a key advantage in scalability due to their compatibility with CMOS technology, making them well-suited for large-scale quantum computers. However, as the number of silicon spin qubits increases, wiring complexity within a dilution refrigerator becomes a significant challenge, along with the increased thermal load from external control systems. To mitigate these issues, we are developing CMOS analog circuits operating at cryogenic temperature which generate qubit control signals directly inside the refrigerator. This paper introduces cryogenic DA/AD converters for biasing spin qubits and acquiring their environmental data, with a focus on low power consumption and small area to meet space and power budget within the refrigerator. The cryogenic DAC design effectively utilizes the specific transistor characteristics such as reduced leakage current and thermal noise at cryogenic temperature. This DAC achieves 11-bit resolution with small area, enabling integration of 50 channels on a single chip, thanks to an area-efficient capacitor mismatch calibration technique. To achieve further accurate control of spin qubits, a cryogenic ADC is designed for environmental signal acquisition near qubits. The ADC successfully digitizes the pulse signals with 11-bit resolution and 35 uW power consumption at a deep cryogenic temperature of 100 mK.

[To Session Table]

Session 9F (DF-4) Integrated Circuit Design Methodologies using Open Source and Artificial Intelligence
Time: 15:20 - 16:35, Thursday, January 23, 2025
Organizer: Hiroyuki Uzawa (NTT Corporation, Japan), Chair: Takeshi Kuboki (Kumamoto University, Japan)

9F-1 (Time: 15:20 - 15:45)

Title	(Designers' Forum) A Challenge to Tape-Out in Open-Source Era
Author	*Akira Tsuchiya (The University of Shiga Prefecture, Japan)
Abstract	This talk reports a challenge to "Chipathon" program from Japanese open-source community. Chipathon 2023 is a long-term program of international collaborative IC design by open-source PDK (Process Design Kit) and open-source EDA tools. Team-Japan is a group of individuals who have various backgrounds and skills. I will introduce how we design IC in on-line community. Based on the experience of Chipathon 2023, pros/cons and future issues of open-source silicon is discussed.

9F-2 (Time: 15:45 - 16:10)

Title	(Designers' Forum) Analog Design Democratization for Small Volume LSI Fabrication
Author	*Seijiro Moriyama (Anagix Corporation, Japan), Shingo Ura, Tadaaki Tsuchiya (Logic Research Co., Ltd., Japan)
Abstract	We are developing analog design environment for LSI novices who need solutions for small volume LSIs. For lower cost LSI fabrication, the shuttle service is useful. But, most legacy fabs have been behind in PDK development needed for the service. The open PDK development method has been developed by the speaker to develop PDK faster and at much lower cost. Recently several PDKs have been developed by the method, and Minimal Fab is the one made public without NDA requirement. PDKs developed by the open PDK methods are easier for layout design porting because similar PCells are used. However, to meet analog performance, deep knowledge is required for designers to tune circuit design. The speaker proposes 'good example based analog design' and has been developing a program to make it feasible.

9F-3 (Time: 16:10 - 16:35)

Title	(Designers' Forum) Automatic Design of an Analog Integrated Circuits using AI
Author	*Nobukazu Takai (Kyoto Institute of Technology, Japan)
Abstract	The demand for analog integrated circuits is increasing year on year, driven by the proliferation of the IoT and improved performance of integrated circuits. In order to meet the rapidly growing demand for analog integrated circuits, design automation technology is required. This presentation introduces methods for automatic design of element values for analog integrated circuits, in particular operational amplifiers, using RL. The GNN-opt and TuRBO algorithms applied in the automatic design are presented as examples.

The 30th Asia and South Pacific Design Automation Conference Technical Program

Session Schedule

List of papers

The 30th Asia and South Pacific Design Automation Conference
Technical Program