Tuesday, January 23, 2018 |
Room 302 | Room 401 | Room 402A | Room 402B |
---|---|---|---|
Opening & Keynote I 9:00 - 10:30 |
|||
10:30 - 11:00 |
|||
11:00 - 12:15 |
11:00 - 12:15 |
11:00 - 12:15 |
11:00 - 12:15 |
12:15 - 13:45 |
|||
13:45 - 15:25 |
13:45 - 15:25 |
13:45 - 15:25 |
13:45 - 15:25 |
15:25 - 15:55 |
|||
15:55 - 17:35 |
15:55 - 17:35 |
15:55 - 17:35 |
15:55 - 17:35 |
17:35 - 18:00 |
|||
18:00 - 20:00 |
Wednesday, January 24, 2018 |
Room 302 | Room 401 | Room 402A | Room 402B |
---|---|---|---|
Keynote II 9:00 - 10:00 |
|||
10:00 - 10:30 |
|||
10:30 - 12:10 |
10:30 - 12:10 |
10:30 - 12:10 |
10:30 - 12:10 |
12:10 - 13:40 |
|||
13:40 - 15:45 |
13:40 - 15:45 |
13:40 - 15:45 |
13:40 - 15:45 |
15:45 - 16:15 |
|||
16:15 - 17:30 |
16:15 - 17:30 |
16:15 - 17:30 |
16:15 - 17:30 |
17:30 - 18:00 |
|||
18:00 - 20:00 |
Thursday, January 25, 2018 |
Room 302 | Room 401 | Room 402A | Room 402B |
---|---|---|---|
Keynote III 8:30 - 10:00 |
|||
10:00 - 10:30 |
|||
10:30 - 11:45 |
10:30 - 11:45 |
10:30 - 11:45 |
10:30 - 11:45 |
11:45 - 13:15 |
|||
13:15 - 14:30 |
(EDA Winter Workshop) | (Workshop on Memory and Storage Devices and Systems) | |
14:30 - 15:00 | |||
15:00 - 16:15 |
|||
16:15 - 16:45 | |||
16:45 - 18:00 |
Tuesday, January 23, 2018 |
Title | (Keynote Address) Designing Heterogeneous Systems in the AI Era: Challenges and Opportunities |
Author | Jeff Burns (IBM Thomas J. Watson Research Center, U.S.A.) |
Abstract | Artificial Intelligence has recently become a major trend, driving the rapid development of new compute capabilities. As AI functionality has improved, the demand for even greater capabilities has grown, requiring significantly higher levels of performance and energy- efficiency in the IT infrastructure. The maturation of classical semiconductor scaling has made delivering these higher levels much more challenging; reliance on scaling alone is insufficient. As a result, innovations in design, design automation, and architecture are crucial to improving power-performance. Accelerators in heterogeneous systems will increasingly be needed to improve overall system capabilities and power/performance. To enable practical accelerator integration, system architectures must be designed to easily incorporate heterogeneous components such as today’s GPUs, as well as future accelerators for cognitive applications. Design processes must increase in efficiency to enable the cost-effective design of application- specific accelerators. In this presentation, I will describe these trends, some exemplary innovations that address them, and areas of future research towards heterogeneous systems for the AI era. |
Title | (Invited Paper) Quantized Deep Neural Networks for Energy Efficient Hardware-based Inference |
Author | Ruizhou Ding, Zeye Liu, *R. D. (Shawn) Blanton, Diana Marculescu (Carnegie Mellon University, U.S.A.) |
Page | pp. 1 - 8 |
Keyword | Deep Learning, Binarized Neural Networks |
Abstract | Deep Neural Networks (DNNs) have been adopted in many systems because of their higher classification accuracy, with custom hardware implementations great candidates for high-speed, accurate inference. While progress in achieving large scale, highly accurate DNNs has been made, significant energy and area are required due to massive memory accesses and computations. Such demands pose a challenge to any DNN implementation, yet it is more natural to handle in a custom hardware platform. To alleviate the increased demand in storage and energy, quantized DNNs constrain their weights (and activations) from floating-point numbers to only a few discrete levels. Therefore, storage is reduced, thereby leading to less memory accesses. In this paper, we provide an overview of different types of quantized DNNs, as well as the training approaches for them. Among the various quantized DNNs, our LightNN (Light Neural Network) approach can reduce both memory accesses and computation energy, by filling the gap between classic, full-precision and binarized DNNs. We provide a detailed comparison between LightNNs, conventional DNNs and Binarized Neural Networks (BNNs), with MNIST and CIFAR-10 datasets. In contrast to other quantized DNNs that trade-off significant amounts of accuracy for lower memory requirements, LightNNs can significantly reduce storage, energy and area while still maintaining a test error similar to a large DNN configuration. Thus, LightNNs provide more options for hardware designers to trade-off accuracy and energy. |
Title | (Invited Paper) Intelligent Corner Synthesis via Cycle-Consistent Generative Adversarial Networks for Efficient Validation of Autonomous Driving Systems |
Author | Handi Yu (Duke University, U.S.A.), *Xin Li (Duke University / Duke Kunshan University, U.S.A.) |
Page | pp. 9 - 15 |
Keyword | Corner synthesis, Statistical validation, Autonomous driving |
Abstract | Today’s automotive vehicles are often equipped with powerful data processing systems for driver assistance and/or autonomous driving. To meet the rigorous safety standard, one critical task is to ensure extremely small failure rate over all possible operation conditions. Such a validation task requires a large amount of on-road testing data to cover all possible corners. In this paper, we describe a novel general-purpose methodology to synthetically and efficiently generate a broad spectrum of corner cases for validation purpose. Our proposed method is based upon cycle-consistent generative adversarial networks (CycleGANs) trained by a small set of image samples to mathematically map a nominal case to other corner cases. By taking STOP sign detection as an example, our numerical experiments demonstrate that the proposed approach is able to reduce the validation error by up to 100× given a limited data set for corner cases. |
Title | (Invited Paper) Deep Learning for Better Variant Calling for Cancer Diagnosis and Treatment |
Author | Anand Ramachandran, Huiren Li, Eric Klee, Steven S. Lumetta, *Deming Chen (UIUC, U.S.A.) |
Page | pp. 16 - 21 |
Keyword | Deep Learning, DNA sequencing, Variant Calling, Cancer Diagnosis |
Abstract | High-throughput techniques have revolutionized the study of genomics and molecular biology in recent years. These methods provide a large quantity of sequence data, and have applications in different areas of bioinformatics. One can sequence parts or whole of an organism’s DNA to determine genetic information about an individual or a population, measure expression levels of different genes under different conditions, and determine binding affinity of proteins to DNA segments revealing details regarding gene regulation, at a higher resolution than before. However, different high-throughput methods that target even a single application have different underlying error models. Robust analytic pipelines are necessary to extract necessary information from the raw data. In this paper, we discuss future research directions for developing such analytics using techniques from Machine Learning and Deep Neural Networks. We focus on two applications that will affect the diagnosis and treatment of cancer. |
Title | Multi-Device Collaborative Management Through Knowledge Sharing |
Author | *Zhongyuan Tian (Hong Kong University of Science and Technology, Hong Kong), Zhe Wang (Huawei Technologies Co. Ltd., China), Haoran Li, Peng Yang, Rafael Kioji Vivas Maeda, Jiang Xu (Hong Kong University of Science and Technology, Hong Kong) |
Page | pp. 22 - 27 |
Keyword | collaborative learning, knowledge sharing, multicore, multi-device |
Abstract | Rapidly evolving embedded applications continuously demand better performance under tight energy budgets, and maintaining high energy efficiency has become a great design challenge for mobile devices. With the ever-increasing software and hardware complexity, it is challenging for the learning-based power manager to quickly find an efficient management policy. We propose a multi-device collaborative management approach to accelerate the learning process. Experimental results show that the proposed approach can achieve 8X speedup and 10% energy saving when compared with state-of-the-art approaches. |
Title | SQLiteKV: An Efficient LSM-tree-based SQLite-like Database Engine for Mobile Devices |
Author | *Yuanjing Shi, Zhaoyan Shen, Zili Shao (The Hong Kong Polytechnic University, Hong Kong) |
Page | pp. 28 - 33 |
Keyword | SQLite, Key value store, LSM tree, Smart Device, Memory Defragmentation |
Abstract | SQLite has been deployed in millions of mobile devices from web to smartphone applications on various mobile operating systems. However, SQLite is not efficient with low transactions per second. In this paper, we for the first time propose a new SQLite-like database engine, called SQLiteKV, which adopts the LSM-tree-based data structure but retains the SQLite operation interfaces. With its SQLite interface, SQLiteKV can be utilized by existing applications without any modification, while providing high performance with its LSM-tree-based data structure. In SQLiteKV, we develop a light-weight SQLite to key-value compiler to solve the semantic mismatch, so SQL statements can be efficiently translated into KV operations. We also design a novel coordination caching mechanism so query results can be effectively cached inside SQLiteKV by alleviating the discrepancy of data management between front-end SQLite statements and back-end key-value data organization. We have implemented and deployed SQLiteKV on a Google Nexus 6P smartphone. Experiments results show that SQLiteKV outperforms SQLite up to 6 times. |
Title | DI-SSD: Desymmetrized Interconnection Architecture and Dynamic Timing Calibration for Solid-State Drives |
Author | Ren-Shuo Liu, *Jian-Hao Huang (National Tsing Hua University, Taiwan) |
Page | pp. 34 - 39 |
Keyword | Solid-State Drive (SSD), SSD Architecture, Interconnection, Flash controller, Flash memory |
Abstract | NAND flash-based solid-state drives (SSDs) have long been architected in the way that the interconnections between a flash controller and the associated flash memory chips operate at a symmetric speed in both directions. However, this commonly accepted and widely used architecture is suboptimal to SSDs because reading flash cells is 10 to 20 times faster than writing them. In response, we propose desymmetrized interconnection SSD architecture (DI-SSD) and dynamic timing calibration (DTC), which selectively push the flash-to-controller speed to the limit. We conduct comprehensive experiments including characterizing real SSD products, using industrial IC test equipment to emulate a flash controller that adopts DTC, and simulating DI-SSD using simulators to demonstrate the benefits of our proposals. |
Title | Sound Valve-Control for Programmable Microfluidic Devices |
Author | *Andreas Grimmer, Berislav Klepic (Johannes Kepler University Linz, Austria), Tsung-Yi Ho (National Tsing Hua University, Taiwan), Robert Wille (Johannes Kepler University Linz, Austria) |
Page | pp. 40 - 45 |
Keyword | Microfluidics, Programmable Microfluidic Device, Valve-control |
Abstract | In the domain of microfluidics, a paradigm shift from application-specific to fully-programmable solutions takes place. So-called Programmable Microfluidic Devices (PMDs) provide a promising platform in this regard. However, determining a sound control of their valves represents a non-trivial task and first solutions frequently yield impractical control sequences. We address this issue by providing a precise task definition and present complementary solutions (exact and heuristic). Evaluations demonstrate that the proposed solutions are capable of automatically generating a sound valve-control for PMDs. |
Title | Multi-Level Droplet Routing in Active-Matrix Based Digital-Microfluidic Biochips |
Author | *Guan-Ruei Lu (National Chiao Tung University, Taiwan), Bhargab B. Bhattacharya (Indian Statistical Institute, India), Tsung-Yi Ho (National Tsing Hua University, Taiwan), Hung-Ming Chen (National Chiao Tung University, Taiwan) |
Page | pp. 46 - 51 |
Keyword | Digital Microfluidic Biochip, Active-Matrix, EWOD, Multi-level Routing |
Abstract | Active-Matrix (AM) technology is currently being used to implement a superior class of EWOD-based biochips, which consist of a dense 2D-array of microelectrodes. These chips offer many advantages over conventional biochips such as the capability of handling variable-size droplets, more flexibility in droplet movement, precise control over droplet navigation, and as a sequel, ease of implementing complex bioprotocols on-chip. However, the new technology poses a number of challenges concerning droplet routing. In order to enhance routability, we propose, in this paper, a multi-level hierarchical approach that takes appropriate decisions on droplet splitting and reshaping. Compared to the most recent routing methods used for EWOD, the proposed multi-level router reduces maximum latest-arrival-time by an average 18% and achieves 7% less average latest-arrival-time. |
Title | MESGA: An MPSoC Based Embedded System Solution for Short Read Genome Alignment |
Author | *Vikkitharan Gnanasambandapillai, Arash Bayat, Sri Parameswaran (University of New South Wales, Australia) |
Page | pp. 52 - 57 |
Keyword | genome alignment, short read, pipeline processor, MPSoC, load balancing |
Abstract | Computational needs for genome processing are satiated by servers or by cloud computers. In this paper, we present a method to entirely move the alignment process to embedded processors. Such a system is useful in a variety of situations where significant networked infrastructure is not available, and where privacy is a concern. Experimental results show that the proposed solution speeds up the performance by approximately seven times with 16 embedded processors when compared to a linear pipelined system. |
Title | Scheduling and Shaping of Complex Task Activations for Mixed-Criticality Systems |
Author | Biao Hu (Beijing University of Chemical Technology, China), *Kai Huang (Sun Yat-sen University, China) |
Page | pp. 58 - 63 |
Keyword | Schedulability test, Mixed-Criticality, Runtime Shaping |
Abstract | In this paper, we present a new schedulability test that can cope with complex activation patterns for mixed-criticality systems. Under this analysis we proceed to present a shaping approach that can adaptively make use of system slack to improve the quality of service to less critical tasks. Compared with the state-of-the-art scheduling analysis, our scheduling analysis is more effective in handling the case that activation events can be backlogged and task deadlines can be arbitrary; and the shaping approach furthermore reduces the dropped jobs of less critical tasks without jeopardizing the guarantee to critical tasks. Extensive simulations and real-life deployment in Raspberry Pi3 board confirm the effectiveness of our proposed schedulability test and shaping approach. |
Title | BUQS: Battery- and User-aware QoS Scaling for Interactive Mobile Devices |
Author | *Wooseok Lee, Reena Panda (The University of Texas at Austin, U.S.A.), Dam Sunwoo, Jose Joao (ARM R&D, U.S.A.), Andreas Gerstlauer, Lizy K. John (The University of Texas at Austin, U.S.A.) |
Page | pp. 64 - 69 |
Keyword | QoS, Energy Management, Battery Management, Interactive Mobile Device |
Abstract | Battery life has become one of major concerns for mobile user experience. Existing approaches for balancing device quality-of-service (QoS) and energy often over- or under-provision available battery capacity, or do not properly account for the non-obvious impact of QoS and battery state on actual user experience. In this paper, we propose BUQS, Battery- and User-aware QoS Scaling to maximize user experience under desired battery lifetime goals by leveraging insights about mobile device users. BUQS continually evaluates optimal QoS based on battery status and varying user expectations, and it dynamically adjusts the device service level to maximize user experience and simultaneously meet the battery lifetime requirements. BUQS recognizes dependence of user experience on both device use and battery life using an extended battery-aware quality-of-experience (QoE) model. Furthermore, BUQS learns user's behavior to predict energy demands in time and proactively rebalance energy to improve user experience. Experimental results show that our approach delivers about 30% higher QoE than the state-of-the-art QoS and energy balancing approach. |
Title | Power Conversion Efficiency-Aware Mapping of Multithreaded Applications on Heterogeneous Architectures: A Comprehensive Parameter Tuning |
Author | *Hossein Sayadi (George Mason University, U.S.A.), Divya Pathak, Ioannis Savidis (Drexel University, U.S.A.), Houman Homayoun (George Mason University, U.S.A.) |
Page | pp. 70 - 75 |
Keyword | Energy Efficiency, Heterogeneous Architectures, Scheduling, Power Conversion Efficiency |
Abstract | Heterogeneous Multicore Processors (HMPs) are comprised of multiple core types (small vs. big core architectures) with various performance and power characteristics which offer the flexibility to assign each thread to a core that provides the maximum energy-efficiency. Although this architecture provides more flexibility for the running application to determine the optimal run-time settings that maximize energy-efficiency, due to the interdependence of various tuning parameters such as the type of core, run-time voltage and frequency, and the number of threads, the scheduling becomes more challenging. More importantly, the impact of Power Conversion Efficiency (PCE) of the On-Chip Voltage Regulators (OCVRs) is another important parameter that makes it more challenging to schedule multithreaded applications on HMPs. In this paper, the importance of concurrent optimization and fine-tuning of the circuit and architectural parameters for energy-efficient scheduling on HMPs is addressed to harness the power of heterogeneity. In addition, the scheduling challenges for multithreaded applications are investigated for HMP architectures that account for the impact of power conversion efficiency. A highly accurate learning-based model is developed for energy-efficiency prediction to guide the scheduling decision. Using the predictive model, we further develop a PCE-aware scheduling scheme is developed for effective mapping of multithreaded applications onto an HMP. The results indicate that the proposed learning-based scheme outperforms the state of the art solution by 10% when there is no PCE gap between big and little cores. The energy-efficiency improves up to 60% when the PCE gap between big and little cores increases. |
Title | (Invited Paper) Effect of Aging on Linear and Nonlinear MUX PUFs by Statistical Modeling |
Author | Anoop Koyily, S.V. Sandeep Avvaru, Chen Zhou, Chris H. Kim, *Keshab K. Parhi (University of Minnesota, U.S.A.) |
Page | pp. 76 - 83 |
Keyword | Hardware Security, Aging, PUF, Statistical analysis |
Abstract | This paper addresses the effect of aging on linear and non-linear MUX physical unclonable functions (PUFs). It is well known that a PUF response can be modeled in terms of the delay difference of MUX stages. In this paper, we show that the aging effects can be modeled in terms of variations in delay-difference and arbiter delay. Specifically, with aging, the percent delay-difference variation of each MUX stage can be modeled as a ratio of two correlated Gaussian random variables. This ratio distribution is shown to be approximately Gaussian with zero mean and variance increasing with time. In case of the arbiter, the ratio distribution is modeled as a Gaussian with positive mean. The paper makes three contributions: modeling the effect of aging in terms of percent variations in delay-difference of the MUX stages and arbiter delay, analysis of authentication accuracy with aging, and approaches to increase the PUF's lifetime by either recalibrating it to obtain new delay-difference parameters, or by tuning a threshold based on the total delay-difference. A general approach for selecting the threshold values is described in the paper. It is shown that the authentication accuracy of a PUF is significantly affected due to aging effects of the arbiter itself. Therefore, under the assumption that the variations in arbiter delay are considerably more than in delay-differences, the performance degradation in the case of aging alone is prominent compared to noise alone. We show that the authentication accuracy of a feed-forward PUF is more degraded compared to linear or modified feed-forward PUF. Metrics like Jenson-Shannon and Henze-Penrose divergence are also used to analyze the effect of aging. |
Title | (Invited Paper) ASAX: Automatic Security Assertion Extraction for Detecting Hardware Trojans |
Author | *Chenguang Wang, Yici Cai, Qiang Zhou, Haoyi Wang (Tsinghua University, China) |
Page | pp. 84 - 89 |
Keyword | hardware Trojans detection, RTL invariants, automatic security assertion extraction |
Abstract | Hardware Trojans (HT) has been one of the major concerns of IC designers, and formal methods have been applied to the HT detection. In general, the assertions for detecting HT are manually defined, which is time-consuming and error-prone even for an expert engineer. However, there is a lack of studies on the automatic definition for security assertions. To fill in this gap, we propose an automatic security assertion extraction (ASAX) tool. ASAX labels the candidate signals and infers the proposed register transfer level (RTL) invariants from simulation traces. Next, the security assertions are mined from the inferred RTL invariants. By adopting a two-step invariants inferring technique, ASAX can extract high-coverage assertions with a low runtime. We validate the effectiveness and efficiency of ASAX through experiments on the benchmarks from Trust-hub, DeTrust and OpenCores. The results show that the HT can be 100% detected by model checking with the extracted security assertions. |
Title | (Invited Paper) Polymorphic Gate based IC Watermarking Techniques |
Author | *Tian Wang, Xiaoxin Cui, Dunshan Yu (Peking University, China), Omid Aramoon, Timothy Dunlap, Gang Qu (University of Maryland, U.S.A.), Xiaole Cui (Peking University Shenzhen Graduate School, China) |
Page | pp. 90 - 96 |
Keyword | polymorphic gate, watermarking, genetic algorithm |
Abstract | Polymorphic gates are reconfigurable devices whose functionality may vary in response to the change of execution environment such as temperature, supply voltage or external control signals. This feature makes them a perfect candidate for circuit watermarking. However, polymorphic gates are hard to find because they do not exhibit the traditional structure. In this paper, we report four dual-function polymorphic gates that we have discovered using an evolutionary approach. With these gates, we propose a circuit watermarking scheme that selectively replaces certain standard logic gates with the polymorphic gates. Experimental results on ISCAS and MCNC benchmark circuits demonstrate that this scheme introduces low overhead. More specifically, the average overhead in area, speed and power are 4.10%, 2.08% and 1.17% respectively when we embed 30-bit watermark sequences. These overheads increase to 6.36%, 4.75% and 2.08% respectively when 10% of the gates in the original circuits are replaced to embed watermark up to more than 300 bits. |
Title | (Invited Paper) A Machine Learning Attack Resistant Multi-PUF Design on FPGA |
Author | Qingqing Ma (Nanjing University of Aeronautics and Astronautics, China), Chongyan Gu, Neil Hanley (Queen's University Belfast, U.K.), Chenghua Wang, *Weiqiang Liu (Nanjing University of Aeronautics and Astronautics, China), Maire O'Neill (Queen's University Belfast, U.K.) |
Page | pp. 97 - 104 |
Keyword | Multi-PUF, Modeling Attacks, Machine Learning |
Abstract | Current approaches for building physical unclonable function (PUF) designs resistant to machine learning attacks often suffer from large resource overhead and are typically difficult to implement on field programmable gate arrays (FPGAs). In this paper we propose a new arbiter-based multi-PUF (MPUF) design that utilises a Weak PUF to obfuscate the challenges to a Strong PUF and is harder to model than the conventional arbiter PUF using machine learning attacks. The proposed PUF design shows a greater resistance to attacks, which have been successfully applied to other Arbiter PUFs. A mathematical model is presented to analyse the complexity and obfuscation properties of the proposed PUF design. Moreover, we show that it is feasible to implement the proposed MPUF design on a Xilinx Artix-7 FPGA, and that it achieves a good uniqueness result of 40.60 % and uniformity of 37.03 %, which significantly improves over previous work into multi-PUF designs. |
Title | Supporting Compressed-Sparse Activations and Weights on SIMD-like Accelerator for Sparse Convolutional Neural Networks |
Author | *Chien-Yu Lin, Bo-Cheng Lai (National Chiao Tung University, Taiwan) |
Page | pp. 105 - 110 |
Keyword | Accelerator, CNN, Sparsity |
Abstract | Sparsity is widely observed in convolutional neural networks by zeroing a large portion of both activations and weights without impairing the result. By keeping the data in a compressed-sparse format, the energy consumption could be considerably cut down due to less memory traffic. However, the wide SIMD-like MAC engine adopted in many CNN accelerators can not support the compressed input due to the data misalignment. In this work, a novel Dual Indexing Module (DIM) is proposed to efficiently handle the alignment issue where activations and weights are both kept in compressed-sparse format. The DIM is implemented in a representative SIMD-like CNN accelerator, and able to exploit both compressed-sparse activations and weights. The synthesis results with 40nm technology have shown that DIM can enhance up to 46% of energy consumption and 55.4% Energy-Delay-Product (EDP). |
Title | IMCE: Energy-Efficient Bit-Wise In-Memory Convolution Engine for Deep Neural Network |
Author | Shaahin Angizi, Zhezhi He, Farhana Parveen, *Deliang Fan (University of Central Florida, U.S.A.) |
Page | pp. 111 - 116 |
Keyword | In-memory computing, bit-wise convolution |
Abstract | In this paper, we pave a novel way towards the concept of bit-wise In-Memory Convolution Engine (IMCE) that could implement the dominant convolution computation of Deep Convolutional Neural Networks (CNN) within memory. IMCE employs parallel computational memory sub-array as a fundamental unit based on our proposed Spin Orbit Torque Magnetic Random Access Memory (SOT-MRAM) design. Then, we propose an accelerator system architecture based on IMCE to efficiently process low bit-width CNNs. This architecture can be leveraged to greatly reduce energy consumption dealing with convolutional layers and also accelerate CNN inference. The device to architecture co-simulation results show that the proposed system architecture can process low bit-width AlexNet on ImageNet data-set favorably with 785.25µJ/img, which consumes ~3× less energy than that of recent RRAM based counterpart. Besides, the chip area is ~4× smaller. |
Title | Training Low Bitwidth Convolutional Neural Network on RRAM |
Author | *Yi Cai, Tianqi Tang, Lixue Xia, Ming Cheng, Zhenhua Zhu, Yu Wang, Huazhong Yang (Tsinghua University, China) |
Page | pp. 117 - 122 |
Keyword | RRAM, on-chip training, low bitwidth, CNN |
Abstract | This paper propose an RRAM-based on-chip low-bit CNN training system, introducing low-bit ADC/DACs and low-precision weights in CNN's training, which helps to solve the problems that impedes the realization of RRAM-based training system. This paper also discuss the disturbance existing in RRAM's resistance due to non-ideal factors and analyse its impact on training. |
Title | A High-Throughput and Energy-Efficient RRAM-based Convolutional Neural Network using Data Encoding and Dynamic Quantization |
Author | *Xizi Chen, Jingbo Jiang, Jingyang Zhu, Chi-Ying Tsui (Hong Kong University of Science and Technology, China) |
Page | pp. 123 - 128 |
Keyword | CNN, memristor, RRAM, neural, accelerator |
Abstract | To solve the scaling, memory wall and high power density issues, recently RRAM-based accelerators, which show a better energy and area efficiency compared with the CMOS-based counterparts, have been proposed for convolutional neural networks. However, the RRAM-based architectures still face several design challenges, including the high energy and timing overhead at the analog/digital (A/D) conversion and interfacing circuits. To address these issues, we propose several novel optimization schemes in this work. First an encoding scheme for the synaptic weights and the input feature maps is proposed to reduce the energy of the in-situ computation and the bit-resolution of the A/D conversion. Then the resolution of the A/D conversion is further optimized for a lower energy consumption. Moreover, a dynamic quantization scheme for the multiply-accumulate operations (MACs) is proposed to improve the throughput and the energy efficiency by reducing the number of partial products. Experimental results show that the throughput, the energy efficiency and the area efficiency are improved by 2 to 4 times when compared with the state-of-the-art RRAM-based accelerators. |
Title | DRL-Cloud: Deep Reinforcement Learning-Based Resource Provisioning and Task Scheduling for Cloud Service Providers |
Author | Mingxi Cheng (Duke University, U.S.A.), Ji Li, *Shahin Nazarian (University of Sourthern California, U.S.A.) |
Page | pp. 129 - 134 |
Keyword | Deep reinforcement learning, Deep Q-learning, Cloud computing, Resource provisioning, Task scheduling |
Abstract | Cloud computing has become an attractive computing paradigm in both academia and industry. Through virtualization technology, Cloud Service Providers (CSPs) that own data centers can structure physical servers into Virtual Machines (VMs) to provide services, resources, and infrastructures to users. Profit-driven CSPs charge users for service access and VM rental, and reduce power consumption and electric bills so as to increase profit margin. The key challenge faced by CSPs is data center energy cost minimization. Prior works proposed various algorithms to reduce energy cost through Resource Provisioning (RP) and/or Task Scheduling (TS). However, they have scalability issues or do not consider TS with task dependencies, which is a crucial factor that ensures correct parallel execution of tasks. This paper presents DRL-Cloud, a novel Deep Reinforcement Learning (DRL)-based RP and TS system, to minimize energy cost for large-scale CSPs with very large number of servers that receive enormous numbers of user requests per day. A deep Q-learning-based two-stage RP-TS processor is designed to automatically generate the best long-term decisions by learning from the changing environment such as user request patterns and realistic electric price. With training techniques such as target network, experience replay, and exploration and exploitation, the proposed DRL-Cloud achieves remarkably high energy cost efficiency, low reject rate as well as low runtime with fast convergence. Compared with one of the state-of-the-art energy efficient algorithms, the proposed DRL-Cloud achieves up to 320% energy cost efficiency improvement while maintaining lower reject rate on average. For an example CSP setup with 5,000 servers and 200,000 tasks, compared to a fast round-robin baseline, the proposed DRL-Cloud achieves up to 144% runtime reduction. |
Title | Pairing of Microring-based Silicon Photonic Transceivers for Tuning Power Optimization |
Author | Rui Wu (University of California, Santa Barbara, U.S.A.), M. Ashkan Seyedi (Hewlett Packard Labs, U.S.A.), *Yuyang Wang (University of California, Santa Barbara, U.S.A.), Jared Hulme, Marco Fiorentino, Raymond G. Beausoleil (Hewlett Packard Labs, U.S.A.), Kwang-Ting Cheng (Hong Kong University of Science and Technology, Hong Kong) |
Page | pp. 135 - 140 |
Keyword | silicon photonics, microring, WDM transceiver, tuning power |
Abstract | Nanophotonic interconnects have started replacing traditional electrical interconnects in data centers for rack-level communications and demonstrated great potential at board and chip levels. However, microring-based silicon photonic transceivers, an important element for nanophotonic interconnects, are very sensitive to fabrication process variations, and require power hungry wavelength tuning. In this paper, we apply efficient optimization algorithms to mix-and-match a pool of fabricated transceiver devices with the objective of minimizing the overall tuning power. For two sets of fabricated devices, the pairs of transceivers assigned by the optimal pairing technique reduce the tuning power by 6% to 60% compared to the scheme without the application of the technique. We further evaluate the method on synthetic data sets that are generated from a well-established process variation model. Our experimental results show greater power saving with more fabricated devices available for pairing and acceptable algorithm execution time. This optimal pairing technique can be applied during the production stage to reduce tuning power consumption. |
Title | Neu-NoC: A High-efficient Interconnection Network for Accelerated Neuromorphic Systems |
Author | Xiaoxiao Liu, Wei Wen (University of Pittsburgh, U.S.A.), Xuehai Qian (University of Southern California, U.S.A.), Hai Li, *Yiran Chen (Duke University, U.S.A.) |
Page | pp. 141 - 146 |
Keyword | neuromorphic computing, NoC |
Abstract | To solve the high power consumption and average delay of the NoC in modern neuromorphic acceleration systems, we propose Neu-NoC - a high-efficient interconnection network to reduce the redundant data traffic in neuromorphic acceleration systems and explore the data transfer ability between adjacent layers. A sophisticated neural network aware mapping algorithm and a multicast transmission scheme are designed to alleviate data traffic congestions without increasing the average transmission distance. Finally, we explore the sparsity characteristics of fully-connected neural networks. |
Title | A Lifetime-aware Mapping Algorithm to Extend MTTF of Networks-on-Chip |
Author | *Letian Huang, Shuyu Chen, Qiong Wu (University of Electronic Science and Technology of China, China), Masoumeh Ebrahimi (KTH Royal Institute of Technology, Sweden), Junshi Wang, Shuyan Jiang, Qiang Li (University of Electronic Science and Technology of China, China) |
Page | pp. 147 - 152 |
Keyword | Network-on-Chip, mapping algorithm, lifetime reliability, aging |
Abstract | Fast aging of components has become one of the major concerns in System-on-Chip with further scaling of the submicron technology. This problem accelerates when combined with improper working conditions such as unbalanced components utilization. Considering the mapping algorithms in the Networks-on-Chip domain, some routers/links might be frequently selected for mapping while others are underutilized. Consequently, the highly utilized components may age faster than others which results in the disconnection of the related cores from the network. To address this issue, we propose lifetime-aware neighborhood allocation (LaNA) mapping algorithm that takes the aging of components into account when mapping applications. The proposed method is able to balance the wear-out of NoC's components, extending the NoC's service time. We model the lifetime as a resource consumed over time and accordingly define the lifetime budget metric. LaNA selects a suitable mapping which has the maximum lifetime budget. Experimental results show that the lifetime-aware mapping algorithm could improve the minimal MTTF of NoC around 72.2%, 58.3%, 46.6% and 48.2% as compared to NN, CoNA, WeNA |
Title | Layout-Dependent Aging Mitigation for Critical Path Timing |
Author | Che-Lun Hsu (University of Texas at Austin, U.S.A.), Shaofeng Guo (Peking University, China), Yibo Lin, Xiaoqing Xu, *Meng Li (University of Texas at Austin, U.S.A.), Runsheng Wang, Ru Huang (Peking University, China), David Pan (University of Texas at Austin, U.S.A.) |
Page | pp. 153 - 158 |
Keyword | Aging, Timing, Standard cell, Placement |
Abstract | Layout-dependent effects (LDEs) are becoming increasingly important as technology node continues to shrink into the regime of FinFET transistors. Prior LDE studies mainly focus on accurate transistor modeling and fast circuit performance evaluations at the early lifetime of a design. Few studies have been performed on the layout dependency of the circuit aging towards the end of life (EOL). This study demonstrates that, due to transistor-level layout-dependent aging (LDA) behaviors, circuit-level timing degradations are greatly impacted by layout configurations, including length of diffusion and oxide spacing. Therefore, we propose the first circuit-level aging mitigation framework to improve the critical- path timing towards the EOL. Our framework features comprehensive LDA evaluations for standard cell timing, which shows that multiple-row height cells lead to worse EOL timing than single-row height cells due to length-of-diffusion effects. We further propose a min- cost-flow-based placement approach to concurrently allocate the oxide spacing among neighboring standard cells, which generates much better EOL timing than a conventional greedy approach. Experimental results demonstrate the effectiveness of the proposed aging mitigation framework. |
Title | MTTF-aware Design Methodology of Error Prediction Based Adaptively Voltage-scaled Circuits |
Author | *Yutaka Masuda, Masanori Hashimoto (Osaka University, Japan) |
Page | pp. 159 - 165 |
Keyword | adaptive voltage scaling, critical path isolation, mean time to failure, timing error predictive FF |
Abstract | Adaptive voltage scaling is a promising approach to overcome manufacturing variability, dynamic environmental fluctuation, and aging. This paper focuses on error prediction based adaptive voltage scaling (EP-AVS) and proposes a design methodology for EP-AVS circuits. Main contributions of this work include (1) optimization of both voltage-scaled circuit and voltage control logic, and (2) quantitative evaluation of voltage reduction for practically long MTTF. Evaluation results show that the proposed EP-AVS design methodology achieves 20.8% voltage reduction while satisfying target MTTF. |
Title | A Highly Compressed Timing Macro-modeling Algorithm for Hierarchical and Incremental Timing Analysis |
Author | *Tin-Yin Lai, Martin D. F. Wong (University of Illinois at Urbana-Champaign, U.S.A.) |
Page | pp. 166 - 171 |
Keyword | Timing macro-modeling, Timing analysis, Algorithms |
Abstract | Large-scale hierarchical and incremental timing analysis has driven the need for highly compressed timing macro-models. A small timing macro-model for accelerating hierarchical timing is desired because the size of incremental changes dramatically increases as the macro-models are widely used in the large design process. In fact, it takes days for an incremental timing analysis on millions of gates with thousands of incremental changes. To date, the timing macro-models generated by timing macro-modeling algorithms from all the previous works are not compact enough. In this work, we provide four essential techniques in our timing macro-modeling algorithm, which are able to generate highly compressed timing macro-models for hierarchical and incremental timing analysis. In addition, our timing macro-model maintain high accuracy and the efficiency in generating our macro-models. Our algorithm generates timing macro-models where the model sizes are 9% better in the number of nodes and 19% better in the number of edges than the original circuit. Our work outperforms the state of arts significantly in both model size and the runtime in macro-model usage. |
Title | FastPass: Fast Timing Path Search for Generalized Timing Exception Handling |
Author | *Pei-Yu Lee (National Chiao Tung University, Taiwan), Iris Hui-Ru Jiang (National Taiwan University, Taiwan), Tung-Chieh Chen (Maxeda Technology, Taiwan) |
Page | pp. 172 - 177 |
Keyword | Static timing analysis, timing exceptions, Critical path search |
Abstract | As design complexity rapidly grows, a modern design contains more complex constraints and has more clock domains. To these stringent timing requirements, a design is iteratively optimized. Along with intensive optimizations, fast timing analysis guiding designers to fix timing violations is desired. Thus far, previous works have focused on either timing exception handling or path search only. Different from them, in this paper, we tackle these two issues together for the urgent need in modern design. We first generalize timing exceptions to model all common timing exceptions and other path-specific timing quantities. Then, we propose a novel timing analysis flow that performs fast path search for generalized timing exception handling. Furthermore, we develop three delicate techniques to achieve fast path search, including local slack bounds, dynamic slack recovering, and slack priority queue. Experimental results show that our model is general, and our flow is promising with high efficiency and scalability. |
Title | ReGAN: A Pipelined ReRAM-Based Accelerator for Generative Adversarial Networks |
Author | *Fan Chen, Linghao Song, Yiran Chen (Duke University, U.S.A.) |
Page | pp. 178 - 183 |
Keyword | Neural Network, GAN, ReRAM |
Abstract | Generative Adversarial Networks (GANs) have recently drawn tremendous attention in many artificial intelligence (AI) applications including computer vision, speech recognition, and natural language processing. While GANs deliver state-of-the-art performance on these AI tasks, it comes at the cost of high computational complexity. Although recent progress demonstrated the promise of using ReRMA-based Process-In-Memory for acceleration of convolutional neural networks (CNNs) with low energy cost, the unique training process required by GANs makes them difficult to run on existing neural network acceleration platforms: two competing networks are simultaneously co-trained in GANs, and hence, significantly increasing the need of memory and computation resources. In this work, we propose ReGAN -- a novel ReRAM-based Process-In-Memory accelerator that can efficiently reduce off-chip memory accesses. Moreover, ReGAN greatly increases system throughput by pipelining the layer-wise computation. Two techniques, namely, Spatial Parallelism and Computation Sharing are particularly proposed to further enhance training efficiency of GANs. Our experimental results show that ReGAN can achieve 240x performance speedup compared to GPU platform averagely, with an average energy saving of 94x. |
Title | Quad-Multiplier Packing Based on Customized Floating Point for Convolutional Neural Networks on FPGA |
Author | *Zhifeng Zhang, Dajiang Zhou, Shihao Wang, Shinji Kimura (Waseda University, Japan) |
Page | pp. 184 - 189 |
Keyword | Convolutional Neural Networks, Customized Floating Point, Multiplier Packing, Resource Efficiency, FPGA |
Abstract | Convolutional neural networks (CNNs) are widely used in many fields. This paper explores customized floating point for both the training and inference of CNNs. Optimization results show that 9-bit customized floating point is sufficient for the training of ResNet-20 on CIFAR-10 dataset with less than 1% accuracy loss. Based on the reduced bit-width, a computational unit based on Quad-Multiplier Packing is proposed, which can save 87.5% DSP slices and 62.5% LUTs on Xilinx Kintex-7 platform compared to 32-bit floating point. |
Title | Sparse Ternary Connect: Convolutional Neural Networks Using Ternarized Weights with Enhanced Sparsity |
Author | *Canran Jin, Heming Sun, Shinji Kimura (Waseda University, Japan) |
Page | pp. 190 - 195 |
Keyword | Convolutional Neural Network, Computation Reduction, Accelerator, FPGA |
Abstract | Convolutional Neural Networks (CNNs) are indispensable in a wide range of tasks to achieve state-of-the-art results. In this work, we exploit ternary weights in both inference and training of CNNs and further propose Sparse Ternary Connect where kernel weights in float value are converted to 1, -1 and 0 with the controlled ratio of 0. The experimental evaluation on 2 datasets (CIFAR-10 and SVHN) shows that the proposed method can reduce considerable resource utilization with less than 0.5% accuracy loss. |
Title | A Deep Reinforcement Learning Framework for Optimizing Fuel Economy of Hybrid Electric Vehicles |
Author | Pu Zhao (Northeastern University, U.S.A.), *Yanzhi Wang (Syracuse University, U.S.A.), Naehyuck Chang (Korea Advanced Institute of Science and Engineering, Republic of Korea), Qi Zhu (Northwestern University, U.S.A.), Xue Lin (Northeastern University, U.S.A.) |
Page | pp. 196 - 202 |
Keyword | Deep Reinforcement Learning, Hybrid Electric Vehicles, power control |
Abstract | Hybrid electric vehicles employ a hybrid propulsion system to combine the energy efficiency of electric motor and a long driving range of internal combustion engine, thereby achieving a higher fuel economy as well as convenience compared with conventional ICE vehicles. However, the relatively complicated powertrain structures of HEVs necessitate an effective power management policy to determine the power split between ICE and EM. In this work, we propose a deep reinforcement learning framework of the HEV power management with the aim of improving fuel economy. The DRL technique is comprised of an offline deep neural network construction phase and an online deep Q-learning phase. Unlike traditional reinforcement learning, DRL presents the capability of handling the high dimensional state and action space in the actual decision-making process, making it suitable for the HEV power management problem. Enabled by the DRL technique, the derived HEV power management policy is close to optimal, fully model-free, and independent of a prior knowledge of driving cycles. Simulation results based on actual vehicle setup over real-world and testing driving cycles demonstrate the effectiveness of the proposed framework on optimizing HEV fuel economy. |
Title | Process Variation and Temperature Aware Adaptive Scrubbing for Retention Failures in STT-MRAM |
Author | Nour Sayed, Sarath Mohanachandran Nair, Rajendra Bishnoi, *Mehdi Tahoori (KIT - Karlsruhe Institute of Technology, Germany) |
Page | pp. 203 - 208 |
Keyword | STT-MRAM, Retention failure, Reliability, Scrubbing |
Abstract | Spin Transfer Torque Magnetic Random Access Memory (STT-MRAM) is an emerging memory technology, which is seen as a promising replacement for CMOS based on-chip memories. It has several distinctive advantages such as non-volatility, high density, CMOS compatibility and scalability among others. However, retention failure has emerged as a major reliability concern for this technology due to the large variations in retention time because of process variations and temperature effects. The conventional solution to mitigate retention failures is to use scrubbing at regular intervals to prevent accumulation of errors, based on the worst case retention time of the memory array. However this leads to large performance and energy overheads. In this work, we propose a process variation and temperature aware scrubbing technique, where we cluster the cache lines into different groups based on their retention times and use different scrubbing intervals for each of these groups. In addition, the scrubbing interval is adjusted at runtime based on the operating temperature, to guarantee target error rate requirements. Our results show that for 512KB cache, a group size of 4 can reduce the performance and dynamic energy overheads of scrubbing by 97%, under the same error rate constraint. |
Title | PIMCH: Cooperative Memory Prefetching in Processing-In-Memory Architecture |
Author | *Sheng Xu, Ying Wang, Yinhe Han, Xiaowei Li (Institute of Computing Technology, Chinese Academy of Sciences, China) |
Page | pp. 209 - 214 |
Keyword | Process-In-Memory, Over-lock, Prefetching |
Abstract | High performance processors employ hardware data prefetching mechanism to reduce the negative performance impact on large main memory latencies. However, current prefetching methods cause significant slowdown in Process-In-Memory(PIM) architectures in the form of stalls. We identify that most of these stalls are caused by over-lock of prefetched data. In this paper, we propose a simple prefetching mechanism called PIMCH that enabling the cooperative memory prefetching in Processing-In-Memory architecture. |
Title | CAMO: A Novel Cache Management Organization for GPGPUs |
Author | *Debiprasanna Sahoo, Swaraj Sha, Manoranjan Satpathy (Indian Institute of Technology, Bhubaneswar, India), Madhu Mutyam (Indian Institute of Technology, Madras, India), Laxmi Narayan Bhuyan (University of California, Riverside, U.S.A.) |
Page | pp. 215 - 220 |
Keyword | GPGPU, DRAM, Cache |
Abstract | GPGPUs are now commonly used as co-processors of CPUs for the computation of data parallel and throughput- intensive algorithms. However, memory available in GPGPUs is limited for many applications of interest; there is a continuous demand for increased memory of such applications. Several techniques like multi-steaming or pinned memory are frequently employed to mitigate these issues to some extent. However, these techniques either suffer from latency overhead or increase programming complexity. GPUdmm uses GPU DRAM as a cache of CPU; key problems in this design are inefficient memory access data-path and tag access overhead. In this context, we present CAMO, a novel cache memory organization for GPGPUs which addresses the limitations of pinned memory technique and GPUdmm. First, it uses GPU DRAM as a victim cache of LLC that improves the performance by delivering data faster to the SMs. Second, it uses ATCache, a CPU based DRAM cache tag management technique. ATCache reduces the number of DRAM cache accesses. We implement CAMO within the GPGPU-Sim framework and show that its average performance -- when compared with pinned memory -- increases by a factor of 1.87x and the peak performance growth being 4.67x. In addition, CAMO outperforms GPUdmm on an average by a factor of 15.9% and maximum speedup by a factor of 80%. |
Title | Process Variation Aware Data Management for Magnetic Skyrmions Racetrack Memory |
Author | *Fan Chen (Duke University, U.S.A.), Zheng Li (University of Pittsburgh, U.S.A.), Wang Kang, Weisheng Zhao (Beihang University, China), Hai Li, Yiran Chen (Duke University, U.S.A.) |
Page | pp. 221 - 226 |
Keyword | Skyrmions, Racetrack, LLC |
Abstract | Skyrmions racetrack memory (SKM) has been identified as a promising candidate for future on-chip cache. Similar to many other nanoscale technologies, process variations also adversely impact the reliability and performance of SKM cache. In this work, we propose the first holistic solution for employing SKM as last-level caches. We first present a novel SKM cache architecture and a physical-to-logic mapping scheme based on our comprehensive analysis on working mechanism of SKM. We then model the impact of process variations on SKM cache performance. By leveraging the developed model, we propose a process variation aware data management technique to minimize the performance degradation of SKM cache incurred by process variations. Experimental results show that the proposed SKM cache can achieve a geometric mean of 1.28× IPC improvement, 2× density increase, and 23% energy reduction compared to Domain Wall racetrack memory (DWM) under the same area constraint across 15 workloads. In addition, our dynamic data management technique can further improve the system IPC by 25% w.r.t. the worst-case design. |
Title | Optimizing Dynamic Mapping Techniques for On-Line NoC Test |
Author | Shuyan Jiang, *Qiong Wu, Shuyu Chen, Junshi Wang (University of Electronic Science and Technology of China, China), Masoumeh Ebrahimi (KTH Royal Institute of Technology, Sweden), Letian Huang, Qiang Li (University of Electronic Science and Technology of China, China) |
Page | pp. 227 - 232 |
Keyword | Network-on-Chip, mapping algorithm, intermittent fault, on-line testing |
Abstract | With the aggressive scaling of submicron technology, intermittent faults are becoming one of the limiting factors in achieving a high reliability in Network-on-Chip (NoC). Increasing test frequency is necessary to detect intermittent faults, which in turn interrupts the execution of applications. On the other hand, the main goal of traditional mapping algorithms is to allocate applications to the NoC platform, ignoring about the test time requirement. In this paper, we propose a novel testing-aware mapping algorithm (TAMA) for NoC, targeting intermittent faults on the paths between crossbars. In this approach, the idle paths are identified and tested when the application is mapped to the platform. The remaining paths are tested if there is enough time from when the application leaves the platform and a new application enters it. The mapping algorithm is tuned to give a higher priority to the tested paths in the next application mapping. This leaves enough time to test the paths that have not been tested in the expected time. Experiment results show that the proposed testing-aware mapping algorithm leads to a significant improvement over FF, NN, CoNA, and WeNA. |
Title | On Enabling Diagnosis for 1-Pin Test Fails in an Industrial Flow |
Author | *Daniel Tille, Benedikt Gottinger, Ulrike Pfannkuchen (Infineon Technologies AG, Germany), Helmut Graeb, Ulf Schlichtmann (Technical University of Munich, Germany) |
Page | pp. 233 - 238 |
Keyword | DFT, Test, Diagnosis, 1-Pin Test, MISR |
Abstract | The 1-Pin Test concept has proven to be beneficial for test cost reduction. By compacting test responses into a signature and reading them out at test end, test parallelism can be increased significantly. This reduces the test time and thus test cost. Especially cost-sensitive devices, e.g. IoT end nodes, profit. A drawback of this method is the limited capability of diagnosis due to the lack of cycle-accurate PASS/FAIL information. In this paper, we present a new approach to tackle this challenge. It enables the use of an industrial diagnosis flow for fails that occurred during 1-Pin Test. For this purpose, we propose failing vector and failing cycle analysis techniques. Our approach is fault model independent and not limited to a single fault assumption. We mitigate the aliasing problem by masking. The effectiveness of our approach is shown on an investigation of real silicon fails in industrial designs. |
Title | Approximation-aware Testing for Approximate Circuits |
Author | Arun Chandrasekharan, Stephan Eggersglüß, *Daniel Große, Rolf Drechsler (University of Bremen, Germany) |
Page | pp. 239 - 244 |
Keyword | Approximate Computing, Yield, Test |
Abstract | A wide range of applications significantly benefit from the Approximate Computing (AC) paradigm in terms of speed or power reduction. AC achieves this by tolerating errors in the design. These errors are introduced into the design either manually by the designer or by approximate synthesis approaches. From here, the standard design flow is taken. Hence, the manufactured AC chip is eventually tested for production errors using well established fault models. To be precise, if the test for a test pattern fails, the AC chip is sorted out. However, from a general perspective this procedure results in throwing away chips which are perfectly fine taking into account that the considered fault (i.e. physical defect that leads to the error) can still be tolerated because of approximation. This can lead to a significant amount of yield loss. In this paper, we present an approximation-aware test methodology which can be easily integrated into the regular test flow. It is based on a pre-process to identify approximation-redundant faults. By this, we remove all potential faults that no longer need to be tested because they can be tolerated under the given error metric. Our experimental results and case studies on a wide variety of benchmark circuits show a significant potential for yield improvement. |
Title | A Channel-Sharable Built-In Self-Test Scheme for Multi-Channel DRAMs |
Author | *Kuan-Te Wu, Jin-Fu Li (National Central University, Taiwan), Chin-Yen Lo, Jenn-Shiang Lai, Ding-Ming Kwai, Yung-Fa Chou (ITRI, Taiwan) |
Page | pp. 245 - 250 |
Keyword | Multi-channel DRAM, BIST, Test, March test |
Abstract | Various multi-channel dynamic random access memories (MC-DRAMs) have been proposed for the demand of high bandwidth. In this paper, we propose a channel-sharable built-in self-test (BIST) scheme for MC-DRAMs. The BIST can apply test patterns and evaluate test responses for multiple channels simultaneously regardless of the difference of the read/write latency among the channels. Therefore, the proposed BIST can reduce the test time. In our simulation cases show that the proposed BIST scheme can achieve about 14% test time reduction in comparison with an existing conventional shared BIST scheme for a two-channel 1G-bit DRAM by consuming about 0.003% additional area cost. |
Title | Concerted Wire Lifting: Enabling Secure and Cost-Effective Split Manufacturing |
Author | *Satwik Patnaik (New York University, U.S.A.), Johann Knechtel, Mohammed Ashraf, Ozgur Sinanoglu (New York University Abu Dhabi, United Arab Emirates) |
Page | pp. 251 - 258 |
Keyword | Split manufacturing, Hardware security, Wire lifting |
Abstract | Here we advance the protection of split manufacturing (SM)-based layouts through the judicious and well-controlled handling of interconnects. Initially, we explore the cost-security trade-offs of SM, which are limiting its adoption. Aiming to resolve this issue, we propose effective and efficient strategies to lift nets to the BEOL. Towards this end, we design custom "elevating cells" which we also provide to the community. Further, we define and promote a new metric, Percentage of Netlist Recovery (PNR), which can quantify the resilience against gate-level theft of intellectual property (IP) in a manner more meaningful than established metrics. Our extensive experiments show that we outperform the recent protection schemes regarding security. For example, we reduce the correct connection rate to 0% for commonly considered benchmarks, which is a first in the literature. Besides, we induce reasonably low and controllable overheads on power, performance, and area (PPA). At the same time, we also help to lower the commercial cost incurred by SM. |
Title | A Conflict-Free Approach for Parallelizing SAT-Based De-Camouflaging Attacks |
Author | *Xueyan Wang, Qiang Zhou, Yici Cai (Tsinghua University, China), Gang Qu (University of Maryland, College Park, U.S.A.) |
Page | pp. 259 - 264 |
Keyword | Circuit Camouflaging, SAT-based Attack, Parallelization, Reverse Engineering, Hardware Security |
Abstract | As one of the most effective proactive countermeasures against reverse engineering, circuit camouflaging has emerged to be a hot research topic and it is becoming a mature technology with the development of various de-camouflaging attacks. Among them, the SAT-based method is the most powerful one to defeat circuit camouflaging. However, SAT-based attacks have scalability problem due to the complexity of the underlying SAT solvers, and straightforward approach to parallelize SAT-based attacks will fail. In this paper, we propose a two-level partition method (independent module partitioning and k-medoids clustering), together with a novel conflict avoidance strategy to solve the problem. Experimental results on OpenSparc T1 microprocessor controller demonstrate that our approach can on average reduce the scales of the SAT formulas by more than 50% and achieve 3.6x speedup on the best-known SAT-based de-camouflaging tool. |
Title | A Practical Split Manufacturing Framework for Trojan Prevention via Simultaneous Wire Lifting and Cell Insertion |
Author | *Meng Li (University of Texas at Austin, U.S.A.), Bei Yu (Chinese University of Hong Kong, Hong Kong), Yibo Lin, Xiaoqing Xu, Wuxi Li, David Z. Pan (University of Texas at Austin, U.S.A.) |
Page | pp. 265 - 270 |
Keyword | Split Manufacturing, Trojan Prevention, K-Isomorphism, Simultaneous Wire Lifting and Node/Wire Insertion |
Abstract | Trojans and backdoors inserted by untrusted foundries have become serious threats to hardware security. Split manufacturing is proposed to prevent Trojan insertion proactively. Existing methods depend on wire lifting to hide partial circuit interconnections, which usually suffer from large overhead and lack of security guarantee. In this paper, we propose a novel split manufacturing framework that not only guarantees to achieve the required security level but also allows for a drastic reduction of the introduced overhead. In our framework, insertion of dummy circuit cells and wires is considered simultaneously with wire lifting. To support cell and wire insertion, we propose a new security criterion, and further derive its sufficient condition to avoid computation intensive operations in traditional methods. Then, for the first time, a novel mixed integer linear programming formulation is proposed to simultaneously consider cell and wire insertion together with wire lifting, which significantly enlarges the design space to guarantee the realization of the sufficient condition under the security requirements and overhead constraints. With extensive experimental results, our framework demonstrates much better efficiency, overhead reduction, and security guarantee compared with existing methods. |
Title | A Comparative Investigation of Approximate Attacks on Logic Encryptions |
Author | Yuanqi Shen, Amin Rezaei, *Hai Zhou (Northwestern University, U.S.A.) |
Page | pp. 271 - 276 |
Keyword | Logic Encryption, Approximate Attack, Error Probability |
Abstract | Logic encryption is an important hardware protection technique that adds extra keys to lock a given circuit. With recent discovery of the effective SAT-based attack, new enhancement methods such as SARLock and Anti-SAT have been proposed to thwart the SAT-based and similar exact attacks. Since these new techniques all have very low error rate, approximate attacks such as Double DIP and AppSAT have been proposed to find an almost correct key with low error rate. However, measuring the performance of an approximate attack is extremely challenging, since exact computation of the error rate is very expensive, while estimation based on random sampling has low confidence. In this paper, we develop a suite of scientific encryption benchmarks where a wide range of error rates are possible and the error rate can be found out by simple eyeballing. Then, we conduct a thorough comparative study on different approximate attacks, including AppSAT and Double DIP. The results show that approximate attacks are far away from closing the gap and more investigations are needed in this area. |
Wednesday, January 24, 2018 |
Title | (Keynote Address) Quality, Schedule and Cost: Design Technology and the Last Semiconductor Scaling Levers |
Author | Andrew B. Kahng (UCSD, U.S.A.) |
Abstract | Lateral scaling in semiconductor manufacturing and device architecture will be extremely challenging after the foundry 5nm/3nm nodes. Thus, continuing the Moore's-Law trajectory of value scaling will require new (and perhaps final) levers. There are three basic types of such levers. The first is quality: improved design tools and methods must recover IC design quality that was left on the table as the industry rode scaling in the past decade's "race to the end of the roadmap". The second is schedule: the temporal axis of Moore's Law (i.e., "one week is one percent") must become a focus for delivery of new, "design-based equivalent scaling" value. The third is cost: new mechanisms must be found to reduce the cost of IC design infrastructure as well as the cost and difficulty of the IC design process itself. In this talk, I will describe these levers and how the EDA community can contribute to their realization for future semiconductor design. |
Title | An Ultra-Low-Noise Differential Relaxation Oscillator based on a Swing-Boosting Scheme |
Author | Junghyup Lee, *Arup George (Daegu Gyeongbuk Institute of Science and Technology, Republic of Korea), Minkyu Je (Korea Advanced Institute of Science and Technology, Republic of Korea) |
Page | pp. 277 - 278 |
Keyword | relaxation oscillator, low-noise, differential, swing-boosting |
Abstract | This paper presents an ultra-low-noise differential relaxation oscillator implemented in a 0.18-μm standard CMOS process. The proposed oscillator having output frequency of 10.5 MHz achieves 162.1 dBc/Hz FOM at 100 kHz offset and 157.7 dBc/Hz FOM at 1kHz offset by employing a swing-boosting scheme. It also achieves 9.86 psrms period jitter corresponding to 0.01% relative jitter which is five times lower than those of prior works. |
Title | A Nonvolatile Flip-Flop-Enabled Cryptographic Wireless Authentication Tag with Per-Query Key Update and Power-Glitch Attack Countermeasures |
Author | Chiraag S. Juvekar (MIT, U.S.A.), Joyce Kwong (Texas Instruments, U.S.A.), *Hyung-Min Lee (Korea University, Republic of Korea), Anantha P. Chandrakasan (MIT, U.S.A.) |
Page | pp. 279 - 280 |
Keyword | authentication, cryptographic, energy backup, wireless power and data transfer, regulating voltage multiplier |
Abstract | A Cryptographic wireless authentication tag, which implements PRNG and authenticated encryption modes, is presented. Key-update is used to limit side-channel leakage, and FeRAM-based NVDFFs with a fully-integrated energy backup mitigate power-glitch attacks. The 130nm CMOS tag harvests inductive power from a 433MHz link via a regulating voltage multiplier, supplying 1.5V for AC input >0.55V with <1.1% line/load regulation. Pulse-based wireless telemetry is used to reduce dead-time. The full system operation including the tag, reader, and server protocol is demonstrated. |
Title | A 42nJ/conversion On-Demand State-of-Charge Indicator for Miniature IoT Li-ion Batteries |
Author | *Junwon Jeong (Korea University, Republic of Korea), Seokhyeon Jeong (University of Michigan, U.S.A.), Chulwoo Kim (Korea University, Republic of Korea), Dennis Sylvester, David Blaauw (University of Michigan, U.S.A.) |
Page | pp. 281 - 282 |
Keyword | SOC indication, Fuel gauge, Battery, Low power, IoT |
Abstract | An energy efficient State-of-Charge (SOC) indication algorithm and integrated system for small IoT batteries are introduced in this paper. The system is implemented in a 180-nm CMOS technology. Based on a key finding that small Li-ion batteries exhibit a linear dependence between battery voltage and load current, we propose an instantaneous linear extrapolation (ILE) algorithm and circuit allowing on-demand estimation of SOC. Power consumption is 42nW and maximum SOC indication error is 1.7%. |
Title | A Supply Noise Insensitive PLL with a Rail-to-Rail Swing Ring Oscillator and a Wideband Noise Suppression Loop |
Author | *Dongin Kim, SeongHwan Cho (KAIST, Republic of Korea) |
Page | pp. 283 - 284 |
Keyword | PLLs, power supply noise, ring oscillator, wideband noise suppression |
Abstract | In this paper, we present a supply noise insensitive digital phase-locked loop using a wideband noise suppression loop around the oscillator that does not require any calibration or additional settling time. Unlike previous techniques, the proposed approach suppresses supply noise without reducing the voltage headroom of the oscillator. The proposed dual loop PLL is implemented in 65nm CMOS, achieving an average spur suppression of 30dB near PLL loop bandwidth, while consuming 2.73mW at 3.2GHz output. |
Title | A Dual-Output SC Converter with Dynamic Power Allocation for Multi-Core Application Processors |
Author | Junmin Jiang (The Hong Kong University of Science and Technology, Hong Kong), *Yan Lu (University of Macau, Macau), Xun Liu, Wing-Hung Ki, Philip K. T. Mok (The Hong Kong University of Science and Technology, Hong Kong), Seng-Pan U, Rui P. Martins (University of Macau, Macau) |
Page | pp. 285 - 286 |
Keyword | SC Converter, DC-DC Converter, Dual output, Power Allocation |
Abstract | A fully integrated single-input dual-output switched-capacitor converter with dynamic power-cell allocation for application processors is presented in this summary. The power cells can be dynamically allocated according to the loads, and the efficiency is improved by 4.8%. A dual-path voltage-control oscillator (VCO) that works independently of the power-cell allocation is proposed to achieve a fast and stable regulation loop. The converter achieved peak efficiency of 83.3% and maximum combined load-currents of 100mA while maintaining minimized cross regulation. |
Title | 12Gb/s Over Four Balanced Lines Utilizing NRZ Braid Clock Signaling with 100% Data Payload and Spread Transition Scheme for 8K UHD Intra-Panel Interface |
Author | Yeonho Lee, *Yoonjae Choi, Chulwoo Kim (Korea University, Republic of Korea) |
Page | pp. 287 - 288 |
Keyword | Wireline, Transceiver, Efficient Signaling, Intra-panel interface, Display |
Abstract | This paper presents a Braid clock signaling scheme with 100% data payload and spread transition scheme. The Braid clock signaling has NRZ signaling margin without any dummy clock bits. Also, this paper describes spread transition schemes for low EMI radiation. The effective data bandwidth is increased by 11.1% with the 500% highly embedded transitions. With a same RX voltage margin, the required power for the termination is 5.4 times smaller than the multi-level signaling. |
Title | A Digital SC Converter with High Efficiency and Low Voltage Ripple |
Author | Junmin Jiang, *Wing-Hung Ki (The Hong Kong University of Science and Technology, Hong Kong), Yan Lu (University of Macau, Macau) |
Page | pp. 289 - 290 |
Keyword | SC Converter, DC-DC Converter, Switched-Capacitor, High-Efficiency, Low voltage ripple |
Abstract | A switched-capacitor DC-DC converter with a low output voltage ripple and high efficiency is presented in this summary. To achieve a wide input and output voltage range, a 3-clock-phase operation is proposed to achieve 6 voltage conversion ratios (VCRs) with only two discrete flying capacitors. A digital ripple reduction scheme is utilized to achieve up to four times reduction in output voltage ripple. The digital design also reduces the design complexity. The converter can deliver a 250mW maximum power to a wide output range of 0.5 V to 3 V with an input range of 1.6 V to 3.3 V, and achieves a peak efficiency of 91%. |
Title | A Reconfigurable SIMO System with 10-Output Dual-Bus DC-DC Converter using the Load Balancing Function in Group Allocator for Diversified Load Condition |
Author | *Se-Un Shin, Sang-Hui Park, Gyu-Hyeong Cho (KAIST, Republic of Korea) |
Page | pp. 291 - 292 |
Keyword | SIMO, load balancing function, group allocator, diversified load condition |
Abstract | Portable electronic devices require multiple supplies with relatively large difference in load currents, which causes serious problems when adopting a single-inductor multiple-output (SIMO) DC-DC converter. To resolve these problems, this paper presents a SIMO system with 10-output dual-bus DC-DC converter having two buses for heavy and light load outputs, respectively. Due to such load balancing function, regulation issues which could occur under diversified load condition are resolved. |
Title | Real-time Depth Map Processor for Offset Aperture based Single Camera System |
Author | Hyeji Kim, Jinyeon Lim, *Yeongmin Lee, Woojin Yun, Young-Gyu Kim, Wonseok Choi, Asim Khan (Korea Advanced Institute of Science and Technology, Republic of Korea), Muhammad Umar Karim Khan, Said Homidov (Center for Integrated Smart Sensors, Republic of Korea), Hyun Sang Park (Division of Electrical Engineering, Kongju National University, Republic of Korea), Chong-Min Kyung (Korea Advanced Institute of Science and Technology, Republic of Korea) |
Page | pp. 293 - 294 |
Keyword | Depth extraction, Offset Aperture(OA), Low-power camera system, Real-time |
Abstract | This paper presents a Offset Aperture (OA) based single camera system and proposes an optimized vision processor, a new hardware architecture for fast, low-energy, and low-complexity depth extraction. The proposed design was fabricated in 110nm CMOS image sensor technology and supports 32-level depth resolution on 1920x1080 full HD image with 30fps, consuming 280.53mW from 1.5V supply and a mere 2.8% of bad classification. The low-complexity algorithms are employed to eliminate the DRAM access, thereby the proposed OA architecture can be directly embedded with the CMOS image sensor and commercial image processing chip. |
Title | Edge Pursuit Comparator with Application in a 74.1dB SNDR, 20KS/s 15b SAR ADC |
Author | *Minseob Shim (Korea University, Republic of Korea), Seokhyeon Jeong (University of Michigan, U.S.A.), Paul Myers (Massachusetts Institute of Technology, U.S.A.), Suyoung Bang (Intel Circuit Research Lab, U.S.A.), Junhua Shen (Analog Devices, U.S.A.), Chulwoo Kim (Korea University, Republic of Korea), Dennis Sylvester, David Blaauw, Wanyeong Jung (University of Michigan, U.S.A.) |
Page | pp. 295 - 296 |
Keyword | Edge-Pursuit Comparator, EPC, Oscillator Collapse, SAR ADC, Common-Mode CDAC |
Abstract | This paper presents a new energy-efficient ring oscillator collapse-based comparator, which is called edge-pursuit comparator (EPC) and demonstrated it in a 15-bit SAR ADC. The comparator automatically adjusts the performance according to its input difference without any control, eliminating unnecessary energy spent on coarse comparisons. The employed SAR ADC supplements a 10-bit differential main CDAC with a 5-bit common-mode CDAC which uses common to differential gain tuning to improves linearity by reducing the effect of switch parasitic capacitance. A test chip fabricated in 40nm CMOS shows 74.12 dB SNDR and 173.4 dB FOMs. The comparator consumes 104 nW with the full ADC consuming 1.17 µW |
Title | A 300-μW Audio ΔΣ Modulator with 100.5-dB DR Using Dynamic Bias Inverter |
Author | *Sangwoo Lee, Woojin Jo, Seungwoo Song, Youngcheol Chae (Yonsei University, Republic of Korea) |
Page | pp. 297 - 298 |
Keyword | Delta-sigma modulator, dynamic bias inverter, inverter-based integrator, micropower audio, PVT variations |
Abstract | A micropower audio delta-sigma modulator is presented for mobile applications. The modulator employs dynamic bias inverter based integrators, which maximizes both gm/ID ratio and slew rate while compensating PVT variations. A prototype modulator implemented in a 0.18μm CMOS process features a single-bit third-order topology. The modulator achieves 97.7dB SNDR, 98.6dB SNR, 100.5dB DR, and 105.8dB SFDR in a 20kHz audio band, while consuming only 300μW from a 1.8V supply. This corresponds to a state-of- the-art FoM of 178.7dB. |
Title | An External-Capacitor-Less High PSR Low-Dropout Regulator Using an Adaptive Supply-Ripple Cancellation Technique to the Body-Gate |
Author | *Younghyun Lim, Jeonghyun Lee, Suneui Park, Jaehyouk Choi (UNIST, Republic of Korea) |
Page | pp. 299 - 300 |
Keyword | LDO, High PSR, External-capacitor-less |
Abstract | This work presents an external-capacitor-less low-dropout regulator (LDO) that provides high power-supply rejection (PSR) at all low-to-high frequencies. Using the proposed adaptive supply-ripple cancellation (ASRC) technique, where the ripples copied from the supply are injected adaptively to the body-gate, the PSRR-hump of conventional LDOs can be suppressed significantly. The proposed LDO was fabricated in a 65-nm CMOS process, and the measured PSRRs were less than –36dB at all frequencies from 10kHz to 1GHz, despite changes in a load current and a dropout voltage. |
Title | A 230-260GHz Wideband Amplifier in 65nm CMOS Based on Dual-Peak Gmax-core |
Author | *Dae-Woong Park, Dzuhri Utomo (KAIST, Republic of Korea), Jong-Phil Hong (Chungbuk National University, Republic of Korea), Sang-Gug Lee (KAIST, Republic of Korea) |
Page | pp. 301 - 302 |
Keyword | CMOS, Amplifier, Wideband, Gmax |
Abstract | A dual-peak maximum achievable gain core design technique is proposed. It has been adopted into a 4-stage wideband amplifier. Implemented in a 65nm CMOS, the amplifier achieves 3dB bandwidth of 30GHz (230~260GHz), gain of 12.4 1.5dB, and peak PAE of 1.6% while dissipating 23.8mW, which corresponds to the widest bandwidth and highest gain per stage among other reported CMOS amplifiers operating above 200GHz. |
Title | Injection-Locked Frequency Multiplier with a Continuous Frequency-Tracking Loop for 5G Transceivers |
Author | *Seyeon Yoo, Seojin Choi, Juyeop Kim, Heein Yoon, Yongsun Lee, Jaehyouk Choi (Ulsan National Institute of Science and Technology, Republic of Korea) |
Page | pp. 303 - 304 |
Keyword | Injection-locked, frequency multiplier, frequency-tracking loop, millimeter-wave, 5G transceivers |
Abstract | This work presents a low-phase noise (PN) mm-wave injection-locked frequency multiplier (ILFM) using an ultra-low power frequency-tracking loop (FTL). Monitoring the averages of phase deviations rather than detecting the instantaneous values, the FTL consumed only 600W to calibrate the mm-wave ILFM generating a frequency between 27 and 30GHz. While consuming low power, the proposed FTL effectively regulated the PN degradation, which was less than 2dB up to 100MHz offset across VT variations. |
Title | A 6.9mW 120fps 28×50 Capacitive Touch Sensor for 1mm-φ Stylus Using Current-Driven ΔΣ ADCs |
Author | *Hyunseok Hwang, Hyeyeon Lee, Youngcheol Chae (Yonsei University, Republic of Korea) |
Page | pp. 305 - 306 |
Keyword | Touch sensor, Capacitive sensor, Current conveyer, ΔΣ ADC |
Abstract | This paper presents a 6.9mW 120fps 28×50 channels capacitive touch sensor for 1mm-φ stylus. A current conveyor analog front-end enables a current signaling before its voltage conversion. A current-driven ΔΣ ADC directly interfaces to the differential current for the digital conversion. A test IC is fabricated in 0.18μm CMOS process. This work achieves SNR of 41.7dB with 1mm-φ stylus, consuming only 6.9mW. This results in an energy efficiency of 0.41nJ/step which is more than 4× improvement on the state-of-the-art works |
Title | A Switched-Loop-Filter PLL with Fast Phase-Error Correction Technique |
Author | *Yongsun Lee, Taeho Seong, Seyeon Yoo, Jaehyouk Choi (UNIST, Republic of Korea) |
Page | pp. 307 - 308 |
Keyword | PLL, jitter, spur, ILCM, phase noise |
Abstract | This work presents an ultra-low jitter, low-reference spur switched-loop-filter (SLF) PLL that uses a fast phase-error correction (FPEC) technique that emulates the phase-realignment mechanism of an injection-locked clock multiplier (ILCM). Despite a high multiplication factor (i.e., 64), the proposed SLF-PLL concurrently achieved ultra-low jitter and low reference spur. From the prototype that was fabricated using a 65-nm CMOS process, the RMS-jitter, the FOM, and the reference spur were measured as 378 fs, −242 dB, and −71 dBc, respectively. |
Title | A 9.3 nW All-in-One Bandgap Voltage and Current Reference Circuit using Leakage-based PTAT Generation and DIBL Characteristic |
Author | *Youngwoo Ji, Cheonhoo Jeon, Hyunwoo Son, Byungsub Kim, Hong-June Park, Jae-Yoon Sim (POSTECH, Republic of Korea) |
Page | pp. 309 - 310 |
Keyword | Bandgap reference |
Abstract | This paper presents a sub-10 nW bandgap reference (BGR) circuit that implements both voltage and current references in one circuit. The BGR circuit was implemented with a 0.18-μm CMOS process and generates voltage and current references of 1.238 V and 6.64 nA while consuming 9.3 nW. The voltage and current references show standard deviations of 0.43 % and 1.19 % with temperature coefficients of 26 ppm/°C and 283 ppm/°C, respectively. |
Title | A 16.6-pJ/b 150-Mb/s Body-Channel Communication Transceiver with Decision Feedback Equalization Improving >200x Area Efficiency |
Author | *Ji-Hoon Lee (Korea Institute of Science and Technology, Republic of Korea), Kwangmin Kim, Minsoo Choi, Jae-Yoon Sim, Hong-June Park, Byungsub Kim (Pohang University of Science and Technology, Republic of Korea) |
Page | pp. 311 - 312 |
Keyword | Body-channel Communication, Human-body communication, Intra-body communication, body-coupled communication, Decision Feedback Equalizer |
Abstract | This paper presents a body-channel communication (BCC) transceiver adopted with decision feedback equalization (DFE). The proposed transceiver, fabricated in 65-nm CMOS process, achieves reliable (BER<10-6) data rates of 150 Mb/s (16.6 pJ/b), and 100 Mb/s (23.5 pJ/b) over 20-cm and 1.3-m channels on human limbs. The transceiver occupies a total core area of 5580 μm2, which is less than 1% compared to any previously-presented work. |
Title | Low Power FSK Transceiver using ADPLL with Direct Modulation and Integrated SPDT for BLE Application |
Author | *Dongsoo Lee, Sungjin Kim, Seongjin Oh, Gyusub Won, Thi kim nga Truong, Imran Ali, Hamed Abbasizadeh, Behnam Samadpoor Rikan, Kang-Yoon Lee (Sungkyunkwan University, Republic of Korea) |
Page | pp. 313 - 314 |
Keyword | ADPLL, FSK Transceiver, BLE, low power, SPDT |
Abstract | This paper presents a low power FSK transceiver with ADPLL based on direct modulation and integrated SPDT switch for Bluetooth low energy application. To ensure that the proposed low power transceiver can operate at 1 Mbps data rate, FSK modulation is implemented using an ADPLL with direct modulation technique. The SPDT switch is integrated to share the antenna and matching network between the transmitter and receiver, thus minimizing the system cost by reducing external components. The transceiver is implemented using 1P6M 55-nm CMOS technology. The die area of the transceiver with DC-DC converter is 1.79 mm2. The power consumption of the Tx and Rx are 6 and 5 mW, respectively. The noise figure of Rx is up to 6.8 dB with respect to channel frequencies. The phase noise of the ADPLL is -84.7 and -118.9 dBc/Hz at 100 kHz and 1 MHz offset from 2.44 GHz, respectively. |
Title | A 2.22 Gbps High-Throughput NB-LDPC Decoder in 65nm CMOS with Aggressive Overlap Scheduling |
Author | Injun Choi (Silicon Works, Republic of Korea), *Ji-Hoon Kim (Seoul National University of Science and Technology, Republic of Korea) |
Page | pp. 315 - 316 |
Keyword | NB-LDPC, Channel Coding, VLSI, Forward Error Correction |
Abstract | This paper presents the fully-overlapped non-binary low-density parity-check (NB-LDPC) decoder to improve the throughput performance. The early-bubble (e-bubble) check algorithm and the EVN overlap scheduling is proposed to reduce the iteration latency of the decoder. The proposed decoder for (160, 80) NB-LDPC code over GF(64) achieved a throughput of 2.22 Gbps at a 625-MHz frequency |
Title | Design of Resource Sharing Reconfigurable ΔΣ SAR-ADC |
Author | *Motomi Ishizuka, Kohei Yamada, Hiroki Ishikuro (Keio University, Japan) |
Page | pp. 317 - 318 |
Keyword | SAR, programmable, delta-sigma, resource sharing |
Abstract | This paper presents an ADC with re-configurability between SAR-only mode and delta sigma (delta sigma assisted mode). The delta sigma assisted mode brings resolution enhancement. Proposed ADC shares a capacitor array for SAR, feedback DAC, and integrator capacitor in delta sigma loop, which can reduce the circuit size. The prototype ADC fabricated in 65-nm CMOS achieved SNDR of 44.35 dB at 32 MS/s and power consumption of 0.55 mW. The SNDR is improved to 62.9 dB by delta sigma assisted mode. |
Title | A 2.4GHz, -102dBm-Sensitivity, 25kb/s, 0.466mW Interference Resistant BFSK Multi-Channel Sliding-IF ULP Receiver |
Author | *Hyun-Gi Seok, Oh-Yong Jung, Anjana Dissanayake, Sang-Gug Lee (KAIST, Republic of Korea) |
Page | pp. 319 - 320 |
Keyword | ULP, receiver, sensitivity, demodulation, IoT |
Abstract | This paper presents an ultra-low power, high-sensitivity, and interference-resistant receiver suitable for IoT applications. By the combination of sliding-IF based low-power down-conversion and relative-power-detection based FSK demodulation, the proposed receiver achieves multi-channel operation and minimizes power consumption. Cascaded N-path filter and 4th-order hybrid-PPF are adopted to improve the sensitivity and carrier-to-interference ratio. Implemented in a 65nm CMOS, the receiver achieves -102dBm sensitivity at 0.1% BER and 25kb/s data rate while consuming 466μW from a 0.6V supply. |
Title | Highly Sensitive Fingerprint Readout IC for Glass-Covered Mutual Capacitive Fingerprint Sensor |
Author | *Kyeongmin Park, Joohyeob Song, Franklin Bien (Ulsan National Institute of Science and Technology, Republic of Korea) |
Page | pp. 321 - 322 |
Keyword | fingerprint readout IC, glass-covered mutual capacitive fingerprint sensor, charge amplifier (CA) |
Abstract | This paper presents a highly sensitive fingerprint readout IC for glass-covered mutual capacitive fingerprint sensor. To enhance signal to noise ratio (SNR) from the relatively loud noises compared to the signal, the proposed fingerprint readout IC uses modulation and demodulation process, band-pass operation and differential sensing scheme. Furthermore, the proposed fingerprint readout IC make an interface with 250 dpi glass-covered mutual capacitive fingerprint sensor which is patterned with 42 transmitter (TX) electrodes and 32 receiver (RX) electrodes. An analog front end (AFE) achieves 42 dB SNR under 0.1 T (mm) cover glass and 38 dB SNR under 0.2 T (mm) cover glass. The test chip fabricated with 0.18 μm CMOS process consumes 28 mW from a 3.3 V supply. |
Title | A 5.8 GHz DSRC Digitally Controlled CMOS RF-SoC Transceiver for China ETC |
Author | Huimin Liu, Xiongfei Qu, Lingling Cao (Tianjin University of Technology, China), Ruifeng Liu (RF Microelectronics Corp., China), Yuanzhi Zhang (Southern Illinois University Carbondale, U.S.A.), Meijuan Zhang, Xiaoqiang Li, Wenshen Wang (Tianjin University of Technology, China), *Chao Lu (Southern Illinois University Carbondale, U.S.A.) |
Page | pp. 323 - 324 |
Keyword | digital control, DSRC, Transceiver, RF-SoC, ETC |
Abstract | This paper presents a 5.8 GHz dedicated short range communication (DSRC) CMOS RF-SoC transceiver with digitally controlled RF architecture for China electronic toll collection (ETC) system. The operation of key RF blocks, such as ASK modulator, power amplifier, LNA, and mixer, are directly controlled by digital baseband. Compared with state-of-the-art designs in literature, this work demonstrates remarkable advantages in design simplicity, Tx output peak power, adjacent channel power ratio (ACPR), dynamic range, occupied bandwidth (OBW), bit error rate (BER), and so on. |
Title | A Low-Power Wide Dynamic-Range Current Readout Circuit for Biosensors |
Author | *Hyunwoo Son, Hwasuk Cho, Jahyun Koo, Youngwoo Ji, Byungsub Kim, Hong-June Park, Jae-Yoon Sim (Pohang University of Science and Technology, Republic of Korea) |
Page | pp. 325 - 326 |
Keyword | biosensor applications, ultra-low-power, current-to-digital converter, current readout circuit, delta-sigma modulation |
Abstract | This paper presents an amplifier-less and digital-intensive current-to-digital converter for biosensors. The proposed circuit achieves a first-order noise shaping of the quantization error without any continuous-time feedback circuit. Also, it minimizes static power consumption by employing a single-ended current-steering digital-to-analog converter (DAC) which flows only the same current as the input. The effect of dynamic switching noise become input-independent constant by adopting switching averaging algorithm. The implemented circuit in 0.35um CMOS converts an input range of 2.8uA to 15b digital output in about 4ms, while consuming 16.8uW |
Title | An Efficient Fixed-point Arithmetic Processor Using a Hybrid CORDIC Algorithm |
Author | *Hong-Thu Nguyen, Xuan-Thuan Nguyen, Cong-Kha Pham (The University of Electro-Communications, Japan) |
Page | pp. 327 - 328 |
Keyword | CORDIC, arithmetic, fixed-point |
Abstract | The purpose of this article is to introduce a CORDIC-based Arithmetic Processor which utilizes both angle recoding (ARD) and scaling-free (SCFE) CORDIC algorithms. The proposed processor is able to operate the sine, cosine, sine hyperbolic, cosine hyperbolic, and multiplication function. Its hardware architecture implemented in 180 nm CMOS technology is capable of working at 100 MHz frequency and spends 12.96 mW power consumption. In comparison with some previous work, the design is a good choice for high-throughput low-energy applications. |
Title | A 2.4pJ/bit, 6.37Gb/s SPC-enhanced BC-BCH Decoder in 65nm CMOS for NAND Flash Storage Systems |
Author | *Jaehwan Jung, In-Cheol Park (KAIST, Republic of Korea), Youngjoo Lee (POSTECH, Republic of Korea) |
Page | pp. 329 - 330 |
Keyword | NAND flash memory, Error correction code, Decoder architecture, VLSI design |
Abstract | This paper present an energy-efficient block-concatenated BCH (BC-BCH) decoder which can achieve superior decoding performance for NAND flash storage systems. To enhance the error-correcting capability, an additional decoding step with single parity-check (SPC) block is newly employed. A novel memory based syndrome updating method effectively improves the energy efficiency as well as the decoding latency. Using the proposed methods, a prototype chip is implemented to decode a (36443, 32768) BC-BCH code in 65nm CMOS process. The proposed decoder provides a decoding throughput of 6.37Gb/s and an efficiency of 2.4pJ/bit, being superior to the state-of-the-art hard-decision decoders for storages. |
Title | Exploring Energy and Accuracy Tradeoff in Structure Simplification of Trained Deep Neural Networks |
Author | *Boyu Zhang, Azadeh Davoodi, Yu-Hen Hu (University of Wisconsin-Madison, U.S.A.) |
Page | pp. 331 - 336 |
Keyword | Deep Neural Networks, Energy, Structure Simplification, Pruning |
Abstract | This paper presents a structure simplification procedure that allows efficient energy and accuracy tradeoffs in implementation of trained deep neural networks (DNNs). This structure simplification procedure identifies and eliminates redundant neurons in any layer of the DNN based on the trained weights connected from these neurons. This procedure may be applied to all layers of a DNN. For each layer different configurations with Pareto-optimal accuracy and energy consumption are realized. Our work is the first to use energy-accuracy trade-offs to guide optimal structure realization of trained DNNs. After redundant neurons are discarded, the weights of remaining neurons will be updated using matrix multiplication without retraining. Yet, retraining may still be applied if desired to further fine tune the performance. In our experiments, we show energy-accuracy tradeoff provides clear guidance to achieve efficient realization of trained DNNs. We also observe significant implementation cost reductions with up to 33X in energy and 12X in memory while the performance (accuracy) loss is negligible. |
Title | Low Latency Parallel Implementation of Traditionally-Called Stochastic Circuits using Deterministic Shuffling Networks |
Author | *Zhiheng Wang, Soheil Mohajer, Kia Bazargan (University of Minnesota, Twin Cities, U.S.A.) |
Page | pp. 337 - 342 |
Keyword | Stochastic Computation, FPGA, Threshold Function |
Abstract | Stochastic Computing (SC) in recent years has been defined as a digital computation approach that operates on streams of random bits that represent probability values. In a bit-stream representing probability x, each bit has probability x of being 1. Using this simple assumption, SC can perform complex tasks with much smaller hardware footprints compared to conventional binary methods: e.g., a simple AND gate can perform multiplication between two uncorrelated bit-streams. Previous methods on SC circuits either relied on (1) randomness in the input bit streams, or (2) more recently, performing full convolution of deterministic streams to achieve exact computation results. The problem with the first method is that it introduces high random fluctuations and hence high variability in the results. The second method results in exponential increase in the length of the bit stream as circuit depth increases, making it not practical for complex applications. Both of these methods suffer from very long latencies and neither is readily suitable for parallel implementations. |
Title | Optimizing FPGA-based Convolutional Neural Networks Accelerator for Image Super-Resolution |
Author | *Jung-Woo Chang, Suk-Ju Kang (Sogang University, Republic of Korea) |
Page | pp. 343 - 348 |
Keyword | Convolutional neural networks accelerator, Deconvolutional neural networks accelerator, Image super-resolution, machine learning, computer architecture |
Abstract | Convolutional neural networks (CNN) are widely used in various computer vision applications. Recently, there have been many studies on FPGA-based CNN accelerators to achieve high performance and power efficiency. Most of them have been on CNN-based object detection algorithms, but researches on image super-resolution have been rarely con- ducted. A fast super-resolution CNN (FSRCNN), well known for CNN-based super-resolution algorithm, are a combination of multiple convolutional layers and a single deconvolutional layer. Since the deconvolutional layer generates high-resolution (HR) output feature maps from low-resolution (LR) input feature maps, its execution cycles are larger than those of the convolutional layer. In this paper, we propose a novel architecture of the FPGA-based CNN accelerator with the efficient parallelization. We develop a method of transforming a deconvolutional layer into a convolutional layer (TDC), a new methodology for the deconvolutional neural networks (DCNN). There is a massive parallelization source in the deconvolutional layer where multiple outputs within the same output feature map are created with the same inputs. When this new parallelization technique is applied to the deconvolutional layer, it generates the LR output feature maps the same as the convolutional layer. Thus, the performance of the accelerator increases without any additional hardware resources because the kernel size required to generate the LR output feature maps is smaller. In addition, if there is DSP underutilization problem in the deconvolutional layer that some of the processors are in an idle state, the proposed method solves this problem by allowing more output feature maps to be processed in parallel. Experimental results show that the proposed TDC method achieves up to 81 times higher throughput than the state-of-the-art DCNN accelerator with the same hardware resources. We also improve the speed by 7.8 times by having all layers in the hourglass-type FSRCNN to be processed in inter layer parallelism without additional DSP usage. |
Title | XORiM: A Case of In-Memory Bit-Comparator Implementation and Its Performance Implications |
Author | *Kaiwei Zou, Ying Wang, Huawei Li, Xiaowei Li (Institute of Computing Technology, Chinese Academy of Sciences, China) |
Page | pp. 349 - 354 |
Keyword | DRAM, PIM, XOR, bitwise, implementation |
Abstract | The resurrection of Processing in memory (PIM) architectures is expected to address the ever-worsening memory wall issue in big data era. In this work, we propose XORiM, an inexpensive PIM design to achieve fast bulky bitwise XOR operation in commodity DRAM devices for memory-intensive workloads. Instead of resorting to 3D-integration or emerging memory technology, we reuse and adapt the peripheral circuits and row-buffers in memory to enable within-DRAM data manipulation. The implemented mechanism can also be employed to conduct high-throughput bulky data operations including memory initialization, AND, OR and INV. We present the detailed circuitry design and transistor-level simulation to evaluate the proposed method, and demonstrate the application of XORiM to realistic workloads by conducting full-system level simulation. The experimental results on data-intensive applications such as deduplication and data encryption show that about 1.5× and 5.1× overall performance benefits and 4.9× and 9.1× overall energy savings are achieved respectively by XORiM over conventional computing systems. |
Title | Logic Synthesis for Energy-Efficient Photonic Integrated Circuits |
Author | Zheng Zhao, Zheng Wang, Zhoufeng Ying, Shounak Dhar, Ray T. Chen, *David Z. Pan (University of Texas at Austin, U.S.A.) |
Page | pp. 355 - 360 |
Keyword | Logic synthesis, Design automation, Optical device, Emerging technology, Signal integrity |
Abstract | The development of photonic integrated circuits (PICs) has made it possible to accomplish on-chip optical interconnects and computations. As a promising alternative to traditional CMOS circuits, optics has demonstrated the ability to realize ultra-high speed and low-power information processing and communications. In this work, we propose a logic synthesis methodology for PICs. For the first time, practical issues including the insertion losses from optical combiners and switches are considered. Two optimization techniques based on binary decision diagram, combiner elimination and coupler assignment, are proposed to improve the power efficiency for PICs. Experimental results of MCNC and IWLS combinational benchmarks showed our method could efficiently generate quality PICs with a 27.02X better optical power efficiency on average, and greatly reduce the optical power depletion and facilitate large-scale on-chip optical computation. |
Title | HieIM: Highly Flexible In-Memory Computing using STT MRAM |
Author | Farhana Parveen, Zhezhi He, Shaahin Angizi, *Deliang Fan (University of Central Florida, U.S.A.) |
Page | pp. 361 - 366 |
Keyword | In-memory computing, memory architecture, STT-MRAM, Magnetic Domain Wall Motion, Magnetic Tunnel Junction |
Abstract | In this paper we propose a Highly Flexible In-Memory (HieIM) computing platform using STT MRAM, which can be leveraged to implement Boolean logic functions without sacrificing memory functionality. It could pre-process data within memory to further reduce power hungry long distance communication between memory and processing units as in Von-Neumann computing system. HieIM can implement all the Boolean logic functions (AND/NAND, OR/NOR, XOR/XNOR) between any two cells in the same memory array, thus overcoming the 'operand locality' problem in contemporary in-memory computing platform designs. To investigate the performance of HieIM, we test in-memory bulk bit-wise Boolean logic operations using different vector datasets, which shows ~8X energy saving and ~5X speedup compared to recent DRAM based in-memory computing platform. We further implement an in-memory data encryption engine design based on HieIM as another case study. With AES algorithm, it shows 51.5% and 68.9% lower energy consumption compared to CMOS-ASIC and CMOL based implementations, respectively. |
Title | Performance Analysis on Structure of Racetrack Memory |
Author | *Hongbin Zhang (Tsinghua University, China), Chao Zhang (Peking University, China), Qingda Hu (Tsinghua University, China), Chengmo Yang (University of Delaware, U.S.A.), Jiwu Shu (Tsinghua University, China) |
Page | pp. 367 - 374 |
Keyword | Racetrack Memory, Main Memory |
Abstract | Racetrack Memory(RM) has attracted abundant attention of memory researchers recently. RM can achieve ultra-high storage density, fast access velocity and non-volatility. Former research has demonstrated that RM has potential to serve as on-chip cache or main memory. However, RM has more flexibility and difficulty in design space of main memory because it has more device level design parameters. The layout of macro unit (MU) needs trade-off among area, access performance and energy consuming, and its shift operation introduces extra dimension of design space. In this paper, we explore these design parameters and analyze their relationship in memory design space in both device level and system level. Experimental results demonstrated the existence of regularity between design parameter and performance features. In addition, the results also implied that different basic racetrack structures are suitable for different applications, and the optimized layout of racetrack MU is suggested for certain application area such as big-data and IoT which need cost-effective and energy-efficient memory respectively. |
Title | Modeling of Biaxial Magnetic Tunneling Junction for Multi-level Cell STT-RAM Realization |
Author | Enes Eken, Ismail Bayram (University of Pittsburgh, U.S.A.), Hai (Helen) Li, *Yiran Chen (Duke University, U.S.A.) |
Page | pp. 375 - 380 |
Keyword | STT-RAM, MTJ, BIAXIL MTJ, MODELING |
Abstract | In recent years, spin-transfer torque random access memory (STTRAM) has been widely studied as a promising candidate to replace DRAM because of its fast access time, high endurance, and good CMOS compatibility. The improvement of tunneling magnetoresistance ratio (TMR) of magnetic tunneling junction (MTJ) also makes it possible to store more than one bit in a STT-RAM cell, namely, a multi-level cell (MLC) STT-RAM cell. One example is connecting two MTJs with different sizes in series to form four different resistance states that represent two bits in a memory cell. Very recently, Biaxial MTJ was proposed to realize four stable resistance states in a single MTJ. This technology greatly relaxes the driving capability requirement of select transistor and hence, increases the integration density of the MLC STT-RAM cells by reducing the size of the select transistor. In this paper, we developed the first device model of biaxial MTJ that can capture the switching transience between its four different states. We also validated our developed model against a manufactured biaxial MTJ device. |
Title | Automatic Insertion of Airgap with Design Rule Constraints |
Author | *Daijoon Hyun, Youngsoo Shin (Korea Advanced Institute of Science and Technology, Republic of Korea) |
Page | pp. 381 - 386 |
Keyword | airgap, design rule |
Abstract | Airgap refers to an intentional void used with some material as inter metal dielectric (IMD), which reduces permittivity and corresponding coupling capacitance; this may contribute to improvement in circuit performance. We address a problem of airgap insertion considering design rules in this context. This is performed to maximize the amount of airgap, and is constrained by multiple design rules. The problem is formulated as integer linear programming (ILP); more practical heuristic algorithm is also proposed, whose performance is comparable to ILP. Experiments demonstrate that airgap is inserted in 84% grids, which corresponds to the increase of timing slack by 5.3%, and our heuristic method achieves speed-up by 3.7 times in 7-nm technology. |
Title | On Coloring Rectangular and Diagonal Grid Graphs for Multiple Patterning Lithography |
Author | Daifeng Guo (University of Illinois at Urbana-Champaign, U.S.A.), Hongbo Zhang (Facebook, U.S.A.), *Martin D.F. Wong (University of Illinois at Urbana-Champaign, U.S.A.) |
Page | pp. 387 - 392 |
Keyword | Coloring, Graph, Grid, Multiple patterning lithography |
Abstract | Rectangular and diagonal grid graphs are induced subgraphs of a rectangular or diagonal grid respectively. Their k-coloring problem has direct applications in printing contact/via layouts by multi-patterning lithography (MPL). However, the problem in general is computationally difficult for k>2, while it remains an open question on grid graphs due to their regularity and sparsity. In this paper, we conduct a complete analysis of the k-coloring problems on rectangular and diagonal grid graphs, and particularly the NP-completeness of 3-coloring on a diagonal grid graph is proved. In practice, we propose an exact 3-coloring algorithm. Experiments are conducted to verify its effectiveness and performance. Extensions and other results are also discussed. |
Title | Lifetime-aware Design Methodology for Dynamic Partially Reconfigurable Systems |
Author | *Siva Satyendra Sahoo, Tuan D. A. Nguyen, Bharadwaj Veeravalli (National University of Singapore, Singapore), Akash Kumar (Technische Universitaet Dresden, Germany) |
Page | pp. 393 - 398 |
Keyword | Partial Reconfiguration, System Lifetime, System Partitioning |
Abstract | Dynamic Partial Reconfiguration (DPR) is a very useful feature in the FPGA-based system. The aging-related permanent faults can be mitigated efficiently. The most common approach is to duplicate the Partially Reconfigurable Regions (PRRs) as redundancy resources. These PRRs are able to host any of the Partially Reconfigurable Modules (PRMs), or tasks, in one particular instance. This kind of system is called homogeneous. However, the FPGA resource constraints limit the amount of redundancy that can be used and hence affect the lifetime of the system. This issue can be addressed by utilizing the heterogeneous approach where each PRR now only hosts a subset of the tasks. Nevertheless, in this case, the deadlines of the applications must be taken care of in the design phase to decide the mapping of tasks to PRRs. To the best of our knowledge, there is no prior work that proposes a solution for this approach. In this work, we propose a proactive application-specific system-level design methodology to determine the appropriate number of PRRs, the mapping of tasks to them as well as the task schedule. Our method is capable of not only maximizing the system lifetime based on the available FPGA resources but also making sure that the application deadline is met. The core idea of the solution is to build the Integer Linear Programming program to mathematically model all of the aspects of the system. Our experiments show that the heterogeneous systems can offer up to 2x lifetime improvement over homogeneous ones. |
Title | Electromigration-Lifetime Constrained Power Grid Optimization Considering Multi-Segment Interconnect Wires |
Author | Han Zhou, Yijing Sun, Zeyu Sun, Hengyang Zhao, *Sheldon X.-D. Tan (University of California, Riverside, U.S.A.) |
Page | pp. 399 - 404 |
Keyword | power grid, optimization, electromigration, lifetime, reliability |
Abstract | As on-chip power/ground (P/G) networks are most vulnerable to Electromigration (EM) failures, proper sizing and even routing of power grid networks are critical for the full-chip EM sign-off. In this paper, we propose a new P/G network sizing technique based on recently proposed EM analysis. Numerical results demonstrate that the new method can effectively reduce the area of the network while ensuring immortality or enforcing targeted lifetime for all the wires. |
Title | (Invited Paper) New Directions for Learning-Based IC Design Tools and Methodologies |
Author | *Andrew B. Kahng (University of California, San Diego, U.S.A.) |
Page | pp. 405 - 410 |
Keyword | Machine Learning, deep learning, eda |
Abstract | Design-based equivalent scaling now bears much of the burden of continuing the semiconductor industry’s trajectory of Moore’s-Law value scaling. In the future, reductions of design effort and design schedule must comprise a substantial portion of this equivalent scaling. In this context, machine learning and deep learning in EDA tools and design flows offer enormous potential for value creation. Examples of opportunities include: improved design convergence through prediction of downstream flow outcomes; margin reduction through new analysis correlation mechanisms; and use of open platforms to develop learning-based applications. These will be the foundations of future design-based equivalent scaling in the IC industry. This paper describes several near-term challenges and opportunities, along with concrete existence proofs, for application of learning-based methods within the ecosystem of commercial EDA, IC design, and academic research. |
Title | (Invited Paper) Machine Learning and Systems for Building the Next Generation of EDA tools |
Author | *Manish Pandey (Synopsys, Inc., U.S.A.) |
Page | pp. 411 - 415 |
Keyword | Machine Learning, Systems, Infrastructure, Verification |
Abstract | This paper describes how machine learning techniques can enable the development of the next generation of EDA tools with substantial gains in performance and ease-of-use. After a brief overview of machine learning techniques, the paper discusses performance limitations of traditional compute and storage systems, and the systems and infrastructure considerations for performing machine learning at scale. The paper concludes with a few examples in which machine learning can be applied to solve common optimization and classification problems encountered in the traditional CAD flows, including functional verification and debug. |
Title | (Invited Paper) Machine Learning based Generic Violation Waiver System with Application on Electromigration Sign-off |
Author | *Norman Chang, Ajay Baranwal, Hao Zhuang, Ming-Chih Shih, Rahul Rajan, Yaowei Jia, Hui-Lun Liao, Ying-Shiun Li (ANSYS, U.S.A.), Ting Ku, Rex Lin (Nvidia Corporation, U.S.A.) |
Page | pp. 416 - 421 |
Keyword | Machine/Deep Learning, K-means, K-NN, Electromigration, waiver |
Abstract | Manually analyzing the results generated by EDA tools to waive or fix any violations is a tedious, error-prone and time-consuming process. By automating these time-consuming rigorous manual procedures by aggregating key insights across different designs using continuing and prior simulation data, a design team can speed up the tape-out process, optimize resources and significantly minimize the risk of overlooking must fix violations that are prone to cause field failures. In this paper, a machine learning based generic waiver system is presented. |
Title | (Invited Paper) Machine Learning for Engineering |
Author | *Jeff Dyck (Solido Design Automation, U.S.A.) |
Page | pp. 422 - 427 |
Keyword | Machine/Deep Learning, Variation-aware design, Solido, Monte Carlo, PVT |
Abstract | Applying machine learning techniques to solve production problems within electronic design automation is complex. This is because production engineering applications have accuracy, scalability, complexity, verifiability, and usability requirements that are not met by traditional machine learning approaches. These additional challenges frequently cause production machine learning approaches to fail. This paper examines these engineering-specific challenges and presents some effective solutions based on Solido's experience developing a suite of successful applied machine learning solutions to EDA over the past twelve years. |
Title | (Invited Paper) Large-scale Short-term Urban Taxi Demand Forecasting Using Deep Learning |
Author | Siyu Liao (City University of New York, U.S.A.), Liutong Zhou, Xuan Di (Columbia University, U.S.A.), Bo Yuan (City University of New York, U.S.A.), *Jinjun Xiong (IBM Thomas J. Watson Research Center, U.S.A.) |
Page | pp. 428 - 433 |
Keyword | Deep Learning, Traffic Demands, Prediction |
Abstract | The world has seen in recent years great successes in applying deep learning (DL) for many application domains. Though powerful, DL is not easy to be used well. In this invited paper, we study an urban taxi demand forecast problem using DL, and we show a number of key insights in modeling a domain problem as a suitable DL task. We also conduct a systematic comparison of two recent deep neural networks (DNNs) for taxi demand prediction, i.s., the ST-ResNet and FLC-Net, on New York city taxi record dataset. Our experimental results show DNNs indeed outperform most traditional machine learning techniques, but such superior results can only be achieved with proper design of the right DNN architecture, where domain knowledge plays a key role. |
Title | Utilizing Quad-Trees for Efficient Design Space Exploration with Partial Assignment Evaluation |
Author | *Kai Neubauer (University of Rostock, Germany), Philipp Wanko, Torsten Schaub (University of Potsdam, Germany), Christian Haubelt (University of Rostock, Germany) |
Page | pp. 434 - 439 |
Keyword | Design Space Exploration, Optimization, CDCL, Quad-Tree, Pareto archive |
Abstract | Recently, it has been shown that constraint-based symbolic solving techniques offer an efficient way for deciding binding and routing options in order to obtain a feasible system level implementation. In combination with various background theories, a feasibility analysis of the resulting system may already be performed on partial solutions. That is, infeasible subsets of mapping and routing options can be pruned early in the decision process, which fastens the solving accordingly. However, allowing a proper design space exploration including multi-objective optimization also requires an efficient structure for storing and managing non-dominated solutions. In this work, we propose and study the usage of the Quad–Tree data structure in the context of partial assignment evaluation during system synthesis. Out experiments show that unnecessary dominance checks can be avoided, which indicates a preference of Quad–Trees over a state-of-the-art list-based implementation for large combinatorial optimization problems. |
Title | SCBench: A Benchmark Design Suite for SystemC Verification and Validation |
Author | *Bin Lin, Fei Xie (Portland State University, U.S.A.) |
Page | pp. 440 - 445 |
Keyword | SystemC, Benchmark, Verification, Validation |
Abstract | SystemC has become a de-facto standard hardware modelling language in the semiconductor industry, enabling early exploration of design spaces and verification at a higher level of abstraction. It is both important and necessary for researchers to evaluate their new verification approaches and algorithms quantitatively. This paper presents SCBench, a comprehensive suite of benchmark designs for SystemC verification and validation. SCBench consists of 38 well-written representative behavior-level SystemC designs, which have been selected carefully from various application domains, such as CPU architecture, security, network, and artificial intelligence. The benchmark covers most core features of SystemC language. SCBench is freely available online to all researchers. |
Title | MemFlow: Memory-Driven Data Scheduling with Datapath Co-design in Accelerators for Large-Scale Inference Applications |
Author | *Qi Nie, Sharad Malik (Princeton University, U.S.A.) |
Page | pp. 446 - 451 |
Keyword | Data Scheduling, Memory-driven Optimization, Accelerator Design, Codesign, Large-scale Inference |
Abstract | SRAM scratch-pad memory in accelerators is limited in size and bandwidth. Besides computation, accelerator design is about how data flow is scheduled across the memory hierarchy, from DRAM to datapath registers. There is limited support for this in current tools. Thus, we propose MemFlow, memory-driven data scheduling with datapath co-design in accelerators, to improve computing performance and reduce higher-level memory accesses. We demonstrate its efficacy using several key kernels from large-scale inference applications. |
Title | A Mapping Approach between IR and Binary CFGs Dealing with Aggressive Compiler Optimizations for Performance Estimation |
Author | *Omayma Matoussi, Frédéric Pétrot (Tima Laboratory, Grenoble INP, France) |
Page | pp. 452 - 457 |
Keyword | native simulation, performance estimation, compiler optimizations, mapping, intermediate representation |
Abstract | In this work, we define a mapping approach between the compiler intermediate representation and the binary control flow graph for the purpose of performance estimation in native simulation. Our approach handles aggressive compiler optimizations such as loop unrolling without having to introduce any modification to the compiler. Our mapping approach experimentally leads to a good accuracy (0.59% error) while keeping a 25x speedup for native simulation compared to instruction set simulation. |
Title | System Level Performance Analysis and Optimization for the Adaptive Clocking based Multi-Core Processor |
Author | *Byung Su Kim (Samsung Electronics, Republic of Korea), Joon-Sung Yang (Sungkyunkwan University, Republic of Korea) |
Page | pp. 458 - 463 |
Keyword | Adaptive clocking, Reliable System, multi-core design, Aging, Queueing theory |
Abstract | A supply voltage droop, temperature variation and aging effects can generate timing failures during operation. Various adaptive clocking methods have been introduced to resolve the problems. They use a tunable clock to avoid the timing failures rather than using wide design guard bands. However, the system performance analysis becomes complicated in a multi-core system with the adaptive clocking method. In this paper, a queueing theory based system level performance model is proposed to estimate an average response time and power by a closed form equation. Furthermore, for multi-core system with the adaptive clocking, an optimal job scheduling method using the inequality of arithmetic and geometric means is proposed. The proposed optimal job scheduling method relieves a system performance degradation arising from the adaptive clocking. The proposed performance model can analyze the system level performance within ~3% error compared with a JMT system simulation tool. Experimental results also show that the proposed job scheduling method can obtain a significant performance enhancement than the conventional round-robin method. |
Title | Detecting Non-Functional Circuit Activity in SoC Designs |
Author | *Dustin Peterson, Yannick Boekle, Oliver Bringmann (University of Tübingen, Germany) |
Page | pp. 464 - 469 |
Keyword | Circuit Activity, Switching Activity, Simulation, Power Optimization |
Abstract | In this paper, we present a methodology for the automatic detection of non-functional circuit activity in SoC designs. Our methodology formally analyzes an RTL design, generates an internal graph representation and traverses the graph using given simulation traces. We evaluate an open source processor with a given set of benchmark applications using our approach. With a commercial RTL simulator, we observe an average register toggle activity of 6.7% – 11.5%, but our experiments show that 86.1 – 92.7% of these toggles are non-functional, i.e. not necessary for producing the exact same circuit output. We further evaluate the efficiency of the clock gating architecture of a commercial ASIP. For the Dhrystone benchmark we show that, even though only 34.7% of the registers are clocked on average, still 64.3% of the non-clock-gated registers in this ASIP are not needed on average to produce exactly the same circuit output. |
Title | Multi-Level Timing Simulation on GPUs |
Author | *Eric Schneider, Michael A. Kochte, Hans-Joachim Wunderlich (Institute of Computer Architecture and Computer Engineering, University of Stuttgart, Germany) |
Page | pp. 470 - 475 |
Keyword | timing simulation, switch level, multi-level, parallel simulation, GPUs |
Abstract | This paper presents the first timing-accurate multi-level simulation approach for parallel execution on Graphics Processing Units. It combines logic level and switch level abstractions to allow a selective trade off in speed and accuracy for user-defined regions of interest. The multi-level simulation approach is able to process designs with millions of cells and thousands of stimuli and allows to reduce the overall simulation runtime by up to 89 percent compared to full simulation at switch level. |
Title | An Optimal Gate Design for the Synthesis of Ternary Logic Circuits |
Author | *Sunmean Kim, Taeho Lim, Seokhyeong Kang (Ulsan National Institute of Science and Technology, Republic of Korea) |
Page | pp. 476 - 481 |
Keyword | Multi-valued logic, Ternary logic circuit, Synthesis methodology, CNTFET, T-CMOS |
Abstract | Over the last few decades, CMOS-based digital circuits have been steadily developed. However, because of the power density limits, device scaling may soon come to an end, and new approaches for circuit designs are required. Multi-valued logic (MVL) is one of the new approaches, which increases the radix for computation to lower the complexity of the circuit. For the MVL implementation, ternary logic circuit designs have been proposed previously, though they could not show advantages over binary logic, because of unoptimized synthesis techniques. In this paper, we propose a methodology to design ternary gates by modeling pull-up and pull-down operations of the gates. Our proposed methodology makes it possible to synthesize ternary gates with a minimum number of transistors. From HSPICE simulation results, our ternary designs show significant power-delay product reductions; 49 % in the ternary full adder and 62 % in the ternary multiplier compared to the existing methodology. We have also compared the number of transistors in CMOS-based binary logic circuits and ternary device-based logic circuits. |
Title | Performance-Preserved Analog Routing Methodology via Wire Load Reduction |
Author | *Hao-Yu Chi, Hwa-Yi Tseng, Chien-Nan Jimmy Liu (Department of Electrical Engineering, National Central University, Taiwan), Hung-Ming Chen (Department of Electronics Engineering, National Chiao Tung University, Taiwan) |
Page | pp. 482 - 487 |
Keyword | routing |
Abstract | Analog layout automation is a popular research direction in recent years to raise the design productivity. However, the research on this topic is still not well accepted by analog designers because notable performance loss often exists in tool-generated layout. Most previous works focus on layout placement problem and route the nets by typical digital routing methodology. This routing approach can solve the net crossing issue easily, but requires a lot of extra vias to connect the horizontal and vertical lines, which significantly increases the wire loads and reduces the circuit performance. In the proposed analog routing flow, we try to route each net with minimum layer changing and consider the wire length simultaneously. Through reducing the wire load from vias, net resistance is used as the optimization goal instead of using wire length only to keep the circuit performance after layout. As demonstrated on several cases, this approach significantly reduces the wire load and keeps the similar circuit performance as in manual works. |
Title | Static Timing Analysis for Ring Oscillators |
Author | *David M. Moore (University of Michigan, U.S.A.), Jeffrey A. Fredenburg, Muhammad Faisal (Movellus Inc., U.S.A.), David D. Wentzloff (University of Michigan, U.S.A.) |
Page | pp. 488 - 493 |
Keyword | Ring Oscillators, Analog Verification, Analog Timing Analysis |
Abstract | The creation of cell-based ring-oscillators using digital place-and-route tools has emerged as a practical solution to the difficulties of designing these circuits in advanced process nodes. However, design iteration and verification for these circuits today rely on complete SPICE simulations which can be slow and costly. In this paper, we present a static timing analysis approach to the design and verification of these oscillators. This approach provides up to a 20x reduction in verification time, and allows for automated refinement of oscillator designs within the digital design flows. Experimental results show agreement with SPICE simulations and illustrate simulation and design time savings. |
Title | OCV Guided Clock Tree Topology Reconstruction |
Author | Necati Uysal, *Rickard Ewetz (University of Central Florida, U.S.A.) |
Page | pp. 494 - 499 |
Keyword | clock tree, on-chip variations, skew |
Abstract | The timing performance of clock trees in scaled technology nodes may be severely degraded by delay variations introduced by on-chip variations (OCV). Clock tree optimization (CTO) is employed to eliminate timing violations by specifying a set of non-negative delay adjustments using a linear programming (LP) formulation. Next, the delay adjustments are realized in the clock tree by inserting delay buffers and detour wires. The drawback is that given the topology of the initial clock tree, it may be impossible to remove all timing violations. In this paper, a framework that performs OCV guided clock tree topology reconstruction is proposed. The framework reconstructs the topology of a clock tree while improving the lower bounds on the worst negative slack (WNS) and the total negative slack (TNS). Next, traditional CTO is employed to reduce WNS and TNS to the improved lower bounds. The reconstruction of the clock tree topology is guided by a predicted leaf buffer slack graph (pLB-SG). The leaf buffers that must be placed closer in the tree topology are identified by detecting cycles (or strongly connected components) in the pLB-SG. Given the identified leaf buffers, the framework enumerates, evaluates, and performs various topology changes. The experimental results demonstrate that the proposed framework can on the average reduce WNS and TNS with 84% and 80%, respectively. |
Title | Cohesive Techniques for Cell Layout Optimization Supporting 2D Metal-1 Routing Completion |
Author | *Kyeongrok Jo, Seyong Ahn, Taewhan Kim, Kyumyung Choi (Seoul National University, Republic of Korea) |
Page | pp. 500 - 506 |
Keyword | Layout |
Abstract | This work addresses the problem of automatically synthesizing compact standard cell layouts with 2D metal-1 routing under design rule constraints. Precisely, we propose a set of new highly impacting techniques dedicated solely to the generation of cell layouts with 2D metal-1 routing completion. Those are (1) netlist decomposition (2) transistor chaining combined with transistor folding, (3) gate poly ordering combined with fast routing congestion estimation, and (4) 2D one-layer routing with minimal resource. It is shown from experiments that our proposed layout generator is able to produce layouts of quality comparable to the expert’s manual ones, but spending just one hour for 56 representative cells generation. |
Title | Clustering of Flip-Flops for Useful-Skew Clock Tree Synthesis |
Author | *Chuan Yean Tan (Purdue University, U.S.A.), Rickard Ewetz (University of Central Florida, U.S.A.), Cheng-Kok Koh (Purdue University, U.S.A.) |
Page | pp. 507 - 512 |
Keyword | Multi-bit Flip-Flop, Clustering, Power Optimization, Clock Tree Synthesis |
Abstract | The clock network of a circuit is a main contributor to the power consumption of any ASIC design. A key technique that is used to reduce power consumption is to cluster fliip-flops or latches into groups and to place each group of flip-flops close together to reduce the clock wire length. In this paper, we introduce a clock tree synthesis methodology that incorporates clustering with a previously published useful-skew clock tree synthesis technique to minimize the clock wire length. The clustering process is guided by bounded arrival time constraints, which enable its efficiency. Experimental results show that the proposed methodology reduces up to 34% of the total power consumption while meeting all timing constraints. |
Title | Optimal Die Placement for Interposer-Based 3D ICs |
Author | *Sergii Osmolovskyi (Dresden University of Technology, Germany), Johann Knechtel (New York University Abu Dhabi, United Arab Emirates), Igor L. Markov (University of Michigan, U.S.A.), Jens Lienig (Dresden University of Technology, Germany) |
Page | pp. 513 - 520 |
Keyword | 3D integration, 2.5D ICs, floorplanning, interposer, optimal placement |
Abstract | Performance of modern multi-chip modules, increasingly implemented as interposer solutions, is limited by system-level interconnects. We propose an effective method for optimal wirelength-driven die placement of interposer-based 3D ICs. Our key ideas are to leverage the constraint-satisfaction problem (CSP) formalism in combination with a branch-and-bound (B&B) algorithm, and to develop several novel techniques for early identification and pruning of unpromising configurations. Such techniques are crucial for addressing the combinatorial explosion when solving the NP-hard placement problem. Experiments on ISPD08 (modified) and MCNC benchmarks demonstrate that our method outperforms prior art: we can optimally place up to eleven rotatable dies, whereas state-of-the-art tools are limited to six dies. |
Title | Flip-Chip Routing with IO Planning Considering Practical Pad Assignment Constraints |
Author | *Tao-Chun Yu, Shao-Yun Fang (National Taiwan University of Science and Technology, Taiwan) |
Page | pp. 521 - 526 |
Keyword | Flip-chip routing, IO planning |
Abstract | In order to support the pad-limited Application-Specific Integrated Circuit (ASIC) designs, the flip chip package is used and provides the highest chip density compared to other packaging technologies. In this paper, we propose the first work of free-assignment flip-chip routing considering practical bump/IO pad constraints and flexibilities. Unlike previous studies regarding all nets as the same, we differentiate signal and power/ground nets and set different bump pad assignment constraints for substrate layout optimization. In our flow, a global routing-based IO-bump assignment algorithm is proposed with a multi-commodity flow network model. After that, a detailed routing algorithm minimizing total wirelength is presented, which determines optimal relay points with a linear programming (LP) formulation. Finally, a dynamic programming (DP)-based IO pad planning technique is applied to further reduce the number of wire bends. Experimental results based on modified industrial cases show that our algorithm flow not only achieves 100% routability of all testcases but also minimizes total wirelength and bump utilization. |
Title | (Invited Paper) Accelerator-centric Deep Learning Systems for Enhanced Scalability, Energy-Efficiency, and Programmability |
Author | *Minsoo Rhu (POSTECH, Republic of Korea) |
Page | pp. 527 - 533 |
Keyword | Deep Neural Networks |
Abstract | Deep learning (DL) has been successfully deployed in various application domains ranging from computer vision, speech recognition, to natural language processing. As the network models and the datasets used to train these models scale, system architects are faced with new challenges in designing a scalable and energy-efficient high-performance computing (HPC) system for training DL algorithms. One of the key obstacles that DL researchers are facing is the memory capacity bottleneck, where the limited physical memory size of the PCIe-attached DL accelerator (whether it be a discrete GPU or ASIC accelerator like Google's Tensor Processing Unit) constrains the algorithm that can be studied. In this paper, and the associated invited special session talk, we first discuss recent research literature geared towards designing scalable HPC systems for DL. In this context, we then discuss the memory capacity wall problem and introduce our prior work on virtualized deep neural networks, a memory virtualization solution that systematically reduces the memory consumption of DNN training. We conclude by providing projections on future challenges DNN memory virtualization will encounter upon and suggest accelerator-centric DL system as a promising research direction for the development of a scalable and energy-efficient deep learning system architecture. |
Title | (Invited Paper) Running Sparse and Low-Precision Neural Network: When Algorithm Meets Hardware |
Author | Bing Li, Wei Wen, Jiachen Mao (Duke University, U.S.A.), Sicheng Li (Hewlett Packard Labs, U.S.A.), *Yiran Chen, Hai(Helen) Li (Duke University, U.S.A.) |
Page | pp. 534 - 539 |
Keyword | Deep Neural Networks, Sparse DNN, Low-rank approximation, gradient quantization |
Abstract | Deep Neural Networks (DNNs) are pervasively applied in many artificial intelligence (AI) applications. The high performance of DNNs comes at the cost of larger size and higher compute complexity. Recent studies show that DNNs have much redundancy, such as the zero-value parameters and excessive numerical precision. To reduce computing complexity, many redundancy reduction techniques have been proposed, including pruning and data quantization. In this paper, we demonstrate our co-optimization of the DNN algorithm and hardware which exploits the model redundancy to accelerate DNNs. |
Title | (Invited Paper) Architectures and Algorithms for User Customization of CNNs |
Author | Barend Harris, Mansureh S. Moghaddam, Duseok Kang, Inpyo Bae, Euiseok Kim, Hyemi Min (Seoul National University, Republic of Korea), Hansu Cho, Sukjin Kim (Samsung Electronics, Ltd., Republic of Korea), *Bernhard Egger, Soonhoi Ha, Kiyoung Choi (Seoul National University, Republic of Korea) |
Page | pp. 540 - 547 |
Keyword | Deep Neural Networks, Coarse Grained Reconfigurable Architectures, Personalization of DNNs |
Abstract | In this paper we present a convolutional neural network architecture that supports user customization through incremental transfer learning. The architecture consists of a large basic inference engine and a small augmenting engine. After training the basic inference engine and augmenting engine on a large general dataset, the basic inference engine is fixed. For user customization, only the augmenting engine is re-trained on-device using a small user specific dataset provided by the user. To accelerate the training of the augmenting engine we map this to a coarse-grained reconfigurable array processor. The complete network architecture is evaluated using the Caffe framework, and a C-code equivalent network is implemented and tested on a CGRA processor. Experiments with NIST '19 and our user-specific datasets show an increase in accuracy of the system from 76.3% to 93.2% after user customization. Mapping this code to a CGRA gives us a speed up of 45x and a 49- and 3-fold reduced energy consumption over an ARMv7 processor and a 3-way VLIW processor, respectively, showing the potential of CGRAs as DNN processors. |
Title | Rethinking Self-balancing Binary Search Tree over Phase Change Memory with Write Asymmetry |
Author | *Chieh-Fu Chang (Institute of Information Science, Academia Sinica, Taiwan), Che-Wei Chang (Chang Gung University, Taiwan), Yuan-Hao Chang (Institute of Information Science, Academia Sinica, Taiwan), Ming-Chang Yang (Chinese University of Hong Kong, Hong Kong) |
Page | pp. 548 - 553 |
Keyword | Phase Change Memory, Non-volatile Memory, Write Asymmetry, Bit Flips, Binary Search Tree |
Abstract | Phase change memory (PCM) has become a promising candidate to replace DRAM in some massive/big data applications because of its low leakage power, non-volatility, and high density. However, most of the existing memory read/write intensive algorithms/designs are not aware of the endurance and write asymmetry issues of PCM. In particular, self-balancing binary search trees, which are widely used to manage massive data in the big-data era, were designed without the consideration of PCM characteristics and could degrade the memory performance. In this work, we rethink the design of self-balancing binary search trees, and propose a write-asymmetry-aware self-balancing tree to reduce the tree management overhead by decreasing the total/average number of bit flips of tree rotations with the consideration of the endurance and write asymmetry issues of PCM. Experimental results show that our solution significantly outperforms the original implementation of a self-balancing binary search tree, in terms of minimizing the total number of bit flips when the amount of data is large. |
Title | Energy, Latency, and Lifetime Improvements in MLC NVM with Enhanced WOM Code |
Author | *Huizhang Luo, Liang Shi, Qiao Li (College of Computer Science, Chongqing University, China), Chun Jason Xue (Department of Computer Science, City University of Hong Kong, Hong Kong), Edwin H.-M. Sha (College of Computer Science, Chongqing University, China) |
Page | pp. 554 - 559 |
Keyword | NVM, MLC, WOM, Performance, Energy |
Abstract | Non-volatile memories (NVMs), such as phase change memory (PCM) and resistive random access memory (ReRAM), have emerged as promising memory technologies for replacements of DRAM due to their advantages, such as better scalability, zero cell leakage, and DRAM-comparable read latency. Furthermore, multiple level cell (MLC) NVMs offer high data density and memory capacity over single level cell (SLC) NVMs. However, the adoption of MLC NVMs is limited by their high programming energy and latency as well as the low endurance. In this paper, we propose an enhanced 〈23〉2/4 WOM code for MLC NVMs, which exploits the asymmetric characteristic in MLC NVM cell state transitions. Unlike the conventional WOM codes that focus on eliminating the worst-case latency writes, we propose to enlarge the best-case latency writes in MLC NVM cell state transitions. After data shaping with the enhanced WOM code, proportion of the best-case latency writes is maximized. In this way, the enhanced WOM code simultaneously reduces energy and latency, and improves lifetime with no memory and logic overheads. Evaluations show exciting improvement from the proposed approach. |
Title | Scheduling Multi-Rate Real-Time Applications on Clustered Many-Core Architectures with Memory Constraints |
Author | *Matthias Becker, Saad Mubeen (Mälardalen University, Sweden), Dakshina Dasari (Corporate Research, Robert Bosch GmbH, Germany), Moris Behnam, Thomas Nolte (Mälardalen University, Sweden) |
Page | pp. 560 - 567 |
Keyword | Many-Core, Real-Time, Scheduling, Framework, Constraint Programming |
Abstract | Access to shared memory is one of the main challenges for many-core processors. One group of scheduling strategies for such platforms focuses on the division of tasks' access to shared memory and code execution. This allows to orchestrate the access to shared local and off-chip memory in a way such that access contention between different compute cores is avoided by design. In this work, an execution framework is introduced that leverages local memory by statically allocating a subset of tasks to cores. This reduces the access times to shared memory, as off-chip memory access is avoided, and in turn improves the schedulability of such systems. A Constraint Programming (CP) formulation is presented to select the statically allocated tasks and to generate the complete system schedule. Evaluations show that the proposed approach yields an up to 19% higher schedulability ratio than related work, and a case study demonstrates its applicability to industrial problems. |
Title | PT-Spike: A Precise-Time-Dependent Single Spike Neuromorphic Architecture with Efficient Supervised Learning |
Author | Tao Liu (Florida International University, U.S.A.), Lei Jiang (Indiana University, U.S.A.), Yier Jin (University of Florida, U.S.A.), Gang Quan, *Wujie Wen (Florida International University, U.S.A.) |
Page | pp. 568 - 573 |
Keyword | neuromorphic, emerging, SNN |
Abstract | One of the most exciting advancements in Artificial Intelligence (AI) over the last decade is the wide adoption of Artificial Neural Networks (ANNs), such as Deep Neural Network (DNN) and Convolutional Neural Network (CNN), in real world applications. However, the underlying massive amounts of computation and storage requirement greatly challenge their applicability in resource-limited platforms like drone, mobile phone and IoT devices etc. The third generation of neural network model–Spiking Neural Network (SNN), inspired by the working mechanism and efficiency of human brain, has emerged as a promising solution for achieving more impressive computing and power efficiency within light-weighted devices (e.g. single chip). However, the relevant research activities have been narrowly carried out on conventional rate based spiking system designs for fulfilling the practical cognitive tasks, underestimating SNN’s energy efficiency, throughput and system flexibility. Although the time-based SNN can be more attractive conceptually, its potentials are not unleashed in realistic applications due to lack of efficient coding and practical learning schemes. In this work, a Precise Time-Dependent Single Spike Neuromorphic Architecture, namely “PTSpike”, is developed to bridge this gap. Three constituent hardware favorable techniques: precise single-spike temporal encoding, efficient supervised temporal learning and fast asymmetric decoding are proposed accordingly to boost the energy efficiency and data processing capability of the time-based SNN at a more compact neural network model size when executing real cognitive tasks. Simulation results show that “PTSpike” demonstrates significant improvements in network size, processing efficiency and power consumption with marginal classification accuracy degradation, when compared to the rate-based SNN and ANN under the similar network configuration. |
Title | Fully Parallel RRAM Synaptic Array for Implementing Binary Neural Network with (+1, -1) Weights and (+1, 0) Neurons |
Author | Xiaoyu Sun, Xiaochen Peng, Pai-Yu Chen, Rui Liu, Jae-sun Seo, *Shimeng Yu (Arizona State University, U.S.A.) |
Page | pp. 574 - 579 |
Keyword | Binary Neural Network, RRAM, energy-efficiency, hardware accelerator, sense amplifier |
Abstract | Binary Neural Networks (BNNs) have been recently proposed to improve the area-/energy-efficiency of the machine/deep learning hardware accelerators, which opens an opportunity to use the technologically more mature binary RRAM devices to effectively implement the binary synaptic weights. In addition, the binary neuron activation enables using the sense amplifier instead of the analog-to-digital converter to allow bitwise communication between layers of the neural networks. However, the sense amplifier has intrinsic offset that affects the threshold of binary neuron, thus it may degrade the classification accuracy. In this work, we analyze a fully parallel RRAM synaptic array architecture that implements the fully connected layers in a convolutional neural network with (+1, -1) weights and (+1, 0) neurons. The simulation results with TSMC 65 nm PDK show that the offset of current mode sense amplifier introduces a slight accuracy loss from ~98.5% to ~97.6% for MNIST dataset. Nevertheless, the proposed fully parallel BNN architecture (P-BNN) can achieve 137.35 TOPS/W energy efficiency for the inference, improved by ~20X compared to the sequential BNN architecture (S-BNN) with row-by-row read-out scheme. Moreover, the proposed P-BNN architecture can save the chip area by ~16% as it eliminates the area overhead of MAC peripheral units in the S-BNN architecture. |
Title | Spintronics based Stochastic Computing for Efficient Bayesian Inference System |
Author | Xiaotao Jia, *Jianlei Yang, Zhaohao Wang (Beihang University, China), Yiran Chen, Hai (Helen) Li (Duke University, U.S.A.), Weisheng Zhao (Beihang University, China) |
Page | pp. 580 - 585 |
Keyword | Bayesian inference, Stochastic Computing, Stochastic Bitstream, MTJ |
Abstract | Bayesian inference is an effective approach for solving statistical learning problems especially with uncertainty and incompleteness. However, the inference efficiencies are physically limited by the bottlenecks of conventional computing platforms. In this paper, an emerging Bayesian inference system is proposed by exploiting spintronics based stochastic computing. A stochastic bitstream generator is realized as the kernel components by leveraging the inherent randomness of spintronics devices due to their stochastic switching behaviors. The proposed system is evaluated by typical applications of data fusion and Bayesian belief networks. Simulation results indicate that the proposed approach could achieve significant improvement on inference efficiencies in therms of power consumptions, circuit area and speed. |
Title | SAT-Based Area Recovery in Structural Technology Mapping |
Author | *Bruno Schmitt (EPFL, Switzerland), Alan Mishchenko, Robert Brayton (UC Berkeley, U.S.A.) |
Page | pp. 586 - 591 |
Keyword | structural technology mapping, Boolean satisfiability, optimization |
Abstract | This paper proposes a fast SAT-based algorithm for recovering area applicable to an already technology mapped circuit. The algorithm considers a sequence of relatively small overlapping regions, called windows, in a mapped network and tries to improve the current mapping of each window using a SAT solver. Delay constraints are considered by interfacing the SAT solver with a timer. Experimental results are given for benchmarks that have been mapped already into 6-LUTs by a high-effort area-only synthesis/mapping flow. The new mapper starting from these results, many of which represented the best known area results at the time, achieved an additional average area reduction of 3-4%, while for some benchmarks the area reduction exceeded 10. Runtime for any example was only a few seconds. |
Title | A Two-Step Search Engine for Large Scale Boolean Matching under NP3 Equivalence |
Author | *Chak-Wa Pui, Peishan Tu, Haocheng Li, Gengjie Chen, Evangeline F.Y. Young (The Chinese University of Hong Kong, Hong Kong) |
Page | pp. 592 - 598 |
Keyword | boolean matching, logic synthesis, np3, verification |
Abstract | Boolean matching is one of the most widely used engines in industrial applications. However, existing Boolean matching researches mainly focus on NPNP-equivalence. In this paper, we study a more practical problem of Boolean matching, which is Non-exact Projective NPNP (NP3). A two-step search engine is used to solve the problem and several heuristics and constraints are proposed to accelerate the whole process. In particular, we explore a new kind of symmetry properties in NP3 equivalence checking which helps to prune the solution space efficiently. Experimental results show that our proposed approach can achieve the best results among the winning teams of the ICCAD 2016 contest in quality within a given time limit. |
Title | Low-Cost Hardware Architectures for Mersenne Modulo Functional Units |
Author | Keith Campbell, Chen-Hsuan Lin, *Deming Chen (University of Illinois at Urbana-Champaign, U.S.A.) |
Page | pp. 599 - 604 |
Keyword | Low cost, Functional unit, Mersenne number, Modulo arithmetic, Error Resilience |
Abstract | With technology scaling leading to reliability problems and a proliferation of hardware accelerators, there is a need for cost-effective techniques to detect errors in complex datapaths. Modulo (residue) arithmetic is useful for creating a shadow datapath to check the computation of an arithmetic datapath and involves three key steps: reduction of the inputs to modular shadow inputs, computation with those shadow values, and checking the outputs for consistency with the shadow outputs. The focus of this paper is new gate-level architectures and algorithms to reduce the cost of modular shadow datapaths. We introduce low-cost architectures for all four key functional units in a shadow datapath: (1) a modulo reduction algorithm which generates architectures consisting entirely of full-adder standard cells; (2) minimum-area modular adder and subtractor architectures; (3) an array-based modular multiplier design; and (4) a modular equality comparator that handles non-normalized input produced by the above. |
Thursday, January 25, 2018 |
Title | (Keynote Address) CAE challenge on High Capacity/High Bandwidth Memory Design |
Author | Woojong Han (SK Hynix, Republic of Korea) |
Abstract | It is a strong trend that the demand on high capacity and high bandwidth memory solutions are increasing continuously due to higher demand on hyper-scale data center and new emerging applications such as Machine Learning. Traditionally memory design relies heavily on custom circuit design, which makes general high-level synthesis and auto P&R much less efficient than SoC design. However due to ever increasing design and validation time as of complex, high speed memory devices, there is strong need on more efficient CAE capability that can combine the digital logic design and implementation method along with layout-based schematic design. At the same time the designer needs more accurate and efficient verification environment. There is significant effort on reducing the validation time without sacrificing coverage. To cope with critical timing margin issue on high bandwidth memory, statistical design environment is being developed, which can provide better correlation between real implementation and simulation. Also integrated design and validation environment is to be used to get integrated SI with On/Off-chip interface. We can also discuss about collaboration with customer who owns the platform. |
Title | (Keynote Address) TeraByte/s Bandwidth 2.5D HBM (High-bandwidth Memory Module) Designs for Deep Learning Artificial Intelligent Servers |
Author | Joungho Kim (KAIST, Republic of Korea) |
Abstract | Recently, we are facing a newly emerging technology and industrial transition, named as 4th Industrial Revolution, which is based big data platforms, deep learning algorithms, and high performance GPU computing machines. Accordingly, demands for terabyte/s bandwidth GPU-DRAM computing performance are rapidly increasing. However, continuously growing gaps between GPU performance and DRAM data bandwidth are becoming the critical drawbacks. In order to meet the required terabyte/s bandwidth needs, we are proposing a novel High bandwidth memory (HBM) solution using TSV and Si interposer technologies. In this presentation, we will introduce the basic approaches and designs of terabyte/s bandwidth 2.5D HBM (High-bandwidth Memory Module), in particular, which will be useful for deep learning artificial intelligent servers. Especially, we will talk about the signal and power integrity design, simulation methods, analysis results of TSV and Si interposer channels, including GPU-DRAM channels, and high-speed serial channels. In addition, we will discuss PDN impedance, and decoupling capacitor schemes as well. Finally, we will propose next generation HBM designs using active interposer and equalization schemes to even increase the bandwidths with lower power consumptions. |
Title | A Low-Power High-Speed Accuracy-Controllable Approximate Multiplier Design |
Author | *Tongxin Yang, Tomoaki Ukezono, Toshinori Sato (Fukuoka University, Japan) |
Page | pp. 605 - 610 |
Keyword | approximate computing, accuracy-controllable multiplier, carry-maskable adder, low-power multiplier, high-speed multiplier |
Abstract | Approximate multiplication is considered to be an efficient technique for trading off energy against performance and accuracy. This paper proposes an accuracy-controllable multiplier whose final product is generated by a carry-maskable adder. Compared with a conventional Wallace tree multiplier, the proposed multiplier reduced power consumption by between 47.3% and 56.2% and critical path delay by between 29.9% and 60.5%, depending on the required accuracy. Its silicon area was also 44.6% smaller. |
Title | Exploration of Approximate Multipliers Design Space using Carry Propagation Free Compressors |
Author | Sina Boroumand (University of Tehran, Iran), Hadi P. Afshar (Qualcomm Research, U.S.A.), *Philip Brisk (University of California, Riverside, U.S.A.), Siamak Mohammadi (University of Tehran, Iran) |
Page | pp. 611 - 616 |
Keyword | Approximate computation, Multiplier, Adder, Machine Learning |
Abstract | Many emerging application domains, such as machine learning, can tolerate limited amounts of arithmetic inaccuracy. When designing custom compute accelerators for these domains, hardware designers can explore tradeoffs that sacrifice accuracy in order to reduce area, delay, and/or power consumption. This paper explores the design space of approximate multipliers using a family of approximate compressors as building blocks for the partial product reduction tree. We present a tool that allows the user to specify an allowable level of error tolerance, and returns the minimum area, delay, or power approximate multiplier that provides that level of accuracy. Our experimental results indicate that our proposed compressors generate more accurate and more efficient approximate multipliers than existing state-of-the-art techniques. |
Title | Low-power Implementation of Mitchell’s Approximate Logarithmic Multiplication for Convolutional Neural Networks |
Author | *Min Soo Kim (University of California at Irvine, U.S.A.), Alberto Antonio Del Barrio, Román Hermida (Universidad Complutense de Madrid, Spain), Nader Bagherzadeh (University of California at Irvine, U.S.A.) |
Page | pp. 617 - 622 |
Keyword | Approximate multiplier, Convolutional Neural Networks, Logarithm, Power Reduction, CIFAR-10 |
Abstract | This paper proposes a low-power implementation of the approximate logarithmic multiplier to improve the power consumption of convolutional neural networks for image classification, taking advantage of its intrinsic tolerance to error. The approximate logarithmic multiplier converts multiplications to additions by taking approximate logarithm and achieves significant improvement in power and area while having low worst-case error, which makes it suitable for neural network computation. Our proposed design shows a significant improvement in terms of power and area over the previous work that applied logarithmic multiplication to neural networks, reducing power up to 76.6% compared to exact fixed-point multiplication, while maintaining comparable prediction accuracy in convolutional neural networks for MNIST and CIFAR10 datasets. |
Title | (Invited Paper) Accelerating Electromigration Aging for Fast Failure Detection for Nanometer ICs |
Author | Zeyu Sun, Sheriff Sadiqbatcha, Hengyang Zhao, *Sheldon X.-D. Tan (University of California at Riverside, U.S.A.) |
Page | pp. 623 - 630 |
Keyword | Reliability, Electromigration |
Abstract | For practical testing and detection of electromigration (EM) induced failures in dual damascene copper interconnects in today’s and future sub-10nm ICs, one critical issue is how to create stressing conditions so that the chip will fail exclusively under EM in a very short period of time so that EM signoff and validation can be carried out efficiently. In this work, we propose novel EM wearout-acceleration techniques for practical VLSI chips. We will first review the recently proposed three-phase physics-based EM models and discuss the important factors contributing to the EM aging process. Then we propose a new formula for fast estimation of the void’s saturation volume for general multi-segment interconnect wires, which is important for EM mortality check. We then investigate two strategies to accelerate the EM failure process: reservoirenhanced acceleration and temperature-based acceleration. |
Title | (Invited Paper) Efficient Worst-case Timing Analysis of Critical-path Delay under Workload-dependent Aging Degradation |
Author | *Shumpei Morita, Song Bian (Kyoto University, Japan), Michihiro Shintani (Nara Institute of Science and Technology, Japan), Masayuki Hiromoto, Takashi Sato (Kyoto University, Japan) |
Page | pp. 631 - 636 |
Keyword | reliability, NBTI, failure probability analysis, Monte Carlo simulation |
Abstract | We propose a fast and efficient method for predicting the failure probability under the workload-dependent NBTI-induced aging. The proposed method utilizes a subset simulation (SS) framework to find the worst-case workload and its failure probability. It is shown that using our method, the calculation of the failure probability achieved 36x speedup compared to the Monte-Carlo method. Also, it is demonstrated that the feasible workload that gives worst aged delay is obtained based on the result of the SS. |
Title | (Invited Paper) Balancing Resiliency and Energy Efficiency of Functional Units in Ultra-low Power Systems |
Author | Mohammad Saber Golanbari, Anteneh Gebregiorgis, Elyas Moradi, Saman Kiamehr, *Mehdi B. Tahoori (Karlsruhe Institute of Technology, Germany) |
Page | pp. 637 - 644 |
Keyword | Reliability, Energy-efficiency, ALU, Low-power |
Abstract | For applications with stringent power budget, such as ultra low power systems and Internet of the things (IoT), power and energy are the most important constraints. Aggressive voltage scaling is a promising approach to reduce power and energy consumption. In particular, it is shown that when the supply voltage is close to the threshold voltage of transistor, known as near threshold computing (NTC), the energy consumption is at its minimum range. However, by reducing the supply voltage not only the circuit performance decreases significantly, but aggravates various reliability mechanisms. Moreover, the performance variation due to process and runtime variation increases exponentially which makes the traditional margining, to address variability, very inefficient. In this paper, we address energy-efficient countermeasures to combat reliability challenges at NTC in order to guarantee resilient and energy-efficient system operation. We propose to partition a functional unit like as Arithmetic Logic Unit (ALU) into multiple smaller and faster functional units and power-gate them whenever they are not used for long time. Simulation results on an ALU show that by applying the proposed method the energy efficiency of an ALU can be improved by up to 44.1%, or the reliability can be improved by some orders of magnitude. |
Title | (Invited Paper) Mechanical Strain and Temperature Aware Design Methodology for Thin-Film Transistor Based Pseudo-CMOS Logic Array |
Author | *Wenyu Sun, Yuxuan Huang, Qinghang Zhao, Fei Qiao (Tsinghua University, China), Tsung-Yi Ho (National Tsing Hua University, Taiwan), Xiaojun Guo (Shanghai Jiao Tong University, China), Huazhong Yang, Yongpan Liu (Tsinghua University, China) |
Page | pp. 645 - 650 |
Keyword | TFT, Logic Array, Design methodology |
Abstract | Thin-film transistor (TFT) circuits are facing the challenges of unipolar device, process variation, and yield problems, which can be addressed by pseudo-CMOS logic array with multi-layer interconnect. However, existing design methodology does not take mechanical strain and temperature into consideration which may seriously affect the carrier mobility of TFT and thus the performance of whole logic array circuits. This paper presents a novel cell mapping algorithm including intrarow mapping step and inter-row mapping step for flexible logic array to mitigate the mobility influence. Experimental results indicate that there is more than 40% performance improvement in critical path delay at best case with the proposed algorithm. |
Title | (Invited Paper) Process Design Kit for Flexible Hybrid Electronics |
Author | *Leilai Shao (UCSB, U.S.A.), Tsung-Ching Huang (Hewlett Packard Labs, U.S.A.), Ting Lei, Zhenan Bao (Stanford University, U.S.A.), Raymond Beausoleil (Hewlett Packard Labs, U.S.A.), Kwang-Ting Cheng (Hong-Kong University of Science and Technology, China) |
Page | pp. 651 - 657 |
Keyword | Design automation, Flexible Electronics, PDK, Compact Modeling, Physical Verification |
Abstract | Flexible Electronics (FE) is emerging for wearables and low-cost internet of things (IoT) nodes, benefiting from its low-cost fabrication and mechanical flexibility. Combining FE with thinned silicon chips, known as flexible hybrid electronics (FHE), can take advantages of both low-cost printed electronics and high performance silicon chips. To design a FHE system, the process design kit (PDK) offering the capabilities for circuit design, simulation and verification for both FE and silicon chips is needed. The key elements of FHE-PDK include technology files for design rule checking (DRC), layout versus schematic (LVS) and layout parasitics extraction (LPE), as well as SPICE-compatible models for flexible thin-film transistors (TFTs) and passive elements. Wafer scale measurements are used to validate our SPICE models and design rules are derived accordingly to assure a satisfactory yield. With FHE-PDK, circuit and system designers can therefore focus on design innovations and can rely on design tools to produce manufacturable designs. |
Title | (Invited Paper) From Silicon to Printed Electronics: A Coherent Modeling and Design Flow Approach Based on Printed Electrolyte Gated FETs |
Author | Gabriel Cadilha Marques, Farhan Rasheed, Jasmin Aghassi-Hagmann, *Mehdi B. Tahoori (Karlsruhe Institute of Technology, Germany) |
Page | pp. 658 - 663 |
Keyword | Design automation, Printed electronics, electrolyte-gating, oxide electronics |
Abstract | Printed electronics offers certain technological advantages over its silicon based counterparts, such as mechanical flexibility, low process temperatures, maskless and additive manufacturing process, leading to extremely low cost manufacturing. However, to be exploited in applications such as smart sensors, Internet of Things and wearables, it is essential that the printed devices operate at low supply voltages. Electrolyte gated field effect transistors (EGFETs) using solution-processed inorganic materials which are fully printed using inkjet printers at low temperatures are very promising candidates to provide such solutions. In this paper, we discuss the technology, process, modeling, fabrication, and design aspect of circuits based on EGFETs. We show how the measurements performed in the lab can accurately be modeled in order to be integrated in the design automation tool flow in the form of a Process Design Kit (PDK). We also review some of the remaining challenges in this technology and discuss our future directions to address them. |
Title | A Best-Fit Mapping Algorithm to Facilitate ESOP-Decomposition in Clifford+T Quantum Network Synthesis |
Author | *Giulia Meuli, Mathias Soeken (EPFL, Switzerland), Martin Roetteler, Nathan Wiebe (Microsoft Research, U.S.A.), Giovanni De Micheli (EPFL, Switzerland) |
Page | pp. 664 - 669 |
Keyword | quantum circuit, logic synthesis, ESOP-decomposition, post-synthesis optimization |
Abstract | Currently, there is a large research interest and a significant economical effort to build the first practical quantum computer. Such quantum computers promise to exceed the capabilities of conventional computers in fields such as computational chemistry, machine learning and cryptanalysis. Automated methods to map logic designs to quantum networks are crucial to fully realizing this dream, however, existing methods can be expensive both in computational time as well as in the size of the resultant quantum networks. This work introduces an efficient method to map reversible single-target gates into a universal set of quantum gates (Clifford+T). This mapping method is called best-fit mapping and aims at reducing the cost of the resulting quantum network. It exploits k-LUT mapping and the existence of clean ancilla qubits to decompose a large single-target gate into a set of smaller single-target gates. In addition this work proposes a post-synthesis optimization method to reduce the cost of the final quantum network, based on two cost-minimization properties. Results show a cost reduction for the synthesized EPFL benchmark up to 53% in the number T gates. |
Title | Exploiting Coding Techniques for Logic Synthesis of Reversible Circuits |
Author | *Alwin Zulehner, Robert Wille (Johannes Kepler University Linz, Austria) |
Page | pp. 670 - 675 |
Keyword | Synthesis, reversible circuit, coding techniques |
Abstract | Reversible circuits are composed of a set of circuit lines that are passed through a cascade of reversible gates. Since the number of circuit lines is crucial, functional logic synthesis approaches have been proposed which realize circuits where the number of circuit lines is minimal. However, since the function to be realized is often non-reversible, additional variables have to be added to the function in order to establish reversibility – leading to a significant overhead that affects the scalability of the synthesis method and yields rather complex circuits. In this work, we propose to overcome these problems by exploiting coding techniques in the logic synthesis of reversible circuits. To this end, we propose an intermediate encoding of the output patterns that requires fewer additional inputs and outputs. Using this synthesis scheme allows to perform the majority of the synthesis on significantly fewer variables and to exploit several don’t care values in the code. Experimental evaluations – where we obtain better scalability and circuits with magnitudes fewer costs – confirmed the benefits of the proposed synthesis approach. |
Title | Functional Decomposition Using Majority |
Author | *Zhufei Chu (Ningbo University, China), Mathias Soeken (EPFL, Switzerland), Yinshui Xia (Ningbo University, China), Giovanni De Micheli (EPFL, Switzerland) |
Page | pp. 676 - 681 |
Keyword | Functional decomposition, Majority, logic synthesis, XOR-majority graph |
Abstract | Typical operators for the decomposition of Boolean functions in state-of-the-art algorithms are AND, exclusive-OR (XOR), and a 2-to-1 multiplexer (MUX). We propose a logic de- composition algorithm that uses the majority-of-three (MAJ) operation. Such decomposition can extend the capabilities of current logic decomposition, but only found limited attention in previous work. In our algorithm, we derived a decomposition rule using MAJ. Combined with disjoint-support decomposition, the algorithm can factorize XOR-Majority Graphs (XMGs), a recently proposed data structure which has XOR, MAJ, and inverters as only logic primitives. XMGs have been applied in various applications, including (i) exact synthesis aware rewriting, (ii) pre- optimization for 6-LUT mapping, and (iii) synthesis of quantum networks. An experimental evaluation shows that our algorithm leads to better XMGs compared to state-of-the-art algorithms, which positively affect all these three applications. As one example, our experiments show that the proposed method achieves up to 37.1% with a average of 9.6% reduction on the look-up tables (LUT) size/depth product applied to the EPFL arithmetic benchmarks after technology mapping. |
Title | CANNA: Neural Network Acceleration using Configurable Approximation on GPGPU |
Author | Mohsen Imani, Max Masich, *Daniel Peroni, Pushen Wang, Tajana Rosing (University of California at San Diego, U.S.A.) |
Page | pp. 682 - 689 |
Keyword | Neural Network, Approximate Computing, Floating Point Units, Energy Efficiency |
Abstract | In this paper, we propose a gradual training approximation, called CANNA, which adaptively sets the level of hardware approximation depending on the neural network’s internal error. To accelerate inference, CANNA’s layer-based approximation approach selectively relaxes the computation in each layer of neural network, as a function its sensitivity to approximation. For hardware support, we use a configurable floating point unit in Hardware that dynamically identifies inputs which produce the largest approximation error and process them instead in precise mode. Our experimental evaluation shows that CANNA achieves up to 4.84x (7.13x) energy savings and 3.22x (4.64x) speedup when training four different neural network applications with 0% (2%) quality loss as compared to the implementation on baseline GPU. |
Title | Task Assignment and Scheduling in MPSoC under Process Variation: A Stochastic Approach |
Author | *Behnam Khodabandeloo, Ahmad Khonsari (Department of Electrical and Computer Engineering, The University of Tehran/School of Computer Science, Institute for Research in Fundamental Sciences, Iran), Alireza Majidi (Department of Computer Science and Engineering, Texas A&M University, U.S.A.), Mohammad Hassan Hajiesmaili (Department of Electrical and Computer Engineering, The University of Tehran/School of Computer Science, Institute for Research in Fundamental Sciences, Iran) |
Page | pp. 690 - 695 |
Keyword | Task assignment, Process variation, Stochastic optimization, Performance yield, MILP optimization |
Abstract | Nowadays, aggressive scaling in integrated circuits brings out new challenges such as increase in power density, temperature, and process variation in designing Multiprocessor Systems-on-Chip (MPSoC) employed in embedded systems. While most of the previous works attempt to mitigate the process variation effects in system design level, the eventual design still is inefficient and suffers from the variability of frequency and leakage power of processors in a MPSoC. In this paper, we formulate a MILP problem for variation-aware task assignment and scheduling to optimize power consumption while meeting the real-time constraints. To capture stochastic behavior of process variation, we employ chance constrained programming technique to turn the problem into a corresponding stochastic optimization one that can be solved by typical solvers. Extensive experiments using E3S benchmarks have been carried out and the obtained results of the proposed method evince improvements compared to the baseline method in terms of performance-yield and run-time. |
Title | DarkMem: Fine-Grained Power Management of Local Memories for Accelerators in Embedded Systems |
Author | *Christian Pilato (Università della Svizzera italiana (USI), Switzerland), Luca P. Carloni (Columbia University, U.S.A.) |
Page | pp. 696 - 701 |
Keyword | Hardware accelerators, Power gating, SRAM |
Abstract | SRAM consumes a growing fraction of the static power in heterogeneous SoCs, as embedded memories take 70% to 90% of the area of specialized accelerators. We present DarkMem as a comprehensive solution for fine-grained power management of accelerator local memories. The DarkMem methodology optimizes at design time the bank configuration for each given accelerator to maximize power-gating opportunities. The DarkMem microarchitecture dynamically varies the operating mode of each memory bank according to the accelerator workload. In our experiments, DarkMem reduces the SRAM static power by more than 40% on average, which translates into a reduction of the total power by almost 18% on average with less than 1% overhead. |
Title | CryptoBlaze: A Partially Homomorphic Processor with Multiple Instructions and Non-Deterministic Encryption Support |
Author | *Florencia Irena, Daniel Murphy, Sri Parameswaran (The University of New South Wales, Australia) |
Page | pp. 702 - 708 |
Keyword | Homomorphic Computing, Security |
Abstract | Homomorphic computing has been suggested as a method to secure processing in insecure servers. One of the drawbacks of homomorphic processing is the enormous execution time taken to process even the simplest of operations. In this paper, we propose a processor with hardware support for homomorphic processing. The proposed processor, named CryptoBlaze, has eight additional specialized instructions and hardware to support computation of encrypted data. For the first time, we show that it is possible to build a hardware implementation of a processor with multiple instructions, support for non-deterministic Pallier encryption, and partially homomorphic processing. The system was implemented and tested on an FPGA with three benchmarks. The design space with differing security parameters was explored and results are presented. |
Title | PMU-Trojan: On Exploiting Power Management Side Channel for Information Leakage |
Author | *Md Nazmul Islam, Sandip Kundu (University of Massachusetts Amherst, U.S.A.) |
Page | pp. 709 - 714 |
Keyword | Advanced Encryption Standard (AES) key, Hardware Trojan Horse (HTH), Power Management Unit (PMU), Dynamic Voltage and Frequency Scaling (DVFS), Data Center |
Abstract | Hardware Trojans are malicious, undesired, intentional modifications introduced in an Integrated Circuit (IC) which can be leveraged by a knowledgeable adversary to compromise the security of the IC. Trojans might be designed to modify the functionality of an IC, access sensitive information or even disable or destroy a system. In this paper, we propose PMU-Trojan, a hardware Trojan for leaking confidential information, such as, cryptographic secret key covertly to an adversary. For information leakage by hardware Trojan, we exploit a backdoor created by Power Management Unit (PMU) in Multi Processor System on Chip (MPSoC). PMU is a system block responsible for initiating voltage and the frequency changes to facilitate flexible power management and energy efficiency. It transmits voltage level change request to power supply. In this paper we leverage this facility as an information side-channel to leak information to power-supply co-tenants. While the proposed approach is general and can be applied for any kind of secret information leakage, for the purpose of illustration, in this study, we focus on leaking Advanced Encryption Standard (AES) key. We demonstrate the working principle of this system in Linux environment where a co-tenant thread monitors the voltage level and receives side channel information from a thread affected by the Trojan. This scheme also defeats Differential Power Analysis (DPA) based Trojan detection due to low information bit rate spread over long duration by a Trojan unit dissipating power at mere pico-Watts level. |
Title | A Low-overhead PUF based on Parallel Scan Design |
Author | *Wenxuan Wang, Aijiao Cui (Harbin Institute of Technology Shenzhen Graduate School, China), Gang Qu (University of Maryland, U.S.A.), Huawei Li (Institute of Computing Technology, Chinese Academy of Sciences, China) |
Page | pp. 715 - 720 |
Keyword | PUF, parallel scan, arbiter, overhead |
Abstract | Physical unclonable function (PUF) is a promising security primitive. Most existing delay based PUF designs are independent of the original circuit. The extra PUF circuitry not only makes PUF vulnerable to removal attack, but also incurs high area overhead. In this paper, we propose to reuse the parallel scan design existing in the original circuit to implement PUF. The basic idea is to pass the same input signal to two scannable flip-flops and to use the discrepancy in the two output signals’ arrival time to generate a PUF bit. Symmetrical SR-latches are used as arbiters to reduce PUF design cost. Compared to the previous scan based PUF using single scan chain, the proposed approach avoids the requirement of a rigorous clock of high frequency. It simultaneously reduces the area overhead and improves the robustness against removal attack. The proposed PUF design is implemented on XILINX Virtex-5 FPGA boards. Experimental results show that it has a high level of uniqueness of 49.86%, very good randomness, and acceptable reliability under temperature and voltage variations. |
Title | Security Analysis and Enhancement of Model Compressed Deep Learning Systems under Adversarial Attacks |
Author | Qi Liu, Tao Liu, Zihao Liu (Florida International University, U.S.A.), Yanzhi Wang (Syracuse University, U.S.A.), Yier Jin (University of Florida, U.S.A.), *Wujie Wen (Florida International University, U.S.A.) |
Page | pp. 721 - 726 |
Keyword | deep learning system, security, adversarial attacks |
Abstract | The state-of-the-art of Deep Neural Network (DNN) presents human-level performance on many complex intelligent tasks along with machine learning model innovation and recent hardware advancement. However, it also introduces ever-increasing security concerns for those intelligent systems. For example, the emerging adversarial attacks, indicate that even small input imperceptible perturbation may severely destroy the cognitive functionality of deep learning systems (DLS). Although many relevant researches are conducted, all the existing DNN adversarial studies are based on the single uncertainty factor, i.e. input perturbations, and are solely performed on ideal software-level DNN models without considering DNN model reshaping introduced by various hardware-favorable techniques such as network pruning and HashNet during practical task executions in DNN hardware platforms. Whether and how those software-based attacks and defense solutions can be exerted and/or exploited in practical DLSs remain unexplored. In this work, we for the first time investigate the multi-factor adversarial attack problem in more practical hardware-oriented deep learning systems by jointly considering the DNN model-reshaping (e.g. HashNet based deep compression) and the input perturbations. Comprehensive robustness and vulnerability analysis are conducted based on the proposed semi-analytical modeling and simulation. Inspired from the security analysis, a defense technique named ``gradient inhibition" is further proposed to prevent the generating of input perturbations thus to effectively mitigate adversarial attacks towards either hardware oriented deep learning systems or software based DNNs. Simulation results well validate our proposed solution. |
Title | HLIFT: A High-level Information Flow Tracking Method for Detecting Hardware Trojans |
Author | *Chenguang Wang, Yici Cai, Qiang Zhou (Tsinghua University, China) |
Page | pp. 727 - 732 |
Keyword | hardware Trojans detection, unspecified output pins, information flow tracking, feature matching, statement-level CDFG |
Abstract | In this paper, we note that the hardware Trojans that leak information through the unspecified output pins are difficult to detect by functional testing or side-channel signal analysis. To solve this problem, we propose a feature matching method based on information flow tracking at high abstraction level. Experimental results show that our method can successfully identify the above-mentioned Trojans from Trust-hub, DeTrust, and OpenCores in less than 20, showing significantly lower time complexity compared with the existing works. |
Title | System-on-Chip Security Architecture and CAD Framework for Hardware Patch |
Author | *Atul Prasad Deb Nath (University of Florida, U.S.A.), Sandip Ray (NXP Semiconductors, U.S.A.), Abhishek Basak (Intel Corporation, U.S.A.), Swarup Bhunia (University of Florida, U.S.A.) |
Page | pp. 733 - 738 |
Keyword | Security Architecture, System-on-Chip (SoC) Security, Security Policy, Trusted SoC |
Abstract | System-on-Chip (SoC) security architectures targeted towards diverse applications including Internet of Things (IoT) and automotive systems enforce two critical design requirements: in-field configurability and low overhead. To simultaneously address these constraints, in this paper, we present a novel, flexible, and adaptable SoC security architecture that efficiently implements diverse security policies. The architecture and associated CAD flow enable ``hardware patching'' i.e. hardware security policy engine that can be seamlessly and securely upgraded in field to address unanticipated attacks or new security requirements. We implement (1) a centralized Reconfigurable Security Policy Engine (RSPE), (2) smart security wrappers, and (3) Design-for-Debug (DfD) infrastructure interface as the building blocks of the architecture. The proposed framework provides a systematic approach to represent and synthesize diverse security policies. Through extensive analysis using representative SoC models, we show, for the first time to our knowledge, that the proposed framework provides high level of patchability with minimal energy and performance overhead. |