(Go to Top Page)

The 24th Asia and South Pacific Design Automation Conference
Technical Program

Remark: The presenter of each paper is marked with "*".
Technical Program:   SIMPLE version   DETAILED version with abstract
Author Index:   HERE

Session Schedule

Tuesday, January 22, 2019

Room SaturnRoom UranusRoom VenusRoom Mars+Room Mercury
1K  (Miraikan Hall)
Opening & Keynote I

9:00 - 10:30
Coffee Break
10:30 - 10:45
1A  University Design Contest
10:45 - 12:00
1B  Real-time Embedded Software
10:45 - 12:00
1C  Hardware and System Security
10:45 - 12:00
1D  Thermal- and Power-Aware Design and Optimization
10:45 - 12:00
Lunch Break / University LSI Design Contest Poster Presentation (Room Jupiter)
12:00 - 13:30
2A  (SS-1) Reverse Engineering: growing more mature – and facing powerful countermeasures
13:30 - 15:35
2B  All about PIM
13:30 - 15:35
2C  Design for Reliability
13:30 - 15:35
2D  New Advances in Emerging Computing Paradigms
13:30 - 15:35
Coffee Break
15:35 - 15:55
3A  (SS-2) Design, testing, and fault tolerance of Neuromorphic systems
15:55 - 17:10
3B  Memory-Centric Design and Synthesis
15:55 - 17:10
3C  Efficient Modeling of Analog, Mixed Signal and Arithmetic Circuits
15:55 - 17:10
3D  Logic and Precision Optimization for Neural Network Designs
15:55 - 17:10
ACM SIGDA Student Research Forum at ASP-DAC 2019 (Room Jupiter)
18:00 - 20:00

Wednesday, January 23, 2019

Room SaturnRoom UranusRoom VenusRoom Mars+Room Mercury
2K  (Miraikan Hall)
Keynote II

9:00 - 10:00
Coffee Break
10:00 - 10:20
4A  (SS-3) Modern Mask Optimization: From Shallow To Deep Learning
10:20 - 12:00
4B  System Level Modelling Methods I
10:20 - 12:00
4C  Testing and Design for Security
10:20 - 12:00
4D  Network-Centric Design and System
10:20 - 12:00
Lunch Break / Supporters' Session (Miraikan Hall)
12:00 - 13:50
5A  (DF-1) Robotics: From System Design to Application
13:50 - 15:05
5B  Advanced Memory Systems
13:50 - 15:05
5C  Learning: Make Patterning Light and Right
13:50 - 15:05
5D  Design and CAD for Emerging Memories
13:50 - 15:05
Coffee Break
15:05 - 15:25
6A  (DF-2) Advanced Imaging Technologies and Applications
15:35 - 17:15
6B  Optimized Training for Neural Networks
15:35 - 16:50
6C  New Trends in Biochips
15:35 - 16:50
6D  Power-efficient Machine Learning Hardware Design
15:35 - 16:50
Banquet (Hilton Tokyo Odaiba, "Orion")
18:30 - 20:30

Thursday, January 24, 2019

Room SaturnRoom UranusRoom VenusRoom Mars+Room Mercury
3K  (Miraikan Hall)
Keynote III

9:00 - 10:00
Coffee Break
10:00 - 10:20
7A  (SS-4) Security of Machine Learning and Machine Learning for Security: Progress and Challenges for Secure, Machine Intelligent Mobile Systems
10:20 - 12:00
7B  System Level Modelling Methods II
10:20 - 12:00
7C  Placement
10:20 - 12:00
7D  Algorithms and Architectures for Emerging Applications
10:20 - 12:00
Lunch Break
12:00 - 13:15
8A  (DF-3) Emerging Technologies for Tokyo Olympic 2020
13:15 - 14:30
8B  Embedded Software for Parallel Architecture
13:15 - 14:30
8C  Machine Learning and Hardware Security
13:15 - 14:30
8D  Memory Architecture for Efficient Neural Network Computing
13:15 - 14:30
Coffee Break
14:30 - 14:50
9A  (DF-4) Beyond the Virtual Reality World
14:50 - 16:05
9B  Logic-Level Security and Synthesis
14:50 - 16:05
9C  Analysis and Algorithms for Digital Design Verification
14:50 - 16:05
9D  FPGA and Optics-Based Neural Network Designs
14:50 - 16:05
Coffee Break
16:05 - 16:25
10A  (SS-5) The Resurgence of Reconfigurable Computing in the Post Moore Era
16:25 - 17:40
10B  Hardware Acceleration
16:25 - 17:40
10C  Routing
16:25 - 17:40

DF: Designers' Forum, SS: Special Session

List of papers

Remark: The presenter of each paper is marked with "*".

Tuesday, January 22, 2019

[To Session Table]

Session 1K  Opening & Keynote I
Time: 9:00 - 10:30 Tuesday, January 22, 2019
Location: Miraikan Hall
Chair: Toshiyuki Shibuya (Fujitsu Labs., Japan)

1K-1 (Time: 9:30 - 10:30)
Title(Keynote Address) Development trend of artificial intelligence technology and its application in the field of robotics
AuthorTao Zhang (Tsinghua University, China)
AbstractIn recent years, artificial intelligence has advanced remarkably, and this has brought various robot systems to a more highly advanced level as well. Meanwhile, the development of robot technology can significantly promote innovations in artificial intelligence technologies. This talk will introduce the development trends of artificial intelligence technologies by summarizing the main achievements in each technological field. Furthermore, it will enumerate the applications of artificial intelligent technologies in various robot systems, such as unmanned vehicles, unmanned aerial vehicles, service robots, space robots, marine robots, et al. We hope our viewpoints and predictions will be actually realized in the near future and we also believe that the world will be changed for the better and human life will be improved by means of artificial intelligent technology and its application in the field of robotics.

[To Session Table]

Session 1A  University Design Contest
Time: 10:45 - 12:00 Tuesday, January 22, 2019
Location: Room Saturn
Chairs: Kousuke Miyaji (Shinshu University, Japan), Akira Tsuchiya (The University of Shiga Prefecture, Japan)

Best Design Award
1A-1 (Time: 10:45 - 10:48)
TitleA Wide Conversion Ratio, 92.8% Efficiency, 3-Level Buck Converter with Adaptive On/Off-Time Control and Shared Charge Pump Intermediate Voltage Regulator
Author*Kousuke Miyaji, Yuki Karasawa, Takanobu Fukuoka (Shinshu University, Japan)
Pagepp. 1 - 2
Keyword3-level buck converter, wide conversion ratio, adaptive-on time control, charge pump
AbstractAn efficient cascode 3-level buck converter with adaptive on/off-time (AOOT) control and shared charge pump (CP) intermediate voltage (VMID) regulator is proposed and demonstrated. The conversion ratio (CR) VOUT/VIN is enhanced by using the proposed AOOT control scheme, where the control switches between adaptive on-time and adaptive off-time mode according to the target CR. The proposed CP shares flying capacitor CFLY and power switches in the 3-level buck converter to generate VMID=VIN/2 achieving both small size and low loss. The proposed 3-level buck converter is implemented in a standard 0.25um CMOS process. 92.8% maximum efficiency and wide CR are obtained with the integrated VMID regulator.

Special Feature Award
1A-2 (Time: 10:48 - 10:51)
TitleA Three-Dimensional Millimeter-Wave Frequency-Shift Based CMOS Biosensor using Vertically Stacked Spiral Inductors in LC Oscillators
Author*Maya Matsunaga, Taiki Nakanishi, Atsuki Kobayashi (Nagoya University, Japan), Kiichi Niitsu (Nagoya University/JST PRESTO, Japan)
Pagepp. 3 - 4
KeywordCMOS, Three-dimensional, LC oscillator, Biosensor
AbstractThis paper presents a millimeter-wave frequency-shift-based CMOS biosensor that is capable of providing three-dimensional (3D) resolution. The vertical resolution from the sensor surface is obtained using dual-layer LC oscillators, which enable 3D target detection. The LC oscillators produce different frequency shifts from the desired resonant frequency due to the frequency-dependent complex relative permittivity of the biomolecular target. The measurement results from a 65-nm test chip demonstrated the feasibility of achieving 3D resolution.

Special Feature Award
1A-3 (Time: 10:51 - 10:54)
TitleDesign of 385 x 385 µm2 0.165V 270pW Fully-Integrated Supply-Modulated OOK Transmitter in 65nm CMOS for Glasses-Free, Self-Powered, and Fuel-Cell-Embedded Continuous Glucose Monitoring Contact Lens
Author*Kenya Hayashi, Shigeki Arata, Ge Xu, Shunya Murakami, Cong Dang Bui, Takuyoshi Doike, Maya Matsunaga, Atsuki Kobayashi, Kiichi Niitsu (Nagoya University, Japan)
Pagepp. 5 - 6
Keywordcontinuous glucose monitoring, smart contact lens, wireless transmitter, glucose fuel cell, low-power
AbstractThis work presents the lowest power consumption sub-mm2 supply modulated OOK transmitter for enabling self-powered continuous glucose monitoring (CGM) contact lens. By combining the transmitter with a glucose fuel cell which functions as both the power source and sensing transducer, self-powered CGM contact lens can be emerged. The 385 x 385 µm2 test chip implemented in 65-nm standard CMOS technology operates 270pW under 0.165V and successfully demonstrates self-powered operation using 2 x 2 mm2 solid-state glucose fuel cell.

1A-4 (Time: 10:54 - 10:57)
Title2D Optical Imaging Using Photosystem I Photosensor Platform with 32×32 CMOS Biosensor Array
Author*Kiichi Niitsu, Taichi Sakabe (Nagoya University, Japan), Mariko Miyachi, Yoshinori Yamanoi, Hiroshi Nishihara (University of Tokyo, Japan), Tatsuya Tomo (Tokyo University of Science, Japan), Kazuo Nakazato (Nagoya University, Japan)
Pagepp. 7 - 8
Keyword2D, CMOS, PSI, biosensor
AbstractThis paper presents 2D imaging using photosensor platform with a newly-proposed large-scale CMOS biosensor array in 0.6-um standard CMOS. The platform combines photosystem I (PSI) isolated from Thermosynechococcus elongatus and a large-scale CMOS biosensor array. PSI converts the absorbed photons into electrons, which are then sensed by the CMOS biosensor array. The prototyped photosensor enables CMOS-based 2D imaging using PSI for the first time.

1A-5 (Time: 10:57 - 11:00)
TitleDesign of Gate-Leakage-Based Timer Using an Amplifier-Less Replica-Bias Switching Technique in 55-nm DDC CMOS
Author*Atsuki Kobayashi, Yuya Nishio, Kenya Hayashi, Shigeki Arata (Nagoya University, Japan), Kiichi Niitsu (Nagoya University/JST PRESTO, Japan)
Pagepp. 9 - 10
Keywordgate-leakage, logic circuits based, low-voltage operation, oscillator, subthreshold
AbstractA design of gate-leakage-based timer using an amplifier-less replica-bias switching technique that can realize stable and low-voltage operation is presented. To generate stable oscillation frequency, the topology that discharges the pre-charged capacitor via a gate leaking MOS capacitor with low-leakage switch and logic circuits is employed. The test chip fabricated in 55-nm deeply depleted channel (DDC) CMOS technology achieves an Allan deviation floor of 200 ppm at a supply voltage of 350 mV in a 0.0022 mm2 area.

1A-6 (Time: 11:00 - 11:03)
TitleA Low-Voltage CMOS Electrophoresis IC Using Electroless Gold Plating for Small-Form-Factor Biomolecule Manipulation
Author*Kiichi Niitsu, Yuuki Yamaji, Atsuki Kobayashi, Kazuo Nakazato (Nagoya University, Japan)
Pagepp. 11 - 12
KeywordCMOS, Electrophoresis, Biomolecule manipulation
AbstractWe present sub-1-V CMOS-based electrophoresis method for small-form-factor biomolecule manipulation that is contained in a microchip. This is the first time this type of device has been presented in the literature. By combining CMOS technology with electroless gold plating, the electrode pitch can be reduced and the required input voltage can be decreased to less than 1 V. We fabricated the CMOS electrophoresis chip in a cost-competitive 0.6 um standard CMOS process. A sample/hold circuit in each cell is used to generate a constant output from an analog input. After forming gold electrodes using an electroless gold plating technique, we were able to manipulate red food coloring with a 0-0.7 V input voltage range. The results shows that the proposed CMOS chip is effective for electrophoresis-based manipulation.

1A-7 (Time: 11:03 - 11:06)
TitleA Low-Voltage Low-Power Multi-Channel Neural Interface IC Using Level-Shifted Feedback Technology
Author*Liangjian Lyu, Yu Wang (Fudan University, China), Chixiao Chen, C. -J. Richard Shi (University of Washington, U.S.A.)
Pagepp. 13 - 14
Keywordneural recorder, analog front end, low voltage, low power, low noise
AbstractA low-voltage low-power 16-channel neural interface front-end IC for in-vivo neural recording applications is presented in this paper. A current reuse telescope amplifier is used to achieve better noise efficiency factor (NEF). Power efficiency factor (PEF) is further improved by reducing supply voltage with the proposed level-shifted feedback (LSFB) technique. The neural interface is fabricated in a 65 nm CMOS process. It operates under 0.6V supply voltage consuming 1.07 μW/channel. An input referred noise of 5.18 μV is measured, leading to a NEF of 2.94 and a PEF of 5.19 over 10 kHz bandwidth.

1A-8 (Time: 11:06 - 11:09)
TitleDevelopment of a High Stability, Low Standby Power Six-Transistor CMOS SRAM Employing a Single Power Supply
Author*Nobuaki Kobayashi (Nihon University, Japan), Tadayoshi Enomoto (Chuo University, Japan)
Pagepp. 15 - 16
KeywordCMOS, SRAM, High stability, Low leakage, SVL
AbstractWe developed and applied a new circuit, called the “Self-controllable Voltage Level (SVL)” circuit, not only to expand both “write” and “read” stabilities, but also to achieve a low stand-by power and data holding capability in a single low power supply, 90-nm, 2-kbit, six-transistor CMOS SRAM. The SVL circuit can adaptively lower and higher the word-line voltages for a “read” and “write” operation, respectively. It can also adaptively lower and higher the memory cell supply voltages for the “write” and “hold” operations, and “read” operation, respectively. A Si area overhead of the SVL circuit is only 1.383 % of the conventional SRAM.

1A-9 (Time: 11:09 - 11:12)
TitleDesign of Heterogeneously-integrated Memory System with Storage Class Memories and NAND Flash Memories
Author*Chihiro Matsui, Ken Takeuchi (Chuo University, Japan)
Pagepp. 17 - 18
KeywordDesign methodology, Non-volatile memory systemsly-integrated memory system, Storage class memory (SCM), NAND flash memory
AbstractHeterogeneously-integrated memory system is configured with storage class memories (SCMs) and NAND flash memories. SCMs are faster than NAND flash, and they are divided into memory and storage type with their characteristics. NAND flash memories are also classified by the number of stored bits per memory cell. These non-volatile memories have different characteristics such as access speed, capacity and bit cost. This paper discusses a design methodology of appropriate configurations in the heterogeneously-integrated memory system for various types of applications.

1A-10 (Time: 11:12 - 11:15)
TitleA 65-nm CMOS Fully-Integrated Circulating Tumor Cell and Exosome Analyzer Using an On-Chip Vector Network Analyzer and a Transmission-Line-Based Detection Window
AuthorTaiki Nakanishi, Maya Matsunaga, Shunya Murakami, Atsuki Kobayashi, *Kiichi Niitsu (Nagoya University, Japan)
Pagepp. 19 - 20
KeywordCMOS, CTC, exosome, liquid biopsy
AbstractA fully-integrated CMOS circuit based on a vector network analyzer (VNA) and a transmission-line-based detection window for circulating tumor cell (CTC) and exosome analysis is presented. We have introduced a fully-integrated architecture, which eliminates the undesired parasitic components and enables high-sensitivity, for analysis of extremely low–concentration CTC in blood. To validate the operation of the proposed system, a test chip was fabricated using 65-nm CMOS technology. Measurement results shows the effectiveness of the approach.

1A-11 (Time: 11:15 - 11:18)
TitleLow Standby Power CMOS Delay Flip-Flop with Data Retention Capability
Author*Nobuaki Kobayashi (Nihon University, Japan), Tadayoshi Enomoto (Chuo University, Japan)
Pagepp. 21 - 22
KeywordCMOS, Delay Flip-Flop, low leakage, SVL
AbstractWe developed and applied a new circuit, called the self-controllable voltage level (SVL) circuit, to achieve not only low standby power dissipation (PST) while retaining data, but also to switch significantly quickly between an operational mode and a standby mode, in a single power source, 90-nm CMOS delay flip-flop (D-FF). The PST of the developed D-FF is only 5.585 nW/bit, 14.81% of the 37.71 nW/bit of the conventional D-FF at a supply voltage (VDD) of 1.0 V. The static-noise margin of the developed D-FF is 0.2576 V, and that of the conventional D-FF is 0.3576 V (at VDD of 1.0 V). The Si area overhead of the SVL circuit is 11.62% of the conventional D-FF.

1A-12 (Time: 11:18 - 11:21)
TitleAccelerate Pattern Recognition for Cyber Security Analysis
Author*Mohammad Tahghighi, Wei Zhang (Hong Kong University of Science & Technology, Hong Kong)
Pagepp. 23 - 24
KeywordHardware/software co-design, Security, Acceleration
AbstractNetwork security analysis is about processing the network equipment's log records to capture malicious and anomalous traffic. Scrutinizing huge amount of records to capture complex patterns is time consuming and difficult to parallelize. In this paper, we proposed a hardware/software co-designed system to address this problem for specific IP chaining patterns.

1A-13 (Time: 11:21 - 11:24)
TitleFPGA Laboratory System supporting Power Measurement for Low-Power Digital Design
AuthorMarco Winzker, *Andrea Schwandt (Bonn-Rhein-Sieg University, Germany)
Pagepp. 25 - 26
KeywordFPGA, remote lab, low-power design, open educational resource
AbstractPower measurement of a digital design implementation supports development of low-power systems and gives insight into the performance of a circuit. A laboratory system is presented that consists of an FPGA board for use in a hands-on and remote laboratory. Measurement results show how the system can be utilized for teaching and research.

[To Session Table]

Session 1B  Real-time Embedded Software
Time: 10:45 - 12:00 Tuesday, January 22, 2019
Location: Room Uranus
Chairs: Zhaoyan Shen (Shandong University), Zebo Peng (Linköping University, Sweden)

1B-1 (Time: 10:45 - 11:10)
TitleTowards Limiting the Impact of Timing Anomalies in Complex Real-Time Processors
Author*Pedro Benedicte (Barcelona Supercomputing Center and Universitat Politècnica de Catalunya, Spain), Jaume Abella, Carles Hernandez, Enrico Mezzetti (Barcelona Supercomputing Center, Spain), Francisco J. Cazorla (Barcelona Supercomputing Center and IIIA-CSIC, Spain)
Pagepp. 27 - 32
KeywordTiming Ananomalies, WCET, embedded critical systems
AbstractTiming verification of embedded critical real-time systems is hindered by complex designs. Timing anomalies, deeply analyzed in static timing analysis, require specific solutions to bound their impact. For the first time, we study the concept and impact of timing anomalies in measurement-based timing analysis, the most used in industry, showing that they require to be considered and handled differently. In addition, we analyze anomalies in the context of Measurement-Based Probabilistic Timing Analysis, which simplifies quantifying their impact.

1B-2 (Time: 11:10 - 11:35)
TitleSeRoHAL: Generation of Selectively Robust Hardware Abstraction Layers for Efficient Protection of Mixed-criticality Systems
Author*Petra R. Kleeberger, Juana Rivera, Daniel Mueller-Gritschneder, Ulf Schlichtmann (Technische Universität München, Germany)
Pagepp. 33 - 38
KeywordHardware errors, Software-based safety, code generation
AbstractA major challenge in mixed-criticality system design is to ensure safe behavior under the influence of hardware errors while complying with cost and performance constraints. SeRoHAL generates hardware abstraction layers with software-based safety mechanisms to handle errors in peripheral interfaces. To reduce performance and memory overheads, SeRoHAL can select protection mechanisms depending on the criticality of the hardware accesses. We evaluated SeRoHAL on a robot arm control software. During fault injection, it prevents up to 76% of the assertion failures. Selective protection customized to the criticality of the accesses reduces the induced overheads significantly compared to protection of all hardware accesses.

1B-3 (Time: 11:35 - 12:00)
TitlePartitioned and Overhead-Aware Scheduling of Mixed-Criticality Real-Time Systems
Author*Yuanbin Zhou, Soheil Samii, Petru Eles, Zebo Peng (Linköping University, Sweden)
Pagepp. 39 - 44
Keywordreal-time systems, mixed-criticality systems, scheduling
AbstractModern real-time embedded and cyber-physical systems comprise a large number of applications, often of different criticalities, executing on the same computing platform. Partitioned scheduling is used to provide temporal isolation among tasks with different criticalities. Isolation is often a requirement, for example, in order to avoid the case when a low criticality task overruns or fails in such a way that causes a failure in a high criticality task. When the number of partitions increases in mixed criticality systems, the size of the schedule table can become extremely large, which becomes a critical bottleneck due to design time and memory constraints of embedded systems. In addition, switching between partitions at runtime causes CPU overhead due to preemption. In this paper, we propose a design framework comprising a hyper-period optimization algorithm, which reduces the size of schedule table and preserves schedulability, and a re-scheduling algorithm to reduce the number of preemptions. Extensive experiments demonstrate the effectiveness of proposed algorithms and design framework.

[To Session Table]

Session 1C  Hardware and System Security
Time: 10:45 - 12:00 Tuesday, January 22, 2019
Location: Room Venus
Chairs: Ray C.C. Cheung (City University of Hong Kong, Hong Kong), Hai Zhou (Northwestern University, U.S.A.)

1C-1 (Time: 10:45 - 11:10)
TitleLayout Recognition Attacks on Split Manufacturing
Author*Wenbin Xu, Lang Feng, Jeyavijayan Rajendran, Jiang Hu (Texas A&M University, U.S.A.)
Pagepp. 45 - 50
KeywordHardware security, split manufacturing, layout recognition attack
AbstractOne technique to prevent attacks from an untrusted foundry is split manufacturing, where only a part of the layout is sent to the untrusted high-end foundry, and the rest is manufactured at a trusted low-end foundry. The untrusted foundry has front-end-of-line (FEOL) layout and the original circuit netlist and attempts to identify critical components on the layout for Trojan insertion. Although defense methods for this scenario have been developed, the corresponding attack technique is not well explored. For instance, Boolean satisfiability (SAT) based bijective mapping attack is mentioned without detailed research. Hence, the defense methods are mostly evaluated with the k-security metric without actual attacks. We provide the first systematic study, to the best of our knowledge, on attack techniques in this scenario. Besides of implementing SAT-based bijective mapping attack, we develop a new attack technique based on structural pattern matching. Experimental comparison with bijective mapping attack shows that the new attack technique achieves about the same success rate with much faster speed for cases without the k-security defense, and has a much better success rate at the same runtime for cases with k-security defense. The results offer an alternative and practical interpretation for k-security in split manufacturing.

1C-2 (Time: 11:10 - 11:35)
TitleExecution of Provably Secure Assays on MEDA Biochips to Thwart Attacks
Author*Tung-Che Liang (Duke University, U.S.A.), Mohammed Shayan (New York University, U.S.A.), Krishnendu Chakrabarty (Duke University, U.S.A.), Ramesh Karri (New York University, U.S.A.)
Pagepp. 51 - 57
KeywordMEDA biochip, Biochip security, Droplet-location map
AbstractDigital microfluidic biochips (DMFBs) have emerged as a promising platform for DNA sequencing, clinical chemistry, and point-of-care diagnostics. Recent research has shown that DMFBs are susceptible to various types of malicious attacks. Defenses proposed thus far only offer probabilistic guarantees of security due to the limitation of on-chip sensor resources. A micro-electrode-dot-array (MEDA) biochip is a next-generation DMFB that enables the sensing of on-chip droplet locations, which are captured in the form of a droplet-location map. We propose a security mechanism that validates assay execution by reconstructing the sequencing graph (i.e., the assay specification) from the droplet-location maps and comparing it against the golden sequencing graph. We prove that there is a unique (one-to-one) mapping from the set of droplet-location maps (over the duration of the assay) to the set of possible sequencing graphs. Any deviation in the droplet-location maps due to an attack is detected by this countermeasure because the resulting derived sequencing graph is not isomorphic to the original sequencing graph. We highlight the strength of the security mechanism by simulating attacks on real-life bioassays.

1C-3 (Time: 11:35 - 12:00)
TitleTAD: Time Side-Channel Attack Defense of Obfuscated Source Code
Author*Alexander Fell, Hung Thinh Pham, Siew Kei Lam (Nanyang Technological University, Singapore)
Pagepp. 58 - 63
KeywordSoftware obfuscation, hardware diversification, time side-channel attacks, reverse-engineering
AbstractProgram obfuscation is widely used to protect commercial software against reverse-engineering. However, an adversary can still download, disassemble and analyze binaries of the obfuscated code executed on an embedded System-on-Chip (SoC), and by correlating execution times to input values, extract secret information from the program. In this paper, we show (1) the impact of widely-used obfuscation methods on timing leakage, and (2) that well-known software countermeasures to reduce timing leakage of programs, are not always effective for low-noise environments found in embedded systems. We propose two methods for mitigating timing leakage in obfuscated codes. The first is a compiler driven method, called TAD, which removes conditional branches with distinguishable execution times for an input program. In the second method (TADCI), TAD is combined with dynamic hardware diversity by replacing primitive instructions with Custom Instructions (CIs) that exhibit non-deterministic execution times at runtime. Experimental results on the RISC-V platform show that the information leakage is reduced by 92% and 82% when TADCI is applied to the original and obfuscated source code, respectively.

[To Session Table]

Session 1D  Thermal- and Power-Aware Design and Optimization
Time: 10:45 - 12:00 Tuesday, January 22, 2019
Location: Room Mars+Room Mercury
Chairs: Hussam Amrouch (Karlsruhe Institute of Technology (KIT), Germany), Jiang Hu (Texas A&M)

Best Paper Candidate
1D-1 (Time: 10:45 - 11:10)
TitleLeakage-Aware Thermal Management for Multi-Core Systems Using Piecewise Linear Model Based Predictive Control
AuthorXingxing Guo, *Hai Wang, Chi Zhang, He Tang, Yuan Yuan (University of Electronic Science and Technology of China, China)
Pagepp. 64 - 69
KeywordThermal management, leakage power, multi-core, model predictive control
AbstractPerforming thermal management on new generation IC chips is challenging. This is because the leakage power, which is significant in these chips, is nonlinearly related to temperature, resulting in a complex nonlinear control problem in thermal management. In this paper, a new dynamic thermal management (DTM) method with piecewise linear (PWL) thermal model based predictive control is proposed to solve the nonlinear control problem. First, the original nonlinear leakage power is approximated by piece-wise linear functions expanded at several Taylor expansion points. These Taylor expansion points are carefully selected by a new algorithm which exploits the thermal behavior property of the IC chips. Based on these piece-wise linear functions, a series of local linear thermal models are built, which have the same structure as the normal linear thermal model. Then, using these local linear thermal models, a new PWL model based predictive control method is proposed to compute the future power recommendation. By approximating the nonlinearity accurately with the PWL thermal models and being equipped with predictive control technique, the new method can achieve an overall high quality temperature management with smooth and accurate temperature tracking. Experimental results show the new method outperforms the linear model predictive control based method in temperature management quality with negligible computing overhead.

1D-2 (Time: 11:10 - 11:35)
TitleMulti-Angle Bended Heat Pipe Design Using X-Architecture Routing with Dynamic Thermal Weight on Mobile Devices
AuthorHsuan-Hsuan Hsiao, *Hong-Wen Chiou, Yu-Min Lee (National Chiao Tung University, Taiwan)
Pagepp. 70 - 75
KeywordThermal-aware design, Heat pipe routing, thermal modeling, mobile device
AbstractHeat pipe is an effective passive cooling technique for mobile devices. This work builds a multi-angle bended heat pipe thermal model and presents an X-architecture routing engine guided by developed dynamic thermal weights to construct the heat pipe path for reducing the operating temperatures of a smartphone. Compared with a commercial tool, the error of the thermal model is only 4.79%. The routing engine can efficiently reduce the operating temperatures of application processors at least 13.20% in smartphones.

1D-3 (Time: 11:35 - 12:00)
TitleFully-automated Synthesis of Power Management Controllers from UPF
Author*Dustin Peterson, Oliver Bringmann (University of Tuebingen, Germany)
Pagepp. 76 - 81
Keywordpower management, power design, unified power format
AbstractWe present a methodology for automatic synthesis of power management controllers for System-on-Chip designs by using an extended version of the Unified Power Format (UPF). Our methodology takes an SoC design and a UPF-based power design, and automatically generates a power management controller in Verilog/VHDL that implements the power state machine specified in UPF. It performs a priority-based scheduling for all power state machine actions, connects each power management signal to the corresponding logic wire in the UPF design and integrates the controller into the System-on-Chip using a configurable bus interface. We implemented the proposed approach as a plugin for Synopsys Design Compiler to close the gap in today’s power management flows and evaluated it by a RISC-V System-on-Chip.

[To Session Table]

Session 2A  (SS-1) Reverse Engineering: growing more mature – and facing powerful countermeasures
Time: 13:30 - 15:35 Tuesday, January 22, 2019
Location: Room Saturn
Chair: Naehyuck Chang (KAIST, Republic of Korea)

2A-1 (Time: 13:30 - 13:55)
Title(Invited Paper) Integrated Flow for Reverse Engineering of Nanoscale Technologies
Author*Bernhard Lippmann, Aayush Singla, Niklas Unverricht, Peter Egger, Anja Dübotzky, Michael Werner (Infineon Technologies AG, Germany), Horst Gieser (Fraunhofer EMFT, Germany), Martin Rasche, Oliver Kellermann (Raith GmbH, Germany), Helmut Gräb (TUM, Germany)
Pagepp. 82 - 89
KeywordReverse Engineering, image processing, net list extraction
AbstractIn view of potential risks of piracy and malicious manipulation of complex integrated circuits built in technologies of 45 nm and less there is an increasing need for an effective and efficient process for reverse engineering. The paper provides an overview over the current process and details on a new tool for the acquisition and synthesis of large area images and the extraction of a layout. For the first time the error between the generated layout and the the known drawn GDS2 will be compared quantitatively as a figure of merit (FOM). From this layout a circuit graph of an ECC encryption and the partitioning in circuit blocks will be extracted.

2A-2 (Time: 13:55 - 14:20)
Title(Invited Paper) NETA: When IP Fails, Secrets Leak
AuthorTravis Meade, Jason Portillo, Shaojie Zhang (University of Central Florida, U.S.A.), *Yier Jin (University of Florida, U.S.A.)
Pagepp. 90 - 95
KeywordHardware Security, Reverse Engineering, Netlist Analysis
AbstractAssuring the quality and the trustworthiness of third party resources has been a hard problem to tackle. Researchers have shown that analyzing Integrated Circuits (IC), without the aid of golden models, is challenging. In this paper we discuss a toolset, NETA, designed to aid IP users in assuring the confidentiality, integrity, and accessibility of their IC or third party IP core. The discussed toolset gives access to a slew of gate-level analysis tools, many of which are heuristic-based, for the purposes of extracting high-level circuit design information. NETA majorly comprises the following tools: RELIC, REBUS, REPCA, REFSM, and REPATH.

2A-3 (Time: 14:20 - 14:45)
Title(Invited Paper) Machine Learning and Structural Characteristics for Reverse Engineering
Author*Johanna Baehr, Alessandro Bernardini, Georg Sigl, Ulf Schlichtmann (Technical University of Munich, Germany)
Pagepp. 96 - 103
KeywordHardware reverse engineering, Fuzzy Similarity Matching, Subcircuit Identification, Netlist Abstraction, Netlist Interpretation
AbstractIn the past years, much of the research into hardware reverse engineering has focused on the abstraction of gate level netlists to a human readable form. However, none of the proposed methods consider a realistic reverse engineering scenario, where the netlist is physically extracted from a chip. This paper analyzes how errors caused by this extraction and the later partitioning of the netlist affect the ability to identify the functionality. Current formal verification based methods, which compare against a golden model, are incapable of dealing with such erroneous netlists. Two new methods are proposed, which focus on the idea that structural similarity implies functional similarity. The first approach uses fuzzy structural similarity matching to compare the structural characteristics of an unknown design against designs in a golden model library using machine learning. The second approach proposes a method for inexact graph matching using fuzzy graph isomorphisms, based on the functionalities of gates used within the design. For realistic error percentages, both approaches are able to match more than 90% of designs correctly. This is an important first step for hard-ware reverse engineering methods beyond formal verification based equivalence matching.

2A-4 (Time: 14:45 - 15:10)
Title(Invited Paper) Towards Cognitive Obfuscation: Impeding Hardware Reverse Engineering Based on Psychological Insights
Author*Carina Wiesen, Steffen Becker, Nils Albartus, Max Hoffmann, Sebastian Wallat, Marc Fyrbiak, Nikol Rummel, Christof Paar (Ruhr-Universität Bochum, Germany)
Pagepp. 104 - 111
Keywordcognitive obfuscation, netlist-level reverse engineering, hardware obfuscation
AbstractIn contrast to software reverse engineering, there are hardly any tools available that support hardware reversing. Therefore, the reversing process is conducted by human analysts combining several complex semi-automated steps. However, countermeasures against reversing are evaluated solely against mathematical models. Our research goal is the establishment of cognitive obfuscation based on the exploration of underlying psychological processes. We aim to identify problems which are hard to solve for human analysts and derive novel quantification metrics, thus enabling stronger obfuscation techniques.

2A-5 (Time: 15:10 - 15:35)
Title(Invited Paper) Insights to the Mind of a Trojan Designer: The Challenge to Integrate a Trojan in the Bitstream
Author*Maik Ender, Paul Martin Knopp, Christof Paar, Pawel Swierczynski, Sebastian Wallert, Matthias Wilhelm (Ruhr-Universität Bochum, Germany)
Pagepp. 112 - 119
KeywordBitstream Manipulation, Reverse Engineering, Hardware Trojan, Netlist Analysis
AbstractThe threat of inserting hardware Trojans during the design, production, or in-field poses a danger for integrated circuits in real-world applications. A particular critical case of hardware Trojans is the malicious manipulation of third-party FPGA configurations. In addition to attack vectors during the design process, FPGAs can be infiltrated in a non-invasive manner after shipment through alterations of the bitstream. First, we present an improved methodology for bitstream file format reversing. Second, we introduce a novel idea for Trojan insertion.

[To Session Table]

Session 2B  All about PIM
Time: 13:30 - 15:35 Tuesday, January 22, 2019
Location: Room Uranus
Chairs: Guangyu Sun (Peking University, China), Wanli Chang (University of York, U.K.)

Best Paper Award
2B-1 (Time: 13:30 - 13:55)
TitleGraphSAR: A Sparsity-Aware Processing-in-Memory Architecture for Large-Scale Graph Processing on ReRAMs
Author*Guohao Dai (Tsinghua University, China), Tianhao Huang (Massachusetts Institute of Technology, U.S.A.), Yu Wang, Huazhong Yang (Tsinghua University, China), John Wawrzynek (University of California, Berkeley, U.S.A.)
Pagepp. 120 - 126
KeywordReRAM, Large-scale Graph Processing, Processing-in-Memory
AbstractLarge-scale graph processing has drawn great attention in recent years. The emerging metal-oxide resistive random access memory (ReRAM) and ReRAM crossbars have shown huge potential in accelerating graph processing. However, the sparse feature of natural graphs hinders the performance of graph processing on ReRAMs. Previous work of graph processing on ReRAMs stored and computed edges separately, leading to high energy consumption and long latency of transferring data. In this paper, we present GraphSAR, a sparsity-aware processing-in-memory large-scale graph processing accelerator on ReRAMs. Computations over edges are performed in the memory, eliminating overheads of transferring edges. Moreover, graphs are divided considering the sparsity. Subgraphs with low densities are further divided into smaller ones to minimize the waste of memory space. According to our extensive experimental results, GraphSAR achieves 4.43x energy reduction and 1.85x speedup (8.19x lower energy-delay product, EDP) against previous graph processing architecture on ReRAMs (GraphR).

2B-2 (Time: 13:55 - 14:20)
TitleParaPIM: A Parallel Processing-in-Memory Accelerator for Binary-Weight Deep Neural Networks
AuthorShaahin Angizi, Zhezhi He, *Deliang Fan (University of Central Florida, U.S.A.)
Pagepp. 127 - 132
KeywordAccelerator, SOT-MRAM
AbstractRecent algorithmic progression has brought competitive classification accuracy despite constraining neural networks to binary weights (+1/-1). These findings show remarkable optimization opportunities to eliminate the need for computationally-intensive multiplications, reducing memory access and storage. In this paper, we present ParaPIM architecture, which transforms current Spin Orbit Torque Magnetic Random Access Memory (SOT-MRAM) sub-arrays to massively parallel computational units capable of running inferences for Binary-Weight Deep Neural Networks (BWNNs). ParaPIM’s in-situ computing architecture can be leveraged to greatly reduce energy consumption dealing with convolutional layers, accelerate BWNNs inference, eliminate unnecessary off-chip accesses and provide ultra-high internal bandwidth. The device-to-architecture co-simulation results indicate ~4× higher energy-efficiency and 7.3× speedup over recent processing-in-DRAM acceleration, or roughly 5× higher energy-efficiency and 20.5× speedup over recent ASIC approaches, while maintaining inference accuracy comparable to baseline designs.

2B-3 (Time: 14:20 - 14:45)
TitleCompRRAE: RRAM-based Convolutional Neural Network Accelerator with Reduced Computations through a Runtime Activation Estimation
Author*Xizi Chen, Jingyang Zhu, Jingbo Jiang, Chi-Ying Tsui (Hong Kong University of Science and Technology, Hong Kong)
Pagepp. 133 - 139
KeywordConvolutional Neural Network, Hardware Accelerator, RRAM
AbstractRecently Resistive-RAM (RRAM) crossbar has been used in the design of the accelerator of convolutional neural networks (CNNs) to solve the memory wall issue. However, the intensive multiply-accumulate computations (MACs) executed at the crossbars during the inference phase are still the bottleneck for the further improvement of energy efficiency and throughput. In this work, we explore several methods to reduce the computations for the RRAM-based CNN accelerators. First, the output sparsity resulting from the widely employed Rectified Linear Unit is exploited, and a significant portion of computations are bypassed through an early detection of the negative output activations. Second, an adaptive approximation is proposed to terminate the MAC early when the sum of the partial results of the remaining computations is considered to be within a certain range of the intermediate accumulated result and thus has an insignificant contribution to the inference. In order to determine these redundant computations, a novel runtime estimation on the maximum and minimum values of each output activation is developed and used during the MAC operation. Experimental results show that around 70% of the computations can be reduced during the inference with a negligible accuracy loss smaller than 0.2%. As a result, the energy efficiency and the throughput are improved by over 2.9 and 2.8 times, respectively, compared with the state-of-the-art RRAM-based accelerators.

2B-4 (Time: 14:45 - 15:10)
TitleCuckooPIM: An Efficient and Less-blocking Coherence Mechanism for Processing-in-Memory Systems
Author*Sheng Xu, Xiaoming Chen, Ying Wang, Yinhe Han, Xiaowei Li (Institute of Computing Technology, Chinese Academy of Sciences, China)
Pagepp. 140 - 145
KeywordProcessing-in-Memory, Coherence, Less-blocking
AbstractThe ever-growing processing ability of in-memory processing logic makes the data sharing and coherence between processors and in-memory logic play an increasingly important role in Processing-in-Memory (PIM) systems. Unfortunately, the existing state-of-the-art coarse-grained PIM coherence solutions suffer from unnecessary data movements and stalls caused by a data ping-pong issue. This work proposes CuckooPIM, a criticality-aware and less-blocking coherence mechanism, which can effectively avoid unnecessary data movements and stalls. Experiments reveal that CuckooPIM achieves 1.68x speedup on average comparing with coarse-grained PIM coherence.

2B-5 (Time: 15:10 - 15:35)
TitleAERIS: Area/Energy-Efficient 1T2R ReRAM Based Processing-in-Memory Neural Network System-on-a-Chip
Author*Jinshan Yue, Yongpan Liu, Fang Su (Tsinghua University, China), Shuangchen Li (University of California, Santa Barbara, U.S.A.), Zhe Yuan, Zhibo Wang, Wenyu Sun, Xueqing Li, Huazhong Yang (Tsinghua University, China)
Pagepp. 146 - 151
KeywordProcessing-in-memory, 1T2R ReRAM, Layer-balance Scheduling, Reference Current Scheme, Neural Network
AbstractReRAM-based processing-in-memory (PIM) architecture is a promising solution for deep neural networks (NN), due to its high energy efficiency and small footprint. However, traditional PIM architecture has to use a separate crossbar array to store either positive or negative (P/N) weights, which limits both energy efficiency and area efficiency. Even worse, imbalance running time of different layers and idle ADCs/DACs even lower down the whole system efficiency. This paper proposes AERIS, an Area/Energy-efficient 1T2R ReRAM based processing-In-memory NN System-on-a-chip to enhance both energy and area efficiency. We propose an area-efficient 1T2R ReRAM structure to represent both P/N weights in a single array, and a reference current cancelling scheme (RCS) is also presented for better accuracy. Moreover, a layer-balance scheduling strategy, as well as the power gating technique for interface circuits, such as ADCs/DACs, is adopted for higher energy efficiency. Experiment results show that compared with state-of-the-art ReRAM-based architectures, AERIS achieves 8.5x/1.3x peak energy/area efficiency improvements in total, due to layer-balance scheduling for different layers, power gating of interface circuits, and 1T2R ReRAM circuits. Furthermore, we demonstrate that the proposed RCS compensates the non-ideal factors of ReRAM and improves NN accuracy by 5.2% in the XNOR net on CIFAR-10 dataset.

[To Session Table]

Session 2C  Design for Reliability
Time: 13:30 - 15:35 Tuesday, January 22, 2019
Location: Room Venus
Chairs: Shigeki Nojima (Toshiba Memory, Japan), Bei Yu (Chinese University of Hong Kong, Hong Kong)

2C-1 (Time: 13:30 - 13:55)
TitleIR-ATA: IR Annotated Timing Analysis, A Flow for Closing the LoopBetween PDN design, IR Analysis & Timing Closure
AuthorAshkan Vakil, Houman Homayoun, *Avesta Sasan (George Mason University, U.S.A.)
Pagepp. 152 - 159
KeywordGolden Timing Model, voltage Noise, IR Drop, IR Aware Timing Analysis
AbstractThis paper presents IR-ATA, a novel flow for modeling the timing impact of IR drop during the physical design and timing closure of an ASIC chip. We first illustrate how the current and conventional mechanism for budgeting the IR drop and voltage noise (by using hard margins) lead to sub-optimal design. Consequently, we propose a new approach for modeling and margining against voltage noise, such that each timing path is margined based on its own topology and its own view of voltage noise. By having such a path based margining mechanism, the margins for IR drop and voltage noise for most timing paths in the design are safely relaxed. The reduction in the margin increases the available timing slack that could be used for improving the power, performance, and area of a design. Finally, we illustrate how IR-ATA could be used to track the timing impact of physical or PDN changes, allowing the physical designers to explore tradeoffs that was previously, for lack of methodology, not possible.

2C-2 (Time: 13:55 - 14:20)
TitleLearning-Based Prediction of Package Power Delivery Network Quality
AuthorYi Cao (Qualcomm Technologies, Inc., U.S.A.), *Andrew B. Kahng (UC San Diego, U.S.A.), Joseph Li, Abinash Roy, Vaishnav Srinivas (Qualcomm Technologies, Inc., U.S.A.), Bangqi Xu (UC San Diego, U.S.A.)
Pagepp. 160 - 166
Keywordpower integrity (PI), PDN, machine learning, inductance prediction
AbstractPower Delivery Network (PDN) is a critical component in modern System-on-Chip (SoC) designs. With the rapid development in applications, the quality of PDN, especially Package (PKG) PDN, determines whether a sufficient amount of power can be delivered to critical computing blocks. In conventional PKG design, PDN design typically takes multiple weeks including many manual iterations for optimization. Also, there is a large discrepancy between (i) quick simulation tools used for quick PDN quality assessment during the design phase, and (ii) the golden extraction tool used for signoff. This discrepancy may introduce more iterations. In this work, we propose a learning-based methodology to perform PKG PDN quality assessment both before layout (when only bump/ball maps, but no package routing, are available) and after layout (when routing is completed but no signoff analysis has been launched). Our contributions include (i) identification of important parameters to estimate the achievable PKG PDN quality in terms of bump inductance; (ii) the avoidance of unnecessary manual trial and error overheads in PKG PDN design; and (iii) more accurate design-phase PKG PDN quality assessment. We validate accuracy of our predictive models on PKG designs from industry. Experimental results show that, across a testbed of 17 industry PKG designs, we can predict bump inductance with an average absolute percentage error of 21.2% or less, given only pinmap and technology information. We improve prediction accuracy to achieve an average absolute percentage error of 17.5% or less when layout information is considered.

2C-3 (Time: 14:20 - 14:45)
TitleTackling Signal Electromigration with Learning-Based Detection and Multistage Mitigation
Author*Wei Ye, Mohamed Baker Alawieh, Yibo Lin, David Z. Pan (The University of Texas at Austin, U.S.A.)
Pagepp. 167 - 172
KeywordElectromigration, Machine learning, Detection, Mitigation
AbstractWith the continuous scaling of integrated circuit (IC) technologies, electromigration (EM) prevails as one of the major reliability challenges facing the design of robust circuits. With such aggressive scaling in advanced technology nodes, signal nets experience high switching frequency, which further exacerbates the signal EM effect. Traditionally, signal EM fixing approaches analyze EM violations after the routing stage and repair is attempted via iterative incremental routing or cell resizing techniques. However, these ``EM-analysis-then fix'' approaches are ill-equipped when faced with the ever-growing EM violations in advanced technology nodes. In this work, we propose a novel signal EM handling framework that (i) incorporates EM detection and fixing techniques into earlier stages of the physical design process, and (ii) integrates machine learning based detection alongside a multistage mitigation. Experimental results demonstrate that our framework can achieve 15x speedup when compared to the state-of-the-art EDA tool while achieving similar performance in terms of EM mitigation and overhead.

2C-4 (Time: 14:45 - 15:10)
TitleROBIN: Incremental Oblique Interleaved ECC for Reliability Improvement in STT-MRAM Caches
Author*Elham Cheshmikhani (Sharif University of Technology, Iran), Hamed Farbeh (Amirkabir University of Technology, Iran), Hossein Asadi (Sharif University of Technology, Iran)
Pagepp. 173 - 178
KeywordCache Memory, Reliability, Error-Correction Codes, Emerging Non-Volatile Memories
AbstractSpin-Transfer Torque Magnetic RAM (STT-MRAM) is a promising alternative for SRAMs in on-chip cache memories. Besides all its advantages, high error rate in STT-MRAM is a major limiting factor for on-chip cache memories. In this paper, we first present a comprehensive analysis that reveals that the conventional ErrorCorrecting Codes (ECCs) lose their efficiency due to data-dependent error patterns, and then propose an efficient ECC configuration, so-called ROBIN, to improve the correction capability. The evaluations show that the inefficiency of conventional ECC increases the cache error rate by an average of 151.7% while ROBIN reduces this value by more than 28.6x.

2C-5 (Time: 15:10 - 15:35)
TitleAging-aware Chip Health Prediction Adopting an Innovative Monitoring Strategy
AuthorYun-Ting Wang (National Tsing Hua University, Taiwan), Kai-Chiang Wu (National Chiao Tung University, Taiwan), *Chung-Han Chou (Feng Chia University, Taiwan), Shih-Chieh Chang (National Tsing Hua University, Taiwan)
Pagepp. 179 - 184
KeywordAging, bias-temperature instability, chip health prediction, process, voltage, and temperature (PVT) variation, support vector machine (SVM)
AbstractConcerns exist that the reliability of chips is worsening because of downscaling technology. Among various reliability challenges, device aging is a dominant concern because it degrades circuit performance over time. Traditionally, runtime monitoring approaches are proposed to estimate aging effects. However, such techniques tend to predict and monitor delay degradation status for circuit mitigation measures rather than the health condition of the chip. In this paper, we propose an aging-aware chip health prediction methodology that adapts to workload conditions and process, supply voltage, and temperature variations. Our prediction methodology adopts an innovative on-chip delay monitoring strategy by tracing representative aging-aware delay behavior. The delay behavior is then fed into a machine learning engine to predict the age of the tested chips. Experimental results indicate that our strategy can obtain 97.40% accuracy with 4.14% area overhead on average. To the authors’ knowledge, this is the first method that accurately predicts current chip age and provides information regarding future chip health.

[To Session Table]

Session 2D  New Advances in Emerging Computing Paradigms
Time: 13:30 - 15:35 Tuesday, January 22, 2019
Location: Room Mars+Room Mercury
Chairs: Shigeru Yamashita (Ritsumeikan University, Japan), Xiaoming Chen (Institute of Computing Technology, Chinese Academy of Sciences, China)

2D-1 (Time: 13:30 - 13:55)
TitleCompiling SU(4) Quantum Circuits to IBM QX Architectures
Author*Alwin Zulehner, Robert Wille (Johannes Kepler University Linz, Austria)
Pagepp. 185 - 190
KeywordIBM QX, Quantum Computation, Compilation, Mapping
AbstractThe Noisy Intermediate-Scale Quantum (NISQ) technology is currently investigated by major players in the field to build the first practically useful quantum computer. IBM QX architectures are the first ones which are already publicly available today. However, in order to use them, the respective quantum circuits have to be compiled for the respectively used target architecture. While first approaches have been proposed for this purpose, they are infeasible for a certain set of SU(4) quantum circuits which recently have been introduced to benchmark such compilers. In this work, we analyze the bottlenecks of existing compilers and provide a dedicated method for compiling these kind of circuits to IBM QX architectures. Our experimental evaluation (using tools provided by IBM) shows that the proposed approach significantly outperforms IBM’s own solution regarding fidelity of the compiled circuit as well as runtime. An implementation of the proposed methodology is publicly available at [omitted for review].

2D-2 (Time: 13:55 - 14:20)
TitleQuantum Circuit Compilers Using Gate Commutation Rules
Author*Toshinari Itoko, Rudy Raymond, Takashi Imamichi, Atsushi Matsuo (IBM Research, Japan), Andrew W. Cross (IBM Research, U.S.A.)
Pagepp. 191 - 196
Keywordquantum computer, circuit compiler, nearest neighbor architecture
AbstractThe use of noisy intermediate-scale quantum computers (NISQCs), which consist of dozens of noisy qubits with limited coupling constraints, has been increasing. A circuit compiler, which transforms an input circuit into an equivalent output circuit conforming the coupling constraints with as few additional gates as possible, is essential for running applications on NISQCs. We propose a formulation and two algorithms exploiting gate commutation rules to obtain a better circuit compiler.

2D-3 (Time: 14:20 - 14:45)
TitleScalable Design for Field-coupled Nanocomputing Circuits
Author*Marcel Walter (University of Bremen, Germany), Robert Wille (Johannes Kepler University Linz, Austria), Frank Sill Torres (DFKI GmbH, Germany), Daniel Große, Rolf Drechsler (University of Bremen, Germany)
Pagepp. 197 - 202
KeywordField-coupled Nanocomputing, Quantum-dot Cellular Automata, Nanomagnet Logic, Placement & Routing
AbstractField-coupled Nanocomputing (FCN) technologies are considered as a solution to overcome physical boundaries of conventional CMOS approaches. But despite ground breaking advances regarding their physical implementation as e.g. Quantum-dot Cellular Automata (QCA), Nanomagnet Logic (NML), and many more, there is an unsettling lack of methods for large-scale design automation of FCN circuits. In fact, design automation for this class of technologies still is in its infancy -- heavily relying either on manual labor or automatic methods which are applicable for rather small functionality only. This work presents a design method which -- for the first time -- allows for the scalable design of FCN circuits that satisfy dedicated constraints of these technologies. The proposed scheme is capable of handling around 40000 gates within seconds while the current state-of-the-art takes hours to handle around 20 gates. This could be confirmed by experimental results on the layout level for various established benchmarks libraries.

2D-4 (Time: 14:45 - 15:10)
TitleBDD-based Synthesis of Optical Logic Circuits Exploiting Wavelenngth Division Multiplexing
Author*Ryosuke Matsuo, Jun Shiomi, Tohru Ishihara, Hidetoshi Onodera (Kyoto University, Japan), Akihiko Shinya, Masaya Notomi (NTT Nanophotonics Center / NTT Basic Research Laboratories, Japan)
Pagepp. 203 - 209
KeywordBinary decision diagram, logic optimization, optical circuit, wavelength division multiplexing
AbstractOptical circuits using nanophotonic devices attract significant interest due to its low power dissipation and ultra-high speed operation. As a consequence, the synthesis methods for the optical circuits also attract increasing attention. However, existing methods for synthesizing optical circuits mostly rely on straight-forward mappings from established data structures such as Binary Decision Diagram (BDD). The strategy of simply mapping a BDD to an optical circuit sometimes results in an explosion of size and involves significant power losses in waveguide branches and optical devices. To address these issues, this paper proposes a BDD size reduction method for optical logic circuits exploiting wavelength division multiplexing (WDM). The paper also proposes a method for reducing the number of branches in a BDD-based optical circuit, which contributes to the reduction of the power dissipation in laser sources. Experimental results obtained using a partial product accumulation circuit which is typically used in parallel multipliers demonstrates advantages of our method over existing approaches in terms of area and power consumption.

2D-5 (Time: 15:10 - 15:35)
TitleHybrid Binary-Unary Hardware Accelerator
AuthorS. Rasoul Faraji, *Kia Bazargan (University of Minnesota, U.S.A.)
Pagepp. 210 - 215
KeywordHybrid computing system, Unary computing system, Hardware accelerators, Scaling Network, Alternator Logic
AbstractStochastic computing has been used in recent years to create designs with significantly smaller area by harnessing unary encoding of data. However, the low area advantage comes at an exponential price in latency, making the area*delay cost unattractive. In this paper, we present a novel method which uses a hybrid binary / unary representation to perform computations. We first divide the input range into a few sub-regions, perform unary computations on each sub-region individually, and finally pack the outputs of all sub-regions back to compact binary. Moreover, we propose a synthesis methodology and a regression model to predict an optimal or sub-optimal design in the design space. The proposed method is especially well-suited to FPGAs due to the abundant availability of routing and flip-flop resources. To the best of our knowledge, we are the first to show a scalable method based on the principles of stochastic computing that can beat conventional binary in terms of a real cost, i.e., area * delay. Our method outperforms the binary and fully unary methods on a number of functions and on a common edge detection algorithm. In terms of area*delay cost, our cost is on average only 2.51 and 10.2 of the binary for 8- and 10-bit resolutions, respectively. These numbers are 2-3 orders of magnitude better than the results of traditional stochastic methods. Our method is not competitive with the binary method for high-resolution oscillating functions such as sin(15 x).

[To Session Table]

Session 3A  (SS-2) Design, testing, and fault tolerance of Neuromorphic systems
Time: 15:55 - 17:10 Tuesday, January 22, 2019
Location: Room Saturn
Chair: Chenchen Liu (Clarkson University, U.S.A.)

3A-1 (Time: 15:55 - 16:20)
Title(Invited Paper) Fault Tolerance in Neuromorphic Computing Systems
Author*Yu Wang (Tsinghua University, China), Mengyun Liu, Krishnendu Chakrabarty (Duke University, U.S.A.), Lixue Xia (Tsinghua University, China)
Pagepp. 216 - 223
KeywordRRAM, Fault-tolerance, Neuromorphic computing
AbstractResistive Random Access Memory (RRAM) and RRAM-based computing systems (RCS) provide energy-efficient technology options for neuromorphic computing. However, the applicability of RCS is limited by reliability problems that arise from the immature fabrication process. In order to take advantage of RCS in practical applications, fault-tolerant design is a key challenge. We present a survey of fault-tolerant designs for RRAM-based neuromorphic computing systems. We first describe RRAM-based crossbars and training architectures in RCS. Following this, we classify fault models into different categories, and review post-fabrication testing methods. Subsequently, online testing methods are presented. Finally, we present various fault-tolerant techniques that were designed to tolerate different types of RRAM faults. The methods reviewed in this survey represent recent trends in fault-tolerant designs of RCS, and are expected to motivate further research in this field.

3A-2 (Time: 16:20 - 16:45)
Title(Invited Paper) Build Reliable and Efficient Neuromorphic Design with Memristor Technology
AuthorBing Li, Bonan Yan (Duke University, U.S.A.), Chenchen Liu (Clarkson University, U.S.A.), *Hai Li (Duke University, U.S.A.)
Pagepp. 224 - 229
Keywordmemristor, neuromorphic computing system, reliability
AbstractNeuromorphic computing is a revolutionary approach of computation, which attempts to mimic the human brain's mechanism for extremely high implementation efficiency and intelligence. Latest research studies showed that the memristor technology has a great potential for realizing power- and area-efficient neuromorphic computing systems (NCS). On the other hand, the memristor device processing is still under development. Unreliable devices can severely degrade system performance, which arises as one of the major challenges in developing memristor-based NCS. In this paper, we first review the impacts of the limited reliability of memristor devices and summarize the recent research progress in building reliable and efficient memristor-based NCS. In the end, we discuss the main difficulties and the trend in memristor-based NCS development.

3A-3 (Time: 16:45 - 17:10)
Title(Invited Paper) Reliable In-Memory Neuromorphic Computing Using Spintronics
AuthorChristopher Muench, Rajendra Bishnoi, *Mehdi Tahoori (KIT, Germany)
Pagepp. 230 - 236
KeywordMTJ, defect modeling, neuromorphic computing
AbstractRecently Spin Transfer Torque Random Access Memory (STT-MRAM) technology has drawn a lot of attention for the direct implementation of neural networks, because it offers several advantages such as near-zero leakage, high endurance, good scalability, small foot print and CMOS compatibility. The storing device in this technology, the Magnetic Tunnel Junction (MTJ), is developed using magnetic layers that requires new fabrication materials and processes. Due to complexities of fabrication steps and materials, MTJ cells are subject to various failure mechanisms. As a consequence, the functionality of the neuromorphic computing architecture based on this technology is severely affected. In this paper, we have developed a framework to analyze the functional capability of the neural network inference in the presence of the several MTJ defects. Using this framework, we have demonstrated the required memory array size that is necessary to tolerate the given amount of defects and how to actively decrease this overhead by disabling parts of the network.

[To Session Table]

Session 3B  Memory-Centric Design and Synthesis
Time: 15:55 - 17:10 Tuesday, January 22, 2019
Location: Room Uranus
Chair: Tohru Ishihara (Nagoya University)

Best Paper Candidate
3B-1 (Time: 15:55 - 16:20)
TitleA Staircase Structure for Scalable and Efficient Synthesis of Memristor-Aided Logic
Author*Alwin Zulehner (Johannes Kepler University Linz, Austria), Kamalika Datta (National Institute of Technology Meghalaya, India), Indranil Sengupta (Indian Institute of Technology, Kharagpur, India), Robert Wille (Johannes Kepler University Linz, Austria)
Pagepp. 237 - 242
KeywordMemristor Crossbar, MAGIC, Mapping
AbstractThe identification of the memristor as fourth fundamental circuit element and, eventually, its fabrication in the HP labs provide new capabilities for in-memory computing. While there already exist sophisticated methods for realizing logic gates with memristors, mapping them to crossbar structures that can easily be fabricated still constitutes a challenging task. This is particularly the case since several (complementary) design objectives have additionally to be satisfied, e.g. the design method has to be scalable, should yield designs requiring a low the number of timesteps as well as a low number of utilized memristors, and a layout should result that is hardly skewed. However, all solutions proposed thus far only focus on one of these objectives and hardly address the other ones. Consequently, rather imperfect solutions are generated by state-of-the-art design methods for memristor-aided logic thus far. In this work, we propose a corresponding automatic design solution which addresses all these design objectives at once. To this end, a staircase structure is utilized which employs an almost square-like layout and remains perfectly scalable while, at the same time, keeps the number of timesteps and utilized memristors close to the minimum. Experimental evaluations confirm that the proposed approach indeed allows to satisfy all design objectives at once.

3B-2 (Time: 16:20 - 16:45)
TitleOn-chip Memory Optimization for High-level Synthesis of Multi-dimensional Data on FPGA
AuthorDaewoo Kim, Sugil Lee, *Jongeun Lee (Ulsan National Institute of Science and Technology, Republic of Korea)
Pagepp. 243 - 248
KeywordSIMD, FPGA implementation
AbstractIt is very challenging to design an on-chip memory architecture for high-performance kernels with large amount of computation and data. The on-chip memory architecture must support efficient data access from both the computation part and the external memory part, which often have very different expectations about how data should be accessed and stored. Previous work provides only a limited set of optimizations. In this paper we show how to fundamentally restructure on-chip buffers, by decoupling logical array view from the physical buffer view, and providing general mapping schemes for the two. Our framework considers the entire data flow from the external memory to the computation part in order to minimize resource usage without creating performance bottleneck. Our experimental results demonstrate that our proposed technique can generate solutions that reduce memory usage significantly (2X over the conventional method), and successfully generate optimized on-chip buffer architectures without costly design iterations for highly optimized computation kernels.

3B-3 (Time: 16:45 - 17:10)
TitleHUBPA: High Utilization Bidirectional Pipeline Architecture for Neuromorphic Computing
AuthorHouxiang Ji, *Li Jiang, Tianjian Li, Naifeng Jing, Jing Ke, Xiaoyao Liang (Shanghai Jiao Tong University, China)
Pagepp. 249 - 254
KeywordReRAM, Convolution Neural Network, Pipeline architecture, Process in memory
AbstractCNN is memory-and computation-intensive. ReRAM-based accelerator is highly energy-efficient for such tasks and advocates a pipeline architecture in previous works due to the natural pipeline of convolutional- and fully-connected layers in CNN. We observe low utilization of ReRAM resources in pipeline architectures due to the imbalanced data throughput among different pipeline stage (i.e., convolutional layers). This paper strives to enhance the utilization of ReRAM-based computing resources in three steps. First, we propose a novel bi-directional pipeline architecture, in which both forward-and backward- propagations of CNN training process can share the allocated pipeline resource. Second, this bi-directional pipeline can further trade a significant amount of memory access for computation, leading to substantial reduction of the energy consumption. Third, given limited ReRAM re-sources and any CNN topology, an efficient weight-matrix mapping algorithm orchestrates the shape of the pipeline,i.e., duplication in each pipeline stage, to further balance the data throughput along the pipeline. The experiments showed average 1.6×improvements of resource utilization, 60%improvements of performance and the energy consumption by 60% compared to the state-of-art ReRAM pipeline architecture when performing CNN training.

[To Session Table]

Session 3C  Efficient Modeling of Analog, Mixed Signal and Arithmetic Circuits
Time: 15:55 - 17:10 Tuesday, January 22, 2019
Location: Room Venus
Chairs: Jun Tao (Fudan University, China), Shobha Vasudevan (University of Illinois at Urbana-Champaign, U.S.A.)

3C-1 (Time: 15:55 - 16:20)
TitleEfficient Sparsification of Dense Circuit Matrices in Model Order Reduction
Author*Charalampos Antoniadis, Nestor Evmorfopoulos, Georgios Stamoulis (University of Thessaly, Greece)
Pagepp. 255 - 260
KeywordMOR, SDD M-matrix, sparsification, graph
AbstractThe integration of more components into ICs due to the ever increasing technology scaling has led to very large parasitic networks consisting of million of nodes, which have to be simulated in many times or frequencies to verify the proper operation of the chip. Model Order Reduction techniques have been employed routinely to substitute the large scale parasitic model by a model of lower order with similar response at the input/output ports. However, all established MOR techniques result in dense system matrices that render their simulation impractical. To this end, in this paper we propose an algorithm for the sparsification of dense MOR circuit matrices, which employs a sequence of algorithms based on the computation of the nearest diagonally dominant matrix and the sparsification of the corresponding graph. Experimental results indicate that a high sparsity ratio of the reduced system matrices can be achieved with very small loss of accuracy.

3C-2 (Time: 16:20 - 16:45)
TitleSpectral Approach to Verifying Non-linear Arithmetic Circuits
AuthorCunxi Yu (EPFL, Switzerland), Tiankai Su, Atif Yasin, *Maciej Ciesielski (UMass Amherst, U.S.A.)
Pagepp. 261 - 267
KeywordFormal verification, computer arithmetic, computer algebra
AbstractThis paper presents a fast and effective computer algebraic method for analyzing and verifying non-linear integer arithmetic circuits using a novel algebraic spectral model. We introduce a concept of algebraic spectrum, a numerical form of polynomial expression that uses only coefficients of the monomials. In contrast to previous works, the proof of functional correctness is done by computing an algebraic spectrum without performing a complete rewriting of word-level polynomials. The speedup is achieved by propagating coefficients through the circuit using And-Inverter Graphs (AIG). The effectiveness of the method is demonstrated with experiments including standard and Booth multipliers, and other synthesized non-linear arithmetic circuits up to 1024 bits, containing over 12 million gates.

3C-3 (Time: 16:45 - 17:10)
TitleS2-PM: Semi-Supervised Learning for Efficient Performance Modeling of Analog and Mixed Signal Circuits
AuthorMohamed Baker Alawieh, Xiyuan Tang, *David Z. Pan (University of Texas at Austin, U.S.A.)
Pagepp. 268 - 273
Keywordperformance modeling, semi-supervised learning, AMS
AbstractAs integrated circuit technologies continue to scale, variability modeling is becoming more crucial yet, more challenging. In this paper, we propose a novel performance modeling method based on semi-supervised co-learning. We exploit the multiple representations of process variation in any analog and mixed signal circuit to establish a bootstrap co-learning framework where unlabeled samples are leveraged to improve the model accuracy without enduring any simulation cost. Practically, our proposed method relies on a small set of labeled data, and the availability of no-cost unlabeled data to efficiently build accurate performance model for any analog and mixed signals circuit design. Our numerical experiments demonstrate that the proposed approach achieves up to 30\% reduction in simulation cost compared to the state-of-the-art modeling technique without surrendering any accuracy.

[To Session Table]

Session 3D  Logic and Precision Optimization for Neural Network Designs
Time: 15:55 - 17:10 Tuesday, January 22, 2019
Location: Room Mars+Room Mercury
Chairs: Massanori Muroyama (Tohoku University, Japan), Younghyun Kim (University of Wisconsin, U.S.A.)

Best Paper Award
3D-1 (Time: 15:55 - 16:20)
TitleEnergy-Efficient, Low-Latency Realization of Neural Networks through Boolean Logic Minimization
AuthorMahdi Nazemi, Ghasem Pasandi, *Massoud Pedram (USC, U.S.A.)
Pagepp. 274 - 279
KeywordTraining, Inference, Reduced-Memory-Access, Synthesis, DNN
AbstractDeep neural networks have been successfully deployed in a wide variety of applications including computer vision and speech recognition. However, computational and storage complexity of these models has forced the majority of computations to be performed on high-end computing platforms or on the cloud. To cope with computational and storage complexity of these models, this paper presents a training method that enables a radically different approach for realization of deep neural networks through Boolean logic minimization. The aforementioned realization completely removes the energy-hungry step of accessing memory for obtaining model parameters, consumes about two orders of magnitude fewer computing resources compared to realizations that use floating-point operations, and has a substantially lower latency.

3D-2 (Time: 16:20 - 16:45)
TitleLog-Quantized Stochastic Computing for Memory and Computation Efficient DNNs
Author*Hyeonuk Sim, Jongeun Lee (Ulsan National Institute of Science and Technology, Republic of Korea)
Pagepp. 280 - 285
KeywordDeep neural networks, Logarithmic quantization, Stochastic computing
AbstractFor energy efficiency, many low-bit quantization methods for deep neural networks (DNNs) have been proposed. Among them, logarithmic quantization is being highlighted showing acceptable deep learning performance. It also simplifies high-cost multipliers as well as reducing memory footprint drastically. Meanwhile, stochastic computing (SC) was proposed for low-cost DNN acceleration and the recently proposed SC multiplier improved the accuracy and latency significantly which are main drawbacks of SC. However, in their binary-interfaced system which yet costs much less than storing all stochastic stream, quantization is basically linear as same as conventional fixed-point binary. We applied logarithmically quantized DNNs to the state-of-the-art SC multiplier and studied how it can benefit. We found that SC multiplication on logarithmically quantized input is more accurate and it can help fine-tuning process. Furthermore, we designed the much low-cost SC-DNN accelerator utilizing the reduced complexity of inputs. Finally, while logarithmic quantization benefits data flow, proposed architecture achieves 40% and 24% less area and power consumption than the previous SC-DNN accelerator. Its area X latency product is smaller even than the shifter based accelerator.

3D-3 (Time: 16:45 - 17:10)
TitleCell Division: Weight Bit-Width Reduction Technique for Convolutional Neural Network Hardware Accelerators
Author*Hanmin Park, Kiyoung Choi (Seoul National University, Republic of Korea)
Pagepp. 286 - 291
Keywordfixed-point, quantization, inference, activation duplication, channel fusing
AbstractThe datapath bit-width of hardware accelerators for convolutional neural network (CNN) inference is generally chosen to be wide enough, so that they can be used to process upcoming unknown CNNs. Here we introduce the cell division technique, which is a variant of function-preserving transformations. With this technique, it is guaranteed that CNNs that have weights quantized to fixed-point format of arbitrary bit-widths, can be transformed to CNNs with less bit-widths of weights without any accuracy drop (or any accuracy change). As a result, CNN hardware accelerators are released from the weight bit-width constraint, which has been preventing them from having narrower datapaths. In addition, CNNs that have wider weight bit-widths than those assumed by a CNN hardware accelerator can be executed on the accelerator. Experimental results on LeNet-300-100, LeNet-5, AlexNet, and VGG-16 show that weights can be reduced down to 2–5 bits with 2.5×–5.2× decrease in weight storage requirement and of course without any accuracy drop.

Wednesday, January 23, 2019

[To Session Table]

Session 2K  Keynote II
Time: 9:00 - 10:00 Wednesday, January 23, 2019
Location: Miraikan Hall
Chair: Hidetoshi Onodera (Kyoto University, Japan)

2K-1 (Time: 9:00 - 10:00)
Title(Keynote Address) Post-K: A Game-changing Supercomputer with Groundbreaking A64fx High Performance Arm Processor
AuthorSatoshi Matsuoka (RIKEN, Japan)
AbstractThe first would be exascale supercomputer in the world, the "Post-K", is being co-designed by Riken-CCS and Fujitsu in collaboration with the entire HPC community in Japan. The heart of Post-K is the groundbreaking Fujitsu A64fx processor which will sit at the pinnacle of billions of ARM processors manufactured every year, and will likely best competing CPUs by significant margins for HPC and other related workloads such as Big Data and AI, as well as CAE/EDA. Not only high in FLOPS with world’s first implementation of wide-vector ARM SVE standard, but also the chip will likely to be the first general-purpose CPU to accommodate high bandwidth on-package HBM2 memory, coupled with streaming-friendly memory controllers and cache hierarchy for nearly a Terabyte of memory bandwidth. Such features will accelerate CAE workloads to unprecedented performance levels in platforms from supercomputers to workstations while being fully compliant to broad and open software ecosystem.

[To Session Table]

Session 4A  (SS-3) Modern Mask Optimization: From Shallow To Deep Learning
Time: 10:20 - 12:00 Wednesday, January 23, 2019
Location: Room Saturn
Chairs: Evangeline F. Y. Young (The Chinese University of Hong Kong, Hong Kong), Atsushi Takahashi (Tokyo Institute of Technology, Japan)

4A-1 (Time: 10:20 - 10:45)
Title(Invited Paper) LithoROC: Lithography Hotspot Detection with Explicit ROC Optimization
Author*Wei Ye, Yibo Lin, Meng Li, Qiang Liu, David Z. Pan (University of Texas at Austin, U.S.A.)
Pagepp. 292 - 298
KeywordHotspot detection, imbalance, AUC, ROC
AbstractAs modern integrated circuits scale up with escalating complexity of layout design patterns, lithography hotspot detection, a key stage of physical verification to ensure layout finishing and design closure, has raised a higher demand on its efficiency and accuracy. Among all the hotspot detection approaches, machine learning distinguishes itself for achieving high accuracy while maintaining low false alarms. However, due to the class imbalance problem, the conventional practice which uses the accuracy and false alarm metrics to evaluate different machine learning models is becoming less effective. In this work, we propose the use of the area under the ROC curve (AUC), which provides a more holistic measure for imbalanced datasets compared with the previous methods. To systematically handle class imbalance, we further propose the surrogate loss functions for direct AUC maximization as a substitute for the conventional cross-entropy loss. Experimental results demonstrate that the new surrogate loss functions are promising to outperform the cross-entropy loss when applied to the state-of-the-art neural network model for hotspot detection.

4A-2 (Time: 10:45 - 11:10)
Title(Invited Paper) Detecting Multi-Layer Layout Hotspots with Adaptive Squish Patterns
Author*Haoyu Yang (The Chinese University of Hong Kong, Hong Kong), Piyush Pathak, Frank Gennari, Ya-Chieh Lai (Cadence Design Systems Inc., U.S.A.), Bei Yu (The Chinese University of Hong Kong, Hong Kong)
Pagepp. 299 - 304
KeywordHotspot Detection, Squish Pattern, Metal-to-via Failure, Deep Learning
AbstractLayout hotpot detection is one of the critical steps in modern integrated circuit design flow. It aims to find potential weak points in layouts before feeding them into manufacturing stage. Rapid development of machine learning has made it a preferable alternative of traditional hotspot detection solutions. Recent researches range from layout feature extraction and learning model design. However, only single layer layout hotspots are considered in state-of-the-art hotspot detectors and certain defects such as metal-to-via failures are not naturally supported. In this paper, we propose an adaptive squish representation for multilayer layouts, which is storage efficient, lossless and compatible with deep neural networks. We conduct experiments on 14nm industrial designs with a metal layer and its two adjacent via layers that contain metal-to-via hotspots. Results show that the adaptive squish representation can achieve satisfactory hotspot detection accuracy by incorporating a medium-sized convolutional neural networks.

4A-3 (Time: 11:10 - 11:35)
Title(Invited Paper) A Local Optimal Method on DSA Guiding Template Assignment with Redundant/Dummy Via Insertion
Author*Xingquan Li (Fuzhou University, China), Bei Yu (Chinese University of Hong Kong, Hong Kong), Jianli Chen, Wenxing Zhu (Fuzhou University, China)
Pagepp. 305 - 310
KeywordDirected self-assembly, Redundant via insertion, Guiding template assignment, Dummy via insertion, Local optimal
AbstractFor better reliability and manufacturability, we concurrently consider DSA guiding template assignment with redundant via and dummy via insertion. By analyzing the structure property of guiding templates, we propose a building-block based solution expression and construct a conflict graph, then formulate the problem to an integer linear programming (ILP). Then, we relax the ILP to an unconstrained nonlinear programming (UNP). Finally, a line search optimization algorithm is proposed to solve the UNP. Experimental results verify the effectiveness of our method.

4A-4 (Time: 11:35 - 12:00)
Title(Invited Paper) Deep Learning-Based Framework for Comprehensive Mask Optimization
AuthorBo-Yi Yu, *Yong Zhong, Shao-Yun Fang, Hung-Fei Kuo (National Taiwan University of Science and Technology, Taiwan)
Pagepp. 311 - 316
KeywordDeep learning, Lithography, Optical proximity correction (OPC), Sub-resolution assist feature (SRAF) insertion
AbstractWith the dramatically increase of design complexity and the advance of semiconductor technology nodes, huge difficulties appear during design for manufacturability with existing lithography solutions. Sub-resolution assist feature (SRAF) insertion and optical proximity correction (OPC) are both inevitable resolution enhancement techniques (RET) to maximize process window and ensure feature printability. Conventional model-based SRAF insertion and OPC methods are widely applied in industrial application but suffer from the extremely long runtime due to iterative optimization process. In this paper, we propose the first work developing a deep learning framework to simultaneously perform SRAF insertion and edge-based OPC. In addition, to make the optimized masks more reliable and convincing for industrial application, we employ a commercial lithography simulation tool to consider the quality of wafer image with various lithographic metrics. The effectiveness and efficiency of the proposed framework are demonstrated in experimental results, which also show the success of machine learning-based lithography optimization techniques for the current complex and large-scale circuit layouts.

[To Session Table]

Session 4B  System Level Modelling Methods I
Time: 10:20 - 12:00 Wednesday, January 23, 2019
Location: Room Uranus
Chairs: Yoshinori Takeuchi (Kindai University), Sri Parameswaran (The University of New South Wales)

4B-1 (Time: 10:20 - 10:45)
TitleAxDNN:Towards the Cross-layer Design of Approximate DNNs
Author*Yinghui Fan, Xiaoxi Wu, Jiying Dong, Zhi Qi (National ASIC System Engineering Research Center, Southeast University, China)
Pagepp. 317 - 322
KeywordApproximate Computing, DNNs, Co-design, Accelerator
AbstractThanks for the inborn error resistance of neural networks, approximate computing has become a promising and hardware friendly technique to improve the energy efficiency of DNNs. From the layer of algorithms, architectures, to circuits, there are many possibilities to implement approximate DNNs. However, the complicated interaction between major design concerns, e.g., power performance, and the lack of an efficient simulator cross multiple design layers have generated suboptimal solutions of approximate DNNs through the conventional design method. In this paper, we present a systematical framework towards the cross-layer design of approximation DNNs. By introducing hardware imperfection to the training phase, the accuracy of DNN models can be recovered by up to 5.32% when the most aggressive approximate multiplier has been used. Integrated with the techniques of activation pruning and voltage scaling, the energy efficiency of the approximate DNN accelerator can be improved by 52.5% on average. We also build a pre-RTL simulation environment where we can easily express accelerator architectures, try the combination of different approximate strategies, and evaluate the power consumption. Experiments demonstrate the pre-RTL simulation has achieved 20X speed up compared with traditional RTL method when evaluating the same target. The convenient pre-RTL simulation helps us to quickly figure out the trade-off between accuracy and energy at the design stage for an approximate DNN accelerator.

4B-2 (Time: 10:45 - 11:10)
TitleSimulate-the-hardware: Training Accurate Binarized Neural Networks for Low-Precision Neural Accelerators
Author*Jiajun Li (University of Chinese Academy of Sciences/Institute of Computing Technology, Chinese Academy of Sciences, China), Ying Wang (Institute of Computing Technology, Chinese Academy of Sciences, China), Bosheng Liu (University of Chinese Academy of Sciences/Institute of Computing Technology, Chinese Academy of Sciences, China), Yinhe Han, Xiaowei Li (Institute of Computing Technology, Chinese Academy of Sciences, China)
Pagepp. 323 - 328
KeywordBinarized Neural Networks, Overflow, Containing, Simulating
AbstractThis work investigates how to effectively train binarized neural networks (BNNs) for the specialized low-precision neural accelerators. When mapping BNNs onto the specialized neural accelerators that adopt fixed-point feature data representation and binary parameters, due to the operation overflow caused by short fixed-point coding, the BNN inference results from the deep learning frameworks on CPU/GPU will be inconsistent with those from the accelerators. This issue leads to a large deviation between the training environment and the inference implementation, and causes potential model accuracy losses when deployed on the accelerators. Therefore, we present a series of methods to contain the overflow phenomenon, and enable typical deep learning frameworks like Tensorflow to effectively train BNNs that could work with high accuracy and convergence speed on the specialized neural accelerators.

4B-3 (Time: 11:10 - 11:35)
TitleAn N-Way Group Association Architecture and Sparse Data Group Association Load Balancing Algorithm for Sparse CNN Accelerators
Author*Jingyu Wang, Zhe Yuan, Ruoyang Liu, Huazhong Yang, Yongpan Liu (Tsinghua University, China)
Pagepp. 329 - 334
KeywordCNN accelerator, group association, load balancing, optimization algorithm, Scheduler
AbstractIn recent years, sparse CNN Accelerators have attracted great attention among researchers. This paper presents an N-Way Group Association Architecture and corresponding Load Balancing Algorithm. The system is analyzed, simulated and verified. Compared with the state-of-art accelerator, this work achieves either 1) 1.74x performance with 50\% memory overhead reduction in the 4-way design or 2) 1.91x performance without memory overhead reduction in the 2-way design, which is close to the theoretical performance limit (without collision).

4B-4 (Time: 11:35 - 12:00)
TitleMaximizing Power State Cross Coverage in Firmware-based Power Management
Author*Vladimir Herdt, Hoang M. Le (University of Bremen, Germany), Daniel Große, Rolf Drechsler (University of Bremen, DFKI GmbH, Germany)
Pagepp. 335 - 340
Keywordcoverage-driven validation, power management, cross-coverage, virtual prototype, power-aware simulation
AbstractVirtual Prototypes (VPs) are becoming increasingly attractive for the early analysis of SoC power management, which is nowadays mostly implemented in firmware (FW). Power and timing constraints can be monitored and validated by executing a set of test-cases in a power-aware FW/VP co-simulation. In this context, cross coverage of power states is an effective but challenging quality metric. This paper proposes a novel coverage-driven approach to automatically generate test-cases maximizing this cross coverage. In particular, we integrate a coverage-loop that successively refines the generation process based on previous results. We demonstrate our approach on a LEON3-based VP.

[To Session Table]

Session 4C  Testing and Design for Security
Time: 10:20 - 12:00 Wednesday, January 23, 2019
Location: Room Venus
Chairs: Michihiro Shintani (Nara Institute of Science and Technology, Japan), Kohei Miyase (Kyushu Institute of Technology, Japan)

4C-1 (Time: 10:20 - 10:45)
TitleImproving Scan Chain Diagnostic Accuracy Using Multi-Stage Artificial Neural Networks
AuthorMason Chern, Shih-Wei Lee, *Shi-Yu Huang (National Tsing Hua University, Taiwan), Yu Huang, Gaurav Veda, Kun-Han Tsia, Wu-Tung Cheng (Mentor, A Siemens Business, U.S.A.)
Pagepp. 341 - 346
Keywordchain diagnosis, intermittent fault, machine learning, Neural Network, inference
AbstractDiagnosis of intermittent scan chain failures remains a hard problem. We demonstrate that Artificial Neural Networks (ANNs) can be used to achieve significantly higher accuracy. The key is to take on domain knowledge and use a multi-stage process incorporating ANNs with gradually refined focuses. Experimental results on benchmark circuits show that this method is, on average, 20% more accurate than a state-of-the-art commercial tool for intermittent stuck-at faults, and improves the hit rate from 25.3% to 73.9% for some test-case.

4C-2 (Time: 10:45 - 11:10)
TitleTesting Stuck-Open Faults of Priority Address Encoder in Content Addressable Memories
Author*Tsai-Ling Tsai, Jin-Fu Li (National Central University, Taiwan), Chun-Lung Hsu, Chi-Tien Sun (Industrial Technology Research Institute, Taiwan)
Pagepp. 347 - 351
KeywordTest, Content Addressable Memory, Priority Address Encoder, March-Like Test
AbstractContent addressable memory (CAM) is widely used in the systems with the parallel search function. The testing of CAM is more difficult than that of random access memory (RAM) due to the complicated function of CAM. Similar to the testing of RAM, the testing of CAM should cover the cell array and peripheral circuits. In this paper, we propose a March-like test March-PCL for detecting the stuck-open faults (SOFs) of the priority address encoder of CAMs. As the best of our knowledge, this is the first word to discuss the testing of SOFs of the priority address encoder of CAMs. The March-PCL requires 4N Write and 4N Compare operations to cover 100% SOFs.

4C-3 (Time: 11:10 - 11:35)
TitleScanSAT: Unlocking Obfuscated Scan Chains
Author*Lilas Alrahis (Khalifa University, United Arab Emirates), Muhammad Yasin (Tandon school of engineering, New York university, U.S.A.), Hani Saleh, Baker Mohammad, Mahmoud Al-Qutayri (Khalifa University, United Arab Emirates), Ozgur Sinanoglu (New York University Abu Dhabi, United Arab Emirates)
Pagepp. 352 - 357
KeywordDeobfuscating Scan Chains, Hardware Security, SAT Attack, No-Scan Access
AbstractWhile financially advantageous, outsourcing key steps such as testing to potentially untrusted Outsourced Semiconductor Assembly and Test (OSAT) companies may pose a risk of compromising on-chip assets. Obfuscation of scan chains is a technique that hides the actual scan data from the untrusted testers; logic inserted between the scan cells, driven by a secret key, hide the transformation functions between the scan-in stimulus (scan-out response) and the delivered scan pattern (captured response). In this paper, we propose ScanSAT: an attack that transforms a scan obfuscated circuit to its logic-locked version and applies a variant of the Boolean satisfiability (SAT) based attack, thereby extracting the secret key. Our empirical results demonstrate that ScanSAT can easily break naive scan obfuscation techniques using only three or fewer attack iterations even for large key sizes and in the presence of scan compression.

4C-4 (Time: 11:35 - 12:00)
TitleCycSAT-Unresolvable Cyclic Logic Encryption Using Unreachable States
AuthorAmin Rezaei, You Li, Yuanqi Shen, Shuyu Kong, *Hai Zhou (Northwestern University, U.S.A.)
Pagepp. 358 - 363
KeywordCyclic Logic Encryption, Unreachable States, CycSAT Attack
AbstractLogic encryption has attracted much attention due to increasing IC design costs and growing number of untrusted foundries. Unreachable states in a design provide a space of flexibility for logic encryption to explore. However, due to the available access of scan chain, traditional combinational encryption cannot leverage the benefit of such flexibility. Cyclic logic encryption inserts key-controlled feedbacks into the original circuit to prevent piracy and overproduction. Based on our discovery, cyclic logic encryption can utilize unreachable states to improve security. Even though cyclic encryption is vulnerable to a powerful attack called CycSAT, we develop a new way of cyclic encryption by utilizing unreachable states to defeat CycSAT. The attack complexity of the proposed scheme is discussed and its robustness is demonstrated.

[To Session Table]

Session 4D  Network-Centric Design and System
Time: 10:20 - 12:00 Wednesday, January 23, 2019
Location: Room Mars+Room Mercury
Chairs: Keiji Kimura (Waseda University, Japan), Yaoyao Ye (Shanghai Jiao Tong University, China)

Best Paper Candidate
4D-1 (Time: 10:20 - 10:45)
TitleRouting in Optical Network-on-Chip: Minimizing Contention with Guaranteed Thermal Reliability
Author*Mengquan Li (Chongqing University, China), Weichen Liu (Nanyang Technological University, Singapore), Lei Yang, Peng Chen, Duo Liu (Chongqing University, China), Nan Guan (Hong Kong Polytechnic University, Hong Kong)
Pagepp. 364 - 369
KeywordOptical network-on-chip, Thermal reliability, Contention-aware adaptive routing
AbstractCommunication contention and thermal susceptibility are two potential issues in optical network-on-chip (ONoC) architecture, which are both critical for ONoC designs. However, minimizing conflict and guaranteeing thermal reliability are incompatible in most cases. In this paper, we present a routing criterion in the network level. Combined with device-level thermal tuning, it can implement thermal-reliable ONoC. We further propose two routing approaches (including a mixed-integer linear programming (MILP) model and a heuristic algorithm (CAR)) to minimize communication conflict based on the guaranteed thermal reliability, and meanwhile, mitigate the energy overheads of thermal regulation in the presence of chip thermal variations. By applying the criterion, our approaches achieve excellent performance with largely reduced complexity of design space exploration. Evaluation results on synthetic communication traces and realistic benchmarks show that the MILP-based approach achieves an average of 112.73% improvement in communication performance and 4.18% reduction in energy overhead compared to state-of-the-art techniques. Our heuristic algorithm only introduces 4.40% performance difference compared to the optimal results and is more scalable to large-size ONoCs.

4D-2 (Time: 10:45 - 11:10)
TitleBidirectional Tuning of Microring-Based Silicon Photonic Transceivers for Optimal Energy Efficiency
Author*Yuyang Wang (University of California, Santa Barbara, U.S.A.), M. Ashkan Seyedi, Jared Hulme, Marco Fiorentino, Raymond G. Beausoleil (Hewlett Packard Labs, U.S.A.), Kwang-Ting Cheng (Hong Kong University of Science and Technology, Hong Kong)
Pagepp. 370 - 375
Keywordsilicon photonics, optical interconnects, microring tuning, energy efficiency
AbstractMicroring-based silicon photonic transceivers are promising to resolve the communication bottleneck of future high-performance computing systems. To rectify process variations in microring resonance wavelengths, thermal tuning is usually preferred over electrical tuning due to its preservation of extinction ratios and quality factors. However, the low energy efficiency of resistive thermal tuners results in nontrivial tuning cost and overall energy consumption of the transceiver. In this study, we propose a hybrid tuning strategy which involves both thermal and electrical tuning. Our strategy determines the tuning direction of each resonance wavelength with the goal of optimizing the transceiver energy efficiency without compromising signal integrity. Formulated as an integer programming problem and solved by a genetic algorithm, our tuning strategy yields 32%~53% savings of overall energy per bit for measured data of 5-channel transceivers at 5~10 Gb/s per channel, and up to 24% saving for synthetic data of 30-channel transceivers, generated based on the process variation models built upon measured data. We further investigated a polynomial-time approximation method which achieves over 100x speedup in tuning scheme computation, while still maintaining considerable energy-per-bit savings.

4D-3 (Time: 11:10 - 11:35)
TitleRedeeming Chip-level Power Efficiency by Collaborative Management of the Computation and Communication
Author*Ning Lin, Hang Lu, Xin Wei, Xiaowei Li (State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences/University of Chinese Academy of Sciences, China)
Pagepp. 376 - 381
KeywordNetworks-on-Chip, Many-core Processor, Power Mangement
AbstractPower consumption is the first order design constraint in future many-core processors. Conventional power management approaches usually focus on certain functional components, i.e. either computation or communication hardware resources, trying to optimize its power consumption as much as possible, while leave the other part untouched. However, such unilateral power control concept, though has some potentials to contribute overall power reduction, cannot guarantee the optimal power efficiency of the chip. In this paper, we propose a novel Collaborative management approach, coordinating both Computation and Communication infrastructure in tandem, termed as CoCom. Apart from prior work that deals with power control separately, it leverages the correlations between the two parts, as the “key chain” to guide their respective power state coordination to the appropriate direction. Besides, it uses dedicated hybrid on-chip/off-chip mechanisms to minimize the control cost and simultaneously guarantee effectiveness. Experimental results show that, compared with conventional unilateral baselines, CoCom is able to achieve abundant performance improvement and power reduction at the same time.

4D-4 (Time: 11:35 - 12:00)
TitleA High-Level Modeling and Simulation Approach Using Test-Driven Cellular Automata for Fast Performance Analysis of RTL NoC Designs
Author*Moon Gi Seok (Arizona State University, U.S.A.), Daejin Park (Kyungpook National University, Republic of Korea), Hessam S. Sarjoughian (Arizona State University, U.S.A.)
Pagepp. 382 - 387
KeywordNetwork on Chip, RTL, Cellular Automaton
AbstractThe simulation speedup of designed RTL NoC regarding the packet transmission is essential because the performance analysis or parameter optimization considering various combinations of intellectual-property (IP) blocks requires repeated computations for parameter space exploration.In this paper, we propose a high-level modeling and simulation (M&S) approach using a revised cellular automata (CA) concept to speed up communication simulation of dynamic flit movements and queue occupation within the target RTL NoC. The CA abstracts the detailed RTL operations with the view of deciding an action state (that is related to moving packet flits and changing the connection between CA) using neighbors' and own high-level states and executing relevant operations to the action state. During performing operations or finding neighbors, user-developed architecture independent routing and arbitration functions may be used. The decision regarding the action states follows a rule set, which is generated by the proposed test environment. The proposed method was applied to an open-sourced Verilog NoC, which resulted in a simulation speedup by about 8 to 31 times for a parameter set.

[To Session Table]

Session 5A  (DF-1) Robotics: From System Design to Application
Time: 13:50 - 15:05 Wednesday, January 23, 2019
Location: Room Saturn
Organizer: Koji Inoue (Kyushu University, Japan), Organizer/Chair: Yuji Ishikawa (Toshiba Device & Storage, Japan)

5A-1 (Time: 13:50 - 14:15)
Title(Designers' Forum) Computer-Aided Support System for Minimally Invasive Surgery Using 3D Organ Shape Models
AuthorKen'ichi Morooka (Kyusyu University, Japan)
AbstractOur research group has been doing research about computer-aided support systems for safe and accurate minimally invasive surgeries. Especially, our support system uses 3D shapes and deformations of organs by combining stereo endoscopic images and neural networks. We talk about the fundamental techniques of our support system.

5A-2 (Time: 14:15 - 14:40)
Title(Designers' Forum) ROS and mROS: How to accelerate the development of robot systems and integrate embedded devices
AuthorHideki Takase (Kyoto University/JST PRESTO, Japan)
AbstractRobot Operating System (ROS) is a state of the art component oriented development framework led by Open Robotics. This talk firstly describes advantages of ROS on the robot development process. ROS can accelerate the development of robot systems by configuring and connecting abundant open-source ROS packages. Another aspect of ROS is a communication middleware based on publish/subscribe model. ROS nodes communicate with each other via topic. An arbitrary node publishes data to a topic and other nodes can subscribe data from the topic. roscore, the master of ROS system, manages advertisement of node information. In addition, ROS provides powerful tools and a world-wide friendly community to help robot system designers. Although there are a lot of useful packages available from ROS1, which is a widely used version, you must employ Linux/Ubuntu to execute ROS1 nodes. It means that you have to select high-performance and power-hunger processors, such as AARCH64 or x64 CPU. We think the use of embedded processors would contribute to power consumption and real-time capability for robot systems. The latter part of this talk will present our work about mROS, that enables embedded processors to be integrated to ROS1 systems. mROS is a lightweight runtime environment to run a ROS1 node on embedded systems. mROS is assumed to operate on edge devices in distributed network systems. We employ lwIP as a TCP/IP protocol stack that is included in ARM mbed library, and TOPPERS/ASP kernel as a real-time operating system to realize ROS communication. You can design mROS nodes with native ROS APIs. In addition, you can develop embedded device drivers with mbed library. Moreover, ITRON programming model could help you if you wish to realize multi-tasking. Our work would contribute to the portability of ROS1 packages to embedded systems, and enhancement of power saving and real-time performance for edge nodes on distributed robot systems.

5A-3 (Time: 14:40 - 15:05)
Title(Designers' Forum) Rapid Development of Robotics technology using Robot Contest and Open Collaboration
AuthorMasaki Yamamoto (AI Solution Center, Panasonic Corporation, Japan)
AbstractRobotic system is widely used in manufacturing settings, as the environment can be tuned for the robots. Repetitive tasks can easily be performed by robots with high accuracy and speed surpassing human operators. On the other hand, in human environment, even though robots have long been said promising, their deployment is slow. Thanks to emerging AI technology, robots are getting more reliable capability to sense and adapt in the human environment. But once robot has to perform physical tasks, as the robots have physical bodies to interact with the external world, the total system gets more complicated and expensive, which makes the robot system less viable in the real application. To overcome these difficulties, we have to carefully access the potential robotic application from as many view points as possible. In this situation, robot contests are becoming popular in robotics research community. In some cases, potential users of the robotics system organize the contest to accelerate the technology development, such as Amazon Robotics Challenge. The environment can be determined by the contest organizer reflecting not only technical but also user point of view. Proposals from worldwide participants can be good testbed to explore design possibilities. For universities, students can exert their creativity and test their technological edge. Private companies can use these opportunities to educate young engineers and explore potential technological direction in a short period of time. In this talk, we share our experiences in Amazon Robotics Challenge 2017 and World Robotics Summit 2018.

[To Session Table]

Session 5B  Advanced Memory Systems
Time: 13:50 - 15:05 Wednesday, January 23, 2019
Location: Room Uranus
Chairs: Jaehyun Park (University of Ulsan, Republic of Korea), Chenchen Liu (Clarkson University)

5B-1 (Time: 13:50 - 14:15)
TitleA Sharing-Aware L1.5D Cache for Data Reuse in GPGPUs
Author*Jianfei Wang, Li Jiang, Jing Ke, Xiaoyao Liang, Naifeng Jing (Shanghai Jiao Tong University, China)
Pagepp. 388 - 393
KeywordGPGPU, cache thrashing, shared cache, data reuse, cache management
AbstractWith GPUs heading towards general-purpose, hardware caching, e.g. the first-level data (L1D) cache is introduced into the on-chip memory hierarchy for GPGPUs. However, facing the GPGPU massive multi-threading, the small L1D requires a better management for a higher hit rate to benefit the performance. In this paper, on observing the L1D usage inefficiency, such as data duplication among streaming multiprocessors (SMs) that wastes the precious L1D resources, we first propose a shared L1.5D cache that substitutes the private L1D caches in several SMs to reduce the duplicated data and in turn increase the effective cache size for each SM. We evaluate and adopt a suitable layout of L1.5D to meet the timing requirements in GPGPUs. Then, to protect the sharable data from early evictions, we propose a sharable data aware cache management, which leverages a lightweight PC-based history table to protect sharable data on cache replacement. The experiments demonstrate that the proposed design can achieve an averaged 20.1% performance improvement with an increased on-chip hit rate by 16.9% for applications with sharable data.

5B-2 (Time: 14:15 - 14:40)
TitleNeuralHMC: An Efficient HMC-Based Accelerator for Deep Neural Networks
AuthorChuhan Min (University of Pittsburgh, U.S.A.), Jiachen Mao, Hai Li, *Yiran Chen (Duke University, U.S.A.)
Pagepp. 394 - 399
KeywordHybrid Memory Cube, processing-in memory architecture, simulation
AbstractDeep Neural Networks involve a significant amount of data movement. Process-in-memory architecture such as Hybrid Memory Cube, becomes an excellent candidate to improve the data locality for efficient DNN execution. However, it's still hard to efficiently deploy matrix computation in DNN on HMC because of its packet protocol. In this work, we propose NeuralHMC, the first HMC-based accelerator tailored for efficient DNN execution. Compared to state-of-the-art PIM-based accelerator, NeuralHMC improves the system performance by 4.1x and reduces energy by 1.5x.

5B-3 (Time: 14:40 - 15:05)
TitleBoosting Chipkill Capability under Retention-Error Induced Reliability Emergency
AuthorXianwei Zhang (AMD, U.S.A.), Rujia Wang, *Youtao Zhang, Jun Yang (University of Pittsburgh, U.S.A.)
Pagepp. 400 - 405
KeywordDRAM, Reliability, Chipkill, Refresh, Power
AbstractThe DRAM based main memory of high embedded systems faces two design challenges: (i) degrading reliability; and (ii) increasing power and energy consumption. While chipkill ECC (error correction code) and multi-rate refresh may be adopted to address them, respectively, a simple integration of the two results in 3× or more SDC (silent data corruption) errors and failing to meet the system reliability guarantee. This is referred to as reliability emergency. In this paper, we propose PlusN, a hardware-assisted memory error protection design that adaptively boosts the baseline chipkill capability to address the reliability emergency. Based on the error probability assessment at runtime, the system switches its memory protection between the baseline chipkill and PlusN --- the latter generates a stronger ECC with low storage and access overheads. Our experimental results show that PlusN can effectively enforce the system reliability guarantee under different reliability emergency scenarios. On average, it introduces 2.7% performance overhead and 6.25% space overhead when adopting 256ms refresh interval and 48H profiling interval.

[To Session Table]

Session 5C  Learning: Make Patterning Light and Right
Time: 13:50 - 15:05 Wednesday, January 23, 2019
Location: Room Venus
Chairs: Tetsuaki Matsunawa (Toshiba Memory Corp.), Hidetoshi Matsuoka (Fujitsu Laboratories)

Best Paper Candidate
5C-1 (Time: 13:50 - 14:15)
TitleSRAF Insertion via Supervised Dictionary Learning
Author*Hao Geng, Haoyu Yang, Yuzhe Ma (The Chinese University of Hong Kong, Hong Kong), Joydeep Mitra (Cadence Inc., U.S.A.), Bei Yu (The Chinese University of Hong Kong, Hong Kong)
Pagepp. 406 - 411
KeywordSRAF insertion, Supervised online dictionary learning, Integer linear programming, Mask optimization, Feature extraction
AbstractIn modern VLSI design flow, sub-resolution assist feature (SRAF) insertion is one of the resolution enhancement techniques (RETs) to improve chip manufacturing yield. With aggressive feature size continuously scaling down, layout feature learning becomes extremely critical. In this paper, for the first time, we enhance conventional manual feature construction, by proposing a supervised online dictionary learning algorithm for simultaneous feature extraction and dimensionality reduction. By taking advantage of label information, the proposed dictionary learning engine can discriminatively and accurately represent the input data. We further consider SRAF design rules in a global view, and design an integer linear programming model in the post-processing stage of SRAF insertion framework. Experimental results demonstrate that, compared with a state-of-the-art SRAF insertion tool, our framework not only boosts the mask optimization quality in terms of edge placement error (EPE) and process variation (PV) band area, but also achieves some speed-up.

5C-2 (Time: 14:15 - 14:40)
TitleA Fast Machine Learning-based Mask Printability Predictor for OPC Acceleration
Author*Bentian Jiang (The Chinese University of Hong Kong, Hong Kong), Hang Zhang (Cornell University, U.S.A.), Jinglei Yang (University of California, Santa Barbara, U.S.A.), Evangeline F. Y. Young (The Chinese University of Hong Kong, Hong Kong)
Pagepp. 412 - 419
KeywordDesign for Manufacturability, Optical Proximity Correction Acceleration, Machine Learning Based Lithography Simulation
AbstractContinuous shrinking of VLSI technology nodes brings us powerful chips with lower power consumption, but it also introduces many issues in manufacturability. Lithography simulation process for new feature size suffers from large computational overhead. As a result, conventional mask optimization process has been drastically resource consuming in terms of both time and cost. In this paper, we propose a high performance machine learning-based mask printability evaluation framework for lithography-related applications, and apply it in a conventional mask optimization tool to verify its effectiveness.

5C-3 (Time: 14:40 - 15:05)
TitleSemi-Supervised Hotspot Detection with Self-Paced Multi-Task Learning
AuthorYing Chen (Institute of Microelectronics of Chinese Academy of Sciences/University of Chinese Academy of Sciences, China), Yibo Lin (University of Texas at Austin, U.S.A.), Tianyang Gai (Institute of Microelectronics of Chinese Academy of Sciences/University of Chinese Academy of Sciences, China), Yajuan Su (Institute of Microelectronics of Chinese Academy of Sciences, China), Yayi Wei (Institute of Microelectronics of Chinese Academy of Sciences/University of Chinese Academy of Sciences, China), *David Z. Pan (University of Texas at Austin, U.S.A.)
Pagepp. 420 - 425
Keywordhotspot detection, semi-supervised learning, multi-task network, self-paced learning
AbstractLithography simulation is computationally expensive for hotspot detection. Machine learning based hotspot detection is a promising technique to reduce the simulation overhead. However, most learning approaches rely on a large amount of training data to achieve good accuracy and generality. At the early stage of developing a new technology node, the amount of data with labeled hotspots or non-hotspots is very limited. In this paper, we propose a semi-supervised hotspot detection with self-paced multi-task learning paradigm, leveraging both data samples w./w.o. labels to improve model accuracy and generality. Experimental results demonstrate promising accuracy with a limited amount of labeled training data compared to the state-of-the-art work.

[To Session Table]

Session 5D  Design and CAD for Emerging Memories
Time: 13:50 - 15:05 Wednesday, January 23, 2019
Location: Room Mars+Room Mercury
Chairs: Li Jiang (Shanghai Jiao Tong University, China), Takao Marukame (Toshiba Corporate Research & Development Center, Japan)

5D-1 (Time: 13:50 - 14:15)
TitleExploring emerging CNFET for Efficient Last Level Cache Design
AuthorDawen Xu, *Li Li (Hefei University of Technology, China), Ying Wang, Cheng Liu, Huawei Li (Institute of Computer Technology, Chinese Academy of Sciences, China)
Pagepp. 426 - 431
KeywordCNFET, Last Level Cache
AbstractAbstract—Carbon Nanotube field-effect transistors (CNFET) emerge as a promising alternative to the conventional CMOS for the much higher speed and power efficiency. It is particularly suitable for building the power-hungry last level cache (LLC). However, the process variation (PV) in CNFET substantially affects the operation stability and thus the worst-case timing, which limits the LLC operation frequency dramatically given a fully synchronous design. To address this problem, we developed a variation-aware cache such that each part of the cache can run at its optimal frequency and the overall cache performance can be improved significantly. While asymmetric-correlated in the variation unique to the CNFET fabrication process, this indicates that cache latency distribution is closely related with the LLC layouts. For the two typical LLC layouts, we proposed variation-aware-set (VAS) cache and variation-aware-way (VAW) cache respectively to make best use of the CNFET cache architecture. For VAS cache, we further proposed a static page mapping to ensure the most frequent used data are mapped to the fast cache region. Similarly, we apply a latency-aware LRU replacement strategy to assign the most recent data to the fast cache region. According to the experiments, the optimized CNFET based LLC improves the performance by 39% and reduces the power consumption by 10% on average compared to the baseline CNFET LLC design.

5D-2 (Time: 14:15 - 14:40)
TitleMosaic: An Automated Synthesis Flow for Boolean Logic Based on Memristor Crossbar
Author*Lei Xie (Southeast University, China)
Pagepp. 432 - 437
KeywordMemristor, Logic Synthesis, EDA
AbstractMemristor crossbar stacked on the top of CMOS circuitry is a promising candidate for future VLSI circuits, due to its great scalability, near-zero standby power consumption, etc. In order to design large-scale logic circuits, an automated synthesis flow is highly demanded to map Boolean functions onto memristor crossbar. This paper proposes such a synthesis flow, Mosaic by reusing a part of the existing CMOS synthesis flow. In addition, two schemes are proposed to optimize designs in terms of delay and power consumption. To verify Mosaic and its optimization schemes, four types of adders are used as a study case; the incurred delay, area and power costs for both the crossbar and its CMOS controller are evaluated. The results show that the optimized adders reduce delay (>26%), power consumption (>21%) and area (>23%) as compared to initial ones. To show the potential of Mosaic for design space exploration, we use other nice more complex benchmarks. The results shows that the design can be signifcantly optimized in terms of both area (4.5x to 82.9x) and delay (2.4x to 9.5x).

Best Paper Candidate
5D-3 (Time: 14:40 - 15:05)
TitleHandling Stuck-at-faults in Memristor Crossbar Arrays using Matrix Transformations
AuthorBaogang Zhang, Necati Uysal, Deliang Fan, *Rickard Ewetz (University of Central Florida, U.S.A.)
Pagepp. 438 - 443
Keywordmemristor, Neural Network, Deep Learning, CAD
AbstractMatrix-vector multiplication is the dominating computational workload in the inference phase of neural networks. Memristor crossbar arrays (MCAs) can inherently execute matrix-vector multiplication with low latency and small power consumption. A key challenge is that the classification accuracy may be severely degraded by stuckat-fault defects. Earlier studies have shown that the accuracy loss can be recovered by retraining each neural network or by utilizing additional hardware. In this paper, we propose to handle stuck-atfaults using matrix transformations. A transformation T changes a weight matrix W into a weight matrix, We = T (W ), which is more robust to stuck-at-faults. In particular, we propose a row flipping transformation, a permutation transformation, and a value range transformation. The row flipping transformation results in that stuck-off (stuck-on) faults are translated into stuck-on (stuck-off) faults. The permutation transformation maps small (large) weights to memristors stuck-off (stuck-on). The value range transformation is based on reducing the magnitude of the smallest and largest elements in the matrix, which results in that each stuck-at-fault introduces an error of smaller magnitude. The experimental results demonstrate that the proposed framework is capable of recovering 99% of the accuracy loss introduced by stuck-at-faults without requiring the neural network to be retrained.

[To Session Table]

Session 6A  (DF-2) Advanced Imaging Technologies and Applications
Time: 15:35 - 17:15 Wednesday, January 23, 2019
Location: Room Saturn
Organizer: Masaki Sakakibara (Sony Semiconductor Solutions Corporation, Japan), Organizer/Chair: Shinichi Shibahara (Renesas Electronics Corporation, Japan)

6A-1 (Time: 15:35 - 16:00)
Title(Designers' Forum) NIR Lock-in Pixel Image Sensors for Remote Heart Rate Detection
AuthorShoji Kawahito, Cao Chen, Leyi Tan, Keiichiro Kagawa, Keita Yasutomi (Shizuoka University, Japan), Norimichi Tsumura (Chiba University, Japan)
AbstractThis paper presents a lock-in pixel CMOS image sensor (CIS) with high near-infrared (NIR) sensitivity for remote physiological signal detection. The developed 1.3M-pixel CIS has a function of lock-in detection of a short-pulse-modulated signal light while suppressing the influence of back-ground light variation. Using the implemented lock-in camera system consisting of the CIS chip and NIR (870nm) LEDs, remote non-contact heart-rate measurements with the 98% accuracy compared with that of the contact-type HR measurements are demonstrated. A HR variability spectrogram for monitoring mental stress is also successfully obtained with the implemented system. Target applications of this sensor include a driver monitoring system, nursing care and security systems.

6A-2 (Time: 16:00 - 16:25)
Title(Designers' Forum) A TDC/ADC Hybrid LiDAR SoC for 200m Range Detection with High Image Resolution under 100klux Sunlight
AuthorKentaro Yoshioka (Toshiba Corporation, Japan)
AbstractLong-range and high-pixel-resolution LiDAR systems, using a Time-of-Flight (ToF) information of the reflected photon from the target, are essential upon launching safe and reliable self-driving programs of Level 4 and above. 200m long-range distance measurement (DM) is required to sense proceeding vehicles and obstacles as fast as possible on a highway situation. To realize safe and reliable self-driving in city areas, LiDAR systems uniting wide angle-of-view and high-pixel resolution are required to fully perceive the surrounding events. Moreover, these performances must be achieved under strong background light (e.g. sunlight), which is the most significant noise source for LiDAR systems. We propose a TDC/ADC hybrid LiDAR SoC with smart accumulation technique (SAT) to achieve both 200m and high resolution range imaging for reliable self-driving systems. SAT using ADC information enhances the effective pixel resolution with an accumulation activated by recognizing only the target reflection. Moreover, the hybrid architecture enables wide range measurement of 0-200m; 2x longer and 2x higher effective-pixel-resolution range imaging is achieved than conventional designs.

6A-3 (Time: 16:25 - 16:50)
Title(Designers' Forum) A 1/4-inch 3.9Mpixel Low Power Event-driven Back-illuminated Stacked CMOS Image sensor
AuthorOichi Kumagai (Sony Semiconductor Solutions Corporation, Japan)
AbstractWireless products such as smart home-security cameras, intelligent agents, virtual personal assistants, and smartphones, are evolving rapidly to satisfy our needs. Small size, extended battery life, transparent machine interfaces: all these are required of the camera system in these applications. These applications, in battery-limited environments, can profit from an event-driven approach for moving-object detection. We have developed a 1/4-inch 3.9Mpixel low-power event-driven back-illuminated stacked CMOS image sensor deployed with a pixel readout circuit that detects moving objects for each pixel under lighting conditions ranging from 1 to 64,000lux. Utilizing pixel summation in a shared floating diffusion for each pixel block, moving object detection is realized at 10 frames per second while consuming only 1.1mW, a 99% reduction in power from the same CIS at a full-resolution 60fps power of 95mW. The low-power event-driven technology enhance the device usability and create a low-resolution always-on sensing and high-quality imaging world.

6A-4 (Time: 16:50 - 17:15)
Title(Designers' Forum) Next-Generation Fundus Camera with Full-Color Image Acquisition in 0-lx Visible Light using BSI CMOS Image Sensor with Advanced NIR Multi-Spectral Imaging System -Application Field Development of Dynamic Intelligent Systems Using High-Speed Vision-
AuthorHirofumi Sumi (The University of Tokyo / Nara Institute of Science and Technology, Japan), Hironari Takehara (Nara Institute of Science and Technology, Japan), Norimasa Kishi (The University of Tokyo, Japan), Jun Ohta (Nara Institute of Science and Technology, Japan), Masatoshi Ishikawa (The University of Tokyo, Japan)
AbstractThis research describes the development of the next-generation of fundus cameras with a high frame-rate, based on intelligent imaging technologies. In one of several POCs (Proof of Concepts) for Dynamic Intelligent Systems using High-Speed Vision, we aimed to develop a solution system that can be used as a camera to facilitate tracking of fast movement of the eye. Moreover, these cameras can acquire images in multi-band spectral ranges for the signals NIR1, NIR2, and NIR3 which correspond to visible light R, B, and G respectively based on near-infrared spectral imaging technology. In this regard, advanced NIR multi-spectral technology has been developed. Using this technique, NIR1: 780–800nm, NIR2: 870nm, and NIR3: 940nm in the NIR wavelength range are acquired for a target image. By exploiting the application of interpolation and color correction processing, a color image can be reproduced using only multi-NIR signals in the absence of visible light (0-lx). Using this fundus camera, it is also possible for an individual to observe and acquire images of the bottom of the eye, without assistance. Additionally, the fundus of the eye is the only site in the human body where arteries and capillaries can be directly observed non-invasively. By examining the fundus oculi, it is possible to observe the state of blood vessels and the retina/optic papilla, and thus diagnose various diseases ranging from glaucoma and retinal detachment to diabetes and arteriosclerosis. Furthermore, another potential application of this compact camera is to capture diagnostic health information, which will allow for control and active health management by individuals.

[To Session Table]

Session 6B  Optimized Training for Neural Networks
Time: 15:35 - 16:50 Wednesday, January 23, 2019
Location: Room Uranus
Chairs: Deliang Fan (University of Central Florida, U.S.A.), Raymond (Ruirui) Huang (Alibaba Cloud, U.S.A.)

6B-1 (Time: 15:35 - 16:00)
TitleCAPTOR: A Class Adaptive Filter Pruning Framework for Convolutional Neural Networks in Mobile Applications
Author*Zhuwei Qin, Fuxun Yu (George Mason University, U.S.A.), ChenChen Liu (Clarkson University, U.S.A.), Xiang Chen (George Mason University, U.S.A.)
Pagepp. 444 - 449
KeywordMobile Application, Filter Pruning, Visualization
AbstractNowadays, the evolution of deep learning and cloud service significantly promotes neural network based mobile applications. Although intelligent and prolific, those applications still lack certain flexibility: For classification tasks, neural networks are generally trained online with vast classification targets to cover various utilization contexts. However, only partial classes are practically tested due to individual mobile user preference and application specificity. Thus the unneeded classes cause considerable computation and communication cost. In this work, we propose CAPTOR – a class-level reconfiguration framework for Convolutional Neural Networks (CNNs). By identifying the class activation preference of convolutional filters through feature interest visualization and gradient analysis, CAPTOR can effectively cluster and adaptively prune the filters associated with unneeded classes. Therefore, CAPTOR enables class-level CNN reconfiguration for network model compression and local deployment on mobile devices. Experiment shows that, CAPTOR can reduce computation load for VGG-16 by up to 40.5% and 37.9% energy consumption with ignored loss of accuracy. For AlexNet, CAPTOR also reduces computation load by up to 42.8% and 37.6% energy consumption with less than 3% loss in accuracy.

6B-2 (Time: 16:00 - 16:25)
TitleTNPU: An Efficient Accelerator Architecture for Training Convolutional Neural Networks
Author*Jiajun Li, Guihai Yan, Wenyan Lu, Shuhao Jiang, Shijun Gong, Jingya Wu, Junchao Yan, Xiaowei Li (Institute of Computing Technology, Chinese Academy of Sciences, China)
Pagepp. 450 - 455
Keywordaccelerator, CNN training, architecture
AbstractTraining large scale convolutional neural networks (CNNs) is an extremely computation and memory intensive task that requires massive computational resources and training time. Recently, many accelerator solutions have been proposed to improve the performance and efficiency of CNNs. Existing approaches mainly focus on the inference phase of CNN, and can hardly address the new challenges posed in CNN training: the resource requirement diversity and bidirectional data dependency between convolutional layers (CVLs) and fully-connected layers (FCLs). To overcome this problem, this paper presents a new accelerator architecture for CNN training, called TNPU, which leverages the complementary effect of the resource requirements between CVLs and FCLs. Unlike prior approaches optimizing CVLs and FCLs in separate way, we take an alternative by smartly orchestrating the computation of CVLs and FCLs in single computing unit to work concurrently so that both computing and memory resources will maintain high utilization, thereby boosting the performance. We also proposed a simplified out-of-order scheduling mechanism to address the bidirectional data dependency issues in CNN training. The experiments show that TNPU achieves achieves a speedup of 1.5x and 1.3x, with an average energy reduction of 35.7% and 24.1% over comparably provisioned state-of-the-art accelerators (DNPU and DaDianNao), respectively.

6B-3 (Time: 16:25 - 16:50)
TitleREIN: A Robust Training Method for Enhancing Generalization Ability of Neural Networks in Autonomous Driving Systems
Author*Fuxun Yu (George Mason University, U.S.A.), Chenchen Liu (Clarkson University, U.S.A.), Xiang Chen (George Mason University, U.S.A.)
Pagepp. 456 - 461
KeywordAutonomous Driving Systems, Deep Neural Network, Generalization Ability
AbstractIn self-driving systems, neural network models need to recognize real-world images coming from various different environment settings, e.g. different lighting, under raining, or even fogs. Therefore, generalization ability becomes the most important factor in self-driving systems.In this work, we propose a novel network training method, which significantly improves model generalization capability in various practical scenarios, including rainy, fogy, different light condition, and motion blurring.

[To Session Table]

Session 6C  New Trends in Biochips
Time: 15:35 - 16:50 Wednesday, January 23, 2019
Location: Room Venus
Chair: Tsun-Ming Tseng (Technical University of Munich, Germany)

6C-1 (Time: 15:35 - 16:00)
TitleFactorization Based Dilution of Biochemical Fluids with Micro-Electrode-Dot-Array Biochips
AuthorSohini Saha (IIEST, Shibpur, India), Debraj Kundu, *Sudip Roy (IIT Roorkee, India), Sukanta Bhattacharjee (New York University, United Arab Emirates), Krishnendu Chakrabarty (Duke University, U.S.A.), Partha P. Chakrabarti (IIT Kharagpur, India), Bhargab B. Bhattacharya (ISI Kolkata, India)
Pagepp. 462 - 467
KeywordBiochip, Microfluidics, Sample preparation, Dilution, MEDA
AbstractSample preparation, an essential preprocessing step for biochemical protocols, is concerned with the generation of fluids satisfying specific mixing ratios and error-tolerance. Recent micro-electrode-dot-array (MEDA)-based DMF biochips provide the advantage of supporting both discrete and dynamic mixing models, the power of which has not yet been fully harnessed for implementing on-chip dilution and mixing of fluids. In this paper, we propose a novel factorization-based algorithm called FacDA for efficient and accurate dilution of sample fluid.

6C-2 (Time: 16:00 - 16:25)
TitleSample Preparation for Multiple-Reactant Bioassays on Micro-Electrode-Dot-Array Biochips
Author*Tung-Che Liang (Duke University, U.S.A.), Yun-Sheng Chan (National Chiao Tung University, Taiwan), Tsung-Yi Ho (National Tsing Hua University, Taiwan), Krishnendu Chakrabarty (Duke University, U.S.A.), Chen-Yi Lee (National Chiao Tung University, Taiwan)
Pagepp. 468 - 473
KeywordSample preparation, MEDA biochip
AbstractSample preparation, as a key procedure in many biochemical protocols, mixes various samples and/or reagents into solutions that contain the target concentrations. Digital microfluidic biochips (DMFBs) have been adopted as a platform for sample preparation because they provide automatic procedures that require less reactant consumption and reduce human-induced errors. However, traditional DMFBs only utilize the (1:1) mixing model, i.e., only two droplets of the same volume can be mixed at a time, which results in higher completion time and the wastage of valuable reactants. To overcome this limitation, a next-generation micro-electrode-dot-array (MEDA) architecture that provides flexibility of mixing multiple droplets of different volumes in a single operation was proposed. In this paper, we present a generic multiple-reactant sample preparation algorithm that exploits the novel fluidic operations on MEDA biochips. Simulated experiments show that the proposed method outperforms existing methods in terms of saving reactant cost, minimizing the number of operations, and reducing the amount of waste.

6C-3 (Time: 16:25 - 16:50)
TitleRobust Sample Preparation on Low-Cost Digital Microfluidic Biochips
Author*Zhanwei Zhong (Duke University, U.S.A.), Robert Wille (Johannes Kepler University Linz, Austria), Krishnendu Chakrabarty (Duke University, U.S.A.)
Pagepp. 474 - 480
Keywordsample preparation, fault-tolerance, robust, low cost, biochip
AbstractSample preparation is an important application for the digital microfluidic biochips (DMFBs) platform, and many methods have been developed to reduce the time and reagent usage associated with on-chip sample preparation. However, errors in fluidic operations can result in the concentration of the resulting droplet being outside the calibration range. Current error-recovery methods have the drawback that they need the use of on-chip sensors and further re-execution time. In this paper, we present two dilution-chain structures that can generate a droplet with a desired concentration even if volume variations occur during droplet splitting. Experimental results show the effectiveness of the proposed method compared to previous methods.

[To Session Table]

Session 6D  Power-efficient Machine Learning Hardware Design
Time: 15:35 - 16:50 Wednesday, January 23, 2019
Location: Room Mars+Room Mercury
Chairs: Hai Wang (UESTC, China), Sheldon Tan (University of California at Riverside, U.S.A.)

6D-1 (Time: 15:35 - 16:00)
TitleSAADI: A Scalable Accuracy Approximate Divider for Dynamic Energy-Quality Scaling
AuthorSetareh Behroozi, Jingjie Li, Jackson Melchert, *Younghyun Kim (University of Wisconsin-Madison, U.S.A.)
Pagepp. 481 - 486
Keywordapproximate computing, arithmetic logic unit, divider
AbstractWe present the design of a runtime accuracy-configurable approximate divider named SAADI. It makes an approximation to the reciprocal of the divisor in an incremental manner, thus the division speed or energy can be dynamically traded off for accuracy by controlling the number of iterations. For the approximate 8-bit division of 32-bit input operands, the average accuracy of SAADI can be adjusted in between 91.2% and 99.0% with the latency variation by up to 7x.

6D-2 (Time: 16:00 - 16:25)
TitleSeFAct: Selective Feature Activation and Early Classification for CNNs
Author*Farhana Sharmin Snigdha, Ibrahim Ahmed, Susmita Dey Manasi, Meghna G. Mankalale (University of Minnesota, U.S.A.), Jiang Hu (Texas A&M University, U.S.A.), Sachin S. Sapatnekar (University of Minnesota, U.S.A.)
Pagepp. 487 - 492
KeywordNeural network, Energy-efficient design, Data-dependent architecture, Deep learning, Reduced precision
AbstractThis work presents a dynamic energy reduction approach for hardware accelerators for convolutional neural networks (CNN). Two methods are used: (1) an adaptive data-dependent scheme to selectively activate a subset of all neurons, by narrowing down the possible activated classes (2) static bitwidth reduction. The former is applied in late layers of the CNN, while the latter is more effective in early layers. Even accounting for the implementation overheads, the results show 20%-25% energy savings with 5-10% accuracy loss.

6D-3 (Time: 16:25 - 16:50)
TitleFACH: FPGA-based Acceleration of Hyperdimensional Computing by Reducing Computational Complexity
AuthorMohsen Imani, Sahand Salamat, Saransh Gupta, *Jiani Huang, Tajana Rosing (UC San Diego, U.S.A.)
Pagepp. 493 - 498
KeywordBrain-inspired Computing, Machine learning, Energy efficiency
AbstractBrain-inspired hyperdimensional (HD) computing explores computing with hypervectors for the emulation of cognition as an alternative to computing with numbers. In HD, input symbols are mapped to a hypervector and an assoc iative search is performed for reasoning and classification. An associative memory, which finds the closest match between a set of learned hypervectors and a query hypervector, uses simple Hamming distance metric for similarity check. However, we observe that, in order to provide acceptable classification accuracy HD needs to store non-binarized model in associative memory and uses costly similarity metrics such as cosine to perform a reasoning task. This makes the HD computationally expensive when it is used for realistic classification problems. In this paper, we propose a FPGA-based acceleration of HD (FACH) which significantly improves the computation efficiency by removing majority of multiplications during the reasoning task. FACH identifies representative values in each class hypervector using clustering algorithm. Then, it creates a new HD model with hardware-friendly operations, and accordingly propose an FPGA-based implementation to accelerate such tasks. Our evaluations on several classification problems show that FACH can provide 5.9× energy efficiency improvement and 5.1× speedup as compared to baseline FPGA-based implementation, while ensuring the same quality of classification.

Thursday, January 24, 2019

[To Session Table]

Session 3K  Keynote III
Time: 9:00 - 10:00 Thursday, January 24, 2019
Location: Miraikan Hall
Chair: Shinji Kimura (Waseda University, Japan)

3K-1 (Time: 9:00 - 10:00)
Title(Keynote Address) Hardware and Software Security Technologies to Enable Future Connected Cars
AuthorYasuhisa Shimazaki (Renesas Electronics, Japan)
AbstractIn coming autonomous-driving era, automobiles and many kinds of facilities will be connected each other to provide safe, comfortable and efficient driving environment for drivers. Communication between a vehicle and cloud, for example, is used to obtain traffic information, to deliver driver-centric applications, to maintain automotive condition by monitoring and analyzing various sensing data obtained around the vehicle and so on and so forth. This means, however, we need to pay much attention to cyber security in automotive system. Actually, a remote attack on a running car through cellular network was demonstrated in 2015, resulting in 1.4 million recalls. In order to address this issue, MCUs used in an automobile need to have some sort of security measures which protect themselves and their communication channels effectively and efficiently. In this presentation, basic security technology will firstly be introduced, and then actual implementation of hardware and software security technique will be shown. The presentation will also cover standardization trend in automotive security.

[To Session Table]

Session 7A  (SS-4) Security of Machine Learning and Machine Learning for Security: Progress and Challenges for Secure, Machine Intelligent Mobile Systems
Time: 10:20 - 12:00 Thursday, January 24, 2019
Location: Room Saturn
Chairs: Xiang Chen (George Mason University, U.S.A.), Yanzhi Wang (Northeastern University, U.S.A.)

7A-1 (Time: 10:20 - 10:45)
Title(Invited Paper) ADMM Attack: An Enhanced Adversarial Attack for Deep Neural Networks with Undetectable Distortions
AuthorPu Zhao, *Kaidi Xu (Northeastern University, U.S.A.), Sijia Liu (IBM Research AI, U.S.A.), Yanzhi Wang, Xue Lin (Northeastern University, U.S.A.)
Pagepp. 499 - 505
Keywordadversarial attack, DNN security
AbstractMany recent studies demonstrate that state-of-the-art Deep neural networks (DNNs) might be easily fooled by adversarial examples, generated by adding carefully crafted and visually imperceptible distortions onto original legal inputs through adversarial attacks. Adversarial examples can lead the DNN to misclassify them as any target labels. In the literature, various methods are proposed to minimize the different 𝓁p norms of the distortion. However, there lacks a versatile framework for all types of adversarial attacks. To achieve a better understanding for the security properties of DNNs, we propose a general framework for constructing adversarial examples by leveraging Alternating Direction Method of Multipliers (ADMM) to split the optimization approach for effective minimization of various 𝓁p norms of the distortion, including 𝓁0, 𝓁1, 𝓁2, and 𝓁 norms. Thus, the proposed general framework unifies the methods of crafting 𝓁0, 𝓁1, 𝓁2, and 𝓁 attacks. The experimental results demonstrate that the proposed ADMM attacks achieve both the high attack success rate and the minimal distortion for the misclassification compared with state-of-the-art attack methods.

7A-2 (Time: 10:45 - 11:10)
Title(Invited Paper) A System-level Perspective to Understand the Vulnerability of Deep Learning Systems
AuthorTao Liu, Nuo Xu, Qi Liu (Florida International University, U.S.A.), *Yanzhi Wang (Northeastern University, U.S.A.), Wujie Wen (Florida International University, U.S.A.)
Pagepp. 506 - 511
Keywordmachine learning, security, DNN, system-level, mitigation
AbstractDeep neural network (DNN) is nowadays achieving the human-level performance on many machine learning applications like self-driving car, gaming, and computer-aided diagnosis. However, recent studies show that such a promising technique has gradually become the major attack target, significantly threatening the safety of machine learning services. On one hand, the adversarial or poisoning attacks incurred by DNN algorithm vulnerabilities can cause the decision misleading with very high confidence. On the other hand, the system-level DNN attacks built upon models, training/inference algorithms and hardware and software in DNN execution, have also emerged for more diversified damages like denial of service, private data stealing. In this paper, we present an overview of such emerging system-level DNN attacks by systematically formulating their attack routines. Several representative cases are selected in our study to summarize the characteristics of system-level DNN attacks. Based on our formulation, we further discuss the challenges and several possible techniques to mitigate such emerging system-level DNN attacks.

7A-3 (Time: 11:10 - 11:35)
Title(Invited Paper) High-Performance Adaptive Mobile Security Enhancement against Malicious Speech and Image Recognition
Author*Zirui Xu, Fuxun Yu (George Mason University, U.S.A.), Chenchen Liu (Clarkson University, U.S.A.), Xiang Chen (George Mason University, U.S.A.)
Pagepp. 512 - 517
KeywordAdversarial Example, Automatic Speech Recognition, Image Recognition
AbstractAutomatic Speech Recognition(ASR) and Image Recognition (IR) have been massively used in unauthorized audio/image data analysis, causing serious privacy leakage. To address this issue, we propose HAMPER in this work, which is a data encryption framework that protects the audio/image data from unauthorized ASR/IR analysis. Leveraging machine learning models’ vulnerability to adversarial examples, HAMPER encrypt the audio/image data with adversarial noises to perturb the recognition results of ASR/IR systems. To deploy the proposed framework in extensive platforms (e.g. mobile devices), HAMPER generates adversarial examples from the low-level features. Taking advantage of the light computation load, fundamental impact, and direct configurability of the low-level features, the generated adversarial examples can efficiently and effectively affect the whole ASR/IR systems.

7A-4 (Time: 11:35 - 12:00)
Title(Invited Paper) AdverQuil: an Efficient Adversarial Detection and Alleviation Technique for Black-Box Neuromorphic Computing Systems
AuthorHsin-Pai Cheng, Juncheng Shen, Huanrui Yang (Duke University, U.S.A.), Qing Wu (Air Force Research Laboratory, U.S.A.), Hai Li, *Yiran Chen (Duke University, U.S.A.)
Pagepp. 518 - 525
KeywordNeural networks, Neuromorphic computing, adversarial attack
AbstractIn recent years, neuromorphic computing systems (NCS) have gained popularity in accelerating neural network computation because of their high energy efficiency. The known vulnerability of neural networks to adversarial attack, however, raises a severe security concern of NCS. In addition, there are certain application scenarios in which users have limited access to the NCS. In such scenarios, defense technologies that require changing the training methods of the NCS, e.g., adversarial training become impracticable. In this work, we propose AdverQuil – an efficient adversarial detection and alleviation technique for black-box NCS. AdverQuil can identify the adversarial strength of input examples and select the best strategy for NCS to respond to the attack, without changing structure/parameter of the original neural network or its training method. Experimental results show that on MNIST and CIFAR-10 datasets, AdverQuil achieves a high efficiency of 79.5 - 167K image/sec/watt. AdverQuil introduces less than 25% of hardware overhead, and can be combined with various adversarial alleviation techniques to provide a flexible trade-off between hardware cost, energy efficiency and classification accuracy.

[To Session Table]

Session 7B  System Level Modelling Methods II
Time: 10:20 - 12:00 Thursday, January 24, 2019
Location: Room Uranus
Chair: Naehyuck Chang (KAIST, Republic of Korea)

7B-1 (Time: 10:20 - 10:45)
TitleSIMULTime: Context-Sensitive Timing Simulation on Intermediate Code Representation for Rapid Platform Explorations
Author*Alessandro Cornaglia, Alexander Viehl (FZI Research Center for Information Technology, Germany), Oliver Bringmann, Wolfgang Rosenstiel (University of Tübingen, Germany)
Pagepp. 526 - 531
KeywordSoftware Timing Simulation, Embedded Systems, Compiler optimizations effects, Hardware-dependent Software
AbstractToday, product lines are common practice in the embedded systems domain as they allow for substantial reductions in development costs and the time-to-market by a consequent application of design paradigms such as variability and structured reuse management. In that context, accurate and fast timing predictions are essential for an early evaluation of all relevant variants of a product line concerning target platform properties. Context-sensitive simulations provide attractive benefits for timing analysis. Nevertheless, these simulations depend strongly on a single configuration pair of compiler and hardware platform. To cope with this limitation, we present SIMULTime, a new technique for context-sensitive timing based on software intermediate rep- resentation. Multiple simultaneous simulations, for different hardware platforms and compiler configurations, are obtained with a total simulation throughput that is much higher than to run different single simulations for obtaining the same accurate predictions. Our novel approach was applied on several applications showing that SIMULTime increases the average simulation throughput by 90% when at least four con- figurations are analyzed in parallel.

7B-2 (Time: 10:45 - 11:10)
TitleModeling Processor Idle Times in MPSoC Platforms to Enable Integrated DPM, DVFS, and Task Scheduling Subject to a Hard Deadline
AuthorAmirhossein Esmaili, Mahdi Nazemi, *Massoud Pedram (University of Southern California, U.S.A.)
Pagepp. 532 - 537
KeywordTask Scheduling, Energy Optimization, DVFS, DPM, Real-time MPSoCs
AbstractEnergy efficiency is one of the most critical design criteria for modern embedded systems such as multiprocessor system-on-chips (MPSoCs). Dynamic voltage and frequency scaling (DVFS) and dynamic power management (DPM) are two major techniques for reducing energy consumption in such embedded systems. Furthermore, MPSoCs are becoming more popular for many real-time applications. One of the challenges of integrating DPM with DVFS and task scheduling of real-time applications on MPSoCs is the modeling of idle intervals on these platforms. In this paper, we present a novel approach for modeling idle intervals in MPSoC platforms which leads to a mixed integer linear programming (MILP) formulation integrating DPM, DVFS, and task scheduling of periodic task graphs subject to a hard deadline. We also present a heuristic approach for solving the MILP and compare its results with those obtained from solving the MILP.

Best Paper Candidate
7B-3 (Time: 11:10 - 11:35)
TitlePhone-nomenon: A System-Level Thermal Simulator for Handheld Devices
Author*Hong-Wen Chiou, Yu-Min Lee, Shin-Yu Shiau (National Chiao Tung University, Taiwan), Chi-Wen Pan, Tai-Yu Chen (Mediatek Inc., Taiwan)
Pagepp. 538 - 543
KeywordThermal analysis, Handheld Devices, System level
AbstractThis work presents a system-level thermal simulator, Phone-nomenon, to predict the thermal behavior of smartphone. First, we fully investigate the nonlinearity of internal and external heat transfer mechanisms and propose a compact thermal model. After that, we develop an iterative framework to handle this nonlinearity. Compared with ANSYS Icepak, Phone-nomenon can achieve two orders of magnitude speedup with 3.58% maximum error. Meanwhile, a thermal test vehicle has been built for collecting measured data and Phone-nomenon fits these data pretty well.

7B-4 (Time: 11:35 - 12:00)
TitleVirtual Prototyping of Heterogeneous Automotive Applications: Matlab, SystemC, or both?
AuthorXiao Pan, *Carna Zivkovic, Christoph Grimm (University of Kaiserslautern, Germany)
Pagepp. 544 - 549
KeywordVirtual Prototyping, SystemC, Matlab/Simulink, Instruction Set Simulator
AbstractWe present a case study on virtual prototyping of automotive applications. We address the co-simulation of HW/SW systems involving firmware, communication protocols, and physical/mechanical systems in the context of model-based and agile development processes. The case study compares the Matlab/Simulink and SystemC based approaches by an e-gas benchmark. We compare the simulation performance, modeling capabilities and applicability in different stages of the development process.

[To Session Table]

Session 7C  Placement
Time: 10:20 - 12:00 Thursday, January 24, 2019
Location: Room Venus
Chairs: Ting-Chi Wang (National Tsing Hua University), Yasuhiro Takashima (University of Kitakyushu, Japan)

Best Paper Candidate
7C-1 (Time: 10:20 - 10:45)
TitleDiffusion Break-Aware Leakage Power Optimization and Detailed Placement in Sub-10nm VLSI
AuthorSun ik Heo (Samsung Electronics Co., Ltd., Republic of Korea), *Andrew B. Kahng, Minsoo Kim, Lutong Wang (UC San Diego, U.S.A.)
Pagepp. 550 - 556
Keywordleakage, placement, diffusion break, optimization, local layout effect
AbstractA diffusion break (DB) isolates two neighboring devices in a standard cell-based design and has a stress effect on delay and leakage power. In foundry sub-10nm design enablements, device performance is changed according to the type of DB – single diffusion break (SDB) or double diffusion break (DDB) – that is used in the library cell layout. Crucially, local layout effect (LLE) can substantially affect device performance and leakage. Our present work focuses on the 2nd DB effect, a type of LLE in which distance to the second-closest DB (i.e., a distance that depends on the placement of a given cell’s neighboring cell) also impacts performance of a given device. In this work, we implement a 2nd DB-aware timing and leakage analysis flow, and show how a lack of 2nd DB awareness can misguide current optimization in place-and-route stages. We then develop 2nd DB-aware leakage optimization and detailed placement heuristics. Experimental results in a scaled foundry 14nm technology indicate that our 2nd DB-aware analysis and optimization flow achieves, on average, 80% recovery of the leakage increment that is induced by the 2nd DB effect, without changing design performance.

7C-2 (Time: 10:45 - 11:10)
TitleMDP-trees: Multi-Domain Macro Placement for Ultra Large-Scale Mixed-Size Designs
Author*Yen-Chun Liu (National Taiwan University, Taiwan), Tung-Chieh Chen (Maxeda Technology Inc., Taiwan), Yao Wen Chang, Sy-Yen Kuo (National Taiwan University, Taiwan)
Pagepp. 557 - 562
KeywordPlacement, Macro Placement, Mixed-Size Placement
AbstractIn this paper, we present a new hybrid representation of slicing trees and multi-packing trees, called multi-domain-packing trees (MDP-trees), for macro placement to handle ultra large-scale multidomain mixed-size designs. A multi-domain design typically consists of a set of mixed-size domains, each with hundreds/thousands of large macros and (tens of) millions of standard cells, which is often seen in modern high-end applications (e.g., 4G LTE products and upcoming 5G ones). To the best of our knowledge, there is still no published work specifically tackling the multi-domain macro placement. Based on binary trees, the MDP-tree is very efficient and effective for handling macro placement with multiple domains. Previous works on macro placement can handle only single-domain designs, which do not consider the global interactions among domains. In contrast, our MDPtrees optimize the interconnections among domains and macro/cell positions simultaneously. The area of each domain is well reserved, and the macro displacement is minimized from initial macro positions of the design prototype. Experimental results show that our approach can significantly reduce both the average half-perimeter wirelength and the average global routing wirelength.

7C-3 (Time: 11:10 - 11:35)
TitleA Shape-Driven Spreading Algorithm Using Linear Programming for Global Placement
Author*Shounak Dhar (University of Texas at Austin, U.S.A.), Love Singhal, Mahesh A. Iyer (Intel Corporation, U.S.A.), David Z. Pan (University of Texas at Austin, U.S.A.)
Pagepp. 563 - 568
KeywordPlacement, Legalization, Spreading, Flow
AbstractIn this paper, we consider the problem of finding the global shape for placement of cells in a chip that results in minimum wirelength. Under certain assumptions, we theoretically prove that some shapes are better than others for purposes of minimizing wirelength, while ensuring that overlap-removal is a key constraint of the placer. We derive some conditions for the optimal shape and obtain a shape which is numerically close to the optimum. We also propose a linear-programming-based spreading algorithm with parameters to tune the resultant shape and derive a cost function that is better than total or maximum displacement objectives, that are traditionally used in many numerical global placers. Our new cost function also does not require explicit wirelength computation, and our spreading algorithm preserves to a large extent, the relative order among the cells placed after a numerical placer iteration. Our experimental results demonstrate that our shape-driven spreading algorithm improves wirelength, routing congestion and runtime compared to a bi-partitioning based spreading algorithm used in a state-of-the-art academic global placer for FPGAs.

7C-4 (Time: 11:35 - 12:00)
TitleFinding Placement-Relevant Clusters With Fast Modularity-Based Clustering
Author*Mateus Fogaça (Universidade Federal do Rio Grande do Sul, Brazil), Andrew B. Kahng (University of California, San Diego, U.S.A.), Ricardo Augusto da Luz Reis (Universidade Federal do Rio Grande do Sul, Brazil), Lutong Wang (University of California, San Diego, U.S.A.)
Pagepp. 569 - 576
KeywordPhysical Design, Floorplaning, placement, Clustering, Modularity
AbstractIn advanced technology nodes, IC implementation faces increasing design complexity as well as ever-more demanding design schedule requirements. This raises the need for new decomposition approaches that can help reduce problem complexity, in conjunction with new predictive methodologies that can help avoid bottlenecks and loops in the physical implementation flow. Notably, with modern design methodologies it would be very valuable to better predict final placement of the gate-level netlist: this would enable more accurate early assessment of performance, congestion and floorplan viability in the SOC floorplanning/RTL planning stages of design. In this work, we study a new criterion for the classic challenge of VLSI netlist clustering: how well netlist clusters “stay together” through final implementation. We propose use of several evaluators of this criterion. We also explore the use of modularity-driven clustering to identify natural clusters in a given graph without the tuning of parameters and size balance constraints typically required by VLSI CAD partitioning methods. We find that the netlist hypergraph-to-graph mapping can significantly affect quality of results, and we experimentally identify an effective recipe for weighting that also comprehends topological proximity to I/Os. Further, we empirically demonstrate that modularity-based clustering achieves better correlation to actual netlist placements than traditional VLSI CAD methods (our method is also 4X faster than use of hMetis for our largest testcases). Finally, we show a potential flow with fast “blob placement” of clusters to evaluate netlist and floorplan viability in early design stages; this flow can predict gate-level placement of 370K cells in 200 seconds on a single core.

[To Session Table]

Session 7D  Algorithms and Architectures for Emerging Applications
Time: 10:20 - 12:00 Thursday, January 24, 2019
Location: Room Mars+Room Mercury
Chair: Taewhan Kim (Seoul National University, Republic of Korea)

7D-1 (Time: 10:20 - 10:45)
TitleAn Approximation Algorithm to the Optimal Switch Control of Reconfigurable Battery Packs
AuthorShih-Yu Chen, *Jie-Hong Jiang (National Taiwan University, Taiwan), Shou-Hung Ling, Shih-Hao Liang, Mao-Cheng Huang (ITRI, Taiwan)
Pagepp. 577 - 584
Keywordapproximation algorithm, battery pack, reconfigurability, switch control
AbstractThe broad applications of lithium-ion batteries in cyber-physical systems attract intensive research on building energy-efficient battery systems. Reconfigurable battery packs have been proposed to improve reliability and energy efficiency. Despite recent efforts, how to simultaneously maximize battery usage time and minimize switching count during reconfiguration is rarely addressed. In this work, we devise a control algorithm that, under a simplified battery model, achieves the longest usage time under a given constant power-load while the switching count is at most twice above the minimum. It is further generalized for arbitrary power-loads and adjusted for refined battery models. Simulation experiments show promising benefits of the proposed algorithm.

7D-2 (Time: 10:45 - 11:10)
TitleAutonomous Vehicle Routing In Multiple Intersections
Author*Sheng-Hao Lin, Tsung-Yi Ho (National Tsing Hua University, Taiwan)
Pagepp. 585 - 590
KeywordAutonomous Driving, Intersection Management
AbstractAbstract—Advancements in artificial intelligence and Internet of Things indicates the realization of commercial autonomous vehicles is almost ready. With autonomous vehicles comes new approaches in solving some of the current traffic problems such as fuel consumption, congestion, and high incident rates. Autonomous Intersection Management (AIM) is an example that utilizes the unique attributes of autonomous vehicles to improve the efficiency of a single intersection. However, in a system of interconnected intersections, just by improving individual intersections does not guarantee a system optimum. Therefore, we extend from a single intersection to a grid of intersections and propose a novel path planning method for autonomous vehicles that can effectively reduce the travel time of each vehicle. With Dedicated Short Range Communications and the fine-grained control of autonomous vehicles, we are able to apply traditional wire routing algorithms with modified constraints. Our method intelligently avoids congestions by simulating the future traffic and thereby achieving a system optimum.

7D-3 (Time: 11:10 - 11:35)
TitleGRAM: Graph Processing in a ReRAM-based Computational Memory
Author*Minxuan Zhou, Mohsen Imani, Saransh Gupta, Yeseong Kim, Tajana Rosing (University of California, San Diego, U.S.A.)
Pagepp. 591 - 596
Keywordgraph processing, processing in memory
AbstractGraph processing applications suffer from inefficient memory behaviour in traditional systems. In this paper, we exploit resistive memory (ReRAM) based processing-inmemory (PIM) technology to accelerate graph applications in a computational memory which not only stores application data but also supports various in-memory operations like arithmetic operations and associative search. The proposed solution, GRAM, implements the vertex program model, which is widely used to implement various parallel graph algorithms, in the computational memory to accelerate various parallel graph applications. Based on our experiments with three important graph kernels on seven real-world graphs, GRAM can provide 122.5× and 11.1× speedup compared with a large-scale graph system and optimized multi-threading reference algorithms running on a multi-core CPU. Compared to a GPU-based graph acceleration library and a recently proposed PIM accelerator, GRAM improves the performance by 7.1× and 3.8× respectively.

7D-4 (Time: 11:35 - 12:00)
TitleADEPOS: Anomaly Detection based Power Saving for Predictive Maintenance using Edge Computing
Author*Sumon Kumar Bose, Bapi Kar, Mohendra Roy, Pradeep Kumar Gopalakrishnan, Arindam Basu (Nanyang Technological University, Singapore)
Pagepp. 597 - 602
KeywordApproximate Computing, Predictive Maintenance, Anomaly Detection, ELM, Autoencoder
AbstractIn Industry 4.0, predictive maintenance (PM) is one of the most important applications pertaining to the Internet of Things (IoT). Machine learning is used to predict the possible failure of a machine before the actual event occurs. However, main challenges in PM are: (a) lack of enough data from failing machines, and (b) paucity of power and bandwidth to transmit sensor data to cloud throughout the lifetime of the machine. Alternatively, edge computing approaches reduce data transmission and consume low energy. In this paper, we propose Anomaly Detection based Power Saving (ADEPOS) scheme using approximate computing through the lifetime of the machine. In the beginning of the machine's life, low accuracy computations are used when machine is healthy. However, on detection of anomalies as time progresses, system is switched to higher accuracy modes. We show using the NASA bearing dataset that using ADEPOS, we need 8.8X less neurons on average and based on post-layout results, the resultant energy savings are 6.4-6.65X.

[To Session Table]

Session 8A  (DF-3) Emerging Technologies for Tokyo Olympic 2020
Time: 13:15 - 14:30 Thursday, January 24, 2019
Location: Room Saturn
Organizer/Chair: Koichiro Yamashita (Fujitsu, Japan), Organizer: Akihiko Inoue (Panasonic Corporation, Japan)

8A-1 (Time: 13:15 - 13:40)
Title(Designers' Forum) Walking assistive powered-wear 'HIMICO' with wire-driven assist
AuthorKenta Murakami (Panasonic Corporation, Japan)
AbstractWearable robotic devices that augment and assist human movement were developed in many groups. Many of these devices, however, are exo-skeleton type. Exo-skeleton robots limit freedom of movement and lead to significant increases in leg inertia since actuators are placed near the joint. On the other hand, we developed the wire-driven assistive robots ‘HIMICO’, which assist human movement softly by using Bowden cables to transfer power from motor system. The key feature of this approach is that the actuator can be located away from the joint, allowing lightweight leg structures while still generating significant forces. The max tension per wire is approximately 100 N, allowing very light-weight, 3.5 kg and reducing metabolic cost in climbing slopes and walking up stairs.

8A-2 (Time: 13:40 - 14:05)
Title(Designers' Forum) Deep Scene Recognition with Object Detection
AuthorZhiming Tan (Fujitsu R&D Center, Co. LTD, China)
AbstractAs well as an international large event like Olympics, advanced image processing and deep learning technologies become important for smart-city and smart-life, because of the limit of the visual inspection to recognize the scene. Traditional methods for traffic scene recognition need to solve complex factors, such as multiple object types, object relationship, background, weather, and lighting. So it is hard for them to recognize accurate scene in real time. By using a lightweight CNN model optimized for object detection, we present a system to recognize traffic scene with higher accuracy and in real time. The CNN model is optimized for small objects in far distance and with occlusion. Rules of object relationship are used for recognizing scenes, such as city surveillance; traffic jam, road construction, and waiting for bus, etc. Our activities are enhanced to recognize a human's behavior and scene; grasping the game situation from player's movement for sports applications, detecting a doubtful behavior in the crowd for security applications, etc. And various case of sample movies are introduced in this speech.

8A-3 (Time: 14:05 - 14:30)
Title(Designers' Forum) Spatial and battery sensing solutions for smart cities leading to 2020
AuthorHiroyuki Tsujikawa (Panasonic Corporation, Japan)
AbstractTowards the year of 2020, the deployment of infrastructure services based on IoT has been accelerating in the making of smart cities. Especially for mobility, robots and drones are being widely commercialized. These hi-tech products require highly accurate spatial sensing and efficient battery management to realize security and safety for autonomous driving and autonomous control. This time we are introducing examples of next generation sensing solutions that incorporate Panasonic's valuable sensor devices and algorithm technology. For spatial recognition solutions, we are proposing sensing technology that facilitates free space detection and obstacle detection. These functions have been developed by adding 3D depth measurement techniques to the high-quality imaging technology that we have cultivated with camera products over the years. And as for battery application solutions, we are introducing model based design for lithium-ion batteries’ deterioration diagnosis and lifetime prediction by using AI based battery state estimation technology.

[To Session Table]

Session 8B  Embedded Software for Parallel Architecture
Time: 13:15 - 14:30 Thursday, January 24, 2019
Location: Room Uranus
Chairs: Zhaoyan Shen (Shandong University), Weichen Liu (Nanyang Technological University)

8B-1 (Time: 13:15 - 13:40)
TitleEfficient Sporadic Task Handling in Parallel AUTOSAR Applications Using Runnable Migration
Author*Milan Copic, Rainer Leupers, Gerd Ascheid (RWTH Aachen University, Germany)
Pagepp. 603 - 608
KeywordAUTOSAR, Multi-core software optimization, Event-triggered tasks, Legacy code
AbstractAutomotive software has become immensely complex. To manage this complexity, a safety-critical application is commonly written respecting the AUTOSAR standard and deployed on a multi-core ECU. However, parallelization of an AUTOSAR task is hindered by data dependencies between runnables, the smallest code-fragments executed by the run-time system. Consequently, a substantial number of idle intervals is introduced. We propose to utilize such intervals in sporadic tasks by migrating runnables that were originally scheduled to execute in the scope of periodic tasks.

8B-2 (Time: 13:40 - 14:05)
TitleA Heuristic for Multi Objective Software Application Mappings on Heterogeneous MPSoCs
Author*Gereon Onnebrink, Ahmed Hallawa, Rainer Leupers, Gerd Ascheid (RWTH Aachen University, Germany), Awaid-Ud-Din Shaheen (Silexica GmbH, Germany)
Pagepp. 609 - 614
KeywordMPSoC, SW mapping, EMOA
AbstractEfficient development of parallel software is one of the biggest hurdles to exploit the advantages of heterogeneous multi-core architectures. Fast and accurate compiler technology is required for determining the trade-off between multiple objectives, such as power and performance. To tackle this problem, the paper at hand proposes the novel heuristic TONPET. Furthermore, it is integrated into the MAPS framework for a detailed evaluation and an applicability study. TOPNET is tested against representative benchmarks on three different platforms and compared to a state-of-the-art Evolutionary Multi Objective Algorithm (EMOA). On average, TONPET produces 6% better Pareto fronts, while being 18x faster in the worst case.

8B-3 (Time: 14:05 - 14:30)
TitleReRAM-based Processing-in-Memory Architecture for Blockchain Platforms
Author*Fang Wang (The Hong Kong Polytechnic University, Hong Kong), Zhaoyan Shen (Shandong University, China), Lei Han (The Hong Kong Polytechnic University, Hong Kong), Zili Shao (The Chinese University of Hong Kong, Hong Kong)
Pagepp. 615 - 620
KeywordProcessing-in-memory, ReRAM, Blockchain, Parallelism
AbstractBlockchain’s decentralized and consensus mechanism has attracted lots of applications, such as IoT devices. Blockchain maintains a linked list of blocks and grows by mining new blocks. However, the Blockchain mining consumes huge computation resource and energy, which is unacceptable for resource-limited embedded devices. This paper for the first time presents a ReRAM-based processing-in-memory architecture for Blockchain mining, called Re-Mining. Re-Mining includes a message schedule module and a SHA computation module. The modules are composed of several basic ReRAM-based logic operations units, such as ROR, RSF and XOR. Re-Mining further designs intra-transaction and inter-transaction parallel mechanisms to accelerate the Blockchain mining. Simulation results show that the proposed Re-Mining architecture outperforms CPU-based and GPU-based implementations significantly.

[To Session Table]

Session 8C  Machine Learning and Hardware Security
Time: 13:15 - 14:30 Thursday, January 24, 2019
Location: Room Venus
Chair: Hiromitsu Awano (Osaka University)

8C-1 (Time: 13:15 - 13:40)
TitleTowards Practical Homomorphic Email Filtering: A Hardware-Accelerated Secure Naive Bayesian Filter
Author*Song Bian, Masayuki Hiromoto, Takashi Sato (Kyoto University, Japan)
Pagepp. 621 - 626
Keywordhomomorphic encryption, secure naive bayesian filter, secure email classification, domain-specific accelerator
AbstractA secure version of the naive Bayesian filter (NBF) is proposed utilizing partially homomorphic encryption (PHE) scheme. SNBF can be implemented with only the additive homomorphism from the Paillier system, and we derive new techniques to reduce the computational cost of PHE-based SNBF. In the experiment, we implemented SNBF both in software and hardware. Compared to the best existing PHE scheme, we achieved 1,200x and 398,840x runtime reduction for CPU and ASIC implementations, repsectively, with additional 1,919x power reduction on the designated hardware multiplier. Our hardware implementation is able to classify an average-length email in 0.5s, making it one of the most practical NBF schemes to date.

8C-2 (Time: 13:40 - 14:05)
TitleA 0.16pJ/bit Recurrent Neural Network Based PUF for Enhanced Machine Learning Attack Resistance
Author*Nimesh Kirit Shah (Nanyang Technological University, Singapore), Manaar Alam (Indian Institute of Technology Kharagpur, India), Durga Prasad Sahoo (Robert Bosch Engineering and Business Solutions Private Limited, India), Debdeep Mukhopadhyay (Indian Institute of Technology Kharagpur, India), Arindam Basu (Nanyang Technological University, Singapore)
Pagepp. 627 - 632
KeywordPUF, IoT, Machine Learning, Recurrent, Neural Networks
AbstractPhysically Unclonable Function (PUF) circuits are finding widespread use due to increasing adoption of IoT devices. However, the existing strong PUFs such as Arbiter PUFs (APUF) and its compositions are susceptible to machine learning (ML) attacks because the challenge-response pairs have a linear relationship. In this paper, we present a Recurrent-Neural-Network PUF (RNN-PUF) which uses a combination of feedback and XOR function to significantly improve resistance to ML attack, without significant reduction in the reliability. ML attack is also partly reduced by using a shared comparator with offset-cancellation to remove bias and save power. From simulation results, we obtain ML attack accuracy of 62% for different ML algorithms, while reliability stays above 93%. This represents a 33.5% improvement in our Figure-of-Merit. Power consumption is estimated to be 12.3uW with energy/bit of ~ 0.16pJ.

8C-3 (Time: 14:05 - 14:30)
TitleP3M: A PIM-based Neural Network Model Protection Scheme for Deep Learning Accelerator
Author*Wen Li, Ying Wang, Huawei Li, Xiaowei Li (Institute of Computing Technology, Chinese Academy of Sciences, China)
Pagepp. 633 - 638
KeywordPUF, PIM, Deep learning, Edge Computing, Security and Privacy
AbstractThis work is oriented at the edge computing scenario that terminal deep learning accelerators use pre-trained neural network models distributed from third-party providers (e.g. from data center clouds) to process the private data instead of sending it to the cloud. In this scenario, the network model is exposed to the risk of being attacked in the unverified devices if the parameters and hyper-parameters are transmitted and processed in an unencrypted way. Our work tackles this security problem by using on-chip memory Physical Unclonable Functions (PUFs) and Processing-In-Memory (PIM). We allow the model execution only on authorized devices and protect the model from white-box attacks, black-box attacks and model tampering attacks. The proposed PUFs-and-PIM based Protection method for neural Models (P3M), can utilize unstable PUFs to protect the neural models in edge deep learning accelerators with negligible performance and energy overhead. The experimental results show considerable performance improvement over two state-of-the-art solutions we evaluated.

[To Session Table]

Session 8D  Memory Architecture for Efficient Neural Network Computing
Time: 13:15 - 14:30 Thursday, January 24, 2019
Location: Room Mars+Room Mercury
Chairs: Jongeun Lee (UNIST, Republic of Korea), Bei Yu (Chinese University of Hong Kong, Hong Kong)

8D-1 (Time: 13:15 - 13:40)
TitleLearning the Sparsity for ReRAM: Mapping and Pruning Sparse Neural Network for ReRAM based Accelerator
Author*Jilan Lin (UCSB/Tsinghua University, China), Zhenhua Zhu, Yu Wang (Tsinghua University, China), Yuan Xie (UCSB, U.S.A.)
Pagepp. 639 - 644
KeywordReRAM, Sparse Neural Network, Network Compression
AbstractWith the in-memory processing ability, ReRAM based computing gets more and more attractive for accelerating neural network (NN). However, most ReRAM based accelerators cannot support efficient mapping for sparse NN, and we need to map the whole dense matrix onto ReRAM crossbar array to achieve O(1) computation complexity. In this paper, we propose a sparse NN mapping scheme based on elements clustering to achieve better ReRAM crossbar utilization. Further, we propose crossbar-grained pruning algorithm to reduce the crossbars with low utilization. Finally, since most current ReRAM device cannot achieve high precision, we analyze the effect of quantization precision for sparse NN, and propose to complete high-precision composing in the analog field and design related periphery circuits. In our experiments, we discuss how system performs with different crossbar sizes to choose the optimized design. Our results show that compared with those accelerators for dense NN, our mapping scheme for sparse NN with proposed pruning algorithm achieves 3-5x energy efficiency and more than 2.5x speedup. Also, the accuracy experiments show that our pruning appears to have almost no accuracy loss.

8D-2 (Time: 13:40 - 14:05)
TitleIn-Memory Batch-Normalization for Resistive Memory based Binary Neural Network Hardware
Author*Hyungjun Kim, Yulhwa Kim, Jae-Joon Kim (Pohang University of Science and Technology, Republic of Korea)
Pagepp. 645 - 650
Keywordbinary neural network, resistive random access memory, vector-matrix multiplication, deep neural network accelerator
AbstractBinary Neural Network (BNN) has a great potential to be implemented on Resistive memory Crossbar Array (RCA)-based hardware accelerators because it requires only 1-bit precision for weights and activations. While general structures to implement convolution or fully-connected layers in RCA-based BNN hardware were actively studied in previous works, Batch-Normalization (BN) layer, which is another key layer of BNN, has not been discussed in depth yet. In this work, we propose in-memory batch-normalization schemes which integrate BN layers on RCA so that area/energy-efficiency of the BNN accelerators can be maximized. In addition, we also show that sense amplifier error due to device mismatch can be suppressed using the proposed in-memory BN design.

8D-3 (Time: 14:05 - 14:30)
TitleExclusive On-Chip Memory Architecture for Energy-Efficient Deep Learning Acceleration
Author*Hyeonuk Sim (UNIST, Republic of Korea), Jason Anderson (University of Toronto, Canada), Jongeun Lee (UNIST, Republic of Korea)
Pagepp. 651 - 656
KeywordDeep learning
AbstractState-of-the-art deep neural networks (DNNs) require hundreds of millions of multiply-accumulate (MAC) computations to perform inference, e.g. in image-recognition tasks. To improve the performance and energy efficiency, deep learning accelerators have been proposed, realized both on FPGAs and as custom ASICs. Generally, such accelerators comprise many parallel processing elements, capable of executing large numbers of concurrent MAC operations. From the energy perspective, however, most consumption arises due to memory accesses, both to off-chip external memory, and on-chip buffers. In this paper, we propose an on-chip DNN co-processor architecture where minimizing memory accesses is the primary design objective. To the maximum possible extent, off-chip memory accesses are eliminated, providing lowest-possible energy consumption for inference. Compared to a state-of-the-art ASIC, our architecture requires 36% fewer external memory accesses and 53% less energy consumption for low-latency image classification.

[To Session Table]

Session 9A  (DF-4) Beyond the Virtual Reality World
Time: 14:50 - 16:05 Thursday, January 24, 2019
Location: Room Saturn
Organizer: Hiroe Iwasaki (NTT, Japan), Organizer/Chair: Masaru Kokubo (Hitachi, Japan)

9A-1 (Time: 14:50 - 15:15)
Title(Designers' Forum) The World of VR2.0
AuthorMichitaka Hirose (The University of Tokyo, Japan)
AbstractRecently, VR technology is attracting various interest from our society. In my talk, recent topics of VR research and development are introduced. Also, social impacts of this technology are discussed from various point of view.

9A-2 (Time: 15:15 - 15:40)
Title(Designers' Forum) Optical fiber scanning system for ultra-lightweight wearable display
AuthorYoshio Seo (Hitachi, Japan)
AbstractAn optical fiber scanning system is one of laser beam steering type devices that can change the laser traveling direction by displacing the tip of the optical fiber. The laser beam steering device is suitable for embedding into small wearable display such as smart glasses due to its small size. The conventional fiber scanning system draws with spiral trajectory of rotational displacement with the same vertical and horizontal vibration frequencies. There are some issues such as becoming the resolution depends only on the vibration frequency, bright spot occurs at the center of the drawing area, and the drawing area becomes circular. In this study, we developed (1) Oval scanning, (2) Cross scanning, and (3) Cross-Limit scanning as novel scanning control system of overlapping oval trajectories of different shapes. We confirmed that the bright spots are moved to the outside of the drawing area, and the drawing area shape is closer to rectangle, by experiment with the actual machine. Further, in these systems, it is possible to increase the resolution with higher Laser modulation frequency. Scanning fiber could improve the image qualities with these novel control.

9A-3 (Time: 15:40 - 16:05)
Title(Designers' Forum) Superreal Video Representation for Enhanced Sports Experiences - R&D on VR x AI technologies –
AuthorHideaki Kimata (NTT, Japan)
AbstractThe impressions received from paintings and pictures are different. What this difference comes from? There is a view that both are the results produced including the composition, and the manufacturing process is different. There is something that the producer wants to convey in it. How far can you put the intention of the producer? Should we make it with the composition that the viewer wants to see? There may not be a logical answer to these questions. Besides pursuing the answers to the various questions mentioned above, examples of the research results in the superreal video presentation are shown. We believe that the superreal video expression is effective for conveying the intention of the producer and also has a useful scene for viewers. In this article, we introduce examples of using VR (virtual reality) technology for training at athletes and examples of video experiences at sports watching and events. Among them, we show scenes in which the magnified expression of is effective, such as "emphasizing" a part of the scene, and discuss it with reference to experimental data. When conveying the producer's intention and thinking that the VR is the media that allows a certain degree of freedom to the viewer side while conveying the intention of the producer, to "emphasize" a part is a method of conveying the intention of the producer in VR.

[To Session Table]

Session 9B  Logic-Level Security and Synthesis
Time: 14:50 - 16:05 Thursday, January 24, 2019
Location: Room Uranus
Chair: Kenshu Seto (Tokyo City University)

9B-1 (Time: 14:50 - 15:15)
TitleBeSAT: Behavioral SAT-based Attack on Cyclic Logic Encryption
AuthorYuanqi Shen, You Li, Amin Rezaei, Shuyu Kong, David Dlott, *Hai Zhou (Northwestern University, U.S.A.)
Pagepp. 657 - 662
KeywordCyclic Logic Encryption, Structural CycSAT, Logic Locking, SAT-based Attack, Behavioral Analysis
AbstractCyclic logic encryption is newly proposed in the area of hardware security. It introduces feedback cycles into the circuit to defeat existing logic decryption techniques. To ensure that the circuit is acyclic under the correct key, CycSAT is developed to add the acyclic condition as a CNF formula to the SAT-based attack. However, we found that it is impossible to capture all cycles in any graph with any set of feedback signals as done in the CycSAT algorithm. In this paper, we propose a behavioral SAT-based attack called BeSAT. BeSAT observes the behavior of the encrypted circuit on top of the structural analysis, so the stateful and oscillatory keys missed by CycSAT can still be blocked. The experimental results show that BeSAT successfully overcomes the drawback of CycSAT.

9B-2 (Time: 15:15 - 15:40)
TitleStructural Rewriting in XOR-Majority Graphs
Author*Zhufei Chu (Ningbo University, China), Mathias Soeken (EPFL, Switzerland), Yinshui Xia, Lunyao Wang (Ningbo University, China), Giovanni De Micheli (EPFL, Switzerland)
Pagepp. 663 - 668
Keywordlogic synthesis, logic representations, xor-majority graphs, quantum computing, optimization
AbstractIn this paper, we present a structural rewriting method for a recently proposed XOR-Majority graph (XMG), which has exclusive-OR (XOR), majority-of-three (MAJ), and inverters as primitives. XMGs are an extension of Majority-Inverter Graphs (MIGs). Previous work presented an axiomatic system, Ω, and its derived transformation rules for manipulation of MIGs. By additionally introducing XOR primitive, the identities of MAJ-XOR operations should be exploited to enable powerful logic rewriting in XMGs. We first proposed two MAJ-XOR identities and exploit its potential optimization opportunities during structural rewriting. Then, we discuss the rewriting rules that can be used for different operations. Finally, we also address structural XOR detection problem in MIG. The experimental results on EPFL benchmark suites show that the proposed method can optimize the size of XMGs and its mapped look-up tables (LUTs), which in turn benefits the quantum circuit synthesis that using XMG as the underlying logic representations.

9B-3 (Time: 15:40 - 16:05)
TitleDesign Automation for Adiabatic Circuits
Author*Alwin Zulehner (Johannes Kepler University Linz, Austria), Micheal Frank (Sandia National Laboratories, U.S.A.), Robert Wille (Johannes Kepler University Linz, Austria)
Pagepp. 669 - 674
KeywordAdiabatic Circuits, 2LAL, Technology Mapping, AIG
AbstractAdiabatic circuits are heavily investigated since they allow for computations with an asymptotically close to zero energy dissipation per operation - serving as an alternative technology for many scenarios where energy efficiency is preferred over fast execution. Their concepts are motivated by the fact that the information lost from conventional circuits results in an entropy increase which causes energy dissipation. To overcome this issue, computations are performed in a (conditionally) reversible fashion which, additionally, have to satisfy switching rules that are different from conventional circuitry - crying out for dedicated design automation solutions. While previous approaches either focus on their electrical realization (resulting in small, hand-crafted circuits only) or on designing fully reversible building blocks (an unnecessary overhead), this work aims for providing an automatic and dedicated design scheme that explicitly takes the recent findings in this domain into account. To this end, we review the theoretical and technical background of adiabatic circuits and present automated methods that dedicatedly realize the desired function as an adiabatic circuit. The resulting methods are further optimized—leading to an automatic and efficient design automation for this promising technology. Evaluations confirm the benefits and applicability of the proposed solution.

[To Session Table]

Session 9C  Analysis and Algorithms for Digital Design Verification
Time: 14:50 - 16:05 Thursday, January 24, 2019
Location: Room Venus
Chairs: Mark Po-Hung Lin (National Chung Cheng University, Taiwan), Andreas G. Veneris (University of Toronto, Canada)

Best Paper Candidate
9C-1 (Time: 14:50 - 15:15)
TitleA Figure of Merit for Assertions in Verification
AuthorSamuel Hertz, *Debjit Pal, Spencer Offenberger, Shobha Vasudevan (University of Illinois at Urbana-Champaign, U.S.A.)
Pagepp. 675 - 680
KeywordAssertion Metric
AbstractAssertion quality is critical to the confidence and claims in a design's verification. In current practice, there is no metric to evaluate assertions. We introduce a methodology to rank register transfer level (RTL) assertions. We define assertion importance and assertion complexity and present efficient algorithms to compute them. Our method ranks each assertion according to its importance and complexity. We demonstrate the effectiveness of our ranking for pre-silicon verification on a detailed case study. For completeness, we study the relevance of our highly ranked assertions in a post-silicon validation context, using traced and restored signal values from the design's netlist.

9C-2 (Time: 15:15 - 15:40)
TitleSuspect2vec: A Suspect Prediction Model for Directed RTL Debugging
Author*Neil Veira, Zissis Poulos, Andreas Veneris (University of Toronto, Canada)
Pagepp. 681 - 686
KeywordDebugging, SAT, machine learning
AbstractAutomated debugging tools based on Boolean Satisfiability (SAT) have greatly alleviated the time and effort required to diagnose and rectify a failing design. Practical experience shows that long–running debugging instances can often be resolved faster using partial results that are available before the SAT solver completes its search. In such cases it is preferable for the tool to maximize the number of suspects it returns during the early stages of its deployment. To capitalize on this observation, this paper proposes a directed SAT–based debugging algorithm which prioritizes examining design locations that are more likely to be suspects. This prioritization is determined by suspect2vec — a model which learns from historical debug data to predict the suspect locations that will be found. Experiments show that this algorithm is expected to find 16% more suspects than the baseline algorithm if terminated prematurely, while still retaining the ability to find all suspects if executed to completion. Key to its performance and a contribution of this work is the accuracy of the suspect prediction model. This is because incorrect predictions introduce overhead in exploring parts of the search space where few or no solutions exist. Suspect2vec is experimentally demonstrated to outperform existing suspect prediction methods by an average accuracy of 5-20%.

9C-3 (Time: 15:40 - 16:05)
TitlePath Controllability Analysis for High Quality Designs
Author*Li-Jie Chen (Department of Electrical Engineering, National Taiwan University, Taiwan), Hong-Zu Chou, Kai-Hui Chang (Avery Design Systems, U.S.A.), Sy-Yen Kuo (Department of Electrical Engineering, National Taiwan University, Taiwan), Chilai Huang (Avery Design Systems, U.S.A.)
Pagepp. 687 - 692
KeywordPath Controllability, At-Speed Test, CDTG, X-Analysis
AbstractGiven a design variable and its fanin cone, determining whether one fanin variable has controlling power over other fanin variables can benefit many design steps such as verification, synthesis and test generation. In this work we formulate this path controllabilty problem and propose several algorithms that not only solve this problem but also return values that enable or block other fanin variables. Empirical results show that our algorithms can effectively perform path controllability analysis and help produce high-quality designs.

[To Session Table]

Session 9D  FPGA and Optics-Based Neural Network Designs
Time: 14:50 - 16:05 Thursday, January 24, 2019
Location: Room Mars+Room Mercury
Chair: Taewhan Kim (Seoul National University, Republic of Korea)

9D-1 (Time: 14:50 - 15:15)
TitleImplementing Neural Machine Translation with Bi-Directional GRU and Attention Mechanism on FPGAs Using HLS
AuthorQin Li, *Xiaofan Zhang (University of Illinois at Urbana-Champaign, U.S.A.), Jinjun Xiong (IBM Thomas J. Watson Research Center, U.S.A.), Wen-mei Hwu, Deming Chen (University of Illinois at Urbana-Champaign, U.S.A.)
Pagepp. 693 - 698
KeywordFPGA, Neural Machine Translator, GRU, HLS
AbstractNeural machine translation (NMT) is a popular topic in Natural Language Processing which uses deep neural networks (DNNs) for translation from source to targeted languages. With the emerging technologies, such as bidirectional Gated Recurrent Units (GRU), attention mechanisms, and beam-search algorithms, NMT can deliver improved translation quality compared to the conventional statistics-based methods, especially for translating long sentences. However, higher translation quality means more complicated models, higher computation/memory demands, and slower translation time, which raises difficulties for practical use. In this paper, we propose a design methodology for implementing the inference of a real-life NMT (with the problem size = 172 GFLOP) on FPGA for improved run time latency and energy efficiency. We use High-Level Synthesis (HLS) to build high-performance parameterized IPs for handling the most basic operations (multiply-accumulations) and construct these IPs to accelerate the matrix-vector multiplication (MVM) kernels, which are frequently used in NMT. Also, we launch a design space exploration by considering both computation resource and memory access bandwidth to utilize the hardware parallelism opportunity existed in the model and generate the best parameter configurations of the proposed IPs. Accordingly, we generate a hybrid parallel structure for accelerating the NMT with affordable resource overhead for the targeted FPGA. Our design is demonstrated on a Xilinx VCU118 with overall performance at 7.16 GFLOPS.

9D-2 (Time: 15:15 - 15:40)
TitleEfficient FPGA Implementation of Local Binary Convolutional Neural Network
AuthorAidyn Zhakatayev, *Jongeun Lee (Ulsan National Institute of Science and Technology, Republic of Korea)
Pagepp. 699 - 704
KeywordArtificial Neural Networks (ANNs), Deep Convolutional Neural Networks (DCNNs)
AbstractBinarized Neural Networks (BNN) has shown a capability of performing various classification tasks while taking advantage of computational simplicity and memory saving. The problem with BNN, however, is a low accuracy on large convolutional neural networks (CNN). Local Binary Convolutional Neural Network (LBCNN) compensates accuracy loss of BNN by using standard convolutional layer together with binary convolutional layer and can achieve as high accuracy as standard AlexNet CNN. For the first time we propose FPGA hardware design architecture of LBCNN and address its unique challenges. We present performance and resource usage predictor along with design space exploration framework. Our architecture on LBCNN AlexNet shows 76.6% higher performance in terms of GOPS, 2.6X and 2.7X higher performance density in terms of GOPS/Slice, and GOPS/DSP compared to previous FPGA implementation of standard AlexNet CNN.

9D-3 (Time: 15:40 - 16:05)
TitleHardware-software Co-design of Slimmed Optical Neural Networks
Author*Zheng Zhao, Derong Liu, Meng Li, Zhoufeng Ying, Biying Xu (University of Texas at Austin, U.S.A.), Lu Zhang, Bei Yu (The Chinese University of Hong Kong, Hong Kong), Ray T. Chen, David Z. Pan (University of Texas at Austin, U.S.A.)
Pagepp. 705 - 710
KeywordPhotonic integrated circuit, Neural Network, Artificial Intelligence
AbstractOptical neural network (ONN) is a neuromorphic computing hardware based on optical components. Since its first on-chip experimental demonstration, it has attracted more and more research interests due to the advantages of ultra-high speed inference with low power consumption. In this work, we design a novel slimmed architecture for realizing optical neural network considering both its software and hardware implementations. Different from the originally proposed ONN architecture based on singular value decomposition which results in two implementation-expensive unitary matrices, we show a more area-efficient architecture which uses a sparse tree network block, a single unitary block and a diagonal block for each neural network layer. In the experiments, we demonstrate that by leveraging the training engine, we are able to find a comparable accuracy to that of the previous architecture, which brings about the flexibility of using the slimmed implementation. The area cost in terms of the Mach-Zehnder interferometers, the core optical components of ONN, is 15%-38% less for various sizes of optical neural networks.

[To Session Table]

Session 10A  (SS-5) The Resurgence of Reconfigurable Computing in the Post Moore Era
Time: 16:25 - 17:40 Thursday, January 24, 2019
Location: Room Saturn
Chair: Antonino Tumeo (Pacific Northwest National Laboratory, U.S.A.)

10A-1 (Time: 16:25 - 16:50)
Title(Invited Paper) Software Defined Architectures for Data Analytics
Author*Vito Giovanni Castellana, Marco Minutoli, Antonino Tumeo (Pacific Northwest National Laboratory, U.S.A.), Pietro Fezzardi, Marco Lattuada, Fabrizio Ferrandi (Politecnico di Milano, Italy)
Pagepp. 711 - 718
KeywordSoftware Defined Architectures, Data Analytics, FPGA, CGRA
AbstractData analytics applications increasingly are complex workflows composed of phases with very different program behaviors (e.g., graph algorithms and machine learning, algorithms operating on sparse and dense data structures, etc). To reach the levels of efficiency required to process these workflows in real time, upcoming architectures will need to leverage even more workload specialization. If, at one end, we may find even more heterogenous processors composed by a myriad of specialized processing elements, at the other end we may see novel reconfigurable architectures, composed of sets of functional units and memories interconnected with (re)configurable on-chip networks, able to adapt dynamically to adapt the workload characteristics. Field Programmable Gate Arrays are more and more used for accelerating various workloads and, in particular, inferencing in machine learning, providing higher efficiency than other solutions. However, their fine-grained nature still leads to issues for the design software and still makes dynamic reconfiguration impractical. Future, more coarse-grained architectures could offer the features to execute diverse workloads at high efficiency while providing better reconfiguration mechanisms for dynamic adaptability. Nevertheless, we argue that the challenges for reconfigurable computing remain in the software. In this position paper, we describe a possible toolchain for reconfigurable architectures targeted at data analytics.

10A-2 (Time: 16:50 - 17:15)
Title(Invited Paper) Runtime Reconfigurable Memory Hierarchy in Embedded Scalable Platforms
AuthorDavide Giri, Paolo Mantovani, *Luca P. Carloni (Columbia University, U.S.A.)
Pagepp. 719 - 726
Keywordheterogeneous system-on-chip, hardware accelerators, cache coherence, FPGA prototyping
AbstractIn heterogeneous systems-on-chip, the optimal choice of the cache-coherence model for a loosely-coupled accelerator may vary at each invocation, depending on workload and system status.We propose a runtime adaptive algorithm to manage the coherence of accelerators. The algorithm’s choices are based on the combination of static and dynamic features of the active accelerators and their workloads. We evaluate the algorithm by leveraging our FPGA-based platform for rapid SoC prototyping. Experimental results, obtained through the deployment of a multi-core and multiaccelerator system that runs Linux SMP, show the benefits of our approach in terms of execution time and memory accesses.

10A-3 (Time: 17:15 - 17:40)
Title(Invited Paper) XPPE: Cross-Platform Performance Estimation of Hardware Accelerators Using Machine Learning
Author*Hosein Mohamamdi Makrani, Hossein Sayadi, Tinoosh Mohsenin, Avesta Sasan, Houman Homayoun, Setareh Rafatirad (George Mason University, U.S.A.)
Pagepp. 727 - 732
Keyworddesign space exploration, machine learning, performance estimation, accelerator
AbstractWe propose XPPE, a neural network based cross-platform performance estimation. XPPE utilizes the resource utilization of an application on a specific FPGA to estimate the performance on other FPGAs. Moreover, XPPE enables developers to explore the design space without requiring to fully implement and map the application. Our evaluation results show that the correlation between the estimated speed up using XPPE and actual speedup of applications on a Hybrid platform over an ARM processor is more than 0.98.

[To Session Table]

Session 10B  Hardware Acceleration
Time: 16:25 - 17:40 Thursday, January 24, 2019
Location: Room Uranus
Chairs: Yongpan Liu (Tsinghua University, China), Xiang Chen (George Mason University, U.S.A.)

10B-1 (Time: 16:25 - 16:50)
TitleAddressing the Issue of Processing Element Under-Utilization in General-Purpose Systolic Deep Learning Accelerators
AuthorBosheng Liu (University of Chinese Academy of Sciences/Institute of Computing Technology, Chinese Academy of Sciences, China), *Xiaoming Chen, Ying Wang, Yinhe Han (Institute of Computing Technology, Chinese Academy of Sciences, China), Jiajun Li, Haobo Xu (University of Chinese Academy of Sciences, China), Xiaowei Li (Institute of Computing Technology, Chinese Academy of Sciences, China)
Pagepp. 733 - 738
Keywordutilization of processing elements, systolic accelerator, deep learning, energy efficiency, embedded systems
AbstractThis study proposes a novel systolic deep neural network (DNN) accelerator with a flexible computation mapping and dataflow scheme to address the under-utilization problem of processing elements (PEs) in systolic accelerators. By providing three types of parallelism and dynamically switching among them: channel-direction mapping, planar mapping, and hybrid, our accelerator offers the adaptability to match various DNN models to the fixed PE resources, and thus, enables flexibly exploiting PE provision for DNN models to achieve optimal performance and energy efficiency.

10B-2 (Time: 16:50 - 17:15)
TitleALook: Adaptive Lookup for GPGPU Acceleration
Author*Daniel Peroni, Mohsen Imani, Tajana Rosing (University of California, San Diego, U.S.A.)
Pagepp. 739 - 746
KeywordApproximate Computing, GPU
AbstractAssociative memory in form of look-up table can decrease the energy consumption of GPGPU applications by exploiting data locality and reducing the number redundant computations. There are proposed architectures utilizing associative memories beside the GPU cores to reduce recomputed operations, but these approaches use static look-up tables. Static designs lack the ability to adapt to applications at runtime, limiting them to small segments of code with high redundancy. These approaches are tied to the GPU pipeline and do not speedup applications. In this paper, we propose an adaptive look-up based approach, called Alook, which uses a dynamic update policy to maintain a set of recently used operations in associative memory. Alook updates the associative memory values of each floating point unit at runtime to adapt to the workload. We test the efficiency of Alook on image processing, general purpose, and machine learning applications by integrating it beside FPUs in an AMD Southern Island GPU. Our evaluation shows that Alook provides 3.6x EDP (Energy Delay Product) and 32.8% performance speedup, compared to an unmodified GPU, for applications accepting less than 5% output error. The proposed Alook architecture improves the GPU performance by 2.0x as compared to state-of-the-art computational reuse methods.

10B-3 (Time: 17:15 - 17:40)
TitleCollaborative Accelerators for In-Memory MapReduce on Scale-up Machines
Author*Abraham Addisie, Valeria Bertacco (University of Michigan, U.S.A.)
Pagepp. 747 - 753
Keywordcollaborative accelerators, in-memory mapreduce, chip-multiprocessor
AbstractRelying on efficient data analytics platforms is increasingly becoming crucial for both small and large scale datasets. While MapReduce implementations, such as Hadoop and Spark, were originally proposed for petascale processing in scale-out clusters, it has been noted that most data centers processes today operate on gigabyte-order or smaller datasets, which are best processed in single high-end scale-up machines. In this context, Phoenix++ is a highly optimized MapReduce framework available for chip-multiprocessor (CMP) scale-up machines. In this paper we observe that Phoenix++ suffers from an inefficient utilization of the memory subsystem, and a serialized execution of the MapReduce stages. To overcome these inefficiencies, we propose CASM, an architecture that equips each core in a CMP design with a dedicated instance of a specialized hardware unit (the CASM accelerators). These units collaborate to manage the key-value data structure and minimize both on- and off-chip communication costs. Our experimental evaluation on a 64-core design indicates that CASM provides more than a 4x speedup over the highly optimized Phoenix++ framework, while keeping area overhead at only 6\%, and reducing energy demands by over 3.5x.

[To Session Table]

Session 10C  Routing
Time: 16:25 - 17:40 Thursday, January 24, 2019
Location: Room Venus
Chairs: Hunng-Ming Chen (NCTU), Kohira Yukihide (University of Aizu)

10C-1 (Time: 16:25 - 16:50)
TitleDetailed Routing by Sparse Grid Graph and Minimum-Area-Captured Path Search
Author*Gengjie Chen, Chak-Wa Pui, Haocheng Li, Jingsong Chen, Bentian Jiang, Evangeline F. Y. Young (The Chinese University of Hong Kong, Hong Kong)
Pagepp. 754 - 760
KeywordDetailed routing, Grid graph, Maze routing, Parallelism
AbstractDifferent from global routing, detailed routing takes care of many detailed design rules and is performed on a significantly larger routing grid graph. In advanced technology nodes, it becomes the most complicated and time-consuming stage. We propose Dr. CU, an efficient and effective detailed router, to tackle the challenges. To handle a 3D detailed routing grid graph of enormous size, a set of two-level sparse data structures is designed for runtime and memory efficiency. For handling the minimum-area constraint, an optimal correct-by-construction path search algorithm is proposed. Besides, an efficient bulk synchronous parallel scheme is adopted to further reduce the runtime usage. Compared with the first place of ISPD 2018 Contest, our router improves the routing quality by up to 65% and on average 39%, according to the contest metric. At the same time, it achieves 80-93% memory reduction, and 2.5-15× speed-up.

10C-3 (Time: 16:50 - 17:15)
TitleLatency Constraint Guided Buffer Sizing and Layer Assignment for Clock Trees with Useful Skew
Author*Necati Uysal (University of Central Florida, U.S.A.), Wen-Hao Liu (Cadence Design Systems, U.S.A.), Rickard Ewetz (University of Central Florida, U.S.A.)
Pagepp. 761 - 766
KeywordUseful Skew, Clock Tree Optimization, Buffer sizing, Layer assignment, Power
AbstractClosing timing using clock tree optimization (CTO) is a tremendously challenging problem that may require designer intervention. CTO is performed by specifying and realizing delay adjustments in an initially constructed clock tree. Delay adjustments are typically realized by inserting delay buffers or detour wires. In this paper, we propose a latency constraint guided buffer sizing and layer assignment framework for clock trees with useful skew, called the (BLU) framework. The BLU framework realizes delay adjustments during CTO by performing buffer sizing and layer assignment. Given an initial clock tree, the BLU framework first predicts the final timing quality and specifies a set of delay adjustments, which are translated into latency constraints. Next, buffer sizing and layer assignment is performed with respect to the latency constraints using an extension of van Ginneken’s algorithm. Moreover, the framework includes a feature of reducing the power consumption by relaxing the latency constraints and a method of improving the timing performance by tightening the latency constraints. The experimental results demonstrate that the proposed framework is capable of reducing the capacitive cost with 13% on the average. The total negative slack (TNS) and worst negative slack (WNS) are reduced with up to 58% and 20%, respectively.