|
|
|||||||||||||||
Monday, March 29, 2021 |
Title | (Invited Talk) The Status and Potential of Quantum Computers: From Quantum Computational Supremacy to Fault-Tolerant Quantum Computer |
Author | Keisuke Fujii (Osaka University, Japan) |
Page | p. 1 |
Abstract | Supported by extensive experimental efforts for the realization of quantum computing devices, a quantum computer of a hundred qubits is now within reach. This level of a quantum computer is not enough for fully-fledged fault-tolerant quantum computing, which is inevitable for large-scale quantum computing supporting theoretically proven exponential computational speedup. However, state-of-the-art quantum computers still are thought to have a computational advantage against classical computers for certain tasks and called noisy intermediate-scale quantum (NISQ) devices. In this talk, I will provide an overview of the NISQ devices and their applications for quantum simulation and machine learning. In addition, I will show the prospects and challenges for the realization of a fault-tolerant quantum computer in the long term. |
Title | Minimum Energy Point Tracking over a Wide Operating Performance Region |
Author | *Shoya Sonoda, Jun Shiomi, Hidetoshi Onodera (Graduate School of Informatics, Kyoto University, Japan) |
Page | pp. 2 - 7 |
Keyword | minimum energy point, dynamic voltage and frequency scaling, adaptive body biasing |
Abstract | A method for runtime energy optimization based on the supply voltage (Vdd) and the threshold voltage (Vth) scaling is proposed. This paper refers to the optimal voltage pair as a Minimum Energy Point (MEP). The MEP dynamically fluctuates depending on the operating conditions determined by a target delay constraint, an activity factor and a chip temperature. In order to track the MEP, this paper proposes a closed-form continuous function that determines the MEP over a wide operating performance region ranging from the above-threshold region down to the sub-threshold region. Based on the MEP determination formula, a MEP tracking algorithm is also proposed. The MEP tracking algorithm estimates the MEP with low-cost monitoring circuits for the operating conditions. Measurement results based on a 32-bit RISC processor fabricated in a 65-nm process technology shows that the proposed method estimates the MEP within a 5% energy loss in comparison with the actual MEP operation. |
Title | Hierarchical Overlapped Clustering for Ising-Model Based TSP Solver |
Author | *Riu Shimizu, Song Bian, Takashi Sato (Graduate School of Informatics, Kyoto University, Japan) |
Page | pp. 8 - 13 |
Keyword | Conbinational optimization problem, ising model, annealing, clustering |
Abstract | The approximate solution of the combinatorial optimization problem using the Ising model based solver has attracted a lot of attention. In general, TSP is one of the problems that Ising model based solver is not good at due to the complexity of the constraints. There are studies that improve the solution of TSP by partitioning it by clustering. In this paper, we proposed a method to overlap the clusters appropriately for TSPs before mapping to Ising model. The average error of the solution was reduced 33% with the proposed method. |
Title | A Highly Linear Sensor System Using SEIR for Distortion Evaluation of Sensor Front-end |
Author | *Chia-Wei Pai, Tatsuya Ishikawa, Hiroki Ishikuro (Keio University, Japan) |
Page | pp. 14 - 18 |
Keyword | sensor system, analog-to-digital converter (ADC), stimulus error identification and removal (SEIR), analog front-end (AFE), internet of things (IoT) |
Abstract | This paper presents a highly linear sensor system by applying the stimulus error identification and removal (SEIR) method for distortion evaluation and calibration. The proposed distortion evaluation method estimates the true linearity of sensor front-end and analog-to-digital converter (ADC) that both are integrated as an analog front-end without separately measuring each block. The effectiveness of the proposed distortion evaluation and calibration technique was successfully demonstrated by a prototype illuminance sensor system. |
Title | MENTAI: A Fully Automated CGRA Application Development Environment that Supports Hardware/Software Co-design |
Author | *Ayaka Ohwada, Takuya Kojima, Hideharu Amano (Graduate School of Science and Technology, Keio University, Japan) |
Page | pp. 19 - 24 |
Keyword | CGRA, compiler |
Abstract | Energy-efficient coarse-grained reconfigurable architectures (CGRAs) have been attracting attention as accelerators for IoT devices. CGRAs have several processing units called PEs (Processing Elements) arranged in a two-dimensional array. CGRAs can achieve high energy efficiency by adaptively changing the operation executed for each PE and the connections between PEs. When running an application on CGRAs, it is necessary to convert the target application to a data flow graph (DFG) and map the DFG to the PE array. Currently, there is not enough research on the CGRA application development environment, and it is hard to say that the environment is in place. In this paper, we propose MENTAI, a fully automated application development environment for the CGRAs based on LLVM. As a case study, we evaluated it using CCSOTB2, which is a kind of CGRA. Thanks to MENTAI, a compute-intensive part of an application is executed on the CCSOTB2, instead of a host processor. Besides, it considers the energy efficiency of the entire system, including the host processor. Thereby, MENTAI achieves around 60 % energy reduction while keeping the execution time. |
Title | Efficient Sample Preparation for Programmable Microfluidic Devices with Defect Valves |
Author | *Shuaijie Ying, Shigeru Yamashita (Ritsumeikan University, Japan) |
Page | pp. 25 - 28 |
Keyword | microfluidic lab-on-a-chip, sample preparation, defect |
Abstract | Microfluidic lab-on-a-chips lab-on-a-chips have emerged as a new technology for implementing biochemical protocols on small-sized portable devices targeting low-cost medical diagnostics. Among various efforts of fabrication of such chips, programmable microfluidic device (PMD) is a relatively new technology for implementation of flow-based lab-on-a-chips. In PMDs, some valves may not work properly in some cases. In this paper, we propose a novel Constraint Satisfaction Problem (CSP) formulation to find how to mix fluids efficiently when there are some defect valves. |
Title | Power Domain Layer Assignment in Package Substrate Design |
Author | *Yu-Sheng Qin, Xiao-Yu Wang, Yi-Yu Liu (National Taiwan University of Science and Technology, Taiwan) |
Page | pp. 29 - 32 |
Keyword | Layer assignment |
Abstract | With the increasing functional integration in modern integrated circuits, both the chip size and design complexity are inevitably increasing. In addition, the number of power domains is drastically increased owing to the demands of power efficiency in various functional modules. Thus, reliable power delivery has become one critical issue in package substrate design. In this paper, we are the first work to automate power domain layer assignment in package substrate design. First, seed selection step selects critical power domains with larger area conflict and assigns these critical domains to separated metal layers. After that, remaining power domains are assigned to proper metal layers taking power pin distribution into account. Experimental results demonstrate the effectiveness of our layer assignment algorithm for the follow-up polygon layout partition. |
Title | A Novel FPGA-Based Convolution Accelerator for Addernet |
Author | *Ke Ma, Lei Mao, Shinji Kimura (Waseda University, Japan) |
Page | pp. 33 - 38 |
Keyword | CNN, FPGA, Accelerator, Addernet |
Abstract | In FPGA-based CNN accelerators, most of the multiplications in convolutions are realized by DSPs. Thus, the number of DSPs limits the parallelism of convolution computation. Recently proposed addernet replaces multiplications in convolution with additions. Based on addernet, we designed a novel PE in which the adders can be reused to perform replaced additions and accumulation. This PE can calculate a 3*3 convolution in 3 clocks using only 9 adders, and can be efficiently constructed on LUT with no DSP. On a Ultra-96 board which has 360 DSPs, we implement an accelerator with 60 proposed 3*3 PEs, and gain a throughput of 2.18 GOPs. |
Title | Evaluation of Multiply Accumulators with Stochastic Numbers for Image Processing |
Author | *Katsuhiro Ichikawa, Shigeru Yamashita (Ritsumeikan University, Japan) |
Page | pp. 39 - 44 |
Keyword | stochastic computing, multiply accumulator operation, image processing |
Abstract | In Stochastic Computing (SC), we can calculate multiplications very efficiently with only a single AND gate. However, the result of an addition operation in SC is so-called 1/2-scaled; addition operations in SC may lose the accuracy in some cases. Thus, a multiply accumulator operation for many numbers in SC may have large errors because it contains many addition operations. Considering this issue, this paper proposes a novel method to realize a multiply accumulator operation for Stochastic Numbers (SNs) without a scaling error. We also report some evaluations of our method when it is used for an image processing application to confirm our method can indeed reduce the errors compared to a conventional multiply accumulator operation in SC. |
Title | Linear Programming Based Reliable Software Performance Model Construction with Noisy CPU Performance Counter Values |
Author | *Teruaki Tanaka (Mitsubishi Electric Corporation, Japan), Masanori Hashimoto (Osaka University, Japan), Yoshinori Takeuchi (Kindai University, Japan) |
Page | pp. 45 - 50 |
Keyword | Performance Estimation, Software, Estimation Method, Linear Programming |
Abstract | In the development of industrial controller equipment, estimating the execution performance of control software running on an embedded CPU is essential, especially early in the system specification defining stage. On the other hand, the estimation is becoming difficult since the embedded CPU functions have become more complicated, and its function specifications are black-boxed. In this paper, we propose a method to determine the coefficient parameters of a software execution performance model, which is expressed as a linear function of the performance counter values acquired from the software execution result on the target CPU. For considering the uncertainty existing in the performance counter values and avoiding the consequent overfitting, the proposed method formulates the problem of determining the coefficient parameters as a linear programming problem. The prior knowledge on the relation between the software performance and the performance counter values are given as a set of inequalities, which improves the reliability and fidelity of the constructed performance model. Experimental results show that proposed estimation method using linear programming is also effective as a statistical estimation function, and can obtain reasonable factors of the performance that can lead appropriate design decisions. The proposed method contributes to improving the validity of software execution performance estimation in the initial stage of system design. |
Title | A K-band High Gain Linearity Mixer with Current-Bleeding and Derivative Superposition Technique |
Author | *Kohki Saito, Ryo Kishida, Tatsuji Matsuura, Akira Hyogo (Tokyo University of Science, Japan) |
Page | pp. 51 - 55 |
Keyword | mixer, k-band, CMOS, linearizer, current-bleeding |
Abstract | This paper presents the design and analysis of high linearity RF mixers in a 65-nm CMOS process for K-band down-conversion receivers. Current-bleeding is useful for high gain mixer. However, the circuit linearity is degraded by the third-order transconductance of the current-bleeding cell. The proposed circuit uses Derivative Superposition (DS) technique to both gm-stage and current-bleeding. Circuit simulation results show conversion gain of 9.7 dB and Third Order Input Intercept Point (IIP3) of 5.0 dBm. |
Title | Improved Modeling of Energy Consumption of Delivery Drones |
Author | *Satoshi Ito, Shunsuke Negoro, Xiangbo Kong (Ritsumeikan University, Japan), Ittetsu Taniguchi (Osaka University, Japan), Hiroyuki Tomiyama (Ritsumeikan University, Japan) |
Page | pp. 56 - 60 |
Keyword | drone, energy measurement, energy modeling, regression analysis |
Abstract | Drones have become popular in a wide range of fields. Among them, the delivery of packages by drones attracts a lot of attention to reduce traffic congestion and carbon emission. However, drones are powered by storage batteries and are severely limited in energy consumption. For safety, crashes by dead batteries must be avoided. Therefore, high-efficient power management technology for drones is one of the most important requirements. In this study, we model an energy consumption of drones with regression analysis. The usefulness of the proposed model is demonstrated compared with the existing regression analysis methods. |
Title | A Feasibility Study on Realizing General-purpose Technology Mapper for DSPs of FPGAs Using Exhaustive Search |
Author | *Koyo Shibata, Takashi Imagawa, Hiroyuki Ochi (Ritsumeikan University, Japan) |
Page | pp. 61 - 66 |
Keyword | Datapath synthesis, Design space exploration, data flow graph, Optimal covering, valid structure enumeration |
Abstract | This paper proposes a technology mapping algorithm applicable for arbitrary single-output tree-structure DSP block whose operation nodes have up to two fan-ins of FPGAs, based on an exhaustive search to find an optimal implementation of the given application circuit description. DSP blocks have become essential in FPGAs to achieve high performance and area efficiency. For the effective use of DSP blocks, a technology mapping algorithm is indispensable to find the optimal implementation of a given circuit using DSP blocks. Ronak et al. have proposed a greedy algorithm to search for a mapping that maximizes throughput, targeting the Xilinx DSP48E1. Our proposed algorithm applies to a broader range of DSP blocks since it automatically generates a database of valid configurations from a structural description of the target DSP block. Replication operators allows us to find solutions with a smaller number of DSP blocks than those by the conventional algorithm while reducing global nets. To reduce runtime, we introduce pruning techniques and graph partitioning that do not affect the optimality. From experiments using DFGs with 33, 58, and 100 nodes, the proposed method reduces the number of DSP blocks by 7.94--10.81% compared with the conventional algorithms. |
Title | (Invited Talk) Ingredients of Efficient Hardware Accelerators for Neural Networks |
Author | Juinn-Dar Huang (National Yang Ming Chiao Tung University, Taiwan) |
Page | p. 67 |
Abstract | Today, neural networks (NNs) are broadly used for numerous artificial intelligence (AI) applications including computer vision, image/video processing, speech recognition, and natural language processing (NLP). Though NN-based algorithms can provide better solutions on several AI application domains, those advantages come at the cost of extremely high computational complexity. Currently, GPU-based computing engines are most commonly used platforms for NN computation. Nevertheless, they are pricey, power-hungry and therefore inappropriate for certain application areas, such as edge computing. Therefore, specialized hardware accelerators optimized for a specific class of NNs are getting more attention these days. This short talk aims to introduce some common ingredients of efficient NN hardware accelerators, including dataflow-based optimization, weight pruning and compression, quantization, and numeric data formats. Algorithm-level design considerations facilitating efficient hardware implementations are also discussed. Accelerators for convolutional neural networks (CNNs) and multilayer perceptron (MLP) are used as examples for demonstrations. |
Title | An Accuracy Reconfigurable Multiply-Accumulate Unit Based on Operand-Decomposed Mitchell's Multiplier |
Author | *Lingxiao Hou, Yutaka Masuda, Tohru Ishihara (Nagoya University, Japan) |
Page | pp. 68 - 73 |
Keyword | Multiply-Accumulate, Mitchell Algorithm, Operand Decomposition, Reconfigurable Multiplier, Logarithmic Multiplier |
Abstract | Multiply-Accumulate (MAC) units are fundamental components for modern microprocessors to speed up calculating large amounts of data and digital signals. However, the accurate multiplier occupies a large area in the circuit and consumes a large amount of power. The logarithmic approximate multiplication scheme provides effective alternatives in terms of area and delay. Mitchell proposed an approximate logarithm multiplication, but its maximum error is 11.1% and needs improvement. This paper presents a new logarithmic approximation algorithm that supports arbitrary accuracy and an accuracy reconfigurable MAC unit based on our novel algorithm. The algorithm is based on the Mitchell algorithm and combined with an improved operand decomposition method. We experimentally confirm that the proposed algorithm dramatically improves the computational accuracy. For our accuracy reconfigurable MAC unit, through the trade-off between the accuracy and parallelism, the identical MAC array can be used to achieve multiple accuracies of approximate multiplication. Compared with the traditional MAC unit composed of accurate multipliers, our proposal greatly reduces the area. |
Title | MRR Usage Optimization for WRONoC Topology Generation and Communication Parallelism Depending on Bandwidth Requirements |
Author | *Kanta Arisawa, Shigeru Yamashita (Ritsumeikan University, Japan), Tsun-Ming Tseng (Technical University of Munich, Germany) |
Page | pp. 74 - 79 |
Keyword | WRONoC Topology Generation, Communication Parallelism, MRR |
Abstract | Most studies of current Wavelength-routed optical networks-on-chip (WRONoC) topology generation methods are based on a single resonant wavelength of each silicon microring resonator (MRR). In this paper, we propose an MRR usage optimization method considering multiple resonant wavelengths for an individual MRRs. Experimental results show that our approach can reduce MRR usage by 20% compared to the state-of-the-art. Moreover, while communication parallelism methods for WRONoCs have mainly focused on maximizing the total bit parallelism regardless of message requirements, this paper illustrates wavelength assignments depending on the number of bit parallelism that each communicating pair requires. |
Title | Structural Doubling Operations for Efficient Design of Long Bit Length Parallel Prefix Adders |
Author | *Aye Myat Mon, Mineo Kaneko (Japan Advanced Institute of Science and Technology, Japan) |
Page | pp. 80 - 85 |
Keyword | Parallel prefix adder, Optimization, Simulated Annealing |
Abstract | The concept of the procedural construction for parallel prefix adders had been proposed for representing the structural variation of parallel prefix adders. This paper proposes an application of the procedural construction to the design of long bit length adders. In order to adapt to the design of long bit length adders with reducing the problem complexity, we will introduce a set of macro construction rules each of which doubles the bit length from one or two source structures. On the other hand, the application sequence of these macro construction rules is maintained with a binary tree data-structure where each node (except leaves) corresponds to a macro construction rule to apply and its one or two children correspond to source structures for the construction rule. Because of the limitation on the construction rules to apply, the size of the solution space is kept small properly even for long bit length adders, e.g., 256-bit or longer adders. In the design experiments, our solution space has been explored with Simulated Annealing Search algorithm to get better solutions compared with typical benchmarks under the trade-off between the hardware resource and delay characteristics. |
Title | Logic Functions Realized Using Clockless Gates for Rapid Single-Flux-Quantum Circuits |
Author | *Takahiro Kawaguchi (Kyoto University, Japan), Kazuyoshi Takagi (Mie University, Japan), Naofumi Takagi (Kyoto University, Japan) |
Page | pp. 86 - 91 |
Keyword | Rapid single-flux-quantum circuit circuit, clockless gates, 0-preserving function |
Abstract | Superconducting rapid single-flux-quantum (RSFQ) circuit is a promising candidate for circuit technology in the post-Moore era. RSFQ digital circuits operate with pulse logic and are usually composed of clocked gates. We have proposed clockless gates, which can reduce the hardware amount of a circuit drastically. In this paper, we first discuss logic functions realizable using clockless gates. We show that all the 0-preserving functions can be realized using the existing clockless gates, i.e., AND, OR, and NIMPLY (not-imply) gates. We then propose a circuit architecture composed of both clocked gates and clockless gates. |
Title | A Fast LUT Based Point Intensity Computation for OPC Algorithm |
Author | *Tahsin Shameem, Shimpei Sato, Atsushi Takahashi, Hiroyoshi Tanabe (Tokyo Institute of Technology, Japan), Yukihide Kohira (The University of Aizu, Japan), Chikaaki Kodama (KIOXIA Corporation, Japan) |
Page | pp. 92 - 97 |
Keyword | lithography, OPC, tap point intensity |
Abstract | Nano lithography is becoming more challenging with the advancement of technology node. OPC (Optical proximity Correction) algorithm is becoming more aggressive with increment of mask complexity. In this research, a simulator is proposed which simulates tap point intensity in order to guide OPC algorithm instead of generation of aerial image of mask pattern which is time and memory efficient. |
Title | Six-Valued Simulation Based on Adaptive Ordering of Input Patterns for Error Diagnosis |
Author | *Akio Masamori, Hiroshi Nakano, Nobutaka Kuroki (Kobe University, Japan), Tetsuya Hirose (Osaka University, Japan), Masahiro Numa (Kobe University, Japan) |
Page | pp. 98 - 103 |
Keyword | error diagnosis, ECO |
Abstract | This paper presents a six-valued simulation method based on adaptive ordering of input patterns to shorten the processing time for error diagnosis. To reduce the number of false error location sets at earlier stages, the input patterns are ordered based on the preceding six-valued simulation results for error location sets with lower multiplicity. The experimental results have shown that the proposed technique is effective to shorten the processing time by 64.2% on average. |
Title | Thermal Design Technology for Non-low Power Hearables |
Author | *Kodai Matsuhashi (Hirosaki University, Japan), Koutaro Hachiya (Teikyo Heisei University, Japan), Toshiki Kanamoto, Masasi Imai, Atsushi Kurokawa (Hirosaki University, Japan) |
Page | pp. 104 - 109 |
Keyword | hearable, wearable, thermal network, thermal design |
Abstract | Hearables are expected to grow the most in the future wearable devices. This paper presents thermal design technology for ear-hook type hearables. We design a bone conduction hearable, develop a thermal network model, and perform thermal simulations. The temperatures obtained by the thermal model agree with those obtained by a commercial FEM-based thermal solver with high accuracy (within 3.6%). Moreover, results of thermal simulations with the model show that 1) the sizing of 10 mm can reduce each maximum temperature of “Device” (in the device body), “Front” (on the front surface of case), “Back” (on the back surface of case), and “Hook” (on the surface of hook) by 13.8%, 19.2%, 16.8%, and 13.6%, 2) the use of aluminum for the case can reduce temperatures of “Device”, “Front”, and “Back” by 37.1%, 39.9%, and 29.9%, and 3) the distribution of heat generating components can reduce temperatures of “Device” and “Front” by 27.4% and 16.5%. |
Title | Voltage Feedback Method by DC-DC Converter with High Power Efficiency for 2-D Resistive Sensor Array |
Author | *Shoo Saga, Yohsuke Shiiki, Hiroki Ishikuro (Keio University, Japan) |
Page | pp. 110 - 113 |
Keyword | resistive sensor array, interface circuit, wire resistance effect |
Abstract | This paper proposes a new readout method for 2-D cross-point resistor sensor arrays using operational amplifiers and PI controller for the step-down DC-DC converter. Compared with the conventional method using an operational amplifier, it can solve an operational amplifier's driving force and reduce power consumption. Wiring resistance, which has a significant effect on reading, has not been evaluated. the simulation was run with 16x16 parameters, a 10kΩ sensor resistor, and a 10Ω metal resistor, Then, there is an over 25% readout error. Therefore, I mentioned the effect of it and the switches on resistance. Besides, various parameters were changed, and the simulation results were shown with MATLAB. Design guidelines were provided from the results. |
Title | A Fast Mixing Method of Multiple Reagents on Programmable Microfluidic Devices |
Author | *Ryuya Sakamoto, Shigeru Yamashita (Ritsumeikan University, Japan) |
Page | pp. 114 - 115 |
Keyword | Programmable Microuidic Devices, Mixing |
Abstract | Programmable Microfluidic Biochips (PMDs) are relatively new devices for implementing flow-based lab-on-a-chips that allow mixing of reagents in various ratios other than 1:1. In this paper, we propose a method of mixing more than two reagents with only two steps. We show that our method can produce almost all ratios. |
Title | Flexible Buffer Configuration for Memory Traffic Reduction of CNNs with Rapid Test Algorithm |
Author | Tsai-Yu Tsai (Chung Yuan Christian University, Taiwan), Yu-Cheng Lin (Kainan University, Taiwan), *Wei-Kai Cheng (Chung Yuan Christian University, Taiwan) |
Page | pp. 116 - 119 |
Keyword | CNN, buffer configuration, memory traffic |
Abstract | In this paper, we propose a rapid test algorithm to minimize memory traffic of CNNs based on the buffer configuration layer by layer. In the first, for the architecture in the convolutional neural network, we appropriately select the processing layers that need to be deployed. Then we adjust the buffer configuration for the layers that need to be deployed. Our rapid test algorithm (RTA) uses machine learning techniques to learn the features of the previously optimized network, and apply the learned features to other network architectures for prediction of buffer configuration. We use SCALE-Sim to evaluate our methodology. Compared with the traditional fixed buffer configuration, RTA can effectively reduce the external memory traffic by about 50%, and reduce the number of calculations by about 85% compared to the brute force method. |
Title | Design Automation for Wire-bond Package Die Orientation and Placement |
Author | *Geng-Shen Lin, Yi-Yu Liu (National Taiwan University of Science and Technology, Taiwan) |
Page | pp. 120 - 123 |
Keyword | Die Placement, Packaging |
Abstract | In IC design and manufacture industry, most chips must go through the process of packaging and testing before mounting on the PCB board. Owing to complex package design rules, the layouts of package substrate are manually designed by engineers. In order to reduce layout design time of package substrates, we develop a floorplanning methodology to automate the process in early design stage. Given the netlist, die pad and substrate bump ball locations, both die orientation and location can be simultaneously determined on the package substrate taking wire length into account. |
Title | (Keynote Speech) The Role of TCAD and TCAD Based DTCO in Advanced Technology Nodes |
Author | Asen Asenov (University of Glasgow, Semiwise Ltd., UK) |
Page | p. 124 |
Abstract | For more than 40 year TCAD has been plain a key role in improving the existing CMOS technologies and in the development of new technology generations. According the 2015 edition of the ITRS the extensive use of TCAD reduces by up to 40% the cost of new technology development. TCAD is also indispensable when evaluation new technology options and device architectures. It played essential role in the introduction of FinFET and FDSOI technology. In recent years the role of TCAD has expanded from technology development to TCAD based Design-Technology Co-Optimisation aiming to tune the technology for particular circuit implementations and products. In this talk we will present the rationale and examples in using of TCAD for technology optimisation and development, path finding and TCAD based DTCO. |
Tuesday, March 30, 2021 |
Title | (Keynote Speech) Design Methodologies for Scalable and Energy-Efficient Neuromorphic Computing |
Author | Anup Das (Drexel University, USA) |
Page | p. 125 |
Abstract | Hardware implementation of neuromorphic computing can significantly improve energy efficiency, thanks to their event-driven nature of activations, low-power designs of hardware components, and distributed implementation of in-place computation and synaptic storage using Non-Volatile Memory (NVM). Modern neuromorphic architectures present several challenges, both on the hardware and on the software front. From the hardware perspective, there is a clear limitation on the scalability of neuromorphic architectures. In fact, as the complexity of these architectures increases, data communication becomes the critical performance and energy bottleneck. From the software perspective, executing a machine learning program on a computer involves several steps: compilation, resource allocation, and run-time mapping. Although apparent for mainstream von-Neumann computers, these steps are not well defined for neuromorphic architectures. This talk will introduce the energy and scalability problems in neuromorphic computing with NVM and our hardware and software-based solutions to these problems. |
Title | Acceleration of Residual Binarized Neural Network |
Author | *Yan Chen, Kiyofumi Tanaka (Japan Advanced Institute of Science and Technology, Japan) |
Page | pp. 126 - 131 |
Keyword | binarized neural network, FPGA, CNN |
Abstract | In this paper, we discuss a state-of-the-art method of accelerating the Residual Binarized Neural Network (ReBNet) by replacing fixed-point number multiplication with logical shift operation. We designed an end-to-end framework for training binary neural networks, on which the conversion to logical-shift-based multiplication on software and hardware accelerators implemented on FPGA is performed. Compared to ReBNet, we got similar accuracy in several datasets, and our hardware resource usage in the same degree of parallelism as ReBNet was much lower. It is concluded that our design can be implemented on a smaller device or with larger degree of parallelism in the same device. |
Title | Mode Transition Improvement by Adding Load Current Sensing Circuit in a Buck-Boost Converter for Mobile Devices |
Author | *Yosuke Susa, Ryo Kishida, Tatsuji Matsuura, Akira Hyogo (Tokyo University of Science, Japan) |
Page | pp. 132 - 136 |
Keyword | buck-boost converter, current sensing, mode transition |
Abstract | Conventional back-boost converter by reducing the number of switching has high power efficiency. However, boundary voltage between back and boost mode shifts as load current changes. This mode shift causes mode transition error. This paper proposes elimination of the error by sensing current flowing through the inductor and making the variable boundary voltage. As results of the circuit simulation, ripple rate is less than ±1.0% and voltage change rate during load fluctuations is less than ±5.0% in the proposed method. |
Title | Approximate Decomposition of Multi-output LUTs under Acceptable Error Tolerance |
Author | *Xuechen Zang, Shigetoshi Nakatake (The University of Kitakyushu, Japan), Hiroyuki Kozutsumi, Mitsunori Katsu (TRL Corp., Japan), Shoichi Sekiguchi (TAIYO YUDEN Co., LTD, Japan) |
Page | pp. 137 - 141 |
Keyword | Approximate Computing, Look-up Table, Programmable Device, Memory Reconfigurable Logic Device |
Abstract | Approximate computing has been widely utilized in logic circuit optimization field to help achieve effective area compression and complexity reduction. This paper proposes a novel methodology for approximate computing with focus on the phase of decomposing large look-up tables (LUTs) into smaller individuals. By asserting reserved bits and finding optimal solutions, the proposed methodology is available to generate approximate LUTs under acceptable error tolerance rate. Experimental results show the decomposition of the 4-bit/8-bit multiplier logic and obtain 2-4 bit reduction within error rate of 5.4%-23.4% / 0-19.4%. The simplicity of the methodology and potential scalability point to several interesting directions for future research. |
Title | Adaptive Ordering of EPI-groups to Extract Error Location Sets Based on ZDD for Error Diagnosis |
Author | Hiroshi Nakano, *Suguru Hojo, Akio Masamori, Nobutaka Kuroki (Kobe University, Japan), Tetsuya Hirose (Osaka University, Japan), Masahiro Numa (Kobe University, Japan) |
Page | pp. 142 - 147 |
Keyword | error diagnosis, ZDD, ECO |
Abstract | This paper presents an adaptive ordering method of EPI-groups to shorten the processing time to extract error location sets based on ZDD for error diagnosis. Based on the results with lower multiplicity, we apply effective EPI-groups in reducing the number of error location sets prior to the other EPI-groups. Experimental results have shown that the proposed technique reduces the processing time by 96.0% at the maximum, and by 88.7% on average. |
Title | Test Plan For Detecting Mersenne Twister Faults In BIST |
Author | *Takuma Nagao, Ken'ichi Yamaguchi, Hiroshi Iwata (National Institute of Technology, Nara College, Japan) |
Page | pp. 148 - 149 |
Keyword | mersenne twister, built-in self test, test sequence, fault coverage |
Abstract | Mersenne twister can be utilized to generate test patterns for detecting LSI faults. The utilization of the Mersenne twister as a test pattern generator can improve the fault coverage. However, the Mersenne Twister may contain faults. Therefore, this paper examines whether there is a test sequence for detecting the Mersenne Twister faults. Experimental results show that the Mersenne Twister has a test sequence that reaches a 100% fault coverage for stuck-at faults. |
Title | NR-Router: A Network-Flow-Based Routing Algorithm for Electrowetting-on-Dielectric Microfluidics with Non-Regular Shape Electrodes |
Author | *Hsin-Chuan Huang, Tsung-Yi Ho (National Tsing Hua University, Taiwan) |
Page | pp. 150 - 155 |
Keyword | Routing, Network-Flow, Microfluidic |
Abstract | Due to the advances in microfluidics, electrowetting-on-dielectric (EWOD) chips have been widely applied to various laboratory procedures. To ease the process of designing EWOD chips, the prototype of a cloud-based open-source EWOD cyber manufacturing ecosystem has been proposed. This ecosystem introduces the non-regular shape electrodes and glass-based chips to provide multi-functional and high-reliability EWOD chips, which can achieve more complex operations (e.g., different-size droplets control, precise droplet movement). None of the existed work, however, considers the routing for glass-based EWOD chips with non-regular shape electrodes, which hinders the implementation and development of this ecosystem since users are burdened with time-consuming wire connections. Unlike regular shape electrodes, routing resource manipulation and pin access of non-regular shape electrodes further complex the routing problem. In this paper, we propose a network-flow-based routing algorithm called NR-Router that can correctly route in glass-based EWOD chips with non-regular shape electrodes for the first time. We construct a minimum cost flow problem to generate optimal routing paths followed by a light-weight model to reduce the run time. NR-Router achieves 100% routability while minimizing wirelength at a low run time and can generate mask files that can be directly put into manufacturing. Experimental results show the robustness and efficiency of the proposed algorithm. |
Title | Hardware/Software Co-Design of a Monte-Carlo Tree Search based Reversi Player |
Author | *Nobutaka Kito, Moeka Tsuji, Kyouka Tomioka (Chukyo University, Japan) |
Page | pp. 156 - 161 |
Keyword | Programmable SoC, Reversi, Monte Carlo Tree Search |
Abstract | A Monte Carlo Tree Search (MCTS) based Reversi player is proposed for programmable SoCs containing both a processor core and an FPGA. It performs the tree search of MCTS with the processor core and simulates games with the FPGA by random selections of legal positions to evaluate a board in progress. The simulations are known as playouts of MCTS and consume a long time. No domain-specific knowledge other than the rules of Reversi is used for the evaluation in the player. The circuit in the FPGA is designed with high-level synthesis. It consists of processing elements for playouts and performs playouts in parallel to evaluate a given board precisely. The player was implemented for Xilinx Zynq-7000 programmable SoC on Pynq-Z1 board. The evaluation results showed that the number of playouts per second with the circuit is 2.7 times higher than that of a multi-threaded software implementation on Ryzen 7 3700X and showed that the designed Reversi player may outperform existing Reversi players. |
Title | Using Receiver Coils for Dissipating Heat of Watch-type Smart Devices |
Author | *Shinsuke Kashiwazaki, Kodai Matsuhashi, Motoki Ishizaki, Toshiki Kanamoto (Hirosaki University, Japan), Koutaro Hachiya (Teikyo Heisei University, Japan), Ryosuke Watanabe, Atsushi Kurokawa (Hirosaki University, Japan) |
Page | pp. 162 - 167 |
Keyword | smartwatch, smart device, thermal analysis, wireless power transfer, wireless charging |
Abstract | This paper proposes a method to promote heat dissipation of watch-type smart devices by using receiver coils for wireless charging. It is achieved by embedding a spiral coil in the belt. In addition, the effects of various coil arrangements on temperatures at key hot spots are clarified. For watch-type smart devices, the most critical part in temperatures is the backside surface of the belt, since the part touches the skin for a long time. The analysis results show that the proposed method can reduce the temperature rise to ambient temperature at the belt’s backside surface by 45.6%. It is also shown that the power transfer efficiency of the proposed method can be more than 90%. |
Title | On the Number of Variables To Represent Classification Functions Using Linear Decompositions |
Author | *Tsutomu Sasao (Meiji University, Japan) |
Page | pp. 168 - 173 |
Keyword | support minimization, data mining, logic minimization |
Abstract | A classification function f is a mapping: D → M, where D ⊂ Bn, B = {0,1}, and M = {1,2,..., m}. When |D| « 2n, f can be represented with fewer variables than n, with a linear decomposition. We show a method to estimate the number of variables to represent the function. Experimental results using randomly generated functions are shown. |
Title | Internet of Things Device Security: Hardware Implementation of International Standard ISO/IEC 29192-3 Lightweight Stream Ciphers |
Author | *Tadashi Okabe (Tokyo Metropolitan Industrial Technology Research Institute, Japan) |
Page | pp. 174 - 178 |
Keyword | IoT, ISO/IEC 29192-3, Lightweight stream cipher, Hardware implementation, FPGA |
Abstract | In an Internet of Things (IoT) society, both general purpose computers and lightweight portable electronic devices require higher levels of security. For this purpose, cryptography techniques, such as standard block ciphers like triple-DES and AES, secure hash algorithms like SHA-1, SHA-2, and SHA-3, and RSA, or public key cryptography are widely used. However, these techniques cannot be applied to small, lightweight devices that are resource-constrained or have integrated circuits with low processing power. Therefore, lightweight cryptography is developed to enhance security levels in computing applications and IoT networks, including resource constrained devices. In October 2012, the International Organization for Standardization and International Electrotechnical Commission published the standard of stream cipher for lightweight cryptography, i.e., ISO/IEC 29192-3. This standard specifies two keystream generators for lightweight stream ciphers, namely, Enocoro and Trivium, which have key sizes of 80 or 128 bits, and 80 bits respectively. The present article describes the first hardware implementation of Enocoro on a general-purpose field programmable gate array using hardware description language. We confirm that the Enocoro and Trivium algorithms provide excellent hardware performance for a lightweight stream cipher in terms of performance per unit area and power consumption. |
Title | Compact FPGA Implementation of Popcounter for BNN Using Linear Feedback Shift Register |
Author | Nagisa Ishiura, *Ryota Saimyoji (Kwansei Gakuin University, Japan) |
Page | pp. 179 - 180 |
Keyword | binarized neural network, FPGA implementation, popcounter, LFSR |
Abstract | This paper presents an idea for reducing the size of neurons for inference of binarized neural networks with a view to pack as many neurons as possible in FPGAs to increase parallelism. We focus on input-serial neurons and use LFSRs (linear feedback shift registers) in place of binary counters. We also presents an idea to use LFSRs in parallel-serial hybrid neurons. Implemented on Xilinx Artix-7, the number of LUTs is reduced to 31.3% and 69.2% for a serial neuron and a hybrid neuron for threshold 2400, respectively. |
Title | Energy Efficient RISC-V Processor for Portable Sensor Applications |
Author | *Kan Hatakeyama (Hirosaki University, Japan), Masami Fukushima, Koichi Kitagishi, Seijin Nakayama (UNO Laboratories, Ltd., Japan), Hideki Ishihara (AQUAXIS TECHNOLOGY, Japan), Masashi Imai, Atsushi Kurokawa, Toshiki Kanamoto (Hirosaki University, Japan) |
Page | pp. 181 - 184 |
Keyword | Processor, Low power, Sensor, Embedded system, IoT |
Abstract | This paper proposes a RISC-V based energy efficient processor aimed to be applied to portable sensor nodes including health monitoring devices. The main feature of the proposed processor is that it adopts an asynchronous data memory enabling one-stage operation which surpasses the existing pipelines. We implement the feature to a FPGA as well as designing an ASIC in an industrial 180nm technology node. The proposed processor equips peripherals necessary to configure sensor nodes such as a portable pulse oximeter. In this paper, we show the effectiveness of the proposed processor in terms of not only micro architecture but also practicality. |
Title | (Invited Talk) ERI in Taiwan: How We Develop EDA Solutions for Reconfigurable Memory-Centric AI Edge Applications |
Author | Hung-Ming Chen (National Yang Ming Chiao Tung University, Taiwan) |
Page | p. 185 |
Abstract | US DARPA has rolled out Electronics Resurgence Initiative (ERI) in 2017, trying to open new innovation pathways in addressing impending engineering and economics changes that could challenge what has been a relentless half-century run of progress in microelectronics technology today. In order to imitate such efforts, Taiwan government has initiated Taiwan ERI program in 2019, there were total 5 projects running in EDA field. In this talk, those projects will be first briefed, and among which we will focus on one specific project working to build in-memory computing (IMC) design platform. Memory-centric designs deploy computation to storage and enable efficient in-memory computation while avoiding massive amount of data movement. The in-memory-computing (IMC) schemes have shown distinct advantages and concerns when applying to different types of memory technologies. To attain an efficient design within short design cycle, it is essential to have an integrated design framework with automated tools to support hybrid memory systems, and perform effective optimization across design stages. The second part of the talk will then introduce a unified framework which integrates EDA solutions to address the design and optimization challenges at different aspects of next generation memory-centric designs, including fast reconfiguring in-memory/near-memory computing designs to provide optimized solutions (behavioral models and APR cell layouts), for AI edge applications. |
Title | Wire-bond Package Finger Placement with Minimal Distance |
Author | Yu-En Lin, *Che-Hsu Lin, Yi-Yu Liu (National Taiwan University of Science and Technology, Taiwan) |
Page | pp. 186 - 191 |
Keyword | finger placement, wire-bond |
Abstract | Semiconductor packaging is the final stage of fabrication, which encapsulates one or more integrated circuit chips to avoid physical damage and to provide interconnections to outside world. There are a variety of packaging design styles, such as wire-bond, flip-chip, 3-D TSV, and so on. In this work, we propose a framework to handle the finger placement problem in wire-bond package design style. First, a minimum-cost-maximum-flow-based global finger placement algorithm is used to derive initial finger locations with overall minimum distance. After that, legalization steps are proposed taking bond-wire crossing constraint and pad row sequence constraint into consideration. Finally, incremental finger placement refinement algorithm is adopted to generate final layout with minimal displacement compared to the initial global placement. With the proposed framework, legalized finger locations are determined in polynomial time for package substrate layout engineer to start with. |
Title | A Transimpedance Amplifier Topology Considering the Impact of Variability on Inductive Peaking |
Author | *Tomofumi Tsuchida, Akira Tsuchiya, Toshiyuki Inoue, Keiji Kishine (University of Shiga Prefecture, Japan) |
Page | pp. 192 - 196 |
Keyword | optical communication, transimpedance amplifier, inductive peaking, variability, parasitic inductance |
Abstract | This paper proposes a transimpedance amplifier topology for reducing the impact of parasitic inductance and the variability of poly-resistors. When inductive peaking is used, parasitic inductance and variability cause over-peaking. We propose a circuit topology whose dumping factor increases as the parasitic inductance increases. Also, the proposed topology is tolerant against the variability of the poly-resistors. Simulation results show the proposed circuit realizes 74% larger eye-opening compared to the conventional topology. |
Title | Combination of Barrel Shifters and Adder Trees for Low-Power Depthwise Separable Convolution |
Author | Shi-Rou Lin, Hsu-Yu Kao, *Shih-Hsu Huang (Chung Yuan Christian University, Taiwan) |
Page | pp. 197 - 200 |
Keyword | Convolution, Digital Design, Logic Circuits, Multipliers, Power Dissipation |
Abstract | Multiplication is a fundamental operation in a depthwise separable convolution hardware unit. However, conventional multipliers often consume large power in a hardware implementation. Thus, in this paper, we propose a new hardware architecture, which is a combination of barrel shifters and adder trees, to execute multiplications in depthwise separable convolution. As a consequence, power consumption can be greatly reduced. Benchmark data show that the proposed approach can save 64.4% power consumption with only 1.2% loss on top-1 accuracy. |
Title | ReNA: A Reconfigurable Neural-Network Accelerator for AI Edge Computing |
Author | *Yasuhiro Nakahara, Motoki Amagasaki (Kumamoto University, Japan), Qian Zhao (Kyushu Institute of Technology, Japan), Masato Kiyama, Masahiro Iida (Kumamoto University, Japan) |
Page | pp. 201 - 206 |
Keyword | AI edge computing, Deep neural network, AI chip |
Abstract | Low power consumption is important in edge artificial intelligence (AI) chips, where power supply is limited. Therefore, we propose a reconfigurable neural network accelerator (ReNA), an AI chip that can process convolutional and fully connected layers with the same structure by reconfiguring the circuit. We also develop tools for preevaluation of performance when a deep neural-network (DNN) model is implemented on ReNA. From this, we establish an implementation flow for DNN models on ReNA. Evaluating the performance of VGG16, we achieve 1.30 TOPS/W. |
Title | Energy Efficient Approximate Storing to MRAM for Deep Neural Network Tasks in Edge Computing |
Author | *Yoshinori Ono, Kimiyoshi Usami (Shibaura Institute of Technology, Japan) |
Page | pp. 207 - 212 |
Keyword | Deep learninig, Edge Device, Magnetic RAM, Approximate Computing, Energy Minimization |
Abstract | On-chip learning is gaining attention in edge devices. In addition, a magnetic RAM (MRAM) is a promising memory technology for edge devices because of low leakage energy. However, the high write energy is a disadvantage of MRAM. For minimizing the write energy, we propose an approximate storing approach to MRAM for learning tasks of deep neural networks (DNN). The proposed approach writes the weight and bias data to NVM approximately on each epoch with the fine-grained adjusted write time. Simulation results with image recognition DNN applications have demonstrated the write energy can be reduced range from 9% to 37% while negligible (< 0.5%) accuracy loss. |
Title | A Hiding External Memory Access Latency by Prefetch Architecture for DNN Accelerator Available on High-Level Synthesis |
Author | *Ryota Yamamoto (Nagoya University, Japan), Shinya Honda (Nanzan University, Japan), Masato Edahiro (Nagoya University, Japan) |
Page | pp. 213 - 218 |
Keyword | DNN, High-Level Sysnthesis |
Abstract | In recent years, there is a demand for deep neural network (DNN) inference applications in embedded systems. In our group, we are also developing a framework to design an HW on an FPGA for DNN inference. Although low-end FPGAs are needed to reduce the cost of development, the FPGAs have small internal memory. Therefore, we have designed a prefetch architecture in the system-level design environment that can be easily used in C code for high-level synthesis. We propose a method of storing data in the internal memory by transferring the data in a burst. We designed a DNN inference HW with external memory access using prefetch architecture. As a result, prefetch design is faster than both the case using internal and external memory. In particular, it is found that the performance is up to 10 times faster than the case of external memory access without prefetch. |
Title | Thermally Optimization of the Trimming Shape of Thin Film NiCr Resistors to Improve Pulse Durability |
Author | *Ryosuke Watanabe (Hirosaki University, Japan), Keita Izawa (Nikkohm Co., Ltd., Japan), Shota Kajiya, Tomohiro Aoba, Ryo Arima, Atsushi Kurokawa, Toshiki Kanamoto (Hirosaki University, Japan) |
Page | pp. 219 - 224 |
Keyword | Resistors, Thin film, Thermal, Durability, Trimming |
Abstract | This paper proposes a trimming method of thin film NiCr resistors to improve pulsed voltage durability. Here, we focus on the thermal damage at the trimmed area which dominates the durability. The thermal damage is the main mechanism of the pulse destruction of thin film resistors. To improve the pulse durability of the thin film resistors, it is effective to reduce the local heat at the NiCr surface. In the production process of the industrial resistors, trimming the surface of the NiCr films is needed to adjust the resistance value. However, strong current concentration induces thermal damage near the trimmed area. Thus, it is beneficial for improving pulse durability of the thin film NiCr resistors to obtain appropriate trimming structure. Here, we consider the L shape trimming structure that has been widely used in the industrial thin film resistors. Thermal circuit as well as technology cad simulation models of the considered resistors indicate that temperature increase is suppressed when the trimming pattern ends closer to edges of the resistance body. Here, we propose a method to obtain thermally desirable L shaped trimming patterns for the thin film resistors based on pre-process thermal simulation. |
Title | Error Detection Capacity of SAT-based Coverage-driven Design Verification |
Author | Hiroyuki Nakayama, *Kiyoharu Hamaguchi (Shimane University, Japan) |
Page | pp. 225 - 228 |
Keyword | verification, SAT, coverage, error detection |
Abstract | In our previous work, we have proposed an approach to accelerate coverage-driven verification combining random simulation and SAT-based input pattern generation. This approach was effective in terms of coverage improvement. However, its capacity for detecting errors in circuit designs was not tested. In this paper, we show that our approach can detect more of errors than random simulation throughout experiments. |
Title | Study on an RTL Conversion Method from Pipelined Synchronous RTL Models into Asynchronous RTL Models |
Author | *Shogo Semba, Hiroshi Saito (The University of Aizu, Japan) |
Page | pp. 229 - 234 |
Keyword | Asynchronous circuits, RTL, Conversion |
Abstract | In this paper, we propose a conversion method from pipelined synchronous Register Transfer Level (RTL) models into asynchronous RTL models with bundled-data implementation. The proposed method generates a control data flow graph (CDFG) from a given synchronous RTL model. After generating the CDFG, the proposed method generates an asynchronous RTL model by analyzing each pipeline stage on the CDFG, assigning asynchronous control modules, and connecting the control modules. In the experiment, we converted four pipelined synchronous RTL models into asynchronous ones. In addition, we performed logic synthesis for the converted asynchronous RTL models. The synthesized asynchronous circuits could reduce the energy consumption by 3.6% on average. |
Title | Transformation of Mixing Graphs Considering Splitting Errors on Digital Microfluidic Biochip |
Author | *Ikuru Yoshida, Shigeru Yamashita (Ritsumeikan University, Japan) |
Page | pp. 235 - 236 |
Keyword | Digital Microfluidic Biochip, error-correction, sample preparation |
Abstract | Recently Digital Microfluidic Biochips (DMFBs) have been studied intensively. DMFBs can execute biochemical experiments very efficiently. In biochemical experiments on a DMFB, "sample preparation" is an important task to generate a sample droplet with the desired concentration value. We merge/split droplets in a DMFB to perform sample preparation. When we split a droplet into two droplets, the split cannot be done evenly in some cases. By some unbalanced splits, the generated concentration value may have unacceptable errors. This paper shows that we can decrease the errors caused by unbalanced splits if we transform the corresponding mixing graph. We also propose an efficient method to transform a mixing graph in order to decrease the errors caused by unbiased splits. |
Title | An Open Circuit Voltage Estimation for Lithium-ion Batteries Using Kalman Filter |
Author | *Syuusei Ota (Ritsumeikan University, Japan), Lei Lin, Masahito Arima (Daiwa Can Company, Japan), Masahiro Fukui (Ritsumeikan University, Japan) |
Page | pp. 237 - 240 |
Keyword | OCV, Kalman Filter, Lithium-ion Battery |
Abstract | To the advent of renewable energy era, precise and safe management technology of lithium-ion batteries are highly required. This paper discusses a method of open circuit voltage (OCV) estimation system for lithium-ion batteries based on the join method of Kalman filter and RLS method. We use pre-measured dOCV/dSOC instead of the mean of dOCV/dSOC. As result, the OCV estimation accuracy is improved. |