Title | Performance-centric Register File Design for GPUs using Racetrack Memory |
Author | *Shuo Wang, Yun Liang, Chao Zhang, Xiaolong Xie, Guangyu Sun (Peking University, China), Yongpan Liu, Yu Wang (Tsinghua University, China), Xiuhong Li (Peking University, China) |
Page | pp. 25 - 30 |
Keyword | GPU, Performance, Register File, Racetrack Memory, Compiler |
Abstract | In this paper, we explore racetrack memory for designing high performance register file for GPU architecture. High storage density racetrack memory helps to improve the thread level parallelism, but the lengthy shift operation may largely degrade the performance. To mitigate the shift operation overhead, we develop a compiler-time managed register mapping algorithm. Our algorithm optimizes the mapping of registers to the physical address in the register file. Experimental results demonstrate that our technique achieves up to 24% (19% on average) improvement in performance for a variety of GPU applications. |
Title | Improving Read Performance of STT-MRAM based Main Memories through Smash Read and Flexible Read |
Author | Lei Jiang (Advanced Micro Devices, U.S.A.), Wujie Wen (Florida International University, U.S.A.), *Danghui Wang (Northwestern Polytechnical University, China), Lide Duan (University of Texas at San Antonio, U.S.A.) |
Page | pp. 31 - 36 |
Keyword | STT-MRAM, read disturbance, main memory, read scheme, LPDDR3 |
Abstract | Spin Transfer Torque Magnetoresistive RAM (STT-MRAM) has been recently deemed as one promising main memory alternative for high-end mobile processors. With process technology scaling, the amplitude of write current approaches that of read current in deep sub-micrometer STT-MRAM arrays. As a result, read disturbance errors (RDEs) emerge. Both high current restore required (HCRR) reads and low current long latency (LCLL) reads can guarantee read reliability and utterly remove RDEs. However, both of them degrade system performance, because of extra restores or a longer read latency. And neither of them always achieves the better performance when running a wide variety of applications. In this paper, we present two architectural techniques to boost read performance for STT-MRAM based main memories in the presence of RDEs. We first propose Smash Read (S-RD) to shorten the latency of HCRR reads by injecting a larger read current. We further introduce Flexible Read (F-RD) to dynamically adopt different types of read schemes, S-RD and LCLL, to maximize main memory system performance. On average, our techniques improve system performance by 9~13% and reduces total energy by 4~8% over all existing read schemes including HCRR and LCLL. |
Title | STLAC: A Spatial and Temporal Locality-Aware Cache and Network-on-Chip Codesign for Tiled Many-core Systems |
Author | *Mingyu Wang (Institute of Microelectronics, Tsinghua University, China), Zhaolin Li (Research Institute of Information Technology, Tsinghua University, China) |
Page | pp. 37 - 42 |
Keyword | Many-core, Adaptive Cache, Network-on-chip |
Abstract | The spatial and temporal locality of workloads are the root causes for cache designs to overcome the memory wall problem. However, few existing state-of-the-art designs exploit both the two locality features to optimize the memory hierarchies in the area of tiled many-core systems, which losses the opportunities to explore more performance improvement. To address this problem, an adaptive spatial and temporal locality-aware cache and network-on-chip (NoC) codesign (STLAC) is proposed, which dynamically partitions the last level cache (LLC) as data prefetch buffer or victim cache for locality prediction and exploits a hybrid burst-support NoC for fast data prefetch. The data prefetch buffer speculates the data blocks in subsequent addresses to exploit the spatial locality, while the victim cache collects the evicted data blocks from the upper memory hierarchy to exploit the temporal locality. By combining the proposed adaptive cache partition with the hybrid burst-support NoC, the off-chip misses and on-chip network usage are greatly reduced. Experimental results demonstrate that the proposed STLAC reduces up to 43% off-chip misses and improves 15% performance on average compared with the traditional shared LLC design. |
Title | A Lightweight OpenMP4 Run-time for Embedded Systems |
Author | Roberto E. Vargas, Sara Royuela, *Maria A. Serrano, Xavi Martorell, Eduardo Quiñones (Barcelona Supercomputing Center, Spain) |
Page | pp. 43 - 49 |
Keyword | OpenMP4, Parallel programming Models, Many-core embedded processors, Compiler Analysis, Task Dependency Graph |
Abstract | OpenMP is increasingly being adopted by current many-core embedded processors to exploit their parallel computation capabilities.
Unfortunately, current run-time implementations of the latest specification (v4.0) are not suitable for processors relying on small and fast on-chip memories, due to its memory consumption.
This paper proposes an OpenMP4 run-time that reduces the memory consumption while providing the same performance.
Our run-time relies on a new compiler pass capable to generate the task dependency graph of OpenMP programs, which is then efficiently stored in memory. |