Subscribe: Computer Architecture Letters - new TOC
http://ieeexplore.ieee.org/rss/TOC10208.XML
Added By: Feedage Forager Feedage Grade B rated
Language: English
Tags:
based  cache  data dependent  data  dram  energy  instruction  memory  order  percent  performance  processing  show  system 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: Computer Architecture Letters - new TOC

IEEE Computer Architecture Letters - new TOC



TOC Alert for Publication# 10208



 



A Case for Memory Content-Based Detection and Mitigation of Data-Dependent Failures in DRAM

July-Dec. 1 2017

DRAM cells in close proximity can fail depending on the data content in neighboring cells. These failures are called data-dependent failures. Detecting and mitigating these failures online while the system is running in the field enables optimizations that improve reliability, latency, and energy efficiency of the system. All these optimizations depend on accurately detecting every possible data-dependent failure that could occur with any content in DRAM. Unfortunately, detecting all data-dependent failures requires the knowledge of DRAM internals specific to each DRAM chip. As internal DRAM architecture is not exposed to the system, detecting data-dependent failures at the system-level is a major challenge. Our goal in this work is to decouple the detection and mitigation of data-dependent failures from physical DRAM organization such that it is possible to detect failures without knowledge of DRAM internals. To this end, we propose MEMCON , a memory content-based detection and mitigation mechanism for data-dependent failures in DRAM. MEMCON does not detect every possible data-dependent failure. Instead, it detects and mitigates failures that occur with the current content in memory while the programs are running in the system. Using experimental data from real machines, we demonstrate that MEMCON is an effective and low-overhead system-level detection and mitigation technique for data-dependent failures in DRAM.



Addressing Read-Disturbance Issue in STT-RAM by Data Compression and Selective Duplication

July-Dec. 1 2017

In deep sub-micron region, spin transfer torque RAM (STT-RAM) shows read-disturbance error (RDE) which presents a crucial reliability challenge. We present SHIELD, a technique to mitigate RDE in STT-RAM last level caches (LLCs). SHIELD uses data compression to reduce cache-write traffic and restore requirement. Also, SHIELD keeps two copies of data blocks compressed to less than half the block size and since several LLC blocks are only accessed once, this approach avoids several restore operations. SHIELD consumes smaller energy than two previous RDE-mitigation techniques, namely high-current restore required read (HCRR, also called restore-after-read) and low-current long latency read (LCLL) and even an ideal RDE-free STT-RAM cache.



An Efficient Temporal Data Prefetcher for L1 Caches

July-Dec. 1 2017

Server workloads frequently encounter L1-D cache misses, and hence, lose significant performance potential. One way to reduce the number of L1-D misses or their effect is data prefetching. As L1-D access sequences have high temporal correlations, temporal prefetching techniques are promising for L1 caches. State-of-the-art temporal prefetching techniques are effective at reducing the number of L1-D misses, but we observe that there is a significant gap between what they offer and the opportunity. This work aims to improve the effectiveness of temporal prefetching techniques. To overcome the deficiencies of existing temporal prefetchers, we introduce Domino prefetching. Domino prefetcher is a temporal prefetching technique that looks up the history to find the last occurrence of the last one or two L1-D miss addresses for prefetching. We show that Domino prefetcher captures more than 87 percent of the temporal opportunity at L1-D. Through evaluation of a 16-core processor on a set of server workloads, we show that Domino prefetcher improves system performance by 26 percent (up to 56 percent).



A Scheme to Improve the Intrinsic Error Detection of the Instruction Set Architecture

July-Dec. 1 2017

The Instruction Set Architecture (ISA) determines the effect that a soft error on an instruction can have on the processor. Previous works have shown that the ISA has some intrinsic capability of detecting errors. For example, errors that change a valid instruction into an invalid instruction encoding or into an instruction that causes an exception. The percentage of detectable errors varies widely for each bit in the ISA. For example, errors on bits that are used for immediate or register values are unlikely to be detected while those that are used for the opcode are more likely to lead to an exception. In this paper, this is exploited by introducing a simple encoding of the instructions that does not require additional bits. The idea is that the decoding propagates the error so that it affects the most sensitive bit of the ISA and therefore it is more likely to be detected. As no additional bits are required, no changes or overheads are needed in the memory. The proposed scheme is useful when the memory is not protected with parity or Error Correction Codes. The only cost of implementing the technique are simple encoder and decoder circuits that are similar to a parity computation. This technique is applicable to any ISA, no matter the length of the opcodes or their location in the instruction encoding. The effectiveness of the proposed scheme has been evaluated on the ARM Cortex M0 ISA resulting in an increase in the error detection capability of up to 1.64x.



Decongest: Accelerating Super-Dense PCM Under Write Disturbance by Hot Page Remapping

July-Dec. 1 2017

At small feature sizes, phase change memory (PCM) shows write disturbance (WD) error (WDE) and this issue can eclipse the density and energy efficiency advantage of PCM. We propose ‘Decongest’, a technique to address WD errors in main memory designed with super-dense ($4F^2$ cell size) PCM. Decongest works by identifying and remapping write-intensive hot pages to a WD-free spare area, which avoids WD to nearby pages due to writing these hot pages, and WD to these hot pages from writing nearby pages. Compared to a WD-affected super-dense PCM baseline, Decongest improves the performance by 14.0 percent, and saves 21.8 percent energy.



Enhanced Dependence Graph Model for Critical Path Analysis on Modern Out-of-Order Processors

July-Dec. 1 2017

The dependence graph model of out-of-order (OoO) instruction execution is a powerful representation used for the critical path analysis. However most, if not all, of the previous models are out-of-date and lack enough detail to model modern OoO processors, or are too specific and complicated which limit their generality and applicability. In this paper, we propose an enhanced dependence graph model which remains simple but greatly improves the accuracy over prior models. The evaluation results using the gem5 simulator show that the proposed enhanced model achieves CPI error of 2.1 percent which is a 90.3 percent improvement against the state-of-the-art model.



FeSSD: A Fast Encrypted SSD Employing On-Chip Access-Control Memory

July-Dec. 1 2017

Cryptography is one of the most popular methods for protecting data stored in storage devices such as solid-state drives (SSDs). To maintain integrity of data, one of the popular techniques is that all incoming data are encrypted before they are stored, however, in this technique, the encryption overhead is non-negligible and it can increase I/O service time. In order to mitigate the negative performance impact caused by the data encryption, a write buffer can be used to hide the long latency by encryption. Using the write buffer, incoming unencrypted data can be immediately returned as soon as they are written in the buffer. They will get encrypted and synchronized with flash memory. However, if the write buffer itself is not encrypted, unencrypted secret data might leak through this insecure write buffer. On the other hand, if the entire write buffer is fully encrypted, it incurs significant performance overhead. To address this problem, we propose an on-chip access control memory (ACM) and presents a fast encrypted SSD, called FeSSD that implements a secure write buffering mechanism using the ACM. The ACM does not require a memory-level full encryption mechanism, thus not only solving the unencrypted data leaking problem, but also offering relatively fast I/O service. Our simulation results show that the I/O response time of FeSSD can be improved by up to 56 percent over a baseline where encrypted data are stored in the normal write buffer.



Guiding Locality Optimizations for Graph Computations via Reuse Distance Analysis

July-Dec. 1 2017

This work addresses the problem of optimizing graph-based programs for multicore processors. We use three graph benchmarks and three input data sets to characterize the importance of properly partitioning graphs among cores at multiple levels of the cache hierarchy. We also exhaustively explore a large design space comprised of different parallelization schemes and graph partitionings via detailed simulation to show how much gain we can obtain over a baseline legacy scheme that partitions for the L1 cache only. Our results demonstrate the legacy approach is not the best choice, and that our proposed parallelization / locality techniques can perform better (by up to 20 percent). We then use a performance prediction model based on multicore reuse distance (RD) profiles to rank order the different parallelization / locality schemes in the design space. We compare the best configuration as predicted by our model against the actual best identified by our exhaustive simulations. For one benchmark and data input, we show our model can achieve 79.5 percent of the performance gain achieved by the actual best. Across all benchmarks and data inputs, our model achieves 48 percent of the maximum performance gain. Our work demonstrates a new use case for multicore RD profiles-i.e. as a tool for helping program developers and compilers to optimize graph-based programs.



IMEC: A Fully Morphable In-Memory Computing Fabric Enabled by Resistive Crossbar

July-Dec. 1 2017

In this paper, we propose a fully morphable In-MEmory Computing (IMEC) fabric to better implement the concept of processing inside memory (PIM). Enabled by emerging nonvolatile memory, i.e., RRAM and its monolithic 3D integration, IMEC can be configured into one or a combination of four distinct functions, 1) logic, 2) ternary content addressable memory, 3) memory, and 4) interconnect. Thus, IMEC exploits a continuum of PIM capabilities across the whole spectrum, ranging from 0 percent (pure data storage) to 100 percent (pure compute engine), or intermediate states in between. IMEC can be modularly integrated into the DDRx memory subsystem, communicating with processors by the ordinary DRAM commands. Additionally, to reduce the programming burden, we provide a complete framework to compile applications written in high-level programming language (e.g., OpenCL) onto IMEC. This framework also enables code portability across different platforms for heterogeneous computing. By using this framework, several benchmarks are mapped onto IMEC for evaluating its performance, energy and resource utilization. The simulation results show that, IMEC reduces the energy consumption by 99.6 percent, and achieves $644times$ speedup, compared to a baseline CPU system. We further compare IMEC with FPGA architecture, and demonstrate that the performance improvement is not simply obtained by replacing SRAM cells with denser RRAM cells.



Improving GPGPU Performance via Cache Locality Aware Thread Block Scheduling

July-Dec. 1 2017

Modern GPGPUs support the concurrent execution of thousands of threads to provide an energy-efficient platform. However, the massive multi-threading of GPGPUs incurs serious cache contention, as the cache lines brought by one thread can easily be evicted by other threads in the small shared cache. In this paper, we propose a software-hardware cooperative approach that exploits the spatial locality among different thread blocks to better utilize the precious cache capacity. Through dynamic locality estimation and thread block scheduling, we can capture more performance improvement opportunities than prior work that only explores the spatial locality between consecutive thread blocks. Evaluations across diverse GPGPU applications show that, on average, our locality-aware scheduler provides 25 and 9 percent performance improvement over the commonly-employed round-robin scheduler and the state-of-the-art scheduler, respectively.



Low Complexity Multiply Accumulate Unit for Weight-Sharing Convolutional Neural Networks

July-Dec. 1 2017

Convolutional Neural Networks (CNNs) are one of the most successful deep machine learning technologies for processing image, voice and video data. CNNs require large amounts of processing capacity and memory, which can exceed the resources of low power mobile and embedded systems. Several designs for hardware accelerators have been proposed for CNNs which typically contain large numbers of Multiply Accumulate (MAC) units. One approach to reducing data sizes and memory traffic in CNN accelerators is “weight sharing”, where the full range of values in a trained CNN are put in bins and the bin index is stored instead of the original weight value. In this paper we propose a novel MAC circuit that exploits binning in weight-sharing CNNs. Rather than computing the MAC directly we instead count the frequency of each weight and place it in a bin. We then compute the accumulated value in a subsequent multiply phase. This allows hardware multipliers in the MAC circuit to be replaced with adders and selection logic. Experiments show that for the same clock speed our approach results in fewer gates, smaller logic, and reduced power.



NearZero: An Integration of Phase Change Memory with Multi-Core Coprocessor

July-Dec. 1 2017

Multi-core based coprocessors have become powerful research vehicles to analyze a large amount of data. Even though they can accelerate data processing by using a hundred cores, the data unfortunately exist on an external storage device. The separation of computation and storage introduces redundant memory copies and unnecessary data transfers over different physical device boundaries, which limit the benefits of coprocessor-accelerated data processing. In addition, the coprocessors need assistance from host-side resources to access the external storage, which can require additional system context switches. To address these challenges, we propose NearZero, a novel DRAM-less coprocessor architecture that precisely integrates a state-of-the-art phase change memory into its multi-core accelerator. In this work, we implement an FPGA-based memory controller that extracts important device parameters from real phase change memory chips, and apply them to a commercially available hardware platform that employs multiple processing elements over a PCIe fabric. The evaluation results reveal that NearZero achieves on average 47 percent better performance than advanced coprocessor approaches that use direct I/Os (between storage and coprocessors), while consuming only 19 percent of the total energy of such advanced coprocessors.



Resistive Address Decoder

July-Dec. 1 2017

Hardwired dynamic NAND address decoders are widely used in random access memories to decode parts of the address. Replacing wires by resistive elements allows storing and reprogramming the addresses and matching them to an input address. The resistive address decoder thus becomes a content addressable memory, while the read latency and dynamic energy remain almost identical to those of a hardwired address decoder. One application of the resistive address decoder is a fully associative TLB with read latency and energy consumption similar to those of a one-way associative TLB. Another application is a many-way associative cache with read latency and energy consumption similar to those of a direct mapped one. A third application is elimination of physical addressing and using virtual addresses throughout the entire memory hierarchy by introducing the resistive address decoder into the main memory.



Runtime-Assisted Global Cache Management for Task-Based Parallel Programs

July-Dec. 1 2017

Dead blocks are handled inefficiently in multi-level cache hierarchies because the decision as to whether a block is dead has to be taken locally at each cache level. This paper introduces runtime-assisted global cache management to quickly deem blocks dead across cache levels in the context of task-based parallel programs. The scheme is based on a cooperative hardware/software approach that leverages static and dynamic information about future data region reuse(s) available to runtime systems for task-based parallel programming models. We show that our proposed runtime-assisted global cache management approach outperforms previously proposed local dead-block management schemes for task-based parallel programs.



Storage-Free Memory Dependency Prediction

July-Dec. 1 2017

Memory Dependency Prediction (MDP) is paramount to good out-of-order performance, but decidedly not trivial as a all instances of a given static load may not necessarily depend on all instances of a given static store. As a result, for a given load, MDP should predict the exact store instruction the load depends on, and not only whether it depends on an inflight store or not, i.e., ideally, prediction should not be binary. However, we first argue that given the high degree of sophistication of modern branch predictors, the fact that a given dynamic load depends on an inflight store can be captured using the binary prediction capabilities of the branch predictor, providing coarse MDP at zero storage overhead. Second, by leveraging hysteresis counters, we show that the precise producer store can in fact be identified. This embodiment of MDP yields performance levels that are on par with state-of-the-art, and requires less than 70 additional bits of storage over a baseline without MDP at all.



Survive: Pointer-Based In-DRAM Incremental Checkpointing for Low-Cost Data Persistence and Rollback-Recovery

July-Dec. 1 2017

This paper introduces the Survive DRAM architecture for effective in-memory micro-checkpointing. Survive implements low-cost incremental checkpointing, enabling fast rollback that can be used in various architectural techniques such as speculation, approximation, or low voltage operation. Survive also provides crash consistency when used as the frontend of a hybrid DRAM-NVM memory system. This is accomplished by carefully copying the incremental checkpoints generated in the DRAM frontend to the NVM backend. Simulations show that Survive only imposes an average 3.5 percent execution time overhead over an unmodified DRAM main-memory system with no checkpointing, while reducing the number of NVM writes by 89 percent over an NVM-only main-memory system.



Towards a TrustZone-Assisted Hypervisor for Real-Time Embedded Systems

July-Dec. 1 2017

Virtualization technology starts becoming more and more widespread in the embedded space. The penalties incurred by standard software-based virtualization is pushing research towards hardware-assisted solutions. Among the existing commercial off-the-shelf technologies for secure virtualization, ARM TrustZone is attracting particular attention. However, it is often seen with some scepticism due to the dual-OS limitation of existing state-of-the-art solutions. This letter presents the implementation of a TrustZone-based hypervisor for real-time embedded systems, which allows multiple RTOS partitions on the same hardware platform. The results demonstrate that virtualization overhead is less than 2 percent for a 10 milliseconds guest-switching rate, and the system remains deterministic. This work goes beyond related work by implementing a TrustZone-assisted solution that allows the execution of an arbitrary number of guest OSes while providing the foundation to drive next generation of secure virtualization solutions for resource-constrained embedded devices.



Transcending Hardware Limits with Software Out-of-Order Processing

July-Dec. 1 2017

Building high-performance, next-generation processors require novel techniques to enable improved performance given today's power- and energy-efficiency requirements. Additionally, a widening gap between processor and memory performance makes it even more difficult to improve efficiency with conventional techniques. While out-of-order architectures attempt to hide this memory latency with dynamically reordered instructions, they lack the energy efficiency seen in in-order processors. Thus, our goal is to reorder the instruction stream to avoid stalls and improve utilization for energy efficiency and performance. To accomplish this goal, we propose an enhanced stall-on-use in-order core that improves energy efficiency (and therefore performance in these power-limited designes) through out-of-program-order execution. During long latency loads, the Software Out-of-Order Processing (SWOOP) core exposes additional memory- and instruction-level parallelism to perform useful, non-speculative work. The resulting instruction lookahead of the SWOOP core reaches beyond the conventional fixed-sized processor structures with the help of transparent hardware register contexts. Our results show that SWOOP demonstrates a 34 percent performance improvement on average compared with an in-order, stall-on-use core, with an energy reduction of 23 percent.



Using Data Variety for Efficient Progressive Big Data Processing in Warehouse-Scale Computers

July-Dec. 1 2017

Warehouse Scale Computers (WSC) are often used for various big data jobs where the big data under processing comes from a variety of sources. We show that different data portions, from the same or different sources, have different significances in determining the final outcome of the computation, and hence, by prioritizing them and assigning more resources to processing of more important data, the WSC can be used more efficiently in terms of time as well as cost. We provide a simple low-overhead mechanism to quickly assess the significance of each data portion, and show its effectiveness in finding the best ranking of data portions. We continue by demonstrating how this ranking is used in resource allocation to improve time and cost by up to 24 and 9 percent respectively, and also discuss other uses of this ranking information, e.g., in faster progressive approximation of the final outcome of big data job without processing entire data, and in more effective use of renewable energies in WSCs.



Worklist-Directed Prefetching

July-Dec. 1 2017

Researchers have demonstrated the benefits of hardware worklist accelerators, which offload scheduling and load balancing operations in parallel graph applications. However, many of these applications are still heavily memory latency-bound due to the irregular nature of graph data structure access patterns. We utilize the fact that the accelerator has knowledge of upcoming work items to accurately issue prefetch requests, a technique we call worklist-directed prefetching. A credit-based system to improve prefetch timeliness and prevent cache thrashing is proposed. The proposed prefetching scheme is simulated on a 64-core CMP with a hardware worklist accelerator on several graph algorithms and inputs. Enabling worklist-directed prefetching into the L2 cache results in an average speedup of 1.99, and up to 2.35 on Breadth-First Search.