Preview: IEEE Computer Architecture Letters
IEEE Computer Architecture Letters
Published: Mon, 3 Nov 2014 15:34:20 GMT
PrePrint: Adaptive Wear-leveling in Flash-based Memory
The paper presents an adaptive wear-leveling scheme based on several wear-thresholds in different periods. The basic idea behind this scheme is that blocks can have different wear-out speeds and the wear-leveling mechanism does not conduct data migration until the erasure counts of some hot blocks hit a threshold. Through a series of emulation experiments based on several realistic disk traces, we show that the proposed wear-leveling mechanism can reduce total erasure counts and yield uniform erasure counts among all blocks at the late lifetime of the storage devices. As a result, not only can the performance of storage systems be advanced, the lifespan of the flash-based memory can also be extended to a certain degree.
PrePrint: Subtleties of Run-Time Virtual Address Stacks
The run-time Virtual Address (VA) stack has some unique properties, which have garnered the attention of researchers. The stack one-dimensionally grows and shrinks at its top, and contains data that is seemingly local/private to one thread, or process. Most prior related research has focused on these properties. However, this article aims to demonstrate how conventional wisdom pertaining to the runtime VA stack fails to capture some critical subtleties and complexities. We first explore two widely established assumptions surrounding the VA stack area: (1) Data accesses can be classified as falling either under VA-stack-area accesses, or non-stack-area accesses, with no aliasing; (2) The VA stack data is completely private and invisible to other threads/processes. Subsequently, we summarize a representative selection of related work that pursued the micro-architectural concept of using run-time VA stacks to extend the general-purpose register file. We then demonstrate why these assumptions are invalid, by using examples from prior work to highlight the potential hazards regarding data consistency, shared memory consistency, and cache coherence. Finally, we suggest safeguards against these hazards. Overall, we explore the function-critical issues that future operating systems and compilers should address to effectively reap all the benefits of using run-time VA stacks.
PrePrint: DRAMA: An Architecture for Accelerated Processing near Memory
Improving energy efficiency is crucial for both mobile and high-performance computing systems while a large fraction of total energy is consumed to transfer data between storage and processing units. Thus, reducing data transfers across the memory hierarchy of a processor (i.e., off-chip memory, on-chip caches, and register file) can greatly improve the energy efficiency. To this end, we propose an architecture, DRAMA, that 3D-stacks coarse-grain reconfigurable accelerators (CGRAs) atop off-chip DRAM devices. DRAMA does not require changes to the DRAM device architecture, apart from through-silicon vias (TSVs) that connect the DRAM device’s internal I/O bus to the CGRA layer. We demonstrate that DRAMA can reduce the energy consumption to transfer data across the memory hierarchy by 66-95% while achieving speedups of up to 18× over a commodity processor.
PrePrint: Refactored Design of I/O Architecture for Flash Storage
Flash storage devices behave quite differently from hard disk drives (HDDs); a page on flash has to be erased before it can be rewritten, and the erasure has to be performed on a block which consists of a large number of contiguous pages. It is also important to distribute writes evenly among flash blocks to avoid premature wearing. To achieve interoperability with existing block I/O subsystems for HDDs, NAND flash devices employ an intermediate software layer, called the flash translation layer (FTL), which hides these differences. Unfortunately, FTL implementations require powerful processors with a large amount of DRAM in flash controllers and also incur many unnecessary I/O operations which degrade flash storage performance and lifetime. In this paper, we present a refactored design of I/O architecture for flash storage which dramatically increases storage performance and lifetime while decreasing the cost of the flash controller. In comparison with page-level FTL, our preliminary experiments show a reduction of 19% in I/O operations, improvement of I/O performance by 9% and storage lifetime by 36%. In addition, our scheme uses only 1 128 DRAM memory in the flash controller.
PrePrint: Architectural Support for Mitigating Row Hammering in DRAM Memories
DRAM scaling has been the prime driver of increasing capacity of main memory systems. Unfortunately, lower technology nodes worsen the cell reliability as it increases the coupling between adjacent DRAM cells, thereby exacerbating different failure modes. This paper investigates the reliability problem due to Row Hammering, whereby frequent activations of a given row can cause data loss for its neighboring rows. As DRAM scales to lower technology nodes, the threshold for the number of row activations that causes data loss for the neighboring rows reduces, making Row Hammering a challenging problem for future DRAM chips. To overcome Row Hammering, we propose two architectural solutions: First, Counter-Based Row Activation (CRA), which uses a counter with each row to count the number of row activations. If the count exceeds the row hammering threshold, a dummy activation is sent to neighboring rows proactively to refresh the data. Second, Probabilistic Row Activation (PRA), which obviates storage overhead of tracking and simply allows the memory controller to proactively issue dummy activations to neighboring rows with a small probability for all memory access. Our evaluations show that these solutions are effective at mitigating Row hammering while causing negligible performance loss (< 1%).
PrePrint: On-Demand Dynamic Branch Prediction
In out-of-order (OoO) processors, speculative execution with high branch prediction accuracy is employed to achieve good single thread performance. In these processors the branch prediction unit tables (BPU) are accessed in parallel with the instruction cache before it is known whether a fetch group contains branch instructions. For integer applications, we find 85% of BPU lookups are done for non-branch operations and of the remaining lookups, 42% are done for highly biased branches that can be predicted statically with high accuracy. We evaluate on-demand branch prediction (ODBP), a novel technique that uses compiler generated hints to identify those instructions that can be more accurately predicted statically to eliminate unnecessary BPU lookups. We evaluate an implementation of ODBP that combines static and dynamic branch prediction. For a 4-wide superscalar processor, ODBP delivers as much as 9% improvement in average energy-delay (ED) product, 7% core average energy saving, and 3% speedup. ODBP also enables the use of large BPU’s for a given power budget.
PrePrint: Persistent Transactional Memory
This paper proposes persistent transactional memory (PTM), a new design that adds durability to transactional memory (TM) by incorporating with the emerging non-volatile memory (NVM). PTM dynamically tracks transactional updates to cache lines to ensure the ACI (atomicity, consistency and isolation) properties during cache flushes and leverages an undo log in NVM to ensure PTM can always consistently recover transactional data structures from a machine crash. This paper describes the PTM design based on Intel’s restricted transactional memory. A preliminary evaluation using a concurrent key/value store and a database with a cache-based simulator shows that the additional cache line flushes are small.
PrePrint: A Hardware-Software Cooperative Approach for Application Energy Profiling
Energy consumption by software applications is a critical issue that determines the future of multicore software development. In this article, we propose a hardware-software cooperative approach that uses hardware support to efficiently gather the energy-related hardware counters during program execution, and utilizes parameter estimation models in software to compute the energy consumption by instructions at a finer grain level (say basic block).We design mechanisms to minimize collinearity in profiler data, and present results to validate our energy estimation methodology.
PrePrint: CIDR: A Cache Inspired Area-Efficient DRAM Resilience Architecture against Permanent Faults
Faulty cells have become major problems in cost-sensitive main-memory DRAM devices. Conventional solutions to reduce device failure rates due to cells with permanent faults, such as populating spare rows and relying on error-correcting codes, have had limited success due to high area overheads. In this paper, we propose CIDR, a novel cache-inspired DRAM resilience architecture, which substantially reduces the area overhead of handling bit errors from these faulty cells. A DRAM device adopting CIDR has a small cache next to its I/O pads to replace accesses to the addresses that include the faulty cells with ones that correspond to the cache data array. We minimize the energy overhead of accessing the cache tags for every read or write by adding a Bloom filter in front of the cache. The augmented cache is programmed once during the testing phase and is out of the critical path on normal accesses because both cache and DRAM arrays are accessed in parallel, making CIDR transparent to existing processor-memory interfaces. Compared to the conventional architecture relying on spare rows, CIDR lowers the area overhead of achieving equal failure rates over a wide range of single-bit error rates, such as 23.6 lower area overhead for a bit-error rate of 105 and a device failure rate of 103.
PrePrint: Constrained Energy Optimization in Heterogeneous Platforms using Generalized Scaling Models
Platform energy consumption and responsiveness are two major considerations for mobile systems since they determine the battery life and user satisfaction, respectively. We first present models for power consumption, response time and energy consumption of heterogeneous mobile platforms. Then, we use these models to optimize the energy consumption of baseline platforms under response time and temperature constraints with and without introducing new resources. We show that the optimal design choices depend on dynamic power management algorithm, and adding new resources is more energy efficient than scaling existing resources alone.