Subscribe: Computer Architecture Letters - new TOC
Added By: Feedage Forager Feedage Grade B rated
Language: English
access  applications  cache  computing  data  dram  hardware  latency  memory  percent  performance  power  processor  system 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: Computer Architecture Letters - new TOC

IEEE Computer Architecture Letters - new TOC

TOC Alert for Publication# 10208


Cache Calculus: Modeling Caches through Differential Equations

Jan.-June 1 2017

Caches are critical to performance, yet their behavior is hard to understand and model. In particular, prior work does not provide closed-form solutions of cache performance, i.e., simple expressions for the miss rate of a specific access pattern. Existing cache models instead use numerical methods that, unlike closed-form solutions, are computationally expensive and yield limited insight. We present cache calculus, a technique that models cache behavior as a system of ordinary differential equations, letting standard calculus techniques find simple and accurate solutions of cache performance for common access patterns.

CARB: A C-State Power Management Arbiter for Latency-Critical Workloads

Jan.-June 1 2017

Latency-critical workloads in datacenters have tight response time requirements to meet service-level agreements (SLAs). Sleep states (c-states) enable servers to reduce their power consumption during idle times; however entering and exiting c-states is not instantaneous, leading to increased transaction latency. In this paper we propose a c-state arbitration technique, CARB, that minimizes response time, while simultaneously realizing the power savings that could be achieved from enabling c-states. CARB adapts to incoming request rates and processing times and activates the smallest number of cores for processing the current load. CARB reshapes the distribution of c-states and minimizes the latency cost of sleep by avoiding going into deep sleeps too often. We quantify the improvements from CARB with memcached running on an 8-core Haswell-based server.

CasHMC: A Cycle-Accurate Simulator for Hybrid Memory Cube

Jan.-June 1 2017

3D-stacked DRAM has been actively studied to overcome the limits of conventional DRAM. The Hybrid Memory Cube (HMC) isa type of 3D-stacked DRAM that has drawn great attention because of its usability for server systems and processing-in-memory (PIM) architecture. Since HMC is not directly stacked on the processor die where the central processing units (CPUs) and graphic processing units (GPUs) are integrated, HMC has to be linked to other processor components through high speed serial links. Therefore, the communication bandwidth and latency should be carefully estimated to evaluate the performance of HMC. However, most existing HMC simulators employ onlysimple HMC modeling. In this paper, we propose a cycle-accurate simulator for hybrid memorycube called CasHMC. It provides a cycle-by-cycle simulation of every module in an HMC and generates analysis results including a bandwidth graph and statistical data. Furthermore, CasHMC is implemented in C++ as a single wrapped object that includes an HMC controller, communication links, and HMC memory. Instantiating this single wrapped object facilitates simultaneous simulation in parallel with other simulators that generate memory access patterns such as a processor simulator or a memory trace generator.

Cloud Server Benchmark Suite for Evaluating New Hardware Architectures

Jan.-June 1 2017

Adding new hardware features to a cloud computing server requires testing both the functionality and the performance of the new hardware mechanisms. However, commonly used cloud computing server workloads are not well-represented by the SPEC integer and floating-point benchmark and Parsec suites typically used by the computer architecture community. Existing cloud benchmark suites for scale-out or scale-up computing are not representative of the most common cloud usage, and are very difficult to run on a cycle-accurate simulator that can accurately model new hardware, like gem5. In this paper, we present PALMScloud, a suite of cloud computing benchmarks for performance evaluation of cloud servers, that is ready to run on the gem5 cycle-accurate simulator. We conduct a behavior characterization and analysis of the benchmarks. We hope that these cloud benchmarks, ready to run on a dualmachine gem5 simulator or on real machines, can be useful to other researchers interested in improving hardware micro-architecture and cloud server performance.

Counter-Based Tree Structure for Row Hammering Mitigation in DRAM

Jan.-June 1 2017

Scaling down DRAM technology degrades cell reliability due to increased coupling between adjacent DRAM cells, commonly referred to as crosstalk. Moreover, high access frequency of certain cells (hot cells) may cause data loss in neighboring cells in adjacent rows due to crosstalk, which is known as row hammering. In this work, the goal is to mitigate row hammering in DRAM cells through a Counter-Based Tree (CBT) approach. This approach uses a tree of counters to detect hot rows and then refreshes neighboring cells. In contrast to existing deterministic solutions, CBT utilizes fewer counters that makes it practically feasible to be implemented on-chip. Compared to existing probabilistic approaches, CBT more precisely refreshes rows vulnerable to row hammering based on their access frequency. Experimental results on workloads from three benchmark suites show that CBT can reduce the refresh energy by more than 60 percent and nearly 70 percent in comparison to leading probabilistic and deterministic approaches, respectively. Furthermore, hardware evaluation shows that CBT can be easily implemented on-chip with only a nominal overhead.

Covert Channels on GPGPUs

Jan.-June 1 2017

GPUs are increasingly used to accelerate the performance of not only graphics workloads, but also data intensive applications. In this paper, we explore the feasibility of covert channels in General Purpose Graphics Processing Units (GPGPUs). We consider the possibility of two colluding malicious applications using the GPGPU as a covert channel to communicate, in the absence of a direct channel between them. Such a situation may arise in cloud environments, or in environments employing containment mechanisms such as dynamic information flow tracking. We reverse engineer the block placement algorithm to understand co-residency of blocks from different applications on the same Streaming Multiprocessor (SM) core, or on different SMs concurrently. In either mode, we identify the shared resources that may be used to create contention. We demonstrate the bandwidth of two example channels: one that uses the L1 constant memory cache to enable communication on the same SM, and another that uses the L2 constant memory caches to enable communication between different SMs. We also examine the possibility of increasing the bandwidth of the channel by using the available parallelism on the GPU, achieving a bandwidth of over 400 Kbps. This study demonstrates that GPGPUs are a feasible medium for covert communication.

Evaluation of Performance Unfairness in NUMA System Architecture

Jan.-June 1 2017

NUMA (Non-uniform memory access) system architectures are commonly used in high-performance computing and datacenters. Within each architecture, a processor-interconnect is used for communication between the different sockets and examples of such interconnect include Intel QPI and AMD HyperTransport. In this work, we explore the impact of the processor-interconnect on overall performance-in particular, we explore the impact on performance fairness from the processor-interconnect arbitration. It is well known that locally-fair arbitration does not guarantee globally-fair bandwidth sharing as closer nodes receive more bandwidth in a multi-hop network. However, this paper is the first to demonstrate the opposite can occur in a commodity NUMA servers where remote nodes receive higher bandwidth (and perform better). This problem occurs because router micro-architectures for processor-interconnects commonly employ external concentration. While accessing remote memory can occur in any NUMA system, performance unfairness (or performance variation) is more critical in cloud computing and virtual machines with shared resources. We demonstrate how this unfairness creates significant performance variation when executing workload on the Xen virtualization platform. We then provide analysis using synthetic workloads to better understand the source of unfairness.

Extending Amdahl’s Law for Multicores with Turbo Boost

Jan.-June 1 2017

Rewriting sequential programs to make use of multiple cores requires considerable effort. For many years, Amdahl's law has served as a guideline to assess the performance benefits of parallel programs over sequential ones, but recent advances in multicore design introduced variability in the performance of the cores and motivated the reexamination of the underlying model. This paper extends Amdahl's law for multicore processors with built-in dynamic frequency scaling mechanisms such as Intel's Turbo Boost. Using a model that captures performance dependencies between cores, we present tighter upper bounds for the speedup and reduction in energy consumption of a parallel program over a sequential one on a given multicore processor and validate them on Haswell and Sandy Bridge Intel CPUs. Previous studies have shown that from a processor design perspective, Turbo Boost mitigates the speedup limitations obtained under Amdahl's law by providing higher performance for the same energy budget. However, our new model and evaluation show that from a software development perspective, Turbo Boost aggravates these limitations by making parallelization of sequential codes less profitable.

Heavy Tails in Program Structure

Jan.-June 1 2017

Designing and optimizing computer systems require deep understanding of the underlying system behavior. Historically many important observations that led to the development of essential hardware and software optimizations were driven by empirical observations about program behavior. In this paper, we report an interesting property of program structures by viewing dynamic program execution as a changing network. By analyzing the communication network created as a result of dynamic program execution, we find that communication patterns follow heavy-tailed distributions. In other words, a few instructions have consumers that are orders of magnitude larger than most instructions in a program. Surprisingly, these heavy-tailed distributions follow the iconic power law previously seen in man-made and natural networks. We provide empirical measurements based on the SPEC CPU2006 benchmarks to validate our findings as well as perform semantic analysis of the source code to reveal the causes of such behavior.

HeteroSim: A Heterogeneous CPU-FPGA Simulator

Jan.-June 1 2017

Heterogeneous Computing is a promising direction to address the challenges of performance and power walls in high-performance computing, where CPU-FPGA architectures are particularly promising for application acceleration. However, the development of such architectures associated with optimal memory hierarchies is challenging due to the absence of an integrated simulator to support full system simulation and architectural exploration. In this work, we present HeteroSim, a full system simulator supporting x86 multi-cores integrated with an FPGA via bus connection. It can support fast architectural exploration with respect to number of cores, number of accelerated kernels on FPGA, and different memory hierarchies between CPU and FPGA. Various performance metrics are returned for further performance analysis and architectural configuration optimization.

LA-LLC: Inter-Core Locality-Aware Last-Level Cache to Exploit Many-to-Many Traffic in GPGPUs

Jan.-June 1 2017

The reply network is a severe performance bottleneck in General Purpose Graphic Processing Units (GPGPUs), as the communication path from memory controllers (MC) to cores is often congested. In this paper, we find that instead of relying on the congested communication path between MCs and cores, the unused core-to-core communication path can be leveraged to transfer data blocks between cores. We propose the inter-core Locality-Aware Last-Level Cache (LA-LLC), which requires only few bits per cache block and enables a core to fetch shared data from another core's private cache instead of the LLC. Leveraging inter-core communication, LA-LLC transforms few-to-many traffic to many-to-many traffic, thereby mitigating the reply network bottleneck. For a set of applications exhibiting varying degrees of inter-core locality, LA-LLC reduces memory access latency and increases performance by 21.1 percent on average and up to 68 percent, with negligible hardware cost.

LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory

Jan.-June 1 2017

Processing-in-memory (PIM) architectures cannot use traditional approaches to cache coherence due to the high off-chip traffic consumed by coherence messages. We propose LazyPIM, a new hardware cache coherence mechanism designed specifically for PIM. LazyPIM uses a combination of speculative cache coherence and compressed coherence signatures to greatly reduce the overhead of keeping PIM coherent with the processor. We find that LazyPIM improves average performance across a range of PIM applications by 49.1 percent over the best prior approach, coming within 5.5 percent of an ideal PIM mechanism.

Measuring the Impact of Memory Errors on Application  Performance

Jan.-June 1 2017

Memory reliability is a key factor in the design of warehouse-scale computers. Prior work has focused on the performance overheads of memory fault-tolerance schemes when errors do not occur at all, and when detected but uncorrectable errors occur, which result in machine downtime and loss of availability. We focus on a common third scenario, namely, situations when hard but correctable faults exist in memory; these may cause an “avalanche” of errors to occur on affected hardware. We expose how the hardware/software mechanisms for managing and reporting memory errors can cause severe performance degradation in systems suffering from hardware faults. We inject faults in DRAM on a real cloud server and quantify the single-machine performance degradation for both batch and interactive workloads. We observe that for SPEC CPU2006 benchmarks, memory errors can slow down average execution time by up to 2.5×. For an interactive web-search workload, average query latency degrades by up to 2.3× for a light traffic load, and up to an extreme 3746× under peak load. Our analyses of the memory error-reporting stack reveals architecture, firmware, and software opportunities to improve performance consistency by mitigating the worst-case behavior on faulty hardware.

Mind The Power Holes: Sifting Operating Points in Power-Limited Heterogeneous Multicores

Jan.-June 1 2017

Heterogeneous chip multicore processors (HCMPs) equipped with multiple voltage-frequency (V-F) operating points provide a wide spectrum of power-performance tradeoff opportunities. This work targets the performance of HCMPs under a power cap. We show that for any performance optimization technique to work under power constraints, the default set of V-F operating points in HCMPs must be first filtered based on the application's power and performance characteristics. Attempting to find operating points of maximum performance by naively walking the default set of operating points leads the application to inefficient operating points which drain power without significant performance benefit. We call these points Power Holes (PH). Contrary to intuition, we show that even using a power-performance curve of Pareto-optimal operating points still degrades performance significantly for the same reason. We propose PH-Sifter, a fast and scalable technique that sifts the default set of operating points and eliminates power holes. We show significant performance improvement of PH-Sifter compared to Pareto sifting for three use cases: (i) maximizing performance for a single application, (ii) maximizing system throughput for multi-programmed workloads, and (iii) maximizing performance of a system in which a fraction of the power budget is reserved for a high-priority application. Our results show performance improvements of 13, 27, and 28 percent on average that reach up to 52, 91 percent, and 2.3×, respectively, for the three use cases.

Mitigating Power Contention: A Scheduling Based Approach

Jan.-June 1 2017

Shared resource contention has been a major performance issue for CMPs. In this paper, we tackle the power contention problem in power constrained CMPs by considering and treating power as a first-class shared resource. Power contention occurs when multiple processes compete for power, and leads to degraded system performance. In order to solve this problem, we develop a shared resource contention-aware scheduling algorithm that mitigates the contention for power and the shared memory subsystem at the same time. The proposed scheduler improves system performance by balancing the shared resource usage among scheduling groups. Evaluation results across a variety of multiprogrammed workloads show performance improvements over a state-of-the-art scheduling policy which only considers memory subsystem contention.

Mth: Codesigned Hardware/Software Support for Fine Grain Threads

Jan.-June 1 2017

Multi-core processors are ubiquitous in all market segments from embedded to high performance computing, but only few applications can efficiently utilize them. Existing parallel frameworks aim to support thread-level parallelism in applications, but the imposed overhead prevents their usage for small problem instances. This work presents Micro-threads (Mth) a hardware-software proposal focused on a shared thread management model enabling the use of parallel resources in applications that have small chunks of parallel code or small problem inputs by a combination of software and hardware: delegation of the resource control to the application, an improved mechanism to store and fill processor's context, and an efficient synchronization system. Four sample applications are used to test our proposal: HSL filter (trivially parallel), FFT Radix2 (recursive algorithm), LU decomposition (barrier every cycle) and Dantzig algorithm (graph based, matrix manipulation). The results encourage the use of Mth and could smooth the use of multiple cores for applications that currently can not take advantage of the proliferation of the available parallel resources in each chip.

Optimizing Read-Once Data Flow in Big-Data Applications

Jan.-June 1 2017

Memory hierarchies in modern computing systems work well for workloads that exhibit temporal data locality. Data that is accessed frequently is brought closer to the computing cores, allowing faster access times, higher bandwidth, and reduced transmission energy. Many applications that work on big data, however, read data only once. When running these applications on modern computing systems, data that is not reused is nevertheless transmitted and copied into all memory hierarchy levels, leading to energy and bandwidth waste. In this paper we evaluate workloads dealing with read-once data and measure their energy consumption. We then modify the workloads so that data that is known to be used only once is transferred directly from storage into the CPU's last level cache, effectively bypassing DRAM and avoiding keeping unnecessary copies of the data. Our measurements on a real system show savings of up to 5 Watts in server power and up to 3.9 percent reduction in server energy when 160 GB of read-once data bypasses DRAM.

Power-Efficient Accelerator Design for Neural Networks Using Computation Reuse

Jan.-June 1 2017

Applications of neural networks in various fields of research and technology have expanded widely in recent years. In particular, applications with inherent tolerance to accuracy loss, such as signal processing and multimedia applications, are highly suited to the approximation property of neural networks. This approximation property has been exploited in many existing neural network accelerators to trade-off accuracy for power-efficiency and speed. In addition to the power saving obtained by approximation, we observed that a considerable amount of arithmetic operations in neural networks are repetitive and can be eliminated to further decrease power consumption. Given this observation, we propose CORN, COmputation Reuse-aware Neural network accelerator that allows neurons to share their computation results, effectively eliminating the power usage of redundant computations. We will show that CORN lowers power consumption by 26 percent on average over low-power neural network accelerators.

SALAD: Achieving Symmetric Access Latency with Asymmetric DRAM Architecture

Jan.-June 1 2017

Memory access latency has significant impact on application performance. Unfortunately, the random access latency of DRAM has been scaling relatively slowly, and often directly affects the critical path of execution, especially for applications with insufficient locality or memory-level parallelism. The existing low-latency DRAM organizations either incur significant area overhead or burden the software stack with non-uniform access latency. This paper proposes SALAD, a new DRAM device architecture that provides symmetric access latency with asymmetric DRAM bank organizations. Since local banks have lower data transfer time due to their proximity to the I/O pads, SALAD applies high aspectratio (i.e.,low-latency) mats only to remote banks to offset the difference in data transfer time, thus providing uniformly low access time (tAC) over the whole device. Our evaluation demonstrates that SALAD improves the IPC by 13 percent (10 percent) without any software modifications, while incurring only 6 percent (3 percent) area overhead.

Stripes: Bit-Serial Deep Neural Network Computing

Jan.-June 1 2017

The numerical representation precision required by the computations performed by Deep Neural Networks (DNNs) varies across networks and between layers of a same network. This observation motivates a precision-based approach to acceleration which takes into account both the computational structure and the required numerical precision representation. This work presents Stripes (STR), a hardware accelerator that uses bit-serial computations to improve energy efficiency and performance. Experimental measurements over a set of state-ofthe-art DNNs for image classification show that STR improves performance over a state-of-the-art accelerator from 1.35x to 5.33x and by 2.24x on average. STR's area and power overhead are estimated at 5 percent and 12 percent respectively. STR is 2.00x more energy efficient than the baseline.

Timing Speculation in Multi-Cycle Data Paths

Jan.-June 1 2017

Modern processors set timing margins conservatively at design time to support extreme variations in workload and environment, in order to operate reliably and produce expected outputs. Unfortunately, the conservative guard bands set to achieve this reliability are detrimental to processor performance and energy efficiency. In this paper, we propose the use of processors with internal transparent pipelines, which allow data to flow between stages without latching, to maximize timing speculation efficiency as they are inherently suited to slack conservation. We design a synchronous tracking mechanism which runs in parallel with the multi-cycle data path to estimate the accumulated slack across instructions/pipeline stages and then appropriately clock synchronous boundaries early to minimize wasted slack and achieve maximum clock cycle savings. Preliminary evaluations atop the CRIB processor show performance improvements of greater than 10% on average and as high as 30% for an assumed 25% slack per clock cycle.

2016 Index IEEE Computer Architecture Letters Vol. 15

Jan.-June 2017

Presents the 2016 author/subject index for this issue of the publication.