Preview: IEEE Micro
IEEE Micro, a bimonthly publication of the IEEE Computer Society, reaches an international audience of microcomputer and microprocessor designers, system integrators, and users. Readers want to increase their technical knowledge of computers and periph
Published: Mon, 3 Nov 2014 15:35:24 GMT
PrePrint: Decentralized NIC-switching Architecture Using SR-IOV PCIe Network Device
In order to increase the flexibility and bandwidth of intra-rack communication, we propose a decentralized NIC-switching architecture to enable rack-level network bandwidth disaggregation. This is the first solution that uses the built-in switch of SR-IOV-compliant PCIe NICs in Data Center environments to handle the traffic of a rack. We design a decentralized switching topology where a pool of virtual NICs (vNICs) from several SR-IOV-compliant PCIe NICs can be shared among multiple servers in a rack instead of each being permanently designated to a server. To take full advantage of this new architecture, we also develop a mechanism to dynamically assign vNICs to each server and distribute the traffic from each server into the vNICs allocated to it. Utilizing the high bandwidth of PCIe technology, the new architecture provides higher switching capacity and more flexibility than the traditional ToR-centric architecture. In addition, the dynamic allocation of vNICs to servers enables flexible bandwidth adjustment for servers according to traffic demands. As the preliminary stage, this paper focuses on exploiting the unique design point and implements an FPGA prototype to prove the technical feasibility of the proposed architecture.
PrePrint: Photonic Interconnects for Exascale and Datacenter Architectures
Exascale and datacenter systems require tera-bits per second (Tb/s) of inter-node communication bandwidth to meet the performance demands of High-Performance Computing (HPC) applications. High-radix routers combined with scalable Dragonfly topology has been proposed for reducing the execution time and improving power dissipation. While the Dragonfly network has low diameter for exascale networks, fewer global links reduce the bisection bandwidth and require adaptive routing to prevent hot-spots due to congestion. Moreover, the number of ports in a high-radix router impacts the router cost when implemented with alternate emerging technologies. In this paper, we advocate that multi-tier network topologies that combines scalable topologies for local (intra-cabinet) and global (inter-cabinet) interconnects such as the k-ary n-cube, the Flattened Butterfly and the Dragonfly, can lead to improved bisection, manageable radix, and reduced link costs, albeit at higher packet latency due to increased diameter. As the performance/Watt delivered by metallic interconnects or coaxial cables will significantly exceed the limited power budget available, we envision that the entire exascale network will be composed of photonic links for communication and complementary-metal oxide semiconductor (CMOS) router for switching. Our results indicate that multi-tier topologies are comparable to single-level Dragonfly topology in terms of power and latency while providing higher bisection and reduced area overhead.
PrePrint: Twice as nice without the double of the trouble: Twin data center interconnection topology
In this paper we propose the Twin server-centric data center network topology, that is based on a particular class of graphs, called twin graphs. Using the properties of these graphs, we show that Twin topologies are fault tolerant, resilient, scalable, and cost-efficient. They can be easily generated by a recursive method and have growing and merging process which allow incremental expansion from a single server to the interconnection of two modular data center containers. Therefore, Twin data centers can be built with an arbitrary number of servers. Moreover, Twin topologies have lower link cost compared to previous data center topologies, that requires less power consumption, and decreases CAPEX and OPEX costs.
PrePrint: NetFPGA SUME: Toward Research Commodity 100Gb/s
The demand-led growth of datacenter networks has meant that many such technologies are beyond the budget of the research community. In order to make and validate timely and relevant research contributions, the wider research community requires accessible evaluation, experimentation and demonstration environments with specification comparable to the subsystems of the most massive datacenter networks. We present NetFPGA-SUME, an FPGA-based PCIe board with I/O capabilities for 100Gb/s operation as NIC, multiport switch, firewall, or test/measurement environment. As a powerful new NetFPGA platform, SUME provides an accessible development environment that both reuses existing codebases and enables new designs.
PrePrint: R3TOS-based Autonomous Fault-Tolerant Systems
An Autonomous Fault-Tolerant System (AFTS) refers to a system that is able to (re)configure its own resources in the presence of permanent defects and spontaneous random faults occurring in its silicon substrate in order to maintain the original functionality. This capability makes AFTSes specially suitable to be used in harsh environments, where traditional electronics technology is susceptible to failure. This article describes the contributions brought about by our Reliable Reconfigurable Real-Time Operating System (R3TOS) for building an AFTS using currently available Xilinx partially reconfigurable FPGAs. Namely, this article discusses what R3TOS is to offer for developing durable, dependable and real-time embedded systems to be used in rugged environments. In this context, the article presents a R3TOS-based inverter controller of a real-world railway traction system that is proven to recover from most of the errors provoked to it without requiring any human intervention.
PrePrint: Timing Verification of Fault-Tolerant Chips for Safety-Critical Applications in Harsh Environments
Critical Real-Time Embedded Systems (CRTES), which are deployed among others in cars, planes and satellites, feature increasingly complex safety-related performance-demanding functionality. Such functionality can only realistically be provided by means of advanced (high-performance) hardware and software. This will inevitably shift CRTES from using simple control software running on in-order, single-core processors with no caches to complex multi-sensor and multi-actuator software running on `aggressive' processors implemented in nanoscale technology deploying several computing cores and a cache hierarchy. However, the use of aggressive technologies and architectures challenges time predictability and reliability, which are mandatory features in CRTES. In this paper we present a processor design that reconciles all three goals, namely, predictability, reliability and high performance. Our design obtains trustworthy and tight worst-case execution time (WCET) estimates for safety-critical applications running on high-performance hardware facing hard and soft errors by means of a smart use of timing analysis techniques in combination with minor hardware modifications.
PrePrint: The Combined Input-Output Queued (CIOQ) Crossbar Architecture for High-Radix On-Chip Switches
High-radix single-chip routers have emerged as efficient building blocks for interconnection networks. It is too believed that at high radices hierarchical switch architectures are needed as crossbars scale with the square of router radix. This article proposes a novel micro-architecture that allows flat crossbar switches to scale to 128 ports supporting 32Gb/s/port while occupying 4.9mm2 and consuming 4.2W, or supporting 64Gb/s/port at 7.5mm2 and 7.5W, in 45nm CMOS. Key features include deep crossbar pipelining to cope with wire delay, a novel cross scheduler architecture to reduce wiring complexity, and catalytic custom gate placement within standard Electronic Design Automation (EDA) flows. Thus, it is also shown that, on chip, crossbar speedup and Combined Input-Output Queuing (CIOQ) is better than Hierarchical Queueing (HQ), providing top performance with orders of magnitude lower memory cost. Finally, a comparison with the recently-developed Swizzle-Switch prototypes is plotted and the potential of high-radix crossbars for System-on-a-Chip interconnects is advocated.
PrePrint: Silicon Odometers: Compact In-situ Aging Sensors for Robust System Design
Circuit reliability issues such as Bias Temperature Instability (BTI), Hot Carrier Injection (HCI), Time Dependent Dielectric Breakdown (TDDB) Electromigration (EM), and Random Telegraph Noise (RTN) have become a growing concern with technology scaling. Precise measurements of circuit degradation induced by these reliability mechanisms are a key aspect of robust design. This article will review a number of unique test chip designs pursued by our group that demonstrate the benefits of utilizing on-chip logic and a simple test interface to automate circuit aging experiments. This new class of compact on-chip sensors can reveal important aspects of circuit aging that would otherwise be impossible to measure, can facilitate the collection of reliability data from systems deployed in the field, and can eventually lead us down the path to real-time aging compensation in future processors.
PrePrint: Reliable Computing with Ultra-Reduced Instruction Set Co-processors
This work presents a method to reliably perform computations in the presence of hard faults arising from aggressive technology scaling, and design defects from human error. Our method is based on the observation that a single Turing-complete instruction can mirror the semantics of any other instruction. One such instruction is the subleq instruction, which has been used for instructional purposes in the past. The scope for using such a Turing-complete instruction is far greater than that of instructional purposes, and thus, we present its applicability to fault tolerance. In particular, we extend a MIPS processor with a co-processor (called the ultra-reduced instruction set co-processor -- URISC) that implements the subleq instruction. The URISC executes sequences of subleq to mimic the semantics of instructions that are known to be faulty on the MIPS core after testing. We implement a new back-end for the LLVM compiler that generates the sequence of subleq for instructions marked as faulty. This presents a hardware-software approach to fault recovery. Our hardware prototype called MIPS-URISC synthesizes onto an Altera FPGA. We experimentally evaluate the following: impact of single-upset faults on the instruc- tions that are rendered faulty, the area overhead of the URISC, and the performance overhead of using subleq with the URISC.