Subscribe: Untitled
http://www.freepatentsonline.com/rssfeed/rssapp712.xml
Added By: Feedage Forager Feedage Grade B rated
Language: English
Tags:
branch  data  elements  execution  includes  instruction  instructions  memory  method  processing  processor  register  vector 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: Untitled

Untitled





 



METHOD FOR OBSERVING SOFTWARE EXECUTION, DEBUG HOST AND DEBUG TARGET

Thu, 23 Feb 2017 08:00:00 EST

A method, performed in a debug host, for observing software execution on a computer having one or more processor cores, a cache attached to the one or more processor cores via respective execution pipelines forming a cache arrangement, and a memory, comprises to obtain an instruction trace of the cache arrangement and a data trace for data being loaded from the memory into the cache. The instruction trace is synchronized with the data trace to generate a synchronized data trace and/or a synchronized instruction trace. A state of a memory model, representing a memory readable by the one or more processor cores via a respective instruction is updated using the synchronized data trace and the synchronized instruction trace.



INSTRUCTION FOR FAST ZUC ALGORITHM PROCESSING

Thu, 23 Feb 2017 08:00:00 EST

Vector instructions for performing ZUC stream cipher operations are received and executed by the execution circuitry of a processor. The execution circuitry receives a first vector instruction to perform an update to a liner feedback shift register (LFSR), and receives a second vector instruction to perform an update to a state of a finite state machine (FSM), where the FSM receives inputs from re-ordered bits of the LFSR. The execution circuitry executes the first vector instruction and the second vector instruction in a single-instruction multiple data (SIMD) pipeline.



FUSIBLE INSTRUCTIONS AND LOGIC TO PROVIDE OR-TEST AND AND-TEST FUNCTIONALITY USING MULTIPLE TEST SOURCES

Thu, 23 Feb 2017 08:00:00 EST

Fusible instructions and logic provide OR-test and AND-test functionality on multiple test sources. Some embodiments include a processor decode stage to decode a test instruction for execution, the instruction specifying first, second and third source data operands, and an operation type. Execution units, responsive to the decoded test instruction, perform one logical operation, according to the specified operation type, between data from the first and second source data operands, and perform a second logical operation between the data from the third source data operand and the result of the first logical operation to set a condition flag. Some embodiments generate the test instruction dynamically by fusing one logical instruction with a prior-art test instruction. Other embodiments generate the test instruction through a just-in-time compiler. Some embodiments also fuse the test instruction with a subsequent conditional branch instruction, and perform a branch according to how the condition flag is set.



ELECTRONIC CONTROL UNIT

Thu, 23 Feb 2017 08:00:00 EST

In an electronic control unit, an important process includes several instructions that are successively executed and contain no branch instruction. Each of the instructions is stored in each of storage areas of memory according to an execution sequence. The storage areas are respectively assigned addresses that vary in increments of a specified value according to the execution sequence. The important process stores, in an expected value counter, a value of a program counter when control transitions to the important process. If a comparison result indicates a difference between the value of the program counter and a value of the expected value counter, an occurrence of an error is determined.



RECONFIGURABLE PROCESSOR AND CONDITIONAL EXECUTION METHOD FOR THE SAME

Thu, 23 Feb 2017 08:00:00 EST

A reconfigurable processor and a conditional execution method for the same are provided. The reconfigurable processor includes: a routing unit, configured to assign a conditional judgment statement and a conditional execution statement to process the conditional judgment statement and the conditional execution statement in parallel; a first arithmetic logic unit, configured to process the conditional judgment statement according to an assignment of the routing unit to obtain a single-bit signal; a second arithmetic logic unit, configured to: process the conditional execution statement according to the assignment of the routing unit to obtain a conditional execution result; receive the single-bit signal; and control an output of the conditional execution result according to the single-bit signal. The reconfigurable processor according to embodiments of the present disclosure may reduce a dependence distance and a running time of the conditional branch statement and enhance an execution efficiency of the conditional branch statement by executing the conditional judgment statement and the conditional execution statement in parallel.



INSTRUCTIONS AND LOGIC TO VECTORIZE CONDITIONAL LOOPS

Thu, 23 Feb 2017 08:00:00 EST

A processing device to provide vectorization of conditional loops includes vector physical registers to store a source vector having a first plurality of n data fields, and a destination vector comprising a second plurality of data fields corresponding to the first plurality of data fields, wherein each of the second plurality of data fields corresponds to a mask value in a vector conditions mask. The processing device includes a decode stage to decode a first processor instruction specifying a vector expand operation and a data partition size, and execution units to set elements of the source vector to n count values, obtain a decisions vector, generate the vector conditions mask according to the decisions vector, and copy data from consecutive vector elements in the source vector, into unmasked vector elements of the destination vector, without copying data from the source vector into masked vector elements of the destination vector.



Vector Compare Instructions for Sliding Window Encoding

Thu, 23 Feb 2017 08:00:00 EST

A processor is described having an instruction execution pipeline having a functional unit to execute an instruction that compares vector elements against an input value. Each of the vector elements and the input value have a first respective section identifying a location within data and a second respective section having a byte sequence of the data. The functional unit has comparison circuitry to compare respective byte sequences of the input vector elements against the input value's byte sequence to identify a number of matching bytes for each comparison. The functional unit also has difference circuitry to determine respective distances between the input vector ‘s elements’ byte sequences and the input value's byte sequence within the data.



MULTI-ELEMENT INSTRUCTION WITH DIFFERENT READ AND WRITE MASKS

Thu, 23 Feb 2017 08:00:00 EST

A method is described that includes reading a first read mask from a first register. The method also includes reading a first vector operand from a second register or memory location. The method also includes applying the read mask against the first vector operand to produce a set of elements for operation. The method also includes performing an operation of the set elements. The method also includes creating an output vector by producing multiple instances of the operation's result. The method also includes reading a first write mask from a third register, the first write mask being different than the first read mask. The method also includes applying the write mask against the output vector to create a resultant vector. The method also includes writing the resultant vector to a destination register.



DELAYED ZERO-OVERHEAD LOOP INSTRUCTION

Thu, 23 Feb 2017 08:00:00 EST

An apparatus may include a counter circuit and an execution unit. The execution unit may be configured to receive and execute a first instruction. The first instruction may include a first number corresponding to a first number of instructions of a plurality of instructions, a second number corresponding to a number of times to execute a subset of the plurality of instructions, and a third number corresponding to a number of instructions in the subset. The execution unit may be further configured to initialize a first count value in the counter circuit to the second number in response to the execution of the first instruction, to execute the first number of the plurality of instructions, and to execute the subset of the plurality of instructions. The counter circuit may be configured to modify the first count value in response to determining a last instruction of the subset has been retired.



PROCESSOR AND METHOD

Thu, 23 Feb 2017 08:00:00 EST

A processor includes an arithmetic processing circuit, a cache memory including a plurality of ways, a usage information register storing usage information indicating whether to use each of the plurality of ways, a purge control circuit performing purge processing on a basis of rewriting of the usage information within the usage information register according to an instruction executed by the arithmetic processing circuit, the purge processing including processing of deleting, from the cache memory, target data retained in a target way to be stopped and processing of writing back part of the target data, the part of the target data being data rewritten in the cache memory, to a main memory at a lower level than the cache memory, and an access control circuit controlling accessing the cache memory on a basis of a memory access request received from the arithmetic processing circuit and status of the purge processing.



DATA PROCESSING METHOD, PROCESSOR, AND DATA PROCESSING DEVICE

Thu, 16 Feb 2017 08:00:00 EST

Disclosed are a data processing method, a processor, and a data processing device. The method comprises: an arbiter sends data D(a,1) to a first processing circuit; the first processing circuit processes the data D(a,1) to obtain data D(1,2), the first processing circuit being a processing circuit among m processing circuits; the first processing circuit sends the data D(1,2) to a second processing circuit; the second processing circuit to an mth processing circuit separately process the received data; and the arbiter receives data D(m,a) sent by the mth processing circuit. The processor further comprises an (m+1)th processing circuit. Each processing circuit in the first processing circuit to the (m+1)th processing circuit can receive first data to be processed sent by the arbiter, and process the first data to be processed. The scheme is helpful to improve efficiency of data processing.



SCALABLE SINGLE-INSTRUCTION-MULTIPLE-DATA INSTRUCTIONS

Thu, 16 Feb 2017 08:00:00 EST

A method for executing scalable single-instruction-multiple-data (SIMD) instructions includes performing a query to determine a hardware vector length of a SIMD processor. The method also includes scaling a first instruction of the scalable SIMD instructions to a first scaled vector length to generate a first scaled instruction. The first scaled vector length is based on the hardware vector length, and the first instruction is a compiled instruction having an adaptable vector length. The method also includes adjusting a first number of iterations to be used by the SIMD processor to perform first operations associated with the first instruction based on the first scaled vector length.



PREDICTING MEMORY INSTRUCTION PUNTS IN A COMPUTER PROCESSOR USING A PUNT AVOIDANCE TABLE (PAT)

Thu, 16 Feb 2017 08:00:00 EST

Predicting memory instruction punts in a computer processor using a punt avoidance table (PAT) are disclosed. In one aspect, an instruction processing circuit accesses a PAT containing entries each comprising an address of a memory instruction. Upon detecting a memory instruction in an instruction stream, the instruction processing circuit determines whether the PAT contains an entry having an address of the memory instruction. If so, the instruction processing circuit prevents the detected memory instruction from taking effect before at least one pending memory instruction older than the detected memory instruction, to preempt a memory instruction punt. In some aspects, the instruction processing circuit may determine, upon execution of a pending memory instruction, whether a hazard associated with the detected memory instruction has occurred. If so, an entry for the detected memory instruction is generated in the PAT.



BRANCH SYNTHETIC GENERATION ACROSS MULTIPLE MICROARCHITECTURE GENERATIONS

Thu, 16 Feb 2017 08:00:00 EST

Branch sequences for branch prediction performance test are generated by performing the following steps: (i) generating a branch node graph, by a branch node graph generator machine logic set, based, at least in part, upon a set of branch traces of a workload or benchmark code; (ii) generating a first assembly pattern file, for use with a first instruction set architecture (ISA)/microarchitecture set, by an assembly pattern generator machine logic set, based, at least in part, upon the branch node graph so as to mimic the control-flow pattern of the workload or benchmark code; and (iii) running the assembly pattern file on the first ISA/microarchitecture set to obtain first execution results.



METHOD AND SYSTEM USING EXCEPTIONS FOR CODE SPECIALIZATION IN A COMPUTER ARCHITECTURE THAT SUPPORTS TRANSACTIONS

Thu, 16 Feb 2017 08:00:00 EST

A method and system uses exceptions for code specialization in a system that supports transactions. The method and system includes inserting one or more branchless instructions into a sequence of computer instructions. The branchless instructions include one or more instructions that are executable if a commonly occurring condition is satisfied and include one or more instructions that are configured to raise an exception if the commonly occurring condition is not satisfied.



HIGH PERFORMANCE RECOVERY FROM MISSPECULATION OF LOAD LATENCY

Thu, 16 Feb 2017 08:00:00 EST

A load instruction, for loading a register among a set of registers, is scheduled. Associated with scheduling the load instruction, a register dependency vector, corresponding to the register, is set to a state identifying the load instruction. A consumer instruction is scheduled, having a set of operand register and a target register, the register being in the set of operand registers. A target register dependency vector, corresponding to the target register is set in the memory. Based at least in part on the register being in the set of operand registers, a value of the target register dependency vector identifies the load instruction. Optionally, upon receiving a cache miss notice associated with the load instruction, the target register dependency vector is retrieved.



PROCESSOR INSTRUCTION SEQUENCE TRANSLATION

Thu, 16 Feb 2017 08:00:00 EST

Method for translating a sequence of instructions is disclosed herein. In one embodiment, the method includes recognizing a candidate multi-instruction sequence, determining that the multi-instruction sequence corresponds to a single instruction, and executing the multi-instruction sequence by executing the single instruction.



BRANCH PREDICTION USING MULTIPLE VERSIONS OF HISTORY DATA

Thu, 16 Feb 2017 08:00:00 EST

Branch prediction is provided by generating a first index from a previous instruction address and from a first branch history vector having a first length. A second index is generated from the previous instruction address and from a second branch history vector that is longer than the first vector. Using the first index, a first branch prediction is retrieved from a first branch prediction table. Using the second index, a second branch prediction is retrieved from a second branch prediction table. Based upon additional branch history data, the first branch history vector and the second branch history vector are updated. A first hash value is generated from a current instruction address and the updated first branch history vector. A second hash value is generated from the current instruction address and the updated second branch history vector. One of the branch predictions are selected based upon the hash values.



BRANCH PREDICTION USING MULTIPLE VERSIONS OF HISTORY DATA

Thu, 16 Feb 2017 08:00:00 EST

Branch prediction is provided by generating a first index from a previous instruction address and from a first branch history vector having a first length. A second index is generated from the previous instruction address and from a second branch history vector that is longer than the first vector. Using the first index, a first branch prediction is retrieved from a first branch prediction table. Using the second index, a second branch prediction is retrieved from a second branch prediction table. Based upon additional branch history data, the first branch history vector and the second branch history vector are updated. A first hash value is generated from a current instruction address and the updated first branch history vector. A second hash value is generated from the current instruction address and the updated second branch history vector. One of the branch predictions are selected based upon the hash values.



EFFICIENT HANDLING OF REGISTER FILES

Thu, 16 Feb 2017 08:00:00 EST

Systems and methods of handling a register file include, in a first instruction set architecture (ISA) mode assigning a first subset of tracking resources to logical registers of a first logical register subset for tracking mappings to full granularity and first lower granularity physical registers of a first physical register subset, and assigning a second subset of tracking resources to logical registers of a second logical register subset for tracking mappings to second lower granularity physical registers of the first physical register subset. The second subset of tracking resources are configured for tracking at least the logical registers of the second logical register subset mappings to physical registers of a second physical register subset in a second ISA mode, wherein the second physical register subset is available to the second ISA mode but not the first ISA mode.



POWER EFFICIENT FETCH ADAPTATION

Thu, 16 Feb 2017 08:00:00 EST

Systems and methods relate to an instruction fetch unit of a processor, such as a superscalar processor. The instruction fetch unit includes a fetch bandwidth predictor (FBWP) configured to predict a number of instructions to be fetched in a fetch group of instructions in a pipeline stage of the processor. A first entry of the FBWP corresponding to the fetch group corresponds to a prediction of the number of instructions to be fetched, based on occurrence and location of a predicted taken branch instruction in the fetch group and a confidence level associated with the predicted number in the prediction field. The instruction fetch unit is configured to fetch only the predicted number of instructions, rather than the maximum number of entries that can be fetched in the pipeline stage, if the confidence level is greater than a predetermined threshold. In this manner, wasteful fetching of instructions is avoided.



DETERMINING PREFETCH INSTRUCTIONS BASED ON INSTRUCTION ENCODING

Thu, 16 Feb 2017 08:00:00 EST

Systems and methods for identifying candidate load instructions for prefetch operations based on at least instruction encoding of the load instructions, include an identifier based on a function of at least one or more fields of a load instruction and optionally, a subset of bits of the PC value of the load instruction, wherein the one or more fields exclude a full address or program counter (PC) value of the load instruction. Prefetch mechanisms, including a prefetch table indexed by the identifier, can determine whether the load instruction is a candidate load instruction for prefetching load data, based on the identifier. The function may be a hash, a concatenation, or a combination thereof, of one or more bits of the one or more fields. The fields include one or more of a base register, a destination register, an immediate offset, an offset register, or other bits of instruction encoding of the load instruction.



PROCESSOR INSTRUCTION SEQUENCE TRANSLATION

Thu, 16 Feb 2017 08:00:00 EST

Computer readable medium and apparatus for translating a sequence of instructions is disclosed herein. In one embodiment, an operation includes recognizing a candidate multi-instruction sequence, determining that the multi-instruction sequence corresponds to a single instruction, and executing the multi-instruction sequence by executing the single instruction.



TABLE LOOKUP USING SIMD INSTRUCTIONS

Thu, 16 Feb 2017 08:00:00 EST

Systems and methods pertain to looking up entries of a table. A processor receives one or more single instruction multiple data (SIMD) instructions, including a first SIMD instruction which specifies a first subset of indices. A first subset of table entries is looked up, using a crossbar, with the first subset of indices. A first vector output of the first SIMD instruction is based on whether the outputs of the crossbar belong to a desired subset of table entries. Similarly, second, third, and fourth SIMD instructions specify corresponding second, third, and fourth subsets of indices to lookup the remaining table entries using the crossbar. The size of the crossbar is based on the number of indices in the subset of indices used to lookup table entries.



MULTI-THREAD PROCESSOR AND ITS HARDWARE THREAD SCHEDULING METHOD

Thu, 16 Feb 2017 08:00:00 EST

A multi-thread processor including a plurality of hardware threads each of which is configured to generate an independent instruction flow, a thread scheduler configured to output a thread selection signal according to a schedule, the thread selection signal being a signal for selecting a hardware thread to be used in a next execution cycle from among the plurality of hardware threads, a first selector configured to select and output an instruction generated by the hardware thread selected according to the thread selection signal, and an arithmetic circuit configured to execute the instruction output from the first selector. The thread scheduler selects at least one hardware thread selected in a fixed manner from among the plurality of hardware threads in a predetermined first execution period and selects an arbitrary hardware thread in a second execution period other than the first execution period.



STORING NARROW PRODUCED VALUES FOR INSTRUCTION OPERANDS DIRECTLY IN A REGISTER MAP IN AN OUT-OF-ORDER PROCESSOR

Thu, 16 Feb 2017 08:00:00 EST

Storing narrow produced values for instruction operands directly in a register map in an out-of-order processor (OoP) is provided. An OoP is provided that includes an instruction processing system. The instruction processing system includes a number of instruction processing stages configured to pipeline the processing and execution of instructions according to a dataflow execution. The instruction processing system also includes a register map table (RMT) configured to store address pointers mapping logical registers to physical registers in a physical register file (PRF) for storing produced data for use by consumer instructions without overwriting logical registers for later executed, out-of-order instructions. In certain aspects, the instruction processing system is configured to write back (i.e., store) narrow values produced by executed instructions directly into the RMT, as opposed to writing the narrow produced values into the PRF in a write back stage.



SIMD MULTIPLY AND HORIZONTAL REDUCE OPERATIONS

Thu, 16 Feb 2017 08:00:00 EST

Systems and methods relate to multiply-and-horizontal-reduce operations, implemented in a digital filter, for example. A single instruction multiple data (SIMD) instruction comprising a first vector comprising M+C multiplicand elements, wherein M and C are positive integers and a second vector comprising M+C corresponding multiplier elements, wherein the C multiplier elements have a value of 1, is received. Using M multipliers in a processor, M multiplications of M multiplicand elements with corresponding M multiplier elements which do not include the C multiplier elements whose values are 1, are performed to generate M products. The C multiplicand elements whose corresponding C multiplier elements have values of 1 are added to or vertically accumulated with the M products.



DATA RETURNED RESPONSIVE TO EXECUTING A START SUBCHANNEL INSTRUCTION

Thu, 09 Feb 2017 08:00:00 EST

An abstraction for storage class memory is provided that hides the details of the implementation of storage class memory from a program, and provides a standard channel programming interface for performing certain actions, such as controlling movement of data between main storage and storage class memory or managing storage class memory.



PROCESSING BYPASS DIRECTORY TRACKING SYSTEM AND METHOD

Thu, 09 Feb 2017 08:00:00 EST

A processing bypass directory system and method are disclosed. In one embodiment, a bypass directory tracking process includes setting bits in a bypass directory when a corresponding architectural register is written. The bits are selectively cleared in the bypass directory each cycle. The configuration of the bits is utilized to determine which state of a bypass path processing information is at.



METHOD FOR BRANCH PREDICTION

Thu, 09 Feb 2017 08:00:00 EST

The invention relates to a method for predicting branch instructions in a processor and a processor configured for this method. The processor includes an execution unit, an instruction fetch unit and a branch prediction unit. The execution unit is configured for executing machine instructions of a binary computer program. The branch prediction unit is configured for predicting the behavior of branch instructions executed by the execution unit. The instruction fetch unit is configured for fetching and pipelining instructions to be executed by the execution unit.



METHOD FOR BRANCH PREDICTION

Thu, 09 Feb 2017 08:00:00 EST

The invention relates to a method for predicting branch instructions in a processor and a processor configured for this method. The processor includes an execution unit, an instruction fetch unit and a branch prediction unit. The execution unit is configured for executing machine instructions of a binary computer program. The branch prediction unit is configured for predicting the behavior of branch instructions executed by the execution unit. The instruction fetch unit is configured for fetching and pipelining instructions to be executed by the execution unit.



MECHANISM FOR FACILITATING DYNAMIC AND EFFICIENT MANAGEMENT OF INSTRUCTION ATOMICITY VOLATIONS IN SOFTWARE PROGRAMS AT COMPUTING SYSTEMS

Thu, 09 Feb 2017 08:00:00 EST

A mechanism is described for facilitating dynamic and efficient management of instruction atomicity violations in software programs according to one embodiment. A method of embodiments, as described herein, includes receiving, at a replay logic from a recording system, a recording of a first software thread running a first macro instruction, and a second software thread running a second macro instruction. The first software thread and the second software thread are executed by a first core and a second core, respectively, of a processor at a computing device. The recording system may record interleavings between the first and second macro instructions. The method includes correctly replaying the recording of the interleavings of the first and second macro instructions precisely as they occurred. The correctly replaying may include replaying a local memory state of the first and second macro instructions and a global memory state of the first and second software threads.



ADAPTIVE CORE GROUPING

Thu, 09 Feb 2017 08:00:00 EST

The present invention relates to a system, method, and non-transitory storage medium executable by one or more processors at a multi-processor system that improves load monitoring and processor-core assignments as compared to conventional approaches. A method consistent with the present invention includes a first data packet being received at a multi-processor system. After the first packet is received it may be sent to a first processor where the first processor identifies a first processing task associated with the first data packet. The first data packet may then be forwarded to a second processor that is optimized for processing the first processing task of the first data packet. The second processor may then process the first processing task of the first data packet. Program code associated with the first processing task may be stored in a level one (L1) cache at the first processor.



HYBRID HARDWARE AND SOFTWARE IMPLEMENTATION OF TRANSACTIONAL MEMORY ACCESS

Thu, 09 Feb 2017 08:00:00 EST

Embodiments of the invention relate a hybrid hardware and software implementation of transactional memory accesses in a computer system. A processor including a transactional cache and a regular cache is utilized in a computer system that includes a policy manager to select one of a first mode (a hardware mode) or a second mode (a software mode) to implement transactional memory accesses. In the hardware mode the transactional cache is utilized to perform read and write memory operations and in the software mode the regular cache is utilized to perform read and write memory operations.



VECTOR CHECKSUM INSTRUCTION

Thu, 09 Feb 2017 08:00:00 EST

A Vector Checksum instruction. Elements from a second operand are added together one-by-one to obtain a first result. The adding includes performing one or more end around carry add operations. The first result is placed in an element of a first operand of the instruction. After each addition of an element, a carry out of a chosen position of the sum, if any, is added to a selected position in an element of the first operand.



METHOD AND APPARATUS FOR SHUFFLING DATA

Thu, 09 Feb 2017 08:00:00 EST

Method, apparatus, and program means for shuffling data. The method of one embodiment comprises receiving a first operand having a set of L data elements and a second operand having a set of L control elements. For each control element, data from a first operand data element designated by the individual control element is shuffled to an associated resultant data element position if its flush to zero field is not set and a zero is placed into the associated resultant data element position if its flush to zero field is not set.



Computer with Hybrid Von-Neumann/Dataflow Execution Architecture

Thu, 02 Feb 2017 08:00:00 EST

A dataflow computer processor is teamed with a general computer processor so that program portions of an application program particularly suited to dataflow execution may be transferred to the dataflow processor during portions of the execution of the application program by the general computer processor. During this time the general computer processor may be placed in partial shutdown for energy conservation.



APPARATUS AND METHOD FOR TRANSFERRING A PLURALITY OF DATA STRUCTURES BETWEEN MEMORY AND A PLURALITY OF VECTOR REGISTERS

Thu, 02 Feb 2017 08:00:00 EST

An apparatus and method are provided for transferring a plurality of data structures between memory and a plurality of vector registers, each vector register being arranged to store a vector operand comprising a plurality of data elements. Access circuitry is used to perform access operations to move data elements of vector operands between the data structures in memory and specified vector registers, each data structure comprising multiple data elements stored at contiguous addresses in the memory. Decode circuitry is responsive to a single access instruction identifying a plurality of vector registers and a plurality of data structures that are located discontiguously with respect to each other in the memory, to generate control signals to control the access circuitry to perform a sequence of access operations to move the plurality of data structures between the memory and the plurality of vector registers such that the vector operand in each vector register holds a corresponding data element from each of the plurality of data structures. This provides a very efficient mechanism for performing complex access operations, resulting in an increase in execution speed, and potential reductions in power consumption.



System and Method for Variable Lane Architecture

Thu, 02 Feb 2017 08:00:00 EST

A system and method for variable lane architecture includes memory blocks located in a memory bank, one or more computing nodes forming a vector instruction pipeline for executing a task, each of the computing nodes located in the memory bank, each of the computing nodes executing a portion of the task independently of other ones of the computing nodes, and a global program controller unit (GPCU) forming a scalar instruction pipeline for executing the task, the GPCU configured to schedule instructions for the task at one or more of the computing nodes, the GPCU further configured to dispatch an address for the memory blocks used by each of the computing nodes to the computing nodes.



REGISTER COMPARISON FOR OPERAND STORE COMPARE (OSC) PREDICTION

Thu, 02 Feb 2017 08:00:00 EST

Embodiments relate to register comparison for register comparison for operand store compare (OSC) prediction. An aspect includes, for each instruction in an instruction group of a processor pipeline: determining a base register value of the instruction; determining an index register value of the instruction; and determining a displacement of the instruction. Another aspect includes comparing the base register value, index register value, and displacement of each instruction in the instruction group to the base register value, index register value, and displacement of all other instructions in the instruction group. Another aspect includes based on the comparison, determining that a load instruction of the instruction group has a probable OSC conflict with a store instruction of the instruction group. Yet another aspect includes delaying the load instruction based on the determined probable OSC conflict.



REGISTER COMPARISON FOR OPERAND STORE COMPARE (OSC) PREDICTION

Thu, 02 Feb 2017 08:00:00 EST

Embodiments relate to register comparison for register comparison for operand store compare (OSC) prediction. An aspect includes, for each instruction in an instruction group of a processor pipeline: determining a base register value of the instruction; determining an index register value of the instruction; and determining a displacement of the instruction. Another aspect includes comparing the base register value, index register value, and displacement of each instruction in the instruction group to the base register value, index register value, and displacement of all other instructions in the instruction group. Another aspect includes based on the comparison, determining that a load instruction of the instruction group has a probable OSC conflict with a store instruction of the instruction group. Yet another aspect includes delaying the load instruction based on the determined probable OSC conflict.



AGE BASED FAST INSTRUCTION ISSUE

Thu, 02 Feb 2017 08:00:00 EST

In an approach for selecting and issuing an oldest ready instruction in an issue queue, one or more processors receive one or more instructions in an issue queue. Ready to execute instructions are identified. An age of the instructions are represented in a first age array. One or more subsets of the instructions are generated for subset age arrays that each hold an age of the instructions in a subset. A major signal is generated that identifies an oldest ready instruction in the first age array and a subset signal is simultaneously generated that identifies an oldest ready instruction in each subset age array. A candidate instruction is selected with each subset signal that is represented in the subset age array of the subset signal, wherein a candidate instruction is an oldest ready instruction in the subset age array. A candidate instruction is selected with the major signal and issued.



APPARATUS WITH REDUCED HARDWARE REGISTER SET

Thu, 02 Feb 2017 08:00:00 EST

An apparatus comprises processing circuitry for processing program instructions according to a predetermined architecture defining a number of architectural registers accessible in response to the program instructions. A set of hardware registers is provided in hardware. A storage capacity of the set of hardware registers is insufficient for storing all the data associated with the architectural registers of the pre-determined architecture. Control circuitry is responsive to the program instructions to transfer data between the hardware registers and at least one register emulating memory location in memory for storing data corresponding to the architectural registers of the architecture.



BRANCH SYNTHETIC GENERATION ACROSS MULTIPLE MICROARCHITECTURE GENERATIONS

Thu, 02 Feb 2017 08:00:00 EST

Branch sequences for branch prediction performance test are generated by performing the following steps: (i) generating a branch node graph, by a branch node graph generator machine logic set, based, at least in part, upon a set of branch traces of a workload or benchmark code; (ii) generating a first assembly pattern file, for use with a first instruction set architecture (ISA)/microarchitecture set, by an assembly pattern generator machine logic set, based, at least in part, upon the branch node graph so as to mimic the control-flow pattern of the workload or benchmark code; and (iii) running the assembly pattern file on the first ISA/microarchitecture set to obtain first execution results.



VECTOR CHECKSUM INSTRUCTION

Thu, 02 Feb 2017 08:00:00 EST

A Vector Checksum instruction. Elements from a second operand are added together one-by-one to obtain a first result. The adding includes performing one or more end around carry add operations. The first result is placed in an element of a first operand of the instruction. After each addition of an element, a carry out of a chosen position of the sum, if any, is added to a selected position in an element of the first operand.



ELEMENT SIZE INCREASING INSTRUCTION

Thu, 02 Feb 2017 08:00:00 EST

An apparatus comprises processing circuitry to generate a result vector including at least one N-bit data element in response to an element size increasing instruction identifying at least a first input vector including M-bit data elements, where N>M. First and second forms of the element size increasing instruction are provided for generating the result vector using first and second subsets of data elements of the first input vector respectively. Positions of the first and second subsets of data elements in the first input vector are interleaved.



SYSTEMS, APPARATUSES, AND METHODS FOR PERFORMING DELTA ENCODING ON PACKED DATA ELEMENTS

Thu, 02 Feb 2017 08:00:00 EST

Embodiments of systems, apparatuses, and methods for performing delta encoding on packed data elements of a source and storing the results in packed data elements of a destination using a single vector packed delta encode instruction are described.



HYBRID PROGRAMMABLE MANY-CORE DEVICE WITH ON-CHIP INTERCONNECT

Thu, 26 Jan 2017 08:00:00 EST

The present invention provides a hybrid programmable logic device which includes a programmable field programmable gate array logic fabric and a many-core distributed processing subsystem. The device integrates both a fabric of programmable logic elements and processors in the same device, i.e., the same chip. The programmable logic elements may be sized and arranged such that place and route tools can address the processors and logic elements as a homogenous routing fabric. The programmable logic elements may provide hardware acceleration functions to the processors that can be defined after the device is fabricated. The device may include scheduling circuitry that can schedule the transmission of data on horizontal and vertical connectors in the logic fabric to transmit data between the programmable logic elements and processor in an asynchronous manner.



MICROPROCESSOR ACCELERATED CODE OPTIMIZER

Thu, 26 Jan 2017 08:00:00 EST

A method for accelerating code optimization a microprocessor. The method includes fetching an incoming microinstruction sequence using an instruction fetch component and transferring the fetched macroinstructions to a decoding component for decoding into microinstructions. Optimization processing is performed by reordering the microinstruction sequence into an optimized microinstruction sequence comprising a plurality of dependent code groups. The optimized microinstruction sequence is output to a microprocessor pipeline for execution. A copy of the optimized microinstruction sequence is stored into a sequence cache for subsequent use upon a subsequent hit optimized microinstruction sequence.



SLIDING WINDOW OPERATION

Thu, 26 Jan 2017 08:00:00 EST

A first register has a lane storing first input data and a second register has a lane storing second input data elements. A width of the lane of the second register is equal to a width of the lane of the first register. A single-instruction-multiple-data (SIMD) lane has a lane width equal to the width of the lane of the first register. The SIMD lane is configured to perform a sliding window operation on the first input data elements in the lane of the first register and the second input data elements in the lane of the second register. Performing the sliding window operation includes determining a result based on a first input data element stored in a first position of the first register and a second input data element stored in a second position of the second register. The second position is different from the first position.