Subscribe: Untitled
http://www.freepatentsonline.com/rssfeed/rssapp712.xml
Added By: Feedage Forager Feedage Grade B rated
Language: English
Tags:
branch  data  includes  instruction  instructions  load  plurality  processing  processor  program  register  unit  vector 
Rate this Feed
Rate this feedRate this feedRate this feedRate this feedRate this feed
Rate this feed 1 starRate this feed 2 starRate this feed 3 starRate this feed 4 starRate this feed 5 star

Comments (0)

Feed Details and Statistics Feed Statistics
Preview: Untitled

Untitled





 



SYSTEMS, DEVICES, AND METHODS FOR ANALOG PROCESSING

Thu, 27 Oct 2016 08:00:00 EDT

A system may include first and second qubits that cross one another and a first coupler having a perimeter that encompasses at least a part of the portions of the first and second qubits, the first coupler being operable to ferromagnetically or anti-ferromagnetically couple the first and the second qubits together. A multi-layered computer chip may include a first plurality N of qubits laid out in a first metal layer, a second plurality M of qubits laid out at least partially in a second metal layer that cross each of the qubits of the first plurality of qubits, and a first plurality N times M of coupling devices that at least partially encompasses an area where a respective pair of the qubits from the first and the second plurality of qubits cross each other.



APPLICATION PROCESSOR AND SYSTEM ON CHIP

Thu, 27 Oct 2016 08:00:00 EDT

An application processor includes an application processor including a first processor configured to generate a control signal based on whether user data is changed, wherein the application processor is configured to implement a power manager which dynamically controls power provided to the first processor, in response to the control signal.



Architecture for long latency operations in emulated shared memory architectures

Thu, 27 Oct 2016 08:00:00 EDT

A processor architecture arrangement for emulated shared memory (ESM) architectures, comprises a number of, preferably a plurality of, multi-threaded processors each provided with interleaved inter-thread pipeline, wherein the pipeline comprises a plurality of functional units arranged in series for executing arithmetic, logical and optionally further operations on data, wherein one or more functional units of lower latency are positioned prior to the memory access segment in said pipeline and one or more long latency units (LLU) for executing more complex operations associated with longer latency are positioned operatively in parallel with the memory access segment. In some embodiments, the pipeline may contain multiple branches in parallel with the memory access segment, each branch containing at least one long latency unit.



INSTRUCTION AND LOGIC FOR IDENTIFYING INSTRUCTIONS FOR RETIREMENT IN A MULTI-STRAND OUT-OF-ORDER PROCESSOR

Thu, 27 Oct 2016 08:00:00 EDT

A processor includes a first logic to execute an instruction stream out-of-order, the instruction stream divided into a plurality of strands, the instruction stream and each strand ordered by program order (PO). The processor also includes a second logic to determine an oldest undispatched instruction in the instruction stream and store an associated PO value of the oldest undispatched instruction as an executed instruction pointer. The instruction stream includes dispatched and undispatched instructions. The processor also includes a third logic to determine a most recently retired instruction in the instruction stream and store an associated PO value of the most recently retired instruction as a retirement pointer, a fourth logic to select a range of instructions between the retirement pointer and the executed instruction pointer, and a fifth logic to identify the range of instructions as eligible for retirement.



ENERGY EFFICIENT PROCESSOR CORE ARCHITECTURE FOR IMAGE PROCESSOR

Thu, 27 Oct 2016 08:00:00 EDT

An apparatus is described. The apparatus includes a program controller to fetch and issue instructions. The apparatus includes an execution lane having at least one execution unit to execute the instructions. The execution lane is part of an execution lane array that is coupled to a two dimensional shift register array structure, wherein, execution lanes of the execution lane array are located at respective array locations and are coupled to dedicated registers at same respective array locations in the two-dimensional shift register array.



INSTRUCTION PREDICATION USING UNUSED DATAPATH FACILITIES

Thu, 27 Oct 2016 08:00:00 EDT

A method and circuit arrangement for selectively predicating an instruction in an instruction stream based upon a value corresponding to a predication register address indicated by a portion of an operand associated with the instruction. A first compare instruction in an instruction stream stores a compare result in at a register address of a predication register. The register address of the predication register is stored in a portion of an operand associated with a second instruction, and during decoding the second instruction, the predication register is accessed to determine a value stored at the register address of the predication register, and the second instruction is selectively predicated based on the value stored at the register address of the predication register.



TECHNIQUES FOR FACILITATING CRACKING AND FUSION WITHIN A SAME INSTRUCTION GROUP

Thu, 27 Oct 2016 08:00:00 EDT

A technique includes determining whether one or more instructions in an instruction group require cracking. Whether the instructions that require cracking are associated with a decode-time instruction optimization (DTIO) sequence is also determined. In response to a first instruction, included in the one or more instructions, requiring cracking and the first instruction not being part of a DTIO sequence, the first instruction is cracked into internal operations (IOPs). In response to a second instruction, included in the one or more instructions, requiring cracking and the second instruction being part of a DTIO sequence, an IOP sequence (that includes at least one IOP that is associated with at least a cracked version of the second instruction and at least a third instruction that is included in the one or more instructions and at least one other IOP that is associated with the cracked version of the second instruction) is generated.



COMPUTER PROCESSOR WITH INDIRECT ONLY BRANCHING

Thu, 27 Oct 2016 08:00:00 EDT

A computer processor with indirect only branching is disclosed. The computer processor may include one or more target registers. The computer processor may include processing logic in signal communication with the one or more target registers. The processing logic may execute a non-interrupting branch instruction based on a value stored in a target register of the one or more target registers. The non-interrupting branch instruction may use the one or more target registers to specify a destination address of a branch specified by the non-interrupting branch instruction.



APPARATUS AND METHOD OF EXECUTION UNIT FOR CALCULATING MULTIPLE ROUNDS OF A SKEIN HASHING ALGORITHM

Thu, 27 Oct 2016 08:00:00 EDT

An apparatus is described that includes an execution unit within an instruction pipeline. The execution unit has multiple stages of a circuit that includes a) and b) as follows: a) a first logic circuitry section having multiple mix logic sections each having: i) a first input to receive a first quad word and a second input to receive a second quad word; ii) an adder having a pair of inputs that are respectively coupled to the first and second inputs; iii) a rotator having a respective input coupled to the second input; iv) an XOR gate having a first input coupled to an output of the adder and a second input coupled to an output of the rotator. b) permute logic circuitry having inputs coupled to the respective adder and XOR gate outputs of the multiple mix logic sections.



TECHNIQUES FOR FACILITATING CRACKING AND FUSION WITHIN A SAME INSTRUCTION GROUP

Thu, 27 Oct 2016 08:00:00 EDT

A technique includes determining whether one or more instructions in an instruction group require cracking. Whether the instructions that require cracking are associated with a decode-time instruction optimization (DTIO) sequence is also determined. In response to a first instruction, included in the one or more instructions, requiring cracking and the first instruction not being part of a DTIO sequence, the first instruction is cracked into internal operations (IOPs). In response to a second instruction, included in the one or more instructions, requiring cracking and the second instruction being part of a DTIO sequence, an IOP sequence (that includes at least one IOP that is associated with at least a cracked version of the second instruction and at least a third instruction that is included in the one or more instructions and at least one other IOP that is associated with the cracked version of the second instruction) is generated.



METHODS AND APPARATUS FOR PARALLEL PROCESSING

Thu, 27 Oct 2016 08:00:00 EDT

Methods and apparatus for parallel processing are provided. A multicore processor is described. The multicore processor may include a distributed memory unit with memory nodes coupled to the processor's cores. The cores may be configured to execute parallel threads, and at least one of the threads may be data-dependent on at least one of the other threads. The distributed memory unit may be configured to proactively send shared memory data from a thread that produces the shared memory data to one or more of the threads.



SYSTEMS AND METHODS FOR EXECUTING SOFTWARE THREADS USING SOFT PROCESSORS

Thu, 20 Oct 2016 08:00:00 EDT

A hardware acceleration component is provided that includes a plurality of hardware clusters, each hardware cluster comprising a plurality of soft processor cores and a functional circuit. The plurality of soft processor cores share the functional circuit.



RUN-TIME PARALLELIZATION OF CODE EXECUTION BASED ON AN APPROXIMATE REGISTER-ACCESS SPECIFICATION

Thu, 20 Oct 2016 08:00:00 EDT

A method includes, in a processor that processes instructions of program code, processing a first segment of the instructions. One or more destination registers are identified in the first segment using an approximate specification of register access by the instructions. Respective values of the destination registers are made available to a second segment of the instructions only upon verifying that the values are valid for readout by the second segment in accordance with the approximate specification. The second segment is processed at least partially in parallel with processing of the first segment, using the values made available from the first segment.



BRANCH PREDICTION

Thu, 20 Oct 2016 08:00:00 EDT

A tagged geometric length (TAGE) branch predictor 16 incorporates multiple prediction tables 20, 22, 24, 26. Each of these prediction tables has prediction storage lines which store a common stored TAG value 50 and a plurality of branch predictions 52, 54 in respect of different offset positions within a block of program instructions read in parallel. Each of the branch prediction has an associated validity indicator 56, 58. Update of predictions stored may be made by a partial allocation mechanism in which a TAG match occurs and a branch storage line is partially overwritten or by full allocation in which no already matching TAG victim storage line can be identified and instead a whole prediction storage line is cleared and the new prediction stored therein.



PROVIDING CODE SECTIONS FOR MATRIX OF ARITHMETIC LOGIC UNITS IN A PROCESSOR

Thu, 20 Oct 2016 08:00:00 EDT

The present invention relates to a processor having a trace cache and a plurality of ALUs arranged in a matrix, comprising an analyser unit located between the trace cache and the ALUs, wherein the analyser unit analyses the code in the trace cache, detects loops, transforms the code, and issues to the ALUs sections of the code combined to blocks for joint execution for a plurality of clock cycles.



INTERRUPT RETURN INSTRUCTION WITH EMBEDDED INTERRUPT SERVICE FUNCTIONALITY

Thu, 20 Oct 2016 08:00:00 EDT

An instruction pipeline implemented on a semiconductor chip is described. The semiconductor chip includes an execution unit having the following to execute an interrupt handling instruction. Storage circuitry to hold different sets of micro-ops where each set of micro-ops is to handle a different interrupt. First logic circuitry to execute a set of said sets of micro-ops to handle an interrupt that said set is designed for. Second logic circuitry to return program flow to an invoking program upon said first logic circuitry having handled said interrupt.



SYSTEM AND METHOD FOR PIPELINE MANAGEMENT OF ARTIFACTS

Thu, 13 Oct 2016 08:00:00 EDT

In the management of deleted content, deleted data is input into a data analysis engine from one or more first computing devices. A parsing module parses the attributes of the deleted data and modifies the metadata of the deleted data based on results of the parsing. A routing module determines a pipeline with attributes matching the modified metadata of the deleted data and routes the modified deleted data to the pipeline. The modified deleted data in the pipeline is managed based on the pipeline configuration. One of more second computing devices may access the pipeline and evaluate the metadata of the modified deleted data in the pipeline. The one or more second computing devices determine whether or not to inherit the modified deleted data. In determining to inherit the modified deleted data, the one or more second computing devices assume ownership of the modified deleted data.



Methods, Apparatus, Instructions and Logic to Provide Permute Controls With Leading Zero Count Functionality

Thu, 13 Oct 2016 08:00:00 EDT

Instructions and logic provide SIMD permute controls with leading zero count functionality. Some embodiments include processors with a register with a plurality of data fields, each of the data fields to store a second plurality of bits. A destination register has corresponding data fields, each of these data fields to store a count of the number of most significant contiguous bits set to zero for corresponding data fields. Responsive to decoding a vector leading zero count instruction, execution units count the number of most significant contiguous bits set to zero for each of data fields in the register, and store the counts in corresponding data fields of the first destination register. Vector leading zero count instructions can be used to generate permute controls and completion masks to be used along with the set of permute controls, to resolve dependencies in gather-modify-scatter SIMD operations.



METHOD AND APPARATUS FOR PERFORMING AN EFFICIENT SCATTER

Thu, 13 Oct 2016 08:00:00 EDT

An apparatus and method for performing an efficient scatter operation. For example, one embodiment of a processor comprises: an allocator unit to receive a scatter operation comprising a number of data elements and responsively allocate resources to execute the scatter operation; a memory execution cluster comprising at least a portion of the resources to execute the scatter operation, the resources including one or more store data buffers and one or more store address buffers; and a senior store pipeline to transfer store data elements from the store data buffers to system memory using addresses from the store address buffers prior to retirement of the scatter operation.



COMMON ARCHITECTURAL STATE PRESENTATION FOR PROCESSOR HAVING PROCESSING CORES OF DIFFERENT TYPES

Thu, 13 Oct 2016 08:00:00 EDT

Methods and apparatuses relating to a common architectural state presentation for a processor having cores of different types are described. In one embodiment, a processor includes a first core, a second core, wherein the first core comprises a unique architectural state and a common architectural state with the second core, and circuitry to migrate a thread from said first core to said second core, said circuitry to migrate the common architectural state from the first core to the second core, and migrate the unique architectural state to a storage external from the second core



METHODS AND SYSTEMS FOR PERFORMING A REPLAY EXECUTION

Thu, 13 Oct 2016 08:00:00 EDT

One or more embodiments may provide a method for performing a replay. The method includes initiating execution of a program, the program having a plurality of sets of instructions, and each set of instructions has a number of chunks of instructions. The method also includes intercepting, by a virtual machine unit executing on a processor, an instruction of a chunk of the number of chunks before execution. The method further includes determining, by a replay module executing on the processor, whether the chunk is an active chunk, and responsive to the chunk being the active chunk, executing the instruction.



A METHOD AND AN APPARATUS FOR EFFICIENT DATA PROCESSING

Thu, 06 Oct 2016 08:00:00 EDT

The subject of the invention is a method and an apparatus for efficient data processing, using principles of Quantum Mechanics. A method of evolving a quantum register from an initial state ψ to a desired final state ψyes of said register characterized by comprising of backtracking to the state computationally equivalent to initial state ψ by mapping each and every unknown, undesirable final state ψnot of the quantum register to the superposition or ensemble of orthogonal states in the computations space, when the projection measurement of a quantum register or parts of said register rendered it in the undesirable state ψnot.



Housing Qubit Devices in an Electromagnetic Waveguide System

Thu, 06 Oct 2016 08:00:00 EDT

In some aspects, a quantum computing system includes an electromagnetic waveguide system. The waveguide system has an interior surface that defines an interior volume of intersecting waveguides. Qubit devices are housed in the waveguide system. In some cases, the intersecting waveguides each define a cutoff frequency, and the qubit devices have qubit operating frequencies below the cutoff frequency. In some cases, coupler devices are housed in the waveguide system; each coupler device is configured to selectively couple a pair of neighboring qubit devices based on control signals received from a control source.



MULTITHREADING IN VECTOR PROCESSORS

Thu, 06 Oct 2016 08:00:00 EDT

In one embodiment, a system includes a processor having a vector processing mode and a multithreading mode. The processor is configured to operate on one thread per cycle in the multithreading mode. The processor includes a program counter register having a plurality of program counters, and the program counter register is vectorized. Each program counter in the program counter register represents a distinct corresponding thread of a plurality of threads. The processor is configured to execute the plurality of threads by activating the plurality of program counters in a round robin cycle.



Low Energy Accelerator Processor Architecture with Short Parallel Instruction Word

Thu, 06 Oct 2016 08:00:00 EDT

Methods and apparatus for a low energy accelerator processor architecture with short parallel instruction word. An integrated circuit includes a system bus having a data width N, where N is a positive integer; a central processor unit coupled to the system bus and configured to execute instructions retrieved from a memory coupled to the system bus; and a low energy accelerator processor coupled to the system bus and configured to execute instruction words retrieved from a low energy accelerator code memory, the low energy accelerator processor having a plurality of execution units including a load store unit, a load coefficient unit, a multiply unit, and a butterfly/adder ALU unit, each of the execution units configured to perform operations responsive to op-codes decoded from the retrieved instruction words, wherein the width of the instruction words is equal to the data width N. Additional methods and apparatus are disclosed.



PARALLELIZED EXECUTION OF INSTRUCTION SEQUENCES BASED ON PRE-MONITORING

Thu, 06 Oct 2016 08:00:00 EDT

A method includes, in a processor that processes instructions of program code, processing one or more of the instructions in a first segment of the instructions by a first hardware thread. Upon detecting that an instruction defined as a parallelization point has been fetched for the first thread, a second hardware thread is invoked to process at least one of the instructions in a second segment of the instructions, at least partially in parallel with processing of the instructions of the first segment by the first hardware thread, in accordance with a specification of register access that is indicative of data dependencies between the first and second segments.



REMOVING INVALID LITERAL LOAD VALUES, AND RELATED CIRCUITS, METHODS, AND COMPUTER-READABLE MEDIA

Thu, 06 Oct 2016 08:00:00 EDT

Removing invalid literal load values, and related circuits, methods, and computer-readable media are disclosed. In one aspect, an instruction processing circuit provides a literal load table containing one or more entries comprising an address and a cached literal load value. Upon detecting a literal load instruction in an instruction stream, the instruction processing circuit determines whether the literal load table contains an entry having an address of the literal load instruction. If so, the instruction processing circuit removes the literal load instruction from the instruction stream, and provides the cached literal load value stored in the entry to at least one dependent instruction. The instruction processing circuit further determines whether an invalidity indicator for the literal load table has been received. If so, the instruction processing circuit flushes the literal load table. The invalidity indicator may be generated responsive to modification of a constant table.



METHOD AND APPARATUS FOR A SUPERSCALAR PROCESSOR

Thu, 06 Oct 2016 08:00:00 EDT

A superscalar processor, for out of order self-timed execution, comprising a plurality of independent self-timed function units, having corresponding instruction queues for holding instructions to be executed by the function unit. The processor further comprising an instruction dispatcher configured for inputting instructions in program counter order; and determining an appropriate function unit for execution of the instruction and a resource management unit configured for monitoring the function units and signaling availability of the appropriate function unit, wherein the dispatcher only dispatches the instruction to the appropriate function unit in response to the availability signal from the resource management unit.



PARALLELIZED EXECUTION OF INSTRUCTION SEQUENCES

Thu, 06 Oct 2016 08:00:00 EDT

A method includes, in a processor that processes instructions of program code, processing one or more of the instructions by a first hardware thread. Upon detecting that an instruction defined as a parallelization point has been fetched for the first thread, a second hardware thread is invoked to process at least one of the instructions at least partially in parallel with processing of the instructions by the first hardware thread.



MULTITHREADING IN VECTOR PROCESSORS

Thu, 06 Oct 2016 08:00:00 EDT

In one embodiment, a system includes a processor having a vector processing mode and a multithreading mode. The processor is configured to operate on one thread per cycle in the multithreading mode. The processor includes a program counter register having a plurality of program counters, and the program counter register is vectorized. Each program counter in the program counter register represents a distinct corresponding thread of a plurality of threads. The processor is configured to execute the plurality of threads by activating the plurality of program counters in a round robin cycle.



SPECULATIVE LOAD ISSUE

Thu, 06 Oct 2016 08:00:00 EDT

A method and load and store buffer for issuing a load instruction to a data cache. The method includes determining whether there are any unresolved store instructions in the store buffer that are older than the load instruction. If there is at least one unresolved store instruction in the store buffer older than the load instruction, it is determined whether the oldest unresolved store instruction in the store buffer is within a speculation window for the load instruction. If the oldest unresolved store instruction is within the speculation window for the load instruction, the load instruction is speculatively issued to the data cache. Otherwise, the load instruction is stalled until any unresolved store instructions outside the speculation window are resolved. The speculation window is a short window that defines a number of instructions or store instructions that immediately precede the load instruction.



METHOD OF COMPILING PROGRAM, STORAGE MEDIUM, AND APPARATUS

Thu, 06 Oct 2016 08:00:00 EDT

A method of compiling a program that executes a plurality of unit processes in parallel, the method includes: replacing a load instruction of a volatile variable, the volatile variable being a variable included in the program and having a possibility of being overwritten by another unit process, with a beginning load instruction indicating a beginning of a range of transactionization and a load, and an end instruction indicating an ending of the range of the transactionization; moving the beginning load instruction before a position of the load instruction of the volatile variable in the program by instruction scheduling; and generating a beginning instruction indicating a beginning of a range of the transactionization and a load instruction of the volatile variable from the moved beginning load instruction.



EMBEDDING ELECTRONIC STRUCTURE IN CONTROLLABLE QUANTUM SYSTEMS

Thu, 29 Sep 2016 08:00:00 EDT

Generating a computing specification to be executed by a quantum processor includes: accepting a problem specification that corresponds to a second-quantized representation of a fermionic Hamiltonian, and transforming the fermionic Hamiltonian into a first qubit Hamiltonian including a first set of qubits that encode a fermionic state specified by occupancy of spin orbitals. An occupancy of any spin orbital is encoded in a number of qubits that is logarithmic in the number of spin orbitals, and a parity for a transition between any two spin orbitals is encoded in a number of qubits that is logarithmic in the number of spin orbitals. An eigenspectrum of a second qubit Hamiltonian, including the first set of qubits and a second set of qubit, includes a low-energy subspace and a high-energy subspace, and an eigenspectrum of the first qubit Hamiltonian is approximated by a set of low-energy eigenvalues of the low-energy subspace.



SIMD IMPLEMENTATION OF STENCIL CODES

Thu, 29 Sep 2016 08:00:00 EDT

Implementing a 1D stencil code via SIMD instructions on a computer with vector registers having N processing elements (PEs), among them a set of coefficient vector registers, a set of at most N data vector registers, and a set of result vector registers. The M stencil coefficients are loaded in a particular pattern into M+N−1 coefficient vector registers. Successive sets of N consecutive data values are received, and each data value of a set is loaded into all PEs of a data vector register of the set of data vector registers. The result vector registers accumulate sums of products of consecutive coefficient vector registers with corresponding data vector registers. The contents of any result vector register containing a sum of all coefficient vector register-data vector register products is output, and the result vector register is reused for accumulating.



SIMD PROCESSING MODULE HAVING MULTIPLE VECTOR PROCESSING UNITS

Thu, 29 Sep 2016 08:00:00 EDT

A SIMD processing module is provided, comprising multiple vector processing units (“VUs”), which can be used to execute an instruction on respective parts (or “subvectors”) within a vector. A control unit determines a vector position indication for each of the VUs to indicate which part of the vector that VU is to execute the instruction on. Therefore, the vector is conceptually divided into subvectors with the respective VUs executing the instruction on the respective subvectors in parallel. Each VU can then execute the instruction as intended, but only on a subsection of the whole vector. This allows an instruction that is written for execution on an n-way VU to be executed by multiple n-way VUs, each starting at different points of the vector, such that the instruction can be executed on more than n of the data items of the vector in parallel.



SYSTEM-ON-A-CHIP (SOC) INCLUDING HYBRID PROCESSOR CORES

Thu, 29 Sep 2016 08:00:00 EDT

A processing device includes a first processor module comprising a first core designed according to a first instruction set (ISA), and a second processor module comprising a second core designed according to a second ISA. The first and second processor modules are fabricated on a same die.



Floating-point supportive pipeline for emulated shared memory architectures

Thu, 29 Sep 2016 08:00:00 EDT

A processor architecture arrangement for emulated shared memory (ESM) architectures, comprising a number of multi-threaded processors each provided with interleaved inter-thread pipeline (400) and a plurality of functional units (402, 402b, 402c, 404, 404b, 404c) for carrying out arithmetic and logical operations on data, wherein the pipeline (400) comprises at least two operatively parallel pipeline branches (414, 416), first pipeline branch (414) comprising a first sub-group of said plurality of functional units (402, 402b, 402c), such as ALUs (arithmetic logic unit), arranged for carrying out integer operations, and second pipeline branch (416) comprising a second, non-overlapping sub-group of said plurality of functional units (404, 404b, 404c), such as FPUs (floating point unit), arranged for carrying out floating point operations, and further wherein one or more of the functional units (404b) of at least said second sub-group arranged for floating point operations are located operatively in parallel with the memory access segment (412, 412a) of the pipeline (400).



SCHEDULERS WITH LOAD-STORE QUEUE AWARENESS

Thu, 29 Sep 2016 08:00:00 EDT

In one embodiment, a computer-implemented method includes tracking a size of a load-store queue (LSQ) during compile time of a program. The size of the LSQ is time-varying and indicates how many memory access instructions of the program are on the LSQ. The method further includes scheduling, by a computer processor, a plurality of memory access instructions of the program based on the size of the LSQ.



APPARATUSES AND METHODS TO SELECTIVELY EXECUTE A COMMIT INSTRUCTION

Thu, 29 Sep 2016 08:00:00 EDT

Methods and apparatuses relating to selectively executing a commit instruction. In one embodiment, a data storage device stores code that when executed by a hardware processor causes the hardware processor to perform the following: translating an instruction into a translated instruction to be executed by the hardware processor, marking a commit instruction one of for execution and for optional execution by the hardware processor, and including a hint for a commit instruction marked for optional execution; and a hardware commit unit to determine if the commit instruction marked for optional execution is to be executed based on the hint.



Systems, Methods, and Apparatuses for Resource Monitoring

Thu, 29 Sep 2016 08:00:00 EDT

Systems, methods, and apparatuses for resource monitoring identification reuse are described. In an embodiment, a system comprising a hardware processor core to execute instructions storage for a resource monitoring identification (RMID) recycling instructions to be executed by a hardware processor core, a logical processor to execute on the hardware processor core, the logical processor including associated storage for a RMID and state, are described.



Embedded Branch Prediction Unit

Thu, 29 Sep 2016 08:00:00 EDT

In accordance with some embodiments of the present invention, a branch prediction unit for an embedded controller may be placed in association with the instruction fetch unit instead of the decode stage. In addition, the branch prediction unit may include no branch predictor. Also, the return address stack may be associated with the instruction decode stage and is structurally separate from the branch prediction unit. In some cases, this arrangement reduces the area of the branch prediction unit, as well as power consumption.



BRANCH LOOK-AHEAD INSTRUCTION DISASSEMBLING, ASSEMBLING, AND DELIVERING SYSTEM APPARATUS AND METHOD FOR MICROPROCESSOR SYSTEM

Thu, 29 Sep 2016 08:00:00 EDT

A method and system of the branch look-ahead (BLA) instruction disassembling, assembling, and delivering are designed for improving speed of branch prediction and instruction fetch of microprocessor systems by reducing the amount of clock cycles required to deliver branch instructions to a branch predictor located inside the microprocessors. The invention is also designed for reducing run-length of the instructions found between branch instructions by disassembling the instructions in a basic block as a BLA instruction and a single or plurality of non-BLA instructions from the software/assembly program. The invention is also designed for dynamically reassembling the BLA and the non-BLA instructions and delivering them to a single or plurality of microprocessors in a compatible sequence. In particular, the reassembled instructions are concurrently delivered to a single or plurality of microprocessors in a timely and precise manner while providing compatibility of the software/assembly program.



APPARATUS AND METHOD FOR VECTOR HORIZONTAL LOGICAL INSTRUCTION

Thu, 29 Sep 2016 08:00:00 EDT

An apparatus and method are described for performing vector horizontal logical instruction. For example, one embodiment of a processor comprises: fetch logic to fetch an instruction from memory, and execution logic to determine a value of a first set of one or more data elements from a first specified set of bits of an immediate operand, wherein positions of the first set of one or more data elements determined from the first specified set of bits of the immediate operand are based on a first set of one or more index values that have a most significant bit corresponding to a packed data element at a first set of one or more positions of a destination packed data operand and that have a least significant bit corresponding to a data element at a corresponding position of a first source packed data operand.



PARALLEL DATA PROCESSING APPARATUS

Thu, 29 Sep 2016 08:00:00 EDT

A data processing apparatus includes a plurality of processing elements arranged in a single instruction multiple data array. The apparatus includes an instruction controller operable to receive instructions from a plurality of instructions streams, and to transfer instructions from those instructions streams to the processing elements in the array, such that the data processing apparatus is operable to process a plurality of processing threads substantially in parallel with one another. A data transfer controller is provided which is operable to control transfer of data between the internal memory units associated with the processing elements, and memory external to the array.



APPARATUSES AND METHODS TO ACCELERATE VECTOR MULTIPLICATION

Thu, 29 Sep 2016 08:00:00 EDT

Methods and apparatuses relating to accelerating vector multiplication. In one embodiment, an apparatus includes a first buffer to store a first cache line of indices for elements of a first vector, a second buffer to store a second cache line of indices for elements of a second vector, a comparison unit to compare each index of the first cache line of indices with each index of the second cache line of indices, a plurality of multipliers to each multiply an element from the first vector and an element from the second vector for an index match from the comparison unit to produce a product, and an adder to add together the product from each of the plurality of multipliers.



GUEST INSTRUCTION BLOCK WITH NEAR BRANCHING AND FAR BRANCHING SEQUENCE CONSTRUCTION TO NATIVE INSTRUCTION BLOCK

Thu, 29 Sep 2016 08:00:00 EDT

A method for translating instructions for a processor. The method includes accessing a plurality of guest instructions that comprise multiple guest branch instructions comprising at least one guest far branch, and building an instruction sequence from the plurality of guest instructions by using branch prediction on the at least one guest far branch. The method further includes assembling a guest instruction block from the instruction sequence. The guest instruction block is translated to a corresponding native conversion block, wherein an at least one native far branch that corresponds to the at least one guest far branch and wherein the at least one native far branch includes an opposite guest address for an opposing branch path of the at least one guest far branch. Upon encountering a missprediction, a correct instruction sequence is obtained by accessing the opposite guest address.



SYSTEM MANAGEMENT MODE TRUST ESTABLISHMENT FOR OS LEVEL DRIVERS

Thu, 29 Sep 2016 08:00:00 EDT

Various embodiments are generally directed to establishing trust in system management mode. An operating system management mode driver can invoke a system management mode and provide a signature to the system management mode to authenticate the driver with. Additionally, a hash value of the driver can be used to determine whether the driver is authorized to invoke system management mode or particular operations or features of system management mode.



INSTRUCTIONS AND LOGIC TO PROVIDE ATOMIC RANGE OPERATIONS

Thu, 29 Sep 2016 08:00:00 EDT

Instructions and logic provide atomic range operations in a multiprocessing system. In one embodiment an atomic range modification instruction specifies an address for a set of range indices. The instruction locks access to the set of range indices and loads the range indices to check the range size. The range size is compared with a size sufficient to perform the range modification. If the range size is sufficient to perform the range modification, the range modification is performed and one or more modified range indices of the set of range indices is stored back to memory. Otherwise an error signal is set when the range size is not sufficient to perform said range modification. Access to the set of range indices is unlocked responsive to completion of the atomic range modification instruction. Embodiments may include atomic increment next instructions, add next instructions, decrement end instructions, and/or subtract end instructions.



History Buffer with Single Snoop Tag for Multiple-Field Registers

Thu, 29 Sep 2016 08:00:00 EDT

An approach is provided in which a mapper control unit matches a result instruction tag corresponding to an executed instruction to a history buffer entry's instruction tag. The matched history buffer entry includes multiple history buffer field sets that each include a field set state indicator. The mapper control unit identifies a subset of the history buffer field sets having a valid field set state indicator and stores result data corresponding to the result instruction tag in the identified subset of history buffer field sets. In turn, the mapper control unit restores a subset of a register's fields utilizing content from the subset of history buffer field sets.



CONTROLLING DATA FLOW BETWEEN PROCESSORS IN A PROCESSING SYSTEM

Thu, 29 Sep 2016 08:00:00 EDT

A processing system includes a program processor for executing a program, and a dedicated processor for executing operations of a particular type (e.g. vector processing operations). The program processor uses an interfacing module and a group of two or more register banks to offload operations of the particular type to the dedicated processor for execution thereon. Whilst the dedicated processor is accessing one register bank for executing a current operation, the interfacing module can concurrently load data for a subsequent operation into a different one of the register banks. The use of multiple register banks allows the dedicated processor to spend a greater proportion of its time executing operations.