Introduction

This document describes how the PSP VFPU instruction set operates. We attempted to collect all the knowledge available in the community and put it toghether in a document that can be used as a reference for developers and enthusiasts.

The goal is to describe the behaviour of the hardware unit with as much detail as possible in a way that every statement can be verified. For this reason, every functional detail described in the docs must have a test that validates it. Of course some things are harder to validate (like hardware bugs) so there's some statements that won't have tests for them at this time.

MIPS allegrex CPU

The Allegrex CPU is a MIPS CPU based on the MIPS II architecture. This is a 32 bit CPU and architecture that has many similarities with other CPUs of the same architecture. However, if we only focus on the instruction set, the main differences with other CPUs in the MIPS II family would be:

  • Lack of 64 bit FPU support (only single float support).
  • Lack of MMU/TLB (only certain memory protections are available).
  • Some extra MIPS32r2 instructions (mostly arithmetic and bit manipulation).
  • Other COP0 instructions, some borrowed from MIPS32
  • Lack of Coprocessor 3 and custom Coprocessor 2 (VFPU)

Most of the extra instructions that are present in the CPU are identical to their MIPS32 counterparts. In some cases though, the encoding is slightly different.

VFPU unit

The PSP VFPU is a coprocessor unit that can perform vector/matrix float and integer operations on a set of 128 bit registers. It features dedicated units to perform the most usual operations that 3D videogames require.

Register set

The CPU features 128 registers, each of them 32 bit wide. Most of the time they are interpreted as IEEE-754 compliant floating point registers, although some instructions will interpret them as integers (or other formats such as 8/16 bit packed integers). The registers can be addressed individually but also in a more powerful way by grouping them as vectors or matrices.

Registers will usually be represented in their matrix layout. The VFPU has 8 matrices, each of them containing 16 elements (4 rows by 4 columns). For each of the 8 total available matrices, the elements are arranged in the following fashion (X represents the matrix number, 0 to 7):

Single 32 bit elementsSX00SX10SX20SX30SX01SX11SX21SX31SX02SX12SX22SX32SX03SX13SX23SX33

When the registers are referenced as vectors, they are grouped as rows and columns of a given matrix. This is important since it means that a vector is composed of elements from a single matrix and cannot access elements across multiple matrices. There's 2D, 3D and 4D vectors, usually called pair, trio and quad respectively. Single elements can be viewed as 1D vectors, and most instructions are available in all four possible vector sizes (which makes the instruction set very uniform). Not all access patterns are possible: pair and trio registers have 128 possible addressing modes while quad has only 64. The available patterns are described as follows:

2D vector rowsRX00RX20RX01RX21RX02RX22RX03RX23
3D vector rowsRX00RX01RX02RX03
3D vector rowsRX10RX11RX12RX13
4D vector rowsRX00RX01RX02RX03
2D vector colsCX00CX10CX20CX30CX02CX12CX22CX32
3D vector colsCX00CX10CX20CX30
3D vector colsCX01CX11CX21CX31
4D vector colsCX00CX10CX20CX30

Matrix addressing is similar to vectors: registers can be read vertically or horizontally. That means matrices can be accessed in a row major and column major mode (ie. by accessing them as a set of rows or columns). Similarly there's three possible sizes: 2x2, 3x3 and 4x4, containing 4, 9 and 16 registers respectively. Again not all addressing patterns are available, having 64 possible addressing modes for 2x2 and 3x3 matrices, but only 16 for 4x4 matrices. These are:

2D matrixMX00MX20MX02MX22
3D matrixMX00
3D matrixMX10
3D matrixMX01
3D matrixMX11
4D matrixMX00

There's also a small set of eight "control" registers that are used for a variety of things, such as prefix state, comparison flag bits, etc. These registers are defined as follow:

  • Reg 128 (VFPU_PFXS): holds the rs prefix value.
  • Reg 129 (VFPU_PFXT): holds the rt prefix value.
  • Reg 130 (VFPU_PFXD): holds the rd prefix value.
  • Reg 131 (VFPU_CC): holds the condition code value.
  • Reg 135 (VFPU_REV): read only register with VFPU revision information.
  • Regs 136 to 147 (VFPU_RCX0 to VFPU_RCX7): Pseudorandom generator state.

Some of these registers are never accessed directly but rather using some VFPU instructions (ie. prefixes, condition code, etc). However these can be read and written in some useful cases, for instance thread context saving and restoration (so that the VFPU state is preserved across thread rescheduling).

Register hazards

Most CPUs have what's called "hazard detection logic", which tracks register reads and writes so that things happen in the right order and results actually make sense. In the VFPU this is also the case, however some operations are quite complex and can be complex to track.

Control registers seem to have some hazards, for instance "mfvc" instruction has a one cycle hazard with any previous vcmp instruction. That means a vnop or some other VFPU instruction should be inserted between a vcmp and mfvc instruction pair to get the right VFPU_CC value.

Some VFPU instructions (mostly dealing with matrices and transformations) require that the input and output registers do not overlap. This has to do with how the hardware performs the operations internally: the VFPU can perform most vector-vector operations in a native way, but matrix operations seem to be decomposed into series of vector-vector operations (ie. a vmmul seems to be a sequence of vtfm operations). Since the results are only partial, the inputs are overwritten before the CPU can even read them, causing incorrect results for the operation.

The affected instructions are divided in two groups, a group that does not allow any sort of overlap, and another group that allows some limited overlap. Instructions vmmul, vtfm2/3/4, vhtfm2/3/4, vqmul and vcrsp do not allow any sort of overlap between input and output registers. These instructions perform operations by repeating a dot product operation multiple times, which results in partial updates of the output register. This partial updates overwrite the input register causing the result to be incorrect.

Instructions that allow partial overlaps are vsin, vcos, vasin, vnsin, vexp2, vrexp2, vlog2, vsqrt, vrsq, vrcp, vnrcp, vdiv, vmscl and vmmov. Single versions (.s) are not affected by this restriction. These instructions are also internally decomposed into a bunch of smaller operations (for instance trigonometric operations are decomposed into a series of single (.s) operations). The registers are allowed to overlap as long as they are compatible in terms of element count and access "direction" (ie. a matrix must be read using the same mode).

Examples

  vmscl.p M000, M022, S100    # No overlap, always OK
  vmscl.p M000, M000, S100    # M000 overlaps with itself, OK
  vmscl.p M000, E000, S100    # Invalid overlap, matrix order is different
  vmscl.t M000, M011, S100    # Overlapping registers are not identical
  vcos.q R000, C000           # Invalid overlap (one element only)
  vcos.q R000, R000           # Identical overlap, OK

Floating point format

Although the FPU seems IEEE-754 compliant, it has a couple of non-standard features that break this compatibility. Its rounding mode is hardwired to "round to nearest" mode, so that users cannot choose another rounding mode. It also lacks support for denormal numbers (also called subnormals): when an operation produces a subnormal number, it rounds it to zero. If the input of an operation is a denormal number, it will also be treated as zero.

See the ieee754-fun.c file for tests.

Instruction execution

The VFPU is a pipelined CPU with an issue width of one. That means that instructions take multiple cycles to execute, since they execute partially during each cycle, and a maximum of one new instruction begins execution each cycle. Instructions that block the pipeline for more than one cycle can be identified by having a throughput different than one. These block the pipeline for a certain number of cycles before a new instruction can enter it.

An instruction usually begins executing whenever its input registers are ready, that is, any previous instruction writing those registers have fully completed their execution. For this reason it is important to closely observe the instruction latency, measured in cycles, since an instruction might have to wait for its inputs to become available, reducing efficiency. A common strategy is to interleave non-dependant instructions to hide latency and avoid wasting CPU cycles.

The pipeline structure looks more or less as follows:

  • Register read
  • Input prefix operations
  • VFPU operation (arithmetic, logic)
  • Output prefix operation
  • Register write

Prefix operations allow to perform certain operations on the inputs before the actual instruction operation and some other operations on the output.

Prefix operations

VFPU operations can operate on one or two inputs (rs and rt) and one output (rd). The input values can be pre-processed by using the VFPU_PFXS and VFPU_PFXT registers (and therefore vpfxs and vpfxt instructions). The result of the operation being written to rd can be post-processed by using the VFPU_PFXD register (vpfxd instruction).

Valid operations for input registers are:

  • Sign change (negation)
  • Absolute value
  • Swizzle (rearranging elments in a row/col)
  • Override element with constant value.

Operations available to the output register post-processing are:

  • Value clamping (to ranges 0..1 or -1..1)
  • Write masking (disable register write)

There's some restrictions on their usage. The assembler will signal an error should you violate any of the restrictions.

  • Constant values can only be 0, 1, 2, 3, 1/2, 1/3, 1/4, 1/6 or any of their negative counterparts
  • Swizzle cannot extend beyond the operand size (ie. you cannot use .z with a an instruction that uses single or pair elements).

A few examples to showcase input prefixes:

  # Sign change prefix
  vmul.p R000, R001, R002[-x,-y]     # Multiplies two rows negating one of the inputs
                                     # S000 = S001 * -S002;  S010 = S011 * -S012

  vfad.q R000, R001[x,-y,z,-w]       # Funnel-add all elements with some changed signs
                                     # S000 = S001 - S011 + S021 - S031

  # Absolute value prefix
  vdot.p S000, R001[|x|,|y|], R002   # Dot product with forced absolute value for R001
                                     # S000 = |S001| * S002 + |S011| * S012

  # Negative and absolute value prefixes
  vdot.p S000, R001[-|x|,-|y|], R002   # Dot product with forced negative values
                                       # S000 = -|S001| * S002 - |S011| * S012

  # Swizzle prefix
  vdot.q R000, R001, R002[x,y,x,y]   # Multiplies with repeating values
                                     # S000 = S001 * S002;  S010 = S011 * S012
                                     # S020 = S021 * S002;  S030 = S031 * S012

  # Constant value prefixes
  vdot.t R000, R001, R002[1,2,3]     # Second operand ignored, overrides to (1,2,3)
                                     # S000 = S001 + S011 * 2 + S021 * 3

  vdot.t R000, R001, R002[x,-2,-y]   # Mix swizzle and constant elements
                                     # S000 = S001 * S002 - S011 * 2 - S021 * S012


Some more examples for output prefixes.

  vmul.p R000[[-1:1],[-1:1]], R001, R002  # Multiplies with output saturation
                                          # S000 = min(1.0f, max(-1.0f, S001 * S002))
                                          # S010 = min(1.0f, max(-1.0f, S011 * S012))

Adding a prefix modifier to an operand will result in vpfxs/t/d instructions being emitted before the actual instruction. This syntax exists just to make assembly coding more comfortable to the user. When using the disassembler the prefix instructions will be clearly visible.

  # The following operand-decorated instruction:
  vmul.q R000, R100[x,y,x,y], R200[-x,-y,z,w]

  # is actually encoded as a sequence of instructions:
  vpfxs [x,y,x,y]
  vpfxt [-x,-y,z,w]
  vmul.q R000, R100, R200

Prefix instructions consume one cycle and have no visible latency (the "decorated" instruction doesn't have to wait any extra cycles). In some cases it might be faster to not use prefixes and use other instructions (vcst, vabs, vneg, vsat0/1 are some similar alternatives), particularly when optimizing for throughput. The advantage of using prefixes is that latency is kept low (since they have no latency and the extra operation is "included" in the instruction pipeline).

Allegrex Instructions

Bit manipulation instructions

The following instructions exist in the Allegrex CPU and share the same MIPS32 encodings:

  • seb: Sign extend byte (byte to word signed extension)
  • seh: Sign extend half-word (half-word to word signed extension)
  • ext: Extract bit field (extract a bit field in a zeroed register)
  • ins: Insert bit field (insert lower bits into another register)
  • wsbh: Swap bytes within a half-word

Other instructions that are borrowed from MIPS32 but have a different encoding are:

  • clo: Count leading ones (uses some unused SPECIAL encodings)
  • clz: Count leading zeros (uses some unused SPECIAL encodings)

The bit manipulation Allegrex specific instructions are:

  • wsbw: Swap bytes in word (uses BSHFL encoding adjacent to wsbh)
  • bitrev: Reverse bits in a word (uses unused BSHFL encoding)

Arithmetic-Logical instructions

Allegrex features some instructions present in MIPS32 and MIPS32r2 with identical encoding to these:

  • rotr: Rotate word right by a fixed amount
  • rotrv: Rotate word right by a variable amount
  • movz: Conditional register move on zero
  • movn: Conditional register move on non-zero

Other instructions that have some particular encoding are multiply-accumulate instructions. Some overlap with MIPS R4010 encodings and some others just use unused encodings. They all use unused SPECIAL opcodes:

  • madd: Signed multiply-accumulate integer
  • maddu: Unsigned multiply-accumulate integer
  • msub: Signed multiply-subtract integer
  • msubu: Unsigned multiply-subtract integer

There's also two novel Allegrex instructions that are used to perform faster compare-and-move operations. These use free SPECIAL opcodes as well:

  • min: Selects smallest (signed) value between two registers.
  • max: Selects greatest (signed) value between two registers.

VFPU Instructions

bvf

VFPU branch on false

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 0 0 1 0 0 1 0 0 0 0 0 vfpucc offset

Syntax

bvf imm3, offset

Description

Branch on VFPU CC register being false

Instruction performance

Throughput: 1 cycles/instruction
Latency: 4 cycles

bvfl

VFPU likely branch on false

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 0 0 1 0 0 1 0 0 0 1 0 vfpucc offset

Syntax

bvfl imm3, offset

Description

Branch on VFPU CC register being false (likely)

Instruction performance

Throughput: 1 cycles/instruction
Latency: 4 cycles

bvt

VFPU branch on true

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 0 0 1 0 0 1 0 0 0 0 1 vfpucc offset

Syntax

bvt imm3, offset

Description

Branch on VFPU CC register being true

Instruction performance

Throughput: 1 cycles/instruction
Latency: 4 cycles

bvtl

VFPU likely branch on true

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 0 0 1 0 0 1 0 0 0 1 1 vfpucc offset

Syntax

bvtl imm3, offset

Description

Branch on VFPU CC register being true (likely)

Instruction performance

Throughput: 1 cycles/instruction
Latency: 4 cycles

mtvc

Move GPR to VFPU control register

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 gpr vfpucc

Syntax

mtvc rt, imm8

Description

Writes the contents of a CPU general purpose register to the specified VFPU control register

mfvc

Move VFPU control register to GPR

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 gpr vfpucc

Syntax

mfvc rt, imm8

Description

Writes the contents of the specified VPFU control register into a CPU general purpose register

Hazards

The instruction does not have interlocks, so the result of a vcmp instruction is only available one cycle later. You will need to interleave at least one VFPU instruction between a vcmp and mfvc (ie. a vnop).

vmtvc

Move vector register to VFPU control register

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 rs vfpucc

Syntax

vmtvc imm8, rs

Description

Writes the contents of a VFPU vector general to the specified VFPU control register

vmfvc

Move VFPU control register to vector register

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 rd vfpucc

Syntax

vmfvc rd, imm8

Description

Writes the contents of the specified VPFU control register into a VFPU vector register

Hazards

The instruction does not have interlocks, so the result of a previous vcmp instruction is only available one cycle later. You will need to interleave at least one VFPU instruction between a vcmp and mfvc (ie. a vnop).

lv.s

Load VFPU element

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 0 1 0 gpr rtlo rthi offset

Syntax

lv.s rd, imm14(rt)

Description

Performs a 4 byte memory load to a VFPU register. Address must be 4 byte aligned or a fault is generated.

Allowed prefixes

  • rd: Not supported

lv.q

Load VFPU quad element

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 1 0 0 gpr rtlo rthi offset

Syntax

lv.q rd, imm14(rt)

Description

Performs a 16 byte memory load to a VFPU quad register. Address must be 16 byte aligned or a fault is generated.

Allowed prefixes

  • rd: Not supported

lvl.q

Load left VFPU quad element

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 1 0 gpr rtlo rthi offset

Syntax

lvl.q rd, imm14(rt)

Description

Performs a 16 byte left unaligned memory load to a VFPU quad register. Instruction ignores the two LSB (forces them to zero), so the address is assumed aligned to 4 bytes. This instruction is similar to MIPS LWL instruction: loads the most significant elements from the specified address leaving the other elements unchanged. Users can use `ulv.q` pseudoinstruction to generate a sequence of `lvl.q` and `lvr.q` instructions in order to load unaligned data. You can check `psp-tests/manual/memops.c` to see examples on how the instruction behaves.

Bugs

The instruction has an errata on PSP-1000 models that causes FPU register corruption (these are the MIPS CPU FPU registers, not the VFPU registers). The bottom 5 bits of the VFPU destination register determine which FPU register will be corrupted. A workaround is to assume the side effect (ie. mark the register are clobbered).

Allowed prefixes

  • rd: Not supported

lvr.q

Load right VFPU quad element

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 1 1 gpr rtlo rthi offset

Syntax

lvr.q rd, imm14(rt)

Description

Performs a 16 byte right unaligned memory load to a VFPU quad register. Instruction ignores the two LSB (forces them to zero), so the address is assumed aligned to 4 bytes. This instruction is similar to MIPS LWR instruction: loads the least significant elements from the specified address leaving the other elements unchanged. Users can use `ulv.q` pseudoinstruction to generate a sequence of `lvl.q` and `lvr.q` instructions in order to load unaligned data. You can check `psp-tests/manual/memops.c` to see examples on how the instruction behaves.

Bugs

The instruction has an errata on PSP-1000 models that causes FPU register corruption (these are the MIPS CPU FPU registers, not the VFPU registers). The bottom 5 bits of the VFPU destination register determine which FPU register will be corrupted. A workaround is to assume the side effect (ie. mark the register are clobbered).

Allowed prefixes

  • rd: Not supported

sv.s

Store VFPU element

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 1 0 1 0 gpr rtlo rthi offset

Syntax

sv.s rs, imm14(rt)

Description

Performs a 4 byte memory store from a VFPU register. Address must be 4 byte aligned or a fault is generated.

Allowed prefixes

  • rd: Not supported

sv.q

Store VFPU quad element

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 1 1 1 0 0 gpr rtlo rthi offset

Syntax

sv.q rs, imm14(rt)

Description

Performs a 16 byte memory store from a VFPU quad register. Address must be 16 byte aligned or a fault is generated.

Allowed prefixes

  • rd: Not supported

svl.q

Store left VFPU quad element

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 1 1 0 1 0 gpr rtlo rthi offset

Syntax

svl.q rs, imm14(rt)

Description

Performs a 16 byte left unaligned memory store from a VFPU quad register. Instruction ignores the two address LSB (forces them to zero), so the address is assumed aligned to 4 bytes. This instruction is similar to MIPS SWL instruction: stores the most significant part of the elements to the specified address leaving any other elements unchanged. Users can use `usv.q` pseudoinstruction to generate a sequence of `svl.q` and `svr.q` instructions in order to store unaligned data. You can check `psp-tests/manual/memops.c` to see examples on how the instruction behaves.

Allowed prefixes

  • rd: Not supported

svr.q

Store right VFPU quad element

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 1 1 0 1 1 gpr rtlo rthi offset

Syntax

svr.q rs, imm14(rt)

Description

Performs a 16 byte right unaligned memory store from a VFPU quad register. Instruction ignores the two address LSB (forces them to zero), so the address is assumed aligned to 4 bytes. This instruction is similar to MIPS SWR instruction: stores the least significant part of the elements to the specified address leaving any other elements unchanged. Users can use `usv.q` pseudoinstruction to generate a sequence of `svl.q` and `svr.q` instructions in order to store unaligned data. You can check `psp-tests/manual/memops.c` to see examples on how the instruction behaves.

Allowed prefixes

  • rd: Not supported

vadd.s

Add elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 0 0 0 0 0 0 rt rs rd

Syntax

vadd.s rd, rs, rt

Description

Performs element-wise floating point addition

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] + rt[0]

vadd.p

Add elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 0 0 0 0 0 1 rt rs rd

Syntax

vadd.p rd, rs, rt

Description

Performs element-wise floating point addition

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] + rt[0]
rd[1] = rs[1] + rt[1]

vadd.t

Add elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 0 0 0 0 1 0 rt rs rd

Syntax

vadd.t rd, rs, rt

Description

Performs element-wise floating point addition

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] + rt[0]
rd[1] = rs[1] + rt[1]
rd[2] = rs[2] + rt[2]

vadd.q

Add elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 0 0 0 0 1 1 rt rs rd

Syntax

vadd.q rd, rs, rt

Description

Performs element-wise floating point addition

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] + rt[0]
rd[1] = rs[1] + rt[1]
rd[2] = rs[2] + rt[2]
rd[3] = rs[3] + rt[3]

vsub.s

Subtract elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 0 0 0 1 0 0 rt rs rd

Syntax

vsub.s rd, rs, rt

Description

Performs element-wise floating point subtraction

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] - rt[0]

vsub.p

Subtract elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 0 0 0 1 0 1 rt rs rd

Syntax

vsub.p rd, rs, rt

Description

Performs element-wise floating point subtraction

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] - rt[0]
rd[1] = rs[1] - rt[1]

vsub.t

Subtract elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 0 0 0 1 1 0 rt rs rd

Syntax

vsub.t rd, rs, rt

Description

Performs element-wise floating point subtraction

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] - rt[0]
rd[1] = rs[1] - rt[1]
rd[2] = rs[2] - rt[2]

vsub.q

Subtract elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 0 0 0 1 1 1 rt rs rd

Syntax

vsub.q rd, rs, rt

Description

Performs element-wise floating point subtraction

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] - rt[0]
rd[1] = rs[1] - rt[1]
rd[2] = rs[2] - rt[2]
rd[3] = rs[3] - rt[3]

vmul.s

Multiply elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 1 0 0 0 0 0 rt rs rd

Syntax

vmul.s rd, rs, rt

Description

Performs element-wise floating point multiplication

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] * rt[0]

vmul.p

Multiply elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 1 0 0 0 0 1 rt rs rd

Syntax

vmul.p rd, rs, rt

Description

Performs element-wise floating point multiplication

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] * rt[0]
rd[1] = rs[1] * rt[1]

vmul.t

Multiply elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 1 0 0 0 1 0 rt rs rd

Syntax

vmul.t rd, rs, rt

Description

Performs element-wise floating point multiplication

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] * rt[0]
rd[1] = rs[1] * rt[1]
rd[2] = rs[2] * rt[2]

vmul.q

Multiply elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 1 0 0 0 1 1 rt rs rd

Syntax

vmul.q rd, rs, rt

Description

Performs element-wise floating point multiplication

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] * rt[0]
rd[1] = rs[1] * rt[1]
rd[2] = rs[2] * rt[2]
rd[3] = rs[3] * rt[3]

vdiv.s

Divide elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 0 1 1 1 0 0 rt rs rd

Syntax

vdiv.s rd, rs, rt

Description

Performs element-wise floating point division

Instruction performance

Throughput: 14 cycles/instruction
Latency: 17 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rt: Full support (swizzle, abs(), neg() and constants)
  • rs: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] / rt[0]

vdiv.p

Divide elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 0 1 1 1 0 1 rt rs rd

Syntax

vdiv.p rd, rs, rt

Description

Performs element-wise floating point division

Instruction performance

Throughput: 28 cycles/instruction
Latency: 31 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rt: Not supported
  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = rs[0] / rt[0]
rd[1] = rs[1] / rt[1]

vdiv.t

Divide elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 0 1 1 1 1 0 rt rs rd

Syntax

vdiv.t rd, rs, rt

Description

Performs element-wise floating point division

Instruction performance

Throughput: 42 cycles/instruction
Latency: 45 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rt: Not supported
  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = rs[0] / rt[0]
rd[1] = rs[1] / rt[1]
rd[2] = rs[2] / rt[2]

vdiv.q

Divide elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 0 1 1 1 1 1 rt rs rd

Syntax

vdiv.q rd, rs, rt

Description

Performs element-wise floating point division

Instruction performance

Throughput: 56 cycles/instruction
Latency: 59 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rt: Not supported
  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = rs[0] / rt[0]
rd[1] = rs[1] / rt[1]
rd[2] = rs[2] / rt[2]
rd[3] = rs[3] / rt[3]

vmin.s

Select smallest elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 1 1 0 1 0 0 0 rt rs rd

Syntax

vmin.s rd, rs, rt

Description

Performs element-wise floating point min(rs, rt) operation

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = fminf(rs[0], rt[0])

vmin.p

Select smallest elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 1 1 0 1 0 0 1 rt rs rd

Syntax

vmin.p rd, rs, rt

Description

Performs element-wise floating point min(rs, rt) operation

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = fminf(rs[0], rt[0])
rd[1] = fminf(rs[1], rt[1])

vmin.t

Select smallest elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 1 1 0 1 0 1 0 rt rs rd

Syntax

vmin.t rd, rs, rt

Description

Performs element-wise floating point min(rs, rt) operation

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = fminf(rs[0], rt[0])
rd[1] = fminf(rs[1], rt[1])
rd[2] = fminf(rs[2], rt[2])

vmin.q

Select smallest elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 1 1 0 1 0 1 1 rt rs rd

Syntax

vmin.q rd, rs, rt

Description

Performs element-wise floating point min(rs, rt) operation

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = fminf(rs[0], rt[0])
rd[1] = fminf(rs[1], rt[1])
rd[2] = fminf(rs[2], rt[2])
rd[3] = fminf(rs[3], rt[3])

vmax.s

Select biggest elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 1 1 0 1 1 0 0 rt rs rd

Syntax

vmax.s rd, rs, rt

Description

Performs element-wise floating point max(rs, rt) operation

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = fmaxf(rs[0], rt[0])

vmax.p

Select biggest elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 1 1 0 1 1 0 1 rt rs rd

Syntax

vmax.p rd, rs, rt

Description

Performs element-wise floating point max(rs, rt) operation

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = fmaxf(rs[0], rt[0])
rd[1] = fmaxf(rs[1], rt[1])

vmax.t

Select biggest elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 1 1 0 1 1 1 0 rt rs rd

Syntax

vmax.t rd, rs, rt

Description

Performs element-wise floating point max(rs, rt) operation

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = fmaxf(rs[0], rt[0])
rd[1] = fmaxf(rs[1], rt[1])
rd[2] = fmaxf(rs[2], rt[2])

vmax.q

Select biggest elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 1 1 0 1 1 1 1 rt rs rd

Syntax

vmax.q rd, rs, rt

Description

Performs element-wise floating point max(rs, rt) operation

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = fmaxf(rs[0], rt[0])
rd[1] = fmaxf(rs[1], rt[1])
rd[2] = fmaxf(rs[2], rt[2])
rd[3] = fmaxf(rs[3], rt[3])

vscmp.s

Compare and set elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 1 1 1 0 1 0 0 rt rs rd

Syntax

vscmp.s rd, rs, rt

Description

Performs element-wise floating point comparison. The result is -1.0f, 0.0f or 1.0f depending on whether the input vs is less that vt, equal, or greater, respectively.

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] < rt[0] ? -1f : rs[0] > rt[0] ? 1.0f : 0.0f

vscmp.p

Compare and set elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 1 1 1 0 1 0 1 rt rs rd

Syntax

vscmp.p rd, rs, rt

Description

Performs element-wise floating point comparison. The result is -1.0f, 0.0f or 1.0f depending on whether the input vs is less that vt, equal, or greater, respectively.

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] < rt[0] ? -1f : rs[0] > rt[0] ? 1.0f : 0.0f
rd[1] = rs[1] < rt[1] ? -1f : rs[1] > rt[1] ? 1.0f : 0.0f

vscmp.t

Compare and set elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 1 1 1 0 1 1 0 rt rs rd

Syntax

vscmp.t rd, rs, rt

Description

Performs element-wise floating point comparison. The result is -1.0f, 0.0f or 1.0f depending on whether the input vs is less that vt, equal, or greater, respectively.

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] < rt[0] ? -1f : rs[0] > rt[0] ? 1.0f : 0.0f
rd[1] = rs[1] < rt[1] ? -1f : rs[1] > rt[1] ? 1.0f : 0.0f
rd[2] = rs[2] < rt[2] ? -1f : rs[2] > rt[2] ? 1.0f : 0.0f

vscmp.q

Compare and set elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 1 1 1 0 1 1 1 rt rs rd

Syntax

vscmp.q rd, rs, rt

Description

Performs element-wise floating point comparison. The result is -1.0f, 0.0f or 1.0f depending on whether the input vs is less that vt, equal, or greater, respectively.

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] < rt[0] ? -1f : rs[0] > rt[0] ? 1.0f : 0.0f
rd[1] = rs[1] < rt[1] ? -1f : rs[1] > rt[1] ? 1.0f : 0.0f
rd[2] = rs[2] < rt[2] ? -1f : rs[2] > rt[2] ? 1.0f : 0.0f
rd[3] = rs[3] < rt[3] ? -1f : rs[3] > rt[3] ? 1.0f : 0.0f

vsge.s

Compare greater or equal and set elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 1 1 1 1 0 0 0 rt rs rd

Syntax

vsge.s rd, rs, rt

Description

Performs element-wise floating point bigger-or-equal comparison. The result will be 1.0 if vs is bigger or equal to vt, otherwise will be zero.

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] >= rt[0] ? 1.0f : 0.0f

vsge.p

Compare greater or equal and set elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 1 1 1 1 0 0 1 rt rs rd

Syntax

vsge.p rd, rs, rt

Description

Performs element-wise floating point bigger-or-equal comparison. The result will be 1.0 if vs is bigger or equal to vt, otherwise will be zero.

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] >= rt[0] ? 1.0f : 0.0f
rd[1] = rs[1] >= rt[1] ? 1.0f : 0.0f

vsge.t

Compare greater or equal and set elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 1 1 1 1 0 1 0 rt rs rd

Syntax

vsge.t rd, rs, rt

Description

Performs element-wise floating point bigger-or-equal comparison. The result will be 1.0 if vs is bigger or equal to vt, otherwise will be zero.

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] >= rt[0] ? 1.0f : 0.0f
rd[1] = rs[1] >= rt[1] ? 1.0f : 0.0f
rd[2] = rs[2] >= rt[2] ? 1.0f : 0.0f

vsge.q

Compare greater or equal and set elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 1 1 1 1 0 1 1 rt rs rd

Syntax

vsge.q rd, rs, rt

Description

Performs element-wise floating point bigger-or-equal comparison. The result will be 1.0 if vs is bigger or equal to vt, otherwise will be zero.

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] >= rt[0] ? 1.0f : 0.0f
rd[1] = rs[1] >= rt[1] ? 1.0f : 0.0f
rd[2] = rs[2] >= rt[2] ? 1.0f : 0.0f
rd[3] = rs[3] >= rt[3] ? 1.0f : 0.0f

vslt.s

Compare less-than and set elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 1 1 1 1 1 0 0 rt rs rd

Syntax

vslt.s rd, rs, rt

Description

Performs element-wise floating point less-than comparison. The result will be 1.0 if vs less than vt, otherwise will be zero.

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] < rt[0] ? 1.0f : 0.0f

vslt.p

Compare less-than and set elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 1 1 1 1 1 0 1 rt rs rd

Syntax

vslt.p rd, rs, rt

Description

Performs element-wise floating point less-than comparison. The result will be 1.0 if vs less than vt, otherwise will be zero.

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] < rt[0] ? 1.0f : 0.0f
rd[1] = rs[1] < rt[1] ? 1.0f : 0.0f

vslt.t

Compare less-than and set elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 1 1 1 1 1 1 0 rt rs rd

Syntax

vslt.t rd, rs, rt

Description

Performs element-wise floating point less-than comparison. The result will be 1.0 if vs less than vt, otherwise will be zero.

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] < rt[0] ? 1.0f : 0.0f
rd[1] = rs[1] < rt[1] ? 1.0f : 0.0f
rd[2] = rs[2] < rt[2] ? 1.0f : 0.0f

vslt.q

Compare less-than and set elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 1 1 1 1 1 1 1 rt rs rd

Syntax

vslt.q rd, rs, rt

Description

Performs element-wise floating point less-than comparison. The result will be 1.0 if vs less than vt, otherwise will be zero.

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] < rt[0] ? 1.0f : 0.0f
rd[1] = rs[1] < rt[1] ? 1.0f : 0.0f
rd[2] = rs[2] < rt[2] ? 1.0f : 0.0f
rd[3] = rs[3] < rt[3] ? 1.0f : 0.0f

vcrs.t

Partial vector cross product

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 1 1 0 1 1 0 rt rs rd

Syntax

vcrs.t rd, rs, rt

Description

Performs a partial cross-product operation

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rt: Not supported
  • rs: Not supported

Pseudocode

rd[0] = rs[1] * rt[2]
rd[1] = rs[2] * rt[0]
rd[2] = rs[0] * rt[1]

vcrsp.t

Vector cross product

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 1 1 0 0 1 0 1 1 0 rt rs rd

Syntax

vcrsp.t rd, rs, rt

Description

Performs a full cross-product operation

Instruction performance

Throughput: 3 cycles/instruction
Latency: 9 cycles

Register overlap compatibility

Output register cannot overlap with input registers

Allowed prefixes

  • rt: Not supported
  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = rs[1] * rt[2] - rs[2] * rt[1]
rd[1] = rs[2] * rt[0] - rs[0] * rt[2]
rd[2] = rs[0] * rt[1] - rs[1] * rt[0]

vqmul.q

Quaternion multiplication

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 1 1 0 0 1 0 1 1 1 rt rs rd

Syntax

vqmul.q rd, rs, rt

Description

Performs a vector-matrix homogeneous transform (matrix-vector product), with a vector result

Instruction performance

Throughput: 4 cycles/instruction
Latency: 10 cycles

Register overlap compatibility

Output register cannot overlap with input registers

Allowed prefixes

  • rt: Not supported
  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = rs[3] * rt[0] - rs[2] * rt[1] + rs[1] * rt[2] + rs[0] * rt[3]
rd[1] = rs[3] * rt[1] + rs[2] * rt[0] + rs[1] * rt[3] - rs[0] * rt[2]
rd[2] = rs[3] * rt[2] + rs[2] * rt[3] - rs[1] * rt[0] + rs[0] * rt[1]
rd[3] = rs[3] * rt[3] - rs[2] * rt[2] - rs[1] * rt[1] - rs[0] * rt[0]

vsbn.s

Change exponent scale

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 0 0 1 0 0 0 rt rs rd

Syntax

vsbn.s rd, rs, rt

Description

Rescales rs operand to have rt as exponent. This would be equivalent to ldexp(frexp(rs, NULL), rt + 128). If we express the number in its IEEE754 terms, that is, if rs can be expressed as ±m * 2^e, the instruction will replace "e" with the value of rt + 127 mod 256.

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = (fpiszero(rs[0]) || fpisnanorinf(rs[0])) ? rs[0] : (rs[0] & 0x807FFFFF) | (((rt[0] + 127) & 0xFF) << 23)

vscl.p

Vector scalar scale

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 1 0 1 0 0 1 rt rs rd

Syntax

vscl.p rd, rs, rt

Description

Scales a vector (element-wise) by an scalar factor

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Not supported

Pseudocode

rd[0] = rs[0] * rt[0]
rd[1] = rs[1] * rt[0]

vscl.t

Vector scalar scale

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 1 0 1 0 1 0 rt rs rd

Syntax

vscl.t rd, rs, rt

Description

Scales a vector (element-wise) by an scalar factor

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Not supported

Pseudocode

rd[0] = rs[0] * rt[0]
rd[1] = rs[1] * rt[0]
rd[2] = rs[2] * rt[0]

vscl.q

Vector scalar scale

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 1 0 1 0 1 1 rt rs rd

Syntax

vscl.q rd, rs, rt

Description

Scales a vector (element-wise) by an scalar factor

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Not supported

Pseudocode

rd[0] = rs[0] * rt[0]
rd[1] = rs[1] * rt[0]
rd[2] = rs[2] * rt[0]
rd[3] = rs[3] * rt[0]

vdot.p

Vector dot product

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 1 0 0 1 0 1 rt rs rd

Syntax

vdot.p rd, rs, rt

Description

Performs vector floating point dot product

Instruction performance

Throughput: 1 cycles/instruction
Latency: 7 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] * rt[0] + rs[1] * rt[1]

vdot.t

Vector dot product

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 1 0 0 1 1 0 rt rs rd

Syntax

vdot.t rd, rs, rt

Description

Performs vector floating point dot product

Instruction performance

Throughput: 1 cycles/instruction
Latency: 7 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] * rt[0] + rs[1] * rt[1] + rs[2] * rt[2]

vdot.q

Vector dot product

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 1 0 0 1 1 1 rt rs rd

Syntax

vdot.q rd, rs, rt

Description

Performs vector floating point dot product

Instruction performance

Throughput: 1 cycles/instruction
Latency: 7 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0] * rt[0] + rs[1] * rt[1] + rs[2] * rt[2] + rs[3] * rt[3]

vdet.p

2x2 matrix determinant

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 1 1 1 0 0 1 rt rs rd

Syntax

vdet.p rd, rs, rt

Description

Performs a 2x2 matrix determinant between two matrix rows

Instruction performance

Throughput: 1 cycles/instruction
Latency: 7 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)
  • rt: Not supported

Pseudocode

rd[0] = rs[0] * rt[1] - rs[1] * rt[0]

vhdp.p

Homogeneous dot product

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 1 1 0 0 0 1 rt rs rd

Syntax

vhdp.p rd, rs, rt

Description

Performs vector floating point homegeneous dot product

Instruction performance

Throughput: 1 cycles/instruction
Latency: 7 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rt: Full support (swizzle, abs(), neg() and constants)
  • rs: Not supported

Pseudocode

rd[0] = rs[0] * rt[0] + rt[1]

vhdp.t

Homogeneous dot product

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 1 1 0 0 1 0 rt rs rd

Syntax

vhdp.t rd, rs, rt

Description

Performs vector floating point homegeneous dot product

Instruction performance

Throughput: 1 cycles/instruction
Latency: 7 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rt: Full support (swizzle, abs(), neg() and constants)
  • rs: Not supported

Pseudocode

rd[0] = rs[0] * rt[0] + rs[1] * rt[1] + rt[2]

vhdp.q

Homogeneous dot product

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 1 1 0 0 1 1 0 0 1 1 rt rs rd

Syntax

vhdp.q rd, rs, rt

Description

Performs vector floating point homegeneous dot product

Instruction performance

Throughput: 1 cycles/instruction
Latency: 7 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rt: Full support (swizzle, abs(), neg() and constants)
  • rs: Not supported

Pseudocode

rd[0] = rs[0] * rt[0] + rs[1] * rt[1] + rs[2] * rt[2] + rt[3]

vmov.s

Vector copy

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 rs rd

Syntax

vmov.s rd, rs

Description

Element-wise data copy

Instruction performance

Throughput: 1 cycles/instruction
Latency: 3 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0]

vmov.p

Vector copy

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 rs rd

Syntax

vmov.p rd, rs

Description

Element-wise data copy

Instruction performance

Throughput: 1 cycles/instruction
Latency: 3 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0]
rd[1] = rs[1]

vmov.t

Vector copy

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 rs rd

Syntax

vmov.t rd, rs

Description

Element-wise data copy

Instruction performance

Throughput: 1 cycles/instruction
Latency: 3 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0]
rd[1] = rs[1]
rd[2] = rs[2]

vmov.q

Vector copy

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 rs rd

Syntax

vmov.q rd, rs

Description

Element-wise data copy

Instruction performance

Throughput: 1 cycles/instruction
Latency: 3 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = rs[0]
rd[1] = rs[1]
rd[2] = rs[2]
rd[3] = rs[3]

vabs.s

Absolute value

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 rs rd

Syntax

vabs.s rd, rs

Description

Performs element-wise floating point absolute value

Instruction performance

Throughput: 1 cycles/instruction
Latency: 3 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Partial support (swizzle only)

Pseudocode

rd[0] = fabsf(rs[0])

vabs.p

Absolute value

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 rs rd

Syntax

vabs.p rd, rs

Description

Performs element-wise floating point absolute value

Instruction performance

Throughput: 1 cycles/instruction
Latency: 3 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Partial support (swizzle only)

Pseudocode

rd[0] = fabsf(rs[0])
rd[1] = fabsf(rs[1])

vabs.t

Absolute value

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 rs rd

Syntax

vabs.t rd, rs

Description

Performs element-wise floating point absolute value

Instruction performance

Throughput: 1 cycles/instruction
Latency: 3 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Partial support (swizzle only)

Pseudocode

rd[0] = fabsf(rs[0])
rd[1] = fabsf(rs[1])
rd[2] = fabsf(rs[2])

vabs.q

Absolute value

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 rs rd

Syntax

vabs.q rd, rs

Description

Performs element-wise floating point absolute value

Instruction performance

Throughput: 1 cycles/instruction
Latency: 3 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Partial support (swizzle only)

Pseudocode

rd[0] = fabsf(rs[0])
rd[1] = fabsf(rs[1])
rd[2] = fabsf(rs[2])
rd[3] = fabsf(rs[3])

vneg.s

Floating point negation

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 rs rd

Syntax

vneg.s rd, rs

Description

Performs element-wise floating point negation

Instruction performance

Throughput: 1 cycles/instruction
Latency: 3 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Partial support (swizzle only)

Pseudocode

rd[0] = -rs[0]

vneg.p

Floating point negation

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 rs rd

Syntax

vneg.p rd, rs

Description

Performs element-wise floating point negation

Instruction performance

Throughput: 1 cycles/instruction
Latency: 3 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Partial support (swizzle only)

Pseudocode

rd[0] = -rs[0]
rd[1] = -rs[1]

vneg.t

Floating point negation

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 rs rd

Syntax

vneg.t rd, rs

Description

Performs element-wise floating point negation

Instruction performance

Throughput: 1 cycles/instruction
Latency: 3 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Partial support (swizzle only)

Pseudocode

rd[0] = -rs[0]
rd[1] = -rs[1]
rd[2] = -rs[2]

vneg.q

Floating point negation

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 rs rd

Syntax

vneg.q rd, rs

Description

Performs element-wise floating point negation

Instruction performance

Throughput: 1 cycles/instruction
Latency: 3 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Partial support (swizzle only)

Pseudocode

rd[0] = -rs[0]
rd[1] = -rs[1]
rd[2] = -rs[2]
rd[3] = -rs[3]

vsat0.s

Saturate float to 0..1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 rs rd

Syntax

vsat0.s rd, rs

Description

Saturates inputs to the [0.0f ... 1.0f] range

Instruction performance

Throughput: 1 cycles/instruction
Latency: 3 cycles

Allowed prefixes

  • rd: Partial support (masking only)
  • rs: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = fminf(fmaxf(rs[0], 0.0f), 1.0f)

vsat0.p

Saturate float to 0..1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 rs rd

Syntax

vsat0.p rd, rs

Description

Saturates inputs to the [0.0f ... 1.0f] range

Instruction performance

Throughput: 1 cycles/instruction
Latency: 3 cycles

Allowed prefixes

  • rd: Partial support (masking only)
  • rs: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = fminf(fmaxf(rs[0], 0.0f), 1.0f)
rd[1] = fminf(fmaxf(rs[1], 0.0f), 1.0f)

vsat0.t

Saturate float to 0..1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 rs rd

Syntax

vsat0.t rd, rs

Description

Saturates inputs to the [0.0f ... 1.0f] range

Instruction performance

Throughput: 1 cycles/instruction
Latency: 3 cycles

Allowed prefixes

  • rd: Partial support (masking only)
  • rs: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = fminf(fmaxf(rs[0], 0.0f), 1.0f)
rd[1] = fminf(fmaxf(rs[1], 0.0f), 1.0f)
rd[2] = fminf(fmaxf(rs[2], 0.0f), 1.0f)

vsat0.q

Saturate float to 0..1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 rs rd

Syntax

vsat0.q rd, rs

Description

Saturates inputs to the [0.0f ... 1.0f] range

Instruction performance

Throughput: 1 cycles/instruction
Latency: 3 cycles

Allowed prefixes

  • rd: Partial support (masking only)
  • rs: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = fminf(fmaxf(rs[0], 0.0f), 1.0f)
rd[1] = fminf(fmaxf(rs[1], 0.0f), 1.0f)
rd[2] = fminf(fmaxf(rs[2], 0.0f), 1.0f)
rd[3] = fminf(fmaxf(rs[3], 0.0f), 1.0f)

vsat1.s

Saturate float to -1..1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 rs rd

Syntax

vsat1.s rd, rs

Description

Saturates inputs to the [-1.0f ... 1.0f] range

Instruction performance

Throughput: 1 cycles/instruction
Latency: 3 cycles

Allowed prefixes

  • rd: Partial support (masking only)
  • rs: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = fminf(fmaxf(rs[0], -1f), 1.0f)

vsat1.p

Saturate float to -1..1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 1 rs rd

Syntax

vsat1.p rd, rs

Description

Saturates inputs to the [-1.0f ... 1.0f] range

Instruction performance

Throughput: 1 cycles/instruction
Latency: 3 cycles

Allowed prefixes

  • rd: Partial support (masking only)
  • rs: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = fminf(fmaxf(rs[0], -1f), 1.0f)
rd[1] = fminf(fmaxf(rs[1], -1f), 1.0f)

vsat1.t

Saturate float to -1..1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 rs rd

Syntax

vsat1.t rd, rs

Description

Saturates inputs to the [-1.0f ... 1.0f] range

Instruction performance

Throughput: 1 cycles/instruction
Latency: 3 cycles

Allowed prefixes

  • rd: Partial support (masking only)
  • rs: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = fminf(fmaxf(rs[0], -1f), 1.0f)
rd[1] = fminf(fmaxf(rs[1], -1f), 1.0f)
rd[2] = fminf(fmaxf(rs[2], -1f), 1.0f)

vsat1.q

Saturate float to -1..1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 1 rs rd

Syntax

vsat1.q rd, rs

Description

Saturates inputs to the [-1.0f ... 1.0f] range

Instruction performance

Throughput: 1 cycles/instruction
Latency: 3 cycles

Allowed prefixes

  • rd: Partial support (masking only)
  • rs: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = fminf(fmaxf(rs[0], -1f), 1.0f)
rd[1] = fminf(fmaxf(rs[1], -1f), 1.0f)
rd[2] = fminf(fmaxf(rs[2], -1f), 1.0f)
rd[3] = fminf(fmaxf(rs[3], -1f), 1.0f)

vrcp.s

Reciprocate elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 rs rd

Syntax

vrcp.s rd, rs

Description

Performs element-wise floating point reciprocal

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Relative error is smaller than 6.3e-07

Instruction performance

Throughput: 1 cycles/instruction
Latency: 7 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = 1.0f / rs[0]

vrcp.p

Reciprocate elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 rs rd

Syntax

vrcp.p rd, rs

Description

Performs element-wise floating point reciprocal

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Relative error is smaller than 6.3e-07

Instruction performance

Throughput: 2 cycles/instruction
Latency: 8 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = 1.0f / rs[0]
rd[1] = 1.0f / rs[1]

vrcp.t

Reciprocate elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 rs rd

Syntax

vrcp.t rd, rs

Description

Performs element-wise floating point reciprocal

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Relative error is smaller than 6.3e-07

Instruction performance

Throughput: 3 cycles/instruction
Latency: 9 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = 1.0f / rs[0]
rd[1] = 1.0f / rs[1]
rd[2] = 1.0f / rs[2]

vrcp.q

Reciprocate elements

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 rs rd

Syntax

vrcp.q rd, rs

Description

Performs element-wise floating point reciprocal

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Relative error is smaller than 6.3e-07

Instruction performance

Throughput: 4 cycles/instruction
Latency: 10 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = 1.0f / rs[0]
rd[1] = 1.0f / rs[1]
rd[2] = 1.0f / rs[2]
rd[3] = 1.0f / rs[3]

vrsq.s

Reciprocal square root

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 rs rd

Syntax

vrsq.s rd, rs

Description

Performs element-wise floating pointreciprocal square root

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Relative error is smaller than 7.3e-07

Instruction performance

Throughput: 1 cycles/instruction
Latency: 7 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = 1.0f / sqrt(rs[0])

vrsq.p

Reciprocal square root

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 rs rd

Syntax

vrsq.p rd, rs

Description

Performs element-wise floating pointreciprocal square root

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Relative error is smaller than 7.3e-07

Instruction performance

Throughput: 2 cycles/instruction
Latency: 8 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = 1.0f / sqrt(rs[0])
rd[1] = 1.0f / sqrt(rs[1])

vrsq.t

Reciprocal square root

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 1 0 rs rd

Syntax

vrsq.t rd, rs

Description

Performs element-wise floating pointreciprocal square root

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Relative error is smaller than 7.3e-07

Instruction performance

Throughput: 3 cycles/instruction
Latency: 9 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = 1.0f / sqrt(rs[0])
rd[1] = 1.0f / sqrt(rs[1])
rd[2] = 1.0f / sqrt(rs[2])

vrsq.q

Reciprocal square root

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 1 1 rs rd

Syntax

vrsq.q rd, rs

Description

Performs element-wise floating pointreciprocal square root

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Relative error is smaller than 7.3e-07

Instruction performance

Throughput: 4 cycles/instruction
Latency: 10 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = 1.0f / sqrt(rs[0])
rd[1] = 1.0f / sqrt(rs[1])
rd[2] = 1.0f / sqrt(rs[2])
rd[3] = 1.0f / sqrt(rs[3])

vsin.s

Sine function

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 rs rd

Syntax

vsin.s rd, rs

Description

Performs element-wise floating point sin(π/2⋅rs) operation

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Absolute error is smaller than 4.8e-07

Instruction performance

Throughput: 1 cycles/instruction
Latency: 7 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = sin(rs[0] * M_PI_2)

vsin.p

Sine function

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 1 rs rd

Syntax

vsin.p rd, rs

Description

Performs element-wise floating point sin(π/2⋅rs) operation

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Absolute error is smaller than 4.8e-07

Instruction performance

Throughput: 2 cycles/instruction
Latency: 8 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = sin(rs[0] * M_PI_2)
rd[1] = sin(rs[1] * M_PI_2)

vsin.t

Sine function

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 rs rd

Syntax

vsin.t rd, rs

Description

Performs element-wise floating point sin(π/2⋅rs) operation

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Absolute error is smaller than 4.8e-07

Instruction performance

Throughput: 3 cycles/instruction
Latency: 9 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = sin(rs[0] * M_PI_2)
rd[1] = sin(rs[1] * M_PI_2)
rd[2] = sin(rs[2] * M_PI_2)

vsin.q

Sine function

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 1 rs rd

Syntax

vsin.q rd, rs

Description

Performs element-wise floating point sin(π/2⋅rs) operation

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Absolute error is smaller than 4.8e-07

Instruction performance

Throughput: 4 cycles/instruction
Latency: 10 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = sin(rs[0] * M_PI_2)
rd[1] = sin(rs[1] * M_PI_2)
rd[2] = sin(rs[2] * M_PI_2)
rd[3] = sin(rs[3] * M_PI_2)

vcos.s

Cosine function

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0 rs rd

Syntax

vcos.s rd, rs

Description

Performs element-wise floating point cos(π/2⋅rs) operation

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 2.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Absolute error is smaller than 4e-07

Instruction performance

Throughput: 1 cycles/instruction
Latency: 7 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = cos(rs[0] * M_PI_2)

vcos.p

Cosine function

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 1 rs rd

Syntax

vcos.p rd, rs

Description

Performs element-wise floating point cos(π/2⋅rs) operation

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 2.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Absolute error is smaller than 4e-07

Instruction performance

Throughput: 2 cycles/instruction
Latency: 8 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = cos(rs[0] * M_PI_2)
rd[1] = cos(rs[1] * M_PI_2)

vcos.t

Cosine function

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 0 rs rd

Syntax

vcos.t rd, rs

Description

Performs element-wise floating point cos(π/2⋅rs) operation

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 2.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Absolute error is smaller than 4e-07

Instruction performance

Throughput: 3 cycles/instruction
Latency: 9 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = cos(rs[0] * M_PI_2)
rd[1] = cos(rs[1] * M_PI_2)
rd[2] = cos(rs[2] * M_PI_2)

vcos.q

Cosine function

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 0 1 1 1 1 rs rd

Syntax

vcos.q rd, rs

Description

Performs element-wise floating point cos(π/2⋅rs) operation

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 2.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Absolute error is smaller than 4e-07

Instruction performance

Throughput: 4 cycles/instruction
Latency: 10 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = cos(rs[0] * M_PI_2)
rd[1] = cos(rs[1] * M_PI_2)
rd[2] = cos(rs[2] * M_PI_2)
rd[3] = cos(rs[3] * M_PI_2)

vexp2.s

Base-2 exponentiation

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 rs rd

Syntax

vexp2.s rd, rs

Description

Performs element-wise floating point exp2(rs) operation

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details. Inputs larger than 127 result in overflow (cannot represent over 2^127)

Relative error is smaller than 7.2e-07

Instruction performance

Throughput: 1 cycles/instruction
Latency: 7 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = (rs[0] >= 128) ? INFINITY : (rs[0] <= -127) ? 0.0f : exp2(rs[0])

vexp2.p

Base-2 exponentiation

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 rs rd

Syntax

vexp2.p rd, rs

Description

Performs element-wise floating point exp2(rs) operation

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details. Inputs larger than 127 result in overflow (cannot represent over 2^127)

Relative error is smaller than 7.2e-07

Instruction performance

Throughput: 2 cycles/instruction
Latency: 8 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = (rs[0] >= 128) ? INFINITY : (rs[0] <= -127) ? 0.0f : exp2(rs[0])
rd[1] = (rs[1] >= 128) ? INFINITY : (rs[1] <= -127) ? 0.0f : exp2(rs[1])

vexp2.t

Base-2 exponentiation

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 rs rd

Syntax

vexp2.t rd, rs

Description

Performs element-wise floating point exp2(rs) operation

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details. Inputs larger than 127 result in overflow (cannot represent over 2^127)

Relative error is smaller than 7.2e-07

Instruction performance

Throughput: 3 cycles/instruction
Latency: 9 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = (rs[0] >= 128) ? INFINITY : (rs[0] <= -127) ? 0.0f : exp2(rs[0])
rd[1] = (rs[1] >= 128) ? INFINITY : (rs[1] <= -127) ? 0.0f : exp2(rs[1])
rd[2] = (rs[2] >= 128) ? INFINITY : (rs[2] <= -127) ? 0.0f : exp2(rs[2])

vexp2.q

Base-2 exponentiation

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 1 rs rd

Syntax

vexp2.q rd, rs

Description

Performs element-wise floating point exp2(rs) operation

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details. Inputs larger than 127 result in overflow (cannot represent over 2^127)

Relative error is smaller than 7.2e-07

Instruction performance

Throughput: 4 cycles/instruction
Latency: 10 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = (rs[0] >= 128) ? INFINITY : (rs[0] <= -127) ? 0.0f : exp2(rs[0])
rd[1] = (rs[1] >= 128) ? INFINITY : (rs[1] <= -127) ? 0.0f : exp2(rs[1])
rd[2] = (rs[2] >= 128) ? INFINITY : (rs[2] <= -127) ? 0.0f : exp2(rs[2])
rd[3] = (rs[3] >= 128) ? INFINITY : (rs[3] <= -127) ? 0.0f : exp2(rs[3])

vlog2.s

Base-2 logarithm

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 0 rs rd

Syntax

vlog2.s rd, rs

Description

Performs element-wise floating point log2(rs) operation

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. Accuracy varies greatly depending on the input value. Please refer to psp-tests/accuracy for more details.

Absolute error is smaller than 3e-05

Instruction performance

Throughput: 1 cycles/instruction
Latency: 7 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = log2(rs[0])

vlog2.p

Base-2 logarithm

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 1 rs rd

Syntax

vlog2.p rd, rs

Description

Performs element-wise floating point log2(rs) operation

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. Accuracy varies greatly depending on the input value. Please refer to psp-tests/accuracy for more details.

Absolute error is smaller than 3e-05

Instruction performance

Throughput: 2 cycles/instruction
Latency: 8 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = log2(rs[0])
rd[1] = log2(rs[1])

vlog2.t

Base-2 logarithm

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 1 0 1 1 0 rs rd

Syntax

vlog2.t rd, rs

Description

Performs element-wise floating point log2(rs) operation

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. Accuracy varies greatly depending on the input value. Please refer to psp-tests/accuracy for more details.

Absolute error is smaller than 3e-05

Instruction performance

Throughput: 3 cycles/instruction
Latency: 9 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = log2(rs[0])
rd[1] = log2(rs[1])
rd[2] = log2(rs[2])

vlog2.q

Base-2 logarithm

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 1 0 1 1 1 rs rd

Syntax

vlog2.q rd, rs

Description

Performs element-wise floating point log2(rs) operation

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. Accuracy varies greatly depending on the input value. Please refer to psp-tests/accuracy for more details.

Absolute error is smaller than 3e-05

Instruction performance

Throughput: 4 cycles/instruction
Latency: 10 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = log2(rs[0])
rd[1] = log2(rs[1])
rd[2] = log2(rs[2])
rd[3] = log2(rs[3])

vlgb.s

LogB calculation

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 1 1 0 1 1 1 0 0 rs rd

Syntax

vlgb.s rd, rs

Description

Performs element-wise logB() calculation

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = logbf(rs[0])

vsbz.s

Reset exponent scale

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 1 1 0 1 1 0 0 0 rs rd

Syntax

vsbz.s rd, rs

Description

Rescales rs operand to have zero as exponent, so that it is reduced to the [1.0, 2.0) interval. This is essentially equivalent to the vsbn instruction with rt=0.

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = (fpiszero(rs[0]) || fpisnan(rs[0])) ? rs[0] : (rs[0] & 0x007FFFFF) | 0x3F800000

vwbn.s

Floating point modulus

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 1 1 0 0 imval rs rd

Syntax

vwbn.s rd, rs, scale

Description

TODO: Document this better. Performs some sort of modulus operation.

Instruction performance

Throughput: 1 cycles/instruction
Latency: 5 cycles

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = ivwbn(rs[0], imval)

vsqrt.s

Square root

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0 rs rd

Syntax

vsqrt.s rd, rs

Description

Performs element-wise floating point aproximate square root

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Relative error is smaller than 7.1e-07

Instruction performance

Throughput: 1 cycles/instruction
Latency: 7 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = sqrt(rs[0])

vsqrt.p

Square root

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 1 1 0 0 1 rs rd

Syntax

vsqrt.p rd, rs

Description

Performs element-wise floating point aproximate square root

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Relative error is smaller than 7.1e-07

Instruction performance

Throughput: 2 cycles/instruction
Latency: 8 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = sqrt(rs[0])
rd[1] = sqrt(rs[1])

vsqrt.t

Square root

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 1 1 0 1 0 rs rd

Syntax

vsqrt.t rd, rs

Description

Performs element-wise floating point aproximate square root

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Relative error is smaller than 7.1e-07

Instruction performance

Throughput: 3 cycles/instruction
Latency: 9 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = sqrt(rs[0])
rd[1] = sqrt(rs[1])
rd[2] = sqrt(rs[2])

vsqrt.q

Square root

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 1 1 0 1 1 rs rd

Syntax

vsqrt.q rd, rs

Description

Performs element-wise floating point aproximate square root

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Relative error is smaller than 7.1e-07

Instruction performance

Throughput: 4 cycles/instruction
Latency: 10 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = sqrt(rs[0])
rd[1] = sqrt(rs[1])
rd[2] = sqrt(rs[2])
rd[3] = sqrt(rs[3])

vasin.s

Arc sine function

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 1 1 1 0 0 rs rd

Syntax

vasin.s rd, rs

Description

Performs element-wise floating point asin(rs)⋅2/π operation

Accuracy

This function provides an approximate value. The precision seems quite good for arguments between -0.5 and 0.5 (around 2.5e-7), but it becomes very inaccurate outside of this range, as it approaches +/-1. Please refer to psp-tests/accuracy for more details.

Absolute error is smaller than 0.02

Instruction performance

Throughput: 1 cycles/instruction
Latency: 7 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Full support (swizzle, abs(), neg() and constants)

Pseudocode

rd[0] = asin(rs[0]) / M_PI_2

vasin.p

Arc sine function

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 1 1 1 0 1 rs rd

Syntax

vasin.p rd, rs

Description

Performs element-wise floating point asin(rs)⋅2/π operation

Accuracy

This function provides an approximate value. The precision seems quite good for arguments between -0.5 and 0.5 (around 2.5e-7), but it becomes very inaccurate outside of this range, as it approaches +/-1. Please refer to psp-tests/accuracy for more details.

Absolute error is smaller than 0.02

Instruction performance

Throughput: 2 cycles/instruction
Latency: 8 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = asin(rs[0]) / M_PI_2
rd[1] = asin(rs[1]) / M_PI_2

vasin.t

Arc sine function

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 1 1 1 1 0 rs rd

Syntax

vasin.t rd, rs

Description

Performs element-wise floating point asin(rs)⋅2/π operation

Accuracy

This function provides an approximate value. The precision seems quite good for arguments between -0.5 and 0.5 (around 2.5e-7), but it becomes very inaccurate outside of this range, as it approaches +/-1. Please refer to psp-tests/accuracy for more details.

Absolute error is smaller than 0.02

Instruction performance

Throughput: 3 cycles/instruction
Latency: 9 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = asin(rs[0]) / M_PI_2
rd[1] = asin(rs[1]) / M_PI_2
rd[2] = asin(rs[2]) / M_PI_2

vasin.q

Arc sine function

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 0 1 1 1 1 1 rs rd

Syntax

vasin.q rd, rs

Description

Performs element-wise floating point asin(rs)⋅2/π operation

Accuracy

This function provides an approximate value. The precision seems quite good for arguments between -0.5 and 0.5 (around 2.5e-7), but it becomes very inaccurate outside of this range, as it approaches +/-1. Please refer to psp-tests/accuracy for more details.

Absolute error is smaller than 0.02

Instruction performance

Throughput: 4 cycles/instruction
Latency: 10 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = asin(rs[0]) / M_PI_2
rd[1] = asin(rs[1]) / M_PI_2
rd[2] = asin(rs[2]) / M_PI_2
rd[3] = asin(rs[3]) / M_PI_2

vnrcp.s

Negative reciprocal

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 rs rd

Syntax

vnrcp.s rd, rs

Description

Performs element-wise floating point negated reciprocal

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Relative error is smaller than 6.3e-07

Instruction performance

Throughput: 1 cycles/instruction
Latency: 7 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Not supported

Pseudocode

rd[0] = -1f / rs[0]

vnrcp.p

Negative reciprocal

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 1 rs rd

Syntax

vnrcp.p rd, rs

Description

Performs element-wise floating point negated reciprocal

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Relative error is smaller than 6.3e-07

Instruction performance

Throughput: 2 cycles/instruction
Latency: 8 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = -1f / rs[0]
rd[1] = -1f / rs[1]

vnrcp.t

Negative reciprocal

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 rs rd

Syntax

vnrcp.t rd, rs

Description

Performs element-wise floating point negated reciprocal

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Relative error is smaller than 6.3e-07

Instruction performance

Throughput: 3 cycles/instruction
Latency: 9 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = -1f / rs[0]
rd[1] = -1f / rs[1]
rd[2] = -1f / rs[2]

vnrcp.q

Negative reciprocal

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 1 rs rd

Syntax

vnrcp.q rd, rs

Description

Performs element-wise floating point negated reciprocal

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Relative error is smaller than 6.3e-07

Instruction performance

Throughput: 4 cycles/instruction
Latency: 10 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = -1f / rs[0]
rd[1] = -1f / rs[1]
rd[2] = -1f / rs[2]
rd[3] = -1f / rs[3]

vnsin.s

Negative sine function

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 rs rd

Syntax

vnsin.s rd, rs

Description

Performs element-wise floating point -sin(π/2⋅rs) operation

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Absolute error is smaller than 4.8e-07

Instruction performance

Throughput: 1 cycles/instruction
Latency: 7 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Not supported

Pseudocode

rd[0] = -sin(rs[0] * M_PI_2)

vnsin.p

Negative sine function

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 1 rs rd

Syntax

vnsin.p rd, rs

Description

Performs element-wise floating point -sin(π/2⋅rs) operation

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Absolute error is smaller than 4.8e-07

Instruction performance

Throughput: 2 cycles/instruction
Latency: 8 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = -sin(rs[0] * M_PI_2)
rd[1] = -sin(rs[1] * M_PI_2)

vnsin.t

Negative sine function

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 1 0 1 0 1 0 rs rd

Syntax

vnsin.t rd, rs

Description

Performs element-wise floating point -sin(π/2⋅rs) operation

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Absolute error is smaller than 4.8e-07

Instruction performance

Throughput: 3 cycles/instruction
Latency: 9 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = -sin(rs[0] * M_PI_2)
rd[1] = -sin(rs[1] * M_PI_2)
rd[2] = -sin(rs[2] * M_PI_2)

vnsin.q

Negative sine function

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 1 0 1 0 1 1 rs rd

Syntax

vnsin.q rd, rs

Description

Performs element-wise floating point -sin(π/2⋅rs) operation

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.

Absolute error is smaller than 4.8e-07

Instruction performance

Throughput: 4 cycles/instruction
Latency: 10 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = -sin(rs[0] * M_PI_2)
rd[1] = -sin(rs[1] * M_PI_2)
rd[2] = -sin(rs[2] * M_PI_2)
rd[3] = -sin(rs[3] * M_PI_2)

vrexp2.s

Base-2 negative exponentiation

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 rs rd

Syntax

vrexp2.s rd, rs

Description

Performs element-wise floating point 1/exp2(rs) operation (equivalent to exp2(-rs))

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details. Inputs larger than 127 result in overflow (cannot represent over 2^127)

Relative error is smaller than 7.2e-07

Instruction performance

Throughput: 1 cycles/instruction
Latency: 7 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rd: Full support (masking and saturation)
  • rs: Not supported

Pseudocode

rd[0] = (rs[0] >= 127) ? 0.0f : (rs[0] <= -128) ? INFINITY : exp2(-rs[0])

vrexp2.p

Base-2 negative exponentiation

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1 rs rd

Syntax

vrexp2.p rd, rs

Description

Performs element-wise floating point 1/exp2(rs) operation (equivalent to exp2(-rs))

Accuracy

This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details. Inputs larger than 127 result in overflow (cannot represent over 2^127)

Relative error is smaller than 7.2e-07

Instruction performance

Throughput: 2 cycles/instruction
Latency: 8 cycles

Register overlap compatibility

Output register can only overlap with input registers if they are identical

Allowed prefixes

  • rs: Not supported
  • rd: Not supported

Pseudocode

rd[0] = (rs[0] >= 127) ? 0.0f : (rs[0] <= -128) ? INFINITY : exp2(-rs[0])
rd[1] = (rs[1] >= 127) ? 0.0f : (rs[1] <= -128) ? INFINITY : exp2(-rs[1])

vrexp2.t

Base-2 negative exponentiation

0 1