This document describes how the PSP VFPU instruction set operates. We attempted to collect all the knowledge available in the community and put it toghether in a document that can be used as a reference for developers and enthusiasts.
The goal is to describe the behaviour of the hardware unit with as much detail as possible in a way that every statement can be verified. For this reason, every functional detail described in the docs must have a test that validates it. Of course some things are harder to validate (like hardware bugs) so there's some statements that won't have tests for them at this time.
The Allegrex CPU is a MIPS CPU based on the MIPS II
architecture. This is a 32 bit CPU and architecture that has many similarities with other CPUs of the same architecture. However, if we only focus on the instruction set, the main differences with other CPUs in the MIPS II
family would be:
Most of the extra instructions that are present in the CPU are identical to their MIPS32 counterparts. In some cases though, the encoding is slightly different.
The PSP VFPU is a coprocessor unit that can perform vector/matrix float and integer operations on a set of 128 bit registers. It features dedicated units to perform the most usual operations that 3D videogames require.
The CPU features 128 registers, each of them 32 bit wide. Most of the time they are interpreted as IEEE-754 compliant floating point registers, although some instructions will interpret them as integers (or other formats such as 8/16 bit packed integers). The registers can be addressed individually but also in a more powerful way by grouping them as vectors or matrices.
Registers will usually be represented in their matrix layout. The VFPU has 8 matrices, each of them containing 16 elements (4 rows by 4 columns). For each of the 8 total available matrices, the elements are arranged in the following fashion (X
represents the matrix number, 0 to 7):
When the registers are referenced as vectors, they are grouped as rows and columns of a given matrix. This is important since it means that a vector is composed of elements from a single matrix and cannot access elements across multiple matrices. There's 2D, 3D and 4D vectors, usually called pair, trio and quad respectively. Single elements can be viewed as 1D vectors, and most instructions are available in all four possible vector sizes (which makes the instruction set very uniform). Not all access patterns are possible: pair and trio registers have 128 possible addressing modes while quad has only 64. The available patterns are described as follows:
Matrix addressing is similar to vectors: registers can be read vertically or horizontally. That means matrices can be accessed in a row major and column major mode (ie. by accessing them as a set of rows or columns). Similarly there's three possible sizes: 2x2, 3x3 and 4x4, containing 4, 9 and 16 registers respectively. Again not all addressing patterns are available, having 64 possible addressing modes for 2x2 and 3x3 matrices, but only 16 for 4x4 matrices. These are:
There's also a small set of eight "control" registers that are used for a variety of things, such as prefix state, comparison flag bits, etc. These registers are defined as follow:
Some of these registers are never accessed directly but rather using some VFPU instructions (ie. prefixes, condition code, etc). However these can be read and written in some useful cases, for instance thread context saving and restoration (so that the VFPU state is preserved across thread rescheduling).
Most CPUs have what's called "hazard detection logic", which tracks register reads and writes so that things happen in the right order and results actually make sense. In the VFPU this is also the case, however some operations are quite complex and can be complex to track.
Control registers seem to have some hazards, for instance "mfvc" instruction has a one cycle hazard with any previous vcmp instruction. That means a vnop or some other VFPU instruction should be inserted between a vcmp and mfvc instruction pair to get the right VFPU_CC value.
Some VFPU instructions (mostly dealing with matrices and transformations) require that the input and output registers do not overlap. This has to do with how the hardware performs the operations internally: the VFPU can perform most vector-vector operations in a native way, but matrix operations seem to be decomposed into series of vector-vector operations (ie. a vmmul seems to be a sequence of vtfm operations). Since the results are only partial, the inputs are overwritten before the CPU can even read them, causing incorrect results for the operation.
The affected instructions are divided in two groups, a group that does not allow any sort of overlap, and another group that allows some limited overlap. Instructions vmmul, vtfm2/3/4, vhtfm2/3/4, vqmul and vcrsp do not allow any sort of overlap between input and output registers. These instructions perform operations by repeating a dot product operation multiple times, which results in partial updates of the output register. This partial updates overwrite the input register causing the result to be incorrect.
Instructions that allow partial overlaps are vsin, vcos, vasin, vnsin, vexp2, vrexp2, vlog2, vsqrt, vrsq, vrcp, vnrcp, vdiv, vmscl and vmmov. Single versions (.s) are not affected by this restriction. These instructions are also internally decomposed into a bunch of smaller operations (for instance trigonometric operations are decomposed into a series of single (.s) operations). The registers are allowed to overlap as long as they are compatible in terms of element count and access "direction" (ie. a matrix must be read using the same mode).
Examples
vmscl.p M000, M022, S100 # No overlap, always OK
vmscl.p M000, M000, S100 # M000 overlaps with itself, OK
vmscl.p M000, E000, S100 # Invalid overlap, matrix order is different
vmscl.t M000, M011, S100 # Overlapping registers are not identical
vcos.q R000, C000 # Invalid overlap (one element only)
vcos.q R000, R000 # Identical overlap, OK
Although the FPU seems IEEE-754 compliant, it has a couple of non-standard features that break this compatibility. Its rounding mode is hardwired to "round to nearest" mode, so that users cannot choose another rounding mode. It also lacks support for denormal numbers (also called subnormals): when an operation produces a subnormal number, it rounds it to zero. If the input of an operation is a denormal number, it will also be treated as zero.
See the ieee754-fun.c
file for tests.
The VFPU is a pipelined CPU with an issue width of one. That means that instructions take multiple cycles to execute, since they execute partially during each cycle, and a maximum of one new instruction begins execution each cycle. Instructions that block the pipeline for more than one cycle can be identified by having a throughput different than one. These block the pipeline for a certain number of cycles before a new instruction can enter it.
An instruction usually begins executing whenever its input registers are ready
, that is, any previous instruction writing those registers have fully completed their execution. For this reason it is important to closely observe the instruction latency, measured in cycles, since an instruction might have to wait for its inputs to become available, reducing efficiency. A common strategy is to interleave non-dependant instructions to hide
latency and avoid wasting CPU cycles.
The pipeline structure looks more or less as follows:
Prefix operations allow to perform certain operations on the inputs before the actual instruction operation and some other operations on the output.
VFPU operations can operate on one or two inputs (rs
and rt
) and one output (rd
). The input values can be pre-processed by using the VFPU_PFXS
and VFPU_PFXT
registers (and therefore vpfxs
and vpfxt
instructions). The result of the operation being written to rd
can be post-processed by using the VFPU_PFXD
register (vpfxd
instruction).
Valid operations for input registers are:
Operations available to the output register post-processing are:
There's some restrictions on their usage. The assembler will signal an error should you violate any of the restrictions.
A few examples to showcase input prefixes:
# Sign change prefix
vmul.p R000, R001, R002[-x,-y] # Multiplies two rows negating one of the inputs
# S000 = S001 * -S002; S010 = S011 * -S012
vfad.q R000, R001[x,-y,z,-w] # Funnel-add all elements with some changed signs
# S000 = S001 - S011 + S021 - S031
# Absolute value prefix
vdot.p S000, R001[|x|,|y|], R002 # Dot product with forced absolute value for R001
# S000 = |S001| * S002 + |S011| * S012
# Negative and absolute value prefixes
vdot.p S000, R001[-|x|,-|y|], R002 # Dot product with forced negative values
# S000 = -|S001| * S002 - |S011| * S012
# Swizzle prefix
vdot.q R000, R001, R002[x,y,x,y] # Multiplies with repeating values
# S000 = S001 * S002; S010 = S011 * S012
# S020 = S021 * S002; S030 = S031 * S012
# Constant value prefixes
vdot.t R000, R001, R002[1,2,3] # Second operand ignored, overrides to (1,2,3)
# S000 = S001 + S011 * 2 + S021 * 3
vdot.t R000, R001, R002[x,-2,-y] # Mix swizzle and constant elements
# S000 = S001 * S002 - S011 * 2 - S021 * S012
Some more examples for output prefixes.
vmul.p R000[[-1:1],[-1:1]], R001, R002 # Multiplies with output saturation
# S000 = min(1.0f, max(-1.0f, S001 * S002))
# S010 = min(1.0f, max(-1.0f, S011 * S012))
Adding a prefix modifier to an operand will result in vpfxs/t/d
instructions being emitted before the actual instruction. This syntax exists just to make assembly coding more comfortable to the user. When using the disassembler the prefix instructions will be clearly visible.
# The following operand-decorated instruction:
vmul.q R000, R100[x,y,x,y], R200[-x,-y,z,w]
# is actually encoded as a sequence of instructions:
vpfxs [x,y,x,y]
vpfxt [-x,-y,z,w]
vmul.q R000, R100, R200
Prefix instructions consume one cycle and have no visible latency (the "decorated" instruction doesn't have to wait any extra cycles). In some cases it might be faster to not use prefixes and use other instructions (vcst, vabs, vneg, vsat0/1 are some similar alternatives), particularly when optimizing for throughput. The advantage of using prefixes is that latency is kept low (since they have no latency and the extra operation is "included" in the instruction pipeline).
The following instructions exist in the Allegrex CPU and share the same MIPS32 encodings:
Other instructions that are borrowed from MIPS32 but have a different encoding are:
SPECIAL
encodings)SPECIAL
encodings)The bit manipulation Allegrex specific instructions are:
BSHFL
encoding adjacent to wsbh
)BSHFL
encoding)Allegrex features some instructions present in MIPS32 and MIPS32r2 with identical encoding to these:
Other instructions that have some particular encoding are multiply-accumulate instructions. Some overlap with MIPS R4010
encodings and some others just use unused encodings. They all use unused SPECIAL
opcodes:
There's also two novel Allegrex instructions that are used to perform faster compare-and-move operations. These use free SPECIAL
opcodes as well:
VFPU branch on false
Syntax
bvf imm3, offset
Description
Branch on VFPU CC register being false
Instruction performance
Throughput: 1 cycles/instruction
Latency: 4 cycles
VFPU likely branch on false
Syntax
bvfl imm3, offset
Description
Branch on VFPU CC register being false (likely)
Instruction performance
Throughput: 1 cycles/instruction
Latency: 4 cycles
VFPU branch on true
Syntax
bvt imm3, offset
Description
Branch on VFPU CC register being true
Instruction performance
Throughput: 1 cycles/instruction
Latency: 4 cycles
VFPU likely branch on true
Syntax
bvtl imm3, offset
Description
Branch on VFPU CC register being true (likely)
Instruction performance
Throughput: 1 cycles/instruction
Latency: 4 cycles
Move GPR to VFPU control register
Syntax
mtvc rt, imm8
Description
Writes the contents of a CPU general purpose register to the specified VFPU control register
Move VFPU control register to GPR
Syntax
mfvc rt, imm8
Description
Writes the contents of the specified VPFU control register into a CPU general purpose register
Hazards
The instruction does not have interlocks, so the result of a vcmp instruction is only available one cycle later. You will need to interleave at least one VFPU instruction between a vcmp and mfvc (ie. a vnop).
Move vector register to VFPU control register
Syntax
vmtvc imm8, rs
Description
Writes the contents of a VFPU vector general to the specified VFPU control register
Move VFPU control register to vector register
Syntax
vmfvc rd, imm8
Description
Writes the contents of the specified VPFU control register into a VFPU vector register
Hazards
The instruction does not have interlocks, so the result of a previous vcmp instruction is only available one cycle later. You will need to interleave at least one VFPU instruction between a vcmp and mfvc (ie. a vnop).
Load VFPU element
Syntax
lv.s rd, imm14(rt)
Description
Performs a 4 byte memory load to a VFPU register. Address must be 4 byte aligned or a fault is generated.
Allowed prefixes
Load VFPU quad element
Syntax
lv.q rd, imm14(rt)
Description
Performs a 16 byte memory load to a VFPU quad register. Address must be 16 byte aligned or a fault is generated.
Allowed prefixes
Load left VFPU quad element
Syntax
lvl.q rd, imm14(rt)
Description
Performs a 16 byte left unaligned memory load to a VFPU quad register. Instruction ignores the two LSB (forces them to zero), so the address is assumed aligned to 4 bytes. This instruction is similar to MIPS LWL instruction: loads the most significant elements from the specified address leaving the other elements unchanged. Users can use `ulv.q` pseudoinstruction to generate a sequence of `lvl.q` and `lvr.q` instructions in order to load unaligned data. You can check `psp-tests/manual/memops.c` to see examples on how the instruction behaves.
Bugs
The instruction has an errata on PSP-1000 models that causes FPU register corruption (these are the MIPS CPU FPU registers, not the VFPU registers). The bottom 5 bits of the VFPU destination register determine which FPU register will be corrupted. A workaround is to assume the side effect (ie. mark the register are clobbered).
Allowed prefixes
Load right VFPU quad element
Syntax
lvr.q rd, imm14(rt)
Description
Performs a 16 byte right unaligned memory load to a VFPU quad register. Instruction ignores the two LSB (forces them to zero), so the address is assumed aligned to 4 bytes. This instruction is similar to MIPS LWR instruction: loads the least significant elements from the specified address leaving the other elements unchanged. Users can use `ulv.q` pseudoinstruction to generate a sequence of `lvl.q` and `lvr.q` instructions in order to load unaligned data. You can check `psp-tests/manual/memops.c` to see examples on how the instruction behaves.
Bugs
The instruction has an errata on PSP-1000 models that causes FPU register corruption (these are the MIPS CPU FPU registers, not the VFPU registers). The bottom 5 bits of the VFPU destination register determine which FPU register will be corrupted. A workaround is to assume the side effect (ie. mark the register are clobbered).
Allowed prefixes
Store VFPU element
Syntax
sv.s rs, imm14(rt)
Description
Performs a 4 byte memory store from a VFPU register. Address must be 4 byte aligned or a fault is generated.
Allowed prefixes
Store VFPU quad element
Syntax
sv.q rs, imm14(rt)
Description
Performs a 16 byte memory store from a VFPU quad register. Address must be 16 byte aligned or a fault is generated.
Allowed prefixes
Store left VFPU quad element
Syntax
svl.q rs, imm14(rt)
Description
Performs a 16 byte left unaligned memory store from a VFPU quad register. Instruction ignores the two address LSB (forces them to zero), so the address is assumed aligned to 4 bytes. This instruction is similar to MIPS SWL instruction: stores the most significant part of the elements to the specified address leaving any other elements unchanged. Users can use `usv.q` pseudoinstruction to generate a sequence of `svl.q` and `svr.q` instructions in order to store unaligned data. You can check `psp-tests/manual/memops.c` to see examples on how the instruction behaves.
Allowed prefixes
Store right VFPU quad element
Syntax
svr.q rs, imm14(rt)
Description
Performs a 16 byte right unaligned memory store from a VFPU quad register. Instruction ignores the two address LSB (forces them to zero), so the address is assumed aligned to 4 bytes. This instruction is similar to MIPS SWR instruction: stores the least significant part of the elements to the specified address leaving any other elements unchanged. Users can use `usv.q` pseudoinstruction to generate a sequence of `svl.q` and `svr.q` instructions in order to store unaligned data. You can check `psp-tests/manual/memops.c` to see examples on how the instruction behaves.
Allowed prefixes
Add elements
Syntax
vadd.s rd, rs, rt
Description
Performs element-wise floating point addition
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] + rt[0]
Add elements
Syntax
vadd.p rd, rs, rt
Description
Performs element-wise floating point addition
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] + rt[0] rd[1] = rs[1] + rt[1]
Add elements
Syntax
vadd.t rd, rs, rt
Description
Performs element-wise floating point addition
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] + rt[0] rd[1] = rs[1] + rt[1] rd[2] = rs[2] + rt[2]
Add elements
Syntax
vadd.q rd, rs, rt
Description
Performs element-wise floating point addition
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] + rt[0] rd[1] = rs[1] + rt[1] rd[2] = rs[2] + rt[2] rd[3] = rs[3] + rt[3]
Subtract elements
Syntax
vsub.s rd, rs, rt
Description
Performs element-wise floating point subtraction
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] - rt[0]
Subtract elements
Syntax
vsub.p rd, rs, rt
Description
Performs element-wise floating point subtraction
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] - rt[0] rd[1] = rs[1] - rt[1]
Subtract elements
Syntax
vsub.t rd, rs, rt
Description
Performs element-wise floating point subtraction
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] - rt[0] rd[1] = rs[1] - rt[1] rd[2] = rs[2] - rt[2]
Subtract elements
Syntax
vsub.q rd, rs, rt
Description
Performs element-wise floating point subtraction
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] - rt[0] rd[1] = rs[1] - rt[1] rd[2] = rs[2] - rt[2] rd[3] = rs[3] - rt[3]
Multiply elements
Syntax
vmul.s rd, rs, rt
Description
Performs element-wise floating point multiplication
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] * rt[0]
Multiply elements
Syntax
vmul.p rd, rs, rt
Description
Performs element-wise floating point multiplication
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] * rt[0] rd[1] = rs[1] * rt[1]
Multiply elements
Syntax
vmul.t rd, rs, rt
Description
Performs element-wise floating point multiplication
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] * rt[0] rd[1] = rs[1] * rt[1] rd[2] = rs[2] * rt[2]
Multiply elements
Syntax
vmul.q rd, rs, rt
Description
Performs element-wise floating point multiplication
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] * rt[0] rd[1] = rs[1] * rt[1] rd[2] = rs[2] * rt[2] rd[3] = rs[3] * rt[3]
Divide elements
Syntax
vdiv.s rd, rs, rt
Description
Performs element-wise floating point division
Instruction performance
Throughput: 14 cycles/instruction
Latency: 17 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = rs[0] / rt[0]
Divide elements
Syntax
vdiv.p rd, rs, rt
Description
Performs element-wise floating point division
Instruction performance
Throughput: 28 cycles/instruction
Latency: 31 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = rs[0] / rt[0] rd[1] = rs[1] / rt[1]
Divide elements
Syntax
vdiv.t rd, rs, rt
Description
Performs element-wise floating point division
Instruction performance
Throughput: 42 cycles/instruction
Latency: 45 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = rs[0] / rt[0] rd[1] = rs[1] / rt[1] rd[2] = rs[2] / rt[2]
Divide elements
Syntax
vdiv.q rd, rs, rt
Description
Performs element-wise floating point division
Instruction performance
Throughput: 56 cycles/instruction
Latency: 59 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = rs[0] / rt[0] rd[1] = rs[1] / rt[1] rd[2] = rs[2] / rt[2] rd[3] = rs[3] / rt[3]
Select smallest elements
Syntax
vmin.s rd, rs, rt
Description
Performs element-wise floating point min(rs, rt) operation
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = fminf(rs[0], rt[0])
Select smallest elements
Syntax
vmin.p rd, rs, rt
Description
Performs element-wise floating point min(rs, rt) operation
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = fminf(rs[0], rt[0]) rd[1] = fminf(rs[1], rt[1])
Select smallest elements
Syntax
vmin.t rd, rs, rt
Description
Performs element-wise floating point min(rs, rt) operation
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = fminf(rs[0], rt[0]) rd[1] = fminf(rs[1], rt[1]) rd[2] = fminf(rs[2], rt[2])
Select smallest elements
Syntax
vmin.q rd, rs, rt
Description
Performs element-wise floating point min(rs, rt) operation
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = fminf(rs[0], rt[0]) rd[1] = fminf(rs[1], rt[1]) rd[2] = fminf(rs[2], rt[2]) rd[3] = fminf(rs[3], rt[3])
Select biggest elements
Syntax
vmax.s rd, rs, rt
Description
Performs element-wise floating point max(rs, rt) operation
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = fmaxf(rs[0], rt[0])
Select biggest elements
Syntax
vmax.p rd, rs, rt
Description
Performs element-wise floating point max(rs, rt) operation
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = fmaxf(rs[0], rt[0]) rd[1] = fmaxf(rs[1], rt[1])
Select biggest elements
Syntax
vmax.t rd, rs, rt
Description
Performs element-wise floating point max(rs, rt) operation
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = fmaxf(rs[0], rt[0]) rd[1] = fmaxf(rs[1], rt[1]) rd[2] = fmaxf(rs[2], rt[2])
Select biggest elements
Syntax
vmax.q rd, rs, rt
Description
Performs element-wise floating point max(rs, rt) operation
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = fmaxf(rs[0], rt[0]) rd[1] = fmaxf(rs[1], rt[1]) rd[2] = fmaxf(rs[2], rt[2]) rd[3] = fmaxf(rs[3], rt[3])
Compare and set elements
Syntax
vscmp.s rd, rs, rt
Description
Performs element-wise floating point comparison. The result is -1.0f, 0.0f or 1.0f depending on whether the input vs is less that vt, equal, or greater, respectively.
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] < rt[0] ? -1f : rs[0] > rt[0] ? 1.0f : 0.0f
Compare and set elements
Syntax
vscmp.p rd, rs, rt
Description
Performs element-wise floating point comparison. The result is -1.0f, 0.0f or 1.0f depending on whether the input vs is less that vt, equal, or greater, respectively.
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] < rt[0] ? -1f : rs[0] > rt[0] ? 1.0f : 0.0f rd[1] = rs[1] < rt[1] ? -1f : rs[1] > rt[1] ? 1.0f : 0.0f
Compare and set elements
Syntax
vscmp.t rd, rs, rt
Description
Performs element-wise floating point comparison. The result is -1.0f, 0.0f or 1.0f depending on whether the input vs is less that vt, equal, or greater, respectively.
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] < rt[0] ? -1f : rs[0] > rt[0] ? 1.0f : 0.0f rd[1] = rs[1] < rt[1] ? -1f : rs[1] > rt[1] ? 1.0f : 0.0f rd[2] = rs[2] < rt[2] ? -1f : rs[2] > rt[2] ? 1.0f : 0.0f
Compare and set elements
Syntax
vscmp.q rd, rs, rt
Description
Performs element-wise floating point comparison. The result is -1.0f, 0.0f or 1.0f depending on whether the input vs is less that vt, equal, or greater, respectively.
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] < rt[0] ? -1f : rs[0] > rt[0] ? 1.0f : 0.0f rd[1] = rs[1] < rt[1] ? -1f : rs[1] > rt[1] ? 1.0f : 0.0f rd[2] = rs[2] < rt[2] ? -1f : rs[2] > rt[2] ? 1.0f : 0.0f rd[3] = rs[3] < rt[3] ? -1f : rs[3] > rt[3] ? 1.0f : 0.0f
Compare greater or equal and set elements
Syntax
vsge.s rd, rs, rt
Description
Performs element-wise floating point bigger-or-equal comparison. The result will be 1.0 if vs is bigger or equal to vt, otherwise will be zero.
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] >= rt[0] ? 1.0f : 0.0f
Compare greater or equal and set elements
Syntax
vsge.p rd, rs, rt
Description
Performs element-wise floating point bigger-or-equal comparison. The result will be 1.0 if vs is bigger or equal to vt, otherwise will be zero.
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] >= rt[0] ? 1.0f : 0.0f rd[1] = rs[1] >= rt[1] ? 1.0f : 0.0f
Compare greater or equal and set elements
Syntax
vsge.t rd, rs, rt
Description
Performs element-wise floating point bigger-or-equal comparison. The result will be 1.0 if vs is bigger or equal to vt, otherwise will be zero.
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] >= rt[0] ? 1.0f : 0.0f rd[1] = rs[1] >= rt[1] ? 1.0f : 0.0f rd[2] = rs[2] >= rt[2] ? 1.0f : 0.0f
Compare greater or equal and set elements
Syntax
vsge.q rd, rs, rt
Description
Performs element-wise floating point bigger-or-equal comparison. The result will be 1.0 if vs is bigger or equal to vt, otherwise will be zero.
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] >= rt[0] ? 1.0f : 0.0f rd[1] = rs[1] >= rt[1] ? 1.0f : 0.0f rd[2] = rs[2] >= rt[2] ? 1.0f : 0.0f rd[3] = rs[3] >= rt[3] ? 1.0f : 0.0f
Compare less-than and set elements
Syntax
vslt.s rd, rs, rt
Description
Performs element-wise floating point less-than comparison. The result will be 1.0 if vs less than vt, otherwise will be zero.
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] < rt[0] ? 1.0f : 0.0f
Compare less-than and set elements
Syntax
vslt.p rd, rs, rt
Description
Performs element-wise floating point less-than comparison. The result will be 1.0 if vs less than vt, otherwise will be zero.
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] < rt[0] ? 1.0f : 0.0f rd[1] = rs[1] < rt[1] ? 1.0f : 0.0f
Compare less-than and set elements
Syntax
vslt.t rd, rs, rt
Description
Performs element-wise floating point less-than comparison. The result will be 1.0 if vs less than vt, otherwise will be zero.
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] < rt[0] ? 1.0f : 0.0f rd[1] = rs[1] < rt[1] ? 1.0f : 0.0f rd[2] = rs[2] < rt[2] ? 1.0f : 0.0f
Compare less-than and set elements
Syntax
vslt.q rd, rs, rt
Description
Performs element-wise floating point less-than comparison. The result will be 1.0 if vs less than vt, otherwise will be zero.
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] < rt[0] ? 1.0f : 0.0f rd[1] = rs[1] < rt[1] ? 1.0f : 0.0f rd[2] = rs[2] < rt[2] ? 1.0f : 0.0f rd[3] = rs[3] < rt[3] ? 1.0f : 0.0f
Partial vector cross product
Syntax
vcrs.t rd, rs, rt
Description
Performs a partial cross-product operation
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[1] * rt[2] rd[1] = rs[2] * rt[0] rd[2] = rs[0] * rt[1]
Vector cross product
Syntax
vcrsp.t rd, rs, rt
Description
Performs a full cross-product operation
Instruction performance
Throughput: 3 cycles/instruction
Latency: 9 cycles
Register overlap compatibility
Output register cannot overlap with input registers
Allowed prefixes
Pseudocode
rd[0] = rs[1] * rt[2] - rs[2] * rt[1] rd[1] = rs[2] * rt[0] - rs[0] * rt[2] rd[2] = rs[0] * rt[1] - rs[1] * rt[0]
Quaternion multiplication
Syntax
vqmul.q rd, rs, rt
Description
Performs a vector-matrix homogeneous transform (matrix-vector product), with a vector result
Instruction performance
Throughput: 4 cycles/instruction
Latency: 10 cycles
Register overlap compatibility
Output register cannot overlap with input registers
Allowed prefixes
Pseudocode
rd[0] = rs[3] * rt[0] - rs[2] * rt[1] + rs[1] * rt[2] + rs[0] * rt[3] rd[1] = rs[3] * rt[1] + rs[2] * rt[0] + rs[1] * rt[3] - rs[0] * rt[2] rd[2] = rs[3] * rt[2] + rs[2] * rt[3] - rs[1] * rt[0] + rs[0] * rt[1] rd[3] = rs[3] * rt[3] - rs[2] * rt[2] - rs[1] * rt[1] - rs[0] * rt[0]
Change exponent scale
Syntax
vsbn.s rd, rs, rt
Description
Rescales rs operand to have rt as exponent. This would be equivalent to ldexp(frexp(rs, NULL), rt + 128). If we express the number in its IEEE754 terms, that is, if rs can be expressed as ±m * 2^e, the instruction will replace "e" with the value of rt + 127 mod 256.
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = (fpiszero(rs[0]) || fpisnanorinf(rs[0])) ? rs[0] : (rs[0] & 0x807FFFFF) | (((rt[0] + 127) & 0xFF) << 23)
Vector scalar scale
Syntax
vscl.p rd, rs, rt
Description
Scales a vector (element-wise) by an scalar factor
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] * rt[0] rd[1] = rs[1] * rt[0]
Vector scalar scale
Syntax
vscl.t rd, rs, rt
Description
Scales a vector (element-wise) by an scalar factor
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] * rt[0] rd[1] = rs[1] * rt[0] rd[2] = rs[2] * rt[0]
Vector scalar scale
Syntax
vscl.q rd, rs, rt
Description
Scales a vector (element-wise) by an scalar factor
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] * rt[0] rd[1] = rs[1] * rt[0] rd[2] = rs[2] * rt[0] rd[3] = rs[3] * rt[0]
Vector dot product
Syntax
vdot.p rd, rs, rt
Description
Performs vector floating point dot product
Instruction performance
Throughput: 1 cycles/instruction
Latency: 7 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] * rt[0] + rs[1] * rt[1]
Vector dot product
Syntax
vdot.t rd, rs, rt
Description
Performs vector floating point dot product
Instruction performance
Throughput: 1 cycles/instruction
Latency: 7 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] * rt[0] + rs[1] * rt[1] + rs[2] * rt[2]
Vector dot product
Syntax
vdot.q rd, rs, rt
Description
Performs vector floating point dot product
Instruction performance
Throughput: 1 cycles/instruction
Latency: 7 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] * rt[0] + rs[1] * rt[1] + rs[2] * rt[2] + rs[3] * rt[3]
2x2 matrix determinant
Syntax
vdet.p rd, rs, rt
Description
Performs a 2x2 matrix determinant between two matrix rows
Instruction performance
Throughput: 1 cycles/instruction
Latency: 7 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] * rt[1] - rs[1] * rt[0]
Homogeneous dot product
Syntax
vhdp.p rd, rs, rt
Description
Performs vector floating point homegeneous dot product
Instruction performance
Throughput: 1 cycles/instruction
Latency: 7 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] * rt[0] + rt[1]
Homogeneous dot product
Syntax
vhdp.t rd, rs, rt
Description
Performs vector floating point homegeneous dot product
Instruction performance
Throughput: 1 cycles/instruction
Latency: 7 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] * rt[0] + rs[1] * rt[1] + rt[2]
Homogeneous dot product
Syntax
vhdp.q rd, rs, rt
Description
Performs vector floating point homegeneous dot product
Instruction performance
Throughput: 1 cycles/instruction
Latency: 7 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] * rt[0] + rs[1] * rt[1] + rs[2] * rt[2] + rt[3]
Vector copy
Syntax
vmov.s rd, rs
Description
Element-wise data copy
Instruction performance
Throughput: 1 cycles/instruction
Latency: 3 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0]
Vector copy
Syntax
vmov.p rd, rs
Description
Element-wise data copy
Instruction performance
Throughput: 1 cycles/instruction
Latency: 3 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] rd[1] = rs[1]
Vector copy
Syntax
vmov.t rd, rs
Description
Element-wise data copy
Instruction performance
Throughput: 1 cycles/instruction
Latency: 3 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] rd[1] = rs[1] rd[2] = rs[2]
Vector copy
Syntax
vmov.q rd, rs
Description
Element-wise data copy
Instruction performance
Throughput: 1 cycles/instruction
Latency: 3 cycles
Allowed prefixes
Pseudocode
rd[0] = rs[0] rd[1] = rs[1] rd[2] = rs[2] rd[3] = rs[3]
Absolute value
Syntax
vabs.s rd, rs
Description
Performs element-wise floating point absolute value
Instruction performance
Throughput: 1 cycles/instruction
Latency: 3 cycles
Allowed prefixes
Pseudocode
rd[0] = fabsf(rs[0])
Absolute value
Syntax
vabs.p rd, rs
Description
Performs element-wise floating point absolute value
Instruction performance
Throughput: 1 cycles/instruction
Latency: 3 cycles
Allowed prefixes
Pseudocode
rd[0] = fabsf(rs[0]) rd[1] = fabsf(rs[1])
Absolute value
Syntax
vabs.t rd, rs
Description
Performs element-wise floating point absolute value
Instruction performance
Throughput: 1 cycles/instruction
Latency: 3 cycles
Allowed prefixes
Pseudocode
rd[0] = fabsf(rs[0]) rd[1] = fabsf(rs[1]) rd[2] = fabsf(rs[2])
Absolute value
Syntax
vabs.q rd, rs
Description
Performs element-wise floating point absolute value
Instruction performance
Throughput: 1 cycles/instruction
Latency: 3 cycles
Allowed prefixes
Pseudocode
rd[0] = fabsf(rs[0]) rd[1] = fabsf(rs[1]) rd[2] = fabsf(rs[2]) rd[3] = fabsf(rs[3])
Floating point negation
Syntax
vneg.s rd, rs
Description
Performs element-wise floating point negation
Instruction performance
Throughput: 1 cycles/instruction
Latency: 3 cycles
Allowed prefixes
Pseudocode
rd[0] = -rs[0]
Floating point negation
Syntax
vneg.p rd, rs
Description
Performs element-wise floating point negation
Instruction performance
Throughput: 1 cycles/instruction
Latency: 3 cycles
Allowed prefixes
Pseudocode
rd[0] = -rs[0] rd[1] = -rs[1]
Floating point negation
Syntax
vneg.t rd, rs
Description
Performs element-wise floating point negation
Instruction performance
Throughput: 1 cycles/instruction
Latency: 3 cycles
Allowed prefixes
Pseudocode
rd[0] = -rs[0] rd[1] = -rs[1] rd[2] = -rs[2]
Floating point negation
Syntax
vneg.q rd, rs
Description
Performs element-wise floating point negation
Instruction performance
Throughput: 1 cycles/instruction
Latency: 3 cycles
Allowed prefixes
Pseudocode
rd[0] = -rs[0] rd[1] = -rs[1] rd[2] = -rs[2] rd[3] = -rs[3]
Saturate float to 0..1
Syntax
vsat0.s rd, rs
Description
Saturates inputs to the [0.0f ... 1.0f] range
Instruction performance
Throughput: 1 cycles/instruction
Latency: 3 cycles
Allowed prefixes
Pseudocode
rd[0] = fminf(fmaxf(rs[0], 0.0f), 1.0f)
Saturate float to 0..1
Syntax
vsat0.p rd, rs
Description
Saturates inputs to the [0.0f ... 1.0f] range
Instruction performance
Throughput: 1 cycles/instruction
Latency: 3 cycles
Allowed prefixes
Pseudocode
rd[0] = fminf(fmaxf(rs[0], 0.0f), 1.0f) rd[1] = fminf(fmaxf(rs[1], 0.0f), 1.0f)
Saturate float to 0..1
Syntax
vsat0.t rd, rs
Description
Saturates inputs to the [0.0f ... 1.0f] range
Instruction performance
Throughput: 1 cycles/instruction
Latency: 3 cycles
Allowed prefixes
Pseudocode
rd[0] = fminf(fmaxf(rs[0], 0.0f), 1.0f) rd[1] = fminf(fmaxf(rs[1], 0.0f), 1.0f) rd[2] = fminf(fmaxf(rs[2], 0.0f), 1.0f)
Saturate float to 0..1
Syntax
vsat0.q rd, rs
Description
Saturates inputs to the [0.0f ... 1.0f] range
Instruction performance
Throughput: 1 cycles/instruction
Latency: 3 cycles
Allowed prefixes
Pseudocode
rd[0] = fminf(fmaxf(rs[0], 0.0f), 1.0f) rd[1] = fminf(fmaxf(rs[1], 0.0f), 1.0f) rd[2] = fminf(fmaxf(rs[2], 0.0f), 1.0f) rd[3] = fminf(fmaxf(rs[3], 0.0f), 1.0f)
Saturate float to -1..1
Syntax
vsat1.s rd, rs
Description
Saturates inputs to the [-1.0f ... 1.0f] range
Instruction performance
Throughput: 1 cycles/instruction
Latency: 3 cycles
Allowed prefixes
Pseudocode
rd[0] = fminf(fmaxf(rs[0], -1f), 1.0f)
Saturate float to -1..1
Syntax
vsat1.p rd, rs
Description
Saturates inputs to the [-1.0f ... 1.0f] range
Instruction performance
Throughput: 1 cycles/instruction
Latency: 3 cycles
Allowed prefixes
Pseudocode
rd[0] = fminf(fmaxf(rs[0], -1f), 1.0f) rd[1] = fminf(fmaxf(rs[1], -1f), 1.0f)
Saturate float to -1..1
Syntax
vsat1.t rd, rs
Description
Saturates inputs to the [-1.0f ... 1.0f] range
Instruction performance
Throughput: 1 cycles/instruction
Latency: 3 cycles
Allowed prefixes
Pseudocode
rd[0] = fminf(fmaxf(rs[0], -1f), 1.0f) rd[1] = fminf(fmaxf(rs[1], -1f), 1.0f) rd[2] = fminf(fmaxf(rs[2], -1f), 1.0f)
Saturate float to -1..1
Syntax
vsat1.q rd, rs
Description
Saturates inputs to the [-1.0f ... 1.0f] range
Instruction performance
Throughput: 1 cycles/instruction
Latency: 3 cycles
Allowed prefixes
Pseudocode
rd[0] = fminf(fmaxf(rs[0], -1f), 1.0f) rd[1] = fminf(fmaxf(rs[1], -1f), 1.0f) rd[2] = fminf(fmaxf(rs[2], -1f), 1.0f) rd[3] = fminf(fmaxf(rs[3], -1f), 1.0f)
Reciprocate elements
Syntax
vrcp.s rd, rs
Description
Performs element-wise floating point reciprocal
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Relative error is smaller than 6.3e-07
Instruction performance
Throughput: 1 cycles/instruction
Latency: 7 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = 1.0f / rs[0]
Reciprocate elements
Syntax
vrcp.p rd, rs
Description
Performs element-wise floating point reciprocal
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Relative error is smaller than 6.3e-07
Instruction performance
Throughput: 2 cycles/instruction
Latency: 8 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = 1.0f / rs[0] rd[1] = 1.0f / rs[1]
Reciprocate elements
Syntax
vrcp.t rd, rs
Description
Performs element-wise floating point reciprocal
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Relative error is smaller than 6.3e-07
Instruction performance
Throughput: 3 cycles/instruction
Latency: 9 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = 1.0f / rs[0] rd[1] = 1.0f / rs[1] rd[2] = 1.0f / rs[2]
Reciprocate elements
Syntax
vrcp.q rd, rs
Description
Performs element-wise floating point reciprocal
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Relative error is smaller than 6.3e-07
Instruction performance
Throughput: 4 cycles/instruction
Latency: 10 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = 1.0f / rs[0] rd[1] = 1.0f / rs[1] rd[2] = 1.0f / rs[2] rd[3] = 1.0f / rs[3]
Reciprocal square root
Syntax
vrsq.s rd, rs
Description
Performs element-wise floating pointreciprocal square root
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Relative error is smaller than 7.3e-07
Instruction performance
Throughput: 1 cycles/instruction
Latency: 7 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = 1.0f / sqrt(rs[0])
Reciprocal square root
Syntax
vrsq.p rd, rs
Description
Performs element-wise floating pointreciprocal square root
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Relative error is smaller than 7.3e-07
Instruction performance
Throughput: 2 cycles/instruction
Latency: 8 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = 1.0f / sqrt(rs[0]) rd[1] = 1.0f / sqrt(rs[1])
Reciprocal square root
Syntax
vrsq.t rd, rs
Description
Performs element-wise floating pointreciprocal square root
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Relative error is smaller than 7.3e-07
Instruction performance
Throughput: 3 cycles/instruction
Latency: 9 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = 1.0f / sqrt(rs[0]) rd[1] = 1.0f / sqrt(rs[1]) rd[2] = 1.0f / sqrt(rs[2])
Reciprocal square root
Syntax
vrsq.q rd, rs
Description
Performs element-wise floating pointreciprocal square root
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Relative error is smaller than 7.3e-07
Instruction performance
Throughput: 4 cycles/instruction
Latency: 10 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = 1.0f / sqrt(rs[0]) rd[1] = 1.0f / sqrt(rs[1]) rd[2] = 1.0f / sqrt(rs[2]) rd[3] = 1.0f / sqrt(rs[3])
Sine function
Syntax
vsin.s rd, rs
Description
Performs element-wise floating point sin(π/2⋅rs) operation
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Absolute error is smaller than 4.8e-07
Instruction performance
Throughput: 1 cycles/instruction
Latency: 7 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = sin(rs[0] * M_PI_2)
Sine function
Syntax
vsin.p rd, rs
Description
Performs element-wise floating point sin(π/2⋅rs) operation
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Absolute error is smaller than 4.8e-07
Instruction performance
Throughput: 2 cycles/instruction
Latency: 8 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = sin(rs[0] * M_PI_2) rd[1] = sin(rs[1] * M_PI_2)
Sine function
Syntax
vsin.t rd, rs
Description
Performs element-wise floating point sin(π/2⋅rs) operation
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Absolute error is smaller than 4.8e-07
Instruction performance
Throughput: 3 cycles/instruction
Latency: 9 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = sin(rs[0] * M_PI_2) rd[1] = sin(rs[1] * M_PI_2) rd[2] = sin(rs[2] * M_PI_2)
Sine function
Syntax
vsin.q rd, rs
Description
Performs element-wise floating point sin(π/2⋅rs) operation
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Absolute error is smaller than 4.8e-07
Instruction performance
Throughput: 4 cycles/instruction
Latency: 10 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = sin(rs[0] * M_PI_2) rd[1] = sin(rs[1] * M_PI_2) rd[2] = sin(rs[2] * M_PI_2) rd[3] = sin(rs[3] * M_PI_2)
Cosine function
Syntax
vcos.s rd, rs
Description
Performs element-wise floating point cos(π/2⋅rs) operation
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 2.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Absolute error is smaller than 4e-07
Instruction performance
Throughput: 1 cycles/instruction
Latency: 7 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = cos(rs[0] * M_PI_2)
Cosine function
Syntax
vcos.p rd, rs
Description
Performs element-wise floating point cos(π/2⋅rs) operation
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 2.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Absolute error is smaller than 4e-07
Instruction performance
Throughput: 2 cycles/instruction
Latency: 8 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = cos(rs[0] * M_PI_2) rd[1] = cos(rs[1] * M_PI_2)
Cosine function
Syntax
vcos.t rd, rs
Description
Performs element-wise floating point cos(π/2⋅rs) operation
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 2.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Absolute error is smaller than 4e-07
Instruction performance
Throughput: 3 cycles/instruction
Latency: 9 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = cos(rs[0] * M_PI_2) rd[1] = cos(rs[1] * M_PI_2) rd[2] = cos(rs[2] * M_PI_2)
Cosine function
Syntax
vcos.q rd, rs
Description
Performs element-wise floating point cos(π/2⋅rs) operation
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 2.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Absolute error is smaller than 4e-07
Instruction performance
Throughput: 4 cycles/instruction
Latency: 10 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = cos(rs[0] * M_PI_2) rd[1] = cos(rs[1] * M_PI_2) rd[2] = cos(rs[2] * M_PI_2) rd[3] = cos(rs[3] * M_PI_2)
Base-2 exponentiation
Syntax
vexp2.s rd, rs
Description
Performs element-wise floating point exp2(rs) operation
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details. Inputs larger than 127 result in overflow (cannot represent over 2^127)
Relative error is smaller than 7.2e-07
Instruction performance
Throughput: 1 cycles/instruction
Latency: 7 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = (rs[0] >= 128) ? INFINITY : (rs[0] <= -127) ? 0.0f : exp2(rs[0])
Base-2 exponentiation
Syntax
vexp2.p rd, rs
Description
Performs element-wise floating point exp2(rs) operation
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details. Inputs larger than 127 result in overflow (cannot represent over 2^127)
Relative error is smaller than 7.2e-07
Instruction performance
Throughput: 2 cycles/instruction
Latency: 8 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = (rs[0] >= 128) ? INFINITY : (rs[0] <= -127) ? 0.0f : exp2(rs[0]) rd[1] = (rs[1] >= 128) ? INFINITY : (rs[1] <= -127) ? 0.0f : exp2(rs[1])
Base-2 exponentiation
Syntax
vexp2.t rd, rs
Description
Performs element-wise floating point exp2(rs) operation
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details. Inputs larger than 127 result in overflow (cannot represent over 2^127)
Relative error is smaller than 7.2e-07
Instruction performance
Throughput: 3 cycles/instruction
Latency: 9 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = (rs[0] >= 128) ? INFINITY : (rs[0] <= -127) ? 0.0f : exp2(rs[0]) rd[1] = (rs[1] >= 128) ? INFINITY : (rs[1] <= -127) ? 0.0f : exp2(rs[1]) rd[2] = (rs[2] >= 128) ? INFINITY : (rs[2] <= -127) ? 0.0f : exp2(rs[2])
Base-2 exponentiation
Syntax
vexp2.q rd, rs
Description
Performs element-wise floating point exp2(rs) operation
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details. Inputs larger than 127 result in overflow (cannot represent over 2^127)
Relative error is smaller than 7.2e-07
Instruction performance
Throughput: 4 cycles/instruction
Latency: 10 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = (rs[0] >= 128) ? INFINITY : (rs[0] <= -127) ? 0.0f : exp2(rs[0]) rd[1] = (rs[1] >= 128) ? INFINITY : (rs[1] <= -127) ? 0.0f : exp2(rs[1]) rd[2] = (rs[2] >= 128) ? INFINITY : (rs[2] <= -127) ? 0.0f : exp2(rs[2]) rd[3] = (rs[3] >= 128) ? INFINITY : (rs[3] <= -127) ? 0.0f : exp2(rs[3])
Base-2 logarithm
Syntax
vlog2.s rd, rs
Description
Performs element-wise floating point log2(rs) operation
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. Accuracy varies greatly depending on the input value. Please refer to psp-tests/accuracy for more details.
Absolute error is smaller than 3e-05
Instruction performance
Throughput: 1 cycles/instruction
Latency: 7 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = log2(rs[0])
Base-2 logarithm
Syntax
vlog2.p rd, rs
Description
Performs element-wise floating point log2(rs) operation
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. Accuracy varies greatly depending on the input value. Please refer to psp-tests/accuracy for more details.
Absolute error is smaller than 3e-05
Instruction performance
Throughput: 2 cycles/instruction
Latency: 8 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = log2(rs[0]) rd[1] = log2(rs[1])
Base-2 logarithm
Syntax
vlog2.t rd, rs
Description
Performs element-wise floating point log2(rs) operation
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. Accuracy varies greatly depending on the input value. Please refer to psp-tests/accuracy for more details.
Absolute error is smaller than 3e-05
Instruction performance
Throughput: 3 cycles/instruction
Latency: 9 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = log2(rs[0]) rd[1] = log2(rs[1]) rd[2] = log2(rs[2])
Base-2 logarithm
Syntax
vlog2.q rd, rs
Description
Performs element-wise floating point log2(rs) operation
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. Accuracy varies greatly depending on the input value. Please refer to psp-tests/accuracy for more details.
Absolute error is smaller than 3e-05
Instruction performance
Throughput: 4 cycles/instruction
Latency: 10 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = log2(rs[0]) rd[1] = log2(rs[1]) rd[2] = log2(rs[2]) rd[3] = log2(rs[3])
LogB calculation
Syntax
vlgb.s rd, rs
Description
Performs element-wise logB() calculation
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = logbf(rs[0])
Reset exponent scale
Syntax
vsbz.s rd, rs
Description
Rescales rs operand to have zero as exponent, so that it is reduced to the [1.0, 2.0) interval. This is essentially equivalent to the vsbn instruction with rt=0.
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = (fpiszero(rs[0]) || fpisnan(rs[0])) ? rs[0] : (rs[0] & 0x007FFFFF) | 0x3F800000
Floating point modulus
Syntax
vwbn.s rd, rs, scale
Description
TODO: Document this better. Performs some sort of modulus operation.
Instruction performance
Throughput: 1 cycles/instruction
Latency: 5 cycles
Allowed prefixes
Pseudocode
rd[0] = ivwbn(rs[0], imval)
Square root
Syntax
vsqrt.s rd, rs
Description
Performs element-wise floating point aproximate square root
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Relative error is smaller than 7.1e-07
Instruction performance
Throughput: 1 cycles/instruction
Latency: 7 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = sqrt(rs[0])
Square root
Syntax
vsqrt.p rd, rs
Description
Performs element-wise floating point aproximate square root
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Relative error is smaller than 7.1e-07
Instruction performance
Throughput: 2 cycles/instruction
Latency: 8 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = sqrt(rs[0]) rd[1] = sqrt(rs[1])
Square root
Syntax
vsqrt.t rd, rs
Description
Performs element-wise floating point aproximate square root
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Relative error is smaller than 7.1e-07
Instruction performance
Throughput: 3 cycles/instruction
Latency: 9 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = sqrt(rs[0]) rd[1] = sqrt(rs[1]) rd[2] = sqrt(rs[2])
Square root
Syntax
vsqrt.q rd, rs
Description
Performs element-wise floating point aproximate square root
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Relative error is smaller than 7.1e-07
Instruction performance
Throughput: 4 cycles/instruction
Latency: 10 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = sqrt(rs[0]) rd[1] = sqrt(rs[1]) rd[2] = sqrt(rs[2]) rd[3] = sqrt(rs[3])
Arc sine function
Syntax
vasin.s rd, rs
Description
Performs element-wise floating point asin(rs)⋅2/π operation
Accuracy
This function provides an approximate value. The precision seems quite good for arguments between -0.5 and 0.5 (around 2.5e-7), but it becomes very inaccurate outside of this range, as it approaches +/-1. Please refer to psp-tests/accuracy for more details.
Absolute error is smaller than 0.02
Instruction performance
Throughput: 1 cycles/instruction
Latency: 7 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = asin(rs[0]) / M_PI_2
Arc sine function
Syntax
vasin.p rd, rs
Description
Performs element-wise floating point asin(rs)⋅2/π operation
Accuracy
This function provides an approximate value. The precision seems quite good for arguments between -0.5 and 0.5 (around 2.5e-7), but it becomes very inaccurate outside of this range, as it approaches +/-1. Please refer to psp-tests/accuracy for more details.
Absolute error is smaller than 0.02
Instruction performance
Throughput: 2 cycles/instruction
Latency: 8 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = asin(rs[0]) / M_PI_2 rd[1] = asin(rs[1]) / M_PI_2
Arc sine function
Syntax
vasin.t rd, rs
Description
Performs element-wise floating point asin(rs)⋅2/π operation
Accuracy
This function provides an approximate value. The precision seems quite good for arguments between -0.5 and 0.5 (around 2.5e-7), but it becomes very inaccurate outside of this range, as it approaches +/-1. Please refer to psp-tests/accuracy for more details.
Absolute error is smaller than 0.02
Instruction performance
Throughput: 3 cycles/instruction
Latency: 9 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = asin(rs[0]) / M_PI_2 rd[1] = asin(rs[1]) / M_PI_2 rd[2] = asin(rs[2]) / M_PI_2
Arc sine function
Syntax
vasin.q rd, rs
Description
Performs element-wise floating point asin(rs)⋅2/π operation
Accuracy
This function provides an approximate value. The precision seems quite good for arguments between -0.5 and 0.5 (around 2.5e-7), but it becomes very inaccurate outside of this range, as it approaches +/-1. Please refer to psp-tests/accuracy for more details.
Absolute error is smaller than 0.02
Instruction performance
Throughput: 4 cycles/instruction
Latency: 10 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = asin(rs[0]) / M_PI_2 rd[1] = asin(rs[1]) / M_PI_2 rd[2] = asin(rs[2]) / M_PI_2 rd[3] = asin(rs[3]) / M_PI_2
Negative reciprocal
Syntax
vnrcp.s rd, rs
Description
Performs element-wise floating point negated reciprocal
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Relative error is smaller than 6.3e-07
Instruction performance
Throughput: 1 cycles/instruction
Latency: 7 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = -1f / rs[0]
Negative reciprocal
Syntax
vnrcp.p rd, rs
Description
Performs element-wise floating point negated reciprocal
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Relative error is smaller than 6.3e-07
Instruction performance
Throughput: 2 cycles/instruction
Latency: 8 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = -1f / rs[0] rd[1] = -1f / rs[1]
Negative reciprocal
Syntax
vnrcp.t rd, rs
Description
Performs element-wise floating point negated reciprocal
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Relative error is smaller than 6.3e-07
Instruction performance
Throughput: 3 cycles/instruction
Latency: 9 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = -1f / rs[0] rd[1] = -1f / rs[1] rd[2] = -1f / rs[2]
Negative reciprocal
Syntax
vnrcp.q rd, rs
Description
Performs element-wise floating point negated reciprocal
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3.5 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Relative error is smaller than 6.3e-07
Instruction performance
Throughput: 4 cycles/instruction
Latency: 10 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = -1f / rs[0] rd[1] = -1f / rs[1] rd[2] = -1f / rs[2] rd[3] = -1f / rs[3]
Negative sine function
Syntax
vnsin.s rd, rs
Description
Performs element-wise floating point -sin(π/2⋅rs) operation
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Absolute error is smaller than 4.8e-07
Instruction performance
Throughput: 1 cycles/instruction
Latency: 7 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = -sin(rs[0] * M_PI_2)
Negative sine function
Syntax
vnsin.p rd, rs
Description
Performs element-wise floating point -sin(π/2⋅rs) operation
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Absolute error is smaller than 4.8e-07
Instruction performance
Throughput: 2 cycles/instruction
Latency: 8 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = -sin(rs[0] * M_PI_2) rd[1] = -sin(rs[1] * M_PI_2)
Negative sine function
Syntax
vnsin.t rd, rs
Description
Performs element-wise floating point -sin(π/2⋅rs) operation
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Absolute error is smaller than 4.8e-07
Instruction performance
Throughput: 3 cycles/instruction
Latency: 9 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = -sin(rs[0] * M_PI_2) rd[1] = -sin(rs[1] * M_PI_2) rd[2] = -sin(rs[2] * M_PI_2)
Negative sine function
Syntax
vnsin.q rd, rs
Description
Performs element-wise floating point -sin(π/2⋅rs) operation
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details.
Absolute error is smaller than 4.8e-07
Instruction performance
Throughput: 4 cycles/instruction
Latency: 10 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = -sin(rs[0] * M_PI_2) rd[1] = -sin(rs[1] * M_PI_2) rd[2] = -sin(rs[2] * M_PI_2) rd[3] = -sin(rs[3] * M_PI_2)
Base-2 negative exponentiation
Syntax
vrexp2.s rd, rs
Description
Performs element-wise floating point 1/exp2(rs) operation (equivalent to exp2(-rs))
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details. Inputs larger than 127 result in overflow (cannot represent over 2^127)
Relative error is smaller than 7.2e-07
Instruction performance
Throughput: 1 cycles/instruction
Latency: 7 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = (rs[0] >= 127) ? 0.0f : (rs[0] <= -128) ? INFINITY : exp2(-rs[0])
Base-2 negative exponentiation
Syntax
vrexp2.p rd, rs
Description
Performs element-wise floating point 1/exp2(rs) operation (equivalent to exp2(-rs))
Accuracy
This function provides an approximate value, with lower accuracy to what FP32 IEEE754 numbers can represent. The lowest 3 mantissa bits seem to be innacurate. Please refer to psp-tests/accuracy for more details. Inputs larger than 127 result in overflow (cannot represent over 2^127)
Relative error is smaller than 7.2e-07
Instruction performance
Throughput: 2 cycles/instruction
Latency: 8 cycles
Register overlap compatibility
Output register can only overlap with input registers if they are identical
Allowed prefixes
Pseudocode
rd[0] = (rs[0] >= 127) ? 0.0f : (rs[0] <= -128) ? INFINITY : exp2(-rs[0]) rd[1] = (rs[1] >= 127) ? 0.0f : (rs[1] <= -128) ? INFINITY : exp2(-rs[1])
Base-2 negative exponentiation