IPU Assembly Instruction Reference¶
This document describes all available IPU assembly instructions.
Compound Instruction Layout¶
XMEM Instructions¶
Memory access instructions for loading and storing data between registers and memory.
str_acc_reg - Store Accumulator¶
Store accumulator to memory.
Syntax: str_acc_reg offset base
Operands: - offset: Offset register (lr0-lr15) - base: Base address register (cr0-cr15)
Operation:
Example:ldr_mult_reg - Load Register¶
Load data from memory into a multiplication stage register.
Syntax: ldr_mult_reg dest offset base
Operands: - dest: Mult stage register (r0, r1, or mem_bypass) - offset: Offset register (lr0-lr15) - base: Base address register (cr0-cr15)
Operation:
Example:ldr_cyclic_mult_reg - Load Cyclic Register¶
Load with cyclic addressing into r_cyclic.
Syntax: ldr_cyclic_mult_reg offset base index
Operands: - offset: Offset register (lr0-lr15) - base: Base address register (cr0-cr15) - index: Index inside cyclic register (lr0-lr15)
Operation:
ldr_mult_mask_reg - Load Mask Register¶
Load mask data from memory.
Syntax: ldr_mult_mask_reg offset base mask_idx
Operands: - offset: Offset register (lr0-lr15) - base: Base address register (cr0-cr15)
Operation:
xmem_nop - No Operation (XMEM)¶
No operation for xmem slot.
Syntax: xmem_nop
xmem.store_aaq_result - Store AAQ Result¶
Write the 128-byte AAQ quantization result register to external memory.
Syntax: xmem.store_aaq_result offset base
Operands: - offset: Offset register (lr0-lr15) - base: Base address register (cr0-cr15)
Operation:
Example:MULT Instructions¶
Multiplication instructions for element-wise and element-vector operations.
The multiplication result (mult_result) is forwarded to the ACC stage in the CPU and not stored in any register in the way.
mult.ee - Element-wise Multiply¶
Multiply elements of two registers element by element.
Syntax: mult.ee ra cyclic_offset mask_offset mask_shift
Operands: - ra: Multiplicand register (r0, r1, or mem_bypass) - cyclic_offset: Base offset for multiplier from RC (cyclic register) - mask_offset: Offset to select mask from RM (mask register) - mask_shift: Shift applied to the mask register
Operation:
Example:mult.ev - Element-Cyclic Multiply (Deprecated)¶
[DEPRECATED: use mult.ve.cr or mult.ve.aaq] Multiply Ra elements against a fixed element from cyclic register.
Syntax: mult.ev ra fixed_cyclic_idx mask_offset mask_shift
Operands: - ra: Multiplicand register (r0, r1, or mem_bypass) - fixed_cyclic_idx: Fixed index for element selection from cyclic register - mask_offset: Offset to select mask from RM (mask register) - mask_shift: Shift applied to the mask register
Operation:
Example:mult.ve - Vector-Element Multiply¶
Multiply a fixed element from Ra register against cyclic register elements.
Syntax: mult.ve ra cyclic_offset mask_offset mask_shift fixed_ra_idx
Operands: - ra: Multiplicand register (r0, r1, or mem_bypass) - cyclic_offset: Base offset for multiplier from RC (cyclic register) - mask_offset: Offset to select mask from RM (mask register) - mask_shift: Shift applied to the mask register - fixed_ra_idx: Fixed index for element selection from Ra register
Operation:
Example:mult_nop - No Operation (MULT)¶
No operation for multiply slot.
Syntax: mult_nop
mult.ve.cr - Vector-Element Multiply (CR scalar)¶
Multiply each element of RC[cyclic_offset:cyclic_offset+128] by a scalar from a CR register. Elements beyond RC boundary are treated as 1 (dtype-specific).
Syntax: mult.ve.cr cyclic_offset mask_offset mask_shift cr_idx
Operands: - cyclic_offset: Base offset into RC (cyclic register); non-cyclic — out-of-bounds elements are padded with 1 - mask_offset: Offset to select mask from RM (mask register) - mask_shift: Shift applied to the mask register - cr_idx: CR register whose low byte supplies the fixed scalar multiplier (cr0-cr15)
Operation:
For i in [0,128): rb = RC[cyclic_offset+i] if in bounds else dtype_one; mult_res[i] = CR[cr_idx][0] * rb
mult.ve.aaq - Vector-Element Multiply (AAQ scalar)¶
Multiply each element of RC[cyclic_offset:cyclic_offset+128] by a scalar from an AAQ register. Elements beyond RC boundary are treated as 1 (dtype-specific).
Syntax: mult.ve.aaq cyclic_offset mask_offset mask_shift aaq_rf_idx
Operands: - cyclic_offset: Base offset into RC (cyclic register); non-cyclic — out-of-bounds elements are padded with 1 - mask_offset: Offset to select mask from RM (mask register) - mask_shift: Shift applied to the mask register - aaq_rf_idx: AAQ register whose low byte supplies the fixed scalar multiplier (aaq0-aaq3)
Operation:
For i in [0,128): rb = RC[cyclic_offset+i] if in bounds else dtype_one; mult_res[i] = AAQ[aaq_rf_idx][0] * rb
ACC Instructions¶
Accumulation instructions for combining values with optional masking and shifting.
acc - Accumulate¶
Accumulate multiply result.
Syntax: acc
Operation:
acc.first - Accumulate First¶
Set accumulator to multiply result (do not add to previous r_acc).
Syntax: acc.first
Operation:
Example:reset_acc - Reset Accumulator¶
Reset accumulator to zero.
Syntax: reset_acc
Operation:
acc_nop - No Operation (ACC)¶
No operation for accumulator slot.
Syntax: acc_nop
acc.add_aaq - Accumulate and Add AAQ¶
Accumulate multiply result, then add the selected AAQ register (32-bit) to each of the 128 accumulator words.
Syntax: acc.add_aaq aaq_rf_idx
Operands: - aaq_rf_idx: AAQ register index (aaq0-aaq3)
Operation:
Example:acc.add_aaq.first - Accumulate and Add AAQ (First)¶
Set accumulator to multiply result plus selected AAQ register (do not add to previous r_acc).
Syntax: acc.add_aaq.first aaq_rf_idx
Operands: - aaq_rf_idx: AAQ register index (aaq0-aaq3)
Operation:
Example:acc.max - Accumulator Max¶
For each element, set r_acc[i] = max(r_acc[i], mult_res[i], aaq_reg[aaq_rf_idx]).
Syntax: acc.max aaq_rf_idx
Operands: - aaq_rf_idx: AAQ register index (aaq0-aaq3)
Operation:
Example:acc.max.first - Accumulator Max (First)¶
For each element, set r_acc[i] = max(mult_res[i], aaq_reg[aaq_rf_idx]). Previous r_acc is ignored (treated as 0).
Syntax: acc.max.first aaq_rf_idx
Operands: - aaq_rf_idx: AAQ register index (aaq0-aaq3)
Operation:
Example:acc.stride - Accumulator Stride¶
Reorder the multiplication result into r_acc using horizontal/vertical stride decimation. Only updates the RACC indexes written; leaves the rest unchanged.
Syntax: acc.stride elements_in_row horizontal_stride vertical_stride offset
Operands: - elements_in_row: Elements per row (8, 16, 32, or 64) - horizontal_stride: Horizontal stride mode (enabled, inverted, expand) - vertical_stride: Vertical stride mode (enabled, inverted) - offset: LR register; value % 4 gives start index in RACC (0, 32, 64, or 96)
Operation:
Decimate mult_res as rows×cols; apply horizontal stride (take every 2nd column, optional expand); then vertical stride (take every 2nd row). Write result into r_acc[start:start+N] where start = (offset%4)*32, N = 32|64|128.
AAQ Instructions¶
Activation and quantization: aggregate r_acc into AAQ registers.
aaq_nop - No Operation (AAQ)¶
No operation for AAQ slot.
Syntax: aaq_nop
agg - Accumulator Aggregate¶
Collapse 128 r_acc words into one value (SUM or MAX), apply post function, store to selected AAQ register.
Syntax: agg agg_mode post_fn cr_idx aaq_rf_idx
Operands: - agg_mode: sum or max - post_fn: value, value_cr, inv, or inv_sqrt - cr_idx: CR register for value_cr post function (cr0-cr15) - aaq_rf_idx: AAQ register to store result (aaq0-aaq3)
Operation:
If sum: v = sum(r_acc[0..127]). If max: v = max(r_acc[0..127], aaq[aaq_rf_idx]). Apply post_fn(v): value→v, value_cr→v*cr[cr_idx], inv→1/v, inv_sqrt→1/sqrt(v). aaq[aaq_rf_idx] = result.
aaq - AAQ Quantize¶
Quantize the 128-word accumulator from INT32 to INT8, storing clamped results in the aaq_result register. Requires INT8 mode.
Syntax: aaq
Operation:
Requires INT8 mode (cr15 == DType.INT8). For i in [0, 128): aaq_result[i] = clamp(trunc(r_acc[i]), -128, 127)
LR Instructions¶
Loop register manipulation instructions for controlling loop counters and addresses.
incr - Increment Loop Register¶
Increment a loop register by an immediate value.
Syntax: incr reg value
Operands: - reg: Loop register to increment (lr0-lr15) - value: Immediate value to add
Operation:
Example:set - Set Loop Register¶
Set a loop register to an immediate value.
Syntax: set reg value
Operands: - reg: Loop register (lr0-lr15) - value: 32-bit immediate value
Operation:
Example:add - Add Registers¶
Add two registers and store in destination.
Syntax: add dest src_a src_b
Operands: - dest: Destination loop register (lr0-lr15) - src_a: First source register (lr0-lr15 or cr0-cr15) - src_b: Second source register (lr0-lr15 or cr0-cr15)
Operation:
Example:sub - Subtract Registers¶
Subtract two registers and store in destination.
Syntax: sub dest src_a src_b
Operands: - dest: Destination loop register (lr0-lr15) - src_a: First source register (lr0-lr15 or cr0-cr15) - src_b: Second source register (lr0-lr15 or cr0-cr15)
Operation:
Example:Conditional Branch Instructions¶
Control flow instructions for branching based on conditions or unconditionally.
beq - Branch if Equal¶
Branch if two registers are equal.
Syntax: beq reg1 reg2 label
Operands: - reg1: First register to compare (lr0-lr15) - reg2: Second register to compare (lr0-lr15) - label: Branch target label
Operation:
Example:bne - Branch if Not Equal¶
Branch if two registers are not equal.
Syntax: bne reg1 reg2 label
Operands: - reg1: First register to compare (lr0-lr15) - reg2: Second register to compare (lr0-lr15) - label: Branch target label
Operation:
Example:blt - Branch if Less Than¶
Branch if first register is less than second.
Syntax: blt reg1 reg2 label
Operands: - reg1: First register to compare (lr0-lr15) - reg2: Second register to compare (lr0-lr15) - label: Branch target label
Operation:
Example:bnz - Branch if Not Zero¶
Branch if test register not equal to base register.
Syntax: bnz test_reg base_reg label
Operands: - test_reg: Register to test (lr0-lr15) - base_reg: Base comparison register (lr0-lr15) - label: Branch target label
Operation:
Example:bz - Branch if Zero¶
Branch if test register equals base register.
Syntax: bz test_reg base_reg label
Operands: - test_reg: Register to test (lr0-lr15) - base_reg: Base comparison register (lr0-lr15) - label: Branch target label
Operation:
Example:b - Unconditional Branch¶
Always branch to label.
Syntax: b label
Operands: - label: Branch target label
Operation:
Example:br - Branch Register¶
Branch to address in register.
Syntax: br reg
Operands: - reg: Register containing target address (lr0-lr15)
Operation:
bkpt - Breakpoint¶
Conditional breakpoint.
Syntax: bkpt
Operation:
Break Instructions¶
Debug break instructions for halting execution and entering debug mode.
break - Break¶
Unconditional break.
Syntax: break
Operation:
break.ifeq - Break if Equal¶
Break execution if register equals value.
Syntax: break.ifeq reg value
Operands: - reg: Register to test (lr0-lr15) - value: Immediate value to compare against
Operation:
Example:break_nop - No Operation (BREAK)¶
No operation for break slot.
Syntax: break_nop