Building IPU Applications¶
This guide shows how to build a complete IPU application using the fully connected neural network layer as an example. The complete code is in src/tools/ipu-apps/src/ipu_apps/fully_connected.
Application Structure¶
Each IPU application is a subpackage under ipu_apps/ containing:
- Assembly code (
.asm) — IPU program with compute operations (see Assembly Syntax Guide) - Python app class (
__init__.py) — Subclass ofIpuAppthat implementssetup()andteardown() - Debug runner (
__main__.py) — Optional CLI entry point for interactive debugging - Test data — Input/output binary files for validation
- Regression tests (
test/test_*.py) — Automated tests comparing outputs against golden references
Everything lives together in one directory.
Step 1: Write the Assembly Program¶
Create your IPU assembly program (e.g., fully_connected.asm). The assembly program contains the compute logic that runs on the IPU. See the Assembly Syntax Guide for details on writing IPU assembly code with Jinja2 preprocessing.
Step 2: Define Bazel Build Targets¶
Create BUILD.bazel entries to assemble the program and define your app targets:
load("//:asm_rules.bzl", "assemble_asm")
load("@rules_python//python:defs.bzl", "py_binary", "py_library")
load("@rules_python_pytest//python_pytest:defs.bzl", "py_pytest_test")
# Assemble the .asm file to .bin
assemble_asm(
name = "assemble_my_app",
src = "src/ipu_apps/my_app/my_app.asm",
)
# Collect test data files
filegroup(
name = "my_app_test_data",
srcs = glob(["src/ipu_apps/my_app/test_data/**/*.bin"]),
)
# Binary for interactive debugging (optional)
py_binary(
name = "my_app",
srcs = ["src/ipu_apps/my_app/__main__.py"],
main = "src/ipu_apps/my_app/__main__.py",
data = [
":assemble_my_app.bin",
":my_app_test_data",
],
env = {
"MY_APP_INST_BIN": "$(rootpath :assemble_my_app.bin)",
"MY_APP_DATA_DIR": "src/tools/ipu-apps/src/ipu_apps/my_app/test_data",
},
imports = ["src"],
deps = [":ipu_apps_lib"],
)
# Regression tests
py_pytest_test(
name = "test_my_app",
srcs = ["test/test_my_app.py"],
data = [
":assemble_my_app.bin",
":my_app_test_data",
],
env = {
"MY_APP_INST_BIN": "$(rootpath :assemble_my_app.bin)",
"MY_APP_DATA_DIR": "src/tools/ipu-apps/src/ipu_apps/my_app/test_data",
},
imports = ["src"],
deps = [
":ipu_apps_lib",
requirement("pytest"),
],
)
Build the assembly:
Step 3: Write the Application Class¶
Create src/ipu_apps/my_app/__init__.py and subclass IpuApp:
"""My IPU application — description of what it does."""
from __future__ import annotations
from pathlib import Path
from typing import TYPE_CHECKING
from ipu_emu.emulator import load_binary_to_xmem, dump_xmem_to_binary
from ipu_apps.base import IpuApp
if TYPE_CHECKING:
from ipu_emu.ipu_state import IpuState
class MyApp(IpuApp):
"""My application harness.
Args:
inst_path: Path to assembled instruction binary.
inputs_path: Path to input data binary.
output_path: Optional path to write output.
"""
def __init__(self, **kwargs) -> None:
super().__init__(**kwargs)
self.inputs_path = Path(self.inputs_path)
def setup(self, state: "IpuState") -> None:
"""Load data into XMEM and configure registers before execution."""
# Load input data
load_binary_to_xmem(
state, self.inputs_path,
base_addr=0x0000,
chunk_size=128,
num_chunks=10,
)
# Set control registers
state.regfile.set_cr(0, 0x0000) # input base address
state.regfile.set_cr(1, 0x20000) # weight base address
state.regfile.set_cr(2, 0x40000) # output base address
def teardown(self, state: "IpuState") -> None:
"""Dump results from XMEM after execution."""
if self.output_path is not None:
dump_xmem_to_binary(
state, self.output_path,
base_addr=0x40000,
chunk_size=256,
num_chunks=10,
)
Key points:
- Only implement setup() and teardown() — the base class handles everything else
- Use **kwargs in __init__ and call super().__init__(**kwargs) to auto-store all parameters as attributes
- setup() prepares the IPU state before execution (load data, set registers)
- teardown() collects results after execution (dump outputs)
- The base class run() method orchestrates: create state → load program → setup → execute → teardown
Step 4: Write the Interactive Runner (Optional)¶
Create src/ipu_apps/my_app/__main__.py for interactive debugging:
"""Debug runner for my_app.
Usage::
bazel run //src/tools/ipu-apps:my_app -- --input data.bin
"""
import argparse
import os
from pathlib import Path
from ipu_emu.debug_cli import debug_prompt
from ipu_apps.my_app import MyApp
def main() -> None:
parser = argparse.ArgumentParser(description="Run my_app with debug CLI")
parser.add_argument("--input", type=Path, required=True)
parser.add_argument("--output", type=Path, default=None)
parser.add_argument("--max-cycles", type=int, default=1_000_000)
args = parser.parse_args()
# Bazel sets these via env vars (see BUILD.bazel)
inst_path = Path(os.environ["MY_APP_INST_BIN"])
app = MyApp(
inst_path=inst_path,
inputs_path=args.input,
output_path=args.output,
)
state, cycles = app.run(
max_cycles=args.max_cycles,
debug_callback=debug_prompt, # Interactive debug CLI
)
print(f"Finished in {cycles} cycles")
if __name__ == "__main__":
main()
Key points:
- Use debug_prompt as the callback to enable interactive debugging
- Read MY_APP_INST_BIN from environment (set by Bazel in BUILD.bazel)
- Accept command-line arguments for input/output paths
- This is optional — only needed if you want a standalone debug runner
Run interactively:
Step 5: Write Regression Tests¶
Create test/test_my_app.py for automated validation:
"""End-to-end regression tests for my_app."""
from __future__ import annotations
import os
from pathlib import Path
import pytest
from ipu_apps.my_app import MyApp
# Bazel sets these via env vars (see BUILD.bazel)
_INST_BIN = Path(os.environ["MY_APP_INST_BIN"])
_DATA_DIR = Path(os.environ["MY_APP_DATA_DIR"])
def _run_app(tmp_path: Path, test_case: str) -> tuple[bytes, int]:
"""Run the app for a test case, return (output_bytes, cycles)."""
case_dir = _DATA_DIR / test_case
if not case_dir.exists():
pytest.skip(f"Test data not found: {case_dir}")
inputs = case_dir / "inputs.bin"
if not inputs.exists():
pytest.skip(f"Missing input file: {inputs}")
output = tmp_path / "output.bin"
app = MyApp(
inst_path=_INST_BIN,
inputs_path=inputs,
output_path=output,
)
_, cycles = app.run(max_cycles=1_000_000)
return output.read_bytes(), cycles
@pytest.mark.parametrize("test_case,golden_name", [
("case1", "golden_case1.bin"),
("case2", "golden_case2.bin"),
])
def test_my_app(tmp_path: Path, test_case: str, golden_name: str) -> None:
"""Compare app output against golden reference."""
actual, cycles = _run_app(tmp_path, test_case)
assert cycles > 0
golden = _DATA_DIR / test_case / golden_name
if golden.exists():
assert actual == golden.read_bytes()
Key points:
- Read MY_APP_INST_BIN and MY_APP_DATA_DIR from environment variables
- Use pytest.mark.parametrize to test multiple cases with one test function
- Compare actual output against golden reference files
- Skip gracefully if test data is missing
- No debug_callback — tests run headless and fast
Run tests:
Example Application: Fully Connected Layer¶
This section walks through a complete real-world implementation: a fully-connected neural network layer that processes multiple samples, each with 128 input neurons and produces 64 output neurons.
Assembly Program¶
The IPU assembly implements the core computation: for each input sample, compute the dot product of the input vector with each weight row (after transposition for efficiency):
set lr0 0 ;;
set lr1 1280 ;;
set lr2 0 ;;
input_loop:
reset_acc;;
ldr_cyclic_mult_reg lr0 cr0 lr15;;
set lr4 -128;;
set lr5 -1;;
set lr6 127;;
element_loop:
ldr_mult_reg mem_bypass lr4 cr1;
incr lr4 128;
incr lr5 1;
mult.ev mem_bypass lr5 lr15 lr15;
acc;
blt lr5 lr6 element_loop;;
str_acc_reg lr7 cr2;;
incr lr7 256;
incr lr0 128;;
break;;
blt lr0 lr1 input_loop;;
end:
bkpt;;
Python Application Class¶
The FullyConnectedApp class handles data setup and result collection:
"""Fully-connected layer test harness — Python port of fully_connected.c."""
from __future__ import annotations
from pathlib import Path
from typing import TYPE_CHECKING
from ipu_emu.ipu_math import DType
from ipu_emu.emulator import load_binary_to_xmem, dump_xmem_to_binary
from ipu_apps.base import IpuApp
if TYPE_CHECKING:
from ipu_emu.ipu_state import IpuState
# Constants
SAMPLES_NUM = 10
INPUT_BASE_ADDR = 0x0000
INPUT_NEURONS = 128
WEIGHTS_BASE_ADDR = 0x20000
OUTPUT_BASE_ADDR = 0x40000
OUTPUT_NEURONS = 64
def parse_dtype(dtype_str: str) -> DType:
"""Parse a dtype string into a DType enum value."""
dtype_map = {
"INT8": DType.INT8,
"FP8_E4M3": DType.FP8_E4M3,
"FP8_E5M2": DType.FP8_E5M2,
}
dt = dtype_map.get(dtype_str)
if dt is None:
raise ValueError(
f"Invalid dtype '{dtype_str}'. Supported: INT8, FP8_E4M3, FP8_E5M2"
)
return dt
def _load_and_transpose_weights(state: "IpuState", weights_path: str | Path) -> None:
"""Load weights from file and transpose into XMEM.
Original: (OUTPUT_NEURONS × INPUT_NEURONS).
Transposed: (INPUT_NEURONS × INPUT_NEURONS), zero-padded.
"""
raw = Path(weights_path).read_bytes()
expected = OUTPUT_NEURONS * INPUT_NEURONS
if len(raw) < expected:
raise ValueError(
f"Weights file too small: {len(raw)} bytes, expected {expected}"
)
original: list[bytes] = []
for j in range(OUTPUT_NEURONS):
row_start = j * INPUT_NEURONS
original.append(raw[row_start : row_start + INPUT_NEURONS])
for i in range(INPUT_NEURONS):
transposed_vector = bytearray(INPUT_NEURONS)
for j in range(OUTPUT_NEURONS):
transposed_vector[j] = original[j][i]
state.xmem.write_address(WEIGHTS_BASE_ADDR + i * INPUT_NEURONS, transposed_vector)
class FullyConnectedApp(IpuApp):
"""Fully-connected layer application harness.
Args:
inst_path: Path to assembled instruction binary.
inputs_path: Path to input activations binary.
weights_path: Path to weights binary.
output_path: Optional path to write output.
dtype: Data type string or DType.
"""
def __init__(self, *, dtype: str | DType = "INT8", **kwargs) -> None:
super().__init__(**kwargs)
self.inputs_path = Path(self.inputs_path)
self.weights_path = Path(self.weights_path)
self.dtype = parse_dtype(dtype) if isinstance(dtype, str) else dtype
def setup(self, state: "IpuState") -> None:
"""Load inputs and weights, set control registers."""
state.set_cr_dtype(int(self.dtype))
load_binary_to_xmem(
state, self.inputs_path, INPUT_BASE_ADDR, INPUT_NEURONS, SAMPLES_NUM
)
_load_and_transpose_weights(state, self.weights_path)
state.regfile.set_cr(0, INPUT_BASE_ADDR)
state.regfile.set_cr(1, WEIGHTS_BASE_ADDR)
state.regfile.set_cr(2, OUTPUT_BASE_ADDR)
def teardown(self, state: "IpuState") -> None:
"""Dump output activations from XMEM."""
if self.output_path is not None:
dump_xmem_to_binary(
state, self.output_path,
OUTPUT_BASE_ADDR, OUTPUT_NEURONS * 4, SAMPLES_NUM,
)
Interactive Debug Runner¶
The __main__.py provides an interactive CLI for development:
"""Debug runner for the fully-connected app.
Usage::
bazel run //src/tools/ipu-apps:fully_connected -- --dtype INT8
"""
import argparse
import os
from pathlib import Path
from ipu_emu.debug_cli import debug_prompt
from ipu_apps.fully_connected import FullyConnectedApp
def main() -> None:
parser = argparse.ArgumentParser(description="Run fully-connected with debug CLI")
parser.add_argument("--dtype", default="INT8", choices=["INT8", "FP8_E4M3", "FP8_E5M2"])
parser.add_argument("--output", type=Path, default=None)
parser.add_argument("--max-cycles", type=int, default=2_000_000)
args = parser.parse_args()
# Bazel sets these via env vars (see BUILD.bazel)
inst_path = Path(os.environ["FC_INST_BIN"])
data_dir = Path(os.environ["FC_DATA_DIR"])
app = FullyConnectedApp(
inst_path=inst_path,
inputs_path=data_dir / args.dtype.lower() / f"inputs_{args.dtype.lower()}.bin",
weights_path=data_dir / args.dtype.lower() / f"weights_{args.dtype.lower()}.bin",
output_path=args.output,
dtype=args.dtype,
)
state, cycles = app.run(max_cycles=args.max_cycles, debug_callback=debug_prompt)
print(f"Finished in {cycles} cycles")
if __name__ == "__main__":
main()
Key observations:
- The setup() method transcodes weights (transpose for cache efficiency) and loads both inputs and weights into XMEM
- Control registers point the assembly code to the base addresses: CR0=inputs, CR1=weights_transposed, CR2=outputs
- The dtype is configurable (INT8, FP8_E4M3, FP8_E5M2) and passed to the IPU state
- The assembly program is written to operate on transposed weights for better performance
- All paths are resolved by Bazel and passed via environment variables
See src/tools/ipu-apps/src/ipu_apps/fully_connected for the complete working code including tests and Bazel build targets.
Summary: The Two Usage Patterns¶
Pattern 1: Interactive Debugging¶
- Purpose: Develop and debug your application
- File:
__main__.pywithdebug_promptcallback - Target:
py_binaryin BUILD.bazel - Run:
bazel run //...:my_app -- --input data.bin - Features: Breakpoints, register inspection, step-through execution
Pattern 2: Regression Tests¶
- Purpose: Validate correctness against known-good outputs
- File:
test/test_*.pywith pytest assertions - Target:
py_pytest_testin BUILD.bazel - Run:
bazel test //...:test_my_app - Features: Fast, headless, CI-friendly, parametrized test cases
Both patterns share the same MyApp class — you just write setup() and teardown() once, then use it in both contexts.
Key Concepts¶
- Memory Layout: Define base addresses for inputs, weights, and outputs in external memory (XMEM)
- Register Setup: Initialize LR/CR registers before execution in
setup() - Auto-attribute Storage: Pass all parameters to
IpuApp.__init__(**kwargs)— they're automatically stored asself.param_name - Path Handling: Use Bazel
envwith$(rootpath ...)to set environment variables with resolved paths - Emulator Run: The emulator executes instructions until the program counter exceeds instruction memory
- Bazel Integration: The
assemble_asmrule compiles.asmfiles to.binexecutables
See the Assembly Syntax Guide for more details on writing IPU programs and the complete fully_connected example at src/tools/ipu-apps/src/ipu_apps/fully_connected/ for a real-world implementation.