Skip to content

Integrated matrix extension#2731

Open
joseemoreira wants to merge 50 commits intomainfrom
integrated-matrix-extension
Open

Integrated matrix extension#2731
joseemoreira wants to merge 50 commits intomainfrom
integrated-matrix-extension

Conversation

@joseemoreira
Copy link
Copy Markdown

Update the main branch.

rpsene and others added 30 commits March 6, 2026 11:22
Use ghcr.io/riscv/riscv-docs-base-container-image:latest
Update GitHub Actions to requested versions
…fication

Add a full first-draft specification for the Integrated Matrix family of
extensions (Zvvmm, Zvvfmm, Zvvmtls), which accelerates matrix multiply-
accumulate (GEMM) using the existing RISC-V V register file without
introducing new architectural state.

New vtype CSR fields:
  - lambda[2:0]: selected lambda (K dimension), encoded as powers of two
  - altfmt_A, altfmt_B: FP/signedness format selection for A/B operands

Sub-extensions and instructions:

Zvvmm — integer matrix multiply-accumulate (C ← C + A × B):
  - vmmacc.vv:   non-widening; all operands SEW (funct6=0x38, OPIVV)
  - vwmmacc.vv:  widening ×2; A/B at SEW/2, C at SEW (funct6=0x39)
  - vqwmmacc.vv: quad-widening ×4; A/B at SEW/4, C at SEW (funct6=0x3a)
  Signedness of A and B controlled independently via altfmt_A/altfmt_B.
  Accumulation wraps modulo 2^SEW.

Zvvfmm — floating-point matrix multiply-accumulate:
  - vfmmacc.vv:   non-widening (funct6=0x14, OPFVV)
  - vfwmmacc.vv:  widening ×2 (funct6=0x15)
  - vfqwmmacc.vv: quad-widening ×4 (funct6=0x16)
  FP format for A/B selected by altfmt_A/altfmt_B; accumulator format by
  altfmt (as defined by Zvfbfa).  Operations use the dynamic rounding mode
  (frm); exception flags accumulate in fflags.

Zvvmtls — 2D tile load/store with optional in-situ transpose:
  - vmtl.v:  order-preserving load  (row-major A or column-major B)
  - vmts.v:  order-preserving store
  - vmttl.v: transposing load  (row-major B or column-major A/C)
  - vmtts.v: transposing store
  All four instructions accept an optional inline lambda override (Lλ)
  and an optional mask operand (v0.t).

Tile register layout conventions:
  - A stored row-major; B and C stored column-major in vector register
    groups.  K_effective = λ × W × LMUL; MUL_C = VLEN / (SEW × λ²).
  - tile_reg_idx() maps sequential tile element indices to flat register
    positions, correctly handling the λ×LMUL segment width for LMUL > 1.
  - Load/store address formula uses linesize = λ × LMUL throughout,
    making the order-preserving instructions correct for all LMUL values.

Each instruction entry includes Synopsis, Mnemonic, Encoding (Wavedrom),
Description, Exceptions (where applicable), Operation (SAIL pseudocode),
and an "Included in" extension table.  All entries appear in a single
alphabetically-ordered instruction index shared across all three
sub-extensions.
* Reformulate GEMM as C ← A × B^T + C with row-major panels
* Rename 'widening' to 'packing' (non-/double-/quad-packing)
* Rename accumulator register group multiplier from MUL to CMUL
  and define as (VLEN / SEW) / λ² (instead of LMUL / λ²)
* Rename 'Alternate Floating-Point Format' to 'Alternate Format'
  to reflect applicability to both integer and floating-point
* Normalize [U]Int notation in sub-extension table
* Add cross-reference anchor for sub-extension table
* Add placeholder 'Storage formats' section
Specify when intermediate rounding is permitted during the K_eff-deep
accumulation of a matrix multiply-accumulate instruction.

For widening instructions (W=2, W=4), define the sub-dot-product as
the W products of (SEW/W)-bit elements within one SEW-wide slot and
note that each individual product is exact at SEW precision.

For floating-point: the implementation partitions the λ×LMUL
sub-dot-products into groups of G (power-of-two, 1 ≤ G ≤ λ),
accumulates each group at ≥ 2×SEW internal precision, then rounds
once and adds to C.  G is implementation-defined, allowing both
systolic (G=1) and outer-product (G ≈ λ) datapaths.  Bit-exact
reproducibility across implementations is explicitly not guaranteed.

For integer: modular (wrapping) arithmetic makes the result uniquely
defined regardless of accumulation order.
Replace undefined term "VLENE" with the correct expression
"(VLEN / SEW)" in the accumulator register group multiplier formula.
Introduce N_tile_max = M_tile to the tile geometry formulas and add a
new "C tile tail policy" section that specifies the behaviour of C
elements beyond the active N_tile columns when VL is partial.

The key property is that tail elements are never read or written by
multiply-accumulate instructions, so tail-undisturbed (vta=0) is
achieved by write-skip rather than read-merge.  Implementations are
therefore not required to read the tail portion of the C register group,
which benefits outer-product engines and register-renaming machines.
…l policy with base V

Bring vmtl.v, vmts.v, vmttl.v, vmtts.v into alignment with the
standard vector load/store element-status semantics (sec-inactive-defs):

* Add a four-category element-status summary (active / inactive / tail /
  prestart) to the shared Instructions section, with explicit vma and
  vta references and a cross-link to <<sec-inactive-defs>>.
* Fix "active element index i in [0, VL)" to "element index i in the
  body [vstart, VL) where the mask is enabled" throughout.
* Extend each instruction's Description to cover the vma=1 (inactive
  may overwrite with 1s) and vta=1 (tail may overwrite with 1s) cases;
  stores now explicitly state that inactive and tail elements are not
  written to memory and do not raise exceptions.
* Update load pseudocode comments to distinguish inactive (body,
  mask=0) from tail (unreachable by the loop) and name the governing
  vma/vta policies.
* Remove init_masked_source from vmts.v and vmtts.v pseudocode;
  replace with read_vmask + vm_val[i] to match the loads and regular
  vector stores.
…operations

Add an "Element packing in input tiles" subsection under "Storage
formats" that defines the ordering of narrow elements within each
SEW-wide slot for widening multiply-accumulate instructions (W=2, W=4):

- For byte-sized and wider elements (EEW >= 8): standard RISC-V V
  element ordering applies.
- For sub-byte elements (EEW = 4): little-endian nibble packing with
  element k at bits [4k+3 : 4k].

Also remove a stale empty "Arithmetic considerations" section header
left over from a previous merge.
Add a C-Language Intrinsics subsection under "Software considerations"
specifying the naming convention, type system, and representative
prototypes for all IME instruction families (Zvvmtls, Zvvmm, Zvvfmm).

Naming convention:
- Type-suffix encodes the accumulator C register-group multiplier
  (CMUL = VLENE / λ²), independent of LMUL.
- `_lm{N}` qualifier selects LMUL when CMUL ≠ LMUL; `_lm1` may be
  omitted (LMUL=1 is the default).
- Non-ISO input types (BFloat16, OFP8, OFP4, Int4) always carry an
  explicit input-type suffix; standard IEEE types do not.
- Mixed altfmt_A ≠ altfmt_B uses dual suffixes: _{inputA}_{inputB}.
- Canonical suffix order:
  {type}[_{inputA}[_{inputB}]][_su|_us][_lm{N}][_L{N}]
- Overloaded short forms: GCC uses resolve_overloaded_builtin, Clang
  uses __attribute__((overloadable)), C++ uses standard overloading.

Tile load/store (Zvvmtls):
- Prototypes for vmtl, vmts, vmttl, vmtts across all SEW variants.
- Lambda override via compile-time `_L{N}` suffix (no runtime arg),
  covering all four instructions.

Integer multiply-accumulate (Zvvmm):
- vmmacc.vv (W=1), vwmmacc.vv (W=2), vqwmmacc.vv (W=4) prototypes.
- Int4 vector types (vint4m{N}_t, vuint4m{N}_t) and prototypes for
  Zvvmmi4b (Int4→Int8) and Zvvmmi4h (Int4→Int16).
- Mixed-sign _su/_us variants for independent altfmt_A/altfmt_B.
- Note: W=8 gap for Zvvmmi4w (Int4→Int32) and Zvvmmbd (Int8→Int64).

Floating-point multiply-accumulate (Zvvfmm):
- OFP8 types (E4M3, E5M2), OFP4 types (E2M1, E3M0) with non-ISO
  scalar type names (_Float8E4M3, etc.).
- vfmmacc.vv prototypes including BF16→BF16 (Zvvfmmbf16).
- vfwmmacc.vv covering OFP4→OFP8, OFP8→FP16/BF16, FP16→FP32,
  BF16→FP32, FP32→FP64.
- vfqwmmacc.vv covering OFP4→FP16/BF16, OFP8→FP32, BF16→FP64,
  FP16→FP64.
- Consolidated altfmt=1 examples (BF16, E5M2, E3M0 inputs).

Portability:
- VLEN-portable code guidance: write for the largest CMUL the code
  targets, select code paths at runtime (mirroring RVV practice).
Disallow fractional LMUL (LMUL < 1) for all Integrated Matrix
instructions.  Only LMUL ∈ {1, 2, 4, 8} is supported; fractional
settings are reserved and shall raise an illegal-instruction exception.

Remove mf2/mf4/mf8 from the tile load/store intrinsic prototypes
to reflect this restriction.
…inputs section

Add an explicit anchor to the "Alternate formats for inputs" heading
and update the cross-reference in the Zvvfmm section to use it,
fixing the asciidoctor warning about an unknown anchor.
…accumulate (#9)

Reserve vm=0 on all six matrix multiply-accumulate instructions
(vfmmacc, vfqwmmacc, vfwmmacc, vmmacc, vqwmmacc, vwmmacc):

- Set the vm bit to 1 (hardwired) in the encoding diagrams.
- Add vm=0 as an illegal-instruction condition in the Exceptions
  sections and SAIL pseudocode.
- Add a forward-looking note that a future extension may redefine
  vm=0 to source per-element scaling factors from v0 for
  microscaling floating-point formats.

Tile load/store instructions are unaffected and continue to support
vector masking.
…P widening instructions

Allow the A and B input tiles to use different floating-point formats
(altfmt_A ≠ altfmt_B) for widening multiply-accumulate instructions
(vfwmmacc.vv, vfqwmmacc.vv).  For non-widening vfmmacc.vv, altfmt_A
must still equal altfmt_B.

The restriction is based on exact-product representability: mixed-format
inputs are permitted only when p_A + p_B ≤ p_C (the product significand
fits in the accumulator format without rounding).  This condition holds
for all widening combinations but fails for every non-widening mixed case.

Changes:
- Add a new "Mixed-format inputs" section defining the restriction,
  IEEE 754 multiplication semantics, significand-width analysis, and
  subextension gating rules.
- For sub-word inputs (OFP4, OFP8), all format combinations within a
  width class are covered by the existing subextension.
- For 16-bit mixed inputs (FP16 × BF16), both the IEEE binary16 and
  BFloat16 subextensions must be present (widening only).
- Update the subextensions table to clarify that OFP4 and OFP8 entries
  cover all format combinations (E2M1 or E3M0, E4M3 or E5M2).
- Add mixed-format intrinsic examples (E4M3 × E5M2, FP16 × BF16).
- Add IEEE 754 mixed-format note to fp_mul_to documentation.
- Update vfmmacc.vv description to require altfmt_A == altfmt_B;
  vfwmmacc.vv and vfqwmmacc.vv descriptions allow independent selection.

The SAIL pseudocode already decodes fmt_A and fmt_B independently and
passes them separately to fp_mul_to, so no pseudocode changes are needed.
…atrix Extensions

Update the title and all references throughout the spec to use the
canonical name "Zvvm family of Integrated Matrix extensions" instead of
the informal "Integrated Matrix family of extensions".
Add microscaling support for floating-point multiply-accumulate
instructions using paired E8M0 block-scale factors supplied through v0
when vm=0.

- Define microscaling semantics: per-block power-of-two scales applied
  as exact exponent additions with no rounding error.
- Specify the paired scale layout in v0 (scale_A in lower byte,
  scale_B in upper byte of 16-bit elements).
- Support block sizes of 32 (standard OCP MX) and 16 elements,
  selected by the bs field in vtype.
- Add capacity proof and configuration table for VLEN=256.
- Add scale-packing code examples (base V, Zvbb, Zvzip) with
  instruction count comparison.
- Define Zvvfmmmx* (BS=32) and Zvvfmmnx* (BS=16) microscaling
  subextensions with implication chains to base subextensions.
- Integer multiply-accumulate instructions reserve vm=0.
E3M0 is a mantissa-free format (all values are exact powers of two),
making it indistinguishable in utility from a wider integer exponent
field.  It adds implementation complexity without meaningful benefit
over OFP4 (E2M1), which already covers the 4-bit FP use case with
one mantissa bit.

Changes:
- Remove Zvvfmme3m0ofp8, Zvvfmme3m0h, Zvvfmme3m0bf16 subextensions
- Mark altfmt_A/B = 1 for 4-bit inputs as _reserved_ in encoding maps
  (vfwmmacc and vfqwmmacc)
- Mark altfmt = 1 / 4-bit as _reserved_ in the altfmt encoding table
- Remove E3M0 from significand widths table
- Remove E3M0 from sub-byte packing description
- Simplify sub-word inputs prose to OFP4 (E2M1) only
- Remove E3M0 C intrinsic prototypes and type references
…pport

Introduce two new mnemonics vfwimmacc.vv and vfqwimmacc.vv for integer
inputs with FP accumulator under microscaling.  These reuse the vm=0
encoding of the existing integer opcodes vwmmacc (funct6=0x39) and
vqwmmacc (funct6=0x3a) in OPIVV, since FP8 exhausts the altfmt_A/B
encoding space for FP inputs.

- vfwimmacc.vv (W=2): Int8/UInt8 → FP16 or BF16 accumulator
- vfqwimmacc.vv (W=4): Int4/UInt4 → FP16/BF16, or Int8/UInt8 → FP32

For both instructions vm=0 selects v0.scale (E8M0 tile-strided layout),
altfmt_A selects signed(0)/unsigned(1) for A, altfmt_B for B, and the
existing BS and LMUL constraints carry over.  SEW=8 operands with EEW=4
(vfqwimmacc only) are reserved.

Add encoding map table, subextension table rows (Zvvfmmmxi8h,
Zvvfmmmxi8bf16, Zvvfmmmxi8w, Zvvfmmmxi4h, Zvvfmmmxi4bf16 and their
Zvvfmmnxi* BS=16 counterparts), full instruction sections with wavedrom
encoding diagrams, description, exceptions, and SAIL pseudocode.  A new
int_to_fp helper converts the exact integer dot-product sum to FP before
scaling and accumulation.  C intrinsic prototypes added for all variants.
- Remove stale "bs=1 and block_size=32 is reserved" bullet from vfmmacc.vv
  that duplicated encoding-table information no longer present.
- Add missing "vd/vs1/vs2 overlaps v0" Illegal Instruction bullet to all
  three microscaling instructions (vfmmacc.vv, vfwmmacc.vv, vfqwmmacc.vv).
- Define normative behaviour when an E8M0 scale decodes as Inf: the scale
  pair is treated as NaN and the accumulator element is set to canonical NaN.
- Cite the encoding tables (not a separate proof) as the source for the
  R ≥ S constraint in the scale-layout section.
- Complete the memory layout / instruction selection table with the
  missing column-major A and row-major C rows.
- Rename CMUL → MUL_C throughout (52 occurrences) for consistency with
  the geometry section.
- Replace the undefined VLENE shorthand with its explicit expansion
  VLEN ÷ (SEW × λ²) / VLEN ÷ (SEW × 4) at all three sites.
- Add missing reserved-encoding Exceptions bullets and SAIL guards for
  vfwimmacc.vv (SEW=32 and SEW=64) and vfqwimmacc.vv (SEW=64).
- Fix VL alignment constraint for vwmmacc.vv and vqwmmacc.vv: the
  Description says "multiple of λ"; align the Exceptions bullet and
  SAIL check accordingly (was incorrectly "multiple of K_effective").
- Allow mixed altfmt_A ≠ altfmt_B for vfmmacc.vv: remove the false
  restriction, update the mixed-format-inputs section and vfmmacc.vv
  Description, and add the missing mixed-format encoding table rows.
- Mark vm=1 as a fixed bit in the wavedrom for vmmacc.vv, vwmmacc.vv,
  and vqwmmacc.vv (vm=0 decodes as a different instruction for the
  latter two; vm=0 is reserved for vmmacc.vv).
* Updates to the introduction

* Editorial notes on bit locations

* Revised floating-point rounding rules

* Revised floating-point rounding rules

* Revised floating-point rounding rules

* Special case for tile loads/stores when (rs2) = 0

* Inputs don't have their own SEW, just EEW

* Added arithmetic considerations to mixed-format inputs

* Added arithmetic considerations to mixed-format inputs

* Made semantics of micro-scaling computations clearer

* Used byte addresses in the definitions of tile load/store

* Used byte addresses in the definitions of tile load/store

* Clarify valid values of VL

* Clarify that tile loads must use target SEW

* Clarify guidelines for portable IME code
math formulas.

Signed-off-by: Jose Moreira <jmoreira@us.ibm.com>
Co-authored-by: Jose Moreira <jmoreira@us.ibm.com>
Remove the spurious `images/` prefix from the four figure references;
the build system already resolves images relative to `src/`, so the
correct prefix is `png/` not `images/png/`.
…qwimmacc

MXINT is defined as signed-only; unsigned inputs via altfmt_A=1 or
altfmt_B=1 are not part of the format and must be reserved.

- Encoding table: mark all UInt8/UInt4 input rows as _reserved_
- Subextension table: "Int8/UInt8" → "Int8", "Int4/UInt4" → "Int4"
- Description prose: remove "0=signed, 1=unsigned" wording; state
  that altfmt_A and altfmt_B must be 0
- Exceptions: add reserved bullets for altfmt_A=1 and altfmt_B=1
- SAIL: add Illegal_Instruction() guards before the compute loop;
  simplify inner loop to unconditional signed() reads
- Intro prose: clarify that altfmt_A/B=1 is reserved for these
  integer-input MX instructions
joseemoreira and others added 5 commits March 9, 2026 13:31
* Added figures for tile load/store.
Added text on optimization of cache access through LMUL usage.
Removed some confusing text on resulting tile dimensions.

* Transformed formulas to latex-alike math formulas.

* Reordered sections:
 - common geometry description was moved from the integer Zvvmm
   section to the overview.
 - Full list of subextensions and list of encoding were moved to
   the end.
Added some details and an overview over the instructions to the
Zvfmm section.

* Moved table of microscaling subextensions also to the end, below
the Zvvm subextensions list.
Removed some empty lines and added intentionally one empty line
in front of the major chapters.

* Added comment on transposing with proper data type.
Page breaks "<<<" need adjustment!

* Cosmetics.

* Moved Subextensions section back to right behind the Overview.
Fixed rendering issues of some formulas.
Removed 8x widening extensions.
* Extra tag to make image preview work

* Updated SAIL for tile loads/stores when (rs2) == 0
* Extra tag to make image preview work

* Updated SAIL for tile loads/stores when (rs2) == 0

* Separate order-preserving vs transposing tile loads/stores
@joseemoreira
Copy link
Copy Markdown
Author

Philipp, I don't know how to request your review here. Can you still do it?

efocht-oct and others added 2 commits March 9, 2026 20:24
…t and partial-VL load/store

Add three figures illustrating:
- The block-scale pair layout in v0, including the R-strided distribution
  across hardware lanes in a multi-lane setup (ime-mx-v0-format,
  ime-mx-v0-lanes)
- The element distribution for full and reduced VL tile load/store
  operations (ime-load-store-vl)

Update the prose to reference the new figures and add a cross-reference
anchor for the microscaling subextensions section.
…carve-out

Restore octal-widening (W=8) subextension naming and encoding space
reservation in the integer, base FP, and microscaling subextension tables:

  Integer:       Zvvmmbd (Int8 → Int64)
  Base FP:       Zvvfmmofp4f (OFP4 → FP32), Zvvfmmofp8d (OFP8 → FP64)
  Microscaling:  Zvvfmmmxfp4f/Zvvfmmnxfp4f (MXFP4 → FP32),
                 Zvvfmmmxfp8d/Zvvfmmnxfp8d (MXFP8 → FP64),
                 Zvvfmmmxi8d/Zvvfmmnxi8d   (MXINT8 → FP64)

Update the W=8 NOTE to enumerate all affected subextensions and document
that encoding space has been reserved: funct6 = 0x3b in OPIVV (following
vmmacc/vwmmacc/vqwmmacc at 0x38–0x3a) and funct6 = 0x17 in OPFVV
(following vfmmacc/vfwmmacc/vfqwmmacc at 0x14–0x16).
[cols="1,1,4,2", options="header"]
|===
|Extension | Dependencies | Multiplicand Types | Accumulator Type
|Zvvi4i8mm ^| Zve64d | [U]Int4, [U]Int4 | Int8
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for having everything depends on Zve64d ? (even the smallest multiplicand types and accumulator types extensions)


==== Alternate floating-point format for the output (`altfmt`)

The Zvvm Integrated Matrix floating-point extensions use the _altfmt_ field (as defined by Zvfbfa) to select the floating-point format of the elements in the output accumulator matrix C, in conjunction with `vsew`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to mention that the field altfmt is part of vtype in this section.

| altfmt_A / altfmt_B | Interpretation

| 0 | Signed
| 1 | Unsigned
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is the opposite encoding compare to VME https://github.com/aswaterman/riscv-misc/blob/main/isa/zvt/zvt.adoc

joseemoreira and others added 4 commits March 25, 2026 12:57
…bextension table (#24)

* unprivileged/integrated-matrix: Restore Zvvfmmofp8w (OFP8→FP32) to subextension table

Zvvfmmofp8w was referenced in the FP encoding map (vfqwmmacc.vv, SEW=32,
EEW=8) and as the implied base subextension of Zvvfmmmxfp8w, but was
missing from the base FP subextension table.  Restore the entry between
Zvvfmmofp8bf16 and Zvvfmmofp8d.

* unprivileged/integrated-matrix: Rename subextensions per ARC and document naming scheme (#25)

As handed down on a stone tablet by the guardians of the spirit and intent
of the specification, rename all computational subextensions to the new
scheme: Zvv + input-types + output-types (omitted if same as input) + mm.

Type tokens: i4/i8/i16/i32/i64 for integer; ofp4/ofp8/fp16/bf16/fp32/fp64
for floating-point; x<type>mm for microscaling BS=32, xn<type>mm for BS=16.

Examples: Zvvmmb → Zvvi8mm, Zvvfmmhf → Zvvfp16fp32mm,
          Zvvfmmmxfp8h → Zvvxofp8fp16mm, Zvvfmmnxi8f → Zvvxni8fp32mm.

Add a "Naming conventions" subsection to the Subextensions of Zvvm section
documenting the two-level naming scheme (family-level vs. individual
subextensions) and all type tokens. Zvvmtls and Zvvmttls are unchanged.
* Extra tag to make image preview work

* Updated SAIL for tile loads/stores when (rs2) == 0

* Separate order-preserving vs transposing tile loads/stores

* Fixes based on feedback from initial IME TG internal review
@ptomsich ptomsich force-pushed the integrated-matrix-extension branch from b976f48 to 98710a9 Compare March 25, 2026 11:59
ptomsich and others added 9 commits March 26, 2026 20:20
Process all 28 items from the IME TG internal review feedback tracker.

Subextension dependencies (#3):
  Replace blanket Zve64d dependency with the minimum Zve subset per
  subextension: Zve32x for integer accumulators ≤ 32-bit, Zve64x for
  Int64 accumulators, Zve32f for FP accumulators ≤ 32-bit, and Zve64d
  only for FP64 accumulators.

8× widening instructions (#7, #8, #9, #24):
  Add v8wmmacc.vv (funct6=0x3b, OPIVV), vf8wmmacc.vv (funct6=0x17,
  OPFVV), and vf8wimmacc.vv (integer-input MX variant, vm=0 of
  v8wmmacc) with full instruction definitions, SAIL pseudocode,
  encoding diagrams, and exception tables.  Update encoding maps (FP,
  integer, integer MX) with W=8 entries.  Add Zvvxi4fp32mm and
  Zvvxni4fp32mm to the MX subextension table.  Replace the informative
  NOTE about reserved W=8 encoding space with normative text.  Remove
  the undefined term "octal-widening".

MXINT4 clarification and OCP citation (#14):
  Define MXINT4 as analogous to OCP MX's MXINT8 but with 4-bit signed
  elements.  Add proper citation of the OCP Microscaling Formats (MX)
  v1.0 Specification with URL.  Update microscaling applicability to
  include vf8wmmacc.vv.

vfmmacc.vv vm=0 cleanup (#13, #28):
  Remove contradictory "When vm=0" exception bullets (vm=0 is reserved
  for non-widening FP).  Replace dead microscaling SAIL code with a
  straightforward non-widening FP GEMM loop.  Add explicit note that
  microscaling is not supported for non-widening multiply-accumulate.

Terminology fixes (#15, #21):
  Add forward cross-reference at first use of altfmt_A/altfmt_B.
  Correct two occurrences where λ was described as "the K dimension"
  to "tile-layout parameter", clarifying that K_eff = λ × W × LMUL is
  the derived effective K dimension.
…helpers

Extract common patterns from the 11 GEMM instruction SAIL blocks into
5 shared helper functions, eliminating ~300 lines of duplication:

  decode_gemm_geometry(W, vl_divisor_is_lambda) — unified preamble
    that decodes SEW, LMUL, lambda, computes K_eff/M/N/MUL_C, and
    checks VL divisibility and MUL_C legality.  Parameterised by the
    widening factor W and the VL divisor convention (K_eff vs lambda).

  read_block_scales(vm, i, j, s, R, EEW_C, fmt_C, rm) — reads and
    unpacks paired E8M0 block scales from v0, returning the combined
    scale and a NaN flag.  When vm=1, returns (1.0, false).

  fp_block_dot(i, j, k_lo, k_hi, g, fmt_A, fmt_B, fmt_C, rm, vs1, vs2)
    — FP inner product over a K-dimension block with widening.

  int_block_dot(i, j, k_lo, k_hi, g, signed_A, signed_B, vs1, vs2)
    — exact signed/unsigned integer inner product over a block.

  check_microscaling_legality(W, LMUL, EEW_C, lambda) — checks the
    SEW×λ≥16 and BS=16 LMUL constraints.

Each instruction body now consists of instruction-specific checks
(vm=0 reserved, SEW restrictions, altfmt checks) followed by a call
to decode_gemm_geometry and a compact main loop using the helpers.
Tile load/store instructions (vmtl, vmts, vmttl, vmtts) are unchanged.
…vtype bits

Place altfmt_A at vtype[XLEN-5], altfmt_B at vtype[XLEN-6], and bs at
vtype[XLEN-7], immediately below the lambda[2:0] field.  These positions
are outside the vsetvli immediate field and require vsetvl or vsetivli
to configure.

Remove the two editorial notes that described the old positions
(vtype[9], vtype[10], vtype[11]) as provisional and flagged the
expected move.  Replace with normative prose stating the final
locations.

The SAIL pseudocode is unaffected as it uses symbolic field names
(vtype[altfmt_A], vtype[bs], etc.) throughout.
* Update integrated-matrix.adoc
* Update integrated-matrix.adoc
* Update integrated-matrix.adoc

Signed-off-by: Jose Moreira <jmoreira@us.ibm.com>
…section

Replace auto-generated anchor _alternate_formats_for_inputs_altfmt_a_altfmt_b
with the explicit anchor integrated-matrix-altfmt-inputs defined at the
section heading.
The v0.scale mechanism now specifies only that paired block-scale
factors are present in v0; the scale format (e.g. E8M0) is an inherent
property of the input data type, not of the v0.scale encoding itself.

Restructure the microscaling section:
- Add a "Scale formats" subsection with an explicit E8M0 definition,
  leaving room for future scale formats (E4M3, UE5M3, etc.)
- Make the microscaling semantics format-agnostic: scale decoding
  references "the scale format associated with the input data type"
- Make the v0 layout description format-agnostic: byte positions and
  pairing are independent of the scale format
- Add cross-reference from the v0 layout to the Scale formats section

Update SAIL pseudocode:
- Rename e8m0_to_fp to decode_scale with a scale_format parameter
- Add scale_fmt parameter to read_block_scales; all call sites pass
  E8M0 explicitly
- Update helper comments to describe the generic interface

Update per-instruction descriptions:
- "paired E8M0 block-scale factors" → "paired block-scale factors;
  scale format determined by input data type"
- "with E8M0 Microscaling" → "with Microscaling" in synopses
- Remove "E8M0" from generic microscaling references in the overview

Subextension names and MX type definitions (MXFP8, MXINT8, etc.)
retain "E8M0" as it is part of the format name, not a property of
the v0.scale mechanism.
Extract the main loop bodies from all 11 GEMM instruction SAIL blocks
into four top-level operation functions:

  int_gemm         — integer C += integer A × integer B (no scaling)
  fp_gemm          — FP C += FP A × FP B (no scaling)
  fp_scaled_gemm   — FP C += scale × (FP A × FP B), block scales from v0
  int_scaled_gemm  — FP C += scale × (int A × int B), block scales from v0

Each instruction's SAIL Operation block is now a thin wrapper that
performs instruction-specific legality checks, calls
decode_gemm_geometry, sets up formats/signedness, and dispatches to
the appropriate operation function.

For FP widening instructions (vfwmmacc, vfqwmmacc, vf8wmmacc), the
dispatch is conditional on vm: vm=1 calls fp_gemm, vm=0 calls
fp_scaled_gemm.  This directly mirrors the hardware decode.

The four operation functions use the existing building blocks
(fp_block_dot, int_block_dot, read_block_scales) internally.
…mats

Introduce scale_width_of(scale_fmt) to derive the per-scale bit width
(sw) and scale-pair width (pw = 2 × sw) from the scale format, rather
than hardcoding 16-bit pairs throughout.  For E8M0: sw=8, pw=16 —
all existing behavior is unchanged.

SAIL changes:
- Add scale_width_of() helper (returns 8 for E8M0, extensible)
- read_block_scales: read pw-bit elements from v0, extract sw-bit
  halves via pair[sw-1..0] and pair[2*sw-1..sw]
- fp_scaled_gemm / int_scaled_gemm: derive R = λ × SEW / pw
- check_microscaling_legality: check SEW × λ ≥ pw (takes scale_fmt)
- All call sites pass E8M0 explicitly

Prose changes:
- Scale layout description uses pw/sw instead of hardcoded 16/8
- Capacity proof generalized: M × R = VLEN / pw
- R ≥ S proof annotated with "(for pw ≤ 16)"
- Legality constraint in Applicability section generalized
- Figure caption uses pw for stride formula

Per-instruction exception lists retain "SEW × λ < 16" as the concrete
constraint for the currently defined E8M0 scale format.
@ptomsich ptomsich force-pushed the integrated-matrix-extension branch from 1cd9728 to 7f0b4dd Compare April 12, 2026 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants