Open
Conversation
Use ghcr.io/riscv/riscv-docs-base-container-image:latest
Update GitHub Actions to requested versions
…fication
Add a full first-draft specification for the Integrated Matrix family of
extensions (Zvvmm, Zvvfmm, Zvvmtls), which accelerates matrix multiply-
accumulate (GEMM) using the existing RISC-V V register file without
introducing new architectural state.
New vtype CSR fields:
- lambda[2:0]: selected lambda (K dimension), encoded as powers of two
- altfmt_A, altfmt_B: FP/signedness format selection for A/B operands
Sub-extensions and instructions:
Zvvmm — integer matrix multiply-accumulate (C ← C + A × B):
- vmmacc.vv: non-widening; all operands SEW (funct6=0x38, OPIVV)
- vwmmacc.vv: widening ×2; A/B at SEW/2, C at SEW (funct6=0x39)
- vqwmmacc.vv: quad-widening ×4; A/B at SEW/4, C at SEW (funct6=0x3a)
Signedness of A and B controlled independently via altfmt_A/altfmt_B.
Accumulation wraps modulo 2^SEW.
Zvvfmm — floating-point matrix multiply-accumulate:
- vfmmacc.vv: non-widening (funct6=0x14, OPFVV)
- vfwmmacc.vv: widening ×2 (funct6=0x15)
- vfqwmmacc.vv: quad-widening ×4 (funct6=0x16)
FP format for A/B selected by altfmt_A/altfmt_B; accumulator format by
altfmt (as defined by Zvfbfa). Operations use the dynamic rounding mode
(frm); exception flags accumulate in fflags.
Zvvmtls — 2D tile load/store with optional in-situ transpose:
- vmtl.v: order-preserving load (row-major A or column-major B)
- vmts.v: order-preserving store
- vmttl.v: transposing load (row-major B or column-major A/C)
- vmtts.v: transposing store
All four instructions accept an optional inline lambda override (Lλ)
and an optional mask operand (v0.t).
Tile register layout conventions:
- A stored row-major; B and C stored column-major in vector register
groups. K_effective = λ × W × LMUL; MUL_C = VLEN / (SEW × λ²).
- tile_reg_idx() maps sequential tile element indices to flat register
positions, correctly handling the λ×LMUL segment width for LMUL > 1.
- Load/store address formula uses linesize = λ × LMUL throughout,
making the order-preserving instructions correct for all LMUL values.
Each instruction entry includes Synopsis, Mnemonic, Encoding (Wavedrom),
Description, Exceptions (where applicable), Operation (SAIL pseudocode),
and an "Included in" extension table. All entries appear in a single
alphabetically-ordered instruction index shared across all three
sub-extensions.
* Reformulate GEMM as C ← A × B^T + C with row-major panels * Rename 'widening' to 'packing' (non-/double-/quad-packing) * Rename accumulator register group multiplier from MUL to CMUL and define as (VLEN / SEW) / λ² (instead of LMUL / λ²) * Rename 'Alternate Floating-Point Format' to 'Alternate Format' to reflect applicability to both integer and floating-point * Normalize [U]Int notation in sub-extension table * Add cross-reference anchor for sub-extension table * Add placeholder 'Storage formats' section
Specify when intermediate rounding is permitted during the K_eff-deep accumulation of a matrix multiply-accumulate instruction. For widening instructions (W=2, W=4), define the sub-dot-product as the W products of (SEW/W)-bit elements within one SEW-wide slot and note that each individual product is exact at SEW precision. For floating-point: the implementation partitions the λ×LMUL sub-dot-products into groups of G (power-of-two, 1 ≤ G ≤ λ), accumulates each group at ≥ 2×SEW internal precision, then rounds once and adds to C. G is implementation-defined, allowing both systolic (G=1) and outer-product (G ≈ λ) datapaths. Bit-exact reproducibility across implementations is explicitly not guaranteed. For integer: modular (wrapping) arithmetic makes the result uniquely defined regardless of accumulation order.
Replace undefined term "VLENE" with the correct expression "(VLEN / SEW)" in the accumulator register group multiplier formula.
Introduce N_tile_max = M_tile to the tile geometry formulas and add a new "C tile tail policy" section that specifies the behaviour of C elements beyond the active N_tile columns when VL is partial. The key property is that tail elements are never read or written by multiply-accumulate instructions, so tail-undisturbed (vta=0) is achieved by write-skip rather than read-merge. Implementations are therefore not required to read the tail portion of the C register group, which benefits outer-product engines and register-renaming machines.
…l policy with base V Bring vmtl.v, vmts.v, vmttl.v, vmtts.v into alignment with the standard vector load/store element-status semantics (sec-inactive-defs): * Add a four-category element-status summary (active / inactive / tail / prestart) to the shared Instructions section, with explicit vma and vta references and a cross-link to <<sec-inactive-defs>>. * Fix "active element index i in [0, VL)" to "element index i in the body [vstart, VL) where the mask is enabled" throughout. * Extend each instruction's Description to cover the vma=1 (inactive may overwrite with 1s) and vta=1 (tail may overwrite with 1s) cases; stores now explicitly state that inactive and tail elements are not written to memory and do not raise exceptions. * Update load pseudocode comments to distinguish inactive (body, mask=0) from tail (unreachable by the loop) and name the governing vma/vta policies. * Remove init_masked_source from vmts.v and vmtts.v pseudocode; replace with read_vmask + vm_val[i] to match the loads and regular vector stores.
…operations Add an "Element packing in input tiles" subsection under "Storage formats" that defines the ordering of narrow elements within each SEW-wide slot for widening multiply-accumulate instructions (W=2, W=4): - For byte-sized and wider elements (EEW >= 8): standard RISC-V V element ordering applies. - For sub-byte elements (EEW = 4): little-endian nibble packing with element k at bits [4k+3 : 4k]. Also remove a stale empty "Arithmetic considerations" section header left over from a previous merge.
Add a C-Language Intrinsics subsection under "Software considerations"
specifying the naming convention, type system, and representative
prototypes for all IME instruction families (Zvvmtls, Zvvmm, Zvvfmm).
Naming convention:
- Type-suffix encodes the accumulator C register-group multiplier
(CMUL = VLENE / λ²), independent of LMUL.
- `_lm{N}` qualifier selects LMUL when CMUL ≠ LMUL; `_lm1` may be
omitted (LMUL=1 is the default).
- Non-ISO input types (BFloat16, OFP8, OFP4, Int4) always carry an
explicit input-type suffix; standard IEEE types do not.
- Mixed altfmt_A ≠ altfmt_B uses dual suffixes: _{inputA}_{inputB}.
- Canonical suffix order:
{type}[_{inputA}[_{inputB}]][_su|_us][_lm{N}][_L{N}]
- Overloaded short forms: GCC uses resolve_overloaded_builtin, Clang
uses __attribute__((overloadable)), C++ uses standard overloading.
Tile load/store (Zvvmtls):
- Prototypes for vmtl, vmts, vmttl, vmtts across all SEW variants.
- Lambda override via compile-time `_L{N}` suffix (no runtime arg),
covering all four instructions.
Integer multiply-accumulate (Zvvmm):
- vmmacc.vv (W=1), vwmmacc.vv (W=2), vqwmmacc.vv (W=4) prototypes.
- Int4 vector types (vint4m{N}_t, vuint4m{N}_t) and prototypes for
Zvvmmi4b (Int4→Int8) and Zvvmmi4h (Int4→Int16).
- Mixed-sign _su/_us variants for independent altfmt_A/altfmt_B.
- Note: W=8 gap for Zvvmmi4w (Int4→Int32) and Zvvmmbd (Int8→Int64).
Floating-point multiply-accumulate (Zvvfmm):
- OFP8 types (E4M3, E5M2), OFP4 types (E2M1, E3M0) with non-ISO
scalar type names (_Float8E4M3, etc.).
- vfmmacc.vv prototypes including BF16→BF16 (Zvvfmmbf16).
- vfwmmacc.vv covering OFP4→OFP8, OFP8→FP16/BF16, FP16→FP32,
BF16→FP32, FP32→FP64.
- vfqwmmacc.vv covering OFP4→FP16/BF16, OFP8→FP32, BF16→FP64,
FP16→FP64.
- Consolidated altfmt=1 examples (BF16, E5M2, E3M0 inputs).
Portability:
- VLEN-portable code guidance: write for the largest CMUL the code
targets, select code paths at runtime (mirroring RVV practice).
Disallow fractional LMUL (LMUL < 1) for all Integrated Matrix
instructions. Only LMUL ∈ {1, 2, 4, 8} is supported; fractional
settings are reserved and shall raise an illegal-instruction exception.
Remove mf2/mf4/mf8 from the tile load/store intrinsic prototypes
to reflect this restriction.
…inputs section Add an explicit anchor to the "Alternate formats for inputs" heading and update the cross-reference in the Zvvfmm section to use it, fixing the asciidoctor warning about an unknown anchor.
…accumulate (#9) Reserve vm=0 on all six matrix multiply-accumulate instructions (vfmmacc, vfqwmmacc, vfwmmacc, vmmacc, vqwmmacc, vwmmacc): - Set the vm bit to 1 (hardwired) in the encoding diagrams. - Add vm=0 as an illegal-instruction condition in the Exceptions sections and SAIL pseudocode. - Add a forward-looking note that a future extension may redefine vm=0 to source per-element scaling factors from v0 for microscaling floating-point formats. Tile load/store instructions are unaffected and continue to support vector masking.
…P widening instructions Allow the A and B input tiles to use different floating-point formats (altfmt_A ≠ altfmt_B) for widening multiply-accumulate instructions (vfwmmacc.vv, vfqwmmacc.vv). For non-widening vfmmacc.vv, altfmt_A must still equal altfmt_B. The restriction is based on exact-product representability: mixed-format inputs are permitted only when p_A + p_B ≤ p_C (the product significand fits in the accumulator format without rounding). This condition holds for all widening combinations but fails for every non-widening mixed case. Changes: - Add a new "Mixed-format inputs" section defining the restriction, IEEE 754 multiplication semantics, significand-width analysis, and subextension gating rules. - For sub-word inputs (OFP4, OFP8), all format combinations within a width class are covered by the existing subextension. - For 16-bit mixed inputs (FP16 × BF16), both the IEEE binary16 and BFloat16 subextensions must be present (widening only). - Update the subextensions table to clarify that OFP4 and OFP8 entries cover all format combinations (E2M1 or E3M0, E4M3 or E5M2). - Add mixed-format intrinsic examples (E4M3 × E5M2, FP16 × BF16). - Add IEEE 754 mixed-format note to fp_mul_to documentation. - Update vfmmacc.vv description to require altfmt_A == altfmt_B; vfwmmacc.vv and vfqwmmacc.vv descriptions allow independent selection. The SAIL pseudocode already decodes fmt_A and fmt_B independently and passes them separately to fp_mul_to, so no pseudocode changes are needed.
…atrix Extensions Update the title and all references throughout the spec to use the canonical name "Zvvm family of Integrated Matrix extensions" instead of the informal "Integrated Matrix family of extensions".
Add microscaling support for floating-point multiply-accumulate instructions using paired E8M0 block-scale factors supplied through v0 when vm=0. - Define microscaling semantics: per-block power-of-two scales applied as exact exponent additions with no rounding error. - Specify the paired scale layout in v0 (scale_A in lower byte, scale_B in upper byte of 16-bit elements). - Support block sizes of 32 (standard OCP MX) and 16 elements, selected by the bs field in vtype. - Add capacity proof and configuration table for VLEN=256. - Add scale-packing code examples (base V, Zvbb, Zvzip) with instruction count comparison. - Define Zvvfmmmx* (BS=32) and Zvvfmmnx* (BS=16) microscaling subextensions with implication chains to base subextensions. - Integer multiply-accumulate instructions reserve vm=0.
E3M0 is a mantissa-free format (all values are exact powers of two), making it indistinguishable in utility from a wider integer exponent field. It adds implementation complexity without meaningful benefit over OFP4 (E2M1), which already covers the 4-bit FP use case with one mantissa bit. Changes: - Remove Zvvfmme3m0ofp8, Zvvfmme3m0h, Zvvfmme3m0bf16 subextensions - Mark altfmt_A/B = 1 for 4-bit inputs as _reserved_ in encoding maps (vfwmmacc and vfqwmmacc) - Mark altfmt = 1 / 4-bit as _reserved_ in the altfmt encoding table - Remove E3M0 from significand widths table - Remove E3M0 from sub-byte packing description - Simplify sub-word inputs prose to OFP4 (E2M1) only - Remove E3M0 C intrinsic prototypes and type references
…pport Introduce two new mnemonics vfwimmacc.vv and vfqwimmacc.vv for integer inputs with FP accumulator under microscaling. These reuse the vm=0 encoding of the existing integer opcodes vwmmacc (funct6=0x39) and vqwmmacc (funct6=0x3a) in OPIVV, since FP8 exhausts the altfmt_A/B encoding space for FP inputs. - vfwimmacc.vv (W=2): Int8/UInt8 → FP16 or BF16 accumulator - vfqwimmacc.vv (W=4): Int4/UInt4 → FP16/BF16, or Int8/UInt8 → FP32 For both instructions vm=0 selects v0.scale (E8M0 tile-strided layout), altfmt_A selects signed(0)/unsigned(1) for A, altfmt_B for B, and the existing BS and LMUL constraints carry over. SEW=8 operands with EEW=4 (vfqwimmacc only) are reserved. Add encoding map table, subextension table rows (Zvvfmmmxi8h, Zvvfmmmxi8bf16, Zvvfmmmxi8w, Zvvfmmmxi4h, Zvvfmmmxi4bf16 and their Zvvfmmnxi* BS=16 counterparts), full instruction sections with wavedrom encoding diagrams, description, exceptions, and SAIL pseudocode. A new int_to_fp helper converts the exact integer dot-product sum to FP before scaling and accumulation. C intrinsic prototypes added for all variants.
- Remove stale "bs=1 and block_size=32 is reserved" bullet from vfmmacc.vv that duplicated encoding-table information no longer present. - Add missing "vd/vs1/vs2 overlaps v0" Illegal Instruction bullet to all three microscaling instructions (vfmmacc.vv, vfwmmacc.vv, vfqwmmacc.vv). - Define normative behaviour when an E8M0 scale decodes as Inf: the scale pair is treated as NaN and the accumulator element is set to canonical NaN. - Cite the encoding tables (not a separate proof) as the source for the R ≥ S constraint in the scale-layout section.
- Complete the memory layout / instruction selection table with the missing column-major A and row-major C rows. - Rename CMUL → MUL_C throughout (52 occurrences) for consistency with the geometry section. - Replace the undefined VLENE shorthand with its explicit expansion VLEN ÷ (SEW × λ²) / VLEN ÷ (SEW × 4) at all three sites.
- Add missing reserved-encoding Exceptions bullets and SAIL guards for vfwimmacc.vv (SEW=32 and SEW=64) and vfqwimmacc.vv (SEW=64). - Fix VL alignment constraint for vwmmacc.vv and vqwmmacc.vv: the Description says "multiple of λ"; align the Exceptions bullet and SAIL check accordingly (was incorrectly "multiple of K_effective"). - Allow mixed altfmt_A ≠ altfmt_B for vfmmacc.vv: remove the false restriction, update the mixed-format-inputs section and vfmmacc.vv Description, and add the missing mixed-format encoding table rows. - Mark vm=1 as a fixed bit in the wavedrom for vmmacc.vv, vwmmacc.vv, and vqwmmacc.vv (vm=0 decodes as a different instruction for the latter two; vm=0 is reserved for vmmacc.vv).
* Updates to the introduction * Editorial notes on bit locations * Revised floating-point rounding rules * Revised floating-point rounding rules * Revised floating-point rounding rules * Special case for tile loads/stores when (rs2) = 0 * Inputs don't have their own SEW, just EEW * Added arithmetic considerations to mixed-format inputs * Added arithmetic considerations to mixed-format inputs * Made semantics of micro-scaling computations clearer * Used byte addresses in the definitions of tile load/store * Used byte addresses in the definitions of tile load/store * Clarify valid values of VL * Clarify that tile loads must use target SEW * Clarify guidelines for portable IME code
math formulas. Signed-off-by: Jose Moreira <jmoreira@us.ibm.com> Co-authored-by: Jose Moreira <jmoreira@us.ibm.com>
Remove the spurious `images/` prefix from the four figure references; the build system already resolves images relative to `src/`, so the correct prefix is `png/` not `images/png/`.
…qwimmacc MXINT is defined as signed-only; unsigned inputs via altfmt_A=1 or altfmt_B=1 are not part of the format and must be reserved. - Encoding table: mark all UInt8/UInt4 input rows as _reserved_ - Subextension table: "Int8/UInt8" → "Int8", "Int4/UInt4" → "Int4" - Description prose: remove "0=signed, 1=unsigned" wording; state that altfmt_A and altfmt_B must be 0 - Exceptions: add reserved bullets for altfmt_A=1 and altfmt_B=1 - SAIL: add Illegal_Instruction() guards before the compute loop; simplify inner loop to unconditional signed() reads - Intro prose: clarify that altfmt_A/B=1 is reserved for these integer-input MX instructions
* Added figures for tile load/store. Added text on optimization of cache access through LMUL usage. Removed some confusing text on resulting tile dimensions. * Transformed formulas to latex-alike math formulas. * Reordered sections: - common geometry description was moved from the integer Zvvmm section to the overview. - Full list of subextensions and list of encoding were moved to the end. Added some details and an overview over the instructions to the Zvfmm section. * Moved table of microscaling subextensions also to the end, below the Zvvm subextensions list. Removed some empty lines and added intentionally one empty line in front of the major chapters. * Added comment on transposing with proper data type. Page breaks "<<<" need adjustment! * Cosmetics. * Moved Subextensions section back to right behind the Overview. Fixed rendering issues of some formulas. Removed 8x widening extensions.
* Extra tag to make image preview work * Updated SAIL for tile loads/stores when (rs2) == 0
* Extra tag to make image preview work * Updated SAIL for tile loads/stores when (rs2) == 0 * Separate order-preserving vs transposing tile loads/stores
Author
|
Philipp, I don't know how to request your review here. Can you still do it? |
…t and partial-VL load/store Add three figures illustrating: - The block-scale pair layout in v0, including the R-strided distribution across hardware lanes in a multi-lane setup (ime-mx-v0-format, ime-mx-v0-lanes) - The element distribution for full and reduced VL tile load/store operations (ime-load-store-vl) Update the prose to reference the new figures and add a cross-reference anchor for the microscaling subextensions section.
…carve-out
Restore octal-widening (W=8) subextension naming and encoding space
reservation in the integer, base FP, and microscaling subextension tables:
Integer: Zvvmmbd (Int8 → Int64)
Base FP: Zvvfmmofp4f (OFP4 → FP32), Zvvfmmofp8d (OFP8 → FP64)
Microscaling: Zvvfmmmxfp4f/Zvvfmmnxfp4f (MXFP4 → FP32),
Zvvfmmmxfp8d/Zvvfmmnxfp8d (MXFP8 → FP64),
Zvvfmmmxi8d/Zvvfmmnxi8d (MXINT8 → FP64)
Update the W=8 NOTE to enumerate all affected subextensions and document
that encoding space has been reserved: funct6 = 0x3b in OPIVV (following
vmmacc/vwmmacc/vqwmmacc at 0x38–0x3a) and funct6 = 0x17 in OPFVV
(following vfmmacc/vfwmmacc/vfqwmmacc at 0x14–0x16).
src/integrated-matrix.adoc
Outdated
| [cols="1,1,4,2", options="header"] | ||
| |=== | ||
| |Extension | Dependencies | Multiplicand Types | Accumulator Type | ||
| |Zvvi4i8mm ^| Zve64d | [U]Int4, [U]Int4 | Int8 |
Contributor
There was a problem hiding this comment.
What is the rationale for having everything depends on Zve64d ? (even the smallest multiplicand types and accumulator types extensions)
|
|
||
| ==== Alternate floating-point format for the output (`altfmt`) | ||
|
|
||
| The Zvvm Integrated Matrix floating-point extensions use the _altfmt_ field (as defined by Zvfbfa) to select the floating-point format of the elements in the output accumulator matrix C, in conjunction with `vsew`. |
Contributor
There was a problem hiding this comment.
It would be good to mention that the field altfmt is part of vtype in this section.
| | altfmt_A / altfmt_B | Interpretation | ||
|
|
||
| | 0 | Signed | ||
| | 1 | Unsigned |
Contributor
There was a problem hiding this comment.
I believe this is the opposite encoding compare to VME https://github.com/aswaterman/riscv-misc/blob/main/isa/zvt/zvt.adoc
…bextension table (#24) * unprivileged/integrated-matrix: Restore Zvvfmmofp8w (OFP8→FP32) to subextension table Zvvfmmofp8w was referenced in the FP encoding map (vfqwmmacc.vv, SEW=32, EEW=8) and as the implied base subextension of Zvvfmmmxfp8w, but was missing from the base FP subextension table. Restore the entry between Zvvfmmofp8bf16 and Zvvfmmofp8d. * unprivileged/integrated-matrix: Rename subextensions per ARC and document naming scheme (#25) As handed down on a stone tablet by the guardians of the spirit and intent of the specification, rename all computational subextensions to the new scheme: Zvv + input-types + output-types (omitted if same as input) + mm. Type tokens: i4/i8/i16/i32/i64 for integer; ofp4/ofp8/fp16/bf16/fp32/fp64 for floating-point; x<type>mm for microscaling BS=32, xn<type>mm for BS=16. Examples: Zvvmmb → Zvvi8mm, Zvvfmmhf → Zvvfp16fp32mm, Zvvfmmmxfp8h → Zvvxofp8fp16mm, Zvvfmmnxi8f → Zvvxni8fp32mm. Add a "Naming conventions" subsection to the Subextensions of Zvvm section documenting the two-level naming scheme (family-level vs. individual subextensions) and all type tokens. Zvvmtls and Zvvmttls are unchanged.
* Extra tag to make image preview work * Updated SAIL for tile loads/stores when (rs2) == 0 * Separate order-preserving vs transposing tile loads/stores * Fixes based on feedback from initial IME TG internal review
b976f48 to
98710a9
Compare
Process all 28 items from the IME TG internal review feedback tracker. Subextension dependencies (#3): Replace blanket Zve64d dependency with the minimum Zve subset per subextension: Zve32x for integer accumulators ≤ 32-bit, Zve64x for Int64 accumulators, Zve32f for FP accumulators ≤ 32-bit, and Zve64d only for FP64 accumulators. 8× widening instructions (#7, #8, #9, #24): Add v8wmmacc.vv (funct6=0x3b, OPIVV), vf8wmmacc.vv (funct6=0x17, OPFVV), and vf8wimmacc.vv (integer-input MX variant, vm=0 of v8wmmacc) with full instruction definitions, SAIL pseudocode, encoding diagrams, and exception tables. Update encoding maps (FP, integer, integer MX) with W=8 entries. Add Zvvxi4fp32mm and Zvvxni4fp32mm to the MX subextension table. Replace the informative NOTE about reserved W=8 encoding space with normative text. Remove the undefined term "octal-widening". MXINT4 clarification and OCP citation (#14): Define MXINT4 as analogous to OCP MX's MXINT8 but with 4-bit signed elements. Add proper citation of the OCP Microscaling Formats (MX) v1.0 Specification with URL. Update microscaling applicability to include vf8wmmacc.vv. vfmmacc.vv vm=0 cleanup (#13, #28): Remove contradictory "When vm=0" exception bullets (vm=0 is reserved for non-widening FP). Replace dead microscaling SAIL code with a straightforward non-widening FP GEMM loop. Add explicit note that microscaling is not supported for non-widening multiply-accumulate. Terminology fixes (#15, #21): Add forward cross-reference at first use of altfmt_A/altfmt_B. Correct two occurrences where λ was described as "the K dimension" to "tile-layout parameter", clarifying that K_eff = λ × W × LMUL is the derived effective K dimension.
…helpers
Extract common patterns from the 11 GEMM instruction SAIL blocks into
5 shared helper functions, eliminating ~300 lines of duplication:
decode_gemm_geometry(W, vl_divisor_is_lambda) — unified preamble
that decodes SEW, LMUL, lambda, computes K_eff/M/N/MUL_C, and
checks VL divisibility and MUL_C legality. Parameterised by the
widening factor W and the VL divisor convention (K_eff vs lambda).
read_block_scales(vm, i, j, s, R, EEW_C, fmt_C, rm) — reads and
unpacks paired E8M0 block scales from v0, returning the combined
scale and a NaN flag. When vm=1, returns (1.0, false).
fp_block_dot(i, j, k_lo, k_hi, g, fmt_A, fmt_B, fmt_C, rm, vs1, vs2)
— FP inner product over a K-dimension block with widening.
int_block_dot(i, j, k_lo, k_hi, g, signed_A, signed_B, vs1, vs2)
— exact signed/unsigned integer inner product over a block.
check_microscaling_legality(W, LMUL, EEW_C, lambda) — checks the
SEW×λ≥16 and BS=16 LMUL constraints.
Each instruction body now consists of instruction-specific checks
(vm=0 reserved, SEW restrictions, altfmt checks) followed by a call
to decode_gemm_geometry and a compact main loop using the helpers.
Tile load/store instructions (vmtl, vmts, vmttl, vmtts) are unchanged.
…vtype bits Place altfmt_A at vtype[XLEN-5], altfmt_B at vtype[XLEN-6], and bs at vtype[XLEN-7], immediately below the lambda[2:0] field. These positions are outside the vsetvli immediate field and require vsetvl or vsetivli to configure. Remove the two editorial notes that described the old positions (vtype[9], vtype[10], vtype[11]) as provisional and flagged the expected move. Replace with normative prose stating the final locations. The SAIL pseudocode is unaffected as it uses symbolic field names (vtype[altfmt_A], vtype[bs], etc.) throughout.
* Update integrated-matrix.adoc * Update integrated-matrix.adoc * Update integrated-matrix.adoc Signed-off-by: Jose Moreira <jmoreira@us.ibm.com>
…section Replace auto-generated anchor _alternate_formats_for_inputs_altfmt_a_altfmt_b with the explicit anchor integrated-matrix-altfmt-inputs defined at the section heading.
The v0.scale mechanism now specifies only that paired block-scale factors are present in v0; the scale format (e.g. E8M0) is an inherent property of the input data type, not of the v0.scale encoding itself. Restructure the microscaling section: - Add a "Scale formats" subsection with an explicit E8M0 definition, leaving room for future scale formats (E4M3, UE5M3, etc.) - Make the microscaling semantics format-agnostic: scale decoding references "the scale format associated with the input data type" - Make the v0 layout description format-agnostic: byte positions and pairing are independent of the scale format - Add cross-reference from the v0 layout to the Scale formats section Update SAIL pseudocode: - Rename e8m0_to_fp to decode_scale with a scale_format parameter - Add scale_fmt parameter to read_block_scales; all call sites pass E8M0 explicitly - Update helper comments to describe the generic interface Update per-instruction descriptions: - "paired E8M0 block-scale factors" → "paired block-scale factors; scale format determined by input data type" - "with E8M0 Microscaling" → "with Microscaling" in synopses - Remove "E8M0" from generic microscaling references in the overview Subextension names and MX type definitions (MXFP8, MXINT8, etc.) retain "E8M0" as it is part of the format name, not a property of the v0.scale mechanism.
Extract the main loop bodies from all 11 GEMM instruction SAIL blocks into four top-level operation functions: int_gemm — integer C += integer A × integer B (no scaling) fp_gemm — FP C += FP A × FP B (no scaling) fp_scaled_gemm — FP C += scale × (FP A × FP B), block scales from v0 int_scaled_gemm — FP C += scale × (int A × int B), block scales from v0 Each instruction's SAIL Operation block is now a thin wrapper that performs instruction-specific legality checks, calls decode_gemm_geometry, sets up formats/signedness, and dispatches to the appropriate operation function. For FP widening instructions (vfwmmacc, vfqwmmacc, vf8wmmacc), the dispatch is conditional on vm: vm=1 calls fp_gemm, vm=0 calls fp_scaled_gemm. This directly mirrors the hardware decode. The four operation functions use the existing building blocks (fp_block_dot, int_block_dot, read_block_scales) internally.
…mats Introduce scale_width_of(scale_fmt) to derive the per-scale bit width (sw) and scale-pair width (pw = 2 × sw) from the scale format, rather than hardcoding 16-bit pairs throughout. For E8M0: sw=8, pw=16 — all existing behavior is unchanged. SAIL changes: - Add scale_width_of() helper (returns 8 for E8M0, extensible) - read_block_scales: read pw-bit elements from v0, extract sw-bit halves via pair[sw-1..0] and pair[2*sw-1..sw] - fp_scaled_gemm / int_scaled_gemm: derive R = λ × SEW / pw - check_microscaling_legality: check SEW × λ ≥ pw (takes scale_fmt) - All call sites pass E8M0 explicitly Prose changes: - Scale layout description uses pw/sw instead of hardcoded 16/8 - Capacity proof generalized: M × R = VLEN / pw - R ≥ S proof annotated with "(for pw ≤ 16)" - Legality constraint in Applicability section generalized - Figure caption uses pw for stride formula Per-instruction exception lists retain "SEW × λ < 16" as the concrete constraint for the currently defined E8M0 scale format.
1cd9728 to
7f0b4dd
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Update the main branch.