NVVM Dialect
Refer to the official documentation for more details.
Reactant.MLIR.Dialects.nvvm.barrier_arrive Function
barrier_arrive
Thread that executes this op announces their arrival at the barrier with given id and continue their execution.
The default barrier id is 0 that is similar to nvvm.barrier
Op. When barrierId
is not present, the default barrier id is used.
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.breakpoint Method
breakpoint
Breakpoint suspends execution of the program for debugging. For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.cluster_arrive Method
cluster_arrive
The cluster.arrive
can be used by the threads within the cluster for synchronization and communication. The cluster.arrive
instruction marks the warps' arrival at the barrier without causing the executing thread to wait for other participating threads.
The aligned
attribute, when provided, generates the .aligned version of the PTX instruction.
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.cluster_arrive_relaxed Method
cluster_arrive_relaxed
The cluster.arrive
can be used by the threads within the cluster for synchronization and communication. The cluster.arrive
instruction marks the warps' arrival at the barrier without causing the executing thread to wait for other participating threads.
The aligned
attribute, when provided, generates the .aligned version of the PTX instruction. The .relaxed qualifier on cluster.arrive
specifies that there are no memory ordering and visibility guarantees provided for the memory accesses performed prior to cluster.arrive
.
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.cluster_wait Method
cluster_wait
The cluster.wait
causes the executing thread to wait for all non-exited threads of the cluster to perform cluster.arrive
. The aligned
attribute, when provided, generates the .aligned version of the PTX instruction.
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.cp_async_bulk_commit_group Method
cp_async_bulk_commit_group
This Op commits all prior initiated but uncommitted cp.async.bulk instructions into a cp.async.bulk-group.
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.cp_async_bulk_global_shared_cta Function
cp_async_bulk_global_shared_cta
Initiates an asynchronous copy operation from Shared CTA memory to global memory.
The l2CacheHint
operand is optional, and it is used to specify cache eviction policy that may be used during the memory access.
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.cp_async_bulk_shared_cluster_global Function
cp_async_bulk_shared_cluster_global
Initiates an asynchronous copy operation from global memory to cluster's shared memory.
The multicastMask
operand is optional. When it is present, the Op copies data from global memory to shared memory of multiple CTAs in the cluster. Operand multicastMask
specifies the destination CTAs in the cluster such that each bit position in the 16-bit multicastMask
operand corresponds to the nvvm.read.ptx.sreg.ctaid
of the destination CTA.
The l2CacheHint
operand is optional, and it is used to specify cache eviction policy that may be used during the memory access.
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.cp_async_bulk_shared_cluster_shared_cta Method
cp_async_bulk_shared_cluster_shared_cta
Initiates an asynchronous copy operation from Shared CTA memory to Shared cluster memory.
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.cp_async_bulk_tensor_prefetch Function
cp_async_bulk_tensor_prefetch
Initiates an asynchronous prefetch operation on the tensor data from global memory to L2 cache.
The Op has two modes:
- Tiled Mode: It's the default mode. The source multi-dimensional tensor
layout is preserved at the destination.
- Im2col Mode: This mode is used when
im2colOffsets
operands are present.
the elements in the Bounding Box of the source tensor are rearranged into columns at the destination. In this mode, the tensor has to be at least 3-dimensional.
The l2CacheHint
operand is optional, and it is used to specify cache eviction policy that may be used during the memory access.
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.cp_async_bulk_tensor_reduce Function
cp_async_bulk_tensor_reduce
Initiates an asynchronous reduction operation of tensor data in global memory with tensor data in shared memory.
The mode
attribute indicates whether the copy mode is tile or im2col. The redOp
attribute specifies the reduction operations applied. The supported reduction operations are:
The l2CacheHint
operand is optional, and it is used to specify cache eviction policy that may be used during the memory access.
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.cp_async_bulk_tensor_shared_cluster_global Function
cp_async_bulk_tensor_shared_cluster_global
Initiates an asynchronous copy operation on the tensor data from global memory to shared memory.
The Op operates has two load modes:
- Tiled Mode: It's the default mode. The source multi-dimensional tensor
layout is preserved at the destination.
- Im2col Mode: This mode is used when
im2colOffsets
operands are present.
the elements in the Bounding Box of the source tensor are rearranged into columns at the destination. In this mode, the tensor has to be at least 3-dimensional.
The multicastMask
operand is optional. When it is present, the Op copies data from global memory to shared memory of multiple CTAs in the cluster. Operand multicastMask
specifies the destination CTAs in the cluster such that each bit position in the 16-bit multicastMask
operand corresponds to the nvvm.read.ptx.sreg.ctaid
of the destination CTA.
The l2CacheHint
operand is optional, and it is used to specify cache eviction policy that may be used during the memory access.
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.cp_async_bulk_wait_group Method
cp_async_bulk_wait_group
Op waits for completion of the most recent bulk async-groups.
The $group
operand tells waiting has to be done until for
The $read
indicates that the waiting has to be done until all the bulk async operations in the specified bulk async-group have completed reading from their source locations.
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.cp_async_mbarrier_arrive Method
cp_async_mbarrier_arrive
The cp.async.mbarrier.arrive
Op makes the mbarrier object track all prior cp.async operations initiated by the executing thread. The addr
operand specifies the address of the mbarrier object in generic address space. The noinc
attr impacts how the mbarrier's state is updated.
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.cp_async_mbarrier_arrive_shared Method
cp_async_mbarrier_arrive_shared
The cp.async.mbarrier.arrive.shared
Op makes the mbarrier object track all prior cp.async operations initiated by the executing thread. The addr
operand specifies the address of the mbarrier object in shared memory. The noinc
attr impacts how the mbarrier's state is updated.
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.cvt_float_to_tf32 Method
cvt_float_to_tf32
This Op converts the given f32 input to tf32. The result res
is represented as an i32 type. The relu
attribute, when set, lowers to the '.relu' variant of the cvt instruction. The rnd
and sat
attributes specify the the rounding and saturation modes respectively.
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.elect_sync Method
elect_sync
The elect.sync
instruction elects one predicated active leader thread from among a set of threads specified in membermask. The membermask is set to 0xFFFFFFFF
for the current version of this Op. The predicate result is set to True
for the leader thread, and False
for all other threads.
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.exit Method
exit
Ends execution of a thread. For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.fence_mbarrier_init Method
fence_mbarrier_init
Fence operation that applies on the prior nvvm.mbarrier.init
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.fence_proxy Method
fence_proxy
Fence operation with proxy to establish an ordering between memory accesses that may happen through different proxies.
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.fence_proxy_acquire Method
fence_proxy_acquire
fence.proxy.acquire
is a uni-directional fence used to establish ordering between a prior memory access performed via the generic proxy and a subsequent memory access performed via the tensormap proxy
The address operand addr
and the operand size
together specify the memory range [addr, addr+size)
on which the ordering guarantees on the memory accesses across the proxies is to be provided. The only supported value for the size
operand is 128 and must be an immediate. Generic Addressing is used unconditionally, and the address specified by the operand addr
must fall within the .global
state space. Otherwise, the behavior is undefined
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.fence_proxy_release Method
fence_proxy_release
fence.proxy.release
is a uni-directional fence used to establish ordering between a prior memory access performed via the generic proxy and a subsequent memory access performed via the tensormap proxy. fence.proxy.release
operation can form a release sequence that synchronizes with an acquire sequence that contains the fence.proxy.acquire proxy fence operation
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.griddepcontrol_launch_dependents Method
griddepcontrol_launch_dependents
Signals that specific dependents the runtime system designated to react to this instruction can be scheduled as soon as all other CTAs in the grid issue the same instruction or have completed.
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.griddepcontrol_wait Method
griddepcontrol_wait
Causes the executing thread to wait until all prerequisite grids in flight have completed and all the memory operations from the prerequisite grids are performed and made visible to the current grid.
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.mma_sync Method
mma_sync
The nvvm.mma.sync
operation collectively performs the operation D = matmul(A, B) + C
using all threads in a warp.
All the threads in the warp must execute the same mma.sync
operation.
For each possible multiplicand PTX data type, there are one or more possible instruction shapes given as "mMnNkK". The below table describes the posssibilities as well as the types required for the operands. Note that the data type for C (the accumulator) and D (the result) can vary independently when there are multiple possibilities in the "C/D Type" column.
When an optional attribute cannot be immediately inferred from the types of the operands and the result during parsing or validation, an error will be raised.
b1Op
is only relevant when the binary (b1) type is given to multiplicandDataType
. It specifies how the multiply-and-acumulate is performed and is either xor_popc
or and_poc
. The default is xor_popc
.
intOverflowBehavior
is only relevant when the multiplicandType
attribute is one of u8, s8, u4, s4
, this attribute describes how overflow is handled in the accumulator. When the attribute is satfinite
, the accumulator values are clamped in the int32 range on overflow. This is the default behavior. Alternatively, accumulator behavior wrapped
can also be specified, in which case overflow wraps from one end of the range to the other.
layoutA
and layoutB
are required and should generally be set to #nvvm.mma_layout<row>
and #nvvm.mma_layout<col>
respectively, but other combinations are possible for certain layouts according to the table below.
| A/B Type | Shape | ALayout | BLayout | A Type | B Type | C/D Type |
|----------|-----------|---------|---------|----------|----------|-------------------|
| f64 | .m8n8k4 | row | col | 1x f64 | 1x f64 | 2x f64 |
| f16 | .m8n8k4 | row/col | row/col | 2x f16x2 | 2x f16x2 | 4x f16x2 or 8xf32 |
| | .m16n8k8 | row | col | 2x f16x2 | 1x f16x2 | 2x f16x2 or 4 f32 |
| | .m16n8k16 | row | col | 4x f16x2 | 2x f16x2 | 2x f16x2 or 4 f32 |
| bf16 | .m16n8k8 | row | col | 2x i32 | 1x i32 | 4x f32 |
| | .m16n8k16 | row | col | 4x i32 | 2x i32 | 4x f32 |
| tf32 | .m16n8k4 | row | col | 2x i32 | 1x i32 | 4x f32 |
| | .m16n8k8 | row | col | 4x i32 | 2x i32 | 2x f16x2 or 4 f32 |
| u8/s8 | .m8n8k16 | row | col | 1x i32 | 1x i32 | 2x i32 |
| | .m16n8k16 | row | col | 2x i32 | 1x i32 | 4x i32 |
| | .m16n8k32 | row | col | 4x i32 | 2x i32 | 4x i32 |
| u4/s4 | .m8n8k32 | row | col | 1x i32 | 1x i32 | 2x i32 |
| | m16n8k32 | row | col | 2x i32 | 1x i32 | 4x i32 |
| | m16n8k64 | row | col | 4x i32 | 2x i32 | 4x i32 |
| b1 | m8n8k128 | row | col | 1x i32 | 1x i32 | 2x i32 |
| | m16n8k128 | row | col | 2x i32 | 1x i32 | 4x i32 |
Example
%128 = nvvm.mma.sync A[%120, %121, %122, %123]
B[%124, %125]
C[%126, %127]
{layoutA = #nvvm.mma_layout<row>,
layoutB = #nvvm.mma_layout<col>,
shape = {k = 16 : i32, m = 16 : i32, n = 8 : i32}}
: (vector<2xf16>, vector<2xf16>, vector<2xf16>)
-> !llvm.struct<(vector<2xf16>, vector<2xf16>)>
Reactant.MLIR.Dialects.nvvm.redux_sync Method
redux_sync
redux.sync
performs a reduction operation kind
of the 32 bit source register across all non-exited threads in the membermask.
The abs
and nan
attributes can be used in the case of f32 input type, where the abs
attribute causes the absolute value of the input to be used in the reduction operation, and the nan
attribute causes the reduction operation to return NaN if any of the inputs to participating threads are NaN.
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.shfl_sync Method
shfl_sync
The shfl.sync
Op implements data shuffle within threads of a warp. The thread_mask
denotes the threads participating in the Op where the bit position corresponds to a particular thread’s laneid. The offset
specifies a source lane or source lane offset (depending on kind
). The val
is the input value to be copied from the source. The mask_and_clamp
contains two packed values specifying a mask for logically splitting warps into sub-segments and an upper bound for clamping the source lane index.
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.stmatrix Method
stmatrix
Collectively store one or more matrices across all threads in a warp to the location indicated by the address operand ptr in shared memory.
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.tcgen05_alloc Method
tcgen05_alloc
The tcgen05.alloc
Op allocates tensor core memory for the amount specified by nCols
and writes the destination address to the addr
argument. The nCols
operand specifies the number of columns to be allocated and it must be a power-of-two. For more information, see PTX ISA
Reactant.MLIR.Dialects.nvvm.tcgen05_commit Function
tcgen05_commit
The tcgen05.commit
makes the mbarrier object, specified by the operand addr
, track the completion of all the prior async-tcgen05 operations initiated by the executing thread. The multicast variants allow signaling on the mbarrier objects of multiple CTAs within the cluster. Operand multicastMask
, when present, specifies the destination CTAs in the cluster such that each bit position in the 16-bit multicastMask
operand corresponds to the nvvm.read.ptx.sreg.ctaid
of the destination CTA. For more information, see PTX ISA
Reactant.MLIR.Dialects.nvvm.tcgen05_cp Method
tcgen05_cp
Instruction tcgen05.cp initiates an asynchronous copy operation from shared memory to the location specified by the address operand taddr
in the Tensor Memory. The 64-bit register operand smem_desc
specifies the matrix descriptor representing the source matrix in the shared memory that needs to be copied.
Example
nvvm.tcgen05.cp %taddr, %smem_desc {
group = #nvvm.tcgen05_group<cta_2>,
shape = #nvvm.tcgen05_cp_shape<shape_64x128b>,
multicast = #nvvm.tcgen05_cp_multicast<warpx2_01_23>,
srcFormat = #nvvm.tcgen05_cp_src_fmt<b6x16_p32>
}
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.tcgen05_dealloc Method
tcgen05_dealloc
The tcgen05.dealloc
Op de-allocates the tensor core memory specified by tmemAddr
, which must be from a previous tensor memory allocation. The nCols
operand specifies the number of columns to be de-allocated, and it must be a power-of-two. For more information, see PTX ISA
Reactant.MLIR.Dialects.nvvm.tcgen05_fence Method
tcgen05_fence
The tcgen05.fence<before>
orders all prior async tcgen05 operations with respect to the subsequent tcgen05 and execution ordering operations. The tcgen05.fence<after>
orders all subsequent async tcgen05 operations with respect to the prior tcgen05 and execution ordering operations.
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.tcgen05_relinquish_alloc_permit Method
tcgen05_relinquish_alloc_permit
The tcgen05.relinquish_alloc_permit
Op specifies that the CTA of the executing thread is relinquishing the right to allocate Tensor Memory. So, it is illegal for a CTA to perform tcgen05.alloc
after any of its constituent threads execute tcgen05.relinquish_alloc_permit
. For more information, see PTX ISA
Reactant.MLIR.Dialects.nvvm.tcgen05_shift Method
tcgen05_shift
The tcgen05.shift
is an asynchronous instruction which initiates the shifting of 32-byte elements downwards across all the rows, except the last, by one row. The operand taddr
specifies the base address of the matrix in Tensor Memory whose rows must be down shifted.
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.tcgen05_wait Method
tcgen05_wait
The tcgen05.wait<load>
causes the executing thread to block until all prior tcgen05.ld
operations issued by the executing thread have completed. Similarly, the tcgen05.wait<store>
causes the executing thread to block until all prior tcgen05.st
operations issued by the executing thread have completed. For more information, see PTX ISA
Reactant.MLIR.Dialects.nvvm.wgmma_commit_group_sync_aligned Method
wgmma_commit_group_sync_aligned
Commits all prior uncommitted warpgroup level matrix multiplication operations.
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.wgmma_fence_aligned Method
wgmma_fence_aligned
Enforce an ordering of register accesses between warpgroup level matrix multiplication and other operations.
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.wgmma_mma_async Method
wgmma_mma_async
The warpgroup (128 threads) level matrix multiply and accumulate operation has either of the following forms, where matrix D is called accumulator: D = A * B + D D = A * B, where the input from accumulator D is disabled.
Supported shapes:
|--------------|--------------|------------|--------------|---------------|
| | | | |f16+=e4m3*e4m3 |
| | | | |f16+=e5m2*e5m2 |
|f32+=tf32*tf32|f16+=f16 *f16 | s32+=s8*s8 |s32 += b1 * b1|f16+=e5m2*e4m3 |
| |f32+=f16 *f16 | s32+=u8*u8 | |f16+=e4m3*e5m2 |
| |f32+=bf16*bf16| s32+=u8*u8 | |f16+=e4m3*e5m2 |
| |f32+=bf16*bf16| s32+=s8*u8 | |f32+=e4m3*e4m3 |
| | | s32+=u8*s8 | |f32+=e5m2*e5m2 |
| | | | |f32+=e4m3*e5m2 |
| | | | |f32+=e4m3*e5m2 |
|--------------|--------------|------------|--------------|---------------|
| .m64n8k8 | .m64n8k16 | .m64n8k32 | .m64n8k256 | .m64n8k32 |
| .m64n16k8 | .m64n16k16 | .m64n16k32 | .m64n16k256 | .m64n16k32 |
| .m64n24k8 | .m64n24k16 | .m64n24k32 | .m64n24k256 | .m64n24k32 |
| .m64n32k8 | .m64n32k16 | .m64n32k32 | .m64n32k256 | .m64n32k32 |
| .m64n40k8 | .m64n40k16 | .m64n48k32 | .m64n48k256 | .m64n40k32 |
| .m64n48k8 | .m64n48k16 | .m64n64k32 | .m64n64k256 | .m64n48k32 |
| .m64n56k8 | .m64n56k16 | .m64n80k32 | .m64n80k256 | .m64n56k32 |
| .m64n64k8 | .m64n64k16 | .m64n96k32 | .m64n96k256 | .m64n64k32 |
| .m64n72k8 | .m64n72k16 | .m64n112k32| .m64n112k256 | .m64n72k32 |
| .m64n80k8 | .m64n80k16 | .m64n128k32| .m64n128k256 | .m64n80k32 |
| .m64n88k8 | .m64n88k16 | .m64n144k32| .m64n144k256 | .m64n88k32 |
| .m64n96k8 | .m64n96k16 | .m64n160k32| .m64n160k256 | .m64n96k32 |
| .m64n104k8 | .m64n104k16 | .m64n176k32| .m64n176k256 | .m64n104k32 |
| .m64n112k8 | .m64n112k16 | .m64n192k32| .m64n192k256 | .m64n112k32 |
| .m64n120k8 | .m64n120k16 | .m64n208k32| .m64n208k256 | .m64n120k32 |
| .m64n128k8 | .m64n128k16 | .m64n224k32| .m64n224k256 | .m64n128k32 |
| .m64n136k8 | .m64n136k16 | .m64n240k32| .m64n240k256 | .m64n136k32 |
| .m64n144k8 | .m64n144k16 | .m64n256k32| .m64n256k256 | .m64n144k32 |
| .m64n152k8 | .m64n152k16 | | | .m64n152k32 |
| .m64n160k8 | .m64n160k16 | | | .m64n160k32 |
| .m64n168k8 | .m64n168k16 | | | .m64n168k32 |
| .m64n176k8 | .m64n176k16 | | | .m64n176k32 |
| .m64n184k8 | .m64n184k16 | | | .m64n184k32 |
| .m64n192k8 | .m64n192k16 | | | .m64n192k32 |
| .m64n200k8 | .m64n200k16 | | | .m64n200k32 |
| .m64n208k8 | .m64n208k16 | | | .m64n208k32 |
| .m64n216k8 | .m64n216k16 | | | .m64n216k32 |
| .m64n224k8 | .m64n224k16 | | | .m64n224k32 |
| .m64n232k8 | .m64n232k16 | | | .m64n232k32 |
| .m64n240k8 | .m64n240k16 | | | .m64n240k32 |
| .m64n248k8 | .m64n248k16 | | | .m64n248k32 |
| .m64n256k8 | .m64n256k16 | | | .m64n256k32 |
|--------------|--------------|------------|--------------|---------------|
For more information, see PTX ISA
sourceReactant.MLIR.Dialects.nvvm.wgmma_wait_group_sync_aligned Method
wgmma_wait_group_sync_aligned
Signal the completion of a preceding warpgroup operation.
For more information, see PTX ISA
source