Triton Dialect
Refer to the official documentation for more details.
Reactant.MLIR.Dialects.tt.assert Method
assert
tt.assert
takes a condition tensor and a message string. If the condition is false, the message is printed, and the program is aborted.
Reactant.MLIR.Dialects.tt.atomic_cas Method
atomic_cas
compare cmp with data old at location ptr,
if old == cmp, store val to ptr,
else store old to ptr,
return old
sourceReactant.MLIR.Dialects.tt.atomic_rmw Function
atomic_rmw
load data at ptr, do rmw_op with val, and store result to ptr.
return old value at ptr
sourceReactant.MLIR.Dialects.tt.broadcast Method
broadcast
For a given tensor, broadcast changes one or more dimensions with size 1 to a new size, e.g. tensor<1x32x1xf32> -> tensor<2x32x4xf32>. You cannot change the size of a non-1 dimension.
sourceReactant.MLIR.Dialects.tt.call Method
call
The tt.call
operation represents a direct call to a function that is within the same symbol scope as the call. The operands and result types of the call must match the specified function type. The callee is encoded as a symbol reference attribute named "callee".
Example
%2 = tt.call @my_add(%0, %1) : (f32, f32) -> f32
Reactant.MLIR.Dialects.tt.clampf Method
clampf
Clamp operation for floating point types.
The operation takes three arguments: x, min, and max. It returns a tensor of the same shape as x with its values clamped to the range [min, max].
sourceReactant.MLIR.Dialects.tt.dot Method
dot
a, b) + c. inputPrecision describes how to exercise the TC when the inputs are f32. It can be one of: tf32, tf32x3, ieee. tf32: use TC with tf32 ops. tf32x3: implement the 3xTF32 trick. For more info see the pass in F32DotTC.cpp ieee: don't use TC, implement dot in software. If the GPU does not have Tensor cores or the inputs are not f32, this flag is ignored.
sourceReactant.MLIR.Dialects.tt.dot_scaled Function
dot_scaled
lhs, lhs_scale), scale(rlhs, rhs_scale)) + c. Where scale(x, s) is a function that applies the scale per block following microscaling spec.
sourceReactant.MLIR.Dialects.tt.elementwise_inline_asm Method
elementwise_inline_asm
Runs an inline asm block to generate one or more tensors.
The asm block is given packed_element
elements at a time. Exactly which elems it receives is unspecified.
Reactant.MLIR.Dialects.tt.experimental_descriptor_gather Method
experimental_descriptor_gather
The tt.experimental_desciptor_gather
op will be lowered to NVIDIA TMA load operations on targets that support it.
desc_ptr
is a pointer to the TMA descriptor allocated in global memory. The descriptor block must have 1 row and the indices must be a 1D tensor. Accordingly, the result is a 2D tensor multiple rows.
This is an escape hatch and is only there for testing/experimenting. This op will be removed in the future.
sourceReactant.MLIR.Dialects.tt.experimental_descriptor_load Method
experimental_descriptor_load
This operation will be lowered to Nvidia TMA load operation on targets supporting it. desc
is a tensor descriptor object. The destination tensor type and shape must match the descriptor otherwise the result is undefined.
This is an escape hatch and is only there for testing/experimenting. This op will be removed in the future.
sourceReactant.MLIR.Dialects.tt.experimental_descriptor_store Method
experimental_descriptor_store
This operation will be lowered to Nvidia TMA store operation on targets supporting it. desc
is a tensor descriptor object. The shape and types of src
must match the descriptor otherwise the result is undefined.
This is an escape hatch and is only there for testing/experimenting. This op will be removed in the future.
sourceReactant.MLIR.Dialects.tt.extern_elementwise Method
extern_elementwise
call an external function $symbol implemented in
Reactant.MLIR.Dialects.tt.fp_to_fp Method
fp_to_fp
Floating point casting for custom types (F8), and non-default rounding modes.
F8 <-> FP16, BF16, FP32, FP64
sourceReactant.MLIR.Dialects.tt.func Method
func
Operations within the function cannot implicitly capture values defined outside of the function, i.e. Functions are IsolatedFromAbove
. All external references must use function arguments or attributes that establish a symbolic connection (e.g. symbols referenced by name via a string attribute like SymbolRefAttr). An external function declaration (used when referring to a function declared in some other module) has no body. While the MLIR textual form provides a nice inline syntax for function arguments, they are internally represented as “block arguments” to the first block in the region.
Only dialect attribute names may be specified in the attribute dictionaries for function arguments, results, or the function itself.
Example
// External function definitions.
tt.func @abort()
tt.func @scribble(i32, i64, memref<? x 128 x f32, #layout_map0>) -> f64
// A function that returns its argument twice:
tt.func @count(%x: i64) -> (i64, i64)
attributes {fruit: "banana"} {
return %x, %x: i64, i64
}
// A function with an argument attribute
tt.func @example_fn_arg(%x: i32 {swift.self = unit})
// A function with a result attribute
tt.func @example_fn_result() -> (f64 {dialectName.attrName = 0 : i64})
// A function with an attribute
tt.func @example_fn_attr() attributes {dialectName.attrName = false}
Reactant.MLIR.Dialects.tt.gather Method
gather
Gather elements from the input tensor using the indices tensor along a single specified axis. The output tensor has the same shape as the indices tensor. The input and indices tensors must have the same number of dimension, and each dimension of the indices tensor that is not the gather dimension cannot be greater than the corresponding dimension in the input tensor.
The efficient_layout
attribute is set when the compiler has determined an optimized layout for the operation, indicating that it should not be changed.
Reactant.MLIR.Dialects.tt.histogram Method
histogram
Return the histogram of the input tensor. The number of bins is equal to the dimension of the output tensor. Each bins has a width of 1 and bins start at 0.
sourceReactant.MLIR.Dialects.tt.join Method
join
For example, if the two input tensors are 4x8xf32, returns a tensor of shape 4x8x2xf32.
Because Triton tensors always have a power-of-two number of elements, the two input tensors must have the same shape.
sourceReactant.MLIR.Dialects.tt.make_range Method
make_range
Returns an 1D int32 tensor.
Values span from start to $end (exclusive), with step = 1
sourceReactant.MLIR.Dialects.tt.make_tensor_descriptor Method
make_tensor_descriptor
tt.make_tensor_descriptor
takes both meta information of the parent tensor and the block size, and returns a descriptor object which can be used to load/store from the tensor in global memory.
Reactant.MLIR.Dialects.tt.make_tensor_ptr Method
make_tensor_ptr
tt.make_tensor_ptr
takes both meta information of the parent tensor and the block tensor, then it returns a pointer to the block tensor, e.g. returns a type of tt.ptr<tensor<8x8xf16>>
.
Reactant.MLIR.Dialects.tt.mulhiui Method
mulhiui
Most significant N bits of the 2N-bit product of two integers.
sourceReactant.MLIR.Dialects.tt.precise_divf Method
precise_divf
Precise div for floating point types.
sourceReactant.MLIR.Dialects.tt.precise_sqrt Method
precise_sqrt
Precise sqrt for floating point types.
sourceReactant.MLIR.Dialects.tt.print Method
print
tt.print
takes a literal string prefix and an arbitrary number of scalar or tensor arguments that should be printed. format are generated automatically from the arguments.
Reactant.MLIR.Dialects.tt.reinterpret_tensor_descriptor Method
reinterpret_tensor_descriptor
This Op exists to help the transition from untyped raw TMA objects to typed Tensor descriptor objects. Ideally, we can remove this once the APIs are fully fleshed out.
sourceReactant.MLIR.Dialects.tt.reshape Method
reshape
reinterpret a tensor to a different shape.
If allow_reorder is set the compiler is free to change the order of elements to generate more efficient code.
If efficient_layout is set, this is a hint that the destination layout should be kept for performance reason. The compiler is still free to change it for better performance.
sourceReactant.MLIR.Dialects.tt.return_ Method
return_
The tt.return
operation represents a return operation within a function. The operation takes variable number of operands and produces no results. The operand number and types must match the signature of the function that contains the operation.
Example
tt.func @foo() : (i32, f8) {
...
tt.return %0, %1 : i32, f8
}
Reactant.MLIR.Dialects.tt.split Method
split
The input must be a tensor whose last dimension has size 2. Returns two tensors, src[..., 0] and src[..., 1].
For example, if the input shape is 4x8x2xf32, returns two tensors of shape 4x8xf32.
sourceReactant.MLIR.Dialects.tt.trans Method
trans
For example, given a tensor x with shape [1,2,4], transpose(x) with order=[2,0,1] rearranges the tensor to have shape [4,1,2].
Although this op is called "trans", it implements both tl.trans() and tl.permute(). ("permute" might be a better name, but it's called "trans" because originally it only supported 2D tensors.)
Implementation note on encodings:
In the TritonGPU dialect (and probably others), an encoding is chosen for this op's output so it's a nop from the perspective of code generation.
For example, suppose tensor x has an encoding such that GPU thread [i,j,k] has a register containing element [i,j,k] of the tensor. Now we transpose x with order [2,1,0], i.e. we reverse the order of its dimensions. In TritonGPU, we will choose a layout for the output of the transpose so that GPU thread [i,j,k] has element [k,j,i] of transpose(x). But this is the same element it had before! All we've done is "rename" the element that thread [i,j,k] has.
The "real" transpose – i.e. moving data between GPU threads – occurs in convertLayout ops that appear before and/or after the operation.
We do this so that you can chain multiple data-movement ops (e.g. transpose+reshape+concat) without going to shared memory after each one.
source