Skip to content

Triton Dialect

Refer to the official documentation for more details.

Reactant.MLIR.Dialects.tt.assert Method

assert

tt.assert takes a condition tensor and a message string. If the condition is false, the message is printed, and the program is aborted.

source
Reactant.MLIR.Dialects.tt.atomic_cas Method

atomic_cas

compare cmp with data old at location ptr,

if old == cmp, store val to ptr,

else store old to ptr,

return old

source
Reactant.MLIR.Dialects.tt.atomic_rmw Function

atomic_rmw

load data at ptr, do rmw_op with val, and store result to ptr.

return old value at ptr

source
Reactant.MLIR.Dialects.tt.broadcast Method

broadcast

For a given tensor, broadcast changes one or more dimensions with size 1 to a new size, e.g. tensor<1x32x1xf32> -> tensor<2x32x4xf32>. You cannot change the size of a non-1 dimension.

source
Reactant.MLIR.Dialects.tt.call Method

call

The tt.call operation represents a direct call to a function that is within the same symbol scope as the call. The operands and result types of the call must match the specified function type. The callee is encoded as a symbol reference attribute named "callee".

Example

mlir
%2 = tt.call @my_add(%0, %1) : (f32, f32) -> f32
source
Reactant.MLIR.Dialects.tt.clampf Method

clampf

Clamp operation for floating point types.

The operation takes three arguments: x, min, and max. It returns a tensor of the same shape as x with its values clamped to the range [min, max].

source
Reactant.MLIR.Dialects.tt.dot Method

dot

d=matrixmultiply(

a, b) + c. inputPrecision describes how to exercise the TC when the inputs are f32. It can be one of: tf32, tf32x3, ieee. tf32: use TC with tf32 ops. tf32x3: implement the 3xTF32 trick. For more info see the pass in F32DotTC.cpp ieee: don't use TC, implement dot in software. If the GPU does not have Tensor cores or the inputs are not f32, this flag is ignored.

source
Reactant.MLIR.Dialects.tt.dot_scaled Function

dot_scaled

d=matrixmultiply(scale(

lhs, lhs_scale), scale(rlhs, rhs_scale)) + c. Where scale(x, s) is a function that applies the scale per block following microscaling spec.

source
Reactant.MLIR.Dialects.tt.elementwise_inline_asm Method

elementwise_inline_asm

Runs an inline asm block to generate one or more tensors.

The asm block is given packed_element elements at a time. Exactly which elems it receives is unspecified.

source
Reactant.MLIR.Dialects.tt.experimental_descriptor_gather Method

experimental_descriptor_gather

The tt.experimental_desciptor_gather op will be lowered to NVIDIA TMA load operations on targets that support it.

desc_ptr is a pointer to the TMA descriptor allocated in global memory. The descriptor block must have 1 row and the indices must be a 1D tensor. Accordingly, the result is a 2D tensor multiple rows.

This is an escape hatch and is only there for testing/experimenting. This op will be removed in the future.

source
Reactant.MLIR.Dialects.tt.experimental_descriptor_load Method

experimental_descriptor_load

This operation will be lowered to Nvidia TMA load operation on targets supporting it. desc is a tensor descriptor object. The destination tensor type and shape must match the descriptor otherwise the result is undefined.

This is an escape hatch and is only there for testing/experimenting. This op will be removed in the future.

source
Reactant.MLIR.Dialects.tt.experimental_descriptor_store Method

experimental_descriptor_store

This operation will be lowered to Nvidia TMA store operation on targets supporting it. desc is a tensor descriptor object. The shape and types of src must match the descriptor otherwise the result is undefined.

This is an escape hatch and is only there for testing/experimenting. This op will be removed in the future.

source
Reactant.MLIR.Dialects.tt.extern_elementwise Method

extern_elementwise

call an external function $symbol implemented in libpath/libname with $args return libpath/libname:symbol(args...)

source
Reactant.MLIR.Dialects.tt.fp_to_fp Method

fp_to_fp

Floating point casting for custom types (F8), and non-default rounding modes.

F8 <-> FP16, BF16, FP32, FP64

source
Reactant.MLIR.Dialects.tt.func Method

func

Operations within the function cannot implicitly capture values defined outside of the function, i.e. Functions are IsolatedFromAbove. All external references must use function arguments or attributes that establish a symbolic connection (e.g. symbols referenced by name via a string attribute like SymbolRefAttr). An external function declaration (used when referring to a function declared in some other module) has no body. While the MLIR textual form provides a nice inline syntax for function arguments, they are internally represented as “block arguments” to the first block in the region.

Only dialect attribute names may be specified in the attribute dictionaries for function arguments, results, or the function itself.

Example

mlir
// External function definitions.
tt.func @abort()
tt.func @scribble(i32, i64, memref<? x 128 x f32, #layout_map0>) -> f64

// A function that returns its argument twice:
tt.func @count(%x: i64) -> (i64, i64)
  attributes {fruit: "banana"} {
  return %x, %x: i64, i64
}

// A function with an argument attribute
tt.func @example_fn_arg(%x: i32 {swift.self = unit})

// A function with a result attribute
tt.func @example_fn_result() -> (f64 {dialectName.attrName = 0 : i64})

// A function with an attribute
tt.func @example_fn_attr() attributes {dialectName.attrName = false}
source
Reactant.MLIR.Dialects.tt.gather Method

gather

Gather elements from the input tensor using the indices tensor along a single specified axis. The output tensor has the same shape as the indices tensor. The input and indices tensors must have the same number of dimension, and each dimension of the indices tensor that is not the gather dimension cannot be greater than the corresponding dimension in the input tensor.

The efficient_layout attribute is set when the compiler has determined an optimized layout for the operation, indicating that it should not be changed.

source
Reactant.MLIR.Dialects.tt.histogram Method

histogram

Return the histogram of the input tensor. The number of bins is equal to the dimension of the output tensor. Each bins has a width of 1 and bins start at 0.

source
Reactant.MLIR.Dialects.tt.join Method

join

For example, if the two input tensors are 4x8xf32, returns a tensor of shape 4x8x2xf32.

Because Triton tensors always have a power-of-two number of elements, the two input tensors must have the same shape.

source
Reactant.MLIR.Dialects.tt.make_range Method

make_range

Returns an 1D int32 tensor.

Values span from start to $end (exclusive), with step = 1

source
Reactant.MLIR.Dialects.tt.make_tensor_descriptor Method

make_tensor_descriptor

tt.make_tensor_descriptor takes both meta information of the parent tensor and the block size, and returns a descriptor object which can be used to load/store from the tensor in global memory.

source
Reactant.MLIR.Dialects.tt.make_tensor_ptr Method

make_tensor_ptr

tt.make_tensor_ptr takes both meta information of the parent tensor and the block tensor, then it returns a pointer to the block tensor, e.g. returns a type of tt.ptr<tensor<8x8xf16>>.

source
Reactant.MLIR.Dialects.tt.mulhiui Method

mulhiui

Most significant N bits of the 2N-bit product of two integers.

source
Reactant.MLIR.Dialects.tt.precise_divf Method

precise_divf

Precise div for floating point types.

source
Reactant.MLIR.Dialects.tt.precise_sqrt Method

precise_sqrt

Precise sqrt for floating point types.

source
Reactant.MLIR.Dialects.tt.print Method

print

tt.print takes a literal string prefix and an arbitrary number of scalar or tensor arguments that should be printed. format are generated automatically from the arguments.

source
Reactant.MLIR.Dialects.tt.reinterpret_tensor_descriptor Method

reinterpret_tensor_descriptor

This Op exists to help the transition from untyped raw TMA objects to typed Tensor descriptor objects. Ideally, we can remove this once the APIs are fully fleshed out.

source
Reactant.MLIR.Dialects.tt.reshape Method

reshape

reinterpret a tensor to a different shape.

If allow_reorder is set the compiler is free to change the order of elements to generate more efficient code.

If efficient_layout is set, this is a hint that the destination layout should be kept for performance reason. The compiler is still free to change it for better performance.

source
Reactant.MLIR.Dialects.tt.return_ Method

return_

The tt.return operation represents a return operation within a function. The operation takes variable number of operands and produces no results. The operand number and types must match the signature of the function that contains the operation.

Example

mlir
tt.func @foo() : (i32, f8) {
  ...
  tt.return %0, %1 : i32, f8
}
source
Reactant.MLIR.Dialects.tt.split Method

split

The input must be a tensor whose last dimension has size 2. Returns two tensors, src[..., 0] and src[..., 1].

For example, if the input shape is 4x8x2xf32, returns two tensors of shape 4x8xf32.

source
Reactant.MLIR.Dialects.tt.trans Method

trans

For example, given a tensor x with shape [1,2,4], transpose(x) with order=[2,0,1] rearranges the tensor to have shape [4,1,2].

Although this op is called "trans", it implements both tl.trans() and tl.permute(). ("permute" might be a better name, but it's called "trans" because originally it only supported 2D tensors.)

Implementation note on encodings:

In the TritonGPU dialect (and probably others), an encoding is chosen for this op's output so it's a nop from the perspective of code generation.

For example, suppose tensor x has an encoding such that GPU thread [i,j,k] has a register containing element [i,j,k] of the tensor. Now we transpose x with order [2,1,0], i.e. we reverse the order of its dimensions. In TritonGPU, we will choose a layout for the output of the transpose so that GPU thread [i,j,k] has element [k,j,i] of transpose(x). But this is the same element it had before! All we've done is "rename" the element that thread [i,j,k] has.

The "real" transpose – i.e. moving data between GPU threads – occurs in convertLayout ops that appear before and/or after the operation.

We do this so that you can chain multiple data-movement ops (e.g. transpose+reshape+concat) without going to shared memory after each one.

source