Shader Printf in HLSL and DX12
Except you’re lucky sufficient to to be working solely in Cuda, debugging GPU shaders continues to be very a lot “not nice” within the yr 2024. Instruments like RenderDoc and PIX are wonderful and do present the power to step by way of a shader and examine variables, however they’re essentially tied to a “seize” workflow. Which means you must run your sport or app from RenderDoc/PIX in seize mode (which provides overhead), after which seize at the very least a single body. Then after the seize is completed you may analyze it, discover your draw or dispatch, and eventually choose which thread you’d wish to debug. When you’re debugging issues usually work, however it’s nonetheless attainable you’ll encounter points as a result of means that debugging works in these instruments. They’re usually are not merely inspecting the state of an executing (suspended) thread such as you would count on from a CPU debugger, as a substitute they work by both emulating shader directions on the CPU or by patching the shader bytecode to emit intermediate program state to a buffer that may be learn by the CPU. Don’t get me mistaken, these instruments are nice to have and it’s improbable that they can be utilized to debug shaders in any respect. Nevertheless it’s nonetheless not sufficient, and customarily extra debugging instruments are wanted for unusually robust issues. Even when vendor-agnostic GPU debuggers have been nearly as good as they’re on CPU, you’ll nonetheless in all probability need to attain for another instruments relying on the state of affairs.
In CPU land, the venerable printf
and all of its associated capabilities are generally used for debugging and diagnostics. Nevertheless we have now not but had a cross-vendor/cross-API solution to do the identical for shaders. To be honest it’s much more difficult on a GPU! Shaders run in batches of hundreds and even thousands and thousands of threads, and run on a very separate processor than what the OS and all of its services run on. Regardless of these points, Vulkan/SPIR-V does actually provide a printf that’s out there from each GLSL and HLSL. It comes with caveats although. Particularly it’s setup in order that the messages might be intercepted by a validation layer or by RenderDoc, which makes it tougher for the engine/app itself to acquire the messages and course of them. And naturally this doesn’t assist in the event you’re goal D3D12 or different APIs and also you’d like your prints to work on these platforms as effectively.
On this article I’ll stroll by way of the way to construct your personal shader printf fully in software program, utilizing HLSL and D3D12 because the goal language and API. The ideas right here might be tailored to any API and shader language, though the shading language and compile can (considerably) impression the way you deal with strings in your shaders,
Total Method
The printf implementation I’m going to explain goes to work like this:
- Each body, have the GPU clear a giant buffer that we’ll use to retailer the print strings and arguments
- Make a debug information buffer SRV out there at a “magic” identified descriptor index, which successfully makes it globally out there to all shaders by way of a
#embrace
- This buffer may have a descriptor index for the print buffer, together with some more information helpful for debugging
- When a shader desires to print, it should use an atomic to allocate some area within the huge print buffer and stuff the string + knowledge into it
- After recording each body, copy the print buffer to a CPU-accessible readback buffer in order that the information might be learn again. Then each print might be decoded from the buffer and logged someplace.
Fairly easy in idea! However as all the time the satan is within the particulars. Lets go although them one after the other.
Setting Up The Print Buffer
For our print buffer we don’t want something fancy in any respect, only a “sufficiently big” buffer that’s writable by the GPU. In my pattern framework I do it like this, and likewise create a pair of matching readback buffers:
PrintBuffer.Initialize({
.NumElements = 1024 * 1024 * 4,
.CreateUAV = true,
.Identify = L"Shader Debug Print Buffer",
});
for(ReadbackBuffer& buffer : PrintReadbackBuffers)
buffer.Initialize(PrintBuffer.InternalBuffer.Measurement);
PrintBuffer
on this case is a RawBuffer
which finally ends up being a RWByteAddressBuffer
within the shader. Each body it will get cleared to all 0’s, which I do with a utility operate that simply makes use of a compute shader to do the clear (since ClearUnorderedAccessViewUINT
is painful to make use of in D3D12):
DX12::ClearRawBuffer(cmdList, PrintBuffer, Uint4(0, 0, 0, 0));
DX12::Barrier(cmdList, PrintBuffer.InternalBuffer.WriteToWriteBarrier());
void ClearRawBuffer(ID3D12GraphicsCommandList* cmdList, const RawBuffer& buffer, const Uint4& clearValue)
{
cmdList->SetComputeRootSignature(UniversalRootSignature);
cmdList->SetPipelineState(clearRawBufferPSO);
Assert_(buffer.UAV != uint32(-1));
ClearRawBufferConstants cbData =
{
.ClearValue = clearValue,
.DescriptorIdx = buffer.UAV,
.Num16ByteElements = uint32(AlignTo(buffer.NumElements * buffer.Stride, 16) / 16),
};
BindTempConstantBuffer(cmdList, cbData, URS_ConstantBuffers + 0, CmdListMode::Compute);
uint32 dispatchX = DispatchSize(cbData.Num16ByteElements, clearRawBufferTGSize);
cmdList->Dispatch(dispatchX, 1, 1);
}
That’s it for the print buffer setup! Subsequent we’ll take a look at our “magic” debug information buffer.
The “Magic” Debug Information Buffer
In my pattern framework I take advantage of shader hot-reloading fairly incessantly, because it gives a super-quick iteration loop that doesn’t require re-starting the app. Typically one of many causes for hot-reloading a shader is so as to add some short-term debugging code with a view to determine why one thing isn’t working accurately. To that finish, I needed to have the power so as to add debug prints to any shader with out having to alter the bindings and re-compile the C++ code. This might probably be accomplished by both all the time including an additional descriptor index to binding structs or by guaranteeing it was all the time certain to the basis signature, however as a substitute I opted to lean on Shader Model 6.6 bindless by simply putting a buffer SRV descriptor at a “identified” static index shared between C++ and GPU code. That makes the buffer out there to anybody no matter what root signature they use or what else is happening within the shader, which is good. On the C++ facet it’s easy: I simply added a solution to allocate a selected descriptor index from my DescriptorHeap
kind, create an SRV in that descriptor slot, and free the unique SRV descriptor:
DebugInfoBuffer.Initialize({
.NumElements = sizeof(DebugInfo) / 4,
.Dynamic = true,
.CPUAccessible = true,
.Identify = L"Debug Information Buffer",
});
const PersistentDescriptorAlloc alloc = DX12::SRVDescriptorHeap.AllocatePersistent(MagicDebugBufferIndex);
DX12::SRVDescriptorHeap.FreePersistent(DebugInfoBuffer.SRV);
DebugInfoBuffer.SRV = alloc.Index;
for(uint32 i = 0; i < ArraySize_(alloc.Handles); ++i)
{
const D3D12_SHADER_RESOURCE_VIEW_DESC srvDesc = DebugInfoBuffer.SRVDesc(i);
DX12::Machine->CreateShaderResourceView(DebugInfoBuffer.Useful resource(), &srvDesc, alloc.Handles[i]);
}
Then each CPU body we simply must fill that buffer with recent knowledge:
DebugInfo debugInfo =
{
.PrintBuffer = PrintBuffer.SRV,
.PrintBufferSize = uint32(PrintBuffer.InternalBuffer.Measurement),
.CursorXY = { cursorX, cursorY },
};
DebugInfoBuffer.MapAndSetData(&debugInfo, sizeof(debugInfo) / 4);
Within the shader it’s no fuss in any respect to get the buffer by way of ResourceDescriptorHeap
:
// In a header file shared between shaders and C++:
struct DebugInfo
{
DescriptorIndex PrintBuffer;
ShaderUint PrintBufferSize;
ShaderUint2 CursorXY;
};
SharedConstant_ DescriptorIndex MagicDebugBufferIndex = 1024;
// In ShaderDebug.hlsli
DebugInfo GetDebugInfo()
{
ByteAddressBuffer debugBuffer = ResourceDescriptorHeap[MagicDebugBufferIndex];
return debugBuffer.Load<DebugInfo>(0);
}
Along with the index of the print buffer descriptor, I’ve the mouse cursor place in there as effectively since that’s helpful for “print knowledge for the pixel underneath the cursor” situations. However you may add different issues too: debug flags or floats that you should use within the shader with out having to explicitly add new ones, body indices or different program state, no matter you need.
Dealing With The String Downside
Now that we have now entry to our magic debug information buffer which then offers us entry to our print buffer, we are able to begin build up the performance wanted to write down knowledge into that buffer. That is the place issues sadly get reasonably dicey in HLSL, which has no native assist for working with strings or perhaps a char
kind. The SPIR-V printf kinda works round this, because you solely cross a literal to a printf
intrinsic after which the compiler handles the remaining. However this doesn’t assist us in any respect if we have to do our personal processing of the string, which we have to do for our home-grown printf implementation. One workaround I’ve used myself and seen used elsewhere is to declare uint
arrays of character literals, and cross these round. This works properly sufficient with present language assist, however it’s about as ugly as you’d think about:
const uint printStr[] = { 'T', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ',
't', 'e', 'r', 'r', 'i', 'b', 'l', 'e',
' ', 'w', 'a', 'y', ' ', 't', 'o', ' ',
'w', 'r', 'i', 't', 'e', ' ', 'a',
' ', 's', 't', 'r', 'i', 'n', 'g', };
DebugPrint(printStr);
Whereas this isn’t unworkable, it’s very ugly and takes considerably longer to write down strings this fashion. You may probably use editor tooling to assist alleviate this… in reality I used to have a Elegant Textual content add-in that I wrote which may convert forwards and backwards between a string literal and an array of char literals. However that’s at finest a band help.
One other different could be to write down or use some form of customized pre-processor that may extract the string literals from the shader and exchange them with one thing else. This might seamlessly exchange a literal with that uint
array monstrosity above. Or you may probably take it even additional and exchange the string with some form of hash or token, after which retailer the string someplace accessible on the CPU facet in order that it may be seemed up when the print buffer is learn again and resolved. That final strategy may probably be fairly a bit extra environment friendly on the shader facet by drastically decreasing the quantity of information that must be written to the buffer, however it’s extra difficult. Particularly including any form of shader pre-processing is a heavy-handed step, and the choice to tackle the related burdens shouldn’t be handled calmly. Some engines have already chosen to do that for different causes, by which case including string literal dealing with could be a smaller incremental value.
Actually neither of those options are ultimate, significantly for a smaller codebase getting used for experimenting or messing round. I filed a GitHub issue on the DXC repo with a request for higher string assist again in October 2021, however as of January 2024 it nonetheless has not had a decision.
A Cursed Path
Earlier than I’m going any additional I’ll warn that this code goes to be ugly, and is sort of actually counting on unintended performance within the compiler that simply occurred to be uncovered by way of templates. It’s very attainable this gained’t work sooner or later, so I’d not depend on this for something vital and maybe it might be most secure to keep away from it altogether. You’ve been warned!
Whereas experimenting with HLSL 2021, I stumbled upon a means to make use of a mixture of template and macro hacks to work with string literals and extract the characters as integers. It seems that literals might be handed to templated capabilities that count on an array of kind T
, the place T
is finally char
though that kind is deliberately not uncovered. Whereas indexing into that array doesn’t appear to work, we are able to make a easy strlen
implementation that simply makes use of the scale of the array to find out the size of the string literal:
template<typename T, uint N> uint StrLen(T str[N])
{
// Consists of the null terminator
return N;
}
Good! As for indexing into the characters of the literal, whereas we are able to’t do this from a templated operate we can do it by indexing into the literal itself. For instance, "Howdy"[2]
. This implies we have now to resort to a loop in a macro, however for me that is an appropriate value of doing enterprise. However we nonetheless have yet one more downside: due to the way in which that char
has been disabled within DXC, you may’t truly do helpful issues with it. This consists of any arithmetic, and even casting it to an int
or uint
. Nevertheless it turns that there’s one factor you are able to do with it, which is evaluate it to a different character literal. Subsequently "Howdy"[2] == 'l'
will consider to true
. Whereas this isn’t instantly helpful, it does imply we are able to write the world’s most cursed CharToUint
operate:
template<typename T> uint CharToUint(in T c)
{
if(c == 'A')
return 65;
if(c == 'B')
return 66;
if(c == 'C')
return 67;
if(c == 'D')
return 68;
if(c == 'E')
return 69;
if(c == 'F')
return 70;
if(c == 'G')
return 71;
// ...and about 90 extra circumstances to deal with
}
Placing all of it collectively, we are able to lastly make a macro that may course of the string literal separately in order that it may be packed right into a buffer:
#outline DebugPrint(str, ...) do {
ShaderDebug::DebugPrinter printer;
printer.Init();
const uint strLen = ShaderDebug::StrLen(str);
for(uint i = 0; i < strLen; ++i)
printer.AppendChar(ShaderDebug::CharToUint(str[i]));
printer.StringSize = printer.ByteCount;
printer.AppendArgs(__VA_ARGS__);
printer.Commit(ShaderDebug::GetDebugInfo());
} whereas(0)
Wonderful? Horrible? Useful? Morally harmful? I’m truthfully unsure, you’ll must determine for your self.
Packing It All Into A Buffer
Now that we’ve lined strings, we’re able to look how the string and any arguments get packed into our print buffer. You’ll have observed that in my DebugPrint
macro, I’ve bought a DebugPrinter
kind that’s doing a lot of the heavy lifting. Let’s hook at how that’s written:
struct DebugPrinter
{
static const uint BufferSize = 256;
static const uint BufferSizeInBytes = BufferSize * sizeof(uint);
uint InternalBuffer[BufferSize];
uint ByteCount;
uint StringSize;
uint ArgCount;
void Init()
{
for(uint i = 0; i < BufferSize; ++i)
InternalBuffer[i] = 0;
ByteCount = 0;
StringSize = 0;
ArgCount = 0;
}
uint CurrBufferIndex()
{
return ByteCount / 4;
}
uint CurrBufferShift()
{
return (ByteCount % 4) * 8;
}
void AppendChar(uint c)
{
if(ByteCount < BufferSizeInBytes)
= ((c & 0xFF) << CurrBufferShift());
ByteCount += 1;
}
Principally we have now an inner buffer of uint
that we use to retailer the transformed characters, packed as 1 of the 4 bytes within the uint
. As we loop over the string literal, we simply maintain incrementing ByteCount
and packing within the knowledge. Nevertheless for a printf
we additionally need to have the ability to deal with arguments, that means we are able to print integers and floats and the like. For these we may also pack the information into the inner buffer, however we may also retailer a particular code earlier than every argument to supply the CPU code with the sort and dimension of the argument:
enum ArgCode
{
DebugPrint_Uint = 0,
DebugPrint_Uint2,
DebugPrint_Uint3,
DebugPrint_Uint4,
DebugPrint_Int,
DebugPrint_Int2,
DebugPrint_Int3,
DebugPrint_Int4,
DebugPrint_Float,
DebugPrint_Float2,
DebugPrint_Float3,
DebugPrint_Float4,
NumDebugPrintArgCodes,
};
template<typename T, uint N> void AppendArgWithCode(ArgCode code, T arg[N])
{
if(ByteCount + sizeof(arg) > BufferSizeInBytes)
return;
if(ArgCount >= MaxDebugPrintArgs)
return;
AppendChar(code);
for(uint elem = 0; elem < N; ++elem)
{
for(uint b = 0; b < sizeof(T); ++b)
{
AppendChar(asuint(arg[elem]) >> (b * 8));
}
}
ArgCount += 1;
}
To make issues work properly with the arguments handed to the macro, we have now some trampoline capabilities that append every argument individually and cross the proper code to AppendArgWithCode
:
void AppendArg(uint x)
{
uint a[1] = { x };
AppendArgWithCode(DebugPrint_Uint, a);
}
void AppendArg(uint2 x)
{
uint a[2] = { x.x, x.y };
AppendArgWithCode(DebugPrint_Uint2, a);
}
void AppendArg(uint3 x)
{
uint a[3] = { x.x, x.y, x.z };
AppendArgWithCode(DebugPrint_Uint3, a);
}
// Extra of those for floats, signed integers, and so forth.
void AppendArgs()
{
}
template<typename T0> void AppendArgs(T0 arg0)
{
AppendArg(arg0);
}
template<typename T0, typename T1> void AppendArgs(T0 arg0, T1 arg1)
{
AppendArg(arg0);
AppendArg(arg1);
}
template<typename T0, typename T1, typename T2> void AppendArgs(T0 arg0, T1 arg1, T2 arg2)
{
AppendArg(arg0);
AppendArg(arg1);
AppendArg(arg2);
}
// Extra of those for increased arg counts
This enables the macro to simply do printer.AppendArgs(__VA_ARGS__);
and all of it works.
Lastly we have now one final Commit()
technique on DebugPrinter
that truly shops all the things into the RWByteAddressBuffer
together with a particular header:
void Commit(in DebugInfo debugInfo)
{
if(ByteCount < 2)
return;
// Spherical as much as the subsequent a number of of 4 since we work with 4-byte alignment for every print
ByteCount = ((ByteCount + 3) / 4) * 4;
RWByteAddressBuffer printBuffer = ResourceDescriptorHeap[debugInfo.PrintBuffer];
// Increment the atomic counter to allocate area to retailer the bytes
const uint numBytesToWrite = ByteCount + sizeof(DebugPrintHeader);
uint offset = 0;
printBuffer.InterlockedAdd(0, numBytesToWrite, offset);
// Account for the atomic counter at first of the buffer
offset += sizeof(uint);
if((offset + numBytesToWrite) > debugInfo.PrintBufferSize)
return;
// Retailer the header
DebugPrintHeader header;
header.NumBytes = ByteCount;
header.StringSize = StringSize;
header.NumArgs = ArgCount;
printBuffer.Retailer<DebugPrintHeader>(offset, header);
offset += sizeof(DebugPrintHeader);
// Retailer the buffer knowledge
for(uint i = 0; i < ByteCount / 4; ++i)
printBuffer.Retailer(offset + (i * sizeof(uint)), InternalBuffer[i]);
}
In Commit()
we’re assuming there’s a counter firstly of the print buffer that signifies what number of bytes have been written to it. By atomically incrementing that counter we are able to safely “allocate” some area for our print knowledge, and likewise make sure that sufficient area is left in order that we are able to early-out if the buffer is full. As soon as the counter is incremented we fill out our small header and write that first, then write the contents of the inner buffer one uint
at a time.
Whereas there’s some complexity and ugliness right here that we have to disguise in an included file, truly doing a print within the shader code is about as simple because it will get:
float3 shade = ComputeColor();
DebugPrint("The colour is {0}", shade);
Since we’re embedding the argument kind as a code within the buffer, there’s no want for printf-style format specifiers and we are able to as a substitute use argument IDs like in std::format or C#’s String.Format.
Studying Again On The CPU
To be able to have the CPU course of all the print knowledge that’s been generated on the GPU, we have to challenge some instructions that copies the print buffer to a CPU-accessible readback buffer:
void EndRender(ID3D12GraphicsCommandList7* cmdList)
{
PIXMarker marker(cmdList, "ShaderDebug - EndRender");
DX12::Barrier(cmdList, PrintBuffer.WriteToReadBarrier( { .SyncAfter = D3D12_BARRIER_SYNC_COPY,
.AccessAfter = D3D12_BARRIER_ACCESS_COPY_SOURCE } ));
const ReadbackBuffer& readbackBuffer = PrintReadbackBuffers[DX12::CurrentCPUFrame % DX12::RenderLatency];
cmdList->CopyResource(readbackBuffer.Useful resource, PrintBuffer.Useful resource());
}
Doing the copy to the readback buffer permits us to map and skim the oldest generated print buffer knowledge each body in order that we are able to course of it:
if(DX12::CurrentCPUFrame >= DX12::RenderLatency)
{
const ReadbackBuffer& readbackBuffer = PrintReadbackBuffers[(DX12::CurrentCPUFrame + 1) % DX12::RenderLatency];
DebugPrintReader printReader(readbackBuffer.Map<uint8>(), uint32(readbackBuffer.Measurement));
whereas(printReader.HasMoreData(sizeof(DebugPrintHeader)))
{
const DebugPrintHeader header = printReader.Devour<DebugPrintHeader>(DebugPrintHeader{});
if(header.NumBytes == 0 || printReader.HasMoreData(header.NumBytes) == false)
break;
std::string formatStr = printReader.ConsumeString(header.StringSize);
if(formatStr.size() == 0)
break;
if(header.NumArgs > MaxDebugPrintArgs)
break;
argStrings.Reserve(header.NumArgs);
for(uint32 argIdx = 0; argIdx < header.NumArgs; ++argIdx)
{
const ArgCode argCode = (ArgCode)printReader.Devour<uint8>(0xFF);
if(argCode >= NumDebugPrintArgCodes)
break;
const uint32 argSize = ArgCodeSizes[argCode];
if(printReader.HasMoreData(argSize) == false)
break;
const std::string argStr = MakeArgString(printReader, argCode);
ReplaceStringInPlace(formatStr, ArgPlaceHolders[argIdx], argStr);
}
GlobalApp->AddToLog(formatStr.c_str());
}
readbackBuffer.Unmap();
}
The processing includes a loop that iterates till all the print knowledge within the buffer has been processed. For every iteration of the loop, we pull out a DebugPrintHeader
that marks the start of a single DebugPrint
name from a single thread of a shader. That header is then used to collect the precise format string, in addition to the variety of argument. Every argument is extracted utilizing the code we embedded on the shader facet, and a stringified model of the argument is then inserted into the unique print string to switch the argument index. The ultimate expanded string with arguments is then handed on to my easy log system, however at that time you may do no matter you’d like with it. In my case the log outputs to a easy ImGui textual content window, however you may get as fancy as you need.
Going Past Printf
Having a shader printf is already massively worthwhile and worthwhile, however upon getting this form of infrastructure in place there’s all types of issues you are able to do with it. For instance you may construct a higher-level assert mechanism on prime of DebugPrint
that logs info when a situation isn’t met. The __FILE__
and __LINE__
predefined macros work in DXC, so you may embrace that together with the situation itself within the log message. Sadly there’s no means I do know of to get any form of “stack strace”, so that you’re by yourself for that. There additionally isn’t a typical __debugbreak()
intrinsic or something like that for shaders. The very best you are able to do is enter an infinite loop or force a deliberate page fault, and hope that your GPU crash reporting techniques catch it appropriately.
One other widespread trick is to implement shader-driven debug rendering. When you have a CPU-based debug renderer then it’s fairly simple so as to add particular “print” code that may be detected on the CPU, which might then extract and ahead the draw arguments to the debug renderer. The primary draw back of doing it this fashion is that you simply’ll want to simply accept a couple of frames of latency between when the shader attracts and when it exhibits up. For lots of circumstances that is completely fantastic, however it may be a dealbreaker for sure debugging situations. A zero-latency different is to have a separate buffer for debug attracts, and run a compute shader to course of that and convert it into indirect draw commands.
As soon as you’re taking it to the intense, your print system can mainly be a deferred message passing interface between your shaders and the engine. This might probably be used for all types of highly effective or wacky issues in the event you’re sufficiently motivated. For instance, you may replicate a subset of the Pricey ImGui interface in your shader, and give your shaders the ability to draw their own debug UI. The sky is the restrict!
CR LF
That’s it for the article! Hopefully what I’ve described is useful for individuals implementing related techniques, or gives some concepts for the way to enhance present ones. Good luck, and comfortable printing!