Now Reading
Introducing Hidet: A Deep Studying Compiler for Environment friendly Mannequin Serving

Introducing Hidet: A Deep Studying Compiler for Environment friendly Mannequin Serving

2023-04-27 22:47:38

by

Group Hidet

Hidet is a robust deep studying compiler that simplifies the method of implementing high-performing deep studying operators on fashionable accelerators (e.g., NVIDIA GPUs). With the brand new characteristic of torch.compile(...) in PyTorch 2.0, integrating a novel compiler into PyTorch is simpler than ever – Hidet now can be utilized as a torch.compile(...) backend to speed up PyTorch fashions, making it a horny choice for PyTorch customers who wish to enhance the inference efficiency of their fashions, particularly for many who additionally have to implement extraordinarily optimized customized operators.

Utilizing Hidet to Compile A PyTorch Mannequin

To make use of Hidet in PyTorch, you might want to first set up the hidet bundle by way of pip:

Hidet is built-in with PyTorch as a torch.compile(...) backend following the Custom Backends tutorial. You’ll be able to specify hidet because the backend once you compile a mannequin. (Observe: requires PyTorch model 2.0+):

torch.compile(..., backend='hidet')

Hidet converts the given PyTorch mannequin within the torch.fx.Graph format into its inside graph illustration, and conducts a collection of optimizations. Hidet supplies a couple of choices to configure the optimizations. For instance, we are able to use hidet.torch.dynamo_config.use_tensor_core(True) to permit Hidet to generate CUDA kernels that leverage the Tensor Cores on NVIDIA GPUs, and use hidet.torch.dynamo_config.search_space(2) to permit Hidet to seek for the perfect operator schedule particular on your {hardware} and enter sizes. Extra configurations could be present in Hidet’s documentation.

Right here’s an entire instance of use Hidet to compile and optimize a pre-trained ResNet50 mannequin from torchvision:

import hidet
import torch

# Load a pre-trained ResNet50 mannequin
x = torch.randn(1, 3, 224, 224, system="cuda").half()
mannequin = torch.hub.load(
    'pytorch/imaginative and prescient:v0.6.0', 'resnet50', pretrained=True
).cuda().half().eval()

# Configure hidet to make use of tensor core and allow tuning
hidet.torch.dynamo_config.use_tensor_core(True)
hidet.torch.dynamo_config.search_space(2) 

# Compile the mannequin utilizing Hidet
model_opt = torch.compile(mannequin, backend='hidet')

# Test correctness
torch.testing.assert_close(precise=model_opt(x), anticipated=mannequin(x), rtol=1e-2, atol=1e-2)

# Benchmark
from hidet.utils import benchmark_func
print('keen: {:2f}'.format(benchmark_func(lambda: mannequin(x))))
print('hidet: {:2f}'.format(benchmark_func(lambda: model_opt(x))))

We encourage you to check out the above script by yourself NVIDIA GPU(s)! For those who run this script on an aws.g5.2xlarge occasion, you’d get the outcome proven within the following determine. Hidet achieves the speedup as a result of it may routinely fuse a number of operators, tune operator schedules, and use CUDA Graph to scale back framework-level overhead. Extra outcomes could be discovered within the ASPLOS’23 publication of Hidet and our performance tracking

Eager vs Hidet latency

Utilizing Hidet Script to Write Customized Operators

Hidet Script is one method to implement tensor operators in Python. The next instance reveals implement a naive matrix multiplication utilizing Hidet Script and combine it as a PyTorch operator.

import torch
import hidet


def matmul(m_size, n_size, k_size):
    from hidet.lang import f32, attr
    from hidet.lang.cuda import threadIdx, blockIdx, blockDim

    with hidet.script_module() as script_module:
        @hidet.script
        def matmul(
            a: f32[m_size, k_size],
            b: f32[k_size, n_size],
            c: f32[m_size, n_size]
        ):
            attr.cuda_grid_dim = ((m_size + 31) // 32, (n_size + 31) // 32)
            attr.cuda_block_dim = (32, 32)
            i = threadIdx.x + blockIdx.x * blockDim.x
            j = threadIdx.y + blockIdx.y * blockDim.y
            if i < m_size and j < n_size:
                c[i, j] = 0.0
                for ok in vary(k_size):
                    c[i, j] += a[i, k] * b[k, j]

    ir_module = script_module.ir_module()
    func = hidet.driver.build_ir_module(ir_module)
    return func


class NaiveMatmul(torch.autograd.Operate):
    @staticmethod
    def ahead(ctx, a, b):
        m, ok = a.form
        ok, n = b.form
        c = torch.empty([m, n], dtype=a.dtype, system=a.system)
        func = matmul(m, n, ok)
        func(a, b, c)
        return c


a = torch.randn([3, 4], system="cuda")
b = torch.randn([4, 5], system="cuda")
c = NaiveMatmul.apply(a, b)
cc = torch.matmul(a, b)
torch.testing.assert_close(c, cc)

Extra optimizations could be utilized, see the example in our documentation to be taught extra.

Hidet Script vs. Triton: Triton tremendously simplifies the CUDA programming by introducing the tile-based programming mannequin the place the parallel execution unit is thread blocks as an alternative of threads. Nevertheless, this simplification additionally prevents the tensor program builders from manipulating the fine-grained computation and reminiscence sources (e.g., warps, shared reminiscence) of their most popular methods. It could be difficult to implement an optimization that requires fine-grained management of those sources utilizing Triton if it has not been carried out by the Triton compiler itself. Hidet Script, alternatively, simplifies tensor programming whereas nonetheless enabling customers to implement their very own optimizations with intensive flexibility. It’s value noting that the extra granular management of Hidet Script additionally brings added complexity in comparison with Triton.

Extra about Hidet

Hidet originates from a analysis mission led by the EcoSystem lab on the College of Toronto (UofT) and AWS. The authors suggest a brand new method, named the task-mapping programming paradigm, to assemble tensor packages. It goals to simplify the tensor programming with out sacrificing any optimization alternative. Now, Hidet is an open-source mission, collectively supported by CentML and the EcoSystem lab, that goals to supply an environment friendly resolution to end-to-end inference on fashionable accelerators (e.g., NVIDIA GPUs).

Extra Sources

Acknowledgement

We want to thank Jerry Park, Mark Saroufim, Jason Liang and Helen Suk for his or her worthwhile assistance on getting ready the weblog submit and suggestions on the textual content. We additionally want to thank Nikita Shulga, Jason Ansel, and Dmytro Dzhulgakov for reviewing and enhancing our PR https://github.com/pytorch/pytorch/pull/93873 on the Third-party dynamo backend registration.



Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top