GPU accelerated SMT constraint fixing – OSIRIS Lab at NYU Tandon
By making use of conventional fuzzing strategies, we achieved excessive throughput SMT constraint fixing. We have been capable of obtain 23 billion execs/s utilizing GPU acceleration.
SMT solvers sometimes apply very difficult algorithms to find out if a set of constraints is satisfiable (and produce an answer). The strategy that we have now taken is to desert these fancy strategies and attempt to throw random information on the system over many many hundreds of threads. That is basically what fuzzing is: we throw random information at a perform in an try and discover a crash. As an alternative of a crash, right here we attempt to discover a answer to the SMT system.
By selecting to brute power the SMT answer, the issue shortly turns into: how can we maximize throughput?
https://github.com/moyix/fpsmt_gpu
The method
GPU acceleration
GPUs provide the power to spin up many hundreds of threads every doing pure computation. To know how we took benefit of CUDA, we first want to speak at a really excessive degree about how GPUs work. If you run a program on a GPU, it’s basically operating on a very separate laptop. This system operating on the GPU is named the kernel. Regardless that each the host program and CUDA kernel might be outlined in the identical recordsdata, there are a number of caveats to managing enter and output. If you wish to cross the boundary between the 2, for instance to retrieve GPU information to CPU, you should use the offered cudaMemcpy perform to maneuver the buffer from gadget (GPU) reminiscence to host (CPU) reminiscence.
With CUDA programming we have to add annotations to variables and features to annotate if they’re to be run on host or on the GPU. Wherever you see __host__
implies that that variable / perform is a number perform. The place you see __device__
that suggests the variable / perform is supposed for the GPU. These annotations outline the boundaries between the host program and CUDA kernel. If you’re in a __device__
perform on the GPU, you gained’t have the ability to entry something annotated with __host__
and vice versa.
SMTLib
We have now barely modified the JFS SMT constraint solver to run on a CUDA succesful GPU. For essentially the most half this modification was merely altering the .cpp
extensions to .cu
and including __device__
to issues. Utilizing this library, we are able to generate LLVMFuzzerTestOneInput
features from SMT theories. For instance, the next SMT system
(set-info :smt-lib-version 2.6)
(set-logic QF_FP)
(declare-fun x () Float32)
(define-fun f1 () Float32 ((_ to_fp 8 24) #x3f800000))
(assert (fp.eq x f1))
(check-sat)
(exit)
Turns into…
#embody "concept.h"
__device__ int LLVMFuzzerTestOneInput(const uint8_t *information, size_t measurement) {
BufferRef<const uint8_t> jfs_buffer_ref = BufferRef<const uint8_t>(information, measurement);
const Float<8, 24> x = makeFloatFrom<8, 24>(jfs_buffer_ref, 0, 31);
uint64_t jfs_num_const_sat = 0;
const BitVector<1> jfs_ssa_0 = BitVector<1>(UINT64_C(0));
const BitVector<8> jfs_ssa_1 = BitVector<8>(UINT64_C(127));
const BitVector<23> jfs_ssa_2 = BitVector<23>(UINT64_C(0));
const Float<8, 24> jfs_ssa_3 = Float<8, 24>(jfs_ssa_0, jfs_ssa_1, jfs_ssa_2);
const bool jfs_ssa_4 = x.ieeeEquals(jfs_ssa_3);
if (jfs_ssa_4) {
++jfs_num_const_sat;
}
if (jfs_num_const_sat == 1) {
// Fuzzing goal
return 1;
} else {
return 0;
}
}
Discover the __device__
annotation subsequent to LLVMFuzzerTestOneInput
. That lets CUDA know that that perform needs to be run on the GPU somewhat than the host. Every of our GPU threads will name the LLVMFuzzerTestOneInput
perform in a loop.
Fuzzing
Our strategy to fuzzing the SMTs is comparatively easy. We select some random quantity generator to generate enter, then throw it on the SMT formulation. There is no such thing as a suggestions or corpus system for doubtlessly enhancing the standard of enter. That is referred to as “blind” fuzzing.
We didn’t have a possibility to check protection guided fuzzing. We’d anticipate that it might considerably scale back the throughput we presently obtain. Whether or not the elevated effectivity of fuzzing outweighs the diminished throughput, we are able to’t say. That is to say nothing of the complexity of implementing such techniques in a multi GPU setup.
Random Quantity Technology
We examined a number of strategies of random quantity technology: diminished spherical AES (particularly, two rounds), CHAM, and cuRAND. As we don’t require a cryptographically safe pseudorandom quantity generator, we are able to get away with utilizing encryption algorithms like AES and CHAM. The variant of AES we used is CTR. The construction of our fuzzing loop diversified relying on which quantity generator we have been utilizing.
The best way we used AES and CHAM have been fairly easy. We initialize a key for every thread from urandom
. We then cudaMemcpy the important thing into a worldwide buffer onto the GPU. In our fuzzing loop, we encrypt a block buffer then throw it at LLVMFuzzerTestOneInput
. The urandom initialization is an additional step within the host course of earlier than launching the CUDA kernel. Past initialization, it will be significant that every one random quantity technology happens on the GPU.
cuRAND is a library for random quantity technology with assist for producing random numbers on the GPU and storing them in GPU reminiscence. There was no further initialization step on host earlier than launching the CUDA kernel.
A phrase on warmth
As with just about any computationally intensive course of, warmth is the enemy. After we run our fuzzer, we use little or no VRAM, however preserve 100% core utilization whereas operating. There is no such thing as a IO to sluggish issues down. Each thread simply strikes from one computation to the following.
We initially have been utilizing a single K80 to check on. Because it seems, when a K80 die temperature is 100C it shuts off and gained’t be usable till you totally reboot the machine. We would have liked to transform the complete cooling answer of this one server we labored on to particularly deal with the warmth from the GPU.
In later checks we used a twin RTX 3090 machine. On this setup, thermal throttling was the one actual situation.
Outcomes
The secret right here is velocity. Probably the most optimum configuration is that which maximizes executions per second. That’s what number of instances can we run our fuzz loop in a second over all threads. We created some easy benchmarks the place we run the fuzzer with some set variety of iterations of the fuzzing loop in every thread for every of the random quantity mills we used. Since we at all times launch the identical variety of threads, and run the loop the identical variety of instances every, we have now a set variety of computations per run.
Over 750 runs of every quantity generator on the twin 3090 machine, we are able to see that cuRAND (24 billion execs/second) is a bit quicker than CHAM (20 billion execs/second), with AES (8 billion execs/second) being considerably slower than each.
One other solution to view this may be the period of time it takes