Now Reading
All GB/s with out FLOPS – Nvidia CMP 170HX Assessment, Efficiency Lockdown Workaround, Teardown, Watercooling, and Restore

All GB/s with out FLOPS – Nvidia CMP 170HX Assessment, Efficiency Lockdown Workaround, Teardown, Watercooling, and Restore

2023-10-28 23:30:13

Introduction

In 2021, on the peak of cryptocurrency mining, Nvidia launched the Nvidia
CMP 170HX. Designed as a compute-only card to speed up Ethereum’s
memory-hard Ethash Proof-of-Work mining algorithm with its 1500 GB/s
HBM2e reminiscence bus, Nvidia applied the {hardware} utilizing the GA100 silicon
from their Ampere structure. Thus, the CMP 170HX is basically a
variant of the all-mighty Nvidia A100, Nvidia’s top-performing datacenter
GPU at the moment.

Naturally, the existence of the CMP 170HX raised many questions, together with
its potential in purposes past mining. Right this moment, following the
discontinuation of Ethash, these $5000 GPUs from closed mining farms are
bought on second-hand markets for $400-$500 in China. It’s time to
reply these questions.

This text comprises a primary efficiency overview, a {hardware} teardown,
a watercooling set up information, and a restore log.

Efficiency Overview

{Hardware} Specification

The Nvidia CMP 170HX comprises 8960 shaders. With out synthetic limitations
on instruction throughput, these shaders ought to attain a peak FP32 efficiency
of 25 TFLOPS.

------------------------ Gadget specs ------------------------
Gadget:              NVIDIA Graphics Gadget
CUDA driver model: 12.20
GPU clock price:      1410 MHz
Reminiscence clock price:   729 MHz
Reminiscence bus width:    4096 bits
WarpSize:            32
L2 cache dimension:       32768 KB
Complete world mem:    7961 MB
ECC enabled:         No
Compute Functionality:  8.0
Complete SPs:           8960 (70 MPs x 128 SPs/MP)
Compute throughput:  25267.20 GFlops (theoretical single precision FMAs)
Reminiscence bandwidth:    1492.99 GB/sec
-----------------------------------------------------------------------

Reminiscence Bandwidth

clpeak exhibits the Nvidia CMP 170HX has a peak reminiscence bandwidth of 1355 GB/s for
actual with none downsizing. Because of using HBM2e and a large bus, that is
proper now the GPU with the quickest reminiscence that cash should buy at an inexpensive worth.
The velocity is sort of an identical to the Nvidia A100 (HBM2e), and ~1.5x as quick as
the present prime customer-grade GPUs, together with the Nvidia RTX 4090 (GDDR6X),
AMD Radeon 7900 XTX (GDDR6), or the AMD Radeon VII / Intuition MI50 (HBM2), which
all topped out at “solely” 700-800 GB/s with their 1 TB/s reminiscence buses.

World reminiscence bandwidth (GBPS)
  float   : 1165.79
  float2  : 1269.69
  float4  : 1343.50
  float8  : 1355.40
  float16 : 1350.14

Nevertheless, that is the place all the excellent news ends.

Floating-Level Efficiency

FP32 & FP64 with FMA (clpeak, commonplace model)

To measure the height floating-point efficiency of a processor, most benchmarks
use the Multiply-Add operation within the type of a * b + c of their compute
kernels (i.e. inside loop) which is accelerated by the Fused Multiply-Add (FMA)
or Multiply-Add (MAD) directions in most processors – clpeak is just not
an exception.

Sadly, assessments present heavy efficiency restrictions on FMA and MAD
floating-point operations, with a FP32 FMA velocity of 394 GFLOPS and a FP64 FMA velocity of 182
GFLOPS. The FP32 FMA efficiency is 1/100 of at this time’s high-end GPUs. To place
this into perspective, this isn’t solely slower than the Nvidia GTX 280 GPU
from 15 years in the past – it’s even slower than a 2020s multi-core CPU (> 1 TFLOPS).

This type of restrictions is commonly utilized to FP64 efficiency on customer-grade
GPUs to forestall their use in computing workloads, however for the CMP 170HX,
FP32 mainly acquired the identical therapy.

Single-precision compute (GFLOPS)
  float   : 394.77
  float2  : 394.77
  float4  : 394.77
  float8  : 394.77
  float16 : 394.77

No half precision help! Skipped

Double-precision compute (GFLOPS)
  double   : 182.72
  double2  : 182.55
  double4  : 182.19
  double8  : 181.48
  double16 : 180.08
FP32 and FP64 with out FMA (clpeak, modified model)

Upon additional investigation utilizing a self-written benchmark, I discovered common
FP32 add or multiply is just not restricted by that a lot. Thus, utilizing a
non-standard and patched model of clpeak with out using FMA
directions solely, the next outcomes are obtained.

Single-precision compute (GFLOPS)
  float   : 6285.48
  float2  : 6287.64
  float4  : 6294.30
  float8  : 6268.80
  float16 : 6252.80

No half precision help! Skipped

Double-precision compute (GFLOPS)
  double   : 94.98
  double2  : 94.93
  double4  : 94.84
  double8  : 94.65
  double16 : 94.29

With out FMA, the FP32 efficiency is round 6250 GFLOPS. 6250 GFLOPS is
across the velocity of a present low-end desktop GPU comparable to an RTX 2060, or
an outdated high-end GPU from yesteryear, comparable to a Titan X (Maxwell).

That is relatively low however should be very acceptable, since many physics
simulations for scientific and engineering purposes don’t profit
from improved core efficiency by way of FLOPS, and as an alternative are
primarily restricted by reminiscence bandwidth.

Alternatively, FP64 efficiency is now even decrease with out FMA, at a
practically ineffective degree of 94 GFLOPS. Thus, one could argue that it’s
ineffective for HPC makes use of. Nevertheless, using FP64 is just not at all times
required relying on the analysis topic. For a lot of Partial Differential
Equation solvers for physics simulation, FP32 is the truth is the usual.
First, the meshing and discretization errors are sometimes a lot bigger
sources of error, to not point out that FP64 would decelerate these already
memory-bound simulations even additional. Thus, FP32-only GPUs nonetheless have
their makes use of.

Sadly, within the case of CMP 170HX, it may be tough if not
not possible to disable FMA in apply. Modifying the purposes (and
probably additionally its runtime) is already difficult sufficient, however even when
each circumstances are possible when they’re typically not, the issue of
commonplace libraries stay. Most HPC libraries include hand-tuned
routines that closely depend on FMA.
To not point out that eradicating FMA adjustments the rounding habits from
1 rounding to 2 roundings,
thus probably compromising the numerical integrity of a simulation
and making it untrustworthy if the code was written in a manner that
requires FMA to protect precision
(see Appendix on the finish of the
article on the technical particulars of modifying the OpenCL/SYCL/CUDA
runtime).

Thus, to realize the purpose of stopping using these GPUs generally
compute purposes, Nvidia determined to easily throttle the FMA
throughput. This fashion, these GPUs are incompatible with most pre-existing
software program in existence, with out utilizing extra invasive efficiency
restrictions that will have an effect on their supposed mining purposes.

FP16 (mixbench, SYCL model)

Half-precision is unsupported by Nvidia’s OpenCL runtime but it surely’s accessible
by way of CUDA or SYCL (with a CUDA backend). Based on mixbench, FP16
efficiency is round ~42 TFLOPS (41869 GFLOPS), and using FMA or
not doesn’t appear to have an effect on the consequence.

FP16 efficiency is restricted in the identical method to a degree related
to a low-end GPU: ~42 TFLOPS (41869 GFLOPS). The usage of FMA or not doesn’t
appear to have an effect on the end result.

Combined-Precision

For mixed-precision FP32 computation, gpu_burn achieved 6.2 TFLOPS
when the “Tensor Cores” choice is enabled. When gpu_burn is instructed
to make use of Tensor Cores, the flag CUBLAS_TENSOR_OP_MATH (a.okay.a
CUBLAS_COMPUTE_32F_FAST_16F) is about earlier than calling the matrix multiplication
routines in cuBLAS.

$ ./gpu_burn -d -tc 43200
Utilizing evaluate file: evaluate.ptx
Burning for 43200 seconds.
GPU 0: NVIDIA Graphics Gadget
Initialized system 0 with 7961 MB of reminiscence (7660 MB obtainable, utilizing 6894 MB of it), utilizing FLOATS, utilizing Tensor Cores
Outcomes are 268435456 bytes every, thus performing 24 iterations
10.0%  proc'd: 24504 (6236 Gflop/s)   errors: 0   temps: 32 C 
        Abstract at:   Tue Oct 24 23:21:38 UTC 2023

For the reason that efficiency is similar as commonplace non-FMA FP32 throughput
as I beforehand confirmed, I’ve to conclude that the Tensor Cores most likely
aren’t working or a minimum of not helpful. All CUBLAS_TENSOR_OP_MATH is
stopping cuBLAS from utilizing the usual FP32 FMA code path and
avoiding the FMA efficiency lockdown, but it surely has no extra advantages.

Different Tensor Core modes are presently untested, extra assessments are wanted
to verify the standing of Tensor Cores. Benchmark recommendations are welcomed.

Integer Efficiency

Given the main focus of this GPU on mining, which is an integer workload,
let’s additionally check out its integer efficiency as properly.

clpeak

The Nvidia CMP 170HX exhibits an affordable integer efficiency of 12500
GIOPS (12.5 TIOPS), nearly as quick because the RTX 2080 Ti (15 TIOPS), and
62.5% as quick because the Nvidia A100 (20 TIOPS).

Integer compute (GIOPS)
  int   : 12499.07
  int2  : 12518.02
  int4  : 12494.16
  int8  : 12547.55
  int16 : 12540.22
Hashcat

As beforehand talked about, commonplace assessments or benchmarks use floating-point
operators that are unsuitable for stress-testing this GPU. On the opposite
hand, an all-integer workload like hashcat must do it:

-------------------
* Hash-Mode 0 (MD5)
-------------------

Pace.#1.........: 43930.0 MH/s (53.01ms) @ Accel:64 Loops:512 Thr:1024 Vec:1

The hashcat efficiency intently mirrors the clpeak consequence, it’s 67% as quick
because the Nvidia A100 (~64900 MH/s). However, password crackers ought to most likely
look elsewhere, as even an RTX 3080 runs sooner at 54000.1 MH/s.

Do not forget that hashing or password cracking is a pure compute-bound software,
of which the HBM2e reminiscence is just not helpful. Ethereum’s Ethash is memory-bound
solely as a result of it entails continuously studying a big Directed Acyclic Graph (DAG).

Simulation Efficiency

Many physics simulations are sometimes memory-bound, most are already operating at 1%
to 10% of a CPU’s peak FLOPS with no hope of saturating the FPU as a result of iterative
timestepping nature of the algorithm. Thus, many programmers on this discipline (myself
included) dislike the FLOPS maximalist angle on processor efficiency, seeing
it as a shallow view taken by outsiders.

Thus, maybe the sluggish FLOPS on the CMP 170HX could be a non-issue or a minimum of not a
critical concern for these use instances? Since that is what I bought this GPU for, let’s
give it a strive.

FluidX3D (with FMA)

FluidX3D is a Lattice-Boltzmann Methodology (LBM) solver within the discipline of Computational Fluid
Dynamics (CFD). The kernel of this algorithm is a 3-dimension, 19-point stencil.
Its efficiency in FP32/FP32 mode is:

|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Present Step      | Time Remaining      |
|    2276 |    348 GB/s |       136 |         9999  90% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Information: Peak MLUPs/s = 2276                                                   |

At 2276 MLUPs/s, the efficiency is much like the GeForce RTX 3060 (2108 MLUPS)
despite the fact that the RTX 3060’s FP32 throughput is 12500 GFLOPS. On this software,
the RTX 3060’s 360 GB/s of reminiscence bandwidth is severely limiting its efficiency,
permitting the CMP 170X to match it even when it’s a lot slower by way of FP32 FLOPS.

Sadly, the 1500 GB/s HBM2e didn’t allow the CMP 170HX to change into the world’s
top-performing FluidX3D platform both, as a result of it has the alternative drawback –
the locked-down FP32 efficiency is so sluggish on this GPU, it turns Lattice Boltzmann
Methodology from a memory-bound kernel to a compute-bound kernel, the GPU can’t work
sooner than 348 GB/s.

That is additional illustrated by operating FluidX3D in two bandwidth-saving,
mixed-precision modes.

FP32/FP16S:

|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Present Step      | Time Remaining      |
|    2250 |    173 GB/s |       134 |         9996  60% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Information: Peak MLUPs/s = 2250

FP32/FP16C:

|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Present Step      | Time Remaining      |
|    2266 |    174 GB/s |       135 |         9999  90% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Information: Peak MLUPs/s = 2266

As one can see, operating simulations at decreased precision to scale back reminiscence site visitors
has no impact on simulation velocity on the CMP 170HX. Alternatively, FluidX3D’s
efficiency is doubled on the RTX 3060 to 4070 MLUPs/s (FP32/FP16S) and 3566 (FP32/FP16C).
Nonetheless, word that the CMP 170HX is simply 2x slower than the RTX 3060, not 30x slower
as its FLOPS could recommend.

FluidX3D (with out FMA)

Because of hints from FluidX3D’s writer Dr. Moritz Lehmann, I used to be capable of take away
the FMA usages from FluidX3D by modifying two strains of code.

After this modification, FluidX3D is ready to unleash the complete energy of Nvidia’s GA100
silicon in FP32/FP32 mode:

|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Present Step      | Time Remaining      |
|    7681 |   1175 GB/s |       458 |         9985  50% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Information: Peak MLUPs/s = 7684 

That is 90% as quick as an actual Nvidia A100’s 8526 MLUPs/s (40 GB, PCIe model).

Based on Dr. Moritz Lehmann, though FluidX3D makes use of FMA to enhance efficiency
and to keep away from pointless lack of precision, however the principle errors in a simulation are
discretization errors, not floating-point errors. Thus, with a no-FMA workaround,
the CMP 170HX turns into a performance-leading GPU for FluidX3D simulations.

In mixed-precision FP32/FP16S mode, the CMP 170HX’s efficiency is equally spectacular.
The simulation velocity is 80% as quick because the Nvidia A100 (16035 MLUPs/s), and 10% sooner
than the RTX 4090 (11091 MLUPs/s), due to GA100’s excessive reminiscence bandwidth:

|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Present Step      | Time Remaining      |
|   12386 |    954 GB/s |       738 |         9997  70% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Information: Peak MLUPs/s = 12392                                                  |

The mixed-precision FP32/FP16C mode doesn’t carry out as properly, seemingly as a result of use
of a customized floating-point format that wants many extra calculations in a manner
that exceeds the 6200 GFLOPS {hardware} restriction (it’s additionally doable that different
floating-point operations have completely different limits).

|---------.-------'-----.-----------.-------------------.---------------------|
| MLUPs   | Bandwidth   | Steps/s   | Present Step      | Time Remaining      |
|    6853 |    528 GB/s |       408 |         9985  50% |                  0s |
|---------'-------------'-----------'-------------------'---------------------|
| Information: Peak MLUPs/s = 6859                                                   |

However this shouldn’t be a critical limitation because the accuracy of FP32/FP16C and
FP32/FP16S are comparable in most however just a few edge instances.

FDTD (work in progress, to be added into openEMS)

openEMS is a Finite-Distinction Time-Area (FDTD) solver within the discipline of Computational
Electromagnetics (CEM) for simulating Maxwell’s equations. The kernel of this
algorithm is a 3-dimension, 7-point stencil.
I’m presently engaged on a GPU-accelerated engine to extend its efficiency with the
hope to advance the cutting-edge of free and open circuit board design instruments.

Let’s run
my minimal check kernel (disclaimer: it’s not meant to symbolize sensible efficiency,
it’s solely made for improvement use).

$ ./test1.elf 
simulate 16777216 cells for 1000 timesteps.
velocity: 1156992 MiB/s
velocity: 10110 MC/s in 1.66 seconds

The CMP 170HX has a win.
This moveable kernel was written in SYCL and was beforehand developed on the
AMD Radeon VII (Intuition MI50), and I noticed a velocity of round 6000 MC/s (megacells per second)
due to its 4096-bit 1000 GB/s (theoretical) HBM2 reminiscence. With Nvidia CMP
170HX’s 1500 GB/s (theoretical) HBM2e reminiscence, I’m actually getting a 1.5x speedup

It’s value noting that this speedup was demonstrated utilizing the unique supply
code with out disabling FMA. It’s well-known that FDTD in its primary kind has an
excessive reminiscence bottleneck, much more excessive than LBM. The benchmark consequence
exhibits that it’s certainly the case (work is ongoing on superior optimizations which
I’ll share sooner or later on this weblog).

Be aware: Reminiscence bandwidth is historically benchmarked in base-10 sizes, however
my FDTD benchmark makes use of base-2 sizes, thus 1156992 MiB/s seems decrease than
1300 GB/s that I beforehand reported.

Energy Consumption

Due to the FMA floating-point efficiency lockdown, the GPU runs so sluggish
that it can’t even be correctly stress examined utilizing commonplace instruments.
The facility consumption of gpu_burn for FP32 and FP64 is round 60 watts.
When operating mixed-precision FP32/FP16 workloads (and certain additionally true for
non-FMA FP32 workloads) by enabling Tensor Core in gpu_burn, the facility
consumption is round 75 watts with spikes to 100+ watts.

Nevertheless, that is nonetheless removed from the 250 watts TDP or TBP of the GPU.
Earlier than I spotted this example, the low energy consumption and sluggish velocity
initially made me suspect that low FP32 efficiency was brought on by thermal
or energy throttling as a result of a {hardware} failure (an unrelated one certainly
occurred and glued later), earlier than I discovered one other discussion board submit with confirmed
AIDA64 benchmark outcomes on Home windows and talked about the low energy consumption.
This confirmed my outcomes – the trigger was FP32 lockdown, the impact is
low energy consumption, not vice versa.

To place an affordable degree of stress on the GPU, different varieties of benchmarks
ought to be used as stress assessments – an integer benchmark or a reminiscence benchmark.
A typical integer workload comparable to a password cracker like hashcat can push
the facility consumption to 160+ watts. Integer-only machine-learning inference
could also be one other alternative – however I used to be unable to discover a consultant instance –
feedback welcome. An alternate manner is producing most reminiscence site visitors –
I discovered a self-written STREAM-like program can even push the facility consumption
to 160+ watts.

However word that each varieties of assessments can’t put the best doable load
on the GPU – to try this, you must stress each the core with heavy
integer workload and the reminiscence with heavy streaming site visitors – the GPU
was made for Ethereum mining in any case. Thus, mining ought to at all times have
the best energy consumption (which isn’t examined right here).

When FMA is disabled utilizing a modified copy of FluidX3D, and when FP32/FP16S
mode is used to extend the arithmetic depth of the simulation, energy
consumption as excessive as 180 watts was noticed in nvidia-smi.

I received’t be stunned if a fastidiously crafted hypothetical integer-heavy +
memory-heavy workload can push the GPU above 200 watts.

I/O Efficiency

CMP 170HX’s PCIe bandwidth is restricted to PCIe 1.0 x4, at 0.85 GB/s. Thus,
if the applying performs frequent knowledge exchanges between CPU and GPU, or
between completely different GPUs, it might be too sluggish to be acceptable – precisely in accordance
to Nvidia’s plan.

Switch bandwidth (GBPS)
  enqueueWriteBuffer              : 0.85
  enqueueReadBuffer               : 0.84
  enqueueWriteBuffer non-blocking : 0.85
  enqueueReadBuffer non-blocking  : 0.83
  enqueueMapBuffer(for learn)      : 0.83
    memcpy from mapped ptr        : 10.43
  enqueueUnmap(after write)       : 0.85
    memcpy to mapped ptr          : 10.43

There are the truth is two layers of PCIe bandwidth lockdown. The primary layer is the
on-chip fuse, {hardware} or VBIOS lockdown (who is aware of) to PCIe Gen 1.

The subsequent is the circuit board degree lockdown to x4 from x16 by omitting
the AC coupling capacitors on some PCIe lanes to pressure a hyperlink degradation.

LnkCap: Port #0, Pace 2.5GT/s, Width x16, ASPM not supported
        ClockPM+ Shock- LLActRep- BwNot- ASPMOptComp+
LnkSta: Pace 2.5GT/s, Width x4 (downgraded)
        TrErr- Prepare- SlotClk+ DLActive- BWMgmt- ABWMgmt-
LnkCap2: Supported Hyperlink Speeds: 2.5GT/s, Crosslink- Retimer- 2Retimers- DRS-
LnkCtl2: Goal Hyperlink Pace: 2.5GT/s, EnterCompliance- SpeedDis-
         Transmit Margin: Regular Working Vary, EnterModifiedCompliance- ComplianceSOS-
         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
LnkSta2: Present De-emphasis Degree: -6dB, EqualizationComplete- EqualizationPhase1-
         EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
         Retimer- 2Retimers- CrosslinkRes: unsupported

nvidia-smi reviews the next:

GPU Hyperlink Information
    PCIe Era
        Max                       : 2
        Present                   : 1
        Gadget Present            : 1
        Gadget Max                : 1
        Host Max                  : 3
    Hyperlink Width
        Max                       : 16x
        Present                   : 4x

The PCB degree lockdown ought to be trivially bypassable by {hardware} modification
and soldering the lacking 0402 AC coupling capacitors on the additional PCIe lanes,
however as a result of {hardware} lockdown, it might nonetheless function at PCIe Gen1 at 4 GB/s
with restricted usefulness, so I didn’t try that (I don’t need to dismantle
my watercooled check bench once more). However apparently customers of different CMP card fashions
had reported successes.

Resizable BAR is unavailable, this functionality is surprisingly reported, however the
largest BAR dimension is 64 MiB.

Capabilities: [bb0 v1] Bodily Resizable BAR
        BAR 0: present dimension: 16MB, supported: 16MB
        BAR 1: present dimension: 64MB, supported: 64MB
        BAR 3: present dimension: 32MB, supported: 32MB

Lastly, as a result of the CMP 170HX makes use of the identical Nvidia A100 circuit board, the gold
fingers of the NV-Hyperlink interface exist, however the function is unsupported with all
parts unpopulated on the PCB.

Dialogue

Machine Steadiness and Arithmetic Depth

As already talked about, despite the fact that that the CMP 170HX’s locked-down
FP32 efficiency, there are nonetheless some floating-point workloads that may be
accelerated by the CMP 170HX. In some instances, even as much as the complete HBM2e reminiscence
velocity and beat the top-level customer-grade GPU: the RTX 4090.
In essentially the most excessive case (as in my FDTD demo), this may be completed with out
bypassing the FMA throttling. One can typically discover them in memory-bound physics
simulations. To grasp why, one must make use of the ideas of
machine stability, arithmetic depth and roofline evaluation.

Machine stability describes the stability between a processor’s reminiscence and
floating-point throughput. Within the very best world, this stability ought to be 1:1
so a processor can learn or write reminiscence as quick as it will possibly carry out an arithmetic
operation. Nevertheless, processors had been and are enhancing at a sooner price than
DRAM bandwidth for a number of causes, such because the time it takes for the sense
amplifier to learn bits from capacitors, the impracticality of integrating
massive DRAM and CPU on the identical silicon, or the ensuing interconnect bottleneck
between a processor and DRAM as off-chip knowledge switch is at all times slower than
on-chip entry.

Processor and machine balance increasing, making communication relatively more expensive. Plot for 64-bit floating point data movement & operations; bandwidth from CPU or GPU memory to registers. Data from vendor specs and STREAM benchmark

Therefore, fashionable CPUs and GPUs are likely to have a machine stability of 100:1 (for FP64).
That’s, one can solely obtain the height FLOPS of the processor if 100 floating-point operations
are completed for every integer learn from reminiscence – in different phrases (no pun supposed), a compute
kernel ought to have a excessive Arithmetic Depth. Arithmetic Depth is the
stability between a program’s reminiscence entry operations and floating-point
operations. For instance, if a program must learn 4 FP32s to calculate their
common (utilizing 3 provides and 1 multiplication), its arithmetic depth is 0.25 FLOPS/bytes.
That’s, arithmetic depth will be understood as a software program stability in
analogous to the machine stability.

In sensible packages, the programmer will try and do as many helpful computations
as doable after reminiscence is learn utilizing many tips. However finally, the achievable
depth relies on the algorithm itself. For instance, a great matrix multiplication
program (the archetypal check is the LINPACK benchmark) is solely compute-bound, as
there are an entire lot of multiplications to just do after studying a single submatrix.
Reminiscence bandwidth doesn’t matter on this software.

Arithmetic Intensity of different algorithms. Physics simulations using stencil computation have low arithmetic intensity below 1 FLOPS/byte, dense linear algebra (BLAS 3) has high Arithmetic Intensity over O(10) FLOPS/byte

Alternatively, many varieties of scientific and engineering simulations are reminiscence
bandwidth certain. All they do is studying the simulation field from reminiscence, transferring it
ahead by 1 timestep, and writing it again to reminiscence (consider a 2D or 3D convolution
operate by which the worth of a single cell relies on the values of their surrounding
cells, or consider a naive implementation of Conway’s Recreation of Life – actual
physics simulations of partial differential equations typically work like that). The
arithmetic depth is commonly underneath 1.0 and performs poorly on fashionable {hardware}. The
usable efficiency is 1% to 10% of the height efficiency as a result of bandwidth limitation,
which is way from what the processor core itself can do.

That is the infamous reminiscence wall drawback, to be correct, the reminiscence bandwidth
wall
drawback (which is completely different from the reminiscence latency wall drawback).

Roofline Mannequin

Since memory-bound number-crunching code is so prevalent within the scientific
computing (or Excessive-Performace Computing) world, practitioners established
a easy graphical methodology named the roofline mannequin as a fast guideline on
program optimization, utilizing solely the ratio between floating-point operations
and reminiscence accesses.

A simplified graph of the roofline model

A naive roofline mannequin is straightforward to graph. First, divide a processor’s peak
floating-point throughput by its peak reminiscence bandwidth. This offers the
vital arithmetic depth at which a memory-bound algorithm turns into a
compute-bound one, referred to as the ridge level. Then, plot two segments,
one from the ridge level to the origin on the left, one other is a continuing
operate equal to the machine’s peak floating-point efficiency, from the
ridge level to the correct. Then rescale the chart on a log-log scale.

Subsequent, calculate the variety of floating-point operations and the arithmetic
depth of your software and mark it on the plot. Many number-crunching
kernels are easy, one can typically have a look at the inside loop and rely them
by hand, solely the supply code and its runtime are wanted because of this. To
automate this course of, one can use CPU efficiency counters and static
evaluation as properly.

Within the case of my CMP 170HX, I discovered my FDTD kernel performs 18 floating-point
operations per cell, with an arithmetic depth of 0.25 FLOPS/byte. At
10110 million cell updates per second, the achieved efficiency is 181.98 GFLOPS.
For FluidX3D, 1 cell replace takes 363 arithmetic operations consisting of 261
FP32 flops + 102 INT32 ops with 153 bytes of reminiscence site visitors. If one ignores
INT32 bottleneck (it’s a lot sooner than FP32 on the CMP 170HX),
the arithmetic depth can be 1.7 FLOPS/byte, and the achieved efficiency
at 7684 cell updates per second is 1979.685 GFLOPS. Lastly, I can create
the next roofline mannequin of this GPU with purposes to 2 kernels:
FDTD and FluidX3D (FP32/FP32, no FMA).

Example of a hardware roofline curve and data points of different compute kernels

As one can see, even when the computing energy is proscribed to six GFLOPS
much like a low-end desktop GPU, however each purposes can drastically
profit from the 1350 GB/s reminiscence bandwidth. For naive FDTD, as a result of
its extraordinarily low arithmetic depth, it can’t even saturate the
394 GFLOPS FMA efficiency, explaining why it’s unaffected by Nvidia’s
throttling.

Conclusion

Nvidia is aware of the {hardware} specs of the GA100 silicon are enticing
to all HPC and AI customers past mining, in order that they tried as greatest as they
might to make this GPU completely ineffective for these functions from completely different
angles.

Firstly, the FP32 FMA throughput restrictions makes the
GPU nearly ineffective for operating ready-made software program. As demonstrated
right here, this efficiency restrict can typically be labored round by disabling
FMA, making it acceptable for a lot of memory-bound simulations. Sadly,
as a result of existence of different limitations, comparable to its low 8 GiB reminiscence
dimension, PCIe 1.0 x4 I/O bandwidth, the dearth of NV-Hyperlink, and the dearth of
FP64 functionality, these measures assured the uselessness of the GPU
within the overwhelming majority of purposes.

Total, this GPU is generally ineffective. It could possibly nonetheless be helpful solely when
your have a selected area of interest by which…

  1. Small-to-medium dimension runs that matches in 8 GiB.
  2. Principally in-situ computation with out doing any PCIe or NVLink transfers.
  3. You’ve a floating-point algorithm however extraordinarily memory-bound that FLOPS is
    utterly irrelevant.

    • With out disabling FMA, with an Arithmetic Depth (AI) underneath 0.3 (FP32).
    • After disabling FMA, with an Arithmetic Depth underneath 4.6, however probably
      requires modifications to plain and Third-party libraries.
  4. Or you will have a memory-bound, all-integer workload (e.g. mining)…

Solely then, it might be smart to buy this GPU.

I bought this GPU particularly for operating FDTD electromagnetic discipline
simulations, the AI of its naive implementation is 0.25 FLOPS/bytes, so it
nonetheless provides me an enormous acceleration on par with the precise A100 and presently
unmatched by every part else (if rumors are to be believed, RTX 5090 is
ultimately going to match it at 1.5 TB/s in 2024). Not the worst $500 I’ve
spent.

It doesn’t change the truth that It’s nonetheless a case of caveat emptor, do
not purchase except you realize precisely what you’re shopping for.

VBIOS modding?

A pure query is whether or not the {hardware} efficiency limitation is applied by way of
fuses, or applied by way of VBIOS firmware, and whether or not modifying the VBIOS can bypass
these limits.

Sadly, all Nvidia GPUs since latest years have VBIOS digital signature checks,
making VBIOS modification not possible. A couple of months in the past, a bypass was discovered for RTX
20-series “Turing” primarily based GPUs, sadly, it doesn’t apply to Ampere GPUs.
At present, VBIOS modding is not possible for the foreseeable future.

Radeon VII – the choice from AMD

The mining-special GPU market was not monopolized by Nvidia, there was
robust competitors from AMD as properly – the mining-special Radeon VII
(one can name them mining-special Intuition MI50).

Lengthy story brief, in the course of the peak of Ethereum mining, many board makers
in China unofficially created a particular model of the
Radeon VII by omitting parts of the Intuition MI50
circuit board – the resulted GPU is
a bizarre hybrid between Radeon VII and the Intuition MI50 – it has the
Intuition MI50 circuit board with the Intuition MI50 passive cooler appropriate
for datacenter utilization, alternatively, the circuit configuration is definitely
a desktop
Radeon VII, but it surely can’t be referred to as a desktop card both because the video output
is commonly omitted. Regardless, because the Radeon
VII is kind of a remarketed accelerator GPU for gaming and makes use of mainly
the identical Intuition MI50 silicon, each playing cards are virtually the identical factor,
and it doesn’t matter which one is which.

In China, these GPUs are presently bought at ~$100. Extra amazingly,
they’re pre-modded with the Radeon Professional VII VBIOS with 11 TFLOPS of
FP32 and unlocked 5.5 TFLOPS FP64 – similar to the true MI50, making them
probably the
accelerators with the
highest price-performance ratio of all time. The one criticism I’ve
is Radeon VII’s comparatively inefficient reminiscence controller, that usually causes
appreciable under-utilization of its 1 TB/s HBM2 bandwidth.

Nonetheless, if you wish to purchase a mining-special GPU for affordable to run compute
workloads, I personally extremely suggest this mining-special Intuition MI50 (Radeon
VII) over the Nvidia CMP 170HX in case your software is just not CUDA-specific.

I plan to evaluation this GPU quickly. Due to the dearth of restrictions,
I’m not going to research the GPU in depth because it behaves as a normal
GPU as one usually expects, the main focus of that article goes to be
a watercooling mod tutorial for homelab use… There’s a lot that
will be mentioned for its bandwidth effectivity drawback, which appears to be an
architectural drawback throughout Vega GPUs and is presently poorly understood
as a result of lack of documentation – however I’m unsure if I’ve the
endurance to cowl these subjects.

Appendix 0: Disabling FMA Utilization by Supply Code or Runtime Modification

OpenCL

Supply Code Modification

An FMA or MAD instruction – that are all throttled on the CMP 170HX – could
be generated by the OpenCL compiler underneath the next two circumstances:

  1. The supply code explicitly makes use of the OpenCL capabilities fma() or mad().

  2. The compiler implicitly transforms the expression a * b + c to
    fma(a, b, c) or mad(a, b, c) by means of a course of referred to as FP
    contraction
    . This will happen even with out enabling quick math or unsafe
    math
    optimizations – which merely will increase the complier’s aggressiveness
    of doing so.

Thus, to forestall the OpenCL compiler from utilizing MAD and FMA directions,
two issues must be completed – first, take away all usages of fma() and mad(),
and likewise to show FP contraction off. A simple workaround is to
embrace the next code in entrance of all OpenCL supply information:

/* shadows the OpenCL builtin fma() and mad() capabilities */
#outline fma(a, b, c) ((a) * (b) + (c))
#outline mad(a, b, c) ((a) * (b) + (c))

/* disable FP contraction */
#pragma OPENCL FP_CONTRACT OFF
FluidX3D

When testing FluidX3D, an analogous two-line patch was used, which is reproduced
under:

diff --git a/src/lbm.cpp b/src/lbm.cpp
index d99202f..28aeb25 100644
--- a/src/lbm.cpp
+++ b/src/lbm.cpp
@@ -286,6 +286,8 @@ void LBM_Domain::enqueue_unvoxelize_mesh_on_device(const Mesh* mesh, const uchar
 }
 
 string LBM_Domain::device_defines() const { return
+       "n     #pragma OPENCL FP_CONTRACT OFF"  // prevents implicit FMA optimizations
+       "n     #outline fma(a, b, c) ((a) * (b) + (c))"  // shadows OpenCL specific operate fma()
        "n     #outline def_Nx "+to_string(Nx)+"u"
        "n     #outline def_Ny "+to_string(Ny)+"u"
        "n     #outline def_Nz "+to_string(Nz)+"u"
PoCL

The Nvidia OpenCL compiler is proprietary inside libnvidia-opencl.so, not like
AMD GPUs for which its driver merely invokes clang. Thus, there’s no clear
method to disable FMA and MAD globally on the compiler degree (other than reverse
engineering and modifying the Nvidia binaries). Every OpenCL program has to
be painstakingly modified individually.

A possible workaround is intercepting the decision into clBuildProgram() to
modify the compiler choices. Sadly, the OpenCL specification offers
no compile-time choices to disable floating-point contraction within the type of
a command-line off/on swap. One more workaround will be intercepting
clCreateProgramWithSource() to insert the aforementioned code on the fly. However
it is a relatively tedious and is left as an execrise for the reader.

One other doable and extra simple hack is to interchange the Nvidia OpenCL
runtime altogether with PoCL – a free and moveable OpenCL runtime primarily based on
LLVM/clang with the flexibility to focus on Nvidia PTX. By modifying the PoCL operate
pocl_llvm_build_program(), the clang flag -ffp-contract=off is handed
unconditionally to disable FP contraction. As well as, PoCl’s OpenCL commonplace
library capabilities fma() and mad() additionally must be redefined as plain
a * b + c with out FP contraction. That is applied by the next patch,
permitting one to run pre-existing OpenCL packages with none supply code
modification.

diff -upr pocl-4.0/lib/kernel/fma.cl pocl-4.0/lib/kernel/fma.cl
--- pocl-4.0/lib/kernel/fma.cl	2023-06-21 13:02:44.000000000 +0000
+++ pocl-4.0/lib/kernel/fma.cl	2023-10-28 12:44:58.283185457 +0000
@@ -24,4 +24,5 @@

 #embrace "templates.h"

-DEFINE_BUILTIN_V_VVV(fma)
+#pragma OPENCL FP_CONTRACT OFF
+DEFINE_EXPR_V_VVV(fma, a*b+c)
diff -upr pocl-4.0/lib/kernel/mad.cl pocl-4.0/lib/kernel/mad.cl
--- pocl-4.0/lib/kernel/mad.cl	2023-06-21 13:02:44.000000000 +0000
+++ pocl-4.0/lib/kernel/mad.cl	2023-10-28 12:45:02.093091566 +0000
@@ -24,4 +24,5 @@

 #embrace "templates.h"

+#pragma OPENCL FP_CONTRACT OFF
 DEFINE_EXPR_V_VVV(mad, a*b+c)
diff -upr pocl-4.0/lib/CL/pocl_llvm_build.cc pocl-4.0/lib/CL/pocl_llvm_build.cc
--- pocl-4.0/lib/CL/pocl_llvm_build.cc	2023-06-21 13:02:44.000000000 +0000
+++ pocl-4.0/lib/CL/pocl_llvm_build.cc	2023-10-28 11:36:33.961059705 +0000
@@ -333,10 +333,8 @@ int pocl_llvm_build_program(cl_program p
   if (fastmath_flag != std::string::npos) {
 #ifdef ENABLE_CONFORMANCE
     user_options.exchange(fastmath_flag, 21,
-                         "-cl-finite-math-only -cl-unsafe-math-optimizations");
+                         " ");
 #endif
-    ss << "-D__FAST_RELAXED_MATH__=1 ";
-    fp_contract = "quick";
   }
 
   size_t unsafemath_flag = user_options.discover("-cl-unsafe-math-optimizations");
@@ -346,11 +344,14 @@ int pocl_llvm_build_program(cl_program p
     // this ought to be nearly the identical however disables -freciprocal-math.
     // required for conformance_math_divide check to cross with OpenCL 3.0
     user_options.exchange(unsafemath_flag, 29,
-                         "-cl-no-signed-zeros -cl-mad-enable -ffp-contract=quick");
+                         " ");
 #endif
-    fp_contract = "quick";
   }
 
+  // HACK: disable FP contraction unconditionally
+  fp_contract = "off";
+  ss << " -ffp-contract=off ";
+
   ss << user_options << " ";
 
   if (device->endian_little)

It’s value noting that the patch is incomplete, as OpenCL has extra fma() and mad()
variants than what are patched right here, comparable to mad24(), mad_hi(), and mad_sat().

CUDA

To disable FP contraction, cross the choice -fmad=false to nvcc.

Once more, like OpenCL, it’s additionally doable to make use of FMA and MAD by way of a number of
built-in capabilities. Since nvcc is proprietary, once more, there’s no
clear method to disable FMA and MAD globally on the compiler degree –
other than reverse engineering and modifying the Nvidia binaries, or
implementing a preprocessor or debugger-like interceptor to switch
the supply on-the-fly.

Alternatively, similar to OpenCL, it’s doable to compile CUDA
by way of LLVM/clang and goal Nvidia PTX. Thus, in precept, it ought to
be doable to create a patched LLVM/clang for this goal in analogous
to PoCL.

SYCL

At present there are two full-feature SYCL implementations – an LLVM
fork named Intel DPC++ and an LLVM plugin named AdaptiveCpp (beforehand
referred to as OpenSYCL and hipSYCL). One can use the usual clang
command-line choice -ffp-contract=off. Like CUDA, the issue of
built-in capabilities stay.

Limitations

Though this part demonstrated some doable methods to switch the
program supply code or GPGPU runtime to keep away from FMA and MAD usages, in
apply, it may be tough if not not possible to utterly keep away from
FMA.

As I already confirmed, a clear compiler/runtime patch is just not
as straightforward as toggling a choice, because it doesn’t have an effect on the habits of
built-in capabilities. An invasive patch to the compiler could also be wanted
to utterly take away FMA/MAD code era.

Moreover, even when one can patch the affected purposes and
runtimes, the issue of binary packages and commonplace libraries
stay. Generally, GPU kernels are precompiled and shipped in
machine code or meeting kind, that is the case for proprietary
software program, and typically additionally the case for normal libraries.
Actually, most HPC libraries include hand-tuned meeting routines
that closely depend on FMA. Thus, the modification required to take away
FMA/MAD usages from a codebase is probably going very sophisticated except
the codebase is self-contained.

To not point out that eradicating FMA adjustments the rounding habits from
1 rounding to 2 roundings,
thus pontentially compromising the numerical integrity of a simulation
and making it untrustworthy if the code was written in a manner that
requires FMA to protect precision.

Appendix 1: Teardown

Circuit Board Separation

The next procedures separate the circuit board from its chassis
and cooler.

CMP 170HX in the original metal chassis

Movies

Step 6, Step 7 and Step 9 will be tough. Take into account the truth that Step 7
initially confused even LinusTechTips, consult with following two YouTube
movies (in Chinese language) for visible demonstrations of the elimination course of.

Understanding the language is optionally available, however the second video has YouTube
captioning and will be auto-translated to English. It’s posted by
技数犬, which is a Taobao vendor of aftermarket mod kits for a lot of
compute playing cards to permit using off-the-shelf AIO watercoolers
(not utilized in my watercooling setup).

Procedures

It boils right down to the next steps:

  1. Take away 4 screws on the PCIe mounting bracket on the left. Save
    the PCIe bracket for later use.

  2. It’s not essential to take away the mounting bracket on the correct
    (the alternative facet of the PCIe slot bracket). Don’t waste time
    on it.

  3. Flip the cardboard, take away 10 screws on the again facet of the cardboard.

  4. Flip the cardboard once more, now open the entrance cowl of the cooler, the PCIe
    bracket also needs to come off, revealing an enormous heatsink.

  5. Unscrew and take away the bracket on prime of the facility cable on the
    prime proper, which is holding the facility cable and the connector in
    place.

  6. Free the facility cable away from the backplane by prying it away. The
    cable is squeezed into the backplane, and will be tough to take away.
    Use a plastic spudger to pry it away from the board. Until the facility cable
    / connector is freed, the circuit board can’t transfer freely.

  7. Take note of the truth that the whole circuit board is inserted
    into the underside backplane by sliding the PCIe connector into an open
    slot on the backplane, thus, the circuit board can’t be lifted off
    vertically because it’s embedded into the backplane.

    • To take away the circuit
      board (and the bolted-on heatsink), transfer the circuit board upward
      horizontally (to the wrong way of the PCIe connector). After
      the PCIe connector strikes a bit away from the slot on the backplane,
      elevate the circuit board vertically whereas hold transferring it upward
      horizontally. The circuit board is now separated from the backplane
      (as proven within the photograph under).

    • Use movies for steerage.

Circuit board with heatsink, freed from the backplane

  1. Take away 4 spring-loaded screws on the bottom of the PCB, hold the
    4 washers for later use, take away the mounting bracket.

Backside of the PCB

  1. There’s no extra screws holding the heatsink, and it may be instantly
    eliminated. Nevertheless, the thermal paste typically works like an adhesive.
    On this case, pry the cooler away from the circuit utilizing a plastic
    spudger from the highest facet of the cooler at the place there’s no parts
    on the circuit board. Put the cardboard on a flat desk, rotate the cardboard by
    180 levels, and insert the plastic spudger onto the underside (prime earlier than
    the rotation) facet of the cardboard, and pry it upward whereas holding the
    heatsink (to forestall the heatsink from smashing the circuit board).

  2. Cooler and circuit boards are separated.

Front side of the bare PCB

Circuit Board

The Nvidia CMP 170HX makes use of a circuit board practically – if not utterly – an identical
to the Nvidia A100 (40 GiB). The one distinction is the mannequin variety of the
GPU ASIC GA100-105F-A1 which appears to be a cut-down model of the complete
GA100. The circuit board additionally has many unpopulated parts, together with
omitted VRM phases by unpopulating DrMOS transistors and their output
inductors, ICs associated to the NV-Hyperlink interface are additionally lacking.

In any other case, even the part reference designators on the
circuit board completely match the leaked electrical schematics of the Nvidia
Tesla A100 reference design (discovered on-line) – sure, it was nonetheless internally
named Tesla when the GPU was being designed.

All {hardware} modifications to the Nvidia A100 (40 GiB, not 80 GiB), comparable to
watercooling options like waterblocks and modification kits, are equally
relevant to the CMP 170HX. Thus, for individuals who’re taking part in alongside at house,
if anybody needs carry out a dangerous modification to the Nvidia A100 for workstation
use, it is likely to be a good suggestion to apply this modification on a reasonable
CMP 170HX first.

Appendix 2: Watercooling

When trying to make use of server GPUs for house/workplace workstation purposes,
cooling is a normal drawback. Most server-grade GPUs use passive coolers
they usually’re designed to be cooled by high-RPM followers on a server chassis.
Even when followers will be put in within the unique manner or by way of a customized
mod, these passive coolers typically want a powerful airflow to work correctly,
making a noise degree unacceptable outdoors an information heart.

Fortuitously, for fashionable compute-only GPUs, ready-made waterblocks typically
exist, making watercooling a typical resolution utilizing off-the-shelf elements.
Watercooling has its downsides,
however so long as precautions are fastidiously adopted – which shouldn’t be
tough for the skilled house lab experimenters who frequently
bought used {hardware} – watercooling offers cool temperature and quiet
operation in change. That is fairly vital if you’re operating
number-crunching simulations whereas sleeping subsequent to it…

To chill my Nvidia CMP 170HX, I discovered the Bykski N-TESLA-A100-X-V2 waterblock,
which is appropriate with Nvidia A100 40 GB, Nvidia CMP 170HX, and the
Nvidia A30 24 GB circuit boards. In China, it’s presently bought at a
worth of 680 CNY (~100 USD). The bykski.us distributor website is promoting
it for $300, I recommend ordering it from AliExpress for lower than $150.
The waterblock accepts commonplace G1/4 fittings and will be included in
any water loop. If that is your first set up, you’ll find extra
data on connecting a pump, radiator or reservoir into the water
loop (none is included).

Be sure to don’t confuse the Nvidia Tesla 40 GB model with the 80 GB
model, that are incompatible. There may be additionally an earlier model of the
identical waterblock with out V2 in its mannequin quantity, V2 makes use of all-metal
building whereas the non-V2 makes use of clear acrylic. For the reason that steel
revision was made for a cause (enhanced sturdiness), it’s higher to
keep away from the non-V2 fashions.

Procedures

The Bykski waterblock has no person guide. The offered on-line person information
is extra of a fast begin information than a person guide. I additionally discovered the included
hex wrench was the incorrect dimension (so you should definitely order a set of metric hex
wrenches should you don’t have one already). In any other case, it performs nice
should you handle to appropriately set up it.

Which is an enormous if… My GPU stopped working quickly after the waterblock
set up (however I used to be capable of repair that myself). I believe it was
both a pre-existing fault or brought on by my improper waterblock set up
in my first try – if it was the case, the dearth of a correct person guide
from Bykski was definitely an vital contributing issue – however worry not,
this weblog submit already went forward of you and ironed out a clean path.
You possibly can consult with the next directions to do the identical watercooling
mod, with the foresight of many pitfalls in an effort to keep away from damaging your
GPU.

Thermal pad placement locations in the Bykski manual.

  1. Lower and place included thermal pads on the MOSFETs (DrMOS) on the left
    and proper sides of the GPU ASIC.
    Essential: All unpopulated IC footprints should even be coated by thermal
    pads to forestall a brief circuit. I believe a brief thermal pad minimize lacking
    the empty IC footprint on the topmost place was how I broken my card.

  2. Equally, apply thermal pads on the underside left and proper facet of
    the GPU ASIC.

  3. Equally, apply thermal pads to the PMIC on the correct facet of the
    GPU ASIC between an inductor and a capacitor.

  4. Equally, apply thermal pads to the 2 PMICs on the left facet of
    the GPU ASIC under the three.3 µH inductor.

  5. Apply thermal paste to the GPU ASIC. After finishing these steps, the
    GPU ought to look much like the next photograph.

PCB with thermal pads and thermal paste applied

  1. Flip the waterblock, take away the facility connector holder cowl (labeled
    Merchandise 4 within the exploded-view diagram under) by unscrewing the 2 hex nuts
    (labeled Merchandise 5) with a hex wrench. The included hex wrench from the
    waterblock was the incorrect dimension, use your individual.

Exploded-view diagram of the waterblock

  1. Place the waterblock onto the GPU. Essential: Earlier than placement,
    it’s good thought to visually double test whether or not the IC areas in touch of
    the pillars on the waterblocks are absolutely coated with thermal pads.

  2. Flip the GPU, transfer the waterblock to align the 4 mounting screw
    holes of the waterblock with the PCB.

  3. Discover the 4 washers you’ve saved within the teardown step 8.
    For washers, use the unique ones
    from the GPU, keep away from the plastic washers from the waterblock. For spring-loaded
    screws, keep away from the unique screws and use the screws from the waterblock.
    Place one washer on one mounting gap, and set up just one 1 spring-loaded
    screw, for now, don’t set up all 4 screws.

  4. Insert the facility connector by squeezing it into the slot on the
    waterblock on the highest proper nook (ensure the duvet has already been
    eliminated in Step 6). The stiff energy cable exerts pressure on the whole
    PCB, making it tough to put in if the waterblock is already absolutely
    screwed.

    • If there are nonetheless difficulties throughout set up, unscrew and
      elevate the waterblock utterly, insert the facility connector in its place
      earlier than inserting the waterblock again. On this case, appreciable pressure is
      wanted to beat the stifness of the facility cable when aligning the
      circuit board with the waterblock. Screw the waterblock instantly after
      the primary indicators of an alignment utilizing Step 9.

    • If the waterblock is lifted and remounted, you should definitely double test that
      no thermal paste has shifted in place. The thermal pads can stick onto
      the waterblock and land on an unpredictable location later.

  5. Modify the waterblock for any misalignment, place the remaining
    3 washers onto the mounting gap, and set up 3 spring-loaded screws
    included with the waterblock. Essential: After the waterblock is
    screwed in, now it’s a great time to peek into the circuit board to see
    whether or not the thermal pads are making good contact with the pillars to
    cool the MOSFETs. The thermal pads included with the waterblock ought to
    at all times work, however in case you’re utilizing your individual, now it’s time to identify
    any thickness points.

  6. Reinstall the facility connector holder bracket again onto the waterblock
    with a hex wrench.

CMP 170HX with waterblock installed

  1. Seize the backplane (included with the waterblock) from its packaging.
    Utilizing the backplane is optionally available, but it surely reduces the prospect of damaging the
    again facet of the PCB in case of mishandling.

  2. Discover the PCIe slot bracket you’ve saved throughout teardown in Step 1,
    align and place it on the left facet of the circuit board. Essential:
    Don’t skip this step, the extra peak of the PCIe slot bracket
    seemingly ensures a correct spacing between the backplane and the circuit
    board, thus stopping potential brief circuits. I believe this was
    how I broken my card.

  3. Align and place the included backplane on prime of the circuit board
    (on the left, the PCIe bracket is sandwiched between the PCB and the
    backplane). On the left, the screw holes of the PCB, the PCIe bracket,
    and the backplane ought to be aligned. After profitable alignment, instantly
    set up two 9.5 mm M2 screws (included with the waterblock).

  4. Set up the remaining two 9.5 mm M2 screws to safe the backplane.

See Also

CMP 170HX with backplane installed, running on a motherboard

  1. Waterblock set up full, join the water block into your
    water loop. Use a leak tester to pressurize the water loop for quarter-hour
    to confirm the loop’s air tightness earlier than including coolant.

Temperature

Utilizing a 360 mm, 3-fan radiator, the GPU’s temperature when idle is 30 levels
C at 30 watts.

In a FluidX3D simulation utilizing FP32/FP16S mode with FMA disabled, the facility
consumption is 180 watts. After operating this stress check for half an hour,
the GPU temperature reached 45 levels C. Each the fan and pump speeds had been
stored at a minimal, permitting silent operation throughout GPU number-crunching.
Even decrease temperature could also be doable at a better fan velocity.

CMP 170HX installed into the water loop onto a motherboard, running on the test bench

After I was dry-running the cardboard for an preliminary check with out coolant (not
really useful! Ensure to energy the system off inside 5 minutes
), I additionally
observed a curious constructive suggestions loop between the GPU temperature and its
energy consumption, creating thermal runaway. Initially, the idle energy
consumption was round 40 watts, but it surely step by step rised to 60 watts at
80 levels C and was nonetheless growing earlier than I powered it off – I later
discovered many different PC reviewers reported related findings. This was seemingly
an impact brought on by larger CMOS leakage present at larger temperatures,
which in flip creating extra leakage present and a good larger temperature…
Therefore, good cooling can typically scale back an IC’s junction temperature by extra
than their cooling functionality alone as a result of this second-order impact.

Appendix 3: Restore

My Nvidia CMP 170HX malfunctioned quickly after I acquired it and carried out
the primary try at watercooling modification. The GPU operated for an
hour earlier than it dropped off from the PCIe bus, and will now not be
detected after this incident.

Such is the well-known danger of buying or modding decommissioned {hardware},
or doing each on the identical time! – by no means wager greater than you’ll be able to afford to lose.

It was both a pre-existing fault (maybe the circuit was already marginal
after heavy mining), or as a result of my improper watercooling set up – just a few
sentences within the set up procedures are marked as vital in daring
fonts for a cause… If it was certainly broken from my improper set up,
the dearth of a correct person guide from Bykski was definitely an vital
contributing issue – however worry not, this weblog submit already went forward of you,
and ironed out a clean path. You possibly can consult with the earlier directions
to see many pitfalls to keep away from damaging your GPU.

I used to be capable of restore the issue myself with the assistance of my electronics
house lab (which isn’t a coincidence, since I purchased this GPU particularly
to jot down electromagnetic simulation code for circuit board designs…),
the leaked Nvidia Tesla A100 schematics. Because of whoever who leaked
them – My restore is in any other case not possible.

CMP 170HX under repair, being probed by an oscilloscope, the vertical monitor nearby was showing the schematic diagrams.

Reference designs and OEM circuit schematics are routinely leaked on-line or
bought on the grey markets to restore technicians. This consists of most desktop
and laptop computer motherboards and GPUs. I’m simply stunned that somebody really
determined to leak the $10,000 server-grade A100 along with different typical
RTX3080 and 3090s desktop GPUs.

As I had no prior experiencing on troubleshooting video playing cards, I additionally used the assistance of
Nvidia Pascal GPU Diagnosing Guide
on restore.wiki and likewise just a few restore movies to study what to anticipate –
together with the trick of utilizing output inductors as check factors of
DC/DC converters – alone circuit boards, I at all times used the output
capacitors or devoted check factors and by no means realized this shortcut…

The GPU restore track

To The Military Goes Rolling Alongside:

Quick the rail, fry the core,
Roll the coils throughout the ground,
And the system goes to crash.
MCUs smashed to bits
Give the scopes some nasty hits
And the system goes to crash.
And we’ve additionally discovered
While you flip the facility up
You flip VRMs into trash.

Oh, it’s a lot enjoyable,
Now the GPU received’t run
And the system goes to crash.

Shut it down, pull the plug
Give the core an additional tug
And the system goes to crash.
Mem’ry chips, every one,
Toss out midway down the corridor
And the system goes to crash.
Simply flip one swap
The PWM will stop to twitch
MOSFETs will crumble in a flash.
When the GPU
Solely renders magic smoke
Then the system goes to crash.

Energy Tree

When a GPU can now not be detected from PCIe, its on-board DC/DC energy provides
and associated power-on sequencing circuits are the primary issues to test. Helpfully,
the primary web page of the Nvidia Tesla A100 schematics comprises an influence tree overview.

Nvidia Tesla A100 power topology schematics

The PCB accepts 4 energy inputs:

  1. 3V3_PEX is offered by the PCI Categorical (PEX) slot and is stepped right down to 1.8 V
    to be used by auxiliary management circuits utilizing a 1.8 V Low-Dropout Linear
    Regulator (LDO).

  2. 12V_PEX is offered by the PCI Categorical slot and is stepped down

    • To five V by way of a DC/DC buck converter, the MP1475.

    • To 1.35 V by way of an MP2988 DC/DC controller (PWM controller), with one section
      output to drive DrMOS switching transistors.

    • To HBMVPP (2.5 V) by way of a DC/DC buck converter, one other MP1475. This voltage
      is hooked into the GA100 silicon, however the energy provide circuit is considerably
      easier in comparison with the multi-phase HBMVDD energy rail, so I believe it’s used
      to energy the reminiscence controller itself.

    • To PEXVDD by way of an MP2988 DC/DC controller, with one section output to drive
      DrMOS switching transistors. PEXVDD belongs to the I/O energy area for
      PCIe signaling.

  3. 12V_EXT1 is offered by the primary exterior 12 V energy connector, which is used
    to derive the 1.0 V core voltage (NVVDD) and HBM reminiscence voltage (HBMVDD) by way of
    the MP2988 DC/DC controller, which in flip drives DrMOS switching transistors utilizing
    a number of phases. 12V_EXT2 serves the identical goal, however powers one other group of
    energy provide to supply HBMVDD and NVVDD. On the CMP 170HX and Tesla A100,
    each connectors are bodily mixed right into a 8-pin CPU energy connector, however
    one can nonetheless see two inputs when an adapter cable is used.

Preliminary Verify

With the facility tree in thoughts, it’s time for some primary checks for the voltage rails.

Initial troubleshooting result.

I measured the resistance from each 12 V exterior enter filter inductors to floor,
and located each have excessive impedance. I additionally measured the resistance from the left
facet
of the PCIe connector to floor (which is the slot 12 V enter), and likewise discovered
no brief circuit. Moreover, after powering the cardboard on and probing the output
inductor on the correct facet of the core, I discovered none of them had any voltage
output or switching waveforms.

This strongly means that an issue exists inside the GPU’s power-on sequencing.
Probably, a wanted logic voltage is lacking or entered safety as a result of a fault,
stopping their downstream circuits from being enabled, together with Vcore.

Energy-On Sequencing

The subsequent step of the investigation was understanding how power-on sequencing works,
which was answered in schematics on web page 48.

Power-On Sequencing circuit of the Nvidia Tesla A100

From this schematics, it’s apparent that the power-on sequencing is triggered by
making use of 12 V enter energy (12V_F, F stands for “filtered”, which implies it’s
the downstream of the enter LC filter). This motion causes the resistive
divider R391/R392 to supply a 2.5 V output at an acceptable logic degree, which
is then used because the enabling sign 5V_PS_EN to begin the 5 V DC/DC converter.
There’s additionally a tiny 10 nF capacitor to de-glitch this sign.

Resistive Divider for generating PS_5V_EN

5 V provide

The 5V_PS_EN sign is then acquired by MP1475DJ on web page 18, it’s
an adjustable DC/DC converter made by Monolithic Energy Techniques
with a extremely primary circuit topology – 12 V enter, a switching node
SW pushed by an inside switching transistor to generate a
switching waveform, which is then filtered by an output inductor
and output capacitor to generate a decreased voltage. The output
voltage is sampled by way of a voltage divider and fed again into the FB
pin for closed-loop management.

Though I’ve by no means used the precise chip, it’s a indisputable fact that
nearly each microcontroller circuit board I’ve beforehand designed
contained a primary buck converter like this one. There is no such thing as a complicated
controller-transistor mixtures and no multi-phase switching,
because it’s solely a easy auxiliary voltage for powering the logic
chips, not the GPU.

5 V DC/DC buck converter

After pinpointing the bodily location of the DC/DC converter and
measuring its switching node, I instantly discovered the primary drawback –
the 5 V DC/DC converter was refusing to begin. On the switching node
SW, there was a momentary pulse that lasted for dozens of nanoseconds
earlier than it stopped. The IC would then retry once more after dozens of
microseconds however fail to begin once more, and this cycle repeats. This both
recommended the
existence of a brief circuit within the 5 V rail, inflicting the facility chip
to enter hiccup-mode short-circuit safety – or implied the DC/DC
converter was defective.

Thus, I desoldered the MP1475DJ IC utilizing my sizzling air gun. I additionally
soldered a jumper wire to its output inductor for injecting an
exterior 5 V voltage from my benchtop energy provide.

5 V DC/DC desoldered, with a jumper wire for voltage injection

Voltage injection confirmed no brief circuit. Nevertheless, I found that
the resistance from the empty footprint Pin 1 (Energy Good sign) to
floor was 5 Ω, this isn’t good!

Therefore, the following step was to analyze the place the sign PS_5V_PGOOD
went.

PS_5V_PGOOD brief circuit

After chasing for the web label PS_5V_PGOOD throughout the whole schematics,
I discovered the overwhelming majority of its customers are the logic chips, SN74LV1T08.
This chip is likely one of the fashionable remakes of the basic 74xx logic gates,
offering a 2-input AND gate with degree shifting for controlling energy
sequencing elsewhere, together with PEXVDD, NVVDD, 1V35, and 1V8.

The PS_5V_PGOOD signal is mainly used for sequencing other rails via the SN74LV1T08 logic gate.

If any of those 74-logic chips is shorted, isolating the fault would
be a painstaking activity.

One in every of this stuff is just not like the opposite, although. the 5V Energy Good
sign can be used to allow the 5 V to three.3 V Low-Dropout Regulator (LDO)
to supply 3V3_SEQ, an auxiliary 3.3 V logic voltage, which is simply
used for persevering with the facility sequencing after 5 V (together with powering
these 74 logic chips).

GS7155NVTD 3.3 V LDO on the schematics

Conveniently, there’s a 0-ohm resistor jumper R1200 between the PS_5V_PGOOD
enter and the ENABLE pin of the three.3 V LDO. Thus, eradicating R1200 grew to become
a no brainer.

After I pushed away this 0402 resistor R1200 with a blob of molten
solder on the tip of my soldering iron, the brief circuit at Pin 1
of MP1475DJ (in different phrases, PS_5V_PGOOD) has cleared – giving
affirmation that the GS7155NVTD LDO was lifeless as a result of an inside
brief circuit.

5 V restore success!

When the issues had been recognized, I instantly ordered substitute
chips for the MP1475DJ and GS7155NVTD. As quickly because the MP1475DJs
arrived, I resoldered a brand new one to the circuit board and utilized
energy. Energy-on assessments confirmed the 5 V logic rail was efficiently
restored.

3.3 V provide

The subsequent drawback is changing the GS7155NVTD.

GS7155NVTD 3.3 V LDO on the schematics

The adjustable LDO chip GS7155NVTD is made by GSTEK. Though I’ve
by no means used this chip or vendor
earlier than (apparently it’s particularly designed for powering Nvidia
GPUs and its full datasheet is underneath NDA, though one can discover the
full datasheet of an analogous mannequin), but it surely’s additionally essentially the most primary LDO
circuit topology.

It accepts 5 V and generates 3.3 V by dropping the surplus voltage
away as warmth utilizing its inside collection cross transistor. The output
voltage is sampled by
a resistive divider into the suggestions pin for closed-loop management,
voltage is adjustable by utilizing completely different divider ratio.
There’s additionally a capacitor throughout a suggestions resistor, seemingly for
frequency compensation of its management loop to enhance its transient
responsive efficiency – normally one simply copies these parts primarily based
on the producer’s suggestions from the datasheet.

Utilizing a hot-air gun, I rapidly changed the IC, but it surely was dangerous to
energy the board again on at this level. This chip makes use of the Quad
No-Lead (QFN) package deal which is difficult to solder correctly by hand.
One must pre-tin the chip with solder, pre-tin the circuit board
footprint with solder, then warmth the PCB with a sizzling air gun. An
inexperienced restore tech (like me) can simply create a chilly solder
joint, which is not any joke.

If the suggestions pin of the chip is open, from the angle of the
LDO, it might see a continuing output undervoltage, and the LDO would
put the utmost voltage on its output in an try to spice up the
output voltage. In actuality, this causes the output voltage to shoot
as much as the max and creating full 5 V on the three.3 V rail,
thus blowing up nearly all the three.3 V logic chips with disastrous
outcomes. Subsequently,
I got here up with an impoverished plan to tell apart a great and a nasty LDO:
I changed the 7.68 kΩ suggestions resistor on the backside with a 20 kΩ
resistor to briefly reprogram the LDO’s output voltage from 3.3 V
to 1.8 V, then I injected 3.3 V on the 5 V rail.

Within the first try, I discovered the LDO couldn’t output any voltage –
oh no… Circuit boards for laptop motherboards and graphics playing cards
include many layers of energy and floor planes, making them act as
a wonderful heatsink and a nightmare to restore. That is already
typically an issue on my 4-layer PCBs, however fashionable laptop playing cards typically
include 8 to 12 layers of PCB to fulfill their high-speed signaling
wants. On this card, I discovered I wanted to set my hot-air gun to 420
levels Celsius and to warmth the world for two minutes earlier than I used to be in a position
to take away any chip.
I actually didn’t need to stress the circuit board additional and danger
lifting the copper pads or traces off the board with one other spherical
of air-hot gun torture (that is additionally why a pre-heater is required for
any critical BGA rework…)

Immediately I spotted that the ENABLE sign for the LDO was
lacking. After bridging the LDO’s ENABLE sign to three.3 V,
fortunately, 1.8 V instantly appeared on the output facet of the
LDO, bingo! This check confirmed that the LDO might appropriately output a
regulated 1.8 V voltage, proving its appropriate operation.

Enabling ENABLE

After this check, I reinstalled the 7.68 kΩ suggestions resistor to
reprogram the LDO from 1.8 V again to three.3 V. Now it’s the second
of fact, I replugged the cardboard again into the check bench and
powered it on…

Sadly, the three.3 V rail has no output, the symptom was
an identical to the earlier case of a lacking ENABLE sign.
What was occurring?

R1200 was in fact a short circuit on the PCB, unlike what the schematics suggest

Upon additional probing with a multimeter, I discovered the left solder pad and proper
solder pad of the 0-ohm jumper R1200 was a brief circuit – certainly, an in depth
inspection underneath the microscope confirmed that there’s a hint instantly bridging
the pads! It should
have been the case that the PCB designers wished to take away the ineffective jumper
R1200 (maybe it’s a typical level of failure), therefore a direct copper hint
was used to bypass R1200. However for some manufacturing cause, probably to
keep away from reprogramming the pick-and-place machine or the automated optical
inspection machine, the now-useless resistor itself was stored as-is and
not eliminated.

Therefore, eradicating the resistor mustn’t have remoted the unique brief
circuit contained in the GS7155NVTD. As an alternative, within the technique of eradicating R1200
or changing different parts, I’ll have unintentionally broken a by way of or
a hint on the circuit board, thus isolating the brief circuit by breaking
the PCB itself. It’s additionally doable that the by way of or hint concerned already
grew to become fragile as a result of earlier brief circuit. After breakage, the
allow sign GS7155NVTD grew to become an remoted island unconnected to the
broader PS_5V_PGOOD internet.

After the basis trigger was understood, fixing the issue grew to become trivial.
As I beforehand talked about, the PS_5V_PGOOD internet was utilized by many
SN74LV1T08 logic chips throughout the schematics for energy sequencing.
Thus, I made a decision to borrow this sign from U816 to GS7155NVTD utilizing a jumper
wire.

Lastly, I replugged the GPU into my check bench and reapplied energy. 3.3 Vseq
was again! I might see switching waveforms on the oscilloscope! 1.0 V Vcore
(NVVDD) was again! PEXVDD was again! Lastly, the GPU was redetected from
PCIe, the GA100 is alive.

The photograph above confirmed using a PVC-insulated AWG-30 wire-wrap wire –
this was solely used for a fast check, and isn’t really useful in an precise
restore because it’s tough to strip to wire to the right size and to solder
it with out damaging its insulation – risking a brief circuit as a result of
wire motion. For the eventual restore, I used a skinny enameled wire. Its
heat-stripped insulation coating doesn’t endure from this drawback.

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top