Now Reading
Inside Kepler, Nvidia’s Sturdy Begin on 28 nm – Chips and Cheese

Inside Kepler, Nvidia’s Sturdy Begin on 28 nm – Chips and Cheese

2023-11-24 16:23:44

Nvidia’s Fermi structure was bold and revolutionary, providing advances in GPU compute together with options like excessive tessellation efficiency. Nevertheless Terascale 2’s extra conventional strategy delivered higher energy effectivity. Fermi’s GTX 500 collection refresh narrowed the effectivity hole, however then AMD hit again with the GCN structure on 28 nm. In creating their first 28 nm GPU structure, Nvidia threw away many of the Fermi playbook and centered laborious on energy effectivity.

The GTX 680 encompasses a clear shroud design

As a part of this effort, Kepler trades a few of Fermi’s compute credentials for a extra balanced strategy. Nvidia had excessive hopes for GPU compute, however customers didn’t find yourself utilizing GPUs to supply further floating level throughput in on a regular basis functions. GPUs have been nonetheless primarily for gaming, and needed to be good at it.

I waited for a very long time for a web site like Actual World Tech to deep dive Kepler’s structure, however it by no means occurred. I suppose I’ll do it myself.

Kepler has a number of variations, and I’ll be specializing in GK104 and GK210.

Die Title Examined Card Feedback
GK104 Nvidia GeForce GTX 680 Midrange shopper variant focused primarily for gaming
GK210 Nvidia Tesla K80 Datacenter variant for prime efficiency compute. I solely examined a single die on the K80
GF114 Nvidia Geforce GTX 560 Ti Midrange Fermi (prior technology) card. Knowledge kindly offered by Cha0s
GF108 Nvidia Quadro 600 Low finish Fermi card
Tahiti AMD Radeon HD 7950 AMD’s first technology GCN card, which primarily competed with Nvidia’s GK104
Hawaii AMD Radeon R9 390 A scaled up GCN chip that competed with bigger Kepler variants like GK110

I’ll often make references to AMD’s GCN and Nvidia’s prior technology Fermi structure for comparisons.

System Stage

Kepler is constructed from SMX-es, or Subsequent Era Streaming Multiprocessors. The whole GPU will get a shared L2 cache as with Fermi, though some Kepler SKUs get doubled L2 cache capability.

A GDDR5 setup supplies excessive DRAM bandwidth to the GPU. Fermi additionally used GDDR5, however Kepler’s GDDR5 controllers can deal with increased GDDR5 clocks for elevated bandwidth. A PCIe 3.0 connection hyperlinks Kepler to the host system, and supplies twice as a lot bandwidth in comparison with Fermi’s PCIe 2.0 setup.

Kepler SMX Overview

Kepler’s SMX is distantly associated to Fermi’s SMs, however is far bigger and prioritizes energy effectivity. Fermi ran the execution models at twice the GPU core clock to maximise compute energy inside space constraints, however that resulted in excessive energy consumption. Kepler drops the recent clock in favor of extra execution models, utilizing extra die space to attain higher efficiency per watt. Due to this fact 96 “cuda cores” on a GK104 chip are equal to 48 on a GF104 chip working on the identical clocks.

To claw again some space effectivity, Nvidia packs extra execution energy into the SMX. Doing so reduces cases of shared assets like caches, scratchpad storage, and off-SMX interfaces. A SMX thus ends wanting like a doubled-up Fermi SM. It’s able to finishing 192 FP32 operations per cycle, making it a large fundamental constructing block.

SMX Frontend

A SMX feeds its 4 scheduler partitions with a 8 KB instruction cache. Every scheduler partition, or SMSP (SM sub-partition) can exploit thread-level parallelism by monitoring as much as 16 threads. Nvidia calls these threads warps, since every instruction typically operates on a 32-wide vector of 32-bit components. Each cycle, the scheduler can choose a thread and difficulty as much as two directions from it to the execution models. Not like Fermi’s scheduler, Kepler makes use of static software program assisted scheduling.

Static Scheduling

Fermi’s instruction set appears a bit like ARM or x86. The directions say what to do, however not how briskly to do them. That job falls to Fermi’s {hardware}, which makes use of a scoreboard to trace register state and stall directions till their enter registers are prepared.

Fermi might twin difficulty FMA directions and a FMA instruction has three inputs. The register scoreboard would subsequently want a minimum of six learn ports and two write ports. Overlaying the register file would require 1024 entries, since a Fermi SM has a 1024 entry register file.

Kepler does away with this construction by profiting from mounted execution latencies. A FMA operation will all the time produce a end in 9 cycles, and the compiler is aware of this. The hazard detection course of for these mounted latency operations will be prevented if the {hardware} will be advised to attend for a set variety of cycles. To do that, Kepler provides scheduling information to its instruction set. Each seven directions is preceded by a 64-bit management phrase.

Some GK110 disassembly from a Blender Cycles kernel, with management codes annotated as specified by Xiuxia Zhang et al1.

Inside the management phrase, one byte identifies it as management info (not an everyday instruction), whereas the opposite seven bytes include scheduling information for every of the following directions. The management byte can inform Kepler to stall the thread for a lot of cycles, after which register inputs might be prepared. {Hardware} unconditionally trusts the scheduling information. For those who set the management code your self and don’t stall for sufficient cycles, you’ll get the improper outcomes.

Moreover stall cycles, Kepler’s management codes can inform the scheduler to twin difficulty a pair of directions. Fermi’s twin difficulty functionality required the {hardware} to verify adjoining directions for dependencies, however Kepler eliminates that.

Thus Nvidia has changed Fermi’s massive, multi-ported register scoreboard with a less complicated construction that tracks readiness on a per-thread foundation. Every scheduler partition can solely observe 16 threads, so the construction will be small. As a result of the scheduler will solely choose one thread for execution per cycle, the thread scoreboard solely has to supply one output and settle for one write per cycle.

Execution Models

The instruction(s) chosen by the scheduler have their operands learn from the register file. Shopper Kepler variants have a 64 KB register file for every scheduler partition, letting it observe state for as much as 512 32-wide vectors. GK210 has twice as a lot register file capability, possible to enhance occupancy with FP64 workloads. The register file is constructed with 4 single-ported banks and subsequently can solely present 4 operands per cycle. Twin issuing FMA directions would require as much as six register inputs. Kepler subsequently has an operand collector able to caching not too long ago learn register values.

As soon as operands are prepared, directions are despatched to 2 execution ports. Every can deal with a 32-wide vector per cycle, however the second port is shared by two scheduler partitions. Integer operations are dealt with by the primary non-shared port, and thus execute at 2/3 of FP32 price. Particular operate operations like reciprocals and inverse sq. roots execute at quarter of the integer price.

Frequent INT32 and FP32 operations execute with 9 cycle latency5, making Kepler extra susceptible to knowledge hazard stalls than newer GPU architectures. But when execution dependencies are uncommon, Kepler can get via particular person kernel invocations with blazing velocity. Like RDNA, it could actually execute directions from the identical thread on back-to-back cycles. Mix that with twin difficulty functionality, and Kepler’s per-thread, per-cycle throughput is second solely to AMD’s RDNA 3.

Kepler has much less execution throughput in comparison with AMD’s competing GCN GPUs. Nvidia particularly de-prioritized FP64 execution {hardware} on shopper SKUs. Video games don’t use FP64 as a result of graphics rendering doesn’t require excessive accuracy. Having further FP64 models could be a waste of die space.

Kepler merchandise centered for datacenter and HPC just like the GK210-based Tesla K80 did supply excessive FP64 throughput. Nevertheless, profiting from that FP64 functionality comes at a considerable energy price, even at low clocks. The Tesla K80 is nominally able to 824 MHz, however clocks drop as little as 744 MHz underneath sustained FP64 load. As temperatures rise, leakage energy will get worse and the cardboard clocks down to remain inside energy limits.

Monitoring knowledge from GPU-Z on an Azure cloud occasion working a customized FP64 OpenCL workload that calculates gravitational potential.

The place Kepler shines is bringing extra execution energy to bear when confronted with small kernels. A SMX sub partition can nominally difficulty one instruction per clock and sometimes twin difficulty. A GCN SIMD can solely difficulty one instruction each 4 clocks.

Cache and Reminiscence Entry

A Kepler SMX has a number of varieties of non-public caches for various reminiscence areas. World reminiscence is the GPU’s fundamental reminiscence house and acts like CPU reminiscence. It’s backed by VRAM, and writes are seen throughout the GPU. Accesses to world reminiscence will be cached by L1. Texture reminiscence is read-only and backed by separate caches on Kepler. Nvidia’s GK110/GK210 whitepaper refers back to the texture cache as a “read-only knowledge cache”.

From the attitude of a single thread

Fixed reminiscence is supposed for read-only variables, and is specified by way of the “__constant” qualifier in OpenCL. Kepler’s SMX has a two-level fixed cache.

Lastly, every OpenCL workgroup or CUDA thread block can allocate scratchpad reminiscence. OpenCL calls this native reminiscence, whereas CUDA calls it shared reminiscence.

World Reminiscence

Like Fermi, L1 cache and shared reminiscence (a software program managed scratchpad) are dynamically allotted out of a 64 KB block of storage. Along with supporting Fermi’s 16+48 KB cut up, Kepler provides an evenly cut up 32+32 KB choice.

Utilizing CUDA’s cudaDeviceSetCacheConfig API to choose completely different L1/scratchpad splits

GK210 Kepler doubles L1/shared reminiscence storage to 128 KB, doubtlessly serving to occupancy with applications that allocate lots of scratchpad storage. L1 cache capability stays capped at 48 KB, suggesting Nvidia didn’t enhance the tag array measurement. Latency to the L1 knowledge cache stays fixed whatever the L1/scratchpad storage cut up.

In comparison with GF114 Fermi, GK210 Kepler has considerably increased cache and reminiscence latency. The Tesla K80 is a moderately low clocked Kepler product and has a core clock just like the GTX 560 Ti’s (824 vs 823 MHz), which possible accounts for the dearth of enchancment.

Frustratingly, OpenCL and CUDA kernels can’t entry GK104’s L1 knowledge cache. I attempted each OpenCL and CUDA with the opt-in compiler flags set. One way or the other it really works with Vulkan, so I’m utilizing Nemes’ Vulkan microbenchmark to check with AMD’s HD 7950.

Nvidia’s GTX 680 has higher vector entry latency all through the reminiscence hierarchy. AMD enjoys higher latency if it could actually leverage the scalar reminiscence path, which solely works for values identified to be fixed throughout a 64-wide wavefront. Nvidia’s driver in its infinite knowledge has chosen the smallest L1 knowledge cache allocation for Vulkan though the kernel doesn’t use scratchpad reminiscence. In any other case, Nvidia might have a L1D capability and velocity benefit over AMD.

Texture Caching

Nvidia’s GK110/210 whitepaper2 says every SMX has 48 KB of texture cache. Nevertheless, my testing solely sees 12 KB. Größler’s thesis3 reveals comparable outcomes. An investigation by Xinxin Mei and Xiaowen Chu suggests each Kepler and Fermi use a 12 KB, 96-way set associative texture cache. This discrepancy is as a result of Kepler seems to have a personal texture cache for every scheduler partition.

I created disjoint pointer chasing chains inside the check array and spawned a number of units of 32 threads inside the identical workgroup. Nvidia’s compiler seems to assign adjoining OpenCL threads to the identical wave (vector), so I had every set of 32 threads observe the identical chain to keep away from divergent reminiscence accesses. Maintaining the check inside one workgroup meant all waves could be assigned to the identical SMX. Spawning two waves reveals low latency over 24 KB, and spawning 4 reveals the 48 KB of texture cache capability claimed by Nvidia’s whitepaper.

Thus, I disagree with Mei and Chu’s conclusion that every Kepler SM has 12 KB of texture cache per SM(X). Quite, I believe it has 12 KB of texture cache per SMX scheduler partition, giving every SMX 4 12 KB texture caches and 48 KB of complete texture cache capability.

What GK104’s reminiscence hierarchy most likely appears like

In comparison with GCN, Kepler enjoys decrease latency entry for image1d_buffer_t accesses. GCN’s vector cache doubles as a texture cache, so we’re evaluating Kepler’s texture cache to GCN’s vector cache accessed (most likely) via the TMUs. Latencies are nearer this time, possible as a result of array deal with calculations are dealt with by the TMUs, they usually’re sooner than doing the equal shift + add sequence via the vector ALUs.

Three examined Kepler GPUs present 12 KB of texture cache capability

Texture cache misses go to L2. Bigger Kepler chips like GK210 get pleasure from bigger L2 capability, however latency is increased in comparison with shopper GK104 and GK107 chips. A part of this may be attributed to the Tesla K80’s decrease 824 MHz clock velocity. The GTX 680 and GTX 650 run at 1058 MHz.

Fixed Reminiscence

To hurry up entry to variables fixed throughout a vector (warp/wave), Kepler has a small however quick 2 KB fixed cache. A mid-level 32 KB fixed cache helps cache extra constants. Kepler helps as much as 64 KB of constants.

In comparison with Fermi, Kepler shrinks the fixed cache from 4 KB to 2 KB. Latency regresses as properly, rising to 46.9 ns in comparison with Fermi’s 30.58 ns. Kepler and Fermi each have a 32 KB mid-level fixed cache that’s most likely non-public to a SM(X). Latency is barely increased on Kepler, suggesting Nvidia has de-prioritized fixed caching. Maybe fixed reminiscence entry was already quick sufficient on Fermi and pipelines could possibly be lengthened to permit increased clocks at decrease voltage.

AMD doesn’t have a separate fixed cache hierarchy. As a substitute, the 16 KB scalar cache handles each fixed reminiscence and scalar world reminiscence accesses. GCN subsequently enjoys extra first stage caching capability for fixed reminiscence, and a latency benefit out to 16 KB. Previous that, accesses spill into L2.

Unusually, Fermi’s compiler can optimize scalar accesses to make use of the fixed cache, making it play a job just like AMD’s scalar cache. This optimization is gone with Kepler, and the fixed cache hierarchy can solely be accessed if reminiscence is marked with the “__constant” qualifier in OpenCL.

Native Reminiscence

Not like different reminiscence varieties, native reminiscence shouldn’t be backed by DRAM. It’s a small software program managed scratchpad native to a gaggle of threads working in the identical SMX. Nvidia confusingly calls this “Shared Reminiscence” though it’s much less shared than world reminiscence. AMD calls their equal the “Native Knowledge Share”.

Native Reminiscence supplies higher latency than L1 cache accesses though Kepler allocates each out of the identical pool of storage. That’s as a result of native reminiscence is instantly addressed and doesn’t require checking cache tags. On Kepler, native reminiscence gives the bottom latency storage (in addition to registers after all). It’s a bit sooner than Fermi and far sooner than competing GCN GPUs. On the HD 7950, pointer chasing within the LDS is barely sooner than doing so inside the vector cache. The R9 390 improves, however continues to be far off from Kepler.


Atomic operations assist go values between threads working on a GPU. Kepler options enhancements to atomic dealing with, with the L2 getting elevated bandwidth for atomics in comparison with Fermi6. Nvidia has evidently improved latency for atomic operations too. When utilizing atomics on world reminiscence, Kepler gives far decrease latency than Fermi and edges out AMD’s GCN structure.

Roughly analogous to a core-to-core latency check on a CPU

We will additionally do atomic operations on native reminiscence. Doing so passes values between two threads working on the identical SMX, and will end in decrease latency. AMD advantages properly if atomic operations are saved inside a compute unit, however Nvidia doesn’t. Fermi noticed the identical conduct.


For a 2012 GPU, Kepler has good latency traits. Nevertheless, graphics rendering is inherently bandwidth hungry. Kepler is outclassed by AMD’s GCN all through the worldwide reminiscence hierarchy.

See Also

Shared caches are significantly fascinating as a result of they’ve serve lots of purchasers and meet their bandwidth calls for. Nvidia has doubled per-cycle L2 cache throughput on Kepler, giving GK104 512 bytes per cycle of theoretical L2 bandwidth6. Assuming Nvidia is utilizing one L2 slice per 32-bit GDDR5 channel, GK104’s L2 cache would encompass eight 64 KB slices, every able to 64 bytes per cycle of bandwidth. At 1058 MHz, the L2 cache ought to ship 541 GB/s of complete bandwidth. I don’t see something close to that.

Eight workgroups would load all eight SMX-es on the GTX 680. OpenCL used for this check

Various workgroup depend doesn’t create a lot of a distinction in measured bandwidth. I count on AMD’s HD 7950 to have extra L2 bandwidth as a result of L2 slices are sometimes tied to DRAM controllers. Tahiti’s wider DRAM bus would indicate extra L2 slices and extra bandwidth. However greater than a 2x distinction in measured bandwidth is sudden. Nonetheless, GK104’s L2 gives twice as a lot measured bandwidth as DRAM entry.

GK210 Kepler (Tesla K80) encompasses a bigger 1536 KB L2 cache, possible with twelve 128 KB slices. Since GK210 has a 384-bit reminiscence bus like AMD’s Tahiti, bandwidth ought to be extra comparable. Nevertheless, GK210 nonetheless falls brief. Working two workgroups in every SMX barely improves bandwidth from 338 to 360 GB/s, however offering much more parallelism doesn’t enhance outcomes additional. Zhe Jia et al7 achieved the same 339 GB/s on a Tesla K80 utilizing CUDA with PTX, so I’m not loopy.

Kepler has much less VRAM bandwidth than GCN playing cards, because of smaller reminiscence buses. Vulkan microbenchmarking achieved 142 GB/s from the GTX 680’s VRAM setup, in comparison with 192 GB/s theoretical. The identical check acquired 204 GB/s on the HD 7950, out of 240 GB/s theoretical. This distinction persists as we transfer to bigger chips. GK210 will get a 384-bit reminiscence bus, however Hawaii’s 512-bit bus retains it forward.

Some Mild Benchmarking

Numerous reviewers have lined Kepler’s gaming efficiency when the cardboard first launched. Anandtech’s evaluation is a good example, and I extremely encourage going over that to get an concept of the place Kepler stood in opposition to GCN round 2012. That stated, I believe it’ll be enjoyable to throw a number of trendy compute workloads on the card and see what occurs.

The 2 playing cards getting in contrast

VkFFT implements Quick Fourier Transforms utilizing Vulkan. Nvidia’s GTX 680 couldn’t end the benchmark, so I’m presenting outcomes from the sub-tests that did full.

AMD’s GCN structure pulls forward in VkFFT, usually by a big margin. VkFFT shouldn’t be cache pleasant and closely masses the VRAM subsystem. The creator of VkFFT, Dmitrii Tolmachev, collected a profile on the RTX 3070 cellular displaying 93% of reminiscence throughput and 30% of compute throughput used.

I additionally profiled the benchmark on my RX 6900 XT and noticed occupancy restricted by native reminiscence (LDS) capability. Kepler’s unified 64 KB L1/Shared Reminiscence setup may go in opposition to it in comparison with AMD’s separate 64 + 16 KB LDS and L1 vector cache configuration.

VkFFT additionally prints out estimated reminiscence bandwidth utilization. The figures recommend each playing cards are reminiscence bandwidth certain, with AMD’s GCN-based HD 7950 pulling forward because of its wider reminiscence bus.

FluidX3D makes use of the lattice Boltzmann methodology to simulate fluid conduct. It options particular optimizations that lets it use FP32 and nonetheless ship correct ends in “all however excessive edge instances”. I’m utilizing FluidX3D’s built-in benchmark.

FluidX3D can demand lots of reminiscence bandwidth, however one way or the other the state of affairs reverses in comparison with VkFFT. AMD’s HD 7950 now falls behind regardless of a theoretical benefit in each FP32 compute and VRAM bandwidth. Apparently, GCN’s efficiency improves dramatically as we transfer to Hawaii. The R9 390 is ready to beat out the Tesla K80.

Upscayl makes use of machine studying fashions to upscale photos. I’m utilizing the default Actual-ESRGAN mannequin to double a photograph’s decision. Upscayl doesn’t characteristic a benchmark mode, so I’m timing runs with a stopwatch.

Kepler compares badly to the HD 7950, and takes a number of occasions longer to course of the picture.

Remaining Phrases

GPU designers have to attain excessive efficiency whereas balancing space and energy consumption. Kepler re-balances space and energy priorities in comparison with Fermi by focusing extra on energy effectivity. In comparison with AMD’s GCN structure, Kepler falls behind on headline specs like compute throughput and reminiscence bandwidth. To make up for this, Kepler makes its compute throughput simpler to make the most of. A single thread can entry extra execution assets, and vector reminiscence accesses get pleasure from decrease latency. The result’s a excessive efficiency and energy environment friendly structure that put Nvidia on strong floor because the 28 nm period began. After its GTX 600 collection debut, Kepler went on to serve within the GTX 700 technology.

In fact they in comparison with Kepler. Everybody compares to Kepler

Invoice Dally at Scorching Chips 2023 speaking concerning the first TPU paper

Regardless of Kepler’s concentrate on higher gaming efficiency, compute variants of the structure noticed success because of Nvidia’s robust software program ecosystem. Tesla K20x GPUs secured a number of supercomputer wins, together with Oak Ridge Nationwide Laboratory’s Titan.

Moreover being a strong product, Kepler served as a proving floor for strategies that may characteristic in subsequent Nvidia architectures. Kepler’s static scheduling scheme was profitable sufficient for use in each Nvidia technology afterward. The management code format would change to permit extra flexibility and hand much more duty over to the compiler, however the core technique remained. Kepler’s division of the fundamental SM constructing block into 4 scheduler partitions was additionally carried ahead.

Kepler, seen on the Supercomputing 2023 conferencce

From the buyer viewpoint, Kepler was a part of a collection of Nvidia generations that every offered compelling improve choices over the earlier technology. Fermi, Kepler, Maxwell, and Pascal delivered such an enchancment in efficiency, worth, and energy effectivity that used card costs would crater as new ones grew to become accessible. It’s a stark distinction to as we speak’s GPU market, the place a last-generation RX 6700 XT nonetheless instructions midrange costs greater than two years after its launch. I can solely hope that the GPU market strikes again in that route.

For those who like our articles and journalism, and also you wish to help us in our endeavors, then think about heading over to our Patreon or our PayPal if you wish to toss a couple of bucks our means. If you want to speak with the Chips and Cheese workers and the folks behind the scenes, then think about becoming a member of our Discord.


  1. Xiuxia Zhang et al, “Understanding the GPU Microarchitecture to Obtain Naked-Metallic Efficiency Tuning”
  2. Nvidia’s Next Generation CUDA Compute Architecture: Kepler GK110/210
  3. Dominik Größler, “Capturing the Memory Topology of GPUs
  4. Xinxin Mei, Xiaowen Chu, “Dissecting GPU Memory Hierarchy through Microbenchmarking
  5. Yehia Arfa et al, “Low Overhead Instruction Latency Characterization for NVIDIA GPGPUs”
  6. Technology Overview: Nvidia GeForce GTX 680
  7. Zhe Jia et al, “Dissecting the NVidia Turing T4 GPU by way of Microbenchmarking”

Source Link

What's Your Reaction?
In Love
Not Sure
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top