Nvidia’s Pascal GP100 GPU: massive bandwidth, enormous double-precision performance

For the past year, enthusiasts have been chomping at the bit waiting for the next generation of graphics cards to arrive. The 28nm node has persisted for far longer than any previous generation, and while both AMD and Nvidia have introduced multiple products on that node, customers have clearly wanted the power efficiency and performance improvements that the 14/16nm node could provide. Today, Nvidia showcased the full HPC version of Pascal and detailed what the card would offer compared with its previous Maxwell and Kepler products.

Pascal’s renewed focus on high-speed compute

When Nvidia designed Maxwell, it made the design to remove much of the double-precision floating point capabilities that were baked into its previous Kepler architecture. The old Tesla K40, based on the GK110 GPU, was capable of up to 1.68 TFLOPS/s, while the Tesla M40, which used the Maxwell GM200, could only reach 213 GFLOPs. The M40 still had an advantage over the K40 in single-precision floating point, but double-precision floating point performance was sharply curtailed. As we discussed last week, when AMD launched its FirePro S9300 x2, this limited the kinds of workloads where the M40 could excel.

Pascal’s current GP100 variant adds back all the double-precision floating point that Maxwell was missing — then stuffs some more in, just for good measure. The chart below compares Kepler, Maxwell, and Pascal. Note that the dev blog post states that Pascal can include up to 60 SMs, while the variant described below has just 56.


One interesting aspect of Pascal’s design is that Nvidia has again reduced the number of streaming cores in each processing block, or SM and adopted the same ratio that AMD uses, with each compute block containing 64 processors. The total number of streaming processors has increased 17%, as has the number of texture processors. There’s no word yet on ROP counts, but assuming Nvidia followed its historic pattern, the GP100 should have at least 96 ROPS and possibly 128. Base clock is also up 40% over Maxwell, and while Tesla clocks are typically more conservative than their desktop counterparts, the fact that Nvidia squeezed a 40% clock jump out of this silicon suggests we can look forward to similar gains when Pascal comes to the consumer market.

The memory interface is the largest generational upgrade. HBM2 offers a 4096-bit bus and 720 GB/s of memory bandwidth, compared with 336GB/s of bandwidth available on the highest-end Titan X.


Pascal also utilizes a simpler datapath organization, superior scheduling with better power efficiency, overlapped load/store instructions, support for Nvidia’s NVLink interface, support for 16-bit floating point (half precision), and improved atomic functions. GP100 also supports ECC memory natively, meaning there’s no performance or storage penalty for activating the feature.


One note on NVLink: There’s been confusion over where and how this bus is used. For the most part, NVLink is a method of connecting multiple GPUs to each other, especially cross-connections in a multi-socket system, where forcing GPUs attached to two different CPUs to talk to each other would significantly degrade performance.

NVLink can be used to connect the GPU to the CPU directly, but Nvidia’s blog post specifies that this is only applicable to POWER processors.


The diagram above is described as follows: “The [above] figure highlights an example of a four-GPU system with dual NVLink-capable CPUs connected with NVLink. In this configuration, each GPU has 120 combined GB/s bidirectional bandwidth to the other 3 GPUs in the system, and 40 GB/s bidirectional bandwidth to a CPU.”

Nvidia is also claiming that Pascal will offer “Compute Preemption” with a significantly improved computing model. This is one area where Team Green has notably lagged AMD, whose asynchronous compute performance has been much stronger than anything NV has brought to bear. Asynchronous compute and compute pre-emption are not the same thing — we’ll have to wait for shipping hardware to see how this compares with AMD’s implementation and what the differences are.

An impressive leap forward for HPC, but no consumer launch date yet

It’s obvious that Pascal will significantly improve Nvidia’s HPC position, and that’s important since the company has huge plans for deep learning, self-driving cars, and other HPC workloads. Pascal looks like it’ll be a potent match for Xeon Phi, Nvidia’s primary competitor in this space.

Nvidia has remained mum on consumer launch dates, however, so we’ll have to wait and see when this tech makes it to the mass market. Rumors we’ve heard in other contexts suggest that HBM2 hardware won’t hit the consumer market until later this year due to high initial prices for first run equipment. It’s entirely possible that Nvidia is using GP100 to fill out its initial high-end products, but will only move to the HBM2 standard for upper-end consumer tiers in the back half of 2016.

When those cards do arrive, they should be a significant upgrade over Maxwell. The core counts on Pascal aren’t much higher than Maxwell, but the improved clock speeds will drive performance higher as well, and that’s before any improvement from efficiency gains. If you’re in the market for a new GPU this year, I strongly advise waiting to see what NV and AMD ship in the consumer space if that’s possible.

Leave a Reply

Your email address will not be published. Required fields are marked *