@rorschach200

rorschach200@alien.top · 2 年前

The bus width needed to be what it needed to be. That left 2 possibilities - 12 GB and 24 GB. The former was way low for 4090 to work in its target applications. 24 it became.

This is exactly what drives these decisions.

What do you think drives them?

rorschach200@alien.top · 2 年前

by improving the RT & tensor cores

and HW support for DLSS features and CUDA as a programming platform.

It might be “a major architecture update” by the amount of work that Nvidia engineering will have to put in to pull off all the new features and RT/TC/DLSS/CUDA improvements without regressing PPA - that’s where the years of effort will be sunk - and possibly large improvements in perf in selected application categories and operating modes, but a very minor improvement in “perf per SM per clock” in no-DLSS rasterization on average.

rorschach200@alien.top · 2 年前

Why actually build the 36 GB one though? What gaming application will be able to take advantage of more than 24 for the lifetime of 5090? 5090 will be irrelevant by the time the next gen of consoles releases, and the current one has 16 GB for VRAM and system RAM combined. 24 is basically perfect for top end gaming card.

And 36 will be even more self-canibalizing for professional cards market.

So it’s unnecessary, expensive, and canibalizing. Not happening.

rorschach200@alien.top · 2 年前

GA102 to AD102 increased by about 80%

without scaling DRAM bandwidth anywhere near as much, only partially compensating for that with a much bigger L2.

For 5090 on the other hand we might also have clock increase going (another 1.15x?), and proportional 1:1 (unlike Ampere -> Ada) DRAM bandwidth increase by a factor of 1.5 due to GDDR7 (no bus width increases necessary; 1.5 = 1.3 * 1.15), so this is 1.5x perf increase 4090 -> 5090, which has to be further multiplied by whatever u-architectural improvements might bring, like Qesa is saying.

Unlike Qesa, though, I’m personally not very optimistic regarding those u-architectural improvements being very major. To get from 1.5x that comes out of node speed increase and the node shrink subdued and downscaled by node cost increase, to recently rumored 1.7x one would need to get (1.7 / 1.5 = 1.13) 13% perf and perf/w improvement, which sounds just about realistic. I’m betting it’ll be even a little bit less, yielding more like 1.6x proper average, that 1.7x might have been the result of measuring very few apps or outright “up to 1.7x” with “up to” getting lost during the leak (if there was even a leak).

1.6x is absolutely huge, and no wonder nobody’s increasing the bus width: it’s unnecessary for yielding a great product and even more expensive now than it was on 5nm (DRAM controllers almost don’t shrink and are big).

rorschach200@alien.top · 2 年前

The “gain” is largely a weighted average over all apps, not a max realizing in couple of outliers. It’s the bulk that determines the economics of the question, not singular exceptions.
The current status is heavily dominated by the historical state of affairs, as not enough time has passed to do much yet. Complex heterogenous cache hierarchies that generalize poorly is a very recent thing in CPUs, in GPUs it was the case for decades now, and in GPUs that is not the only source of large sensitivity to tuning.

rorschach200@alien.top · 2 年前

Also, GPUs are full of sharp performance cliffs and tuning opportunities, there is a lot to be gained. CPUs are a lot more resilient and generic - a lot less to be gained there.

rorschach200@alien.top · 2 年前

but all they would need to do is look at like the top 100 games played every year

My main hypothesis on this subject - perhaps they already did, and out of the top 100 games only 2 games was possible to accelerate via this method, even after exhaustively checking all possible affinities and scheduling schemes, and only on CPUs with 2 or more 4-clusters of E-cores.

The support for the hypothesis is the following suggestions:

how many behavioral requirements the game threads might need to satisfy
how temporally stable the thread behaviors might need to be, probably disqualifying apps with any in-app task scheduling / load balancing
the signal that they possibly didn’t find a single game where 1 4-core E-cluster is enough (how rarely is this applicable if they apparently needed 2+, for… some reason?)
the odd choice of Metro Exodus as pointed out by HUB - it’s a single player game with very high visual fidelity, pretty far down the list of CPU limited games (nothing else benefited?)
the fact that none of the games supported (Metro and Rainbow 6) are based on either of the two most popular game engines (Unity and Unreal), possibly reducing how many apps could be hoped to have similar behavior and possibly benefit.

Now, perhaps the longer list of games they show on their screenshot is actually the games that benefit, and we only got 2 for now because those are the only ones they figured (at the moment) how to detect threads identities in (possibly not too far off from as curiously as this), or maybe that list is something else entirely and not indicative of anything. Who knows.

And then there comes the discussion you’re having, re implementation, scaling, and maintenance with its own can of worms.

rorschach200@alien.top · 2 年前

From the video for convenience:

“Why did Intel only choose to enable Intel® Application Optimization on select 14th Gen processors? Settings within Intel® Application Optimization are custom determined for each supported processor, as they consider the number of P-cores, E-cores, and Intel® Hyperthreading Technology. Due to the massive amount of custom testing that went into the optimized setting parameters specifically gaming applications [sic], Intel chose to align support for our gaming-focused processors.”
- Intel

(original page quoted from)