• 0 Posts
  • 8 Comments
Joined 11 months ago
cake
Cake day: October 25th, 2023

help-circle

  • by improving the RT & tensor cores

    and HW support for DLSS features and CUDA as a programming platform.

    It might be “a major architecture update” by the amount of work that Nvidia engineering will have to put in to pull off all the new features and RT/TC/DLSS/CUDA improvements without regressing PPA - that’s where the years of effort will be sunk - and possibly large improvements in perf in selected application categories and operating modes, but a very minor improvement in “perf per SM per clock” in no-DLSS rasterization on average.


  • Why actually build the 36 GB one though? What gaming application will be able to take advantage of more than 24 for the lifetime of 5090? 5090 will be irrelevant by the time the next gen of consoles releases, and the current one has 16 GB for VRAM and system RAM combined. 24 is basically perfect for top end gaming card.

    And 36 will be even more self-canibalizing for professional cards market.

    So it’s unnecessary, expensive, and canibalizing. Not happening.


  • GA102 to AD102 increased by about 80%

    without scaling DRAM bandwidth anywhere near as much, only partially compensating for that with a much bigger L2.

    For 5090 on the other hand we might also have clock increase going (another 1.15x?), and proportional 1:1 (unlike Ampere -> Ada) DRAM bandwidth increase by a factor of 1.5 due to GDDR7 (no bus width increases necessary; 1.5 = 1.3 * 1.15), so this is 1.5x perf increase 4090 -> 5090, which has to be further multiplied by whatever u-architectural improvements might bring, like Qesa is saying.

    Unlike Qesa, though, I’m personally not very optimistic regarding those u-architectural improvements being very major. To get from 1.5x that comes out of node speed increase and the node shrink subdued and downscaled by node cost increase, to recently rumored 1.7x one would need to get (1.7 / 1.5 = 1.13) 13% perf and perf/w improvement, which sounds just about realistic. I’m betting it’ll be even a little bit less, yielding more like 1.6x proper average, that 1.7x might have been the result of measuring very few apps or outright “up to 1.7x” with “up to” getting lost during the leak (if there was even a leak).

    1.6x is absolutely huge, and no wonder nobody’s increasing the bus width: it’s unnecessary for yielding a great product and even more expensive now than it was on 5nm (DRAM controllers almost don’t shrink and are big).




  • but all they would need to do is look at like the top 100 games played every year

    My main hypothesis on this subject - perhaps they already did, and out of the top 100 games only 2 games was possible to accelerate via this method, even after exhaustively checking all possible affinities and scheduling schemes, and only on CPUs with 2 or more 4-clusters of E-cores.

    The support for the hypothesis is the following suggestions:

    1. how many behavioral requirements the game threads might need to satisfy
    2. how temporally stable the thread behaviors might need to be, probably disqualifying apps with any in-app task scheduling / load balancing
    3. the signal that they possibly didn’t find a single game where 1 4-core E-cluster is enough (how rarely is this applicable if they apparently needed 2+, for… some reason?)
    4. the odd choice of Metro Exodus as pointed out by HUB - it’s a single player game with very high visual fidelity, pretty far down the list of CPU limited games (nothing else benefited?)
    5. the fact that none of the games supported (Metro and Rainbow 6) are based on either of the two most popular game engines (Unity and Unreal), possibly reducing how many apps could be hoped to have similar behavior and possibly benefit.

    Now, perhaps the longer list of games they show on their screenshot is actually the games that benefit, and we only got 2 for now because those are the only ones they figured (at the moment) how to detect threads identities in (possibly not too far off from as curiously as this), or maybe that list is something else entirely and not indicative of anything. Who knows.

    And then there comes the discussion you’re having, re implementation, scaling, and maintenance with its own can of worms.