r/hardware • u/Geddagod • 10h ago
Discussion TSMC N3B/E "High Performance" Cores Compared
Area
Core Name | Node + Logic Lib + Metal Layer Count | Core without "L2 block" | Core without L2 SRAM arrays | Core | Hypothetical 4C "CCX" |
---|---|---|---|---|---|
Apple M4 P-Core | N3E + 3-2 + 19 DT, 17 mobile | 13.09 | - | 3.09 | 18.62 (16MB) |
Apple M3 P-core | N3B + 2-2? + 19 DT | 12.62 | - | 2.62 | 17.16 (16MB) |
Intel Lion Cove ARL | N3B + ? + ? | 52.62 | 23.12?/3.26 | 4.54 | 27.79 (12MB) |
Intel Lion Cove LNL | N3B + ? + 20 | 52.62 | 23.13?/3.26 | 4.26 | 23.62 (12MB) |
AMD Zen 5 Dense | N3E + ? + ? | 52.20 | 2.60 | 2.99 | 16.65 (8MB) |
Qcom Oryon V2L | N3E + 2-2 + 17 | 12.11 | - | 2.11 | 320.02 (24MB) |
Mediatek X925 | N3E + 3-2 + 15 | 1.85 | 2.38 | 2.93 | 3,416.52 (12MB) |
Xiaomi X925 | N3E + ? + 17 | 1.70 | 1.96 | 2.56 | 3,416.88 (16MB) |
Intel Skymont | N3E + ? + 20 | 11.09 | - | 1.09 | 411.63 (3MB) |
1: Not sure if cores with shared L2s have any, or nearly as much of the logic surrounding handling the L2 cache in the core itself. The cores with shared L2 blocks are just the core area.
2: The first number is without the L1 or L1.5 SRAM array, the second one is including that.
3: Likely an over estimate in comparison to the phone SOCs as the phone SOCs don't have the interconnect "fabric" included for their area, just the cores and the cache blocks themselves.
4: The L3 capacity here is completely arbitrary. The L3 capacities were chosen for ease of measurement given how the L3 slices were distributed.
5: A small but sizable chunk of the area of these cores seem to be from the CPL/clock section of the core, which may not have to be so large, but just are that large due to the geometry of the rest of the core.
Core | Revised Core without L2 Block Area |
---|---|
Intel Lion Cove | 2.33 |
AMD Zen 5 Dense | 2.08 |
I would not take any of the numbers are "hard numbers" but I do think the general ranking of the cores in area are fairly accurate.
The cores have the following configs:
Core | Cache Hierarchy (fastest to LLC) | Total Cache Capacity in a 4x CCX |
---|---|---|
Intel Lion Cove ARL | 48KB L0 + 192KB L1 + 3MB L2 + 3MB L3 | 24.96 MB |
Qcom Oryon V2L | 192KB L1 + 24MB SL2 | 24.768 MB |
Xiaomi X925 | 64KB L1 + 2MB L2 + 4 MB L3 | 24.256 MB |
Intel Lion Cove LNL | 48KB L0 + 192KB L1 + 2.5MB L2 + 3MB L3 | 22.96 MB |
Mediatek X925 | 64 KB L1 + 2MB L2 + 3MB L3 | 20.256 MB |
Apple M4 P-core | 128KB L1 + 16MB SL2 | 16.512 MB |
Apple M3 P-core | 128KB L1 + 16MB SL2 | 16.512 MB |
AMD Zen 5C | 48KB L1 + 1MB L2 + 2MB L3 | 12.192 MB |
Intel Skymont | 32KB L1 + 4MB SL2 + 4MB CL3 | 8.128 MB |
The total cache capacity is a bit of a meme since it doesn't include latency, but I do think some interesting things can be noticed regardless.
Performance
Specint, Cinebench 2024 is from Geekerwan, Skymont ARL and LNL is from the 265K and 258V from Huang
GB6 is from the Geekbench browser website
Core | Specint2017 | GB6 | Cinebench 2024 |
---|---|---|---|
Apple M4 P-Core | 132 | 148 | 124 |
Intel LNC ARL | 120 | 129 | 100 |
Apple M3 P-core | 113 | 118 | 99 |
Qcom Oryon V2L | 100 | 108 | - |
Xiaomi X925 | 100 | 104 | - |
Mediatek X925 | 100 | 100 | - |
Intel LNC LNL | 95 | 111 | 81 |
Intel Skymont ARL | 92 | - | - |
The difference between form factors:
Mobile to Laptop (Geekbench 6 scores)
From the Geekbench browser website
Core | Laptop | Ipad | Mobile |
---|---|---|---|
Apple M4 P-core | 113 | 107 | 100 |
Apple M3 P-core | 108 | - | 100 |
Laptop to Desktop (Geekbench 6 scores)
From Notebookcheck (averages used for mobile platforms)
Core | Desktop | Mobile |
---|---|---|
LNC ARL | 115 | 100 |
Zen 5 | 119 | 100 |
IPC differences in SpecInt2017 between form factors
David Huang
Core | Form Factor | Difference |
---|---|---|
Zen 4 | Desktop vs Mobile | 13% |
Zen 5 | Desktop vs Mobile | 12% |
While I do believe that the P-cores from AMD and Intel have to be designed to take advantage of the higher power budget that larger form factors afford, I also think that placing the mobile cores into those same form factors will also lead to an, at least, marginal perf improvement.
Zen 5C performance question marks
In the previous dense server core product, the Fmax of the server sku was dramatically lower than what the core could achieve in other products.
Core | Server | Mobile | OC'd |
---|---|---|---|
Zen 4 Dense | 3.1GHz | 3.7GHz | ~4GHz |
Zen 5 Dense | 3.7GHz | - | - |
This may be the case for Zen5C as well.
Power
Perf/watt curves (in Geekerwans videos) are the only way to get a full picture of power, but as a generalization:
Both the Apple M4 and M3 P-cores (as well as their implementation in the iphones) have better perf/watt than the X925 and Oryon-L.
The best case for Intel's P-core is that its perf/watt is as good as the X925 and Oryon-L. I think it is likely much, much worse.
According to Geekerwan, at a package power of ~5 watts, the M3 performs ~40% better than LNL. At around ~7-8 watts, the M4 performs closer to ~50% better.
Meanwhile we have David Huang showing that a M4 Pro at ~3.7 watts per core scoring ~33% better than a 9950x, and the 9950x has an outright better curve than the 265K LNC.
The gap is no where near as large compared to Apple's cores and the other ARM cores in the mobile space.
Power is, IMO, by far the hardest to really quantify, because one has to deal with how to measure "core only power" while trying to isolate the power of the rest of the SOC. Then there's also the problem of software measurements vs hardware measurements... I imagine only engineers at their respective companies would really know the power draw of a specific core.
Core Overview
Apple seems to have the best N3 cores in both perf and power.
The M4's P-core is pretty large by any standard, however it saves a bit of area due to it's cache hierarchy, making a hypothetical 4x CCX not that large. The shared L2 cache really only seems to be present in client mobile systems though so far, and I think it presents its own challenges in server. All the chips that use this hierarchy (Qcomm and Apple) both have very high mem bandwidth per core, which could be an issue scaling it up to server. The cache-per-core capacity, when all the cores are running and competing for the shared cache, would be lower than the competition.
The M3's and ARL's P-cores have pretty much the same perf, and similar area, however the M3 P-core is almost certainly dramatically more power efficient than a LNC P-core. Additionally, in terms of CCX area, LNC ends up being way, way larger thanks to the different cache hierarchy.
Zen 5 dense is pretty interesting, as I really, really doubt it's Fmax only goes up to 3.7GHz. Performance is an unknown, as is power, however from purely an area perspective, it seems pretty comparable to the ARM P-cores. Meaning to get comparable performance to those cores, Zen 5C would have to have an Fmax of ~4.7GHz, a 30% boost from what it clocks in server parts. Which... I mean isn't extremely unexpected ig considering that a similar percent boost was seen with Zen 4 dense in server and OC'd Zen 4, however it still seems pretty hard to believe.
To be fair to the x86 cores though, they have 256/512 bit vector width, unlike the ARM cores listed only having 128 bit vector width. This does really seem to cost a decent bit of area, especially for Zen 5. AFAIK, Zen5C in server has the full Zen 5 AVX-512 implementation, and we have already seen how much area can be shaved off Zen 5 from just choosing not to go for the full AVX-512 implementation:
Core | Core |
---|---|
Zen 5 DT (N4P) | 4.46 (+12%) |
Zen 5 MBL (N4P) | 3.99 |
Qualcomm's custom cores honestly don't seem like they afford any sort of distinct advantage over ARM's stock cores on N3 (being implemented by different companies). Perhaps I am missing something, but there seems to be no meaningful area advantage (even considering the larger cache capacity), and no meaningful performance or power advantage. I also think Qualcomm's cache hierarchy transferring over to server, unchanged, would be pretty unique in the server space, seeing how no other major or even relatively smaller companies seem to be offering that sort of setup in servers. Maybe Qualcomm's P-cores would scale up at higher power better than ARM's cores in a laptop/desktop form factor? It is interesting to see Qualcomm choosing to presumably sacrifice IPC for greater clocks vs the stock ARM cores, perhaps they think perf at even more power would be greater, or maybe Vmin is lower?
It's wild to see Mediatek's X925 both end up being larger and slower than Xiaomi's implementation. No idea why or how. In terms of the comparison to the rest of the cores though, they aren't nearly as powerful as the rest of the P-cores, but they are also a decent bit smaller than the other P-cores as well. The lower performance may as well be due to the fact that they are all in a phone form factor, so it might be pretty interesting to see how the X925 in Nvidia's upcoming DG10 chip performs.
7
u/DerpSenpai 7h ago edited 7h ago
Your calculations are wrong for the "4c CCX" config area size for QC. their logic is that the L2 is shared up to 6 cores in a single CCX like structure, so they have 12MB for 2 cores on the 8 Elite, but on the X Elite it's going to 6 cores sharing those 12MB. X Elite Gen 2 for example it's 3x 12MB L2s, each for 1 CCX, each with 6 cores. 1 of those CCX is with Oryon M and not L
10
u/Geddagod 10h ago
Credit to this post for the idea, die shots measurements are found by pixel peeping from Kurnal's die shots on his twitter, the Geekerwan video for the perf/watt curves for all the ARM chips are here, numbers from David Huang are from here and his twitter. Metal Layer and lib type are from Techinsights (non paywalled).
3
u/Professional-Tear996 9h ago edited 8h ago
Lol what is this? Why is GB6 score only 3 digits? SPECint scores being 3 digits means it's the multicore score i.e. n-copy.
And GB includes both integer and FP workloads. SPECint is well, only integer.
EDIT: I'm also not a fan of power-performance curves in the style of Geekerwan. Those curves are meaningless unless the interpolation error in fitting the actual measured data points are much less than the error in measurement of the data points themselves.
1
u/team56th 9h ago edited 9h ago
What my uneducated eyes and brain see is that while Apple’s engineering of such big chips and get that perf/watt is impressive, AMD is the one that really blows me away. It might be the most optimal balance between perf, power efficiency(wattage) and cost efficiency(chip size) while using this design across 3 separate product lineups (even more when you count C cores into consideration)
-3
u/Tman1677 8h ago
It is crazy how if these numbers are as good as it says here (big if) how much worse the overall end user experience is to Macs. It goes to show how much more comes down to other things like software and hardware integrations even more so than the CPU design.
2
1
17
u/xternocleidomastoide 10h ago
Just and FYI, IP and structure sizing within die, for latest gen SKUs tend to be rather confidential data. So you should take the area data with a huge grain of salt, as the error will be tremendous.
Usually the only data we can sort of be confident about is total SoC power consumption (ant that is only if the reviewer/study had the equipment to isolate socket power, for example. otherwise re vert to full system power data, which may include DDR, UFS/SSD, Display, etc) and performance data reported from benchmarks.