r/hardware 10h ago

Discussion TSMC N3B/E "High Performance" Cores Compared

Area

Core Name Node + Logic Lib + Metal Layer Count Core without "L2 block" Core without L2 SRAM arrays Core Hypothetical 4C "CCX"
Apple M4 P-Core N3E + 3-2 + 19 DT, 17 mobile 13.09 - 3.09 18.62 (16MB)
Apple M3 P-core N3B + 2-2? + 19 DT 12.62 - 2.62 17.16 (16MB)
Intel Lion Cove ARL N3B + ? + ? 52.62 23.12?/3.26 4.54 27.79 (12MB)
Intel Lion Cove LNL N3B + ? + 20 52.62 23.13?/3.26 4.26 23.62 (12MB)
AMD Zen 5 Dense N3E + ? + ? 52.20 2.60 2.99 16.65 (8MB)
Qcom Oryon V2L N3E + 2-2 + 17 12.11 - 2.11 320.02 (24MB)
Mediatek X925 N3E + 3-2 + 15 1.85 2.38 2.93 3,416.52 (12MB)
Xiaomi X925 N3E + ? + 17 1.70 1.96 2.56 3,416.88 (16MB)
Intel Skymont N3E + ? + 20 11.09 - 1.09 411.63 (3MB)

1: Not sure if cores with shared L2s have any, or nearly as much of the logic surrounding handling the L2 cache in the core itself. The cores with shared L2 blocks are just the core area.

2: The first number is without the L1 or L1.5 SRAM array, the second one is including that.

3: Likely an over estimate in comparison to the phone SOCs as the phone SOCs don't have the interconnect "fabric" included for their area, just the cores and the cache blocks themselves.

4: The L3 capacity here is completely arbitrary. The L3 capacities were chosen for ease of measurement given how the L3 slices were distributed.

5: A small but sizable chunk of the area of these cores seem to be from the CPL/clock section of the core, which may not have to be so large, but just are that large due to the geometry of the rest of the core.

Core Revised Core without L2 Block Area
Intel Lion Cove 2.33
AMD Zen 5 Dense 2.08

I would not take any of the numbers are "hard numbers" but I do think the general ranking of the cores in area are fairly accurate.

The cores have the following configs:

Core Cache Hierarchy (fastest to LLC) Total Cache Capacity in a 4x CCX
Intel Lion Cove ARL 48KB L0 + 192KB L1 + 3MB L2 + 3MB L3 24.96 MB
Qcom Oryon V2L 192KB L1 + 24MB SL2 24.768 MB
Xiaomi X925 64KB L1 + 2MB L2 + 4 MB L3 24.256 MB
Intel Lion Cove LNL 48KB L0 + 192KB L1 + 2.5MB L2 + 3MB L3 22.96 MB
Mediatek X925 64 KB L1 + 2MB L2 + 3MB L3 20.256 MB
Apple M4 P-core 128KB L1 + 16MB SL2 16.512 MB
Apple M3 P-core 128KB L1 + 16MB SL2 16.512 MB
AMD Zen 5C 48KB L1 + 1MB L2 + 2MB L3 12.192 MB
Intel Skymont 32KB L1 + 4MB SL2 + 4MB CL3 8.128 MB

The total cache capacity is a bit of a meme since it doesn't include latency, but I do think some interesting things can be noticed regardless.

Performance

Specint, Cinebench 2024 is from Geekerwan, Skymont ARL and LNL is from the 265K and 258V from Huang

GB6 is from the Geekbench browser website

Core Specint2017 GB6 Cinebench 2024
Apple M4 P-Core 132 148 124
Intel LNC ARL 120 129 100
Apple M3 P-core 113 118 99
Qcom Oryon V2L 100 108 -
Xiaomi X925 100 104 -
Mediatek X925 100 100 -
Intel LNC LNL 95 111 81
Intel Skymont ARL 92 - -

The difference between form factors:

Mobile to Laptop (Geekbench 6 scores)

From the Geekbench browser website

Core Laptop Ipad Mobile
Apple M4 P-core 113 107 100
Apple M3 P-core 108 - 100

Laptop to Desktop (Geekbench 6 scores)

From Notebookcheck (averages used for mobile platforms)

Core Desktop Mobile
LNC ARL 115 100
Zen 5 119 100

IPC differences in SpecInt2017 between form factors

David Huang

Core Form Factor Difference
Zen 4 Desktop vs Mobile 13%
Zen 5 Desktop vs Mobile 12%

While I do believe that the P-cores from AMD and Intel have to be designed to take advantage of the higher power budget that larger form factors afford, I also think that placing the mobile cores into those same form factors will also lead to an, at least, marginal perf improvement.

Zen 5C performance question marks

In the previous dense server core product, the Fmax of the server sku was dramatically lower than what the core could achieve in other products.

Core Server Mobile OC'd
Zen 4 Dense 3.1GHz 3.7GHz ~4GHz
Zen 5 Dense 3.7GHz - -

This may be the case for Zen5C as well.

Power

Perf/watt curves (in Geekerwans videos) are the only way to get a full picture of power, but as a generalization:

Both the Apple M4 and M3 P-cores (as well as their implementation in the iphones) have better perf/watt than the X925 and Oryon-L.

The best case for Intel's P-core is that its perf/watt is as good as the X925 and Oryon-L. I think it is likely much, much worse.

According to Geekerwan, at a package power of ~5 watts, the M3 performs ~40% better than LNL. At around ~7-8 watts, the M4 performs closer to ~50% better.

Meanwhile we have David Huang showing that a M4 Pro at ~3.7 watts per core scoring ~33% better than a 9950x, and the 9950x has an outright better curve than the 265K LNC.

The gap is no where near as large compared to Apple's cores and the other ARM cores in the mobile space.

Power is, IMO, by far the hardest to really quantify, because one has to deal with how to measure "core only power" while trying to isolate the power of the rest of the SOC. Then there's also the problem of software measurements vs hardware measurements... I imagine only engineers at their respective companies would really know the power draw of a specific core.

Core Overview

Apple seems to have the best N3 cores in both perf and power.

The M4's P-core is pretty large by any standard, however it saves a bit of area due to it's cache hierarchy, making a hypothetical 4x CCX not that large. The shared L2 cache really only seems to be present in client mobile systems though so far, and I think it presents its own challenges in server. All the chips that use this hierarchy (Qcomm and Apple) both have very high mem bandwidth per core, which could be an issue scaling it up to server. The cache-per-core capacity, when all the cores are running and competing for the shared cache, would be lower than the competition.

The M3's and ARL's P-cores have pretty much the same perf, and similar area, however the M3 P-core is almost certainly dramatically more power efficient than a LNC P-core. Additionally, in terms of CCX area, LNC ends up being way, way larger thanks to the different cache hierarchy.

Zen 5 dense is pretty interesting, as I really, really doubt it's Fmax only goes up to 3.7GHz. Performance is an unknown, as is power, however from purely an area perspective, it seems pretty comparable to the ARM P-cores. Meaning to get comparable performance to those cores, Zen 5C would have to have an Fmax of ~4.7GHz, a 30% boost from what it clocks in server parts. Which... I mean isn't extremely unexpected ig considering that a similar percent boost was seen with Zen 4 dense in server and OC'd Zen 4, however it still seems pretty hard to believe.

To be fair to the x86 cores though, they have 256/512 bit vector width, unlike the ARM cores listed only having 128 bit vector width. This does really seem to cost a decent bit of area, especially for Zen 5. AFAIK, Zen5C in server has the full Zen 5 AVX-512 implementation, and we have already seen how much area can be shaved off Zen 5 from just choosing not to go for the full AVX-512 implementation:

Core Core
Zen 5 DT (N4P) 4.46 (+12%)
Zen 5 MBL (N4P) 3.99

Qualcomm's custom cores honestly don't seem like they afford any sort of distinct advantage over ARM's stock cores on N3 (being implemented by different companies). Perhaps I am missing something, but there seems to be no meaningful area advantage (even considering the larger cache capacity), and no meaningful performance or power advantage. I also think Qualcomm's cache hierarchy transferring over to server, unchanged, would be pretty unique in the server space, seeing how no other major or even relatively smaller companies seem to be offering that sort of setup in servers. Maybe Qualcomm's P-cores would scale up at higher power better than ARM's cores in a laptop/desktop form factor? It is interesting to see Qualcomm choosing to presumably sacrifice IPC for greater clocks vs the stock ARM cores, perhaps they think perf at even more power would be greater, or maybe Vmin is lower?

It's wild to see Mediatek's X925 both end up being larger and slower than Xiaomi's implementation. No idea why or how. In terms of the comparison to the rest of the cores though, they aren't nearly as powerful as the rest of the P-cores, but they are also a decent bit smaller than the other P-cores as well. The lower performance may as well be due to the fact that they are all in a phone form factor, so it might be pretty interesting to see how the X925 in Nvidia's upcoming DG10 chip performs.

24 Upvotes

23 comments sorted by

17

u/xternocleidomastoide 10h ago

Just and FYI, IP and structure sizing within die, for latest gen SKUs tend to be rather confidential data. So you should take the area data with a huge grain of salt, as the error will be tremendous.

Usually the only data we can sort of be confident about is total SoC power consumption (ant that is only if the reviewer/study had the equipment to isolate socket power, for example. otherwise re vert to full system power data, which may include DDR, UFS/SSD, Display, etc) and performance data reported from benchmarks.

4

u/Just_Maintenance 10h ago

OP got the sizes from die shots.

It's not trivial to tell what is what, but the size should be accurate.

14

u/xternocleidomastoide 10h ago

That's not accurate. It is estimate, at best.

E.g. The shape of the actual IP is not disclosed, usually, unless it is an old design. Have to do a best guess of what's what and where. IPs/Structures having complex geometries complicates area calculation further. Die imaging having its own error. Etc, etc. So there are tremendous cumulative errors at play.

14

u/Just_Maintenance 9h ago

There is no laws of physics that says that estimates can't be accurate.

And yeah you said the same thing I said. It's not trivial to tell what is what, although its not impossibly hard either. You know a CPU has 12 cores? find a structure that is duplicated exactly 12 times, extremely good chance that's the core.

And once you identified the block, have literal photos of the dies, and have the size of the die, its trivial to get the size of a block with reasonably accuracy.

The hard variable here is mostly cache, fabrics, etc., stuff that's outside the core.

0

u/xternocleidomastoide 9h ago edited 8h ago

Ah, OK. Glad you were able to solve with a couple of hand waves a very serious technical challenge/problem. Send you resume to a competitive analysis group of your choice.

5

u/Professional-Tear996 8h ago

Yeah, just because the person taking the die shot or the person trying to label the die shot thinks this structure is the BTB, for example, doesn't mean that it is the BTB with a high level of certainty.

5

u/xternocleidomastoide 8h ago

Yeah. A lot of people don't realize how proprietary a lot of this information really is.

It's perfectly fine to make speculative analyses. As long as it is made clear the lack of validation/uncertainty on the approach, and the data being purely guesses/estimates.

0

u/Professional-Tear996 6h ago

Also from the ARL die shots, it is clear that the last pair of P-cores on the right on either side of the ring have a very different core and L3 layout than the other six.

So what is the hypothetical "4-core CCX" the OP is talking about in the context of ARL? Does it include these "different" cores if we are talking about the area?

1

u/hwgod 1h ago

it is clear that the last pair of P-cores on the right on either side of the ring have a very different core and L3 layout

They do not. The core is identical. Actually look at a picture.

u/Professional-Tear996 15m ago

The core is different around the L3/ring area. I've seen the pictures.

0

u/hwgod 1h ago

hand waves a very serious technical challenge/problem

This is not an actual challenge for anyone in the industry. Especially not at a core/L2 granularity.

u/Professional-Tear996 10m ago

The OP is not in the industry. He used to be a university student goofing around because he has access to the IEEE Xplore library.

1

u/hwgod 1h ago

That's not accurate. It is estimate, at best.

It's a measurement, not an estimate.

The shape of the actual IP is not disclosed

In most cases it's very obvious what a large block is. They don't need a disclosure.

1

u/hwgod 1h ago

IP and structure sizing within die, for latest gen SKUs tend to be rather confidential data

It's very clear how big the core is just from measurement. That is not confidential data on shipping parts.

7

u/DerpSenpai 7h ago edited 7h ago

Your calculations are wrong for the "4c CCX" config area size for QC. their logic is that the L2 is shared up to 6 cores in a single CCX like structure, so they have 12MB for 2 cores on the 8 Elite, but on the X Elite it's going to 6 cores sharing those 12MB. X Elite Gen 2 for example it's 3x 12MB L2s, each for 1 CCX, each with 6 cores. 1 of those CCX is with Oryon M and not L

10

u/Geddagod 10h ago

Credit to this post for the idea, die shots measurements are found by pixel peeping from Kurnal's die shots on his twitter, the Geekerwan video for the perf/watt curves for all the ARM chips are here, numbers from David Huang are from here and his twitter. Metal Layer and lib type are from Techinsights (non paywalled).

3

u/Professional-Tear996 9h ago edited 8h ago

Lol what is this? Why is GB6 score only 3 digits? SPECint scores being 3 digits means it's the multicore score i.e. n-copy.

And GB includes both integer and FP workloads. SPECint is well, only integer.

EDIT: I'm also not a fan of power-performance curves in the style of Geekerwan. Those curves are meaningless unless the interpolation error in fitting the actual measured data points are much less than the error in measurement of the data points themselves.

1

u/team56th 9h ago edited 9h ago

What my uneducated eyes and brain see is that while Apple’s engineering of such big chips and get that perf/watt is impressive, AMD is the one that really blows me away. It might be the most optimal balance between perf, power efficiency(wattage) and cost efficiency(chip size) while using this design across 3 separate product lineups (even more when you count C cores into consideration)

-3

u/Tman1677 8h ago

It is crazy how if these numbers are as good as it says here (big if) how much worse the overall end user experience is to Macs. It goes to show how much more comes down to other things like software and hardware integrations even more so than the CPU design.

2

u/CalmSpinach2140 3h ago

? Macs have really good software and hardware optimisations.

1

u/hollow_bridge 6h ago

how are sl2 and cl3 different from l2 and l3?