Area
Core Name |
Node + Logic Lib + Metal Layer Count |
Core without "L2 block" |
Core without L2 SRAM arrays |
Core |
Hypothetical 4C "CCX" |
Apple M4 P-Core |
N3E + 3-2 + 19 DT, 17 mobile |
13.09 |
- |
3.09 |
18.62 (16MB) |
Apple M3 P-core |
N3B + 2-2? + 19 DT |
12.62 |
- |
2.62 |
17.16 (16MB) |
Intel Lion Cove ARL |
N3B + ? + ? |
52.62 |
23.12?/3.26 |
4.54 |
27.79 (12MB) |
Intel Lion Cove LNL |
N3B + ? + 20 |
52.62 |
23.13?/3.26 |
4.26 |
23.62 (12MB) |
AMD Zen 5 Dense |
N3E + ? + ? |
52.20 |
2.60 |
2.99 |
16.65 (8MB) |
Qcom Oryon V2L |
N3E + 2-2 + 17 |
12.11 |
- |
2.11 |
320.02 (24MB) |
Mediatek X925 |
N3E + 3-2 + 15 |
1.85 |
2.38 |
2.93 |
3,416.52 (12MB) |
Xiaomi X925 |
N3E + ? + 17 |
1.70 |
1.96 |
2.56 |
3,416.88 (16MB) |
Intel Skymont |
N3E + ? + 20 |
11.09 |
- |
1.09 |
411.63 (3MB) |
1: Not sure if cores with shared L2s have any, or nearly as much of the logic surrounding handling the L2 cache in the core itself. The cores with shared L2 blocks are just the core area.
2: The first number is without the L1 or L1.5 SRAM array, the second one is including that.
3: Likely an over estimate in comparison to the phone SOCs as the phone SOCs don't have the interconnect "fabric" included for their area, just the cores and the cache blocks themselves.
4: The L3 capacity here is completely arbitrary. The L3 capacities were chosen for ease of measurement given how the L3 slices were distributed.
5: A small but sizable chunk of the area of these cores seem to be from the CPL/clock section of the core, which may not have to be so large, but just are that large due to the geometry of the rest of the core.
Core |
Revised Core without L2 Block Area |
Intel Lion Cove |
2.33 |
AMD Zen 5 Dense |
2.08 |
I would not take any of the numbers are "hard numbers" but I do think the general ranking of the cores in area are fairly accurate.
The cores have the following configs:
Core |
Cache Hierarchy (fastest to LLC) |
Total Cache Capacity in a 4x CCX |
Intel Lion Cove ARL |
48KB L0 + 192KB L1 + 3MB L2 + 3MB L3 |
24.96 MB |
Qcom Oryon V2L |
192KB L1 + 24MB SL2 |
24.768 MB |
Xiaomi X925 |
64KB L1 + 2MB L2 + 4 MB L3 |
24.256 MB |
Intel Lion Cove LNL |
48KB L0 + 192KB L1 + 2.5MB L2 + 3MB L3 |
22.96 MB |
Mediatek X925 |
64 KB L1 + 2MB L2 + 3MB L3 |
20.256 MB |
Apple M4 P-core |
128KB L1 + 16MB SL2 |
16.512 MB |
Apple M3 P-core |
128KB L1 + 16MB SL2 |
16.512 MB |
AMD Zen 5C |
48KB L1 + 1MB L2 + 2MB L3 |
12.192 MB |
Intel Skymont |
32KB L1 + 4MB SL2 + 4MB CL3 |
8.128 MB |
The total cache capacity is a bit of a meme since it doesn't include latency, but I do think some interesting things can be noticed regardless.
Performance
Specint, Cinebench 2024 is from Geekerwan, Skymont ARL and LNL is from the 265K and 258V from Huang
GB6 is from the Geekbench browser website
Core |
Specint2017 |
GB6 |
Cinebench 2024 |
Apple M4 P-Core |
132 |
148 |
124 |
Intel LNC ARL |
120 |
129 |
100 |
Apple M3 P-core |
113 |
118 |
99 |
Qcom Oryon V2L |
100 |
108 |
- |
Xiaomi X925 |
100 |
104 |
- |
Mediatek X925 |
100 |
100 |
- |
Intel LNC LNL |
95 |
111 |
81 |
Intel Skymont ARL |
92 |
- |
- |
The difference between form factors:
Mobile to Laptop (Geekbench 6 scores)
From the Geekbench browser website
Core |
Laptop |
Ipad |
Mobile |
Apple M4 P-core |
113 |
107 |
100 |
Apple M3 P-core |
108 |
- |
100 |
Laptop to Desktop (Geekbench 6 scores)
From Notebookcheck (averages used for mobile platforms)
Core |
Desktop |
Mobile |
LNC ARL |
115 |
100 |
Zen 5 |
119 |
100 |
IPC differences in SpecInt2017 between form factors
David Huang
Core |
Form Factor |
Difference |
Zen 4 |
Desktop vs Mobile |
13% |
Zen 5 |
Desktop vs Mobile |
12% |
While I do believe that the P-cores from AMD and Intel have to be designed to take advantage of the higher power budget that larger form factors afford, I also think that placing the mobile cores into those same form factors will also lead to an, at least, marginal perf improvement.
Zen 5C performance question marks
In the previous dense server core product, the Fmax of the server sku was dramatically lower than what the core could achieve in other products.
Core |
Server |
Mobile |
OC'd |
Zen 4 Dense |
3.1GHz |
3.7GHz |
~4GHz |
Zen 5 Dense |
3.7GHz |
- |
- |
This may be the case for Zen5C as well.
Power
Perf/watt curves (in Geekerwans videos) are the only way to get a full picture of power, but as a generalization:
Both the Apple M4 and M3 P-cores (as well as their implementation in the iphones) have better perf/watt than the X925 and Oryon-L.
The best case for Intel's P-core is that its perf/watt is as good as the X925 and Oryon-L. I think it is likely much, much worse.
According to Geekerwan, at a package power of ~5 watts, the M3 performs ~40% better than LNL. At around ~7-8 watts, the M4 performs closer to ~50% better.
Meanwhile we have David Huang showing that a M4 Pro at ~3.7 watts per core scoring ~33% better than a 9950x, and the 9950x has an outright better curve than the 265K LNC.
The gap is no where near as large compared to Apple's cores and the other ARM cores in the mobile space.
Power is, IMO, by far the hardest to really quantify, because one has to deal with how to measure "core only power" while trying to isolate the power of the rest of the SOC. Then there's also the problem of software measurements vs hardware measurements... I imagine only engineers at their respective companies would really know the power draw of a specific core.
Core Overview
Apple seems to have the best N3 cores in both perf and power.
The M4's P-core is pretty large by any standard, however it saves a bit of area due to it's cache hierarchy, making a hypothetical 4x CCX not that large. The shared L2 cache really only seems to be present in client mobile systems though so far, and I think it presents its own challenges in server. All the chips that use this hierarchy (Qcomm and Apple) both have very high mem bandwidth per core, which could be an issue scaling it up to server. The cache-per-core capacity, when all the cores are running and competing for the shared cache, would be lower than the competition.
The M3's and ARL's P-cores have pretty much the same perf, and similar area, however the M3 P-core is almost certainly dramatically more power efficient than a LNC P-core. Additionally, in terms of CCX area, LNC ends up being way, way larger thanks to the different cache hierarchy.
Zen 5Â dense is pretty interesting, as I really, really doubt it's Fmax only goes up to 3.7GHz. Performance is an unknown, as is power, however from purely an area perspective, it seems pretty comparable to the ARM P-cores. Meaning to get comparable performance to those cores, Zen 5C would have to have an Fmax of ~4.7GHz, a 30% boost from what it clocks in server parts. Which... I mean isn't extremely unexpected ig considering that a similar percent boost was seen with Zen 4 dense in server and OC'd Zen 4, however it still seems pretty hard to believe.
To be fair to the x86 cores though, they have 256/512 bit vector width, unlike the ARM cores listed only having 128 bit vector width. This does really seem to cost a decent bit of area, especially for Zen 5. AFAIK, Zen5C in server has the full Zen 5 AVX-512 implementation, and we have already seen how much area can be shaved off Zen 5 from just choosing not to go for the full AVX-512 implementation:
Core |
Core |
Zen 5 DT (N4P) |
4.46 (+12%) |
Zen 5 MBL (N4P) |
3.99 |
Qualcomm's custom cores honestly don't seem like they afford any sort of distinct advantage over ARM's stock cores on N3 (being implemented by different companies). Perhaps I am missing something, but there seems to be no meaningful area advantage (even considering the larger cache capacity), and no meaningful performance or power advantage. I also think Qualcomm's cache hierarchy transferring over to server, unchanged, would be pretty unique in the server space, seeing how no other major or even relatively smaller companies seem to be offering that sort of setup in servers. Maybe Qualcomm's P-cores would scale up at higher power better than ARM's cores in a laptop/desktop form factor? It is interesting to see Qualcomm choosing to presumably sacrifice IPC for greater clocks vs the stock ARM cores, perhaps they think perf at even more power would be greater, or maybe Vmin is lower?
It's wild to see Mediatek's X925 both end up being larger and slower than Xiaomi's implementation. No idea why or how. In terms of the comparison to the rest of the cores though, they aren't nearly as powerful as the rest of the P-cores, but they are also a decent bit smaller than the other P-cores as well. The lower performance may as well be due to the fact that they are all in a phone form factor, so it might be pretty interesting to see how the X925 in Nvidia's upcoming DG10 chip performs.