Performance Index

ID Date Classification
615781 11/18/2024 Public
Document Table of Contents

Architecture Day 2021

Efficient-core

Statement Details
If we compare our Efficient-core to a single Skylake core for a single logical process, we deliver 40% more performance at the same power.

Internal Estimates as of June 22, 2021 using internal architecture simulation.

Workload: SPECrate2017_​int_​base estimates with GCC 8.1.0 -O2 binaries

If we compare our Efficient-core to a single Skylake core for a single logical process, we deliver the same performance while consuming less than 40% of the power.

Alternatively, a Skylake core would consume 2.5X more power to achieve the same performance as our Efficient-core.

Internal Estimates as of June 22, 2021 using internal architecture simulation.

Workload: SPECrate2017_​int_​base estimates with GCC 8.1.0 -O2 binaries

If we compare four of our new Efficient-cores against two Skylake cores running four threads, we deliver 80% more performance while still consuming less power.

Alternatively, we deliver the same throughput while consuming 80% less power. This means that Skylake would need to consume 5 times the power for the same performance.

Internal Estimates as of June 22, 2021 using internal architecture simulation.

Workload: SPECrate2017_​int_​base estimates with GCC 8.1.0 -O2 binaries

Performance-core

Statement Details
Comparing our current 11th Gen Intel Core architecture to the new Performance-core at ISO frequency shows that, on general-purpose performance, there is an average (i.e., geomean) improvement of 19%, across a wide range of existing workloads.

Testing as of May 28, 2021.

Intel® Core™ i9-11900k, 4x16GB 1R DDR4 UDIMM 2DPC 3200 Max Memory Frequency, Samsung 980 Pro 500GB PCIe SSD, WIN 10 20H2 19042.ent.rx64.789, High Performance Power Plan, 1920x1080 display resolution

Alder Lake Desktop S801 , RVP board with 2x16GB 1R DDR5 UDIMM 1DPC, 4400 Max Memory Frequency, Samsung 980 Pro 500GB PCIe SSD, WIN 10 20H2 19042.ent.rx64.508_​update.906, High Performance Power Plan, 1920x1080 display resolution

Based on overall scores and individual subcomponent scores on: SYSmark 25, CrossMark, PCMark 10, SPEC CPU 2017, WebXPRT 3, Geekbench 5

With int8, used for inference, our VNNI technology delivers 256 int8 operations per cycle per core. That's already over 2X our x86 CPU competition. With AMX, we will expand that by 8x, delivering 2048 int8 operations per cycle per core.

Based on peak architectural capability of matrix multiply + accumulate operations per cycle per core assuming 100% CPU utilization

Intel: Peak architectural capabilities in the new Performance-core for VNNI (running on 2 wide FMAs) and AMX implementations as of August 2021

Competition: Peak architectural capabilities of the Zen 3 architecture based on published architectural details as of August 2021, at

https://www.amd.com/system/files/TechDocs/56665.zip

https://www.amd.com/system/files/TechDocs/55723_3_01_0.zip

www.amd.com/system/files/TechDocs/26568.pdf

Alder Lake

Statement Details
PCIE Gen 5 has up to 2X bandwidth vs. Gen4 Based on PCIE Gen 5 specification of 32 GT/s vs. PCIE Gen 4 specification of 16GT/s.
Interconnect bandwidth up to 1000GB/s for compute, up to 204GB/s for memory, and up to 64GB/s for I/O fabrics. Internal specification of peak bandwidth of each fabric for the Alder Lake desktop 125W configuration.
Intel Thread Director demo System: Pre-production Intel internal validation platform with Alder Lake-S running Windows 11 Enterprise Build 22000.150; Application: Intel® Thread Director demo. As of July 2021.

Xe HPG

Statement Details
We have doubled performance year over year, 2 years in a row. First with Gen 9 to Gen 11, and then with Gen 11 to Xe LP.

4x graphics improvement based on in-game benchmark testing of Hitman 2. Testing as of November 2020 at 1080p and low settings. Systems tested:

Core i7-8565U (PL1=15W), measured on HP Spectre x360, 512 GB Intel SSD 660p Series, 2 x 8 GB DDR4-2400, Windows 10 Home 19042.63, Intel UHD 620 27.20.100.8935, F.34

Core i7-1065G7 (PL1=25W), measured on Razer Blade Stealth 13 (2019), 256GB Samsung MZVLB256HAHQ (PM981), 2 x 8 GB LPDDR4X-3733, Windows 10 Home 19042.63, Intel Iris Plus 27.20.100.8935, 1.02

Core i7-1185G7 (B0) (PL1=28W), measured on Intel Reference Platform, Samsung MZVLB512HBJQ, 2 x 8 GB LPDDR4X-4266, Windows 10 Pro 19042.63, Intel Iris Xe 27.20.100.8935, 92A.

With Intel's graphics drivers, we have seen improvements of 15 to 80% throughput for CPU-bound titles. Testing on Sid Meier's Civilization VI shows FPS improvement of 15% (at 1440p) and 80% (at 1080p) comparing a 30.0.100.9805 driver to a 27.20.100.8587 driver on DG1 with a TGL-U platform. As of July-August 2021.
With Intel's graphics drivers, we have seen improved game load times by up to 25%. Game loading time reduction by 25% and more on titles including Control, Cyberpunk 2077 and World of Warcraft when enabling per-stage PSO with 30.0.100.9684 driver on DG1 with a TGL-U platform. As of July - August 2021

Xe HPG - High Quality Super Sampling demo

Up to 2x performance boost (FPS) with 4k native vs. 4k with Xe SS technology

Up to 2x performance boost (FPS) with 4k native vs. 4k with Xe SS technology, Rens fly-through, running on Alchemist pre-production silicon. As of July 2021.
Compared to the Xe LP IP in our Iris Xe Max product, we increased the relative operating frequency by roughly 1.5X. When comparing the frequency clock of pre-production Xe HPG-based silicon vs Xe LP-based DG1 engineering platform. As of July 2021.
Compared to the Xe LP IP in our Iris Xe Max product, we increased the performance per watt by roughly 1.5X. When comparing the internal graphics IP perf/watt of pre-production Xe HPG-based silicon vs Xe LP-based DG1, at approximately 18.5W IP power, running 3DMark Firestrike-P. Simulated results. As of August 2021.

Sapphire Rapids

Statement Details

We introduced AMX capabilities that provide massive speedup to the tensor processing that is at the heart of deep learning algorithms.

With AMX we can perform 2048 int8 operations per cycle (vs. 256 without AMX) and 1024 bfloat16 operations per cycle (vs. 64 without AMX).

Based on peak architectural capability of matrix multiply + accumulate operations per cycle per core assuming 100% CPU utilization. As of August 2021.
Sapphire Rapids - Advanced Matrix Extension (AMX) demo

Sapphire Rapids shows approximately 7.8x more operation with the Advanced Matrix Extensions

Optimized internal matrix-multiply micro benchmark runs approximately 7.8X faster using new Intel AMX instruction set extensions compared to another version of the same micro benchmark using Intel AVX-512 VNNI instructions, both on pre-production Intel Xeon processor (Sapphire Rapids).

The Sapphire Rapids AMX HW block contains 8X the 8-bit integer multiply-add capability compared to the Sapphire Rapids AVX-512 VNNI HW block. The ability of the AMX HW block to execute more multiply-adds per cycle than the AVX-512 VNNI HW block enables matrix-multiply implementations using the Intel AMX instruction set extensions to achieve higher performance than AVX-512 VNNI implementations. The internal matrix-multiply micro benchmark demonstrates that increased capability.

HW: 1-node, 2-socket pre-production Intel Xeon processor (Sapphire Rapids) on Intel reference platform with 512 GB total DDR5 memory, HT ON, Turbo ON, pre-production bios and software running Intel internal matrix-multiply micro benchmark.

Workload: Internal matrix-multiply micro benchmark, similar to GEMM. One version implemented using Intel AMX instruction set extensions and the other using Intel AVX-512 VNNI instructions. The GEMM input is A=32x512 and B=512x32 int8 matrices and the output is C=32x32 int32 matrix. The B matrix is pre-formatted to a blocked VNNI-friendly format outside of the measured region.

As of August 10, 2021.

On an Open virtual Switch use case with up to 4 instances of Data Streaming Accelerator (DSA), we see a nearly 40% reduction in CPU utilization, and a 2.5x improvement in data movement performance. This results in nearly doubling the effective core performance for this workload.

Results have been estimated or simulated as of July 2021 based on testing on pre-production hardware and software.

Platform: Archer City SDV; CPU: Sapphire Rapids C1 pre-production; MEMORY: 128GB DDR5 (16GB PC5-4800E); BIOS: EGSDCRB1.86B.0056.D18.2104081151; OS: Ubuntu 20.04; NIC: 2x100Gb/s E810 (CVL); Virtual Switch: OVS 2.15.9; Data Plane: DPDK 21.08-rc0.

With the Zlib L9 compression algorithm, we see a 50x drop in CPU utilization (i.e., a 98% decrease in expected core utilization), while also speeding up the compression by 22 times. Without QAT, this level of performance would require upwards of 1,000 Performance-cores to achieve.

Results have been estimated or simulated as of July 2021. Sapphire Rapids estimation based on architecture models scaling of baseline measurements taken on Ice Lake.

Baseline testing with Ice Lake and Intel QAT: Platform: Ice Lake XCC, SKU: 8380, Cores: 40, Freq: 2.3 GHz, TDP: 270W, LLC: 60MB, Board: Coyote Pass, RAM: 16 x 32GB DDR4 3200, Hynix HMA84GR7CJR4N-XN, BIOS: SE5C6200.86B.3021.D40.2103160200, Microcode: 0x8d05a260 (03/16)

OS: Ubuntu 20.04.2, Kernel: 5.4.0-65-generic, GCC: 9.3.0, yasm: 1.3.0, nasm: 2.14.02, ISA-L: 2.3, ISA-L Crypto: 2.23, OpenSSL: 1.1.1i, zlib: 1.2.11, lzbench: 1.7.3

On microservices performance, we show an improvement in throughput per core (under a latency SLA of p99 <30ms) of:

24% comparing Ice Lake Server to Cascade Lake

69% comparing Sapphire Rapids to Cascade Lake

Workloads: DeathStarBench 'hotelReservation', 'socialNetwork' (https://github.com/delimitrou/DeathStarBench ) and Google Microservices demo (https://github.com/GoogleCloudPlatform/microservices-demo )

OS: Ubuntu 20.04 with kernel version v5.10, Kubernetes v1.21.0; Testing as of July 2021.

Cascade Lake Measurements on 3-node Kubernetes setup on AWS M5.metal instances (2S 24 core 8259CL with 384GB DDR4 RAM and 25Gbps network) in us-west2b

Ice Lake Measurements on 3-node 2S 32 core, 2.5GHz, 300W TDP SKU with 512GB DDR4 RAM and 40Gbps network

Ponte Vecchio

Statement Details
Intel set ambitious goals with Ponte Vecchio to eclipse industry standards for HPC (FP64), AI (FP16/BF16) and Bandwidth (GB/s). Competitive data between 2011 and present is compiled from published Nvidia specifications. Intel data between 2011 and present is estimated based on internal testing of and specifications for Intel graphics architectures. Graphs shown are for illustrative purposes and not to scale.
We can deliver leadership compute and memory bandwidth density for a wide range of HPC and AI systems with a single design.

Xe-HPC architecture and Intel's modular construction allows PVC to scale to from 1 stack and 2 stacks through a single design.

Intel's competition achieves this coverage only through different chips and different designs (i.e., NVidia - A100, A102, A104; AMD - NAVI 21, NAVI 22, NAVI 23).

Compute - see statement and details immediately below regarding Ponte Vecchio - Performance with oneAPI AI Analytic Toolkit demo.

Memory bandwidth density - greater than 5 terabytes per second of sustained memory fabric bandwidth on Ponte Vecchio and over 2 terabytes per second of aggregate memory and scale-up bandwidth. See details below regarding these statements

Ponte Vecchio - Performance with oneAPI AI Analytic Toolkit demo

Ponte Vecchio is powered on and hitting some industry leading performance numbers on A0 silicon.

Early Ponte Vecchio silicon has set an industry-record in both inference and training throughput on a popular AI benchmark.

ResNet-50 inference throughput on Ponte Vecchio with Sapphire Rapids exceeds 43 thousand images per second - surpassing the standard you see today in market. Based on Intel engineering platform with single Sapphire Rapids and Ponte Vecchio A0 2 stacks; ResNet-50 v1.5, engineering software framework, mixed precision (INT8 and FP32), 76.36% Top 1 and 93.06% Top 5 accuracy, image/sec as measured by total images processed divided by the execution time, Linux. Testing as July 15, 2021

Today we are already seeing leadership performance on Ponte Vecchio's ResNet-50 training throughput, with over 3,400 images per second. Based on Intel engineering platform with single Sapphire Rapids and Ponte Vecchio A0 2 stacks; ResNet-50 v1.5, engineering software framework, batch size 256 per GPU, mixed precision (BF16 and FP32), image/sec as measured by total images processed divided by the execution time, Linux. Testing as of August 12, 2021

Competition's results as of August 10, 2021, published at:

https://developer.nvidia.com/deep-learning-performance-training-inference

Ponte Vecchio produces greater than 45 teraflops of sustained vector single-precision performance (FP32) Measured over 45 teraflops of sustained vector single-precision performance (FP32) using clpeak benchmark. As of July 30, 2021, based on Intel engineering platform with single Sapphire Rapids and Ponte Vecchio A0 2 stacks, Linux.
We measured greater than 5 terabytes per second of sustained memory fabric bandwidth on Ponte Vecchio. Measured over 5 terabytes per second of sustained memory fabric bandwidth. As of July 30, 2021, based on Intel engineering platform with single Sapphire Rapids and Ponte Vecchio A0 2 stacks, Linux.
We measured over 2 terabytes per second of aggregate memory and scale-up bandwidth. Measured over 2 terabytes per second of aggregate memory and stack-to-stack bandwidth. As of July 30, 2021, based on Intel engineering platform with single Sapphire Rapids and Ponte Vecchio A0 2 stacks, Linux.
Xe Core - Intel oneAPI Rendering Toolkit demo

System 1 (scene showing Intel® Embree using Houdini tool from Side Effects): Dual Intel® Xeon® Platinum 8280 Processor @ 2.70GHz; Memory: 256GB RAM; Graphics: NVIDIA GeForce GTX 970 (only used for the Houdini UI, not rendering); Storage: Samsung SSD 850 EVO 2TB; OS: Ubuntu 20.10

System 2 (using the oneAPI software architecture and showing Embree and AI based Intel® Open Image Denoise running cross-architecture on CPUs and GPUs.): 11th Gen Intel Core i7-11850H @ 2.50GHz; Graphics: pre-production Xe architecture-based silicon for data center; Memory: 2x16GiB SODIMM DDR4 Synchronous 3200 MHz; OS: Ubuntu 20.04.2 LTS

System 3 (4K full fidelity frame rendered with a ray tracing capable Xe GPU): Tiger Lake H DDR4 SODIMM RVP, CPU: 11th Gen Intel Core i7-11800H @ 2.30GHz (8 cores); Graphics: pre-production Xe architecture-based silicon for data center; Memory: 2x 16GiB SODIMM DDR4 Synchronous 3200 MHz; Storage: Corsair SKU CSSD-F240GBMP510 Force Series™ MP510 240GB M.2 SSD; OS: Ubuntu 20.04

As of August 2021.

Notices & Disclaimers

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex .

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See above for configuration details. No product or component can be absolutely secure.

Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation.

All product plans and roadmaps are subject to change without notice.

Some results have been estimated or simulated. Results that are based on pre-production systems and components as well as results that have been estimated or simulated using an Intel Reference Platform (an internal example new system), internal Intel analysis or architecture simulation or modeling are provided to you for informational purposes only. Results may vary based on future changes to any systems, components, specifications, or configurations. Intel technologies may require enabled hardware, software or service activation.

Intel contributes to the development of benchmarks by participating in, sponsoring, and/or contributing technical support to various benchmarking groups, including the BenchmarkXPRT Development Community administered by Principled Technologies.

Statements that refer to future plans and expectations are forward-looking statements that involve a number of risks and uncertainties. Words such as "anticipates," "expects," "intends," "goals," "plans," "believes," "seeks," "estimates," "continues," "may," "will," "would," "should," "could," and variations of such words and similar expressions are intended to identify such forward-looking statements. Statements that refer to or are based on estimates, forecasts, projections, uncertain events or assumptions, including statements relating to future products and technology and the expected availability and benefits of such products and technology, market opportunity, and anticipated trends in our businesses or the markets relevant to them, also identify forward-looking statements. Such statements are based on management's current expectations and involve many risks and uncertainties that could cause actual results to differ materially from those expressed or implied in these forward-looking statements. Important factors that could cause actual results to differ materially from the company's expectations are set forth in Intel's reports filed or furnished with the Securities and Exchange Commission (SEC), including Intel's most recent reports on Form 10-K and Form 10-Q, available at Intel's investor relations website at https://www.intc.com/ and the SEC's website at https://www.sec.gov/. Intel does not undertake, and expressly disclaims any duty, to update any statement made in this presentation, whether as a result of new information, new developments or otherwise, except to the extent that disclosure may be required by law.

Code names are used by Intel to identify products, technologies, or services that are in development and not publicly available. These a ​​ re not "commercial" names and not intended to function as trademarks. ​​

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.