• 615781
  • Public Content
Contents

Hot Chips 2021

Alder Lake

Statement Details
P-core delivers higher performance on single and lightly threaded scalable apps (greater than 50% improved single-threaded performance estimated). Estimated based on pre-production Intel internal Alder Lake validation platform (with 8 P-cores + 8 E-cores) running Spec Int 544.nab_​r on P-core vs. on E-core. As of July 2021. Charts shown for illustrative purposes only and not to scale.
E-core provides higher computational density under given physical constraints (greater than 50% improved multi-threaded performance estimated). Estimated based on pre-production Intel internal Alder Lake validation platform running Spec Int 544.nab_​r on 8 P-Core/8 E-core platform configured to run 4 P-core, 2 P-core and 8 E-core. As of July 2021. Charts shown for illustrative purposes only and not to scale.
PCIE Gen 5 has up to 2X bandwidth vs. Gen4. Based on PCIE Gen 5 specification of 32 GT/s vs. PCIE Gen 4 specification of 16GT/s.
Interconnect bandwidth up to 1000GB/s for compute, up to 204GB/s for memory, and up to 64GB/s for I/O fabrics. Internal specification of peak bandwidth of each fabric for the Alder Lake desktop 125W configuration.
Intel Thread Director uses machine-learning based thread telemetry to predict IPC gain of P-core vs. E-core and assign threads based on workload class and performance and efficiency considerations. Based on simulation on Intel internal Alder Lake architectural simulator as of March 2021. Charts shown for illustrative purposes only and not to scale.
Intel Thread Director leads to significant performance improvements. Comparing performance on pre-production Intel internal Alder Lake validation platform (with 8 P-cores + 8 E-cores) running Microsoft Excel and an internal AI workload and combinations of Linpac with micro-benchmarks. Measured with Intel Thread Director and without Intel Thread Director. As of March 2021. Charts shown for illustrative purposes only and not to scale.

Sapphire Rapids

Statement Details
On an Open virtual Switch use case with up to 4 instances of Data Streaming Accelerator (DSA), we see a nearly 40% reduction in CPU utilization, and a 2.5x improvement in data movement performance. This results in nearly doubling the effective core performance for this workload.

Results have been estimated or simulated as of July 2021 based on testing on pre-production hardware and software.

Platform: Archer City SDV; CPU: Sapphire Rapids C1 pre-production; MEMORY: 128GB DDR5 (16GB PC5-4800E); BIOS: EGSDCRB1.86B.0056.D18.2104081151; OS: Ubuntu 20.04; NIC: 2x100Gb/s E810 (CVL); Virtual Switch: OVS 2.15.9; Data Plane: DPDK 21.08-rc0.

With the Zlib L9 compression algorithm, we see a 50x drop in CPU utilization (i.e., a 98% decrease in expected core utilization), while also speeding up the compression by 22 times. Without QAT, this level of performance would require upwards of 1,000 Performance-cores to achieve.

Results have been estimated or simulated as of July 2021. Sapphire Rapids estimation based on architecture models scaling of baseline measurements taken on Ice Lake.

Baseline testing with Ice Lake and Intel QAT: Platform: Ice Lake XCC, SKU: 8380, Cores: 40, Freq: 2.3 GHz, TDP: 270W, LLC: 60MB, Board: Coyote Pass, RAM: 16 x 32GB DDR4 3200, Hynix HMA84GR7CJR4N-XN, BIOS:

SE5C6200.86B.3021.D40.2103160200, Microcode: 0x8d05a260 (03/16)

OS: Ubuntu 20.04.2, Kernel: 5.4.0-65-generic, GCC: 9.3.0, yasm: 1.3.0, nasm: 2.14.02, ISA-L: 2.3, ISA-L Crypto: 2.23, OpenSSL: 1.1.1i, zlib: 1.2.11, lzbench: 1.7.3

With AMX we can perform 2048 int8 operations per cycle (vs. 256 without AMX) and 1024 bfloat16 operations per cycle (vs. 64 without AMX). Based on peak architectural capability of matrix multiply + accumulate operations per cycle per core assuming 100% CPU utilization. As of August 2021.

On microservices performance, we show an improvement in throughput per core (under a latency SLA of p99 <30ms) of:

24% comparing Ice Lake Server to Cascade Lake.

69% comparing Sapphire Rapids to Cascade Lake.

Workloads: DeathStarBench 'hotelReservation', 'socialNetwork' ( https://github.com/delimitrou/DeathStarBench ) and Google Microservices demo (https://github.com/GoogleCloudPlatform/microservices-demo )

OS: Ubuntu 20.04 with kernel version v5.10, Kubernetes v1.21.0; Testing as of July 2021.

Cascade Lake Measurements on 3-node Kubernetes setup on AWS M5.metal instances (2S 24 core 8259CL with 384GB DDR4 RAM and 25Gbps network) in us-west2b

Ice Lake Measurements on 3-node 2S 32 core, 2.5GHz, 300W TDP SKU with 512GB DDR4 RAM and 40Gbps network

Ponte Vecchio

Statement Details
Ponte Vecchio produces greater than 45 teraflops of sustained vector single-precision performance (FP32). Measured over 45 teraflops of sustained vector single-precision performance (FP32) using clpeak benchmark. As of July 30, 2021, based on Intel engineering platform with single Sapphire Rapids and Ponte Vecchio A0 2 stacks, Linux.
We measured greater than 5 terabytes per second of sustained memory fabric bandwidth on Ponte Vecchio. Measured over 5 terabytes per second of sustained memory fabric bandwidth. As of July 30, 2021, based on Intel engineering platform with single Sapphire Rapids and Ponte Vecchio A0 2 stacks, Linux.
We measured over 2 terabytes per second of aggregate memory and scale-up bandwidth. Measured over 2 terabytes per second of aggregate memory and stack-to-stack bandwidth. As of July 30, 2021, based on Intel engineering platform with single Sapphire Rapids and Ponte Vecchio A0 2 stacks, Linux.

Early Ponte Vecchio silicon has set an industry-record in both inference and training throughput on a popular AI benchmark.

ResNet-50 inference throughput on Ponte Vecchio with Sapphire Rapids exceeds 43 thousand images per second - surpassing the standard you see today in market. Based on Intel engineering platform with single Sapphire Rapids and Ponte Vecchio A0 2 stacks; ResNet-50 v1.5, engineering software framework, mixed precision (INT8 and FP32), 76.36% Top 1 and 93.06% Top 5 accuracy, image/sec as measured by total images processed divided by the execution time, Linux. Testing as July 15, 2021

Today we are already seeing leadership performance on Ponte Vecchio's ResNet-50 training throughput, with over 3,400 images per second. Based on Intel engineering platform with single Sapphire Rapids and Ponte Vecchio A0 2 stacks; ResNet-50 v1.5, engineering software framework, batch size 256 per GPU, mixed precision (BF16 and FP32), image/sec as measured by total images processed divided by the execution time, Linux. Testing as of August 12, 2021

Competition's results as of August 10, 2021, published at:

https://developer.nvidia.com/deep-learning-performance-training-inference

Notices & Disclaimers

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex .

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See above for configuration details. No product or component can be absolutely secure.

Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation.

All product plans and roadmaps are subject to change without notice.

Some results have been estimated or simulated. Results that are based on pre-production systems and components as well as results that have been estimated or simulated using an Intel Reference Platform (an internal example new system), internal Intel analysis or architecture simulation or modeling are provided to you for informational purposes only. Results may vary based on future changes to any systems, components, specifications, or configurations. Intel technologies may require enabled hardware, software or service activation.

Intel contributes to the development of benchmarks by participating in, sponsoring, and/or contributing technical support to various benchmarking groups, including the BenchmarkXPRT Development Community administered by Principled Technologies.

Statements that refer to future plans and expectations are forward-looking statements that involve a number of risks and uncertainties. Words such as "anticipates," "expects," "intends," "goals," "plans," "believes," "seeks," "estimates," "continues," "may," "will," "would," "should," "could," and variations of such words and similar expressions are intended to identify such forward-looking statements. Statements that refer to or are based on estimates, forecasts, projections, uncertain events or assumptions, including statements relating to future products and technology and the expected availability and benefits of such products and technology, market opportunity, and anticipated trends in our businesses or the markets relevant to them, also identify forward-looking statements. Such statements are based on management's current expectations and involve many risks and uncertainties that could cause actual results to differ materially from those expressed or implied in these forward-looking statements. Important factors that could cause actual results to differ materially from the company's expectations are set forth in Intel's reports filed or furnished with the Securities and Exchange Commission (SEC), including Intel's most recent reports on Form 10-K and Form 10-Q, available at Intel's investor relations website at https://www.intc.com/ and the SEC's website at https://www.sec.gov/ . Intel does not undertake, and expressly disclaims any duty, to update any statement made in this presentation, whether as a result of new information, new developments or otherwise, except to the extent that disclosure may be required by law.

Code names are used by Intel to identify products, technologies, or services that are in development and not publicly available. These a​re not "commercial" names and not intended to function as trademarks. ​​

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.