Performance Index

ID Date Classification
615781 12/05/2024 Public
Document Table of Contents

Intel® Innovation 2023

Keynotes

Session Section Speaker Claim Claim Details/Citation Testing Date
Pat Gelsinger Keynote Intel® Xeon® Roadmap Pat Gelsinger

Estimated performance 5th Gen Intel® Xeon® vs. 4th Gen Intel® Xeon® on a range of AI inferencing and training workloads.

5th Gen Intel® Xeon® boasts more compute and faster memory while still using the same power as Intel's previous generation - up to 40% more AI performance out of the box.

AI Performance Gain (PyTorch) Estimated performance comparing:​8592+ (Archer City): 1-node, pre-production 2x 5th Gen Intel® Xeon® Platinum processor (Emerald Rapids) 64C, 350W TDP; HT on, Turbo on, Total Memory 1024GB (16x64GB DDR5-5600 MT/s [5600 MT/s]); BIOS Version EGSDCRB1.E9I.0102.D48.2305231333; ucode revision=0xa10000c0, CentOS Stream 9, 1x Samsung SSD 860 EVO 1TB (TF), test by Intel on 5/26/2023.​

8480+ (Archer City):1-node, with 2x Intel® Xeon® Platinum processor (Sapphire Rapids) 56C, 350W TDP, HT on, Turbo on, Total Memory 1024GB (16x64GB DDR5-4800 MT/s [4800 MT/s]); BIOS Version EGSDCRB1.SYS.9409.P01.2211280753,ucode revision=0x2b000161, CentOS Stream 8, 5.15.0, 1x INTEL SSDSC2KW256G8 (PT)/Samsung SSD 860 EVO 1TB (TF), test by Intel on 5/29/2023.

05/26/2023 and 05/29/2023
Greg Lavender Keynote Prediction Guard David Sidd Up to 2x throughput increase in performance of models from Nvidia A100 to Intel® Gaudi® 2

Intel Gaudi 2

• 8 Intel Gaudi 2 HL-225H mezzanine cards (although results are for running models on only one card at a time) • 3rd Gen Intel® Xeon® processors • 1 TB RAM • 30 TB disk • Workload running in Habana’s pre-built Docker image (vault.habana.ai/gaudi-docker/1.11.0/ubuntu20.04/habanalabs/pytorch-installer-2.0.1:latest)

Nvidia A100: • 1 Nvidia A100 (80 GiB) • 12 vCPUs • 140.625 GiB memory • Workload running in Docker via Baseten’s Truss package

Results:

Nous-Hermes-Llama2-13B: • Intel Gaudi 2 (single card): 71.8 tokens/second • Our previous deployment on 1 Nvidia A100 80GB: 36.9 tokens/second

WizardCoder 15B: • Intel Gaudi 2 (single card): 63.24 tokens/second • Our previous deployment on 1 Nivida A100 80GB: 58.1 tokens/second

09/12/2023 by Prediction Guard

Demos

Session Section Speaker Claim Claim Details/Citation Testing Date
AI Booth Demo Enablement and optimizations with Triton for PyTorch Pramod Pai Running on Triton backend is faster than on PyTorch in eager mode The demo shows how PyTorch's torch.compile() - a new feature in PyTorch 2.0 set to inductor backend as Triton, can be used to run LLMs on Data Center Max Series and obtain performance speedups. The performance comparison is between eager mode and torch.compile() mode with Triton. 07/07/23
AI Booth demo Accelerate Diffusion Models for Medical Image Segmentation with PyTorch Kevin Ta IPEX written code runs faster than non-IPEX written code for Pytorch The demo will include a UI where users will run the MedSegDiff model to segment some MRI brain scans. User will be able to select whether or not to enable Intel(R) Extension for PyTorch*. The model will output interference time. 09/14/23

Sessions

Session Section Speaker Claim Claim Details/Citation Testing Date
NGS001 Designing Solutions for Intel® Xeon® with E-cores and P-cores Slide 13 Matt Langman 2.57x throughput for latency sensitive queries running MLPerf GPT-J 6B on GNR-AP 12 Channel MCRDIMM 8800 vs SPR-SP 8480 with DDR5. 1-node, 2x 4th Gen Intel® Xeon® 8480+, with 1024GB 16*64GB 4800MT/s DDR4, Hynix total memory, HT on, Turbo on, ucode 0x2b000461, CentOS Stream 8, 6.1.11-1.el8.elrepo.x86_​64, SAMSUNG MZQL21T9HCJR-00A07 1.8TB, PyTorch / GPT-J 6B, dataset CNN_​Dailymail, MLPerf approved GPT-J model, precision: INT8, BF16, INT4, QPS=.14, test by Intel on 09/09/2023. 1-node, 2x Future Gen Intel® Xeon®, codenamed Granite Rapids with 1536GB, 24*64GB 8800MT/s MCRDIMM total memory, HT on, Turbo on, ucode 0xf81913a0, Redhat 9.0 5.19.0-gnr.po.bkc.1.0.1.x86_​646.1.11-1.el8.elrepo.x86_​64, 1x SSK Storage 953.9 GB, PyTorch / GPT-J 6B, dataset CNN_​Dailymail, MLPerf approved GPT-J model, precision: INT8, BF16, INT4, QPS=.36, test by Intel on 09/09/2023. 09/09/23
NGS007 Advancing Datacenter Performance With The Intel® Infrastructure Processing Unit (IPU) & IPDK​ Slide 6 Naru Sundar Using an IPU to offload storage initiator logic, consumes only 2.7 cores on the hots, as compared to 14.3 cores when not offloading. Offloading maintains throughput and performance.

Workload: • Flexible I/O (fio) benchmark on 2x hosts with 4x storage targets. Tested by Intel on 9/12/2022. 2x Host Systems: Archer City server platform, 2x 4th Gen Intel® Xeon® Scalable Processors with 16x 16GB DDR4 DRAM • Host 1: 1x Intel® Ethernet Controller E810. • Host 2: 1x Intel® Infrastructure Processing Unit (IPU) E2000

Host firmware: • EGSDCRB1.SYS.0084.D24.2207132145

4x Target Systems: • Intel® Server M50CYP Family platform, 2x Intel® Xeon® Gold 6342 processors, 16x 16GB DDR4 DRAM, 8x 128GB Intel® OptaneTM PMEM, 8x NVME P5316 U.2 SSD, 1x Intel® Ethernet Controller E810, CentOS 7.9, LightBits LIghtOS 2.3.17 • LightOS Volume 1: 10GB, no-replication, residing on Target 1 • LightOS Volume 2: 10GB, no-replication, residing on Target 2

Benchmark: fio , 100 % random read, number of jobs=16, queue depth=32, block size =32k • Host 1 configured to use LightOS Volume 1 • Host 2 configured to use LIghtOS Volume

Results: • Host 1: IOPS:274000, 98.3Gbps, Host CPU Utilization: 14.3 Cores • Host 2: IOPS: 348000, 91.2Gbps, Host CPU Utilization: 2.7 Cores

09/12/22
NGS007 Advancing Datacenter Performance With The Intel® Infrastructure Processing Unit (IPU) & IPDK​ Slide 6 Naru Sundar using SW to implement a representative infrastructure network function (VxLAN + NAT + Metering) resulted in only 22% of the packet rate achieved by an IPU while consuming 8C.

Claim & Setup: • In the particular measured example, using SW to implement a representative infrastructure network function (VxLAN + NAT + Metering) resulted in only 22% of the packet rate achieved by an IPU while consuming 8C. • P4 Program: o Both the Intel® Xeon® only setup and the Intel® IPU setup used a P4 program implementing the following networking workload  VxLAN termination  Network address translation (NAT) • CPU HW setup o CPU: Intel® Xeon® Gold 6252N CPU @ 2.30GHz o Sockets: 2 o Cores/socket: 24 o Network Interface: Intel® Ethernet Controller X710 for 10GbE SFP+ (rev 02) o Traffic Generator: Ixia 4x10GbE ports o OS: Ubuntu 22.04 LTS o C Compiler: GCC 11.4.0 o DPDK version: 23.07 w/P4-DPDK extension o P4 program: as described earlier • IPU HW Setup o IPU: Intel® IPU E2100 Adapter o IPU SW: IPU SDK 0.9.3 o IPU configuration: 2x100G network topology, default configuration o P4 program: as described earlier • Common test stimulus o Runtime configuration: 1M flows with randomly generated fields o Input stimulus was generated from a traffic generator with values randomly selected from the programmed runtime ruleset o Observed output packet rate was measured for both Intel® Xeon® and Intel® IPU demonstrating max packet rate achievable given the workload. In the Intel® Xeon® case this measurement was taken for 1, 2, 4 and 8 physical cores used to demonstrate scaling on this workload

Raw data: • Intel® Xeon® data gathered 8/30/23 o With 1 physical core : 3.99 Mpps o With 2 physical cores: 7.95 Mpps o With 4 physical cores: 15.71 Mpps o With 8 physical cores: 29.75 Mpps • Intel® IPU data gathered 8/11/23 o 134Mpps

Data presentation: • The data is presented in a graph normalizing the packet rate against the IPU max rate. • The workload vs infrastructure tax core count was derived by counting the % of cores used for network processing against the total set of cores available on this device.

08/30/23
NGS012 Developing Next-Gen Games with Intel® Graphics Software and Hardware Slide 22 Damien Triolet Intel® Core™ i7-1370P with Intel® Iris® Xe Graphics and XeSS deliver increased performance at 1080p as measured by FPS when compared to gameplay without XeSS All games tested at 1080p using medium settings. All FPS (frames per second) scores are either measured with PresentMon or in-game benchmark. All gameplay has a documented workload running the same replay or game scenario across all configurations and test runs.​ Game workloads that support this claim are Call of Duty: Modern Warfare 2, Hitman 3, Shadow of Tomb Raider, The DioField Chronicle, Gotham Knights, Ghostbusters Spirits Unleashed, Death Stranding Director's Cut and Arcadegeddon. Iris® Xe Graphics and XeSS deliver increased performance at 1080p as measured by FPS when compared to gameplay without XeSS 03/10/23
NGS012 Developing Next-Gen Games with Intel® Graphics Software and Hardware Slide 17 Damien Triolet Intel® Arc™ Graphics found on pre-production CPU codenamed MTL achieves up to 6.6x faster depth scaling compared to a 13th Gen Core i7 with Intel Iris Xe Graphics As part of our internal micro-benchmark testing, we've measured significant improvements for key graphics metrics when comparing a pre-release MTL system with Intel Arc Graphics to a 13th Gen Core i7 with Intel Iris Xe Graphics: -Up to 6.6 faster depth test rate -Up to 2.6x faster vertex processing rate -Up to 2.6x faster triangle draw rate -Up to 2.3x faster compute instruction rate -Up to 2.1x faster pixel blend rate 07/25/23
NGS012 Developing Next-Gen Games with Intel® Graphics Software and Hardware Slide 4 Damien Triolet Intel® Arc™ Graphics found on pre-production CPU codenamed MTL achieves up to 2x better perf/watt versus Intel® Iris® Xe Graphics 2x perf/watt target successfully observed when running Shadow of the Tomb Raider (DX12) at 1080p Medium. 07/17/23