Intel® Innovation 2023

Performance Index

ID	Date	Classification
615781	03/18/2025	Public

Keynotes

Session	Section	Speaker	Claim	Claim Details/Citation	Testing Date
Pat Gelsinger Keynote	Intel® Xeon® Roadmap	Pat Gelsinger	Estimated performance 5th Gen Intel® Xeon® vs. 4th Gen Intel® Xeon® on a range of AI inferencing and training workloads. 5th Gen Intel® Xeon® boasts more compute and faster memory while still using the same power as Intel's previous generation - up to 40% more AI performance out of the box.	AI Performance Gain (PyTorch) Estimated performance comparing:8592+ (Archer City): 1-node, pre-production 2x 5th Gen Intel® Xeon® Platinum processor (Emerald Rapids) 64C, 350W TDP; HT on, Turbo on, Total Memory 1024GB (16x64GB DDR5-5600 MT/s [5600 MT/s]); BIOS Version EGSDCRB1.E9I.0102.D48.2305231333; ucode revision=0xa10000c0, CentOS Stream 9, 1x Samsung SSD 860 EVO 1TB (TF), test by Intel on 5/26/2023. 8480+ (Archer City):1-node, with 2x Intel® Xeon® Platinum processor (Sapphire Rapids) 56C, 350W TDP, HT on, Turbo on, Total Memory 1024GB (16x64GB DDR5-4800 MT/s [4800 MT/s]); BIOS Version EGSDCRB1.SYS.9409.P01.2211280753,ucode revision=0x2b000161, CentOS Stream 8, 5.15.0, 1x INTEL SSDSC2KW256G8 (PT)/Samsung SSD 860 EVO 1TB (TF), test by Intel on 5/29/2023.	05/26/2023 and 05/29/2023
Greg Lavender Keynote	Prediction Guard	David Sidd	Up to 2x throughput increase in performance of models from Nvidia A100 to Intel® Gaudi® 2	Intel Gaudi 2 • 8 Intel Gaudi 2 HL-225H mezzanine cards (although results are for running models on only one card at a time) • 3rd Gen Intel® Xeon® processors • 1 TB RAM • 30 TB disk • Workload running in Habana’s pre-built Docker image (vault.habana.ai/gaudi-docker/1.11.0/ubuntu20.04/habanalabs/pytorch-installer-2.0.1:latest) Nvidia A100: • 1 Nvidia A100 (80 GiB) • 12 vCPUs • 140.625 GiB memory • Workload running in Docker via Baseten’s Truss package Results: Nous-Hermes-Llama2-13B: • Intel Gaudi 2 (single card): 71.8 tokens/second • Our previous deployment on 1 Nvidia A100 80GB: 36.9 tokens/second WizardCoder 15B: • Intel Gaudi 2 (single card): 63.24 tokens/second • Our previous deployment on 1 Nivida A100 80GB: 58.1 tokens/second	09/12/2023 by Prediction Guard

Session

Section

Speaker

Claim

Claim Details/Citation

Testing Date

Pat Gelsinger Keynote

Intel® Xeon® Roadmap

Pat Gelsinger

Estimated performance 5th Gen Intel® Xeon® vs. 4th Gen Intel® Xeon® on a range of AI inferencing and training workloads.

5th Gen Intel® Xeon® boasts more compute and faster memory while still using the same power as Intel's previous generation - up to 40% more AI performance out of the box.

AI Performance Gain (PyTorch) Estimated performance comparing:8592+ (Archer City): 1-node, pre-production 2x 5th Gen Intel® Xeon® Platinum processor (Emerald Rapids) 64C, 350W TDP; HT on, Turbo on, Total Memory 1024GB (16x64GB DDR5-5600 MT/s [5600 MT/s]); BIOS Version EGSDCRB1.E9I.0102.D48.2305231333; ucode revision=0xa10000c0, CentOS Stream 9, 1x Samsung SSD 860 EVO 1TB (TF), test by Intel on 5/26/2023.

8480+ (Archer City):1-node, with 2x Intel® Xeon® Platinum processor (Sapphire Rapids) 56C, 350W TDP, HT on, Turbo on, Total Memory 1024GB (16x64GB DDR5-4800 MT/s [4800 MT/s]); BIOS Version EGSDCRB1.SYS.9409.P01.2211280753,ucode revision=0x2b000161, CentOS Stream 8, 5.15.0, 1x INTEL SSDSC2KW256G8 (PT)/Samsung SSD 860 EVO 1TB (TF), test by Intel on 5/29/2023.

05/26/2023 and 05/29/2023

Greg Lavender Keynote

Prediction Guard

David Sidd

Up to 2x throughput increase in performance of models from Nvidia A100 to Intel® Gaudi® 2

Intel Gaudi 2

• 8 Intel Gaudi 2 HL-225H mezzanine cards (although results are for running models on only one card at a time) • 3rd Gen Intel® Xeon® processors • 1 TB RAM • 30 TB disk • Workload running in Habana’s pre-built Docker image (vault.habana.ai/gaudi-docker/1.11.0/ubuntu20.04/habanalabs/pytorch-installer-2.0.1:latest)

Nvidia A100: • 1 Nvidia A100 (80 GiB) • 12 vCPUs • 140.625 GiB memory • Workload running in Docker via Baseten’s Truss package

Results:

Nous-Hermes-Llama2-13B: • Intel Gaudi 2 (single card): 71.8 tokens/second • Our previous deployment on 1 Nvidia A100 80GB: 36.9 tokens/second

WizardCoder 15B: • Intel Gaudi 2 (single card): 63.24 tokens/second • Our previous deployment on 1 Nivida A100 80GB: 58.1 tokens/second

09/12/2023 by Prediction Guard

Demos

Session	Section	Speaker	Claim	Claim Details/Citation	Testing Date
AI Booth Demo	Enablement and optimizations with Triton for PyTorch	Pramod Pai	Running on Triton backend is faster than on PyTorch in eager mode	The demo shows how PyTorch's torch.compile() - a new feature in PyTorch 2.0 set to inductor backend as Triton, can be used to run LLMs on Data Center Max Series and obtain performance speedups. The performance comparison is between eager mode and torch.compile() mode with Triton.	07/07/23
AI Booth demo	Accelerate Diffusion Models for Medical Image Segmentation with PyTorch	Kevin Ta	IPEX written code runs faster than non-IPEX written code for Pytorch	The demo will include a UI where users will run the MedSegDiff model to segment some MRI brain scans. User will be able to select whether or not to enable Intel(R) Extension for PyTorch*. The model will output interference time.	09/14/23

Sessions

Session	Section	Speaker	Claim	Claim Details/Citation	Testing Date
NGS001 Designing Solutions for Intel® Xeon® with E-cores and P-cores	Slide 13	Matt Langman	2.57x throughput for latency sensitive queries running MLPerf GPT-J 6B on GNR-AP 12 Channel MCRDIMM 8800 vs SPR-SP 8480 with DDR5.	1-node, 2x 4th Gen Intel® Xeon® 8480+, with 1024GB 1664GB 4800MT/s DDR4, Hynix total memory, HT on, Turbo on, ucode 0x2b000461, CentOS Stream 8, 6.1.11-1.el8.elrepo.x86_64, SAMSUNG MZQL21T9HCJR-00A07 1.8TB, PyTorch / GPT-J 6B, dataset CNN_Dailymail, MLPerf approved GPT-J model, precision: INT8, BF16, INT4, QPS=.14, test by Intel on 09/09/2023. 1-node, 2x Future Gen Intel® Xeon®, codenamed Granite Rapids with 1536GB, 2464GB 8800MT/s MCRDIMM total memory, HT on, Turbo on, ucode 0xf81913a0, Redhat 9.0 5.19.0-gnr.po.bkc.1.0.1.x86_646.1.11-1.el8.elrepo.x86_64, 1x SSK Storage 953.9 GB, PyTorch / GPT-J 6B, dataset CNN_Dailymail, MLPerf approved GPT-J model, precision: INT8, BF16, INT4, QPS=.36, test by Intel on 09/09/2023.	09/09/23
NGS007 Advancing Datacenter Performance With The Intel® Infrastructure Processing Unit (IPU) & IPDK	Slide 6	Naru Sundar	Using an IPU to offload storage initiator logic, consumes only 2.7 cores on the hots, as compared to 14.3 cores when not offloading. Offloading maintains throughput and performance.	Workload: • Flexible I/O (fio) benchmark on 2x hosts with 4x storage targets. Tested by Intel on 9/12/2022. 2x Host Systems: Archer City server platform, 2x 4th Gen Intel® Xeon® Scalable Processors with 16x 16GB DDR4 DRAM • Host 1: 1x Intel® Ethernet Controller E810. • Host 2: 1x Intel® Infrastructure Processing Unit (IPU) E2000 Host firmware: • EGSDCRB1.SYS.0084.D24.2207132145 4x Target Systems: • Intel® Server M50CYP Family platform, 2x Intel® Xeon® Gold 6342 processors, 16x 16GB DDR4 DRAM, 8x 128GB Intel® Optane^TM PMEM, 8x NVME P5316 U.2 SSD, 1x Intel® Ethernet Controller E810, CentOS 7.9, LightBits LIghtOS 2.3.17 • LightOS Volume 1: 10GB, no-replication, residing on Target 1 • LightOS Volume 2: 10GB, no-replication, residing on Target 2 Benchmark: fio , 100 % random read, number of jobs=16, queue depth=32, block size =32k • Host 1 configured to use LightOS Volume 1 • Host 2 configured to use LIghtOS Volume Results: • Host 1: IOPS:274000, 98.3Gbps, Host CPU Utilization: 14.3 Cores • Host 2: IOPS: 348000, 91.2Gbps, Host CPU Utilization: 2.7 Cores	09/12/22
NGS007 Advancing Datacenter Performance With The Intel® Infrastructure Processing Unit (IPU) & IPDK	Slide 6	Naru Sundar	using SW to implement a representative infrastructure network function (VxLAN + NAT + Metering) resulted in only 22% of the packet rate achieved by an IPU while consuming 8C.	Claim & Setup: • In the particular measured example, using SW to implement a representative infrastructure network function (VxLAN + NAT + Metering) resulted in only 22% of the packet rate achieved by an IPU while consuming 8C. • P4 Program: o Both the Intel® Xeon® only setup and the Intel® IPU setup used a P4 program implementing the following networking workload  VxLAN termination  Network address translation (NAT) • CPU HW setup o CPU: Intel® Xeon® Gold 6252N CPU @ 2.30GHz o Sockets: 2 o Cores/socket: 24 o Network Interface: Intel® Ethernet Controller X710 for 10GbE SFP+ (rev 02) o Traffic Generator: Ixia 4x10GbE ports o OS: Ubuntu 22.04 LTS o C Compiler: GCC 11.4.0 o DPDK version: 23.07 w/P4-DPDK extension o P4 program: as described earlier • IPU HW Setup o IPU: Intel® IPU E2100 Adapter o IPU SW: IPU SDK 0.9.3 o IPU configuration: 2x100G network topology, default configuration o P4 program: as described earlier • Common test stimulus o Runtime configuration: 1M flows with randomly generated fields o Input stimulus was generated from a traffic generator with values randomly selected from the programmed runtime ruleset o Observed output packet rate was measured for both Intel® Xeon® and Intel® IPU demonstrating max packet rate achievable given the workload. In the Intel® Xeon® case this measurement was taken for 1, 2, 4 and 8 physical cores used to demonstrate scaling on this workload Raw data: • Intel® Xeon® data gathered 8/30/23 o With 1 physical core : 3.99 Mpps o With 2 physical cores: 7.95 Mpps o With 4 physical cores: 15.71 Mpps o With 8 physical cores: 29.75 Mpps • Intel® IPU data gathered 8/11/23 o 134Mpps Data presentation: • The data is presented in a graph normalizing the packet rate against the IPU max rate. • The workload vs infrastructure tax core count was derived by counting the % of cores used for network processing against the total set of cores available on this device.	08/30/23
NGS012 Developing Next-Gen Games with Intel® Graphics Software and Hardware	Slide 22	Damien Triolet	Intel® Core™ i7-1370P with Intel® Iris® Xe Graphics and XeSS deliver increased performance at 1080p as measured by FPS when compared to gameplay without XeSS	All games tested at 1080p using medium settings. All FPS (frames per second) scores are either measured with PresentMon or in-game benchmark. All gameplay has a documented workload running the same replay or game scenario across all configurations and test runs. Game workloads that support this claim are Call of Duty: Modern Warfare 2, Hitman 3, Shadow of Tomb Raider, The DioField Chronicle, Gotham Knights, Ghostbusters Spirits Unleashed, Death Stranding Director's Cut and Arcadegeddon. Iris® Xe Graphics and XeSS deliver increased performance at 1080p as measured by FPS when compared to gameplay without XeSS	03/10/23
NGS012 Developing Next-Gen Games with Intel® Graphics Software and Hardware	Slide 17	Damien Triolet	Intel® Arc™ Graphics found on pre-production CPU codenamed MTL achieves up to 6.6x faster depth scaling compared to a 13th Gen Core i7 with Intel Iris Xe Graphics	As part of our internal micro-benchmark testing, we've measured significant improvements for key graphics metrics when comparing a pre-release MTL system with Intel Arc Graphics to a 13th Gen Core i7 with Intel Iris Xe Graphics: -Up to 6.6 faster depth test rate -Up to 2.6x faster vertex processing rate -Up to 2.6x faster triangle draw rate -Up to 2.3x faster compute instruction rate -Up to 2.1x faster pixel blend rate	07/25/23
NGS012 Developing Next-Gen Games with Intel® Graphics Software and Hardware	Slide 4	Damien Triolet	Intel® Arc™ Graphics found on pre-production CPU codenamed MTL achieves up to 2x better perf/watt versus Intel® Iris® Xe Graphics	2x perf/watt target successfully observed when running Shadow of the Tomb Raider (DX12) at 1080p Medium.	07/17/23

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Performance Index