Innovation Event Claims

Performance Index

ID 615781

Date 07/22/2024

Keynote

Session Code, Speaker Name and/or Slide Number	Claim	Claim Details
KEY100 UC/ GB	Perform video export from Adobe Premiere Pro and mass batch photo import+color correction+export with Adobe Lightroom together, up to 30% faster with 12^th Gen Intel Core i9-12900k and Nvidia RTX 3080 vs. AMD Ryzen 9 5950x and Nvidia RTX 3080	Performance results are based on testing by Intel as of October 14, 2021 and may not reflect all publicly available updates. As measured by comparing the 12th Gen Intel Core i9- 12900K against AMD Ryzen 9 5950X using the following operations: Processor: 12th Gen Intel® Core™ i9-12900K processor (ADL-S) PL1 set to 241W TDP, 16C24T (8P + 8E); Motherboard: Pre-production Asus ROG Strix-E Z690; Memory: Sk Hynix DDR5 CL 36-36-36-70, 2X 32GB DDR5-4400MHz; Storage: Samsung 980 Pro 1TB; Display Resolution: 1920x1080; OS: Microsoft Windows 11 Pro 22000.9; Graphics card: NVIDIA RTX 3090 (FTW3), Graphics driver: 471.68; Motherboard BIOS version: 0007. Processor: AMD Ryzen 9 5950X processor PL1=105W TDP, 16C32T, Motherboard: Asus ROG Crosshair Hero VIII; Memory: G. Skill DDR4 CL 14-14-14-34, 4X 16GB DDR4-3200 MHz; Storage: Samsung 980 Pro 1TB; Display Resolution: 1920x1080; OS: Microsoft Windows 11 Pro 22000.9; Graphics card: NVIDIA RTX 3090 (FTW3), Graphics driver: 471.68; Motherboard BIOS version: 3801. Processor: 12th Gen Intel® Core™ i9-12900K processor (ADL-S) PL1 set to 241W TDP, 16C24T (8P + 8E); Motherboard: Pre-production Asus ROG Strix-E Z690; Memory: Sk Hynix DDR5 CL 36-36-36-70, 2X 32GB DDR5-4400MHz; Storage: Samsung 980 Pro 1TB; Display Resolution: 1920x1080; OS: Microsoft Windows 11 Pro 22000.9; Graphics card: NVIDIA RTX 3090 (FTW3), Graphics driver: 471.68; Motherboard BIOS version: 0007. Processor: AMD Ryzen 9 5950X processor PL1=105W TDP, 16C32T, Motherboard: Asus ROG Crosshair Hero VIII; Memory: G. Skill DDR4 CL 14-14-14-34, 4X 16GB DDR4-3200 MHz; Storage: Samsung 980 Pro 1TB; Display Resolution: 1920x1080; OS: Microsoft Windows 11 Pro 22000.9; Graphics card: NVIDIA RTX 3090 (FTW3), Graphics driver: 471.68; Motherboard BIOS version: 3801.
KEY100 UC/ GB	Linux-based workstation has up to 3 times the memory of competitive offerings with Intel Optane	Based on a technical specification, comparing Nvidia's RTX A6000 memory of 48GB per PCI Express Gen 4x16 graphics card with Intel Xeon Gold 6238L Processor with a max memory size of 4.5TB using Intel Optane Persistent Memory. Top-tier data science workstations with Nvidia RTX A6000 can fit up to 96GB of memory compared to top-tier data science workstations with up to 4.5TB of Intel Optane Persistent Memory with Intel Xeon Gold 6238L processors.
KEY100 UC/ GB	Linux-based workstation has up to 100 times faster hardware accelerations.	Please see https://medium.com/intel-analytics-software  for relevant performance benchmark details, including performance results for various datasets tested against Stock Scikit-Learn available at https://medium.com/intel-analytics-software/leverage-intel-optimizations-in-scikit-learn-f562cb9d5544.
KEY100 UC/ GB	World's best gaming processor	As measured by unique features and superior in-game benchmark mode performance (score or frames per second) on majority of the 31 game titles tested (as of Oct 1, 2021), including in comparison to AMD Ryzen 5950X. Additional details available at http://www.intel.com/PerformanceIndex (12th Gen Intel Core desktop processors). Results may vary.
KEY100 UC/ GB	Best overclocking experience	Based on enhanced overclocking ability enabled by Intel's comprehensive tools and unique architectural tuning capabilities. Overclocking may void warranty or affect system health. Learn more at intel.com/overclocking. Results may vary.
KEY100 UC/ GB	Performance hybrid architecture	Performance hybrid architecture combines two new core microarchitectures, Performance-cores (P-cores) and Efficient-cores (E-cores), on a single processor die. Select 12th Gen Intel® Core™ processors (certain 12th Gen Intel Core i5 processors and lower) do not have performance hybrid architecture, only P-cores.
KEY100 UC/ GB	Intel Thread Director	Built into the hardware, Intel® Thread Director is provided only in performance hybrid architecture configurations of 12th Gen Intel® Core™ processors; OS enablement is required. Available features and functionality vary by OS.
KEY100 UC/ DELL	The Dell Alienware Aurora R13 is 87% faster than the prior gen, up to 16% quieter, runs cooler and, as a result, customers can get up to 5% better performance from the same graphics running in this chassis.	As measured by Dell analysis, October 2021, based on multi-core tests with the CPU overclocked to 175W over the Aurora R12 with a 160W CPU overclock; 5% improvement in graphics performance of The Alienware Aurora R13 compared to the legend 2.0 industrial design, using the same graphics card.
KEY100 UC/ GB	Overclocking demo world record	To learn more about the overclocking world records, visit https://hwbot.org/benchmarks/world_records.
KEY100 Greg Lavender	DeepRec is powered by OneAPI with AVX-512, VNNI and BF16 acceleration, resulting in up to 7x gains	Based on Alibaba internal measurements as of Oct. 2021, using custom SKU and software stack.
KEY100 Greg Lavender	Intel DPC++/C++ Compiler 1.8x faster than GCC* for SPECspeed 2017 Floating Point suite	Claim ID: 07-09-2021-04 Configuration: Testing by Intel as of Jun 10, 2021. Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz, 2 socket, Hyper Thread on, Turbo on, 32G x16 DDR4 3200 (1DPC). Red Hat Enterprise Linux release 8.2 (Ootpa), 4.18.0-193.el8.x86_64. Software: Intel(R) oneAPI DPC++/C++ Compiler for applications running on Intel(R) 64, Version 2021.3.0 Build 20210604. Intel(R) C++ Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.3.0 Build 20210604_000000, GCC 11.1, Clang/LLVM 12.0.0. SPECint®_speed_base_2017 compiler switches: Intel(R) oneAPI DPC++/C++ Compiler: -xCORE-AVX512 -O3 -ffast-math -flto -mfpmath=sse -funroll-loops -qopt-mem-layout-trans=4 -fiopenmp. Intel(R) C++ Intel(R) 64 Compiler Classic: -xCORE-AVX512 -ipo -O3 -no-prec-div -qopt-mem-layout-trans=4 -qopt-GCC: -march=skylake-avx512 -mfpmath=sse -Ofast -funroll-loops -flto –fopenmp. LLVM: -march=skylake-avx512 -mfpmath=sse -Ofast -funroll-loops -flto -fopenmp=libomp. multiple-gather-scatter-by-shuffles -qopenmp. jemalloc 5.0.1 used for intel compiler, gcc and llvm. SPECfp®_speed_base_2017 compiler switches: Intel(R) oneAPI DPC++/C++ Compiler: -xCORE-AVX512 -Ofast -ffast-math -flto -mfpmath=sse -funroll-loops -qopt-mem-layout-trans=4 -fiopenmp. Intel(R) C++ Intel(R) 64 Compiler Classic: -xCORE-AVX512 -ipo -O3 -no-prec-div -qopt-prefetch -ffinite-math-only -qopt-multiple-gather-scatter-by-shuffles -qopenmp. GCC: -march=skylake-avx512 -mfpmath=sse -Ofast -fno-associative-math -funroll-loops -flto –fopenmp. LLVM: -march=skylake-avx512 -mfpmath=sse -Ofast -funroll-loops -flto -fopenmp=libomp. jemalloc 5.0.1 used for intel compiler, gcc and llvm.
KEY100 Greg Lavender / Sandra Rivera	Intel oneAPI Deep Neural Network Library up to 1.5x speedup in Tensorflow	1-node, 2x 3rd Gen Intel Xeon Platinum 8380 with 512 GB (16 slots/ 32GB/ 3200) total DDR4 memory, microcode 0x8d9522d4, HT on, Turbo on, Ubuntu 20.04.2 LTS(docker), 5.4.0-77-generic, TensorFlow v2.5.0 w/o oneDNN, TensorFlow v2.6.0 w oneDNN, test by Intel on 09/28/2021
KEY100 Greg Lavender	IPP Cryptography Library up to 5.63x speedup for multi-buffer implementations of cryptographic algorithms on 3^rd Gen Xeon Scalable Processors (codemaned Ice Lake).	5.63 x higher OpenSSL RSA Sign 2048 performance,1.90x higher OpenSSL ECDSA Sign p256 performance,4.12x higher OpenSSL ECDHE x25519 performance,2.73x higher OpenSSL ECDHE p256 performance, 8280M:1-node, 2x Intel(R) Xeon(R) Platinum 8280M CPU on S2600WFT with 384 GB (12 slots/ 32GB/ 2933) total DDR4 memory, ucode 0x5003003, HT On, Turbo Off, Ubuntu 20.04.1 LTS, 5.4.0-65-generic, 1x INTEL_SSDSC2KG01, OpenSSL 1.1.1j, GCC 9.3.0, test by Intel on 3/5/2021. 8380: 1-node, 2x Intel(R) Xeon(R) Platinum 8380 CPU on M50CYP2SB2U with 512 GB (16 slots/ 32GB/ 3200) total DDR4 memory, ucode 0xd000270, HT On, Turbo Off, Ubuntu 20.04.1 LTS, 5.4.0-65-generic, 1x INTEL_SSDSC2KG01, OpenSSL 1.1.1j, GCC 9.3.0, QAT Engine v0.6.4, test by Intel on 3/24/2021. 8380: 1-node, 2x Intel(R) Xeon(R) Platinum 8380 CPU on M50CYP2SB2U with 512 GB (16 slots/ 32GB/ 3200) total DDR4 memory, ucode 0xd000270, HT On, Turbo Off, Ubuntu 20.04.1 LTS, 5.4.0-65-generic, 1x INTEL_SSDSC2KG01, OpenSSL 1.1.1j, GCC 9.3.0, QAT Engine v0.6.5, test by Intel on 3/24/2021.
KEY100 Greg Lavender AITI001 Pradeep Dubey Slide 20	Modin is an open source library which accelerates Pandas applications by up to 20x, with near infinite scaling from PC to cloud with a Jupyter notebook, all with just a 1-line code change.	Performance results are based on testing by Intel as of October 16, 2020 and may not reflect all publicly available updates. Configurations details and Workload Setup: 2 x Intel® Xeon® Platinum 8280 @ 28 cores, OS: Ubuntu 19.10.5.3.0-64-generic Mitigated 384GB RAM (192 GB RAM (12x 32GB 2933). SW: Modin 0.81. Scikit-learn 0.22.2. Pandas 1.01, Python 3.8.5, DAL(DAAL4Py) 2020.2, Census Data, (21721922.45) Dataset is from IPUMS USA, University of Minnesota, www.ipums.org [Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas and Matthew Sobek. IPUMS USA: Version 10.0 [dataset], Minneapolis, MN. IPUMS, 2020. https//doc.org/10.18128/D010.V10.0], test by Intel on 10/16/2020
KEY100 Sandra Rivera AITI001 Pradeep Dubey Slide 21	Scikit-learn is the most popular machine learning library, and Intel's optimizations enable up to a 100x speedup for some models with a 1-line code change to enable Intel's extensions.	Performance results are based on testing by Intel® as of October 23, 2020 and may not reflect all publicly available updates. Configuration Details and Workload Setup: Intel® oneAPI Data Analytics Library 2021.1 (oneDAL). Scikit-learn 0.23.1, Intel® Distribution for Python 3.8; Intel® Xeon® Platinum 8280LCPU @ 270GHz, 2 sockets, 28 cores per socket, 10M samples, 10 features, 100 clusters, 100 iterations, float32.
KEY100 Sandra Rivera AITI001 Pradeep Dubey Slide 37	Dual socket server with the next gen general purpose cpu can infer over 24K images/second compared with 16K on an Nvidia A30 GPU. This means we can deliver better than 1.5X the performance of Nvidia's mainstream inferencing GPU for 2022.	Baseline: 1-node, 2x Next Gen Intel Xeon Scalable processor (codenamed Sapphire Rapids, > 40 cores) on Intel pre-production platform with 512 GB DDR memory (8(1DPC)/64GB/4800 MT/s), HT on, Turbo on, Ubuntu Linux 18.04.4 LTS, internal pre-production bios and software running ResNet50-v1.5, BS=504, INT8 with intel internal optimization, test by Intel on 10/11/2021  Comp: EPYC 7742@2.25GHz with 1x NVIDIA A30 on GIGABYTE G482-Z52-00, TensorRT 8.0, Batch Size = 128, 21.08-py3, INT8, Dataset: Synthetic. Last updated: September 27th, 2021 Source: https://developer.nvidia.com/deep-learning-performance-training-inference.
KEY100 Sandra Rivera	Developers using INC can increase their productivity by up to 18x vs. NV TensorRT when quantizing models from FP32 to INT8 numerical formats.	Baseline: 1-node, 1x Next Gen Intel Xeon Scalable processor (codenamed Sapphire Rapids, > 40 cores) on Intel pre-production platform with 512 GB DDR memory (8(1DPC)/64GB/4800 MT/s), HT on, Turbo off, CentOS Linux 8.4, internal pre-production bios and software running SSD-ResNet34 BS=1 using TensorFlow 2.6 with Iintel internal optimization, test by Intel on 10/25/2021.  Comp: 1-node, 1x AMD EPYC 7742 64-Core Processor with A100-40GB-PCIE, on memory 8x32GB DDR4-3200, HT on, Turbo off, Ubuntu 20.04, TensorRT 8.0.1.6 running SSD-ResNet34 using CUDA 11.3, CUDNN 8.2.1.32, test by Intel on 10/18/2021.
KEY100 Lisa Pearce	In this example, Deep Link delivers up to 40% faster transcode time.	Up to 40% higher FPS in video encoding through an internal release ofHandBrakeon integrated Intel Xe graphics + discrete Intel Arc graphics compared to using Intel Arc graphics alone. Handbrake running on Alchemist pre-production silicon. As of October 2021.

Tech Insights

AITI001

Pradeep Dubey

Slide 13

End-to-end AI performance versus on 3

^rd

Gen Xeon versus Nvidia Ampere A100 for Census

3rd Gen Intel Xeon Platinum 8380 CPU: 2x 3rd Gen Intel Xeon Platinum 8380 with 512GB (16 slots/ 32GB/ 3200MHz) total DDR4 memory, microcode 0x8d055260, HT on, Turbo on, Ubuntu 20.04.2 LTS, 5.4.0-65-generic kernel, 1x INTEL SSDSC2KG960G8, Python 3.7.9, Modin 0.8.3, Omniscidbe v5.4.1, scikit-learn v0.24.1 accelerated by daal4py v2021.2, test by Intel on 03/15/2021.

Nvidia Ampere A100 GPU: Nvidia Ampere A100 GPU hosted on 2x AMD EPYC 7742 CPU with 512GB (16 slots/ 32GB/ 3200MHz) total DDR4 memory, microcode 0x8301034, HT on, Turbo on, Ubuntu 18.04.5 LTS, 5.4.0-42-generic kernel, 1x SAMSUNG 3.5TB SSD, Python 3.7.9, RAPIDS0.17, cuDF 0.17, cuML 0.17, scikit-learn v0.24.1, CUDA 11.0.221, test by Intel on 02/04/2021. Census Data [21721922, 45]: Dataset is from IPUMS USA, University of Minnesota, www.ipums.org [Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas and Matthew Sobek. IPUMS USA: Version 10.0 [dataset]. Minneapolis, MN: IPUMS, 2020. https://doi.org/10.18128/D010.V10.0]

AITI001

Pradeep Dubey

Slide 13

End-to-end AI performance versus on 3

^rd

Gen Xeon versus Nvidia Ampere A100 for PlasticC

PLAsTiCC Data Training set: (1421705, 6); Test set: (189022127, 6). Dataset is from Kaggle challenge "PLAsTiCC Astronomical Classification" https://www.kaggle.com/c/PLAsTiCC-2018/data

AITI001

Pradeep Dubey

Slide 13

End-to-end AI performance versus on 3

^rd

Gen Xeon versus Nvidia Ampere A100 for DLSA

3rd Gen Intel Xeon Platinum 8380 CPU: 2x 3rd Gen Intel Xeon Platinum 8380 with 512GB (16 slots/ 32GB/ 3200MHz) total DDR4 memory, microcode 0xd0002b1, HT off, Turbo on, Ubuntu 20.04 LTS, 5.4.0-84-generic kernel, 1x Intel 960GB SSD, Intel® Extension for PyTorch v1.8.1, Transformers 4.6.1, MKL 2021.3.0, Bert-large-uncased ( https://huggingface.co/bert-large-uncased) model, BS=1 per instance, 20 instances/node, 4 cores/instance, test by Intel on 09/17/2021.

Nvidia Ampere A100 GPU: Nvidia Ampere A100 GPU hosted on 2x AMD EPYC 7742 CPU with 1024GB (16 slots/ 64GB/ 3200MHz) total DDR4 memory, microcode 0x8301034, HT off, Turbo on, Ubuntu 20.04 LTS, 5.4.0-80-generic kernel, 1x SAMSUNG 3.5TB SSD, PyTorch 1.8.1, Transformers 4.6.1, CUDA 11.1, Bert-large-uncased ( https://huggingface.co/bert-large-uncased) model, BS=1 per instance, 7 total instances with MIG enabled, test by Intel on 09/22/2021

AITI001

Pradeep Dubey

Slide 13

End-to-end AI performance versus on 3

^rd

Gen Xeon versus Nvidia Ampere A100 for DIEN

3rd Gen Intel Xeon Platinum 8380 CPU: 1-node, 2x 3rd Gen Intel Xeon Platinum 8380 on Coyote Pass with 512 GB (16 slots/ 32GB/ 3200) total DDR4 memory, microcode 0xd0002b1, HT off, Turbo on, Ubuntu 20.04 LTS,5.4.0-84-generic, 1x Intel 960GB SSD OS Drive, Modin 0.10.2, Intel-tensorflow-avx512 2.6.0, oneDNN v2.3 , test by Intel on 09/29/2021

Nvidia Ampere A100 GPU: 1-node, 2x AMD EPYC 7742 on Nvidia DGXA100 920-23687-2530-000 utilizing 1x A100 GPU with 1024 GB (16 slots/ 64GB/ 3200) total DDR4 memory, microcode 0x8301034, HT OFF, Turbo on Ubuntu 20.04 LTS,5.4.0-84-generic , 1x SAMSUNG 3.5TB SSD OS Drive, Modin 0.10.2, tensorflow 2.6.0+nv, CUDA 11.4, test by Intel on 09/29/2021

AITI001

Pradeep Dubey

Slide 20

Intel optimized these libraries (NumPy and SciPy) by using the oneAPI core building blocks such as oneMKL to deliver up to 100x speedups.

New:

1-node, 2x 3rd Gen Intel Xeon 8368Q on Coyote Pass platform with 512GB (16 slots/ 32GB/ 3200[run speed 2933]) total DDR4 memory, microcode 0xd0002a0, HT on, Turbo on, CentOS Linux 7, 3.10.0-1160.36.2.el7.x86_64, 1x 1TB SSD Drive, iBench https://github.com/IntelPython/ibench, Intel Distribution for Python 2021.4, test by Intel on 10/13/2021

Baseline: 1-node, 2x 3rd Gen Intel Xeon 8368Q on Coyote Pass platform with 512GB (16 slots/ 32GB/ 3200[run speed 2933]) total DDR4 memory, microcode 0xd0002a0, HT on, Turbo on, CentOS Linux 7, 3.10.0-1160.36.2.el7.x86_64, 1x 1TB SSD Drive, iBench https://github.com/IntelPython/ibench, NumPy 1.20.3 (pypi), SciPy 1.7.1 (pypi), test by Intel on 10/13/2021

AITI001

Pradeep Dubey

Slide 36

As you saw in the opening keynote, we demonstrated the performance gains of using oneDNN optimizations and the neural compressor, plus the productivity savings of automating inference optimization, plus measured gains on Sapphire Rapids… added up to a 30X performance gain!

Baseline : 1-node, 2x 3^rd Gen Intel Xeon Platinum 8380 with 512 GB (16 slots/ 32GB/ 3200) total DDR4 memory, microcode 0x8d9522d4, HT on, Turbo on, Ubuntu 20.04.2 LTS(docker), 5.4.0-77-generic, TensorFlow v2.5.0 w/o oneDNN, TensorFlow v2.6.0 w oneDNN, test by Intel on 09/28/2021

New:

1-node, 2x Next Gen Intel Xeon Scalable processor (codenamed Sapphire Rapids, > 40 cores) on Intel pre-production platform with 512 GB DDR memory (8(1DPC)/64GB/4800 MT/s), HT on, Turbo on, CentOS Linux 8.4, internal pre-production bios and software running SSD-ResNet34 BS=1 using TensorFlow 2.6 with intel internal optimization, test by Intel on 09/28/2021

AITI001

Pradeep Dubey

Slide 41

Gaudi accelerators are bringing that efficiency to Amazon EC2 -training instances to deliver up to 40% better price performance than current GPU-based instances, so AWS customers can train more and spend less.

The price performance claim is made by AWS and based on AWS internal performance testing and publicly available pricing at https://aws.amazon.com/ec2/pricing/on-demand/. Habana Labs, a subsidiary of Intel Corporation, does not control or audit third-party data; your price performance may vary.

CLDTI002

Kamhout/Weekly Slide 29

CLD005

Arijit Biswas

Slide 13

39% additional CPU Core cycles after DSA offload. On an Open virtual Switch use case with up to 4 instances of Data Streaming Accelerator (DSA), we see a nearly 40% reduction in CPU utilization, and a 2.5x improvement in data movement performance.

Results have been estimated or simulated as of July 2021 based on testing on pre-production hardware and software.

Platform: Archer City SDV; CPU: Sapphire Rapids C1 pre-production; MEMORY: 128GB DDR5 (16GB PC5-4800E); BIOS: EGSDCRB1.86B.0056.D18.2104081151; OS: Ubuntu 20.04; NIC: 2x100Gb/s E810 (CVL); Virtual Switch: OVS 2.15.9; Data Plane: DPDK 21.08-rc0.

CLD005

Arijit Biswas

Slide 17

CLD001

Giri

Slide 19

CLDTI002, Kamhout/Weekly

Slide 25

On microservices performance, we show an improvement in throughput per core (under a latency SLA of p99 <30ms) of:

up to 24% comparing Ice Lake 3rd Gen Xeon to 2nd Gen Xeon;

up to 69% comparing Next Gen Xeon (Sapphire Rapids) to Second Gen Xeon.

Workloads: DeathStarBench 'hotelReservation', 'socialNetwork' ( https://github.com/delimitrou/DeathStarBench) and Google Microservices demo ( https://github.com/GoogleCloudPlatform/microservices-demo)

OS: Ubuntu 20.04 with kernel version v5.10, Kubernetes v1.21.0; Testing as of July 2021.

2nd Gen Xeon Measurements on 3-node Kubernetes setup on AWS M5.metal instances (2S 24 core 8259CL with 384GB DDR4 RAM and 25Gbps network) in us-west2b.

3rd Gen Xeon (codenamed Ice Lake) Measurements on 3-node 2S 32 core, 2.5GHz, 300W TDP SKU with 512GB DDR4 RAM and 40Gbps network.

CLD005

Arijit Biswas

Slide 15 CLDTI002

Kamhout/Weekly, Slide 16

With the Zlib L9 compression algorithm, we see up to a 50x drop in CPU utilization (i.e., a 98% decrease in expected core utilization)

Results have been estimated or simulated as of July 2021. Sapphire Rapids estimation based on architecture models scaling of baseline measurements taken on 3^rd Gen Xeon.

Baseline testing with 3^rd Gen Xeon and Intel QAT: Platform: Ice Lake XCC, SKU: 8380, Cores: 40, Freq: 2.3 GHz, TDP: 270W, LLC: 60MB, Board: Coyote Pass, RAM: 16 x 32GB DDR4 3200, Hynix HMA84GR7CJR4N-XN, BIOS: SE5C6200.86B.3021.D40.2103160200, Microcode: 0x8d05a260 (03/16)

OS: Ubuntu 20.04.2, Kernel: 5.4.0-65-generic, GCC: 9.3.0, yasm: 1.3.0, nasm: 2.14.02, ISA-L: 2.3, ISA-L Crypto: 2.23, OpenSSL: 1.1.1i, zlib: 1.2.11, lzbench: 1.7.3

CLDTI002, Kamhout/Weekly Verbal at Slide 14

Crypto Acceleration Example: up to 7x for NGINX TLS Webserver

NGNIX 1.20.1 measured by Intel, Sept 13, 2021 using 3rd Gen Xeon Scalable Processor (Ice Lake) based n2-standard-1 and -16 16 (us-central1-a region) instances comparing crypto performance with and without qatengine using IFMA crypto instructions. Measured with TLS version 1.2, ECDH Curves used: secp384r1, Cipher: ECDHE-RSA-AES128-GCM-SHA256

CLITI001

Chris Kelly

Showcase Demo

No Claims

Platform:

ADL Whitebox

Processor:

Intel® Core™ i9-12900K Processor (16C/24T)

Memory:

2x16GB DDR5 4800 MHz SDRAM

Storage:

Samsung SSD 980 PRO (1 TB)

Display Resolution:

1920 x 1080 display

OS:

Microsoft Windows 11 Pro, Build 22000.120

Graphics:

RTX 3090

CLITI001

Chris Kelly

The hybrid architecture combines both Performance cores (P-cores) and Efficiency cores (E-cores) and delivers higher multi-thread performance for the same power - so it's a much more efficient approach

Estimated based on pre-production Intel internal Alder Lake validation platform (with 8 P-cores + 8 E-cores) running Spec Int 544.nab_r on P-core vs. on E-core. As of July 2021. Charts shown for illustrative purposes only and not to scale. Estimated based on pre-production Intel internal Alder Lake validation platform running Spec Int 544.nab_r on 8 P-Core/8 E-core platform configured to run 4 P-core, 2 P-core and 8 E-core. As of July 2021. Charts shown for illustrative purposes only and not to scale.

CLITI001

Chris Kelly

Intel Thread Director helps the OS keep low scalability threads from consuming precious power resources.

Intel® Thread Director requires 12th gen Intel® Core™ performance hybrid architecture and OS enablement. Available features and functionality will vary by OS.

CLITI001

Chris Kelly

Intel Thread Director Demo

Platform:

ADL Whitebox

Processor:

Intel® Core™ i9-12900K Processor (16C/24T)

Memory:

2x16GB DDR5 4800 MHz SDRAM

Storage:

Samsung SSD 980 PRO (1 TB)

Display Resolution:

1920 x 1080 display

OS:

Microsoft Windows 11 Pro, Build 22000.120

Graphics:

RTX 3090

CLITI001

Chris Kelly

Intel® Evo™ - which delivers the best overall laptop experience to consumers

As measured by Intel® Evo™ platform-based laptops powered by 11th Gen Intel Core i7 processors, the best processors for thin & light devices, as measured by systems with 11th Gen Intel® Core™ i7-1185G7 on industry benchmarks, Representative Usage Guide testing, and unique features, including as compared with AMD Ryzen 7 4800U. For details on why the Intel® Core™ i7-1185G7 processor is the world's best processor for productivity, creation, gaming, collaboration and entertainment on a thin and light laptop, see here .

Intel's comprehensive laptop innovation program Project Athena ensures all designs with the Intel Evo brand have been tested, measured and verified against a premium specification and key experience indicators. Testing results as of August 2020, and do not guarantee individual laptop performance. Power and performance vary by use, configuration and other factors.

CLITI001

Chris Kelly

Intel Bridge Technology

Phone: Model: Samsung Galaxy S21 5G; Model Number: SM-G991N; Android version 11; Build RP1A. 200720.012.G991NKSU3AUF6 .

Chromebook: Acer Spin 713;ChromeOSversion: 93.0.4577.95 (Official Build) (64-bit); CPU 11th Gen Intel Core i5-1135G7 .

CLITI001

Chris Kelly

Our work with partners in the W3C and through development of Web standards is continually enriching the Web platform experience and exposing our new hardware platform capabilities.

More about the Worldwide Web Consortium (W3C) and the standards mentioned can be found here

E5GTI001

Time-Series Inference Performance

Algo for both cases:

A standard 80/20 train and test split

Ran this training dataframe through the standard RandomForestClassfier with nTrees=500 in both cases

Note: In case of Intel, standard pandas dataframe is used. In case of Nvidia, the cudf (Nvidia dataframe) is used.

For each test, points (1, 10, 100, 200, 500), the head of the dataframe is used and ran the inference function 3 times and reported the average time taken in (s)

	Intel	Nvidia
OS	Ubuntu 18.04	Ubuntu 18.04
Python Environment	Installed Intel Python distribution through public conda 'intel' channel using the Intel sklearn extension libraries	Conda stable release from Nvidia 'rapidsai' channel.
Version of software	python 3.7.9 and scikit-learn-intelex 2021.2.2	Python 3.7, cudatoolkit=10.1, rapids 0.19

Xeon® SP System Configuration

CPU	Product	Intel^® Xeon ^™ SP - 6242
Frequency	2.8GHz
Cores/Threads	16Cores/32 Threads
Cache (MB)	22
Graphics	Frequency
Graphics	Nividia T4

Memory
Type	DDR4 DIMM @3200 MHz
Size (GB)	16x32

NOTES:

ICPS = Inference Cycles per Second

Intel® Edge Insights for Industrial v2.3

Algorithm: Random Forest

Software Environment: Ubuntu 18.04

Intel Distribution: Intel Python distribution through public Conda 'intel' channel

NVIDIA Distribution: Conda stable release from NVIDIA 'rapids ai' channel

Dataset = Bosch Manufacturing Data dataset.shape = (406879, 102) size: 230MB (all preprocessing completed, ready for modelling).

Date of testing May 2021.

Names and Brands may be property of others. Date of testing May 2021. See backup for workloads and configurations. Results may vary.

GAMTI001

Roger Chandler

We have doubled performance year over year, 2 years in a row. First with Gen 9 to Gen 11, and then with Gen 11 to Xe LP.

4x graphics improvement based on in-game benchmark testing of Hitman 2. Testing as of November 2020 at 1080p and low settings. Systems tested: 

Core i7-8565U (PL1=15W), measured on HP Spectre x360, 512 GB Intel SSD 660p Series, 2 x 8 GB DDR4-2400, Windows 10 Home 19042.63, Intel UHD 620 27.20.100.8935, F.34. 

Core i7-1065G7 (PL1=25W), measured on Razer Blade Stealth 13 (2019), 256GB Samsung MZVLB256HAHQ (PM981), 2 x 8 GB LPDDR4X-3733, Windows 10 Home 19042.63, Intel Iris Plus 27.20.100.8935, 1.02. 

Core i7-1185G7 (B0) (PL1=28W), measured on Intel Reference Platform, Samsung MZVLB512HBJQ, 2 x 8 GB LPDDR4X-4266, Windows 10 Pro 19042.63, Intel Iris Xe 27.20.100.8935, 92A. 

Demonstrations

Showcase Demo - Game, Stream, Record & Content Creation with Multi-Tasking	Up to 19% more FPS on Mount and Blade II: Bannerlord: As measured by Mount and Blade II: Bannerlord on 12th Gen Intel® Core™ i9-12900K vs. 11th Gen Intel® Core™ i9- 11900K	Game, Stream and Record Demo Open Broadcaster Software (OBS) for Game + Stream Workload - OBS Version 27.1.2. Encoding and streaming using 1920x1080 Resolution, 60FPS at 6000kbps encoding with Medium encoding preset. Audio codec at default values. Mount and Blade II: Bannerlord using in-game benchmark on "High" graphical preset Performance results are based on testing by Intel as of October 1st, 2021 and may not reflect all publicly available updates. CPU: Intel® Core ™ i9-12900K Processor, Mainboard: MSI MPG Z690 CARBON WIFI (MS-7D30), Memory: 32GB DDR5-44000 DDR SDRAM, Graphics Driver Version: 472.12, Graphics Card: NVIDIA GeForce RTX3090 Founders Edition, Storage SSD: Samsung SSD 980 PRO 1TB, OS Version: Microsoft Windows 11 Professional (x64) Build 22000.194, OBS Version: 27.0.1, Mount and Blade II: Bannerlord e1.6.2.284832 CPU: Intel® Core ™ i9-11900K Processor, Mainboard: MSI MPG Z590 CARBON WIFI (MS-7D06), Memory: 32GB DDR4-32000 DDR SDRAM, Graphics Driver Version: 472.12, Graphics Card: NVIDIA GeForce RTX3090 Founders Edition, Storage SSD: Samsung SSD 980 PRO 1TB, OS Version: Microsoft Windows 11 Professional (x64) Build 22000.194, OBS Version: 27.0.1, Mount and Blade II: Bannerlord e1.6.2.284832
Showcase Demo - Game, Stream, Record & Content Creation with Multi-Tasking	Up to 84% more FPS on Mount and Blade II: Bannerlord with a 12th Gen Intel® Core™ i9-12900K processor while Gaming, Streaming and Recording using OBS	Game, Stream and Record Demo Open Broadcaster Software (OBS) for Game + Stream Workload - OBS Version 27.1.2. Encoding and streaming using 1920x1080 Resolution, 60FPS at 6000kbps encoding with Medium encoding preset. Audio codec at default values. Mount and Blade II: Bannerlord using in-game benchmark on "High" graphical preset Performance results are based on testing by Intel as of October 1st, 2021 and may not reflect all publicly available updates. CPU: Intel® Core ™ i9-12900K Processor, Mainboard: MSI MPG Z690 CARBON WIFI (MS-7D30), Memory: 32GB DDR5-44000 DDR SDRAM, Graphics Driver Version: 472.12, Graphics Card: NVIDIA GeForce RTX3090 Founders Edition, Storage SSD: Samsung SSD 980 PRO 1TB, OS Version: Microsoft Windows 11 Professional (x64) Build 22000.194, OBS Version: 27.0.1, Mount and Blade II: Bannerlord e1.6.2.284832 CPU: Intel® Core ™ i9-11900K Processor, Mainboard: MSI MPG Z590 CARBON WIFI (MS-7D06), Memory: 32GB DDR4-32000 DDR SDRAM, Graphics Driver Version: 472.12, Graphics Card: NVIDIA GeForce RTX3090 Founders Edition, Storage SSD: Samsung SSD 980 PRO 1TB, OS Version: Microsoft Windows 11 Professional (x64) Build 22000.194, OBS Version: 27.0.1, Mount and Blade II: Bannerlord e1.6.2.284832
Showcase Demo - Game, Stream, Record & Content Creation with Multi-Tasking	Up to 29% faster sequential content creation performance As measured by sequential Multitasking Content creation workflow on 12th Gen Intel® Core™ i9-12900K vs. 11th Gen Intel® Core™ i9- 11900K	Content Creation Demo There are (2) workloads that run simultaneously. Both are measured. The first workload is in Adobe Premiere Pro. This workload measures the time it takes Adobe Premiere Pro to export an 11:19 sequence of AVC clips with (2) audio tracks and Lumetri color effects applied. The video is exported with Hardware Acceleration. The second workload is in Adobe Lightroom Classic. This workload measures the time it takes Adobe Lightroom Classic to import 100 images to a catalog with a preset applied during import, and then export the files as JPEG at 60% quality. The video source is a collection of 4K (3840X2160), H.264, 29.97 FPS, MP4 footage, recorded at a bitrate of approximately 90 Mb/s. The sequence includes .MP3 audio. The video clips have Lumetri color effects applied. The Sequence is 11 minutes and 19 seconds in length. The preset settings for export are based on the Premiere Preset named 'YouTube 2160p 4K Ultra HD', but has 'Time Interpolation' set to 'Frame Blending'. Performance results are based on testing by Intel as of October 1st, 2021 and may not reflect all publicly available updates. CPU: Intel® Core ™ i9-12900K Processor, Mainboard: Gigabyte Aorus Z690 Sabre Master, Memory: 64GB DDR5-44000 DDR SDRAM, Graphics Driver Version: 472.12, Graphics Card: NVIDIA GeForce RTX3080, Storage SSD: PCIe Gen4 NVMe SSD, OS Version: Microsoft Windows 11 Professional (x64), Adobe Lightroom Classic ver 10.4, Adobe Premiere Pro ver 15.4.1 CPU: Intel® Core ™ i9-11900K Processor, Mainboard: ROG Maximus XIII Hero, Memory: 64GB DDR4-32000 DDR SDRAM, Graphics Driver Version: 472.12, Graphics Card: NVIDIA GeForce RTX3080, Storage SSD: PCIe Gen4 NVMe SSD, OS Version: Microsoft Windows 11 Professional (x64), Adobe Lightroom Classic ver 10.4, Adobe Premiere Pro ver 15.4.1
Showcase Demo - Game, Stream, Record & Content Creation with Multi-Tasking	Up to 47% faster concurrent content creation performance As measured by Multitasking Content creation workflow on 12th Gen Intel® Core™ i9-12900K vs. 11th Gen Intel® Core™ i9- 11900K	Content Creation Demo There are (2) workloads that run simultaneously. Both are measured. The first workload is in Adobe Premiere Pro. This workload measures the time it takes Adobe Premiere Pro to export an 11:19 sequence of AVC clips with (2) audio tracks and Lumetri color effects applied. The video is exported with Hardware Acceleration. The second workload is in Adobe Lightroom Classic. This workload measures the time it takes Adobe Lightroom Classic to import 100 images to a catalog with a preset applied during import, and then export the files as JPEG at 60% quality. The video source is a collection of 4K (3840X2160), H.264, 29.97 FPS, MP4 footage, recorded at a bitrate of approximately 90 Mb/s. The sequence includes .MP3 audio. The video clips have Lumetri color effects applied. The Sequence is 11 minutes and 19 seconds in length. The preset settings for export are based on the Premiere Preset named 'YouTube 2160p 4K Ultra HD', but has 'Time Interpolation' set to 'Frame Blending'. CPU: Intel® Core ™ i9-12900K Processor, Mainboard: Gigabyte Aorus Z690 Sabre Master, Memory: 64GB DDR5-44000 DDR SDRAM, Graphics Driver Version: 472.12, Graphics Card: NVIDIA GeForce RTX3080, Storage SSD: PCIe Gen4 NVMe SSD, OS Version: Microsoft Windows 11 Professional (x64), Adobe Lightroom Classic ver 10.4, Adobe Premiere Pro ver 15.4.1 CPU: Intel® Core ™ i9-11900K Processor, Mainboard: ROG Maximus XIII Hero, Memory: 64GB DDR4-32000 DDR SDRAM, Graphics Driver Version: 472.12, Graphics Card: NVIDIA GeForce RTX3080, Storage SSD: PCIe Gen4 NVMe SSD, OS Version: Microsoft Windows 11 Professional (x64), Adobe Lightroom Classic ver 10.4, Adobe Premiere Pro ver 15.4.1
Showcase Demo - Overclocking 12th Gen Intel® Core™ Processors with Intel® XTU	"Best overclocking experience" Based on enhanced overclocking ability enabled by Intel's comprehensive tools and unique architectural tuning capabilities. Your results may vary. Overclocking may void warranty or affect system health. For details see intel.com/overclocking	Discover new features in the Intel® Extreme Tuning Utility (Intel® XTU), which enables novice and experienced enthusiasts to overclock, monitor, and stress a system. XTU provides the ultimate tuning and control of our new 12th Gen desktop processors. CPU: 12th Gen Intel® Core ™ i9-12900KF Processor, Mainboard: Asus Strix-E Z690 Motherboard, Memory: 32GB (2x16GB) DDR5-6000MHz XMP3.0, Graphics Driver Version: 471.68, Graphics Card: EVGA NVIDIA GeForce RTX3090 FTW3, Storage SSD: Samsung 980 Pro 1TB PCIe Gen4 NVMe SSD , OS Version: Windows 11 Pro Version 22000.194 Performance results are based on testing by Intel as of October 1st, 2021 and may not reflect all publicly available updates.
Python Data Science Demo	Taxi Demo portion	For the Python Data Science NYC Taxi demo, please also note that we made 3 separate attempts in order to get the Nvidia RAPIDS notebook shown in the demo video working, with each resulting in errors. The following attempts were: 1) On AWS (not shown in demo, only mentioned), on p3.16xlarge Ubuntu 20.04 instance where CUDA (NVidia video driver and other system related libraries) is preinstalled by Amazon. Followed instructions from CUDF repo to install Anaconda packages for revision 11.2. Notebook failed to execute a complex "query" statement with an "unknown error". Because this error had a vague explanation, further attempts to resolve the error were unsuccessful. 2) Attempted to install the nightly build (not shown in demo) using the same README as attempt 1 for instructions, but unfortunately with this could not even start dask_cuda cluster because of another strange error. Please note that installing an environment for nightly is much harder than the stable version because the RAPIDS notebook has a lot of dependencies and with the nightly channel, caused a lot of conflicting versions. 3) On Lenovo workstation (shown in the demo video) followed instructions from a NVidia webpage to install video driver and CUDA libraries for Ubuntu 20.04 and then used the same instructions from CUDF repo page to install Anaconda packages for CUDF 11.2. This time "query" statement executed successfully but execution failed on XGBoost training (which we display in the demo). Again, error description doesn't really explain exactly what went wrong and what could be changed to avoid it. For this, did not try making a nightly installation to work for this after the first experience with attempt 2 because the instability makes it close to impossible to compose an Anaconda environment with all required packages
DEMO CODE-TBC Title Optimized for Microservices Performance	POC webserver loadbalancing on cores in software and using Dynamic Load Balancer (DLB). Claim: 1. Latency reduction up to - 22-42% 2. Cycle utilization reduced up to- 30-60%	Hardware: 1-node, 2x Intel Xeon Scalable processor (pre-production Sapphire Rapids) on Intel pre-production platform with 256 GB DDR5 memory, 0x8c0003b0, HT ON, Turbo ON, 2 x P4510 2TB, Cent OS 8.4, 5.12.0-0507.intel_next, test by Intel on 09/24/2021. Baseline: Nginx 1.18.0 New1: Internal pre-production webserver (similar to NGINX at a smaller scale) with software load balancing New2: Internal pre-production webserver (similar to NGINX at a smaller scale) with Dynamic Load balancing HW accelerator

Sessions

AI001, Meena Arunachalam, #13

MacBook Pro is outperforming Spark cluster with m3.xlarge instances on AWS by up to 1.7x, and dual-socket single node CascadeLake is outperforming Spark vluster by up to 15.7x.

OmniSci Tested

2nd Gen Intel Xeon Gold 6226R CPU: 2x 2nd Gen Intel Xeon Gold 6226R with 384GB (12 slots/ 32GB/ 2933MHz) total DDR4 memory, microcode 0x5003003, HT on, Turbo on, Ubuntu 20.04.2 LTS, 5.4.0-65-generic kernel, 1x INTEL SSDSC2KG960G8, Python 3.7.9, Modin 0.8.3, Omniscidbe v5.4.1, scikit-learn v0.24.1 accelerated by daal4py v2021.2, tested by OmniSci on 03/15/2021.

MacBook Pro : 9 ^th Gen i9-9880H CPU with 64GB (2666MHz) total DDR4 memory, HT on, Turbo on, Mac OS X ver 11, 1x4TB SSD, Python 3.7.9,Omnisci DB v5.4.1, tested by OmniSci on 03/15/2021.

AI001, Meena Arunachalam, #14

MacBook Pro is outperforming Spark cluster with m3.xlargeinstances on AWS by up to 1.7x, and dual-socket single nodeCascadeLakeis outperforming Spark cluster by up to 15.7x.

Optane powered Xeon can process 1.2 Bn NYC Taxi benchmark data in memory .

OmniSci Tested

2nd Gen Intel Xeon Gold 6226R CPU : 2x 2nd Gen Intel Xeon Gold 6226R with 384GB (12 slots/ 32GB/ 2933MHz) total DDR4 memory, microcode 0x5003003, HT on, Turbo on, Ubuntu 20.04.2 LTS, 5.4.0-65-generic kernel, 1x INTEL SSDSC2KG960G8, Python 3.7.9, Modin 0.8.3, Omniscidbe v5.4.1, scikit-learn v0.24.1 accelerated by daal4py v2021.2, tested by OmniSci on 03/15/2021.

MacBook Pro : 9 ^th Gen i9-9880H CPU with 64GB (2666MHz) total DDR4 memory, HT on, Turbo on, Mac OS X ver 11, 1x4TB SSD, Python 3.7.9,Omnisci DB v5.4.1, tested by OmniSci on 03/15/2021.

2nd Gen Intel Xeon Platinum 8276L CPU : 2x 2nd Gen Intel Xeon Platinum 8276L with 384GB (12 slots/ 32GB/ 2666Mhz) and 4TB 1 ^st generation Optane memory (2666Mhz) total DDR4 memory, microcode 0x5003003, HT on, Turbo on, Ubuntu 20.04.2 LTS, 5.4.0-65-generic kernel, 1x INTEL SSDSC2KG960G8, Python 3.7.9, Modin 0.8.3, Omniscidbe v5.4.1, scikit-learn v0.24.1 accelerated by daal4py v2021.2, tested by Intel on 03/15/2021.

AI001, Meena Arunachalam, #24

On Census training+inference, Xeon 8380 >5x faster on ML vs DGX A100 (utilizing 1xA100).

3rd Gen Intel Xeon Platinum 8380 CPU: 2x 3rd Gen Intel Xeon Platinum 8380 with 512GB (16 slots/ 32GB/ 3200MHz) total DDR4 memory, microcode 0x8d055260, HT on, Turbo on, Ubuntu 20.04.2 LTS, 5.4.0-65-generic kernel, 1x INTEL SSDSC2KG960G8, Python 3.7.9, Modin 0.8.3, Omniscidbe v5.4.1, scikit-learn v0.24.1 accelerated by daal4py v2021.2, test by Intel on 03/15/2021.

Nvidia Ampere A100 GPU : Nvidia Ampere A100 GPU hosted on 2x AMD EPYC 7742 CPU with 512GB (16 slots/ 32GB/ 3200MHz) total DDR4 memory, microcode 0x8301034, HT on, Turbo on, Ubuntu 18.04.5 LTS, 5.4.0-42-generic kernel, 1x SAMSUNG 3.5TB SSD, Python 3.7.9, RAPIDS0.17, cuDF 0.17, cuML 0.17, scikit-learn v0.24.1, CUDA 11.0.221, test by Intel on 02/04/2021. Census Data [21721922, 45]: Dataset is from IPUMS USA, University of Minnesota, www.ipums.org [Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas and Matthew Sobek. IPUMS USA: Version 10.0 [dataset]. Minneapolis, MN: IPUMS, 2020. https://doi.org/10.18128/D010.V10.0]

AI001, Meena Arunachalam, #25

On DL inference of E2E Document level sentiment analysis (DLSA) Huggingface SST dataset WL Nvidia A100 2x faster than Xeon 8380.

3rd Gen Intel Xeon Platinum 8380 CPU: 2x 3rd Gen Intel Xeon Platinum 8380 with 512GB (16 slots/ 32GB/ 3200MHz) total DDR4 memory, microcode 0xd0002b1, HT off, Turbo on, Ubuntu 20.04 LTS, 5.4.0-84-generic kernel, 1x Intel 960GB SSD, Intel® Extension for PyTorch v1.8.1, Transformers 4.6.1, MKL 2021.3.0, Bert-large-uncased (https://huggingface.co/bert-large-uncased) model, BS=1 per instance, 20 instances/node, 4 cores/instance, test by Intel on 09/17/2021.

Nvidia Ampere A100 GPU: Nvidia Ampere A100 GPU hosted on 2x AMD EPYC 7742 CPU with 1024GB (16 slots/ 64GB/ 3200MHz) total DDR4 memory, microcode 0x8301034, HT off, Turbo on, Ubuntu 20.04 LTS, 5.4.0-80-generic kernel, 1x SAMSUNG 3.5TB SSD, PyTorch 1.8.1, Transformers 4.6.1, CUDA 11.1, Bert-large-uncased (https://huggingface.co/bert-large-uncased) model, BS=1 per instance, 7 total instances with MIG enabled, test by Intel on 09/22/2021

AI001, Meena Arunachalam, #24

End-to-End Census with all phases, Xeon 8380 is 7% slower vs DGX A100 (utilizing 1xA100).

Nvidia Ampere A100 GPU: Nvidia Ampere A100 GPU hosted on 2x AMD EPYC 7742 CPU with 512GB (16 slots/ 32GB/ 3200MHz) total DDR4 memory, microcode 0x8301034, HT on, Turbo on, Ubuntu 18.04.5 LTS, 5.4.0-42-generic kernel, 1x SAMSUNG 3.5TB SSD, Python 3.7.9, RAPIDS0.17, cuDF 0.17, cuML 0.17, scikit-learn v0.24.1, CUDA 11.0.221, test by Intel on 02/04/2021. Census Data [21721922, 45]: Dataset is from IPUMS USA, University of Minnesota, www.ipums.org [Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas and Matthew Sobek. IPUMS USA: Version 10.0 [dataset]. Minneapolis, MN: IPUMS, 2020. https://doi.org/10.18128/D010.V10.0]

AI001, Meena Arunachalam, #25

End-to-End DLSA with all phases Xeon 8380 is 13% faster than DGX A100 (utilizing 1xA100).

3rd Gen Intel Xeon Platinum 8380 CPU : 2x 3rd Gen Intel Xeon Platinum 8380 with 512GB (16 slots/ 32GB/ 3200MHz) total DDR4 memory, microcode 0xd0002b1, HT off, Turbo on, Ubuntu 20.04 LTS, 5.4.0-84-generic kernel, 1x Intel 960GB SSD, Intel® Extension for PyTorch v1.8.1, Transformers 4.6.1, MKL 2021.3.0, Bert-large-uncased (https://huggingface.co/bert-large-uncased) model, BS=1 per instance, 20 instances/node, 4 cores/instance, test by Intel on 09/17/2021.

AI001, Meena Arunachalam, #30

DLSA multi-instance config (10instances per socket is optimal perf).

AI001, Meena Arunachalam, #31; also in AI Productivity and Performance Demo

The Payoff: Higher Performance/$

Disclaimer (for pricing only): System pricing is based on an average of comparable configurations as the test systems as priced on www.colfax-intl.com and www.thinkmate.com on September 20, 2021.  4U rackmount systems used for 3rd Gen Intel® Xeon® Scalable 8380 processors:  Thinkmate GPX XN6-24S3-10GPU and Colfax CX41060s-XK8.  4U rackmount servers used for AMD EPYC 7742 with Nvidia A100 GPU:  Thinkmate GPX QT24-24E2-8GPU and Colfax CX4860s-EK8.  See www.colfax-intl.com and www.thinkmate.com for more details.

AI001, Meena Arunachalam, #21

Lower processing time for 3TB dataset of TPC-DS benchmark

Baseline: Azure-US-East-2, Standard_E16s_v3, 16 vCPUs, 10 instances, Platinum 8171M @ 2.60GHz, 128GB RAM Memory/Instance, 256GB Storage/Instance (NW or Direct Attached), NW bandwidth/Instance 32000 (IOPS)/256 (MBps)/400(Cache), 8000 Storage BW/Instance New: Azure-US-East-2, Standard_E16s_v4, 16vCPUs, 10 instances, Platinum 8272CL @ 2.60GHz , 128GB RAM Memory/Instance, 600GB Storage/Instance (NW or Direct Attached), NW bandwidth/Instance 154000 (IOPS)/968 (MBps)/400(Cache), 8000 Storage BW/Instance; 18.04.1-Ubuntu, 5.4.0-1046-azure, Databricks 7.5 (Includes Apache Spark 3.0.1, Scala 2.12), Tested by Intel on 5-July-2021

New: Azure-US-East-2, Standard_E8s_v4, 8vCPUs, 20 instances, Platinum 8272CL @ 2.60GHz , 256GB RAM Memory/Instance, 1200GB Storage/Instance (NW or Direct Attached), NW bandwidth/Instance 308000 (IOPS)/1936 (MBps)/800(Cache), 16000 Storage BW/Instance; 18.04.1-Ubuntu, 5.4.0-1046-azure, Databricks 7.5 (Includes Apache Spark 3.0.1, Scala 2.12), Tested by Intel on 5-July-2021

AI001, Meena Arunachalam, #21

Lower processing time for 10TB dataset of TPC-DS benchmark

AI001, Meena Arunachalam, #22

Processing times speedup with Intel optimized scikit-learn.

Processing Times Speedup with Intel Optimized Scikit-learn: Azure-US-West, Standard_F16s_V2, 16 vCPUs, 1 instance, Platinum 8168 @ 2.70 GHz / Platinum 8272CL @ 2.60 GHz, 32GB Memory Capacity/Instance, Direct Attached Storage, Ubuntu 18.04.5 LTS, 5.4.0-1051-azure, Databricks 9.0 ML Runtime, Stock scikit-learn-0.22.1 vs Intel scikit-learn-0.24.2, Tested by Intel on 23-September-2021

AI001, Meena Arunachalam, #22

Processing times speedup with Intel optimized TensorFlow/BERT-large.

Baseline: Processing Times Speedup: Intel Optimized TensorFlow/BERT-large: Azure-US-West, Standard_F32s_V2, 32 vCPUs, 1 instance, Platinum 8168 @ 2.70 GHz / Platinum 8272CL @ 2.60 GHz, 64GB Memory Capacity/Instance, Direct Attached Storage, Ubuntu 18.04.5 LTS, 5.4.0-1051-azure, Databricks 9.0 ML Runtime, Stock TensorFlow 2.3.1 vs Intel TensorFlow 2.3.0, Tested by Intel on 23-September-2021

New: Processing Times Speedup: Intel Optimized TensorFlow/BERT-large: Azure-US-West, Standard_F64s_V2, 64 vCPUs, 1 instance, Platinum 8168 @ 2.70 GHz / Platinum 8272CL @ 2.60 GHz, 128GB Memory Capacity/Instance, Direct Attached Storage, Ubuntu 18.04.5 LTS, 5.4.0-1051-azure, Databricks 9.0 ML Runtime, Stock TensorFlow 2.3.1 vs Intel TensorFlow 2.3.0 New: Processing Times Speedup: Intel Optimized TensorFlow/BERT-large: Azure-US-West, Standard_F72s_V2, 72 vCPUs, 1 instance, Platinum 8168 @ 2.70 GHz / Platinum 8272CL @ 2.60 GHz, 144GB Memory Capacity/Instance, Direct Attached Storage, Ubuntu 18.04.5 LTS, 5.4.0-1051-azure, Databricks 9.0 ML Runtime, Stock TensorFlow 2.3.1 vs Intel TensorFlow 2.3.0, Tested on 23-September-2021

AI002, Rachel Oberman, #9

Up to 100x faster performance than stock scikit-Learn

Up to 1.55x faster DLRM training (FP32 vs BF16)and up to 2.8x faster DLRM inference (FP32 vs Int8) with PyTorch Opts

Up to 2.8x faster quantized inference (FP32 v Int8) with TF optimizations and LPOT

SKLearn Fit/Predict: Intel SKLearnEx Up to 10x faster compared to Nvidia, Up to 5x faster compared to AMD CPU

4.5x faster inference compared to Nvidia GPU with XGBoost inference opts

On Census 2020, 38x faster ETL with Modin, 21x faster fit&predict with ML ops in SKlearn

Overall 40% faster performance on census workload with Intel optimizations Modin and SKLearnEx

See all benchmarks and configurations:

https://software.intel.com/content/www/us/en/develop/articles/blazing-fast-python-data-science-ai-performance.html. Each performance claim and configuration data is available in the body of the article listed under sections 1, 2, 3, 4, and 5. Please also visit this page for more details on all scores, and measurements derived.

Testing Date: Performance results are based on testing by Intel as of October 16, 2020 and may not reflect all publicly available updates. Configurations details and Workload Setup: 2 x Intel® Xeon® Platinum 8280 @ 28 cores, OS: Ubuntu 19.10.5.3.0-64-generic Mitigated 384GB RAM (192 GB RAM (12x 32GB 2933). SW: Modin 0.81. Scikit-learn 0.22.2. Pandas 1.01, Python 3.8.5, DAL(DAAL4Py) 2020.2, Census Data, (21721922.45) Dataset is from IPUMS USA, University of Minnesota, www.ipums.org [Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas and Matthew Sobek. IPUMS USA: Version 10.0 [dataset], Minneapolis, MN. IPUMS, 2020. https//doc.org/10.18128/D010.V10.0]

Testing Date: Performance results are based on testing by Intel® as of October 23, 2020 and may not reflect all publicly available updates. Configuration Details and Workload Setup: Intel® oneAPI Data Analytics Library 2021.1 (oneDAL). Scikit-learn 0.23.1, Intel® Distribution for Python 3.8; Intel® Xeon® Platinum 8280LCPU @ 270GHz, 2 sockets, 28 cores per socket, 10M samples, 10 features, 100 clusters, 100 iterations, float32.

Testing Date: Performance results are based on testing by Intel® as of October 23, 2020 and may not reflect all publicly available updates. Configuration Details and Workload Setup: Intel® oneAPI AI Analytics Toolkit v2021.1; Intel® oneAPI Data Analytics Library (oneDAL) beta10, Scikit-learn 0.23.1, Intel® Distribution for Python 3.7, Intel® Xeon® Platinum 8280 CPU @ 2.70GHz, 2 sockets, 28 cores per socket, microcode: 0x4003003, total available memory 376 GB, 12X32GB modules, DDR4. AMD Configuration: AMD Rome 7742 @2.25 GHz, 2 sockets, 64 cores per socket, microcode: 0x8301038, total available memory 512 GB, 16X32GB modules, DDR4, oneDAL beta10, Scikit-learn 0.23.1, Intel® Distribution for Python 3.7. NVIDIA Configuration: NVIDIA Tesla V100 - 16 Gb, total available memory 376 GB, 12X32GB modules, DDR4, Intel® Xeon Platinum 8280 CPU @ 2.70GHz, 2 sockets, 28 cores per socket, microcode: 0x5003003, cuDF 0.15, cuML 0.15, CUDA 10.2.89, driver 440.33.01, Operation System: CentOS Linux 7 (Core), Linux 4.19.36 kernel.

Testing Date: Performance results are based on testing by Intel® as of October 13, 2020 and may not reflect all publicly available updates. Configurations details and Workload Setup:

CPU: c5.18xlarge AWS Instance (2 x Intel® Xeon® Platinum 8124M @ 18 cores. OS: Ubuntu 20.04.2 LTS, 193 GB RAM. GPU: p3.2xlarge AWS Instance (GPU: NVIDIA Tesla V100 16GB, 8 vCPUs, OS: Ubuntu 18.04.2LTS, 61 GB RAM. SW: XGBoost 1.1: build from sources compiler - G++ 7.4, nvcc 9.1 Intel® DAAL: 2019.4 version: Python env: Python 3.6, Numpy 1.16.4, Pandas 0.25 Scikit-learn 0.21.2.

Testing Date: Performance results are based on testing by Intel® as of October 26, 2020 and may not reflect all publicly available updates. Configuration Details and Workload Setup: Intel® Optimization for Tensorflow v2.2.0; oneDNN v1.2.0; Intel® Low Precision Optimization Tool v1.0; Platform; Intel® Xeon® Platinum 8280 CPU; #Nodes 1; #Sockets: 2; Cores/socket: 28; Threads/socket: 56; HT: On; Turbo: On; BIOS version:SE5C620.86B.02.01.0010.010620200716; System DDR Mem Config: 12 slots/16GB/2933; OS: CentOS Linux 7.8; Kernel: 4.4.240-1.el7.elrepo x86_64.

Testing Date : Performance results are based on testing by Intel® as of February 3, 2021 and may not reflect all publicly available updates. Configuration Details and Workload Setup: Intel® Optimization for PyTorch v1.5.0; Intel® Extension for PyTorch (IPEX) 1.1.0; oneDNN version: v1.5; DLRM: Training batch size (FP32/BF16): 2K/instance, 1 instance; DLRM dataset (FP32/BF16): Criteo Terabyte Dataset; BERT-Large: Training batch size (FP32/BF16): 24/Instance. 1 Instance on a CPU socket. Dataset (FP32/BF16): WikiText-2 [https://www.salesforce.com/products/einstein/ai-research/the-wiktext-dependency-language-modeling-dataset/]: ResNext101-32x4d: Training batch size (FP32/BF16): 128/Instance, 1 instance on a CPU socket, Dataset (FP32/BF16): ILSVRC2012; DLRM: Inference batch size (INT8): 16/instance, 28 instances, dummy data. Intel® Xeon® Platinum 8380H Processor, 4 socket, 28 cores HT On Turbo ON Total memory 768 GB (24 slots/32GB/3200 MHz), BIOS; WLYDCRBLSYS.0015.P96.2005070242 (ucode: OX 700001b), Ubuntu 20.04 LTS, kernel 5.4.0-29-genen: ResNet50: [https://github.com/Intel/optimized-models/tree/master/pytorch/ResNet50]: ResNext101 32x4d: [https://github.com/intel/optimized-models/tree/master/pytorch/ResNext101_32x4ct: DLRM: https//github.com/intel/optimized-models/tree/master/pytorch/dlrm].

AI002, Rachel Oberman, Demo (part of talk)

Modin stays <1s interactive even on big data

Xgboost unoptimized 2 hours vs optimized 11 minutes

Sklearn unoptimzied 50s vs optimized 7sec

Showcase alternative with RAPIDS requiring code changes and unresolved errors (ie qualitative, not perf comparisons)

Testing Date: Performance results are based on testing by Intel® as of October 4, 2021 and may not reflect all publicly available updates. Configuration Details and Workload Setup: Hardware (same for all configurations): 1-node, 2x 2nd Gen Intel® Xeon® Gold 6258R on Lenovo 30BC003DUS with 768GB (12 slots/ 64GB/ 2666) total DDR4 memory and 2TB (4 slots/ 512GB/ 2666) DCPMM memory, microcode 0x5003102, HT on, Turbo on, Ubuntu 20.04.3 LTS, 5.10.0-1049-oem, 1x Samsung 1TB SSD OS Drive, 4x Samsung 2TB SSD in RAID0 data drive, 3x NVIDIA Quadro RTX 8000. 3 months of NYCTaxi Data on Stock Software Configuration: Python 3.9.7, Pandas 1.3.3, Scikit-Learn 1.0, XGBoost 0.81, IPython 7.28.0, IPKernel 6.4.1. Full 30 months of NYCTaxi Data on Nvidia RAPIDS Software Configuration: Python 3.7.10, Pandas 1.2.5, XGBoost 1.4.2, cuDF 21.08.03, cudatoolkit 11.2.72, dask-cudf 21.08.03, dask-cuda 21.08.00, IPython 7.28.0, IPKernel 6.4.1. Full 30 months of NYCTaxi Data on Intel Optimized Software Configuration: Python 3.9.7, Pandas 1.3.3, Modin 0.11.0, OmniSci 5.7.0, Scikit-:earn 1.0, Intel® Extension for Scikit-Learn* 2021.3.0, XGBoost 1.4.2, IPython 7.28.0, IPKernel 6.4.1. NYCTaxi Dataset from New York City (nyc.gov): [https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page]

AI003 AG Ramesh Andres Rodriguez #18

30x total performance improvement of  SSD-RN34 workload from 2nd Gen Intel Xeon (codenamed Icelake) to 3rd Gen Intel Xeon (codenamed Sapphire Rapids).

Baseline : 1-node, 2x 3 ^rd Gen Intel Xeon Platinum 8380 with 512 GB (16 slots/ 32GB/ 3200) total DDR4 memory, microcode 0x8d9522d4, HT on, Turbo on, Ubuntu 20.04.2 LTS(docker), 5.4.0-77-generic, TensorFlow v2.5.0 w/o oneDNN, TensorFlow v2.6.0 w oneDNN, test by Intel on 09/28/2021

New: 1-node, 2x Next Gen Intel Xeon Scalable processor (codenamed Sapphire Rapids, > 40 cores) on Intel pre-production platform with 512 GB DDR memory (8(1DPC)/32GB/4800 MT/s), HT on, Turbo on, CentOS Linux 8.4, internal pre-production bios and software running SSD-ResNet34 BS=1 using TensorFlow 2.6 with intel internal optimization, test by Intel on 09/28/2021

AI004, Susan Lansing, Slide #9

Up to 40% better price performance than latest GPU instances

At launch, AWS EC2 DL1 pricing is published here: https://aws.amazon.com/ec2/pricing/ondemand/

The price performance claim is made by AWS and based on AWS internal performance testing for the

di-based instance and the Nvidia A100 and V100-based instances expressed as a ratio to AWS EC2 pricing of the respective instances. Net: it expresses how much training performance a customer can achieve for the cost.

Gaudi performance metrics were derived based on: AI processor: 1-card and 8-card configuration of Habana Gaudi HL-205 processors on the AWS custom designed server for the DL1 instance CPU: AWS Custom 2nd Gen Intel® Xeon® Scalable Processors

Computer vision metrics based on ResNet-50 model: https://github.com/HabanaAI/Model-References/tree/master/TensorFlow/computer_vision/Resnets/resnet_keras and Habana Test Container: http://vault.habana.ai/gaudi-docker/0.15.0/ubuntu18.04/habanalabs/tensorflow-installer-tf-cpu-2.5.1:1.0.1-81

A100 / V100 performance based on software build: https://ngc.nvidia.com/catalog/resources/nvidia:resnet_50_v1_5_for_tensorflow/performance Results published on DGXA100 and DGX-1;

Batch size: 256 for all accelerators.

Natural Language Processing metrics based on BERT-Large, Pre-Training, Phase 1 DL1 performance based on Habana BERT model: https://github.com/HabanaAI/Model-References/tree/master/TensorFlow/nlp/bert Habana Test Container: https://vault.habana.ai/artifactory/gaudi-docker/1.0.1/ubuntu18.04/habanalabs/tensorflow-installer-tf-cpu-2.5.1/1.0.1-81

A100 / V100 Benchmark Source*: https://ngc.nvidia.com/catalog/resources/nvidia:bert_for_tensorflow/performance

* Results published on DGXA100 and DGX-1

Batch Size: 64 for A100 & Gaudi, 16 for V100.

AI004, Susan Lansing, Slide #11

400 Gbps Networking: Fastest networking bandwidth provided to any EC2 instance to support large clusters with high throughput internode connectivity

Claim by AWS that the networking that is provided to customers on the EC2 DL1 is the highest that any EC2 instance offers.

This information is offered publicly at this link which also addresses pricing: https://aws.amazon.com/ec2/pricing/ondemand/

AI004, Susan Lansing, Slide #11

Amazon FSx for Lustre: High performance, scalable storage delivering sub-millisecond latency and hundreds of GB/s of throughput

Claim by AWS regarding the storage speed and throughput of its EC2 service, FSx for Lustre.

For more information on the service, see https://aws.amazon.com/fsx/lustre/?nc2=type_

AI004, Susan Lansing, Slide #13

Chart contains computer vision/ResNet-50 performance metrics of EC2 instances: Gaudi-based DL1, A100/P4d and V100/p3 for 1-card and 8-card configurations.

Model: ResNet-50 Mixed

Precision: BF16/FP16

Framework: TensorFlow

Configurations: 1- and 8-card for all tested

Batch: 256

DL1 metrics based on Habana and AWS internal performance testing based on ResNet50 Model: Habana software: https://github.com/HabanaAI/Model-References/tree/master/TensorFlow/computer_vision/Resnets/resnet_keras

Habana Test Container: http://vault.habana.ai/gaudi-docker/0.15.0/ubuntu18.04/habanalabs/tensorflow-installer-tf-cpu-2.5.1:1.0.1-81 A100/p4d and V100/p3 Performance

Source*: https://ngc.nvidia.com/catalog/resources/nvidia:resnet_50_v1_5_for_tensorflow/performance Results published on DGXA100 and DGX-1

AI004, Susan Lansing, Slide #14

Chart contains NLP performance metrics of EC2 instances: Gaudi-based DL1, A100/P4d and V100/p3 for 1-card and 8-card configurations.

Model: BERT-Large; Pre-Training (Phase 1)

Mixed Precision: B16/FP16

Batch size: 64 for A100 and Gaudi; 16 for V100.

DL1 metrics based on Habana and AWS performance testing based on BERT model: https://github.com/HabanaAI/Model-References/tree/master/TensorFlow/nlp/bert

Habana Test Container: https://vault.habana.ai/artifactory/gaudi-docker/1.0.1/ubuntu18.04/habanalabs/tensorflow-installer-tf-cpu-2.5.1/1.0.1-81

A100 / V100 Benchmark Source*: https://ngc.nvidia.com/catalog/resources/nvidia:bert_for_tensorflow/performance

Results published on DGXA100 and DGX-1

AI004, Susan Lansing, Slide #15

Chart shows the combined factors of EC2 instance pricing and performance on ResNet-50 and BERT models to demonstrate price/performance value (amount of training achievable relative to price of the instance time)

Habana and AWS performance testing based on:

Habana BERT model: https://github.com/HabanaAI/Model-References/tree/master/TensorFlow/nlp/bert

Habana ResNet50 Model: https://github.com/HabanaAI/Model-References/tree/master/TensorFlow/computer_vision/Resnets/resnet_keras

Habana Test Container : https://vault.habana.ai/artifactory/gaudi-docker/1.0.1/ubuntu18.04/habanalabs/tensorflow-installer-tf-cpu-2.5.1/1.0.1-81

A100 / V100 Benchmark Sources*:

https://ngc.nvidia.com/catalog/resources/nvidia:bert_for_tensorflow/performance and

https://ngc.nvidia.com/catalog/resources/nvidia:resnet_50_v1_5_for_tensorflow/performance

* Results published on DGXA100 and DGX-1. Pricing published at https://aws.amazon.com/ec2/pricing/on-demand/

AI004, Susan Lansing, Slide #18

Chart shows details behind price/performance metrics, displaying cost-savings AWS customers can derive from training ResNet-50 on Gaudi/EC2 DL1 relative to price/performance of EC2 training on Nvidia-based instances. Comparisons: -ResNet-50 performance -Hourly on-demand pricing per instance type -Price/performance - millions of images trained per dollar -Performance/$ with A100 40 GB as baseline -Performance/$ compared to A100 40 GB

Assessment is made with:

Model: ResNet-50

Framework: TensorFlow

Mixed precision Configuration: 8-card for all instances

A100 and V100-based instances measured by Habana on AWS EC2 GPU-based instances, on June 28th, using Nvidia Deep Learning AMI (Ubuntu 18.04) + Docker 21.06-tf1-py3 available at: https://ngc.nvidia.com/catalog/containers/nvidia.tensorflow/tags,

Model Trained:https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Classification/ConvNets/resnet50v1.5

Gaudi measured by Habana on AWS EC2 DL1.24xlarge instance, using DLAMI integrating SynapseAI 1.0.1-81 Tensorflow 2.5.1 Container at Habana's Vault, model: https://github.com/HabanaAI/Model-References/tree/master/TensorFlow/computer_vision/Resnets/resnet_keras

AWS pricing published at: https://aws.amazon.com/ec2/pricing/on-demand/. Results may vary.

AI004, Susan Lansing, Slide #19

Chart shows details behind price/performance metrics, displaying cost-savings AWS customers can derive from training BERT on Gaudi/EC2 DL1 relative to price/performance of EC2 training on Nvidia-based instances. Comparisons: -ResNet-50 performance -Hourly on-demand pricing per instance type -Price/performance - millions of images trained per dollar -Performance/$ with A100 40 GB as baseline -Performance/$ compared to A100 40 GB

Assessment is made on:

Model: BERT-Large, Pre-training, Phase 1

Framework: TensorFlow

Mixed precision Configuration: 8-card for all instances

A100 and V100 instances measured by Habana on AWS EC2 GPU-based instances, on June 28th, using Nvidia Deep Learning AMI (Ubuntu 18.04) + Docker 21.06-tf1-py3 available at: https://ngc.nvidia.com/catalog/containers/nvidia.tensorflow/tags

Model: https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT

Gaudi instances measured by Habana on AWS EC2 DL1.24xlarge instance, using DLAMI integrating SynapseAI 1.0.1-81 Tensorflow 2.5.1 Container at Habana's Vault, model: https://github.com/HabanaAI/Model-References/tree/master/TensorFlow/ nlp/bert. Pricing EC2 instance pricing published at: https://aws.amazon.com/ec2/pricing/on-demand/ . Results may vary.

AI004, Susan Lansing, Slide #20

MLPerf performance based on submitted metrics by the respective vendors. Comparison between MLPerf metrics is for Nvidia A100 not reported on TensorFlow, but rather on MxNet and Pytorch.

The next chart is performance for ResNet and BERT based on “Publicly available software (NGC and Habana containers) for Training Throughput on TensorFlow.

The measurements are then compared relative to customer pricing of the EC2 instances.

MLPerf time-to-train metric published at https://mlcommons.org/en/training-normal-10/

Performance measured by Habana on Nvidia GPUs using TensorFlow framework based on public software https://ngc.nvidia.com/catalog/resources/nvidia:bert_for_tensorflow/performance;

DL1 performance measured by Habana on AWS EC2 DL1.24xlarge instance, using DLAMI integrating SynapseAI 1.0.1-81 Tensorflow 2.5.1 Container at Habana's Vault.

Model: https://github.com/HabanaAI/Model-References/tree/master/TensorFlow/nlp/bert. EC2 instance pricing published here: https://aws.amazon.com/ec2/pricing/on-demand/

Your measured performance results may vary.

CLD005, Arijit Biswas, #17

CLD001,

Giri, #19 CLDTI002, Kamhout/Weekly, #25

On microservices performance, we show an improvement in throughput per core (under a latency SLA of p99 <30ms) of:

24% comparing 3rd Gen Xeon to Second Gen Xeon

69% comparing Next Gen Xeon (Sapphire Rapids) to Second Gen Xeon

Workloads: DeathStarBench 'hotelReservation', 'socialNetwork' ( https://github.com/delimitrou/DeathStarBench ) and Google Microservices demo ( https://github.com/GoogleCloudPlatform/microservices-demo)

OS: Ubuntu 20.04 with kernel version v5.10, Kubernetes v1.21.0; Testing as of July 2021.

Cascade Lake Measurements on 3-node Kubernetes setup on AWS M5.metal instances (2S 24 core 8259CL with 384GB DDR4 RAM and 25Gbps network) in us-west2b

Ice Lake Measurements on 3-node 2S 32 core, 2.5GHz, 300W TDP SKU with 512GB DDR4 RAM and 40Gbps network

CLD001, Giri, #16

Up to 83% improvement in Wordpress v4.2 with optimizations from Pathlength and CPU Stall Reductions

Wordpress 4.2 HTTPS comparison of 3 ^rd Gen Intel Xeon Scalable platform (ICX) with optimizations vs 2 ^nd Gen Intel Xeon Scalable platform (CLX).

CLX baseline: 1-node, 2x Intel Xeon Gold 6238R (28-core) on S2600WFT with 384GB (12 slots / 32 GB / 2933) total DDR4 memory, microcode 0x5003003, HT on, Turbo on, Ubuntu 20.04, Linux 5.4.0-65-generic , 1xIntel 1.8T SSDSC2KG01, wordpress 4.2.0, PHP 7.4, gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU C Library (Ubuntu GLIBC 2.31-0ubuntu9.1), mysqldVer: 10.3.25-MariaDB-0ubuntu0.20.04.1, 1x Intel X722, TLSv1.3 - TLS_AES_256_GCM_SHA384, test by Intel on 05/19/2021.

ICX: 1-node, 2x Intel Xeon Gold 6348 (28-core) on Coyote Pass with 512GB (16 slots/ 32GB/ 3200) total DDR4 memory, microcode 0xd000270, HT on, Turbo on, Ubuntu 20.04, Linux 5.4.0-72-generic , 1xIntel 895GB SSDSC2KG96, wordpress 4.2.0, PHP 7.4, gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU C Library (Ubuntu GLIBC 2.31-0ubuntu9.1), mysqldVer: 10.3.25-MariaDB-0ubuntu0.20.04.1, 1x XL710-Q2, TLSv1.3 - TLS_AES_256_GCM_SHA384, test by Intel on 05/19/2021.

CLD005, Arijit Biswas, #18

We introduced AMX capabilities that provide massive speedup to the tensor processing that is at the heart of deep learning algorithms. With AMX we can perform 2048 int8 operations per cycle (vs. 256 without AMX) and 1024 bfloat16 operations per cycle (vs. 64 without AMX).

Based on peak architectural capability of matrix multiply + accumulate operations per cycle per core assuming 100% CPU utilization. As of August 2021.

CLD011, Slide #9 Pranav KalavadeRobbie Frickie

Bits per cell technology further reduces cost beyond QLC

Measurements were performed on components from QLC SSDs using FG and CTF technology. Measurement platform used was Teradyne Magnum 2 Memory test systems, and programming using random patterns and margins were quantified using customer commands. Data measured in 08/2019. Your results may vary.

CLD011, Slide #7 Pranav Kalavade, Robbie Frickie

Floating gate technology delivers ariel density leadership

Source: ISSCC 2015, J.Im; ISCC 2017 R Yamashita; ISSCC 2017 C Kim; ISCC 2018; H. Maejima; ISSCC 2019 C. Siau

CLD011, Slide #12 Pranav Kalavade, Robbie Frickie

The Intel SSD D5-P5316 delivers far greater lifetime write capability versus TLC SSDs and high-capacity hard disk drives.

Most SSDs never approach DWPD rating : UoT study of 1.4 million industry SSDs in Enterprise Storage Deployment

CLD011, Slide #13 Pranav Kalavade, Robbie Frickie

Previously, 10TB tech did 1 DWPD whereas the new 20TB tech writes the same amount of petabytes with .5 DWPD

Theoretical TBW Comparison . 10TB drive PBW = 10TB x 1DWPD x 5 years x 365 days/year = 18.25 Petabytes written. Theoretical 20TB drive PBW = 20TB x 0.5DWPD x 5 years x 365 days/year = 18.25 Petabytes written.

CLD011, Slide #13 Pranav Kalavade, Robbie Frickie

SSDs have a higher endurance than HDDs. HDDs consistently wear out over time.

HDD Endurance . Endurance of 20TB WD HC650 and 18TB Seagate EXOS X18.

CLD011, Slide #13 Pranav Kalavade, Robbie Frickie

QLC SSDs have great endurance. QLC technology has less than 1DWPD and they can serve the market well.

QLC SSD Endurance. Intel® SSD D5-P5316 specification.

CLI006, Ajith Illendula, Session & Showcase Demo - Denoise Images Using Intel® DL Boost on 12th Gen Intel® Core™ Processors

Up to 1.28x denoise time on ADL when using FP32 vs Int-8

Testing Date: Performance results are based on testing by Intel® as of September 23, 2021 and may not reflect all publicly available updates.

Platform: Alderlake Desktop

Processor: Intel® Core™ i9-12900K Processor (16C/24T)

Memory : 4x8GB DDR5 4800 MHz SDRAM

Storage: Samsung SSD 980 Pro 1TB

Display Resolution: 1920x1080 display

OS: Microsoft Windows 11, Build 10.0.22000.132

Graphics: NVIDIA GeForce RTX™ 3080

CLI006, Ajith Illendula, #20

Up to 1.38x faster with Int-8 than FP32 using OpenVINO benchmark tool

Testing Date: Performance results are based on testing by Intel® as of September 23, 2021 and may not reflect all publicly available updates.

Platform: Alderlake Desktop

Processor: Intel® Core™ i9-12900K Processor (16C/24T)

Memory : 4x8GB DDR5 4800 MHz SDRAM

Storage: Samsung SSD 980 Pro 1TB

Display Resolution: 1920x1080 display

OS: Microsoft Windows 11, Build 10.0.22000.132

Graphics: NVIDIA GeForce RTX™ 3080

CLI006, Ajith Illendula, #12

Up to 4.5x performance boost with OpenVINO (DeNoise 1.3) vs tensor flow CPU implementation

Testing Date: Performance results are based on testing by Intel® as of September 23, 2021 and may not reflect all publicly available updates.

Platform: Alderlake Desktop Processor: Intel® Core™ i9-12900K Processor (16C/24T)

Memory : 4x8GB DDR5 4800 MHz SDRAM

Storage: Samsung SSD 980 Pro 1TB

Display Resolution: 1920x1080 display

OS: Microsoft Windows 11, Build 10.0.22000.132

Graphics: NVIDIA GeForce RTX™ 3080

CLI006, Ajith Illendula, #22

Int8 output quality is similar to FP32 output quality

Testing Date: Performance results are based on testing by Intel® as of September 23, 2021 and may not reflect all publicly available updates.

Output Quality: LPIPS metric of 0.19 comparing Int8 output vs FP32 Output. Platform: Alderlake Desktop

Processor: Intel® Core™ i9-12900K Processor (16C/24T)

Memory : 4x8GB DDR5 4800 MHz SDRAM

Storage: Samsung SSD 980 Pro 1TB

Display Resolution: 1920x1080 display

OS: Microsoft Windows 11, Build 10.0.22000.132

Graphics: NVIDIA GeForce RTX™ 3080

CLI011, Erik Niemeyer, Session & Showcase Demo - Intel and Autodesk Accelerate Ray Tracing Workflows

Up to 1.9x Performance Improvement on 12900K vs 11900K for Rendering time compare using a Maya Scene at High quality, or +4AA.

Platform: ADL Whitebox

Processor: Intel® Core™ i9-12900K Processor (16C/24T)

Memory: 2x16GB DDR5 4800 MHz SDRAM

Storage: Samsung SSD 980 PRO (1 TB)

Display Resolution: 1920 x 1080 display

OS: Microsoft Windows 11 Pro, Build 22000.120

Graphics: NVIDIA GeForce RTX™ 3090

Platform: RKL Whitebox

Processor: Intel® Core™ i9-11900K Processor (8C/16T)

Memory: 2x16GB DDR4 3200 MHz SDRAM Storage: WDS400T3X0C-00SJG0 (4 TB)

Display Resolution: 1920x1080 display

OS: Microsoft Windows 11 Pro, Build 22000.120 Graphics: NVIDIA GeForce RTX™ 3090

Performance results are based on testing by Intel as of September 23, 2021 and may not reflect all publicly available updates.

CLI011, Erik Niemeyer, Session & Showcase Demo - Intel and Autodesk Accelerate Ray Tracing Workflows

12 ^th Gen performs (over) twice as fast as previous Gen in IPR scenarios 11900K = 11sec 12900K = 4sec

Platform: ADL Whitebox

Processor: Intel® Core™ i9-12900K Processor (16C/24T)

Memory: 2x16GB DDR5 4800 MHz SDRAM

Storage: Samsung SSD 980 PRO (1 TB)

Display Resolution: 1920 x 1080 display i

OS: Microsoft Windows 11 Pro, Build 22000.120

Graphics: NVIDIA GeForce RTX™ 3090

Platform: RKL Whitebox Processor: Intel® Core™ i9-11900K Processor (8C/16T)

Memory: 2x16GB DDR4 3200 MHz SDRAM

Storage: WDS400T3X0C-00SJG0 (4 TB)

Display Resolution: 1920x1080 display

OS: Microsoft Windows 11 Pro, Build 22000.120

Graphics: NVIDIA GeForce RTX™ 3090

Performance results are based on testing by Intel as of September 23, 2021 and may not reflect all publicly available updates.

E5G005, Deepthi Karkada, Ryan Loney, GE,

11th Gen Intel® Core combined with OpenVINO offers scalable performance improvements (slide #36)

CPU	11th Gen Intel® Core™ i7-1185G7E @ 2.80GHz (Referred to as TGL i7-1185G7E in the charts)	8th Generation Intel® Core™ i7-8700T @ 2.40GHz (Referred to as CFL i7-8700T in the charts)
Motherboard	Intel Reference Validation Platform	Intel Reference Validation Platform
Hyper Threading	on	on
Turbo Setting	on	on
Memory	2 x 8 GB DDR4 3200MHz	16 GB DDR4 2666 MT/s
BIOS Vendor	Intel Corporation	American Megatrends Inc
BIOS Version	TGLSFWI1.R00.3373.A00.2009091720	IG1b IA
BIOS Release	09/09/2020	09/17/2019
micro code	0x88	0xea
GPU	Iris Xe 96EU	Intel® UHD Graphics 630
Batch size	1	1
Precision	FP32	FP32
Number of concurrent inference requests	1 for CPU & GPU plugin latency & throughput results, Multi plugin results: 4 streams for CPU, 2 streams for GPU	1 for CPU & GPU plugin latency & throughput results, Multi plugin results: 4 streams for CPU, 2 streams for GPU, 1 stream for TF results
OS Name	Ubuntu 18.04.3 LTS	Ubuntu 18.04.3 LTS
OS Kernel	Linux 5.9.0-050900-generic	Linux 5.9.0-050900-generic
SW	OV-2021.4.1	OV-2021.4.1, Stock TF 2.0.0 without oneDNN optimizations, TensorFlow does not support inference on iGPU [1]
Test Date	9/15/2021	9/21/2021
Power dissipation, TDP in Watt	28	35
Cost	431	303
Perf/Watt	Throughput/TDP in Watt for OV = 7.63/28 = 0.2725	Throughput/TDP in Watt for OV = 4.02/35 = 0.114
Perf/$	Throughput/Cost for OV = 7.63/431 = 0.0177	Throughput/Cost for OV = 4.02/303 = 0.0132

IOF005

Randi Rost

Slide 14

2-20X speed-up of Deep Learning predictions

30-90% cost savings on Deep Learning inference

Model: Squeezenet 1.1

OS: Ubuntu 20.04 LTS; Linux Version: Linux ip-172-31-30-130 5.11.0-1019-aws #20~20.04.1-Ubuntu SMP Tue Sep 21 10:40:39 UTC 2021

Hardware platform: AWS c5.12xlarge; Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz; cores=24, enabled cores=24, threads=48

Memory: 96GB DIMM DDR4 Static column Pseudo-static Synchronous Window DRAM 2933 MHz (0.3 ns)

Baseline test: Pytorch v1.7.1 standard version; Test date: 2021-10-22T23:23:14Z

Baseline result: 46.9 microseconds

Optimized test: Inteon platform v.JF0.5; Test date: 2021-10-05T03:28:12Z

Optimized result: 1.99 microseconds

Benchmark framework: MLperf

Results reported: 90th percentile

Batch size: 1

Baseline/optimized = 46.9/1.99 = 23.6x speed improvement

AWS compute time is purchased by the hour, therefore estimated cost savings on a c5.12xlarge instance = 1 – 1.99/46.9 = 95.8% cost savings

IOF005 Randi Rost slide 33

Up to 7.76x faster inference performance on Resnet 50 v2.7 when optimized by Inteon

Model: Resnet 50 v2.7

OS: Ubuntu 20.04 LTS; Linux Version: Linux ip-172-31-30-130 5.11.0-1019-aws #20~20.04.1-Ubuntu SMP Tue Sep 21 10:40:39 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux; Linux Kernel details: Linux version 5.11.0-1019-aws (buildd@lgw01-amd64-037) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #20~20.04.1-Ubuntu SMP Tue Sep 21 10:40:39 UTC 2021 Hardware platform: AWS c5.12xlarge; Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz; cores=24, enabled cores=24, threads=48

Memory: 96GB DIMM DDR4 Static column Pseudo-static Synchronous Window DRAM 2933 MHz (0.3 ns)

Baseline test: Pytorch v1.8.1 standard version; Test date: 2020-10-02T13:32:54Z Optimized test: Inteon platform v.JF0.5; Test date: 2021-09-30T15:05:18Z Benchmark framework: MLperf Results reported: 90th percentile Batch size: 1

IOF005 Randi Rost slide 34

Up to 4.22x faster inference performance on VGG16 when optimized by Inteon (slide 34)

Model: VGG16 OS: Ubuntu 20.04 LTS; Linux Version: Linux ip-172-31-30-130 5.11.0-1019-aws #20~20.04.1-Ubuntu SMP Tue Sep 21 10:40:39 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux; Linux Kernel details: Linux version 5.11.0-1019-aws (buildd@lgw01-amd64-037) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #20~20.04.1-Ubuntu SMP Tue Sep 21 10:40:39 UTC 2021 Hardware platform: AWS c5.12xlarge; Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz; cores=24, enabled cores=24, threads=48

Memory: 96GB DIMM DDR4 Static column Pseudo-static Synchronous Window DRAM 2933 MHz (0.3 ns)

Baseline test: TensorFlow 2.4.1; Test date: 2021-04-12T18:57:16Z Optimized test: Inteon platform v.JF0.5; Test date: 2021-09-30T15:02:45Z Benchmark framework: MLperf Results reported: 90th percentile Batch size: 1

IOF005 Randi Rost slide 35

Up to 9.25x faster inference performance on Squeezenet 1.1 when optimized by Inteon (slide 35)

Model: Squeezenet 1.1 OS: Ubuntu 20.04 LTS; Linux Version: Linux ip-172-31-30-130 5.11.0-1019-aws #20~20.04.1-Ubuntu SMP Tue Sep 21 10:40:39 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux; Linux Kernel details: Linux version 5.11.0-1019-aws (buildd@lgw01-amd64-037) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #20~20.04.1-Ubuntu SMP Tue Sep 21 10:40:39 UTC 2021 Hardware platform: AWS c5.12xlarge; Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz; cores=24, enabled cores=24, threads=48

Memory: 96GB DIMM DDR4 Static column Pseudo-static Synchronous Window DRAM 2933 MHz (0.3 ns)

Baseline test: Pytorch v1.8.1 standard version; Test date: 2020-09-29T17:30:36Z Optimized test: Inteon platform v.JF0.5; Test date: 2021-10-05T03:28:12Z Benchmark framework: MLperf Results reported: 90th percentile Batch size: 1

Intel® Notices and Disclaimers

Performance varies by use, configuration, and other factors. Learn more at www.Intel.com/PerformanceIndex.  

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See configuration disclosure for details. No product or component can be absolutely secure.

Your costs and results may vary.

Intel optimizations, for Intel compilers or other products, may not optimize to the same degree for non-Intel products.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

Intel technologies may require enabled hardware, software, or service activation.

Results that are based on systems and components as well as results that have been estimated or simulated using an Intel Reference Platform (an internal example new system), internal Intel analysis or architecture simulation or modeling are provided to you for informational purposes only. Results may vary based on future changes to any systems, components, specifications, or configurations. 

Intel contributes to the development of benchmarks by participating in, sponsoring, and/or contributing technical support to various benchmarking groups, including the BenchmarkXPRT Development Community administered by Principled Technologies.

Read our Benchmarks and Measurements disclosures and our Battery Life disclosures.

All product plans and roadmaps are subject to change without notice.

Statements in this document that refer to future plans or expectations are forward-looking statements. These statements are based on current expectations and involve many risks and uncertainties that could cause actual results to differ materially from those expressed or implied in such statements. For more information on the factors that could cause actual results to differ materially, see our most recent earnings release and SEC filings at www.intc.com.

Altering clock frequency or voltage may void any product warranties and reduce stability, security, performance, and life of the processor and other components.  Check with system and component manufacturers for details.