Performance Index

ID 615781
Date 11/16/2022
Document Table of Contents

ISC 2022

Claim Claim Details/Citation

Demo: Breakthrough Performance for Financial Services

Ponte Vecchio outperforms the competition in Financial Services by 2.6x on Binomial Options, 1.9x on Black Scholes, and 1.7x on Monte Carlo

Testing as of 2/14/2022 Intel Platform: 1-node 1x Intel® Xeon® 6336Y, HT On, Turbo Off, total memory 128GB DDR, BIOS Version WLYDCRB1.SYS.0021.P16.2105280638, Ubuntu 20.04, Linux Version 5.10.54+pvc-xtb-po67perf, Ucode 0x8d0002c1, 1x Intel pre-production Ponte Vecchio GPU ; Competing Platform: 1-node 2x Intel® Xeon® Platinum 8360Y, HT On, Turbo On, total memory 256GB DDR, BIOS Version SE5C6200.86B.0022.D08.2103221623, Ubuntu 21.10, Linux Version 5.13.0-27-generic, Ucode 0xd0002a0, 1x NVIDIA A100 80GB PCIe ; Intel Binomial Options Build notes: Tools: Intel oneAPI 2022.1, Build knobs: -g -fdebug-info-for-profiling -gline-tables-only -fsycl-targets=spir64_​​​gen -Xsycl-target-backend "-device 0x0bd5 -revision_​​​id 3" -O3 -fp-model precise -std=c++17 -flto -o binomial.sycl.gpu.precise -I -lpthread Intel Black- Scholes Build notes: Tools: Intel oneAPI 2022.1, Build knobs: -g -O2 -I/opt/intel/opencl/include/ -L/opt/intel/opencl/lib64/ -ltbb -ltbbmalloc -lOpenCL Intel Monte Carlo Build notes: Tools: Intel oneAPI 2022.1, Build knobs: -DUSE_​​​VML=0 -DUSE_​​​MCG59 -DVEC_​​​SIZE=8 -DMKL_​​​ILP64 -Iinclude -I"${MKLROOT}/include" -L"${MKLROOT}/lib/intel64" -lpthread -lmkl_​​​core -lmkl_​​​intel_​​​ilp64 -lmkl_​​​sequential -lm -ldl -fsycl -fsycl-unnamed-lambda -O /opt/intel/oneapi/mkl/latest/lib/intel64/libmkl_​​​sycl.a Competing Platform Binomial Options Build notes: Tools: CUDA SDK 11.4, Build knobs: -I../../common/inc -m64 --threads 0 -gencode arch=compute_​​​35,code=sm_​​​35 -gencode arch=compute_​​​37,code=sm_​​​37 -gencode arch=compute_​​​50,code=sm_​​​50 -gencode arch=compute_​​​52,code=sm_​​​52 -gencode arch=compute_​​​60,code=sm_​​​60 -gencode arch=compute_​​​61,code=sm_​​​61 -gencode arch=compute_​​​70,code=sm_​​​70 -gencode arch=compute_​​​75,code=sm_​​​75 -gencode arch=compute_​​​80,code=sm_​​​80 -gencode arch=compute_​​​86,code=sm_​​​86 -gencode arch=compute_​​​86,code=compute_​​​86 Competing Platform Black-Scholes Build notes: Tools: CUDA SDK 11.4, Build knobs -ccbin g++ -I/usr/local/cuda-11.4/samples/common/inc -m64 -maxrregcount=16 -gencode arch=compute_​​​35,code=sm_​​​35 -gencode arch=compute_​​​37,code=sm_​​​37 -gencode arch=compute_​​​50,code=sm_​​​50 -gencode arch=compute_​​​52,code=sm_​​​52 -gencode arch=compute_​​​60,code=sm_​​​60 -gencode arch=compute_​​​61,code=sm_​​​61 -gencode arch=compute_​​​70,code=sm_​​​70 -gencode arch=compute_​​​75,code=sm_​​​75 -gencode arch=compute_​​​80,code=sm_​​​80 -gencode arch=compute_​​​86,code=sm_​​​86 -gencode arch=compute_​​​86,code=compute_​​​86 Competing Platform Monte Carlo Build notes: Tools: CUDA SDK 11.4, Build knobs: -ccbin g++ -I/usr/local/cuda-11.4/samples/common/inc -m64 -gencode arch=compute_​​​35,code=sm_​​​35 -gencode arch=compute_​​​37,code=sm_​​​37 -gencode arch=compute_​​​50,code=sm_​​​50 -gencode arch=compute_​​​52,code=sm_​​​52 -gencode arch=compute_​​​60,code=sm_​​​60 -gencode arch=compute_​​​61,code=sm_​​​61 -gencode arch=compute_​​​70,code=sm_​​​70 -gencode arch=compute_​​​75,code=sm_​​​75 -gencode arch=compute_​​​80,code=sm_​​​80 -gencode arch=compute_​​​86,code=sm_​​​86 -gencode arch=compute_​​​86,code=compute_​​​86

Demo: Accelerate AI for HPC using DeepCAM

3rd Gen Intel® Xeon® Scalable processor outperforms AMD EYPC by 1.2x

4th Gen Intel® Xeon® Scalable processor outperforms AMD EYPC by 1.9x

Intel® Xeon® Scalable processors using mixed precision with FP32 and BFloat16 enabled with Intel® AMX & TMUL Instructions shows the scaling improvement of up to 3.2X over AMD EYPC

4th Gen Intel® Xeon® Scalable processor outperforms NVDIA A100 by 1.3x

Baseline: Test by Intel as of 04/07/2022. 1-node, 2x AMD EPYC 7763, 64 cores, HT On, Turbo Off, Total Memory 512 GB (16 slots/ 32 GB/ 3200 MHz, DDR4), BIOS AMI 1.1b, ucode 0xa001144, OS Red Hat Enterprise Linux 8.5 (Ootpa), kernel 4.18.0-348.7.1.el8_​​5.x86_​​64, compiler gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-4), MLPerf HPC-AI v0.7, DeepCAM DeepLabv3+, torch1.11.0a0+git13cdb98 AVX2, torch-1.11.0a0+git13cdb98-cp38-cp38-linux_​​x86_​​64.whl, torch_​​ccl-1.2.0+44e473a-cp38-cp38-linux_​​x86_​​64.whl, intel_​​extension_​​for_​​pytorch-1.10.0+cpu-cp38-cp38-linux_​​x86_​​64.whl (AVX-2), Intel MPI 2021.5, Python3.8. Test by Intel as of 04/07/2022. 1-node, 2x 3rd Gen Intel® Xeon® Platinum 8380 processor, 40 cores, HT On, Turbo Off, Total Memory 512 GB (16 slots/ 32 GB/ 3200 MHz, DDR4), BIOS SE5C6200.86B.0022.D64.2105220049, ucode 0xd0002b1, OS Red Hat Enterprise Linux 8.5 (Ootpa), kernel 4.18.0-348.7.1.el8_​​5.x86_​​64, compiler gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-4), MLPerf HPC-AI v0.7, DeepCAM DeepLabv3+, torch1.11.0a0+git13cdb98 AVX-2, torch-1.11.0a0+git13cdb98-cp38-cp38-linux_​​x86_​​64.whl, torch_​​ccl-1.2.0+44e473a-cp38-cp38-linux_​​x86_​​64.whl, intel_​​extension_​​for_​​pytorch-1.10.0+cpu-cp38-cp38-linux_​​x86_​​64.whl (AVX-512), Intel MPI 2021.5, Python3.8. Test by Intel as of 05/05/2022. 1-node, 2x pre-production 4th Gen Intel® Xeon® processor (codenamed Sapphire Rapids, > 40 cores), HT On, Turbo Off, Total Memory 512 GB (16 slots/ 32 GB/ 4800 MHz, DDR5), BIOS EGSDCRB1.86B.0078.D10.2204072027, ucode 0x8f000320, OS CentOS Stream 8, kernel 5.15.0-spr.bkc.pc.4.24.0.x86_​​64, compiler gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10), MLPerf HPC-AI v0.7, DeepCAM DeepLabv3+, torch1.11.0a0+git13cdb98 AVX-512, FP32, torch-1.11.0a0+git13cdb98-cp38-cp38-linux_​​x86_​​64.whl, torch_​​ccl-1.2.0+44e473a-cp38-cp38-linux_​​x86_​​64.whl, intel_​​extension_​​for_​​pytorch-1.10.0+cpu-cp38-cp38-linux_​​x86_​​64.whl (AVX-512), Intel MPI 2021.5, Python3.8. Test by Intel as of 04/13/2022. 1-node, 2x 3rd gen Intel® Xeon® Platinum 8360Y, 36 cores, HT On, Turbo On, Total Memory 256 GB (16 slots/ 16 GB/ 3200 MHz), Nvidia GPU A100, 80GB HBM, PICe ID 20B5, BIOS AMI 1.1b, ucode 0xd000311, OS Red Hat Enterprise Linux 8.4 (Ootpa), kernel 4.18.0-305.el8.x86_​​64, compiler gcc (GCC) 8.4.1 20200928 (Red Hat 8.4.1-1), MLPerf HPC-AI v0.7, DeepCAM DeepLabv3+, pytorch1.11.0 py3.7_​​cuda11.3_​​cudnn8.2.0_​​0, cudnn 8.2.1, cuda11.3_​​0, intel-openmp 2022.0.1 h06a4308_​​3633, python3.7. Test by Intel as of 05/05/2022. 1-node, 2x pre-production 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids, > 40 cores), HT On, Turbo Off, Total Memory 512 GB (16 slots/ 32 GB/ 4800 MHz, DDR5), BIOS EGSDCRB1.86B.0078.D10.2204072027, ucode 0x8f000320, OS CentOS Stream 8, kernel 5.15.0-spr.bkc.pc.4.24.0.x86_​​64, compiler gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10), MLPerf HPC-AI v0.7, DeepCAM DeepLabv3+, torch1.11.0a0+git13cdb98 AVX-512 FP32, AMX, BFloat16 Enabled, torch-1.11.0a0+git13cdb98-cp38-cp38-linux_​​x86_​​64.whl, torch_​​ccl-1.2.0+44e473a-cp38-cp38-linux_​​x86_​​64.whl, intel_​​extension_​​for_​​pytorch-1.10.0+cpu-cp38-cp38-linux_​​x86_​​64.whl (AVX-512), Intel MPI 2021.5, Python3.8. Test by Intel as of 04/09/2022. 16-nodes Cluster, 1-node, 2x pre-production 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids, > 40 cores), HT On, Turbo On, Total Memory 256 GB (16 slots/ 16 GB/ 4800 MHz, DDR5), BIOS Intel SE5C6301.86B.6712.D23.2111241351, ucode 0x8d000360, OS Red Hat Enterprise Linux 8.4 (Ootpa), kernel 4.18.0-305.el8.x86_​​64, compiler gcc (GCC) 8.4.1 20200928 (Red Hat 8.4.1-1), MLPerf HPC-AI v0.7, DeepCAM DeepLabv3+, torch1.11.0a0+git13cdb98 AVX-512, FP32, torch-1.11.0a0+git13cdb98-cp38-cp38-linux_​​x86_​​64.whl, torch_​​ccl-1.2.0+44e473a-cp38-cp38-linux_​​x86_​​64.whl, intel_​​extension_​​for_​​pytorch-1.10.0+cpu-cp38-cp38-linux_​​x86_​​64.whl (AVX-512), Intel MPI 2021.5, Python3.8, 16-Node. MLPerf™ HPC-AI v0.7 Training benchmark DeepCAM Performance. Result not verified by MLCommons Association. Unverified results have not been through an MLPerf™ review and may use measurement methodologies and/or workload implementations that are inconsistent with the MLPerf™ specification for verified results. The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information

Demo: Accelerate AI for HPC using Cosmoflow

3rd Gen Intel® Xeon® Scalable processor outperforms AMD EYPC by 1.5x

Intel® Xeon® Scalable processor, codenamed Sapphire Rapids with HBM outperforms AMD EYPC 7763 by 2.8x

4th Gen Intel® Xeon® Scalable processor outperforms NVDIA A100 by 1.8x

Baseline: AMD EPYC 7763 : Test by Intel as of 03/27/2022. 1-node, 2x AMD EPYC 7763, 64 cores, HT On, Turbo Off, Total Memory 512 GB (16 slots/ 32 GB/ 3200 MHz, DDR4), BIOS AMI 1.1b, ucode 0xa001144, OS Red Hat Enterprise Linux 8.5 (Ootpa), kernel 4.18.0-348.7.1.el8_​5.x86_​64, compiler gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-4), https://github.com/mlcommons/hpc/tree/main/cosmoflow, Tensorflow 2.8, tf_​nightly-2.8.0.202149-cp38-cp38-linux_​x86_​64.whl, (AVX-2), horovod 0.22.1, keras 2.8.0, oneCCL-2021.4, oneAPI MPI 2021.4.0, Python 3.8. Nvidia A100+ 3rd Gen Intel® Xeon® Platinum 8360Y Scalable Processor: Test by Intel as of 04/08/2022. 1-node, 2x Intel® Xeon® Platinum 8360Y Scalable Processor, 36 cores, HT On, Turbo On, Total Memory 256 GB (16 slots/ 16 GB/ 3200 MHz), Nvidia GPU A100, 80GB HBM, PICe ID 20B5, BIOS AMI 1.1b, ucode 0xd000311, OS Red Hat Enterprise Linux 8.4 (Ootpa), kernel 4.18.0-305.el8.x86_​64, compiler gcc (GCC) 8.4.1 20200928 (Red Hat 8.4.1-1), https://github.com/mlcommons/hpc/tree/main/cosmoflow, Tensorflow 2.6.0, keras 2.6.0, cudnn 8.2.1, horovod 0.24.2, Python 3.7. 3rd Gen Intel® Xeon® Platinum 8380 Scalable Processor: Test by Intel as of 04/05/2022. 1-node, 2x Intel® Xeon® Platinum 8380 Scalable Processor, 40 cores, HT On, Turbo Off, Total Memory 512 GB (16 slots/ 32 GB/ 3200 MHz, DDR4), BIOS SE5C6200.86B.0022.D64.2105220049, ucode 0xd0002b1, OS Red Hat Enterprise Linux 8.5 (Ootpa), kernel 4.18.0-348.7.1.el8_​5.x86_​64, compiler gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-4), https://github.com/mlcommons/hpc/tree/main/cosmoflow, AVX-512, FP32, Tensorflow 2.6.0, horovod 0.23.0, keras 2.6.0, oneCCL-2021.4, oneAPI MPI 2021.4.0, Python 3.8.Pre-production Intel® Xeon® Scalable Processors: Test by Intel as of 05/05/2022. 1-node, 2x pre-production 4th Gen Intel® Xeon® Scalable Processors , >40 cores, HT On, Turbo Off, Total Memory 512 GB (16 slots/ 32 GB/ 4800 MHz, DDR5), BIOS EGSDCRB1.86B.0078.D10.2204072027, ucode 0x8f000320, OS CentOS Stream 8, kernel 5.15.0-spr.bkc.pc.4.24.0.x86_​64, compiler gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10), https://github.com/mlcommons/hpc/tree/main/cosmoflow, AVX-512, FP32, Tensorflow 2.6.0, horovod 0.23.0, keras 2.6.0, oneCCL-2021.4, oneAPI MPI 2021.4.0, Python 3.8. 4th Gen Intel® Xeon® Scalable Processors Multi-Node cluster: Test by Intel as of 04/02/2022. 16-nodes Cluster, each node, 2x pre-production 4th Gen Intel® Xeon® Scalable Processors, >40 cores, HT On, Turbo On, Total Memory 256 GB (16 slots/ 16 GB/ 4800 MHz, DDR5), BIOS Intel SE5C6301.86B.6712.D23.2111241351, ucode 0x8d000360, OS Red Hat Enterprise Linux 8.4 (Ootpa), kernel 4.18.0-305.el8.x86_​64, compiler gcc (GCC) 8.4.1 20200928 (Red Hat 8.4.1-1), https://github.com/mlcommons/hpc/tree/main/cosmoflow, AVX-512, FP32, Tensorflow 2.6.0, horovod 0.23.0, keras 2.6.0, oneCCL-2021.4, oneAPI MPI 2021.4.0, Python 3.8

High Performance with High Bandwidth Memory

Intel® Xeon® Scalable processors, codenamed Sapphire Rapids processor with high bandwidth memory (HBM) outperforms AMD Milan by 2.8x and AMD Milan-X by 2.1x. Also Intel® Xeon® Scalable processors, codenamed Sapphire Rapids processor with high bandwidth memory (HBM) outperforms 3rd Gen Intel® Xeon® Scalable processor by 2.8x

Test by Intel as of 01/26/2022. 1-node, 2x Intel® Xeon® Platinum 8380 CPU @ 2.30GHz (Family 6 Model 106 Stepping 6), 80 cores, HT On, Turbo On, Total Memory 256 GB (16x16GB 3200MT/s, Dual-Rank), BIOS Version SE5C6200.86B.0020.P23.2103261309, ucode revision=0xd000270, Rocky Linux 8.5 , Linux version 4.18.0-240.22.1.el8_​​​3.crt6.x86_​​​64, OpenFOAM® v1912, Motorbike 28M @ 250 iterations; Build notes: Tools: Intel Parallel Studio 2020u4, Build knobs: -O3 -ip -xCORE-AVX512 Test by Intel as of 01/26/2022. 1-node, 2x AMD EPYC 7763 64-Core Processor @ 2.45GHz (Family 25 Model 1 Stepping 1), 128 cores, HT On, Turbo On, Total Memory 256 GB (16x16GB 3200MT/s, Dual-Rank), BIOS Version 2.1, ucode revision=0xa00111d, Rocky Linux 8.5 , Linux version 4.18.0-240.22.1.el8_​​​3.crt6.x86_​​​64, OpenFOAM® v1912, Motorbike 28M @ 250 iterations; Build notes: Tools: Intel Parallel Studio 2020u4, Build knobs: -O3 -ip -xCORE-AVX2 Test by Microsoft® Azure as of 11/08/21. 1-node, 2x AMD EPYC 7V73X on Azure HBv3, 128 cores (120 available), HT Off, Total Memory 448 GB, CentOS 8.1 HPC Image, GNU compiler 9.2.0, OpenFOAM® v1912, Motorbike 28M @ 250 iterations Test by Intel as of 01/26/2022.  1-node, 2x 4th Gen Intel Xeon Scalable processor (codenamed Sapphire Rapids, > 40 cores), HT On, Turbo On, Total Memory 512 GB (16x32GB 4800MT/s, Dual-Rank), preproduction platform and BIOS, Red Hat Enterprise Linux 8.4 , Linux version 4.18.0-305.el8.x86_​​​64, OpenFOAM® v1912, Motorbike 28M @ 250 iterations; Build notes: Tools: Intel Parallel Studio 2020u4, Build knobs: -O3 -ip -xCORE-AVX512 Test by Intel as of 01/26/2022. 1-node, 2x Intel® Xeon® Scalable (code Sapphire Rapids > 40) with HBM, HT Off, Turbo Off, Total Memory 128 GB (HBM2e at 3200 MHz), preproduction platform and BIOS, CentOS 8, Linux version 5.12.0-0507.intel_​​​next.06_​​​02_​​​po.5.x86_​​​64+server, OpenFOAM® v1912, Motorbike 28M @ 250 iterations; Build notes: Tools: Intel Parallel Studio 2020u4, Build knobs: -O3 -ip -xCORE-AVX512. OpenFOAM Disclaimer: This offering is not approved or endorsed by OpenCFD Limited, producer and distributor of the OpenFOAM software via www.openfoam.com, and owner of the OPENFOAM® and OpenCFD ® trademark.
HPC-AI Benchmark Performance on Intel® Xeon® Scalable Processor Codenamed Sapphire Rapids with HBM

DeepCAM Training Performance

Baseline: AMD EPYC 7763: Test by Intel as of 04/07/2022. 1-node, 2x EPYC 7763, 64 cores, HT On, Turbo Off, Total Memory 512 GB (16 slots/ 32 GB/ 3200 MHz, DDR4), BIOS AMI 1.1b, ucode 0xa001144, OS Red Hat Enterprise Linux 8.5 ( Ootpa ), kernel 4.18.0-348.7.1.el8_​5.x86_​64 , compiler gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-4), https://github.com/mlcommons/hpc/tree/main/deepcam, torch1.11.0a0+git13cdb98, torch-1.11.0a0+git13cdb98-cp38-cp38-linux_​x86_​64.whl, torch_​ccl-1.2.0+44e473a-cp38-cp38-linux_​x86_​64.whl, intel_​extension_​for_​pytorch-1.10.0+cpu-cp38-cp38-linux_​x86_​64.whl, Intel MPI 2021.5, Python3.8. Intel® Xeon® Platinum 8380 SP: Test by Intel as of 04/07/2022. 1-node, 2x Intel® Xeon® Platinum 8380 SP, 40 cores, HT On, Turbo Off, Total Memory 512 GB (16 slots/ 32 GB/ 3200 MHz, DDR4), BIOS SE5C6200.86B.0022.D64.2105220049 , ucode 0xd0002b1 , OS Red Hat Enterprise Linux 8.5 ( Ootpa ) , kernel 4.18.0-348.7.1.el8_​5.x86_​64 , compiler gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-4) , https://github.com/mlcommons/hpc/tree/main/deepcam, torch1.11.0a0+git13cdb98, torch-1.11.0a0+git13cdb98-cp38-cp38-linux_​x86_​64.whl, torch_​ccl-1.2.0+44e473a-cp38-cp38-linux_​x86_​64.whl, intel_​extension_​for_​pytorch-1.10.0+cpu-cp38-cp38-linux_​x86_​64.whl (AVX-512), Intel MPI 2021.5, Python3.8. Intel® Xeon® Processor codenamed Sapphire Rapids with HBM (pre-production): Test by Intel as of 05/25/2022. 1-node, 2x Intel® Xeon® Processor codenamed Sapphire Rapids with HBM (pre-production) , > 40 cores, HT On,Turbo Off, Total Memory 128GB HBM and 1TB (16 slots/ 64 GB/ 4800 MHz, DDR5), BIOS EGSDCRB1.86B.0080.D05.2205081330 , ucode 0x8f000320 , OS CentOS Stream 8, kernel 5.18.0-0523.intel_​next.1.x86_​64+server , compiler gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10, https://github.com/mlcommons/hpc/tree/main/deepcam, torch1.11.0a0+git13cdb98, AVX-512, FP32 , torch-1.11.0a0+git13cdb98-cp38-cp38-linux_​x86_​64.whl, torch_​ccl-1.2.0+44e473a-cp38-cp38-linux_​x86_​64.whl, intel_​extension_​for_​pytorch-1.10.0+cpu-cp38-cp38-linux_​x86_​64.whl (AVX-512), Intel MPI 2021.5, Python3.8. A100 80GB with 3 rd Gen Intel® Xeon® Scalable Processor 8360Y : Test by Intel as of 04/13/2022. 1-node, 2x 3 rd Gen Intel® Xeon® Scalable Processor 8360Y, 36 cores, HT On, Turbo On, Total Memory 256 GB (16 slots/ 16 GB/ 3200 MHz), NVIDIA A100 80GB PCIe (UUID: GPU-59998403-852d-2573-b3a9-47695dca0604) , PICe ID 20B5, BIOS AMI 1.1b, ucode 0xd000311, OS Red Hat Enterprise Linux 8.4 ( Ootpa ) , kernel 4.18.0-305.el8.x86_​64, compiler gcc (GCC) 8.4.1 20200928 (Red Hat 8.4.1-1), https://github.com/mlcommons/hpc/tree/main/deepcam, pytorch1.11.0 py3.7_​cuda11.3_​cudnn8.2.0_​0 , cudnn 8.2.1, cuda11.3_​0, intel- openmp 2022.0.1 h06a4308_​3633, python3.7. Intel® Xeon® Processor codenamed Sapphire Rapids with HBM (pre-production): Test by Intel as of 05/25/2022. 1-node, 2x Intel® Xeon® Processor codenamed Sapphire Rapids with HBM (pre-production) , >40 cores, HT On, Turbo Off, Total Memory 128GB HBM and 1TB (16 slots/ 64 GB/ 4800 MHz, DDR5), BIOS EGSDCRB1.86B.0080.D05.2205081330 , ucode 0x8f000320 , OS CentOS Stream 8, kernel 5.18.0-0523.intel_​next.1.x86_​64+server , compiler gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10) , https://github.com/mlcommons/hpc/tree/main/deepcam, torch1.11.0a0+git13cdb98, FP32 and AMX, BFloat16 , torch-1.11.0a0+git13cdb98-cp38-cp38-linux_​x86_​64.whl, torch_​ccl-1.2.0+44e473a-cp38-cp38-linux_​x86_​64.whl, intel_​extension_​for_​pytorch-1.10.0+cpu-cp38-cp38-linux_​x86_​64.whl (AMX, BFloat16 Enabled), Intel MPI 2021.5, Python3.8

* MLPerf ™ HPC-AI v0.7 Training benchmark Performance. Result not verified by MLCommons Association. Unverified results have not been through an MLPerf ™ review and may use measurement methodologies and/or workload implementations that are inconsistent with the MLPerf ™ specification for verified results. The MLPerf ™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.

HPC-AI Benchmark Performance on Intel® Xeon® Scalable Processor Codenamed Sapphire Rapids with HBM

Cosmoflow Training Performance

Baseline: AMD EPYC 7763: Test by Intel as of 03/27/2022. 1-node, 2x EPYC 7763, 64 cores, HT On, Turbo Off, Total Memory 512 GB (16 slots/ 32 GB/ 3200 MHz, DDR4), BIOS AMI 1.1b, ucode 0xa001144, OS Red Hat Enterprise Linux 8.5 ( Ootpa ), kernel 4.18.0-348.7.1.el8_​5.x86_​64 , compiler gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-4), https://github.com/mlcommons/hpc/tree/main/cosmoflow, Tensorflow 2.8, tf_​nightly-2.8.0.202149-cp38-cp38-linux_​x86_​64.whl, horovod 0.22.1, keras 2.8.0, oneCCL-2021.4, oneAPI MPI 2021.4.0, Python 3.8. A100 80GB GPU with 3 rd Gen Intel® Xeon® Scalable Processor 8360Y : Test by Intel as of 04/08/2022. 1-node, 2x Intel® Xeon® Scalable Processor 8360Y, 36 cores, HT On, Turbo On, Total Memory 256 GB (16 slots/ 16 GB/ 3200 MHz), NVIDIA A100 80GB PCIe (UUID: GPU-59998403-852d-2573-b3a9-47695dca0604) , PICe ID 20B5, BIOS AMI 1.1b, ucode 0xd000311, OS Red Hat Enterprise Linux 8.4 ( Ootpa ) , kernel 4.18.0-305.el8.x86_​64, compiler gcc (GCC) 8.4.1 20200928 (Red Hat 8.4.1-1), https://github.com/mlcommons/hpc/tree/main/cosmoflow, Tensorflow 2.6.0, keras 2.6.0, cudnn 8.2.1, horovod 0.24.2, Python 3.7. Intel® Xeon® Platinum 8380 SP: Test by Intel as of 04/05/2022. 1-node, 2x Intel® Xeon® Platinum 8380 SP, 40 cores, HT On, Turbo Off, Total Memory 512 GB (16 slots/ 32 GB/ 3200 MHz, DDR4), BIOS SE5C6200.86B.0022.D64.2105220049 , ucode 0xd0002b1 , OS Red Hat Enterprise Linux 8.5 ( Ootpa ) , kernel 4.18.0-348.7.1.el8_​5.x86_​64 , compiler gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-4) , https://github.com/mlcommons/hpc/tree/main/cosmoflow, AVX-512, FP32, Tensorflow 2.6.0, horovod 0.23.0, keras 2.6.0, oneCCL-2021.4, oneAPI MPI 2021.4.0, Python 3.8. Intel® Xeon® Processor codenamed Sapphire Rapids with HBM (pre-production) : Test by Intel as of 05/25/2022. 1-node, 2x Intel® Xeon® Processors codenamed Sapphire Rapids with HBM (pre-production) , >40 cores, HT On, Turbo Off, Total Memory 128GB HBM and 1TB (16 slots/ 64 GB/ 4800 MHz, DDR5), BIOS EGSDCRB1.86B.0080.D05.2205081330 , ucode 0x8f000320 , OS CentOS Stream 8, kernel 5.15.0-spr.bkc.pc.4.24.0.x86_​64 , compiler gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-10) , https://github.com/mlcommons/hpc/tree/main/cosmoflow, AVX-512, FP32, Tensorflow 2.6.0, horovod 0.23.0, keras 2.6.0, oneCCL-2021.4, oneAPI MPI 2021.4.0, Python 3.8.

* MLPerf ™ HPC-AI v0.7 Training benchmark Performance. Result not verified by MLCommons Association. Unverified results have not been through an MLPerf ™ review and may use measurement methodologies and/or workload implementations that are inconsistent with the MLPerf ™ specification for verified results. The MLPerf ™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.

Intel® Xeon® Scalable Processor Codenamed Sapphire Rapids with HBM leads on CloverLeaf by >3x Test by Intel as of 04/26/2022. 1-node, 2x Intel® Xeon® Platinum 8360Y CPU, 72 cores, HT On, Turbo On, Total Memory 256GB (16x16GB DDR4 3200 MT/s ), SE5C6200.86B.0021.D40.2101090208, Ubuntu 20.04, Kernel 5.10, 0xd0002a0, ifort 2021.5, Intel MPI 2021.5.1, build knobs: -xCORE-AVX512 -qopt-zmm-usage=high Test by Intel as of 04/19/22. 1-node, 2x Pre-production Intel® Xeon® Scalable Processor codenamed Sapphire Rapids with HBM, >40 cores, HT ON, Turbo ON, Total Memory 128 GB (HBM2e at 3200 MHz), BIOS Version EGSDCRB1.86B.0077.D11.2203281354, ucode revision=0x83000200, CentOS Stream 8, Linux version 5.16, ifort 2021.5, Intel MPI 2021.5.1, build knobs: -xCORE-AVX512 -qopt-zmm-usage=high.
Intel® Xeon® Scalable Processor Codenamed Sapphire Rapids with HBM leads on OpenFOAM by >2x Test by Intel as of 01/26/2022. 1-node, 2x Intel® Xeon® Platinum 8380 CPU), 80 cores, HT On, Turbo On, Total Memory 256 GB (16x16GB 3200MT/s, Dual-Rank), BIOS Version SE5C6200.86B.0020.P23.2103261309, 0xd000270, Rocky Linux 8.5 , Linux version 4.18., OpenFOAM® v1912, Motorbike 28M @ 250 iterations; Build notes: Tools: Intel Parallel Studio 2020u4, Build knobs: -O3 -ip -xCORE-AVX512 Test by Intel as of 01/26/2022 1-node, 2x Pre-production Intel® Xeon® Scalable Processor codenamed Sapphire Rapids with HBM, >40 cores, HT Off, Turbo Off, Total Memory 128 GB (HBM2e at 3200 MHz), preproduction platform and BIOS, CentOS 8, Linux version 5.12, OpenFOAM® v1912, Motorbike 28M @ 250 iterations; Build notes: Tools: Intel Parallel Studio 2020u4, Build knobs: -O3 -ip -xCORE-AVX512. OpenFOAM Disclaimer: This offering is not approved or endorsed by OpenCFD Limited, producer and distributor of the OpenFOAM software via www.openfoam.com, and owner of the OPENFOAM® and OpenCFD ® trademark.
Intel® Xeon® Scalable Processor Codenamed Sapphire Rapids with HBM leads on WRF by >2x Test by Intel as of 05/03/2022. 1-node, 2x Intel® Xeon® 8380 CPU, 80 cores, HT On, Turbo On, Total Memory 256 GB (16x16GB 3200MT/s, Dual-Rank), BIOS Version SE5C6200.86B.0020.P23.2103261309, ucode revision=0xd000270, Rocky Linux 8.5, Linux version 4.18, WRF v4.2.2 Test by Intel as of 05/03/2022. 1-node, 2x Pre-production Intel® Xeon® Scalable Processor codenamed Sapphire Rapids with HBM, >40 cores, HT ON, Turbo ON, Total Memory 128 GB (HBM2e at 3200 MHz), BIOS Version EGSDCRB1.86B.0077.D11.2203281354, ucode revision=0x83000200, CentOS Stream 8, Linux version 5.16, WRF v4.2.2.
Intel® Xeon® Scalable Processor Codenamed Sapphire Rapids with HBM leads on YASK iso3dfd by >3x Test by Intel as of 05/9/2022. 1-node, 2x Intel® Xeon® Platinum 8360Y CPU, 72 cores, HT On, Turbo On, Total Memory 256GB (16x16GB DDR4 3200 MT/s ), SE5C6200.86B.0021.D40.2101090208, Rocky linux 8.5, kernel 4.18.0, 0xd000270, Build knobs: make -j YK_​CXX='mpiicpc -cxx=icpx' arch=avx2 stencil=iso3dfd radius=8, Test by Intel as of 05/03/22. 1-node, 2x Pre-production Intel® Xeon® Scalable Processor codenamed Sapphire Rapids with HBM, >40 cores, HT ON, Turbo ON, Total Memory 128 GB (HBM2e at 3200 MHz), BIOS Version EGSDCRB1.86B.0077.D11.2203281354, ucode revision=0x83000200, CentOS Stream 8, Linux version 5.16, Build knobs: make -j YK_​CXX='mpiicpc -cxx=icpx' arch=avx2 stencil=iso3dfd radius=8.
Intel® Xeon® Scalable Processor Codenamed Sapphire Rapids with HBM leads on ParSeNet by 2x Test by Intel as of 05/24/2022. 1-node, 2x Intel® Xeon® Platinum 8380 CPU, 80 cores, HT On, Turbo On, Total Memory 256GB (16x16GB DDR4 3200 MT/s [3200 MT/s]), SE5C6200.86B.0021.D40.2101090208, Ubuntu 20.04.1 LTS, 5.10, ParSeNet (SplineNet), PyTorch 1.11.0, Torch-CCL 1.2.0, IPEX 1.10.0, MKL (2021.4-Product Build 20210904), oneDNN (v2.5.0) Test by Intel as of 04/18/2022. 1-node, 2x Pre-production Intel® Xeon® Scalable Processor codenamed Sapphire Rapids with HBM, >40 cores, HT On, Turbo On, Total Memory 128GB (HBM2e 3200 MT/s), EGSDCRB1.86B.0077.D11.2203281354, CentOS Stream 8, 5.16, ParSeNet (SplineNet), PyTorch 1.11.0, Torch-CCL 1.2.0, IPEX 1.10.0, MKL (2021.4-Product Build 20210904), oneDNN (v2.5.0).
Intel® Xeon® Scalable Processor Codenamed Sapphire Rapids with HBM leads on Ansys Fluent by 2x Test by Intel as of 2/2022 1-node, 2x Intel ® Xeon ® Platinum 8380 CPU, 80 cores, HT On, Turbo On, Total Memory 256 GB (16x16GB 3200MT/s, Dual-Rank), BIOS Version SE5C6200.86B.0020.P23.2103261309, ucode revision=0xd000270, Rocky Linux 8.5 , Linux version 4.18, Ansys Fluent 2021 R2 Aircraft_​wing_​14m; Build notes: Commercial release using Intel 19.3 compiler and Intel MPI 2019u Test by Intel as of 2/2022 1-node, 2x Pre-production Intel® Xeon® Scalable Processor code names Sapphire Rapids with HBM, >40 cores, HT Off, Turbo Off, Total Memory 128 GB (HBM2e at 3200 MHz), preproduction platform and BIOS, CentOS 8, Linux version 5.12, Ansys Fluent 2021 R2 Aircraft_​wing_​14m; Build notes: Commercial release using Intel 19.3 compiler and Intel MPI 2019u8.
Intel data center GPU codenamed Ponte Vecchio leads on OpenMC by 2x Test by Argonne National Laboratory as of 5/23/2022, 1-node, 2x AMD EPYC 7532, 256 GB DDR4 3200, HT On, Turbo On, ucode 0x8301038. 1x A100 40GB PCIe. OpenSUSE Leap 15.3, Linux Version 5.3.18, Libararies: CUDA 11.6 with OpenMP clang compiler. Build Knobs: cmake --preset=llvm_​a100 -DCMAKE_​UNITY_​BUILD=ON -DCMAKE_​UNITY_​BUILD_​MODE=BATCH -DCMAKE_​UNITY_​BUILD_​BATCH_​SIZE=1000 -DCMAKE_​INSTALL_​PREFIX=./install -Ddebug=off -Doptimize=on -Dopenmp=on -Dnew_​w=on -Ddevice_​history=off -Ddisable_​xs_​cache=on -Ddevice_​printf=off. Benchmark: Depleted Fuel Inactive Batch Performance on HM-Large Reactor with 40M particles Test By Intel as of 5/25/2022, 1-node, 2x Intel® Xeon® Scalable Processor 8360Y, 256GB DDR4 3200, HT On, Turbo, On, ucode 0xd0002c1. 1x Pre-production Ponte Vecchio. Ubunt 20.04, Linux Version 5.10.54, agama 434, Build Knobs: cmake -DCMAKE_​CXX_​COMPILER="mpiicpc" -DCMAKE_​C_​COMPILER="mpiicc" -DCMAKE_​CXX_​FLAGS="-cxx=icpx -mllvm -indvars-widen-indvars=false -Xclang -fopenmp-declare-target-global-default-no-map -std=c++17 -Dgsl_​CONFIG_​CONTRACT_​CHECKING_​OFF -fsycl -DSYCL_​SORT -D_​GLIBCXX_​USE_​TBB_​PAR_​BACKEND=0" --preset=spirv -DCMAKE_​UNITY_​BUILD=ON -DCMAKE_​UNITY_​BUILD_​MODE=BATCH -DCMAKE_​UNITY_​BUILD_​BATCH_​SIZE=1000 -DCMAKE_​INSTALL_​PREFIX=../install -Ddebug=off -Doptimize=on -Dopenmp=on -Dnew_​w=on -Ddevice_​history=off -Ddisable_​xs_​cache=on -Ddevice_​printf=off Benchmark: Depleted Fuel Inactive Batch Performance on HM-Large Reactor with 40M particles.