Performance Index

ID 615781
Date 11/28/2022
Document Table of Contents

Intel® Innovation 2022

Keynotes

Session Section Speaker Claim Claim Details/Citation Testing Date
Pat Gelsinger Keynote Intel Arc Graphics Pat Gelsinger The Intel® Arc A770 Limited Edition delivers up to 65% better peak performance versus the RTX3060 in the 3DMark DirectX Raytracing feature test. System Configuration: Graphics: Intel® Arc™ A770 Graphics, Graphics Driver: Engineering Driver 3262, Processor: Intel® Core™ i9-12900K, Asus ROG MAXIMUS Z690 Hero, BIOS: 1601, Memory: 32GB (2x16GB) DDR5 @ 4800MHz, Storage: Corsair MP600 Pro XT 4TB NVMe, OS: Windows 11 Version 22000.795​​ Graphics: NVIDIA GeForce RTX 3060, Graphics Driver: 516.59, Processor: Intel® Core™ i9-12900K, Asus ROG MAXIMUS Z690 Hero, BIOS: 1601, Memory: 32GB (2x16GB) DDR5 @ 4800MHz, Storage: Corsair MP600 Pro XT 4TB NVMe, OS: Windows 11 Version 22000.795 ​ Measurement: UL's 3DMark DirectX Raytracing feature test (synthetic benchmark) run at 2560x1440 resolution with 2, 6, 12 and 20 samples per pixel (SPP) presets. FPS output converted to Gsample/s with the following formula: Gsample/s = FPS x SPP x 2.56 x 1.44 / 1000. August 18-22, 2022
Pat Gelsinger Keynote 13th Gen Announcement Pat Gelsinger The 13th Gen Intel® Core™ delivers the World's Best Desktop Experience Based on performance testing (as of September 7, 2022) and other attributes of 13th Gen Intel Core processors that combine to form the best overall desktop experience. These include: Fast speeds: up to Max Turbo Frequency of 5.8GHz - the highest for any desktop processor. Strong processor performance across a collection of benchmarks and real-world Gaming, Productivity, & Content Creation workloads, including in relation to prior generation (12th Gen Intel Core) and competitive processor offerings such as AMD Ryzen 9 5950X and AMD Ryzen 7 5800X3D. Broad memory support for both DDR4 and DDR5 memory modules. Support for best in class wired and wireless connectivity. See intel.com/PerformanceIndex (connectivity) for details. Intel's unparalleled approach to security like security assurance programs founded on security by design principles, transparency and disclosure of vulnerabilities and a robust Intel Platform Update process, an esteemed bug bounty program as well as internal research through red teams and more. Breadth of price and performance options available in 13th Gen Intel Core family. Extensive open ecosystem enablement (e.g., OEMs, ODMs, OSs, ISVs, etc.) Additional details available at intel.com/13thgen.
Pat Gelsinger Keynote 13th Gen Announcement Pat Gelsinger The 13th Gen Intel® Core™ delivers the world's best overclocking experience. Based on enhanced overclocking ability enabled by Intel's comprehensive tools and unique architectural tuning capabilities of unlocked 13th Gen Intel Core processors. Overclocking may void warranty or affect system health. Learn more at www.intel.com/overclocking. Results may vary.
Pat Gelsinger Keynote 13th Gen Announcement Pat Gelsinger 13th Gen Intel® Core™ i9-13900KS is the World's Fastest Desktop Processor at 5.8 GHz Based on its Max Turbo Frequency of 5.8GHz, which is the highest for any Desktop processor. Additional details at intel.com/PerformanceIndex.
Pat Gelsinger Keynote Intel Unison Announcement Pat Gelsinger Intel Unison Market Availability Intel® Unison™ solution is currently only available on eligible Intel® Evo™ designs on Windows-based PCs and only pairs with Android- or iOS-based phones; all devices must run a supported OS version. See intel.com/performance-evo for details, including set-up requirements. Results may vary.
Pat Gelsinger Keynote Haukke Fedderson Walkon Haukke Fedderson Careful quantization gave us a speedup of 2X; generational jump from 11 to 12th gen took one afternoon and resulted in 30% more fps Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.
Pat Gelsinger Keynote Gaudi2 Demo Pat Gelsinger Gaudi 2 accelerator comfortably outperforms the A100 on ResNet-50 and BERT. https://www.hpcwire.com/off-the-wire/intel-announces-mlperf-results-for-habana-gaudi2/
Greg Lavender Keynote AI on SPR demo Greg Lavender/Brian Martin, AbbVie The Katana Graph team working hand in hand with the Intel team were able to deliver 16x speedups on distributed graph partitioning, and 4.7x speedups on distributed GNN training using AMX for BFloat 16 precision. Graph partitioning: 16x faster using Katana optimized graph on 8 nodes of SPR vs. a single SPR node of METIS graph partition algo from open source DGL. GNN training: 4.7x faster in total using Katana Graph (BF16) on 8 nodes of SPR vs. OOB distDGL (FP32) on 8 nodes of SPR. Distributed GNN Training:​ 8-node, each with 2x 4th Gen Intel Xeon Scalable processor (pre-production Sapphire Rapids >40cores) on Intel pre-production platform and software with 512 GB DDR5 memory, microcode 0x90000c0, HT on, Turbo off, Rocky Linux 8.6, 4.18.0-372.26.1.el8_​6.crt1.x86_​64, 931 GB SSD, 455 TB Luster filesystem with HDR fabric, Katana Graph 0.4.1 vs. DGL 0.9, test by Intel Corporation on 09/19/2022.​ Single node Graph Partitioning:​ 1-node, 2x 4th Gen Intel Xeon Scalable processor (pre-production Sapphire Rapids >40cores) on Intel pre-production platform and software with 1024 GB DDR5 memory, microcode 0x90000c0, HT on, Turbo off, Rocky Linux 8.6, 4.18.0-372.26.1.el8_​6.crt1.x86_​64, 894 GB SSD, 105 TB Luster filesystem with OPA fabric, DGL 0.9.0 random graph partition on single node, test by Intel Corporation on 08/17/2022.

9/19/2022

8/17/2022

Tech Insight

Session Section Speaker Claim Claim Details/Citation Testing Date
Tech Insights: NEX Sachin Katti Every Xeon processor that is built today is built with 80% renewable electricity. Intel is committed to the continued development of more sustainable products, processes, and supply chain as we strive to prioritize greenhouse gas reduction and improve our global environmental impact. Where applicable, environmental attributes of a product family or specific SKU will be stated with specificity. Refer to the 2022 Corporate Responsibility Report (p. 64) for further information.
Tech Insight#5: AI and Data Science Slides 11-14 Vasudev Lal

Claim 1: Intel's submitted solution using custom Multimodal Transformer systems made it to winning list of NeurIPS21 WebQA competition

Claim 2: Intel's submitted solution using custom Multimodal Transformer is on #1 spot of public leaderboard of Visual Commonsense Reasoning in Time bechmark

Claim 3: Intel submitted solution placed #1 in TCO category in the NeurIPS21 Billion-scale Approximate Nearest Neighbor challenge Claim4: Intel's paper VL-InterpreT won "Best Demo Award" at CVPR2022

Claim 1: 3rd party testing by NeurIPS21 WebQA competition organizers (Microsoft Bing + CMU). Intel's solution is listed under challenge winners at: https://webqna.github.io/

Claim 2: 3rd party testing by Visual Commonsense Reasoning n Time benchmark organizers (Paul Allen Institute of AI). Intel's submission is at the #1 spot on the public leaderboard at this link: https://leaderboard.allenai.org/visualcomet/submissions/public

Claim 3: 3rd party testing by NeurIPS21 competition organizers. Public leaderboard at: https://github.com/harsha-simhadri/big-ann-benchmarks/blob/main/t3/LEADERBOARDS_PUBLIC.md

Claim 4: Public announcement by CVPR chair ofr Best Demo Award on Twitter at: https://twitter.com/humphrey_shi/status/1547109616726720513

9/25/2022

Session

Session Section Speaker Claim Claim Details/Citation Testing Date
AIML 003: Achieving efficient, productive and good quality End-to-End AI Pipelines Slide 11 Vrushabh Sanghavi Performance boost from optimized Intel software vs stock software on the same Xeon hardware. Overall, a 15X speed up is observed from the software acceleration Performance acceleration observed with Modin (upto 27X), hyperparameter optimization with SigOpt (3.6x) for PLAsTiCC application.It is an open data challenge on Kaggle to classify objects in the sky that vary in brightness. It uses simulated astronomical time-series data resembling observations from the Large Synoptic Survey Telescope being setup in Northern Chile. The challenge is to determine the probability that each object belongs to one of 14 classes of astronomical filters.
AIML 003: Achieving efficient, productive and good quality End-to-End AI Pipelines Slide 13 Vrushabh Sanghavi Performance boost from a coherant optimization strategy that uses optimized Intel software, tuned parameters, INT8 quantization and multi-instance data parallel execution. Overall, a 3.4X speedup is achieved using these optimizations Observed for the Document Analysis inference application measured on the IMDb dataset
AIML 003: Achieving efficient, productive and good quality End-to-End AI Pipelines Slide 14 Vrushabh Sanghavi Over 3.7x performance gain with AMX on Intel Sapphire Rapids Observed for the Document Analysis Fine-Tuning application measured on the IMDb dataset
AIML004: Accelerate Transformer Training and Inference with Hugging Face, Habana Gaudi Accelerators, and Intel® Xeon® Platform Slide 4 Julien Simon, chief evangelist, Hugging Face 2022: Transformers are Eating Deep Learning

"Transformers are emerging as a general-purpose architecture for ML" https://www.stateof.ai/

RNN and CNN usage down, Transformers usage up https://www.kaggle.com/ kaggle-survey-2021

AIML004: Accelerate Transformer Training and Inference with Hugging Face, Habana Gaudi Accelerators, and Intel® Xeon® Platform Slide 5 Julien Simon, chief evangelist, Hugging Face Hugging Face: One of the Fastest Growing Open Source Projects star-history.com Jul-05
AIML004: Accelerate Transformer Training and Inference with Hugging Face, Habana Gaudi Accelerators, and Intel® Xeon® Platform Slide 15 Julien Simon, chief evangelist, Hugging Face Up to 40% better price performance than latest GPU-based instances See Amazon press announcement for more information: https://press.aboutamazon.com/news-releases/news-release-details/aws-announces-general-availability-amazon-ec2-dl1-instances EC2 instance pricing published here: https://aws.amazon.com/ec2/pricing/on-demand/ EC2 instance pricing published here: https://aws.amazon.com/ec2/pricing/on-demand/
AIML004: Accelerate Transformer Training and Inference with Hugging Face, Habana Gaudi Accelerators, and Intel® Xeon® Platform Slide 16 Julien Simon, chief evangelist, Hugging Face Gaudi2 outperformed Nvidia A100 on MLPerf benchmark for ResNet and BERT https://mlcommons.org/en/training-normal-20/ May-22
AIML004: Accelerate Transformer Training and Inference with Hugging Face, Habana Gaudi Accelerators, and Intel® Xeon® Platform Slide 20 Julien Simon, chief evangelist, Hugging Face Near-Linear Scaling from 1 to 8 HPUs dl1.24xlarge, Amazon EC2; 8x Habana Gaudi1, $13.11/hour (us-east-1, on-demand); Ubuntu 20.08, Habana Deep Learning Base AMI, Habana PyTorch container 1.11.0:1.5.0 - 610, measured by Hugging Face Sep-22
AIML004: Accelerate Transformer Training and Inference with Hugging Face, Habana Gaudi Accelerators, and Intel® Xeon® Platform Slide 25 Julien Simon, chief evangelist, Hugging Face Up to 3x Speedup with Negligible Accuracy Drop Configuration: Test by Intel as of 07/30/2021. 2-node, 2x Intel® Xeon® Platinum 8380 Processor, 40 cores, HT On, Turbo ON, Total Memory 256 GB (16 slots/ 16GB/ 3200 MHz), BIOS: SE5C6200.86B.0022.D64.2105220049(0xd0002b1), Ubuntu 20.04.1 LTS, gcc 9.3.0 compiler, Transformer-Based Models, Deep Learning Framework: PyTorch 1.12, https://download.pytorch.org/whl/cpu/torch-1.12.0+cpu-cp38-cp38-linux_x86_64.whl, BS=1, Public Data, 10 instances/1 sockets, Datatype: FP32/INT8 Jul-21
AIML005: BigDL 2.0: Seamless Scaling of AI Pipelines from Laptops to Distributed Cluster Slide 12 Guoqiong Song, Jiao Wang Up to ~5.8x Training Speedup and ~9.6x Inference Speedup using BigDL-Nano CVPR 2022 Open Access Repository (thecvf.com)
AIML005: BigDL 2.0: Seamless Scaling of AI Pipelines from Laptops to Distributed Cluster Slide 23 Guoqiong Song, Jiao Wang 3x reduction in inference time; 30-50% increase in training throughtput; Training Speedup and ~9.6x Inference Speedup using BigDL-Nano https://networkbuilders.intel.com/solutionslibrary/sk-telecom-intel-build-ai-pipeline-to-improve-network-quality
AIML006: Accelerate End-to-End AI and Data Science Pipelines ​ with Intel® Optimized Libraries for Python Slide 7 Rachel Oberman Starting on the left in the data preprocessing section, data scientists can expect to see anywhere 1-100x faster Pandas workloads by using Intel Distribution of Modin, with 38x performance increase seen on a workload using 2020 US Census data. With Scikit-Learn, users see up to 100x speed-up in model training and inference by using Intel Extension for Scikit-Learn, and by using this simple 2-line code change extension, workloads can be anywhere from up to 10x faster when compared to Nvidia GPU, and up to 5x faster than AMD CPUs. With PyTorch, users can see up to 1.4x faster DLRM training, and up to 2.8x faster DLRM inference by using Intel optimizations and more efficient instruction sets. With TensorFlow, users can see up to 2.8x faster quantized inference with Intel optimizations and more efficient instruction sets. https://www.intel.com/content/www/us/en/developer/articles/technical/blazing-fast-python-data-science-ai-performance.html#gs.c7d2kv Oct-23 2020 (Scikit learn claim) Feb-3-2020(PyTorch Claim) Oct-16-2020(Modin Claim) Oct-26-2020(Tensorflow claim)
AIML006: Accelerate End-to-End AI and Data Science Pipelines ​ with Intel® Optimized Libraries for Python Slide 17 Rachel Oberman We observe a considerable speedup for Modin vs stock Pandas for various operations : CSV reading-9x Query1- 1.8x Query2-17x Query3- 7x Query4- 6.5x https://www.codeproject.com/Articles/5330204/Scale-Your-Pandas-Workflow-with-Modin 10/5/2021
AIML006: Accelerate End-to-End AI and Data Science Pipelines ​ with Intel® Optimized Libraries for Python Slide 20 Rachel Oberman With the Intel ExScikit-Learn, users can see up to 300x speed-up in model training for some algorithms and up 4000x speedup in inference for some algortihms by using Intel Extension for Scikit-Learn https://medium.com/intel-analytics-software/save-time-and-money-with-intel-extension-for-scikit-learn-33627425ae4 June-8-2021
AIML006: Accelerate End-to-End AI and Data Science Pipelines ​ with Intel® Optimized Libraries for Python Slide 23 Rachel Oberman We see a speedup across the board for higgs1m, letters, airline, mortgage and MSRank datasets from the initial XGBoost version to subsequent Xgboost versions when we upstreamed the Intel optimizations into XGBoost. https://www.intel.com/content/www/us/en/developer/articles/technical/improve-performance-xgboost-lightgbm-inference.html#gs.bnk85q Nov-10-2020
AIML011 : Ease of use in leveraging default Intel optimizations for TensorFlow slide 5 Sachin Muradi, Om Thakkar, Tsai Louie AI performance on cloud instances with tensorflow-oneDNN Configuration : For Table 1 (Batch mode): Hardware configuration : AWS instance type : C6i.12xlarge 48 vCPUs, 1 socket, 2 threads per core, Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz, memory : 96 GB, OS: Ubuntu 20.04.2, kernel: 5.11.0-1019-aws x86_​64 Testing done by Intel on May 19, 2021 Baseline performance : Tensorflow version 2.8 Improved performance : Tensorflow version 2.9 For Table 2 (Online mode): Hardware configuration : AWS instance type : C6i.2xlarge 8 vCPUs, 1 socket, 2 threads per core, Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz, memory : 16 GB, OS: Ubuntu 20.04.2, kernel: 5.11.0-1019-aws x86_​64 Testing done by Intel on May 19, 2021 Baseline performance : Tensorflow version 2.8 Improved performance : Tensorflow version 2.9 May-19-2021
AIML011 : Ease of use in leveraging default Intel optimizations for TensorFlow slide 12 Sachin Muradi, Om Thakkar, Tsai Louie Saphire Rapids speed-up presented in Intel innovation 2021(https://www.youtube.com/watch?v=38wrDHEQZuM , https://edc.intel.com/content/www/us/en/products/performance/benchmarks/innovation-event-claims/) and also was presented in oneAPI Developer summit 2022 (https://www.oneapi.io/event-sessions/tbd-session-1/)
AIML011 : Ease of use in leveraging default Intel optimizations for TensorFlow slide 13 Sachin Muradi, Om Thakkar, Tsai Louie Windows Alderlake tensorflow with oneDNN enabled https://medium.com/intel-analytics-software/accelerate-ai-model-performance-on-the-alder-lake-platform-a5c24ae3f522 May-13-2022
CLI001 Intelligent Collaboration on the Web Slide 8,13 Rijubrata Bhaumik Relative Power Comsumption. https://apps.powerapps.com/play/95b66612-ae95-4105-9972-13577ac9aa05?tenantId=46c98d88-e344-4ed4-8496-4ed7712e255d&source=portal&screenColor=rgba(0, 176, 240, 1)&initscreen=viewitem&item=1854
CLI002 Progressive Web Applications, Optimized for Intel XPU Architecture Slide 11 Moh Haghighat 100+mW power savings achieved for video call on Windows w/ 12th Gen Intel Core

BASELINE: Tested by Intel as of 01/12/2022. Intel ADL-P i7-1255 2/8/2 15W TDP. DDR5 16GB. OS: Win1121H2(22000.376). Chromium version: M105 self-built on 01/12/2022 with patcheshttps://chromium-review.googlesource.com/c/chromium/src/ /3754088 and https://chromium-review.googlesource.com/c/chromium/src/ /3329026 applied . Tool used: Intel SoCWatch. Application tested: Microsoft Teams on Chromium browser at 720p30fps, two participants at full screen mode. Camera model: Logitech C920. Hardware acceleration for H.264 encoding/decoding enabled. Chromium command line specified: "--disable-features=WebRtcThreadsUseResourceEfficientType"

OPTIMIZED: baseline configuration with Chromium comandline specified: "--enable-features=WebRtcThreadsUseResourceEfficientType"

CLI002 Progressive Web Applications, Optimized for Intel XPU Architecture Slide 12 Moh Haghighat 10% of SoC power saving for video playback on Windows

BASELINE: Tested by Intel as of 07/20/2022. Intel ADL-P i7-1255 2/8/2 15W TDP. DDR5 16GB. OS: Win1121H2(22000.376). Chromium version: M105 self-built on 07/20/2022 with patch https://chromium-review.googlesource.com/c/chromium/src/ /3737284 applied . Tool used: Intel SoCWatch. Test clip: Tears of Steel - AVC 1080p 24FPS 10Mbps . Chromium command line specified: "--disable-features=UseBatchDecoderBufferForMediaEngine".

OPTIMIZED: baseline configuration with Chromium comandline specified: "--enale-features=UseBatchDecoderBufferForMediaEngine,MediaFoundationClearPlayback"

CLI003 How Intel's Performance Hybrid Architecture Is Defining the Future of Multitasking for Developers, Gamers and Creators (RPL Launch Session) Slide 4 Mark Subotnick Delivering up to 15% ST and 41% MT Performance 13th Gen Intel® Core™ Desktop Processor By SPECrate2017_​int_​base 1 copy and n copy estimates based on measurements on Intel internal reference platforms, comparing i9-13900K to i9-12900K. See the 13th Gen Intel Core Desktop Processor Appendix for additional details. Results may vary. Performance hybrid architecture: Not available on certain 13th Gen Intel Core processors.
CLI003 How Intel's Performance Hybrid Architecture Is Defining the Future of Multitasking for Developers, Gamers and Creators (RPL Launch Session) Slide 7 Mark Subotnick Leadership in Gaming Performance - Gen on Gen For all workload and configuration see www.intel.com/PerformanceIndex. Results may vary. Go to Processors, Intel® Core™, and Desktop.
CLI003 How Intel's Performance Hybrid Architecture Is Defining the Future of Multitasking for Developers, Gamers and Creators (RPL Launch Session) Slide 10 Mark Subotnick Thermal Velcoty ⨥On 13th Gen Core i9 125W and 65W SKUs Max Turbo Frequency refers to the maximum single-core processor frequency that can be achieved with Intel® Turbo Boost Technology. See www.intel.com/technology/turboboost/ for more information. Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No product or component can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.
CLI003 How Intel's Performance Hybrid Architecture Is Defining the Future of Multitasking for Developers, Gamers and Creators (RPL Launch Session) Slide 12 Mark Subotnick 13 th Gen Desktop Overclocking Altering clock frequency or voltage may damage or reduce the useful life of the processor and other system components, and may reduce system stability and performance. Product warranties may not apply if the processor is operated beyond its specifications. Check with the manufacturers of system and components for additional details. Overclocking results will vary based on system configuration, board power deliver, cooling capability, component or module capabilities, risk tolerance, tuning configuration, unit to unit variations, and other factors.
CLI003 How Intel's Performance Hybrid Architecture Is Defining the Future of Multitasking for Developers, Gamers and Creators (RPL Launch Session) Slide 14 Mark Subotnick 13th Gen Intel Core Desktop Processors For all workload and configuration see www.intel.com/PerformanceIndex. Results may vary. Go to Processors, Intel® Core™, and Desktop.
CLI003 How Intel's Performance Hybrid Architecture Is Defining the Future of Multitasking for Developers, Gamers and Creators (RPL Launch Session) Slide 17 Mark Subotnick Leadership Gaming Frame Consistency As measured by benchmark mode score and/or fps measurements of 13th Gen Intel Core i9-13900K with internal reference board and DDR5 5600 MT/s DRAM; and AMD Ryzen 9 5950X with Asus ROG Crosshair Hero 8 board and DDR4 3200MT/s DRAM. The Configurations for all systems include Windows 11 Pro, 1920x1080 Resolution - High Quality Graphics Preset with EVGA RTX 3090 GPU.
CLI003 How Intel's Performance Hybrid Architecture Is Defining the Future of Multitasking for Developers, Gamers and Creators (RPL Launch Session) Slide 20 Mark Subotnick Leap in Performance for Content Creation For all workload and configuration see www.intel.com/PerformanceIndex. Results may vary. Go to Processors, Intel® Core™, and Desktop.
CLI007 Learn How Intel's Latest Processors for Workstations Increase Productivity of Visualization Workflows of Dassault Systèmes Visualization Solutions Slide 34 Partner speaker (Dassault) Scalability results of Intel Xeon 8168 Performance results are based on testing by Intel® as of September, 12, 2022 and may not reflect all publicly available updates.Date testing performed: September 12, 2022 System configuration: 8168 CPU 2x Intel® Xeon® Platinum 8168+ Memory 12x 16GB DDR4-2666 GPU NVIDIA GeForce GTX 1050 Ti Storage 480GB Intel® SSD DC S3500 Series OS Windows 10 21H2 build 19044.1889. Relevant testing/workload setup details: 2x8168 Application settings python.exe stellarscript\examples\benchmark.py benchmark_​WFCPU.yml output.png OS settings ·Power & Sleep - Screen - When plugged in, turn off after: Never ·Power & Sleep - Sleep - When plugged in, turn off after: Never Link HERE for testing details. (Legal - not sure how you want to incorporate this into legal doc).
CLI007 Learn How Intel's Latest Processors for Workstations Increase Productivity of Visualization Workflows of Dassault Systèmes Visualization Solutions Slide 35 Partner speaker (Dassault) Scalability results of Intel Xeon 8168 and Gen13 (13900K) Performance results are based on testing by Intel® as of September, 12, 2022 and may not reflect all publicly available updates. Date testing performed: September 12, 2022 System configuration: 8168 CPU 2x Intel® Xeon® Platinum 8168+ Memory 12x 16GB DDR4-2666 GPU NVIDIA GeForce GTX 1050 Ti Storage 480GB Intel® SSD DC S3500 Series OS Windows 10 21H2 build 19044.1889. Relevant testing/workload setup details: 2x8168 Application settings python.exe stellarscript\examples\benchmark.py benchmark_​WFCPU.yml output.png OS settings ·Power & Sleep - Screen - When plugged in, turn off after: Never ·Power & Sleep - Sleep - When plugged in, turn off after: Never Link HERE for testing details. (Legal - not sure how you want to incorporate this into legal doc). 44904
CLI007 Learn How Intel's Latest Processors for Workstations Increase Productivity of Visualization Workflows of Dassault Systèmes Visualization Solutions Slide 36 Partner speaker (Dassault) Preformance advantages of Gen13 over Gen12 and Gen11 based on proprietary Dassault's benchmarks Performance results are based on testing by Intel® as of September, 12, 2022 and may not reflect all publicly available updates. Date testing performed: September 09, 2022 System configuration: i9-11900K i9-12900K i9-13900K CPU Intel® Core™ i9-11900K Intel® Core™ i9-12900K Intel® Core™ i9-13900K Memory 4x 8GB DDR5-3200 2x 16GB DDR5-4800 2x 16GB DDR5-4800 GPU NVIDIA GeForce RTX 3070 NVIDIA GeForce GTX 1650 NVIDIA GeForce RTX 3080 Storage 1TB Samsung SSD 980 PRO 1TB Samsung SSD 980 PRO 500GB WD_​BLACK SN850 NVMe SSD OS Windows 11 22H2 build 22622.436 Windows 11 21H2 build 22000.527 Windows 11 21H2 build 22000.978 Relevant testing/workload setup details: i9-11900K i9-12900K i9-13900K Application settings python.exe stellarscript\examples\benchmark.py benchmark_​WFCPU.yml output.png OS settings ·Power & Sleep - Screen - When plugged in, turn off after: Never ·Power & Sleep - Sleep - When plugged in, turn off after: Never
CLI009 Learn How To Run A Model on Intel® Movidius™ VPU Using OpenVINO™ Slide 24 Gideon Damaryam We do not make any claims per se, but our lab will get participants to themselves generate performance numbers for fp32 and int8 and compare on the same device. Or material will also show sample results int8 version of the same fp32 model vastly increases performance
CLI011 Rajshree Chabukswar The fastest Performance-cores, with an industry-leading 5.8 GHz Based on its Max Turbo Frequency of 5.8GHz, which is the highest for any Desktop processor. Additional details at intel.com/PerformanceIndex.
CLI011 Rajshree Chabukswar Delivering up to 15% single threaded and 41% multi-threaded performance improvement By SPECrate2017_​int_​base 1 copy and n copy estimates based on measurements on Intel internal reference platforms, comparing i9-13900K to i9-12900K. See the 13th Gen Intel Core Desktop Processor Appendix for additional details. Results may vary. Performance hybrid architecture: Not available on certain 13th Gen Intel Core processors.
CLI011 Rajshree Chabukswar In fact, we conducted an internal analysis across the top 150 developer workloads and use cases and found that the added 8 E cores can provide up to a 40% performance increase gen-on-gen (depending on app scalability). Based on internal data and analysis.
CLI011 Rajshree Chabukswar This is an example of one of CyberLink's key use cases, Sky Replacement effect in PowerDirector. We worked with them to analyze application hotspots using Intel VTune, carefully identifying optimization areas and then using Intel Compiler with the right compiler flags. All this work provided anywhere from 10% to 5X improvement, depending on the scene and complexity of the Sky mask. Based on internal data and analysis.
CLI011 Rajshree Chabukswar In this case shown here, for a MT application, we have a potential to increase the multi-threaded gains from 60% to almost 80%. Based on internal data and analysis.
DCC001 What's New with Intel Infrastructure Processing Units Slides 18/19/20 Nick Tausanovitch CMCC/Intel - Up to x5.5 improvement on bandwidth & package forwarding rate Up to x5.5 IOPS increase in storage performance Baidu/Intel - Supports up to 1024 device hot-plug/unplug PCRs 08-24-2022-08 and 09-10-2022-01 Aug-22
DCC003 Optimized Microservices workloads with 4th Gen Intel® Xeon processor, Intel® IPU, Intel® Ethernet and Computational Storage Slide #24 Brad Burres Graph showing performance benefit obtained by moving the load balancer from 4th Gen Xeon to the Intel IPU E2000 (Mount Evans) In backup slide as well as available on the showcase floor ####
DCC003 - Optimized Microservices workloads with 4th Gen Intel® Xeon Scalable Processors, Intel® Infrastructure Processing Units (Intel® IPUs), Intel® Ethernet, and Computational Storage Slide 8 Suzi, Brad, Mrittika, Anil, Michael Up to 85% fewer cores vs AMD EPYC for ~65K CPS SLA 1-node, 2x pre-production 4th Gen Intel® Xeon® Scalable Processor (60 cores) with integrated Intel® Quick Assist Accelerator (Intel® QAT), on pre-production Intel® platform and software with DDR5 memory total 1024GB (16x64 GB), microcode 0xf000380, HT On, Turbo Off, SNC Off, Ubuntu 22.04.1 LTS, 5.15.0-47-generic, 1x 1.92TB Intel SSDSC2KG01, 1x Intel® Ethernet Network Adapter E810-2CQDA2, 2x100GbE, QAT engine v0.6.14, QAT v20.l.0.9.1, NGINX 1.20.1, OpenSSL 1.1.1l, IPP crypto v2021_​5, IPSec v1.1 , TLS 1.3 AES_​128_​GCM_​SHA256, ECDHE-X25519-RSA2K, tested by Intel September 2022. 1-node, 2x AMD EPYC 7763 processor (64 core) on GIGABYTE R282-Z92 with 1024GB DDR4 memory (16x64 GB), microcode 0xa001144, SMT On, Boost Off, NPS=1, Ubuntu 22.04.1 LTS, 5.15.0-47-generic, 1x 1.92TB Intel SSDSC2KG01, 1x Intel® Ethernet Network Adapter E810-2CQDA2, 2x100GbE, NGINX 1.20.1, OpenSSL 1.1.1l, AES_​128_​GCM_​SHA256, ECDHE-X25519-RSA2K, tested by Intel September 2022. 9/19/2022

DCC008: Develop Like a Rockstar With 4th Gen Intel Xeon Scalable Processors

Slide 11 Ronak Singhal For NGINX webserver with RSA2K public key crypto, 4th Gen Intel Xeon SP with built-in acceleration with Intel QAT delivers high client density while saving 6 CPU cores Intel QAT for NGINX Webserver: Estimated performance comparing 4th Gen Intel Xeon Scalable processor configuration with Intel® QAT enabled, versus same processor without QAT offload, with software optimizations. Configuration: 1-node, 2-sockets (1 socket tested, SPR-E3) 4th Gen Intel Xeon Scalable processor, 52C, 300W TDP with 1, 2 and 4 Intel QAT active devices (with 8 cores/16 threads). Pre-production platform with 512GB (16x32GB 4800MT/s [4800MT/s]) total memory, HT on, Turbo off, internal pre-production BIOS 0x890000a0, Ubuntu 22.04 LTS, 5.15.0-27-generic; Workload: Async NGINX 0.4.7; NGINX TLS 1.3 Webserver with ECDHE-X25519-RSA2K algorithm. Software Configuration: GCC 11.2.0, libraries: OpenSSL 1.1.1o, QAT engine v0.6.12, Intel IPsec MB v1.2, IPP Crypto ippcp_​2021.5. Test by Intel as of 6/30/2022. 9/28/2021
DCC008: Develop Like a Rockstar With 4th Gen Intel Xeon Scalable Processors Slide 14 Ronak Singhal Intel AMX delivers 4.8x additional speedup in addition to OneDNN and quantization optimizations, for a total of 30x total improvement compared to TensorFlow baseline on 4th Gen Intel Xeon SP Intel AMX for SSD-ResNet34: Baseline Configuration: 1-node, 2x 3rd Gen Intel Xeon Platinum 8380 with 512 GB (16 slots/ 32GB/ 3200) total DDR4 memory, microcode 0x8d9522d4, HT on, Turbo on, Ubuntu 20.04.2 LTS(docker), 5.4.0-77-generic, TensorFlow v2.5.0 w/o oneDNN, TensorFlow v2.6.0 w oneDNN, test by Intel on 09/28/2021. New Configuration: 1-node, 2x Next Gen Intel Xeon Scalable processor (codenamed Sapphire Rapids, > 40 cores) on Intel pre-production platform with 512 GB DDR memory (8(1DPC)/64GB/4800 MT/s), HT on, Turbo on, CentOS Linux 8.4, internal pre-production bios and software running SSD-ResNet34 BS=1 using TensorFlow 2.6 with intel internal optimization, test by Intel on 09/28/2021. 4/25/2022
DCC008: Develop Like a Rockstar With 4th Gen Intel Xeon Scalable Processors Slide 17 Ronak Singhal 4th Gen Intel Xeon SP with Intel IAA delivers 2X the operations/sec throughput on data compression for RocksDB (read only) versus Zstd software on CPU cores only Intel IAA for RocksDB: Estimated performance comparing 4th Gen Intel® Xeon® Scalable processor configuration with Intel® IAA enabled, versus same processor running software on CPU cores without IAA offload. Configuration: 1-node, 2 sockets (1 socket tested) 4th Gen Intel Xeon Scalable processor (56-cores, 4x IAA devices) pre-production platform with 512GB (16x32GB <OUT OF SPEC> 4800MT/s [4800MT/s]) total memory, HT on, Turbo on, internal pre-production BIOS 0x8e000260, CentOS Linux 8.4.2105, 5.15.0-spr.bkc.pc.3.21.0.x86_​64, internal RocksDB v7.1 with pluggable compression support (db_​bench, read only). Software Configuration GCC 8.5.0, ZSTD v1.4.4, p99 latency used. Results depend on block size and database entry size. Read-Only results (Relative ops/s vs data size - 16kB block, 32B value) and Read-Write results (Relative ops/s vs data size - 16kB block, 32B value). Tradeoff up to 16% compressed data size.​ Test by Intel as of 4/25/2022. 2/28/2022
DCC008: Develop Like a Rockstar With 4th Gen Intel Xeon Scalable Processors Slide 18 Ronak Singhal For the ClickHouse database, 4th Gen Intel Xeon SP with Intel IAA delivers up to 40% higher query thorughput with better compression and saves up to 20% memory bandwidth versus LZ4 software running on CPU cores only Intel IAA for ClickHouse: Baseline Configuration: Test by Intel as of <02/28/2022>. 1-node, 2x Intel® Xeon® <SKU, processor>, 48 cores, HT On, Turbo On, Total Memory 1024 GB (16 slots/ 64 GB/ 4800 MHz [run @ 4800 MHz] ), <Bios: EGSDCRB1.86B.0072.D01.2201101353>, <ucode version: 0x8e0001a0>, <OS Version: CentOS Stream 8>, <kernel version: 5.12.0-09_​15_​spr.22.x86_​64+server>, <compiler version: CLANG 14>, <Clickhouse 21.12/SSB>, Normalized score=1, Compression Ratio: 1.84 New Configuration: Test by Intel as of <02/28/2022>. 1-node, 2x Intel® Xeon® <SKU, processor>, 48 cores, HT On, Turbo On, Total Memory 1024 GB (16 slots/ 64 GB/ 4800 MHz [run @ 4800 MHz] ), <Bios: EGSDCRB1.86B.0072.D01.2201101353>, <ucode version: 0x8e0001a0>, <OS Version: CentOS Stream 8>, <kernel version: 5.12.0-09_​15_​spr.22.x86_​64+server>, <compiler version: CLANG 14>, <Clickhouse 21.12/SSB>, score= 1.06<QPS>, Compression Ratio: 2.62
DCC013 - Intel AI Optimizations through the lens of sustainability Slide 20,21 Alex Sin Claim 1: Optimizations delivered through Intel Extension for Pytorch, Intel OpenMP, Pytorch JIT tracing, Channel last memory layout and TCMalloc provides upto 3.25 times better inference latency than the stock Pytorch Baseline Claim 2: With the above optimizations, the inference application instance consumes upto 3.06 times lesser energy per image compared against the stock pytorch baseline BASELINE: Tested by Intel as of 09/15/2022. AWS c6i.metal instance, 2 socket Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz, 32 cores per socket, HT On, Turbo On, OS Ubuntu 22.04 LTS, Kernel 5.15.0-1017-aws, Microcode 0xd000363 Total Memory 256GB, Framework: Pytorch 1.12, Torchvision Compiler: GCC 11.2.0, Dummy Test Data : 10k Images 3x224x224, Model Topology : ResNet50_​v1.5, Model Weights: Pretrained Imagenet, Workload type : Model inference, Inference Config: 4 instance, 16 cpu cores / instance, Benchmark metric : latency in ms/image, Optimizations applied : None. Energy measurement : Metric - joules/image, kWh/image, Tool used - Linux Turbostat, Emissions Factor (Avoided Energy US average from AVERT EPA) - 0.709 kg-CO2/kWh OPTIMIZED: Tested by Intel as of 09/15/2022. AWS c6i.metal instance, 2 socket Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz, 32 cores per socket, HT On, Turbo On, OS Ubuntu 22.04 LTS, Kernel 5.15.0-1017-aws, Microcode 0xd000363 Total Memory 256GB, Framework: Pytorch 1.12, Torchvision, Intel Extension For Pytorch 1.12 Compiler: GCC 11.2.0, Dummy Test Data : 10k Images 3x224x224, Model Topology : ResNet50_​v1.5, Model Weights: Pretrained Imagenet, Workload type : Model inference, Inference Config: 4 instance, 16 cpu cores / instance, Benchmark metric : latency in ms/image, Optimizations applied : KMP_​AFFINITY=granularity=fine,compact,1,0 OMP_​NUM_​THREADS=16 LD_​PRELOAD=Intel_​OpenMP:TCMalloc, Intel Extension for Pytorch FP32 optimizations, Pytorch Torchscript JIT Trace and Freeze. Energy measurement : Metric - joules/image, kWh/image, Tool used - Linux Turbostat, Emissions Factor (Avoided Energy US average from AVERT EPA) - 0.709 kg-CO2/kWh
NEC003 Providing an Optimal Network Stack Slide #12 Edwin Verplanke/Keith Wiles Graphs showing Performance In backup Slide #25 shows the configuration.
NEC003 Providing an Optimal Network Stack OVS-DPDK (Slide 6) Edwin Verplanke/Keith Wiles Chart showing performance Architecture: x86_​64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 88 On-line CPU(s) list: 0-87 Thread(s) per core: 2 Core(s) per socket: 22 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz Stepping: 4 CPU MHz: 3163.438 BogoMIPS: 4200.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 30976K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_​tsc art arch_​perfmon pebs bts rep_​good nopl xtopology nonstop_​tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_​cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_​1 sse4_​2 x2apic movbe popcnt tsc_​deadline_​timer aes xsave avx f16c rdrand lahf_​lm abm 3dnowprefetch cpuid_​fault epb cat_​l3 cdp_​l3 invpcid_​single pti intel_​ppin ssbd mba ibrs ibpb stibp tpr_​shadow vnmi flexpriority ept vpid fsgsbase tsc_​adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_​a avx512f avx512dq rdseed adx smap clflushopt clwb intel_​pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_​llc cqm_​occup_​llc cqm_​mbm_​total cqm_​mbm_​local dtherm ida arat pln pts pku ospke flush_​l1d ####
NEC003 Providing an Optimal Network Stack Slide #12 Edwin Verplanke/Keith Wiles Graphs showing Performance In backup Slide #25 shows the configuration.
NEC004 Designing Cloud Native Enterprise Network and Security Edge Platforms on Intel Architecture Slide 21 Brian Will & Tarun Viswanathan NGINX TLS Handshake performance for ECDHE-X25519-RSA2K is >3x higher using Intel crypto instruction optimisations when compared to default openssl for the same core count on Intel Xeon Platinum 8470N (SPR E3). Intel QAT on SPR further increases the performance by another 1.6x when compared to Intel crypto instructions. Backup slides have the detailed system configuration, workload and raw performance scaling (slides 33-38) 06/22/2022
NEC004 Designing Cloud Native Enterprise Network and Security Edge Platforms on Intel Architecture Slide 22 Brian Will & Tarun Viswanathan Snort with Hyperscan performing 2.4 times better then Snort unmodified using ac-bnfa Backup slides have the detailed system configuration (slides 39-41) 09/16/2022
NEC006 Capturing, Managing and Analyzing Real Time Data Where it Matters Slide 8 Rita H. Wouhaybi/Sam Kaira We are able to ingest 1250 FPS in our video pipeline on Comet Lake and EII 3.0 In backup slides we have details of the system setup (slides 22-23) ####
NEC006 Capturing, Managing and Analyzing Real Time Data Where it Matters Slide 8 Rita H. Wouhaybi/Sam Kaira We are able to ingest 1250 FPS in our video pipeline on Comet Lake and EII 3.0 In backup slides we have details of the system setup (slides 22-23) ####
OAC005 Crack the challenges of serverless slide 19, 20, 22, 23 Cathy Zhang The cold start latency is reduced more than 50% using the snapsho- based way of creating the application instance. The application container image size is reduced more than 50%. The new scaling approach more than double/triple the scaling speed. The higher scaling concurrency requirement, the more performance gains of using the new scaling approach #####
OAC010: Driving toward standard Web APIs for XPU Accelerated AI Slide 8 Bryan Bernhart There is up to 56% speed-up on MobileNet of TensorFlow.js Model Benchmark by converting 128-bit Wasm SIMD instructions into 256-bit IA instructions dynamically. Aug-03-2022
OAC010: Driving toward standard Web APIs for XPU Accelerated AI Slide 23 Bryan Bernhart A new Web API called WebNN (Neural Network) can be used as execution backend by Web machine learning (ML) frameworks, such as TensorFlow-Lite Web and ONNXRuntime Web, and delivers near-native inference performance for various computer vision models on CPU and GPU. Comparing to MobileNet V2 inference on CPU through legacy WebAssembly, it delivers ~9.6x performance speedup vs Wasm SIMD, and ~3.1x performance speedup vs fastest Wasm (SIMD + Multithreads) for client AI use cases like Image Classification etc. Aug-03-2022
OAC012 Slide 31 Andrew Richards Nbody is a well known algorithm for simulating a fictional galaxy. From the gif there you can see how it simulate the movement of fictional stars. This is the formula that is used, this is what calculates the force each star in our fictional galaxy experiences, and that is just the sum of all the other interactions in the galaxy. It's intentionally simple, for example it doesn't use any shared memory, computation scales with O to N sqared, but it could be made a much bigger problem. In terms of rendering this uses OpenGL, which is in a separate translation unit. For this simple project we ran the original CUDA code and then the SYCL code that we migrated on the same Nvidia GPU. In this example the performance of the kernel code was quite comparable. I would emphasise that this can vary a lot depending on your own project, but this is an example of a comparison. Machine used: Lenovo ThinkBook 16p RTX3060 Laptop (AMD Ryzen 7 5800H Processors, Nvidia GeForce RTX 3060) with Ubuntu 20.04, DPC++ built from source and CUDA 16). Compiler flags used are available in the GitHub project build scripts at https://github.com/codeplaysoftware/cuda-to-sycl-nbody. Sep-22

Demos

Session Section Speaker Claim Claim Details/Citation Testing Date
Neural Coder Demo in Pat's Keynote Demo Jon Markee Intel Neural Compressor optimizes stock pytoch and quantizes the model from FP32 to INT8. These optimizations accelerate inference speed of ResNet50 workload in PyTorch by 10.8X and utilize the Intel Advanced Matrix Extensions on 4th Gen Intel Xeon Scalable Processor. New: 1-node, 1 x 4th Gen Intel Xeon Scalable Processor >40 cores on QuantaGrid D54Q-2U with 1024GB (16x64GB <OUT OF SPEC> 4800 MT/s [4800 MT/s] total DDR4 memory, microcode 0x2b00002,, HT enabled, Turbo enabled, CentOS Stream 8, 5.19.9-1.el8.elrepo.x86_​64, 1x <S3700 400GB SSD>, PyTorch 1.12.1 optimized with INC with ResNet50 from torchvision 0.13.1 running Imagenet Image size: (3,224,244), oneDNN, Precision Int8, Throughput measured is 3450 FPS. test by Intel on 9/19/2022 Baseline: 1-node, 1 x 4th Gen Intel Xeon Scalable Processor >40 cores on QuantaGrid D54Q-2U with 1024GB (16x64GB <OUT OF SPEC> 4800 MT/s [4800 MT/s] total DDR4 memory, microcode 0x2b00002,, HT enabled, Turbo enabled, CentOS Stream 8, 5.19.9-1.el8.elrepo.x86_​64, 1x <S3700 400GB SSD>, stock PyTorch 1.12.1 with ResNet50 from torchvision 0.13.1 running Imagenet Image size: (3,224,244), oneDNN, Precision FP32, Throughput measured is 319 FPS. test by Intel on 9/19/2022
AI Inference on Intel® Data Center GPU Demo booth #554 and in Jeff McVeigh's PR and Analyst briefing session Sajeer Shamsudeen Using YOLO v5 (object detection), ATSM can handle 90 AVC streams, while Nvidia A10 can handle only 58. The AVC streams are for decode and inference, and ATSM's media capabilities makes it possible. Test shows capability of the GPUs. Intel developing a Smart Cities POC for a customer with Intel Data Center GPU Flex Series, and showing how those AVC streams are used: object detection for traffic cameras. Tests by Intel as of 9/16/2022. Configs: Intel Configuration: 2S Intel(r) Xeon(r) 4309Y, 256GB DDR4-2933, 1x Intel(r) Data Center GPU Flex 170, Ubuntu 20.04 Kernel 5.10, Agama 419, OpenVINO 2022.2.0 - 90 1080p AVC Streams at 25 FPS with YOLOv5 NVIDIA Configuration: 2S Intel(r) Xeon(r) 6336Y, 128GB DDR4-3200, 1x NVIDIA A10 GPU, Ubuntu 20.04, Kernel 5.15, NVIDIA Driver 515.65.01, CUDA 11.7 Update 1, DeepStream 6.1.1, TensorRT 8.4.1.5 - 57 1080p AVC Streams at 25 FPS with YOLOv5 9/14/2022
Acceleration with High Bandwidth Memory (HBM) Demo booth #555 Andrey Ovsyannikov MPAS-A is a memory bandwidth-bound workload and HBM technology on Sapphire Rapids allows to unveil significant performance gains in comparison with other CPUs with conventional DDR memory. We run MPAS-A simulation of atmosphere dynamics at 1° and 0.5° global resolution on a dual-socket Sapphire Rapids with HBM and compare its performance with 3rd Gen Intel® Xeon® Scalable Processor (codenamed Icelake) and with Sapphire Rapids with DDR5 memory. Our results showed that Sapphire Rapids with HBM significantly outperform CPUs with DDR memory: by 3.2x and 1.9x compared to Icelake Sapphire Rapids with DDR, respectively. This significant performance gain allows to significantly reduce the total time-to-solution and improve the climate simulation throughput (e.g., number of simulated years per day). Baseline Configuration: 1-node, 2x 3rd Gen Intel® Xeon® Platinum 8380 with 512 GB (16 slots/ 32GB/ 3200) total DDR4 memory, microcode 0xd000270, HT on, Turbo on, Rocky Linux 8.5, Linux version 4.18.0-372.19.1.el8_​6.crt1.x86_​64, MPAS-A v7.3 compiled with Intel® Fortran Compiler Classic and Intel® MPI from 2022.1 Intel® oneAPI HPC Toolkit, benchmark 120-km dycore, score=0.3041 seconds per timestep; benchmark 60-km dycore, score=1.3321 seconds per timestep, test by Intel on 09/14/2022. New Configuration 1: 1-node, 2x Next Gen Intel Xeon Scalable processor (codenamed Sapphire Rapids) on Intel pre-production platform with 512 GB (16 slots/ 32GB/ 4800) total DDR5 memory, HT on, Turbo on, CentOS Stream 8, Linux version 5.19.0-rc6.0712.intel_​next.1.x86_​64+server, MPAS-A v7.3 compiled with Intel® Fortran Compiler Classic and Intel® MPI from 2022.2.0 Intel® oneAPI HPC Toolkit, benchmark 120-km dycore, score=0.1761 seconds per timestep; benchmark 60-km dycore, score=0.7782 seconds per timestep, test by Intel on 09/14/2022. New Configuration 2: 1-node, 2x Next Gen Intel Xeon Scalable processor (codenamed Sapphire Rapids) on Intel pre-production platform with 128 GB (HBM2e at 3200 MHz) memory, HT on, Turbo on, CentOS Stream 8, Linux version 4.18.0-365.el8.x86_​64, MPAS-A v7.3 compiled with Intel® Fortran Compiler Classic and Intel® MPI from 2022.2.0 Intel® oneAPI HPC Toolkit, benchmark 120-km dycore, score=0.0938 seconds per timestep; benchmark 60-km dycore, score=0.3850 seconds per timestep, test by Intel on 09/14/2022.
INTEL DL BOOST PERF BOOST WITH TENSORFLOW ON AWS Demo Louie Tsai By using 3rd gen Xen scalable processor AWS instance type, users could see great inference speedup leveraging DL Boost feature with Inte Optimized TensorFlow. Intel optimization is also in official TensorFlow 2.9 by default. https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Deep-Learning-Performance-Boost-by-Intel-VNNI/post/1335670 https://medium.com/intel-analytics-software/choosing-the-right-intel-workstation-processor-for-tensorflow-inference-and-development-4afeec41b2a9
Intel® Agilex™ FPGA-based CXL demo's demo booth # 564 Tom Schulte 4x more CXL bandwidth per port vs. other FPGA competitors (hard IP based solution running at Gen5 x16 vs. soft iP based solution running at Gen4 x8). 2x more PCie bandwidth per port vs. Xilinx Versal Premium FPGA. Based on PCI-SIG integrators 5.0 details (Gen 5 x16 at 32GT/s vs. Gen5 x16 at 16 GT/s) PCR numbers: 09-14-2022-03 09-14-2022-04 20-Sep
Content Delivery with Intel® Flex Series GPU Demo David Warberg 5X media transcode throughput at half the power (Intel Flex 140 compared to competition NVIDIA A10 - HEVC 1080p60) https://edc.intel.com/content/www/us/en/products/performance/benchmarks/intel-data-center-gpu-flex-series/?r=698141916
Future of Intel® Developer Cloud: Storage moved onto the Intel® Infrastructure Processing Unit Demo Dave Minturn Moving Storage onto the Intel Infrastructure Processing Unit results in: "Xeon CPU cycle reduction (at least 2x)" and a "high bandwidth storage solution on IPU, up to line rate" Graphical interface showing ~90.8 Gbps bandwidth using the Intel IPU E2000 versus ~93.8 Gbps using the Intel E810 NIC for storage traffic. Also on the Graphical interface shown is a comparison of number of cores utilized in storage processing. When offloading storage onto the Intel IPU E2000 only ~2.6 cores are used and for the Xeon 4th Gen plus Intel E810 NIC ~14 cores are being used resulting in an ~11 core reduction in cores needed for storage processing. A results disclaimer slide will be included in the deck being used for the demo and available for the showcase floorWorkload: Flexible I/O (fio) benchmark on 2x hosts with 4x storage targets. Tested by Intel on 9/12/2022.​ ​ 2x Host Systems: Intel® Archer City server platform, 2x 4th Gen Intel Xeon Scalable Processor with 16x 16GB DDR4 DRAM​ Host 1: 1x Intel E810 NIC. ​ Host 2: 1x Intel E2000 IPU​ Host firmware: EGSDCRB1.SYS.0084.D24.2207132145​ 4x Target Systems: ​ Intel Intel Server M50CYP Family platform, 2x Intel® Xeon Gold 6342, 16x 16GB DDR4 DRAM, 8x 128GB Optane PMEM, 8x NVME P5316 U.2 SSD, 1x Intel E810 NIC, CentOS 7.9, LightBits LIghtOS 2.3.17​ LightOS Volume 1: 10GB, no-replication, residing on Target 1​ LightOS Volume 2: 10GB, no-replication, residing on Target 2​ ​ Benchmark: fio , 100 % random read, number of jobs=16, queue depth=32, block size =32k​ Host 1 configured to use LightOS Volume 1​ Host 2 configured to use LIghtOS Volume ​ Results: ​ Host 1: IOPS:274000, 98.3Gbps, Host CPU Utilization: 14.3 Cores​ Host 2: IOPS: 348000, 91.2Gbps, Host CPU Utilization: 2.7 Cores​ Handle 0x0010, DMI type 4, 48 bytes​ Processor Information​ Socket Designation: CPU0​ Type: Central Processor​ Family: Xeon​ Manufacturer: Intel(R) Corporation​ ID: F7 06 08 00 FF FB EB BF​ Signature: Type 0, Family 6, Model 143, Stepping 7​ Flags:​ FPU (Floating-point unit on-chip)​ VME (Virtual mode extension)​ DE (Debugging extension)​ PSE (Page size extension)​ TSC (Time stamp counter)​ MSR (Model specific registers)​ PAE (Physical address extension)​ MCE (Machine check exception)​ CX8 (CMPXCHG8 instruction supported)​ APIC (On-chip APIC hardware supported)​ SEP (Fast system call)​ MTRR (Memory type range registers)​ PGE (Page global enable)​ MCA (Machine check architecture)​ CMOV (Conditional move instruction supported)​ PAT (Page attribute table)​ PSE-36 (36-bit page size extension)​ CLFSH (CLFLUSH instruction supported)​ DS (Debug store)​ ACPI (ACPI supported)​ MMX (MMX technology supported)​ FXSR (FXSAVE and FXSTOR instructions supported)​ SSE (Streaming SIMD extensions)​ SSE2 (Streaming SIMD extensions 2)​ SS (Self-snoop)​ HTT (Multi-threading)​ TM (Thermal monitor supported)​ PBE (Pending break enabled)​ Version: Intel(R) Xeon(R) Platinum 8480CL​ Voltage: 1.6 V​ External Clock: 100 MHz​ Max Speed: 4000 MHz​ Current Speed: 2000 MHz​ Status: Populated, Enabled​ Upgrade: <OUT OF SPEC>​ L1 Cache Handle: 0x000D​ MMX (MMX technology supported)​ FXSR (FXSAVE and FXSTOR instructions supported)​ SSE (Streaming SIMD extensions)​ SSE2 (Streaming SIMD extensions 2)​ SS (Self-snoop)​ HTT (Multi-threading)​ TM (Thermal monitor supported)​ PBE (Pending break enabled)​ ​ 9/12/2022

Press Briefings

Session Section Speaker Claim Claim Details/Citation Testing Date
Jeff McVeigh Press & Analyst Briefing at Innovation 2022 Press Briefing Deck Jeff McVeigh Intel Flex Series GPU 170 running AI inference workloads on OpenVino- Resnet50v1.5, SSD-MobileNetv1, Yolo-v4. 1xFlex170 / 3xFlex170 Resnet 50v1.5 Batch Size 1/256 on OpenVino 2891/9673 inferences per sec SSD-Mobilnetv1 Batch Size 1/256 on Open Vino: 3532/9662 inferences per sec YOLOv4 Batch Size 1/256 on PyTorch: 456/1115 inferences per sec 3X Intel Flex Series GPU 170 running AI inference workloads -Resnet50v1.5, SSD-MobileNetv1, Yolo-v4. Resnet 50v1.5 Batch Size 1/256 on OpenVino 8541/26632 inferences per sec SSD-Mobilnetv1 Batch Size 1/256 on Open Vino: 10139/25144 inferences per sec YOLOv4 Batch Size 1/256 on PyTorch: 1238/2555 inferences per sec 1 GPU Numbers on: 2S Intel® Xeon® Gold 6336Y, 256GB DDR4-3200, Ubuntu 20.04, Kernel 5.15, 1x Intel® Datacenter GPU Flex 170, GCC 9.4, Pre-production OpenVINO 2022.3, AUTO mode; Model Optimizer: OV-2022.3.0000-8ff193e6f66-temp_​2022/Q3/demo, Inference Engine: OV-2022.3.0000-8ff193e6f66-temp_​2022/Q3/demo, Int8 precision 3 GPU Numbers on: 2S Intel® Xeon® Gold 6342, 256GB DDR4-3200, Ubuntu 20.04, Kernel 5.15, 1x Intel® Datacenter GPU Flex 170, GCC 9.4, Pre-production OpenVINO 2022.3, AUTO mode; Model Optimizer: OV-2022.3.0000-8ff193e6f66-temp_​2022/Q3/demo, Inference Engine: OV-2022.3.0000-8ff193e6f66-temp_​2022/Q3/demo, Int8 precision 20-Sep
Jeff McVeigh Press & Analyst Briefing at Innovation 2022 Press Briefing Deck Jeff McVeigh Nvidia® Data Center GPU A10 can achieve 96 streams of HEVC Decode and ResNet50 Classification 1080p30fps. Compared to Intel Flex Series GPU 170 running Media + AI inference workloads HEVC+ Resnet50: Flex 170: 130 streams; HEVC+ SSD-MobileNet: Flex 170: 128 streams; 2S Intel® Xeon® Gold 6336Y, 128GB DDR4-3200, Ubuntu 20.04, Kernel 5.15, 1x NVIDIA A10 GPU, DeepStream 6.1.1 from nvcr.io/nvidia/deepstream:6.1-devel container, Test scenario: Offline, No Display, Input Res/FPS/Decode Codec: 1080p/25/H.265, Inference Model/Res/BatchSize/FramesSkipped: ResNet50/224x224/64/0 2S Intel® Xeon® Gold 6346, 128GB DDR4-3200, Ubuntu 20.04, Kernel 5.10, 1x Intel® Data Center GPU Flex 170, Agama-devel-419.38, pre-production FFMPEG, OpenVINO 2022.2; HEVC 1080p30 3Mbps, ~50C-65C Temperature 23-Sep
Jeff McVeigh Press & Analyst Briefing at Innovation 2022 Press Briefing Deck Jeff McVeigh Intel Flex Series GPU 170 running AI inference workloads on Frameworks Tensor Flow and PyTorch - Resnet50v1.5, SSD-MobileNetv1-coco, BERT Large workloads 1xFlex170 / 3xFlex170 Resnet50 Inference PyTorch INT8, BS 1024, 1 stream: 8296/25522 Resnet50 Inference Tensor Flow INT8, BS 1024, 1 stream: 9385/28717 SSD MobileNetv1-coco Inference Tensor Flow INT8, BS 1024, 1 stream: 10468/32926 BERT-Large (seq=384) TensorFlow FP16, BS 64, 1 stream: 130/401 2S Intel® Xeon®Gold 6342, 256GB DDR4-3200, Ubuntu 22.04, Kernel 5.15, agama:devel-463; Intel® Extension for TensorFlow* for Flex Series GPU; Intel® Extension for PyTorch* for Flex Series GPU; Version: -- Beta_​0.65 Build date: 08 Sep 2022.

Posters

Session Section Speaker Claim Claim Details/Citation Testing Date
Poster "SYCL* Performs Across GPUs" Poster Thorsten Moeller SYCL is a highly performant language on Nvidia and AMD devices and performs comparably to native CUDA or HIP code for diverse workloads. Development environment and supporting components (compilers, tools, libraries) are highly efficient and competitive. Available tools simplify porting code from CUDA to SYCL. On an NVIDIA A100* system, for 10 workloads, SYCL performance is comparable to CUDA. On an AMD Instinct* MI100 system, for 9 workloads, SYCL performance is comparable to HIP. PCR Claim ID: 09-01-2022-08 9/16/2022