



# Tools for Moving AltiVec\* DSP Applications to Intel® Processors

#### **Disclaimer**

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS.

This document contains information on products in the design phase of development. The information here is subject to change without notice. All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.

Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel Xeon processor, Intel Core $^{TM}$  2, Intel, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

\*Other names and brands may be claimed as the property of others.

Copyright © 2009 Intel Corporation.



# Why Intel® Processors for Digital Signal and Image Processing?

Intel® SSE SIMD performance/watt is now comparable with PowerPC\* AltiVec\* SIMD performance/watt

- For processors based on the Intel Core™ micro-architecture
  - Includes all processors branded "Core", "Core 2" and Xeon® since 2006
- See benchmark results later in this presentation
- Intel Atom<sup>™</sup> branded processors also have good SIMD performance per watt but are not covered by this presentation

Intel has strong roadmap support for SSE SIMD going forward

- Currently shipping 45nm processors; support > 50 new SSE4.1 instructions
- 32nm processors will begin shipping in commercial quantities in Q4 2009 (earlier than previous plans); add SSE4.2 instructions
- Continually improving performance/watt ratios generation to generation

Intel Advanced Vector Extensions™ (AVX): 256 bit SIMD registers

- Current Intel and PPC processors have 128-bit wide registers
- Announced for "Sandy Bridge" generation processors (beginning 2012)



SIMD=Single Instruction Multiple Data

SSE=Streaming SIMD Extensions --Intel's term for an Instruction Set Architecture similar to AltiVec. There have been six extensions to Intel MMX and SSE technologies since they debuted on Intel Pentium® with MMX and Intel Pentium 2 processors.



# N A Software Tools for Intel® Processors

The information on these slides is copyright NAS Ltd.



# **Three Separate Tools**

• Tool 1: VSIPL\* for Intel® Architecture (IA)



• Tool 2: AltiVec.h library for IA



• Tool 3: AltiVec\* binary to IA SSE conversion utility



# Background: VSIPL\*



#### **Vector Signal Image Processing Library\***

- Highly efficient computational middleware for signal and image processing applications
- Application programming interface (API) defined by the VSIPL Forum\*
  - Open standard group -- embedded signal and image processing hardware and software vendors, academia, application developers, government labs
  - http://www.vsipl.org/
- Abstracts hardware implementation details; applications are portable across processor types and generations



# Why Use VSIPL?

- Standard API allows you to capture customers from other hardware platforms
- Hides hardware details; allows upgrades to a system without rewriting the software
- Software that uses portable interfaces can evolve over time more efficiently than low-level code designed for one architecture
- NA Software can provide "plain C" CSIPL or convert your in-house libraries to IA
  - We develop DSP performance libraries using "Liberator", a proprietary semiautomated tool: Maintains efficiency but cuts the cost of providing new APIs

#### Examples:

- Optimised FFTW FFT library for IA
- CSIPL: "plain C" equivalent of VSIPL for IA
- Customers' in-house DSP libraries ported to IA



VSIPL standard defines subsets (profiles)

- Core Lite 127 functions
- Core 511 functions
- Full VSIPL (I haven't counted)
- Image image processing add-on

NASL VSIPL supports "Core Plus" (almost full VSIPL); single precision (32 bit) reals

Image Processing extension is also available



# **VSIPL** Image Functionality

- Histogram Operations
- Convolution
- Diff / Edge Detection
- Image Pad
- Arithmetic Operations
- Logical Functions

- Morphological Operations
- Image Resize
- Object Functionality (e.g., bind/rebind)
- Conversion



# Tool 1: VSIPL Library for IA

DSP portion of application <u>remains unchanged</u>
Recompile target application with NASL VSIPL library; VSIPL calls <u>execute automatically</u> on Intel® processors

- Full VSIPL including optimized, single-precision 1D and 2D FFTs
  - Complex-to-complex, real-to-complex, complex-to-real single and multiple 1D FFTs, filters/other DSP functions; vector/matrix functions; linear algebra
- Based on NASL PowerPC VSIPL Library
- CSIPL "Plain C" API also available
- Supports Intel SSE2-SSE4 processors
- Fully multithreaded
- Aims to match or beat efficiency of Intel® Math Kernel Library
- Linux\* Gold (2.6 kernel) release available for evaluation from NASL
- VxWorks\* 6.6 SMP Gold release available for evaluation from NASL



# VSIPL For PPC and Intel Architecture Performance Comparison White Paper

- GE Fanuc\* Intelligent Systems and NASL VSIPL performance White Paper
  - Freescale\* PPC 8641D (~25W) vs Intel® SL9400 (~28W total)
- Download entire white paper from NASL:

http://www.nasoftware.co.uk/home/attachments/ 018\_PPC\_Intel\_comparison\_whitepaper.pdf



Table 3 complex to complex multiple 1D in-place FFTs:

Data: M rows of length N; FFT the rows.

Data in warm caches

Times in microseconds; Single core only

| N*M                                                                   | 256*256 | 1K*100 | 4K*50 | 16K*20 | 64K*20 | 128K*20 |
|-----------------------------------------------------------------------|---------|--------|-------|--------|--------|---------|
| Freescale* 8641D<br>1.0 GHz; ~ 25W                                    | 698     | 1,164  | 5,941 | 13,111 | 67,307 | 231,970 |
| Intel® Core™ 2 Duo processor SL9400, 1.86 GHz; ~28W including chipset | 361     | 661    | 2,004 | 4,552  | 26,577 | 61,178  |

Figure 3 complex to complex multiple 1D in-place FFTs:

MFLOPS = 5 N Log<sub>2</sub>(N) / (time for one row FFT in microseconds)

Data: M rows of length N; FFT the rows.

Data in warm caches; Times in microseconds; Single core only



# FFT MFLOPS/Watt Measurements

#### FFT Performance comparison MFLOPS / Watt



Table 5a complex vector multiply v1(i) := v2(i)\*v3(i);

Times in italic indicate that the data requires a significant portion, or is too large to fit into the processor's L2 cache. Times in microseconds; Single Core only; Data in warm caches

| N                                                                         | 256  | 1K  | 4K   | 16K | 32K | 64K     | 128K    |
|---------------------------------------------------------------------------|------|-----|------|-----|-----|---------|---------|
| Freescale* 8641D<br>~ 25W                                                 | 0.78 | 2.5 | 18.7 | 74  | 145 | 3,391** | 9,384** |
| Intel® Core™ 2 Duo processor SL9400, single thread ~28W including chipset | 0.44 | 2.0 | 8.8  | 35  | 75  | 151     | 300     |

<sup>\*\*</sup> Data does not fit in 8641D's 1M L2 Cache



# **Complex Vector Multiply Performance**

Complex vector multiply v1(i) := v2(i)\*v3(i);
MFLOPS = 6 \* N / ( time for one vector multiply in microseconds)



# **Measurement Systems Configuration**

| Processor            | Freescale* MPC8641D                                                    | Intel® Core 2 Duo SL9400                                                                                                |  |  |
|----------------------|------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------|--|--|
| Approximate Thermals | About 25 (no ancillary chipset needed)                                 | About 28 inc. ancillary chipset                                                                                         |  |  |
| Process Technology   | 90nm                                                                   | 45nm                                                                                                                    |  |  |
| Clock rate           | Up to 1.5GHz                                                           | Up to 2GHz                                                                                                              |  |  |
| Cores                | 2                                                                      | 2                                                                                                                       |  |  |
| L1 cache             | 32KB (each)                                                            | 32KB (each)                                                                                                             |  |  |
| L2 cache             | 1MB (each)                                                             | Up to 6MB (shared)                                                                                                      |  |  |
| Front Side Bus       | Up to 600MHz                                                           | Up to 1066MHz                                                                                                           |  |  |
| Vector Capability    | AltiVec (per core)                                                     | SSE4.1 (per core)                                                                                                       |  |  |
| Hardware details     | Freescale MPC8641D @ 1GHz , 400MHz<br>Front Side Bus (GE Fanuc DSP230) | Intel Core 2 Duo SL9400 @<br>1.86GHz, 1066MHz Front Side Bus,<br>Intel GS45Express mobile chipset<br>((HP 2530P laptop) |  |  |
| Software Environment | VXWorks 6.6; GE Fanuc AXISLib VSIPL library                            | Linux (2.4 kernel) ; N.A. Software<br>VSIPL library for Intel Architecture                                              |  |  |

The Freescale\* MPC 8641D processor was measured as installed in GE Fanuc\* DSP230 embedded board. The VXWorks\* 6.6 version of NASL's VSIPL\* library was used. The Intel® Core<sup>TM</sup>2 Duo processor SL9400 was measured in an HP\* 2530P laptop with NASL Linux VSIPL library for IA Core<sup>TM</sup>2 Duo processor architecture. NA Software\* has both Linux\* and VxWorks 6.6 versions of their VSIPL libraries for Intel® architecture, and used the Linux versions with the Intel® processor. There is no significant performance difference between the VXWorks and Linux versions in these applications. NASL chose to use the Linux version for these tests because Linux was easier to install on the HP laptop. All timings are with warm caches.



# Tool 1: Evaluation and Licensing

- Evaluation copies for VSIPL for IA are available free of charge
- Support will be handled directly through NA Software Ltd.
- NA Software Ltd. owns the intellectual property for the IA VSIPL library code
  - Licensing for the Gold Release of Tool 1 will be handled directly by NA Software Ltd
  - Support will be handled directly through NA Software Ltd.



#### Tool 2: AltiVec.h Header File for IA

- "altivec.h": #include file often used by PowerPC\* DSP software to access AltiVec\* SIMD functionality
- NASL's altivec.h is a replacement include file targeting IA
- Application's DSP code remains unchanged
- Recompile target application with new altivec.h file; AltiVec SIMD instructions <u>automatically converted</u> to Intel SSE SIMD code
  - Supports Intel SSE2-SSE4.1 processors
  - Fully multithreaded
  - Intel funded; available from NASL
- Linux Gold Release available now
- VxWorks 6.6 SMP Gold version available now

All dates, product descriptions, availability and plans are forecasts and subject to change without notice.



# Tool 2: Evaluation and Licensing

# Gold release evaluation copies for Tool 2 are available from NASL free of charge

- •Support will be handled directly through NA Software Ltd.
- Intel owns the intellectual property for the Altivec.h→SSE conversion utility
  - Tool has been funded by Intel
  - Intel can provide the Gold Altivec.h tool to customers at no charge
  - Long-Term support would have to be negotiated directly with NA Software I td.
  - Intel is also discussing other RTOS versions and long-term support possibilities with several important RTOS vendors



#### Tool 3: AltiVec Assembler Compiler for IA

- Input: PPC AltiVec assembler code
  - Utility builds internal representation of object code
  - Feeds into gcc's back-end optimization and object code generation functionality
- Output: Intel SSE assembler code
- Feasibility study results were positive
  - Plan to proceed after more feedback on Tools 1 and 2
- AltiVec\* Binary to SSE Binary Conversion Tool feasibility study complete
- Soliciting feedback on VSIPL and AltiVec.h for IA tools before proceeding with next phase

All dates, product descriptions, availability and plans are forecasts and subject to change without notice.







#### **SW Conversion Tools Release Schedule**



**Tool 1 - VSIPL Library for IA** 

Tool 2 - Altivec.h Header File for IA

**Tool 3 - AltiVec Assembler Compiler for IA** 



#### **Summary**

Intel processors are now very effective for DSP workloads

Three Tools from NA Software can simplify DSP software conversion

We are actively looking for additional evaluation customers



#### **More Information**

#### Mike Delves: <a href="mailto:delves@nasoftware.co.uk">delves@nasoftware.co.uk</a>

NA Software Ltd.

1 Prospect Road

Birkenhead, CH42 8LE

U.K.

+44 151 609 1911

FAX: +44 151 550 7830

#### Peter Carlston: peter.carlston@intel.com

**Intel Corporation** 

CH6-236

5000 W. Chandler Blvd.

Chandler, Arizona 85226

+1 480/678-2710



# **NASoftware Ltd (NASL)**

Focused, UK-based, software company specialising in optimised digital signal processing (DSP) libraries, algorithms, and services

- Real-time SAR processing and image interpretation tools
- DSP libraries: VSIPL, FFTW, VECLIB, CSIPL, etc.
- Code conversion
- Processor experience: PowerPC, MIPS, Intel

Formed in 1978

Customers in the UK, Europe and USA



#### **DSP Libraries from NA Software**

NASL develops DSP performance libraries using Liberator, a proprietary semi-automated tool

- Maintains efficiency but cuts the cost of providing new APIs
- Provides optimized VSIPL libraries to several major boards manufacturers, DRS Technology, and other defence contractors
- Optimised FFTW FFT library
- CSIPL: "plain C" equivalent of VSIPL
- Customer's in-house DSP libraries ported to IA

