

# Reconfigurable Computing in the Multi-Core Era

### Khaled Benkrid, PhD, CEng, MBA

International Workshop on Highly Efficient Accelerators and Reconfigurable Technologies (HEART 2010)

The University of Edinburgh, Institute of Integrated Systems, System Level Integration Group

k.benkrid@ieee.org 24<sup>th</sup> International Conference on Supercomputing, June 1-4, 2010,Tsukuba, Japan

# Outline

- 1. Introduction: the semantic gap between applications and hardware
- 2. Parallel computer technologies
  - Multi-core general purpose processors (GPPs)
  - Application-specific Processors
  - Field Programmable Gate Arrays (FPGAs)
  - Fixed ASICs
- 3. Comparative Studies: FPGAs vs. GPUs vs. IBM Cell vs. GPPs
- 4. Reconfigurable computing, is it finally the time?
- 5. Heterogeneous computing, is it the way forward?
- 6. Conclusions

## **Introduction 1/2**



The University of Edinburgh, Institute of Integrated Systems, System Level Integration Group

k.benkrid@ieee.org 24<sup>th</sup> International Conference on Supercomputing, June 1-4, 2010,Tsukuba, Japan

## **Introduction 2/2**

- A **semantic gap** is opening between applications (traditionally written in sequential code) and hardware (now essentially parallel)
- Different approaches to parallel hardware programming are being followed stretching from:
  - Attempting to parallelise sequential code automatically to
  - Programming/designing from pure parallel code
- This gap is, ironically perhaps, opening a window of opportunity for new parallel technologies in mainstream computing

# Outline

- 1. Introduction: the semantic gap between applications and hardware
- 2. Parallel computer technologies
  - Multi-core general purpose processors (GPPs)
  - Application-specific Processors
  - Field Programmable Gate Arrays (FPGAs)
  - Fixed ASICs
- 3. Comparative Studies: FPGAs vs. GPUs vs. IBM Cell vs. GPPs
- 4. Reconfigurable computing, is it finally the time?
- 5. Heterogeneous computing, is it the way forward?
- 6. Conclusions

## Parallel computer technologies 1/2

|                   | Technology                                                | Performance<br>/<br>Cost           | Time to<br>market | Time to<br>change<br>code<br>functionality | Power<br>Consumption | $\wedge$    |
|-------------------|-----------------------------------------------------------|------------------------------------|-------------------|--------------------------------------------|----------------------|-------------|
| Speed Performance | GPPs                                                      | Low-Medium                         | Very<br>Short     | Very Short                                 | High                 |             |
|                   | Application-<br>specific<br>processors e.g.<br>DSPs, GPUs | Medium                             | Medium            | Medium                                     | Medium-High          | Flexibility |
|                   | FPGAs                                                     | Medium-<br>High                    | Long              | Long                                       | Low-Medium           | Fle         |
| $\bigvee$         | Fixed ASICs                                               | Very High<br>(for high<br>volumes) | Very<br>Long      | Impossible                                 | Low                  |             |

The University of Edinburgh, Institute of Integrated Systems, System Level Integration Group

k.benkrid@ieee.org

24th International Conference on Supercomputing, June 1-4, 2010, Tsukuba, Japan

## Parallel computer technologies 2/2

- Different technologies offer different advantages
- On the performance scale, fixed ASICs offer the ultimate speed and power consumption, whereas GPPs offer the ultimate flexibility
- FPGAs and application-specific processors e.g. DSPs and GPUs, occupy the middle-ground
- FPGAs have ASIC-like performance and are now leading the process technology
  - They suffer from their low-level programming model though
- GPUs can offer much higher performance than GPPs at a low cost
  - They suffer from a relatively longer development time (compared to GPPs) and a high power consumption

The University of Edinburgh, Institute of Integrated Systems, System Level Integration Group

k.benkrid@ieee.org

24th International Conference on Supercomputing, June 1-4, 2010, Tsukuba, Japan

# Outline

- 1. Introduction: the semantic gap between applications and hardware
- 2. Parallel computer technologies
  - Multi-core general purpose processors (GPPs)
  - Application-specific Processors
  - Field Programmable Gate Arrays (FPGAs)
  - Fixed ASICs
- 3. Comparative Studies: FPGAs vs. GPUs vs. IBM Cell vs. GPPs
- 4. Reconfigurable computing, is it finally the time?
- 5. Heterogeneous computing, is it the way forward?
- 6. Conclusions

### Comparative Study 1: Smith-Waterman Algorithm

pairwise Sequence Alignment e.g. DNA



The University of Edinburgh, Institute of Integrated Systems, System Level Integration Group

k.benkrid@ieee.org

24th International Conference on Supercomputing, June 1-4, 2010, Tsukuba, Japan

### **Comparative Study 1:**

### **Smith-Waterman Reconfigurable Hardware Skeleton**



Reduces time complexity from  $O(m^*n)$  to O(m+n)

K. Benkrid et al., 'A Highly Parameterised and Efficient FPGA-Based Skeleton for Pairwise Biological Sequence Alignment', IEEE TVLSI Journal, Vol. 17, Issue 4, pp. 561-570, April 2009

The University of Edinburgh, Institute of Integrated Systems, System Level Integration Group

k.benkrid@ieee.org 24th International Conference on Supercomputing, June 1-4, 2010,Tsukuba, Japan

- Xilinx Virtex 4 LX160 -11 vs. NVIDIA GeForce 8800GTX GPU vs. IBM's Cell BE processor vs. 3.4 GHz Pentium 4 Prescott processor
- Design and implementations performed by four PhD students with equal experience on each respective platform
- Comparative criteria:
  - Speed performance
  - Energy Consumption
  - Development Time
  - Cost of development
  - Performance per \$
  - Performance per Watt

# Speed performance comparison for query sequence of length 256

| Platform | GCUPS* | Speed-Up |
|----------|--------|----------|
| FPGA     | 19.4   | 228:1    |
| GPU      | 1.2    | 14:1     |
| Cell BE  | 3.84   | 45:1     |
| GPP      | 0.085  | 1:1      |

\* Giga Cell Updates Per Second

The University of Edinburgh, Institute of Integrated Systems, System Level Integration Group

k.benkrid@ieee.org 24<sup>th</sup> International Conference on Supercomputing, June 1-4, 2010,Tsukuba, Japan

### **Development times**

| Platform | Development time in Days |
|----------|--------------------------|
| FPGA     | 300                      |
| GPU      | 45                       |
| Cell BE  | 90                       |
| GPP      | 1                        |

The University of Edinburgh, Institute of Integrated Systems, System Level Integration Group

k.benkrid@ieee.org 24<sup>th</sup> International Conference on Supercomputing, June 1-4, 2010,Tsukuba, Japan

### **Cost of development**

| Platform | Purchase<br>Cost (\$) | Development<br>Cost (\$) | Overall<br>Cost (\$) | Normalised<br>Overall<br>Cost |
|----------|-----------------------|--------------------------|----------------------|-------------------------------|
| FPGA     | 10,000                | 48,000                   | 58,000               | 50                            |
| GPU      | 1450                  | 7,200                    | 8,650                | 8                             |
| Cell BE  | 8,000                 | 14,400                   | 22,400               | 19                            |
| GPP      | 1000                  | 160                      | 1160                 | 1                             |

The University of Edinburgh, Institute of Integrated Systems, System Level Integration Group

k.benkrid@ieee.org 24<sup>th</sup> International Conference on Supercomputing, June 1-4, 2010,Tsukuba, Japan

### **Performance per \$**

| Platform | Performance<br>(MCUPS*) per \$ spent | Normalised Performance per \$<br>spent |
|----------|--------------------------------------|----------------------------------------|
| FPGA     | 0.34                                 | 4.6                                    |
| GPU      | 0.14                                 | 1.9                                    |
| Cell BE  | 0.17                                 | 2.3                                    |
| GPP      | 0.07                                 | 1                                      |

\* Mega Cell Updates Per Second

The University of Edinburgh, Institute of Integrated Systems, System Level Integration Group

k.benkrid@ieee.org 24<sup>th</sup> International Conference on Supercomputing, June 1-4, 2010,Tsukuba, Japan

#### **Power and energy consumption**

| Platform                   | Power<br>(Watt) | Energy<br>(Joule) | Normalised Energy<br>Consumption |
|----------------------------|-----------------|-------------------|----------------------------------|
| FPGA (clocked<br>at 80MHz) | 39              | 73                | 0.0017                           |
| GPU                        | 56              | 1682              | 0.04                             |
| Cell BE                    | 140             | 1317              | 0.03                             |
| GPP                        | 100             | 42400             | 1                                |

The University of Edinburgh, Institute of Integrated Systems, System Level Integration Group

k.benkrid@ieee.org 24<sup>th</sup> International Conference on Supercomputing, June 1-4, 2010,Tsukuba, Japan

### **Performance per Watt**

| Platform | Performance<br>(MCUPS) per Watt | Normalised<br>Performance per Watt |
|----------|---------------------------------|------------------------------------|
| FPGA     | 508                             | 584                                |
| GPU      | 22                              | 25                                 |
| Cell BE  | 27                              | 31                                 |
| GPP      | 0.87                            | 1                                  |

The University of Edinburgh, Institute of Integrated Systems, System Level Integration Group

k.benkrid@ieee.org 24<sup>th</sup> International Conference on Supercomputing, June 1-4, 2010,Tsukuba, Japan

### **Comparative Study 2:**

### **Quasi-Monte-Carlo-based Financial Option Pricing**

- Quasi Monte-Carlo simulation of European options
  - Simulation of stochastic processes
  - Random sampling using Sobol numbers



### **Comparative Study 2: Generic Hardware Skeleton for QMC Simulation**



X. Tian and K. Benkrid, "High Performance Quasi-Monte Carlo Financial Simulation: FPGA vs. GPP vs. GPU", to appear In ACM Transaction on Reconfigurable Technology and Systems, 2010.

The University of Edinburgh, Institute of Integrated Systems, System Level Integration Groupk.benkrid@ieee.org24th International Conference on Supercomputing, June 1-4, 2010,Tsukuba, JapanSlide 19

- Xilinx Virtex4 VFX100-10 vs. NVIDIA GeForce 8800GTX GPU vs. vs. 2.8GHz Intel Xeon Processor
- Design and implementations performed by three PhD students with equal experience on each respective platform
- Comparative criteria:
  - Speed performance
  - Energy Consumption
  - Development Time
  - Cost of development
  - Performance per \$
  - Performance per Watt

# Speed performance price for a single option pricing, using 524,288 simulation paths

| Platform | Speed (ms) | Performance<br>(Paths per Sec) | Normalised<br>Speed-Up |
|----------|------------|--------------------------------|------------------------|
| FPGA     | 7.9        | 66,618,551                     | 545:1                  |
| GPU      | 86         | 6,110,583                      | 50:1                   |
| GPP      | 4291       | 122,180                        | 1:1                    |

The University of Edinburgh, Institute of Integrated Systems, System Level Integration Group

k.benkrid@ieee.org 24th International Conference on Supercomputing, June 1-4, 2010,Tsukuba, Japan

### **Development times**

| Platform | Development time in Days |
|----------|--------------------------|
| FPGA     | 60                       |
| GPU      | 3                        |
| GPP      | 1                        |

The University of Edinburgh, Institute of Integrated Systems, System Level Integration Group

k.benkrid@ieee.org 24<sup>th</sup> International Conference on Supercomputing, June 1-4, 2010,Tsukuba, Japan

### **Cost of development**

| Platform | Purchase<br>Cost (\$) | Development<br>Cost (\$) | Overall<br>Cost (\$) | Normalised<br>Overall<br>Cost |
|----------|-----------------------|--------------------------|----------------------|-------------------------------|
| FPGA     | 10,000                | 9600                     | 19,600               | 17:1                          |
| GPU      | 1350                  | 480                      | 1,830                | 1.6:1                         |
| GPP      | 1000                  | 160                      | 1160                 | 1:1                           |

The University of Edinburgh, Institute of Integrated Systems, System Level Integration Group

k.benkrid@ieee.org 24<sup>th</sup> International Conference on Supercomputing, June 1-4, 2010,Tsukuba, Japan

### **Performance per \$**

| Platform Performance (Paths/sec)<br>per \$ spent |      | Normalised<br>Performance per \$ spent |  |
|--------------------------------------------------|------|----------------------------------------|--|
| FPGA                                             | 3399 | 32:1                                   |  |
| GPU                                              | 3339 | 32:1                                   |  |
| GPP                                              | 105  | 1:1                                    |  |

The University of Edinburgh, Institute of Integrated Systems, System Level Integration Group

k.benkrid@ieee.org 24<sup>th</sup> International Conference on Supercomputing, June 1-4, 2010,Tsukuba, Japan

#### **Power and energy consumption**

| Platform                   | Power<br>(Watt) | Energy<br>(Joule) | Normalised Energy<br>Consumption |
|----------------------------|-----------------|-------------------|----------------------------------|
| FPGA (clocked at<br>75MHz) | 20              | 0.16              | 0.001:1                          |
| GPU                        | 95              | 8.5               | 0.05:1                           |
| GPP                        | 170             | 172               | 1:1                              |

The University of Edinburgh, Institute of Integrated Systems, System Level Integration Group

k.benkrid@ieee.org 24<sup>th</sup> International Conference on Supercomputing, June 1-4, 2010,Tsukuba, Japan

### **Performance per Watt**

| Platform | Paths per Second Per<br>Watt | Normalised Performance<br>per Watt |
|----------|------------------------------|------------------------------------|
| FPGA     | 3,330,928                    | 1090:1                             |
| GPU      | 64,322                       | 21:1                               |
| GPP      | 3,055                        | 1:1                                |

The University of Edinburgh, Institute of Integrated Systems, System Level Integration Group

k.benkrid@ieee.org 24<sup>th</sup> International Conference on Supercomputing, June 1-4, 2010,Tsukuba, Japan

# Outline

- 1. Introduction: the semantic gap between applications and hardware
- 2. Parallel computer technologies
  - Multi-core general purpose processors (GPPs)
  - Application-specific Processors
  - Field Programmable Gate Arrays (FPGAs)
  - Fixed ASICs
- 3. Comparative Studies: FPGAs vs. GPUs vs. IBM Cell vs. GPPs
- 4. Reconfigurable computing, is it finally the time?
- 5. Heterogeneous computing, is it the way forward?
- 6. Conclusions

- FPGA technology's main competitive advantage is on performance per watt grounds, but similar experiments for different applications need to be performed
- High performance computing applications where power consumption is often a bottleneck should hence benefit greatly from this technology
- For many applications, FPGAs are more competitive than alternative technologies on performance per \$ ground
- For this to hold, a minimum of two-orders of magnitude speed up compared to GPPs and one order of magnitude compared to GPUs is needed

- GPUs are very competitive on performance per \$ ground compared to FPGAs and GPPs. They are competitive compared to GPPs on performance per watt grounds
- FPGAs Achilles' Heel is in their long development time
  - Relatively low level HDLs (VHDL/Verilog) are still dominant
  - A large part of FPGA solution development is spent on learning specific FPGA board APIs and debugging in hardware (70% in our experiments!)
  - Unlike software, FPGAs do not currently offer forward/backward compatibility, not even within the same family!
  - FPGAs have a relatively low technology maturity and small user base compared to software

The University of Edinburgh, Institute of Integrated Systems, System Level Integration Group

*k.benkrid*@*ieee.org* 24<sup>th</sup> International Conference on Supercomputing, June 1-4, 2010, Tsukuba, Japan

- Standard FPGA boards with standard languages and APIs can lower this hurdle drastically:
  - Standard FPGA boards' I/O
  - Standard High Level Languages (HLLs) for FPGA programming (C-to-gate)
  - Standard MPI-like support for FPGA-based process communication
- Unless standardisation efforts materialise, FPGAs will still struggle to get a foothold into more mainstream computing

- However, reconfigurable hardware has many advantages to bring to a heterogeneous computer platform, perhaps as a pre/co-processor:
  - Low latency and high bandwidth I/O
  - Reprogrammable custom hardware (high performance) and low power
- The recent announcement from Xilinx of a new FPGA architecture built around the ARM Cortex-A9 MPCore could be a template towards standardisation

# Outline

- 1. Introduction: the semantic gap between applications and hardware
- 2. Parallel computer technologies
  - Multi-core general purpose processors (GPPs)
  - Application-specific Processors
  - Field Programmable Gate Arrays (FPGAs)
  - Fixed ASICs
- Comparative Studies: FPGAs vs. GPUs vs. IBM Cell vs. GPPs
- 4. Reconfigurable computing, is it finally the time?
- 5. Heterogeneous computing, is it the way forward?
- 6. Conclusions

# Heterogeneous computing, is it the way forward?

- Increased choice in computer platforms means that heterogeneity is increasingly a practical and economical solution to many modern application needs
- The trend towards more consumer choice and increasingly different and various needs also favours heterogeneous computer platforms
- Many issues need to be addressed including:
  - Which Architecture?
  - Which Programming model?
  - Which Compiler?
  - Which Runtime System?

# Heterogeneous computing, is it the way forward?

### **Possible Heterogeneous Platform**



The University of Edinburgh, Institute of Integrated Systems, System Level Integration Group

*k.benkrid*@*ieee.org* 24<sup>th</sup> International Conference on Supercomputing, June 1-4, 2010, Tsukuba, Japan

# Heterogeneous computing, is it the way forward?

- Heterogeneity makes programming more difficult, but the potential gains can justify the extra effort
- A multi-language programming flow is perhaps the answer
  - With standard interfaces between languages that allow for dynamic exchange of data and remote procedure invocation
- Heterogeneity facilitates performance and power scalability because it increases design space options for system developers
- Efficacious and efficient compiler and run-time system support for such systems is critical!

The University of Edinburgh, Institute of Integrated Systems, System Level Integration Group k.benkrid@ieee.org 24<sup>th</sup> International Conference on Supercomputing, June 1-4, 2010,Tsukuba, Japan Slide 35

# Outline

- 1. Introduction: the semantic gap between applications and hardware
- 2. Parallel computer technologies
  - Multi-core general purpose processors (GPPs)
  - Application-specific Processors
  - Field Programmable Gate Arrays (FPGAs)
  - Fixed ASICs
- Comparative Studies: FPGAs vs. GPUs vs. IBM Cell vs. GPPs
- 4. Reconfigurable computing, is it finally the time?
- 5. Heterogeneous computing, is it the way forward?
- 6. Conclusions

## Conclusions

- The lack of standards (APIs, boards, communication interfaces) is holding reconfigurable hardware technology back
- Reconfigurable hardware is likely to be part of a heterogeneous computing mix in the future because of its unique position in the mix
  - Low latency and high bandwidth I/Os, custom hardware performance, low power + re-programmability
- Recent processor-centric architectures e.g. Xilinx ARM Cortex-A9 MPCore based platform, are a good sign
  - Ride the processor standards curve
- The custom computing community should encourage standards, open source IPs and platforms

## Acknowledgement

- PhD students
  - Mr Xiang Tian
  - Ms Ying Liu
  - Mr Server Kasap
- Research Collaborators
  - Dr. Tsuyoshi Hamada, Nagasaki University
  - Dr. Ali Akoglu, University of Arizona

The University of Edinburgh, Institute of Integrated Systems, System Level Integration Group

k.benkrid@ieee.org 24<sup>th</sup> International Conference on Supercomputing, June 1-4, 2010,Tsukuba, Japan

# **Thank You For Your Attention!**

# **Questions?**

The University of Edinburgh, Institute of Integrated Systems, System Level Integration Group

*k.benkrid@ieee.org* 24<sup>th</sup> International Conference on Supercomputing, June 1-4, 2010, Tsukuba, Japan