

#### Department of Information Engineering University of Padova

### Exploiting Fine Grained Parallelism in SPE

E. Milani N. Zago

#### ICTCS, Varese, September 21st, 2012

### Table of contents

#### 1 Introduction

- 2 Background and Previous Work
  - Models
  - Previous Work
  - Our Work
- **3** WT Implementation on SPE
  - Single Step
  - Whole Algorithm
  - Applications



#### Introduction

#### Fundamental Problem

- RAM does not capture memory access complexity
- Computational complexity is not enough on actual machines

#### Strategy

- Machine models (memory and processor)
- Algorithmic techniques

#### Questions

- What can be imported from other settings/contexts?
- Is it possible/convenient to exploit parallelism in a scalar setting?

Models Previous Work Our Work

## Models and Algorithms

Two major strategies to cope with latency:

• Temporal/Spatial Locality

Concurrency

Models Previous Work Our Work

## Models and Algorithms

Two major strategies to cope with latency:

- Temporal/Spatial Locality
  - $\Rightarrow$  Hierarchical Memories, Block Transfer

- Concurrency
  - $\Rightarrow$  Pipelined Memories

Models Previous Work Our Work

# Models and Algorithms

Two major strategies to cope with latency:

- Temporal/Spatial Locality
  - $\Rightarrow$  Hierarchical Memories, Block Transfer
  - $\Rightarrow$  Memory access function a(x)
- Concurrency
  - $\Rightarrow$  Pipelined Memories
  - $\Rightarrow$  Constant access request rate

Models Previous Work Our Work

### Memory Models



Models Previous Work Our Work



Models Previous Work Our Work

### Speculative Prefetcher and Evaluator

 Instructions are executed in segments of variable size (segmentsize())

Models Previous Work Our Work

- Instructions are executed in segments of variable size (segmentsize())
- No slow down because of data dependencies. . .

Models Previous Work Our Work

- Instructions are executed in segments of variable size (segmentsize())
- No slow down because of data dependencies. . .
- ... or even O(1) address dependence depth

Models Previous Work Our Work

- Instructions are executed in segments of variable size (segmentsize())
- No slow down because of data dependencies. . .
- ... or even O(1) address dependence depth
- Enhancements such as dynamic loop unrolling and branch prediction are allowed

Models Previous Work Our Work

#### Parallel Random Access Machine



Models Previous Work Our Work

### Parallel Random Access Machine

Many equivalent flavours: SIMD, MIMD...

The actual difference is in how shared memory is managed:

- EREW: exclusive read, exclusive write
- CREW: concurrent read, exclusive write
- CRCW: concurrent read, concurrent write (contention policy)

Models Previous Work Our Work

#### Work-Time Framework

• Parallel Programming Model which targets PRAMs

Models Previous Work Our Work

- Parallel Programming Model which targets PRAMs
- pardo statement defines parallel steps

Models Previous Work Our Work

- Parallel Programming Model which targets PRAMs
- pardo statement defines parallel steps
- sets  $s_1, \ldots, s_T$  of instructions on M cells of memory

Models Previous Work Our Work

- Parallel Programming Model which targets PRAMs
- pardo statement defines parallel steps
- sets  $s_1, \ldots, s_T$  of instructions on M cells of memory
- in general, each s<sub>i</sub> has a different size p<sub>i</sub>

Models Previous Work Our Work

- Parallel Programming Model which targets PRAMs
- pardo statement defines parallel steps
- sets  $s_1, \ldots, s_T$  of instructions on M cells of memory
- in general, each  $s_i$  has a different size  $p_i$

• Time T; Work 
$$W = \sum_{i=1}^{T} p_i$$

Models Previous Work Our Work

- Parallel Programming Model which targets PRAMs
- pardo statement defines parallel steps
- sets  $s_1, \ldots, s_T$  of instructions on M cells of memory
- in general, each  $s_i$  has a different size  $p_i$
- Time T; Work  $W = \sum_{i=1}^{T} p_i$
- Easily schedulable on PRAM: O(W/P + T)



Models Previous Work Our Work

## **Exploiting** Parallelism

PRAM to Disk Model

Chiang, Y., Goodrich, M. T., Grove, E. F., Tamassia, R., Vengroff, D. E.,

Vitter, J. S.: External-Memory Graph Algorithms. SODA '95

 D-BSP to Hierarchical Memory Fantozzi, C., Pietracaprina, A. A., Pucci, G.: Translating Submachine Locality into Locality of Reference. *Journal of Parallel and Distributed Computing 66*

Models Previous Work Our Work

### **Exploiting** Parallelism

PRAM to Disk Model

Chiang, Y., Goodrich, M. T., Grove, E. F., Tamassia, R., Vengroff, D. E.,

Vitter, J. S.: External-Memory Graph Algorithms. SODA '95

- D-BSP to Hierarchical Memory Fantozzi, C., Pietracaprina, A. A., Pucci, G.: Translating Submachine Locality into Locality of Reference. *Journal of Parallel and Distributed Computing 66*
- PRAM to Pipelined Memory Luccio, F., and Pagli, L.: A model of sequential computation with pipelined access to memory. *Math. Syst. Theory 26*

Models Previous Work Our Work

### Our Work

#### Our Work

- Parallel model: Work-Time Framework
- Sequential model: SPE
- A general technique for implementing WT Algorithms on SPE
- Large classes of optimal SPE programs

#### Novelty

- Explicitly refer to a feature of problems: available parallelism
- Target physically implementable machines, not bound to a particular memory access function

Single Step Whole Algorithm Applications

## WT Simulation - Single Step

#### Simulation of parallel step *i* (exclusive write)

#### WT

 $\begin{array}{c} \text{for } j, 1 \leq j \leq p \text{ pardo} \\ \text{operation}_j \end{array}$ 

#### SPE

 $\begin{array}{ll} \text{segmentsize}\,(\min(k,\ p))\\ \text{for}\,j,1\leq j\leq p\;\text{do}\\ & \text{instructions}_j \end{array}$ 

Single Step Whole Algorithm Applications

## WT Simulation - Single Step

#### Simulation of parallel step *i* (exclusive write)



#### SPE

 $\begin{array}{ll} \text{segmentsize}(\min(k, p)) \\ \text{for } j, 1 \leq j \leq p \text{ do} \\ & \text{instructions}_j \end{array}$ 

- Program size: O(1)
- $\Delta$  Space:  $O(p_i)$
- Time:  $O(p + a(M_{SPE}^i))$

Single Step Whole Algorithm Applications

WT Simulation - Single Step

• Program size does not depend on input size

Single Step Whole Algorithm Applications

WT Simulation - Single Step

• Program size does not depend on input size

 $\Rightarrow$  negligible instruction load latency

Single Step Whole Algorithm Applications

- Program size does not depend on input size
  ⇒ negligible instruction load latency
- Data memory use may increase by up to p<sub>i</sub>

Single Step Whole Algorithm Applications

- Program size does not depend on input size
  ⇒ negligible instruction load latency
- Data memory use may increase by up to p<sub>i</sub>
  ⇒ heavily depends on the WT algorithm

Single Step Whole Algorithm Applications

- Program size does not depend on input size
  ⇒ negligible instruction load latency
- Data memory use may increase by up to  $p_i$ 
  - $\Rightarrow$  heavily depends on the WT algorithm
  - $\Rightarrow$  degree of memory reuse

Single Step Whole Algorithm Applications

- Program size does not depend on input size
  ⇒ negligible instruction load latency
- Data memory use may increase by up to p<sub>i</sub>
  - $\Rightarrow$  heavily depends on the WT algorithm
  - $\Rightarrow$  degree of memory reuse
  - $\Rightarrow$  amount of output produced

Single Step Whole Algorithm Applications

- Program size does not depend on input size
  ⇒ negligible instruction load latency
- Data memory use may increase by up to p<sub>i</sub>
  - $\Rightarrow$  heavily depends on the WT algorithm
  - $\Rightarrow$  degree of memory reuse
  - $\Rightarrow$  amount of output produced
- Memory accesses fully overlap

Single Step Whole Algorithm Applications

# WT Simulation - Single Step

- Program size does not depend on input size
  ⇒ negligible instruction load latency
- Data memory use may increase by up to p<sub>i</sub>
  - $\Rightarrow$  heavily depends on the WT algorithm
  - $\Rightarrow$  degree of memory reuse
  - $\Rightarrow$  amount of output produced
- Memory accesses fully overlap

 $\Rightarrow$  proportional to  $p_i$  and  $a(M_{SPE}^i)$ 

Single Step Whole Algorithm Applications

# WT Simulation - Concurrent Write

Different solutions, depending on the concurrent write policy:

- priority policy  $\rightarrow$  predicated instructions
- associative op. policy  $\rightarrow$  accumulation

```
\begin{array}{ll} \text{segmentsize}(\min(k, p)) \\ \text{for } j, 1 \leq j \leq p \text{ do} \\ & \text{instructions}_j \\ & acc \leftarrow \max\{acc; output_j\} \end{array}
```

• Also an if(test) statement can be used... ...when the test is simple enough!

Single Step Whole Algorithm Applications

### WT Simulation - Whole Algorithm

E. Milani, N. Zago Exploiting Fine Grained Parallelism in SPE

Single Step Whole Algorithm Applications

## WT Simulation - Whole Algorithm

• Total space complexity:  $M_{PH} = O(n + W)$ 

Single Step Whole Algorithm Applications

# WT Simulation - Whole Algorithm

- Total space complexity:  $M_{PH} = O(n+W)$
- Total time complexity:  $O(W + T \cdot a(M_{PH}))$

Single Step Whole Algorithm Applications

# WT Simulation - Whole Algorithm

- Total space complexity:  $M_{PH} = O(n+W)$
- Total time complexity:  $O(W + T \cdot a(M_{PH}))$

## Optimal SPE programs if

- Work–optimal WT algorithms
- The average parallelism is larger than worst case latency

Single Step Whole Algorithm Applications

# Merge

## Problem: merging 2 sorted lists of *n* elements.

• Kruskal algorithm:  $T = O(\log n)$ , W = O(n)

Single Step Whole Algorithm Applications

# Merge

## Problem: merging 2 sorted lists of *n* elements.

- Kruskal algorithm:  $T = O(\log n)$ , W = O(n)
- $\Rightarrow$   $T_{SPE} = O(W + T \cdot a(M)) = O(n + \log n \cdot a(n))$

Single Step Whole Algorithm Applications

# Merge

## Problem: merging 2 sorted lists of *n* elements.

• Kruskal algorithm:  $T = O(\log n)$ , W = O(n)

• 
$$\Rightarrow$$
  $T_{SPE} = O(W + T \cdot a(M)) = O(n + \log n \cdot a(n))$ 

• linear for  $a(x) = x^{\alpha}, 0 < \alpha < 1$ ,  $a(x) = \log x$ .

On other hierarchical models:

- $O(n \log n)$  if  $a(x) = x^{\alpha}, 0 < \alpha < 1$
- $O(n \log^* n)$  if  $a(x) = \log x$

Single Step Whole Algorithm Applications

# MergeSort

## **Problem: sorting a list of** *n* **elements.**

#### Warning

Merge is linear only if the input is in the fastest O(n) locations.

## Solution

When merge istances are too small wrt latency, execute them in an interleaved fashion.

Single Step Whole Algorithm Applications

## Non–local Matrix Multiplication

## **Problem: multiplying 2** $n \times n$ matrices.

E. Milani, N. Zago Exploiting Fine Grained Parallelism in SPE

Single Step Whole Algorithm Applications

## Non–local Matrix Multiplication

## **Problem: multiplying 2** $n \times n$ matrices.

• The notorious WT algorithm has  $T = O(\log n)$ ,  $W = O(n^3)$ 

Single Step Whole Algorithm Applications

## Non–local Matrix Multiplication

## **Problem: multiplying 2** $n \times n$ matrices.

- The notorious WT algorithm has  $T = O(\log n)$ ,  $W = O(n^3)$
- $\Rightarrow$   $T_{SPE} = O(W + T \cdot a(M)) = O(n^3 + \log n \cdot a(n^3))$

Single Step Whole Algorithm Applications

## Non–local Matrix Multiplication

#### **Problem: multiplying 2** $n \times n$ matrices.

- The notorious WT algorithm has  $T = O(\log n)$ ,  $W = O(n^3)$
- $\Rightarrow$   $T_{SPE} = O(W + T \cdot a(M)) = O(n^3 + \log n \cdot a(n^3))$
- Optimal even if no locality is exploited

Single Step Whole Algorithm Applications

## Non–local Matrix Multiplication

#### **Problem: multiplying 2** $n \times n$ matrices.

- The notorious WT algorithm has  $T = O(\log n)$ ,  $W = O(n^3)$
- $\Rightarrow$   $T_{SPE} = O(W + T \cdot a(M)) = O(n^3 + \log n \cdot a(n^3))$
- Optimal even if no locality is exploited

#### Space complexity

 $O(n^3)$  memory is required!

Conclusions and Future Work

# Conclusions and Future Work

## Conclusions

- Parallelism is a viable strategy for overlapping accesses
- Large memory footprint, if too much parallelism

Conclusions and Future Work

# Conclusions and Future Work

## Conclusions

- Parallelism is a viable strategy for overlapping accesses
- Large memory footprint, if too much parallelism

#### Future Work

- Integration with locality exploitation
- Exploitation of *coarse grained* parallelism (D-BSP)

# Thank you!

**Conclusions and Future Work** 

# Thank you for your attention!

... questions?