Servet: A Benchmark Suite for Autotuning on Multicore Clusters

Jorge González-Domínguez\*, Guillermo L. Taboada, Basilio B. Fraguela, María J. Martín, Juan Touriño

> Computer Architecture Group University of A Coruña (Spain) {jgonzalezd,taboada,basilio.fraguela, mariam,juan}@udc.es

24th IEEE International Parallel and Distributed Processing Symposium (IPDPS'10), Atlanta, GE, USA

1/61

### Introduction

- Autotuned Codes
- Extraction of System Parameters

### 2 Cache Topology

- Cache Size Estimate
- Determination of Shared Caches
- 3 Memory Access Overhead Characterization
- 4 Determination of Communication Costs

### 5 Conclusions

#### Introduction

Cache Topology Memory Access Overhead Characterization Determination of Communication Costs Conclusions

Autotuned Codes Extraction of System Parameters

### Introduction

- Autotuned Codes
- Extraction of System Parameters

### 2 Cache Topology

- 3 Memory Access Overhead Characterization
- 4 Determination of Communication Costs

### 5 Conclusions

Autotuned Codes Extraction of System Parameters

### Autotuning

Codes that can automatically adapt their performance to the machine on what they are executed.

#### Autotuned Sequential Libraries

- ATLAS -> Numerical computing (BLAS)
- FFTW3 -> Discrete Fourier transform
- Spiral -> Digital signal processing (DSP) algorithms
- Wide search mechanism to find the most suitable algorithm
- The knowledge of some hardware characteristics can reduce their search times

Introduction

Cache Topology Memory Access Overhead Characterization Determination of Communication Costs Conclusions

Autotuned Codes Extraction of System Parameters

### Autotuning Techniques (I)

#### Examples of Autotuning Techniques

- Tiling -> To divide the computation in blocks of data which fit in cache to minimize the number of cache misses.
- Efficient communications:
  - Minimizing the use of interconnection networks
  - Increasing the use of shared memory -> usually faster

Autotuned Codes Extraction of System Parameters

### Autotuning Techniques (I)

#### Examples of Autotuning Techniques

- Tiling -> To divide the computation in blocks of data which fit in cache to minimize the number of cache misses.
- Efficient communications:
  - Minimizing the use of interconnection networks
  - Increasing the use of shared memory -> usually faster

Introduction

Cache Topology Memory Access Overhead Characterization Determination of Communication Costs Conclusions

Autotuned Codes Extraction of System Parameters

### Autotuning Techniques (II)



Autotuned Codes Extraction of System Parameters

### Communications Algorithm: Option 1 (I)



Introduction Cache Topology Determination of Communication Costs Conclusions

Autotuned Codes

### Communications Algorithm: Option 1 (II)



Node 1

Node 0

Autotuned Codes Extraction of System Parameters

### Communications Algorithm: Option 1 (III)



Node 1

Introduction Cache Topology Determination of Communication Costs Conclusions

Autotuned Codes

### Communications Algorithm: Option 1 (IV)



Node 1

▲□▶ ▲□▶ ▲□▶ ▲□▶ □ のQで 10/61

Autotuned Codes Extraction of System Parameters

### Communications Algorithm: Option 1 (and V)



Introduction Cache Topology Determination of Communication Costs Conclusions

Autotuned Codes

### Communications Algorithm: Option 2 (I)



Node 1

Autotuned Codes Extraction of System Parameters

### Communications Algorithm: Option 2 (II)



Autotuned Codes Extraction of System Parameters

### Communications Algorithm: Option 2 (III)



Introduction Cache Topology Determination of Communication Costs Conclusions

Autotuned Codes

### Communications Algorithm: Option 2 (and IV)



Node 1

Autotuned Codes Extraction of System Parameters

### Autotuning Techniques (and III)

#### Examples of Autotuning Techniques

- Tiling -> To divide the computation in blocks of data which fit in cache to minimize the number of cache misses.
- Efficient communications:
  - Minimizing the use of interconnection networks
  - Increasing the use of shared memory -> usually faster
- Mapping policies to minimize:
  - Cache misses because of shared caches
  - Memory access overheads
  - Use of interconnection networks

Autotuned Codes Extraction of System Parameters

### Mapping Policies: Reducing Communication Costs



Autotuned Codes Extraction of System Parameters

### Mapping Policies: Reducing Memory Access Overhead



Autotuned Codes Extraction of System Parameters

### **Obtaining the System Parameters**

#### Option 1 -> From the machine specifications

- Always?
- Where?
- Restricted?
- What format?
- Accurate?

#### Option 2 -> With benchmarks

- General -> Can be used always without restrictions
- Portable -> The place and format do not depend on the vendor
- Accurate -> Explore the real behavior of the machine

Autotuned Codes Extraction of System Parameters

### **Obtaining the System Parameters**

#### Option 1 -> From the machine specifications

- Always?
- Where?
- Restricted?
- What format?
- Accurate?

#### Option 2 -> With benchmarks

- General -> Can be used always without restrictions
- Portable -> The place and format do not depend on the vendor
- Accurate -> Explore the real behavior of the machine

Introduction Cache Topology Overhead Characterization

Conclusions

Determination of Communication Costs

Autotuned Codes Extraction of System Parameters

### **Related Work**

#### Lacks of Previous Works

- Non portable method to estimate sizes of physically indexed caches
- Do not consider different memory access overheads
- Poor communication characterization -> Not necessary in multicores
- Code not available

Cache Size Estimate Determination of Shared Caches

### 1 Introduction

- 2
- Cache Topology
- Cache Size Estimate
- Determination of Shared Caches

### 3 Memory Access Overhead Characterization

4 Determination of Communication Costs

### 5 Conclusions

Cache Size Estimate Determination of Shared Caches

### Main Idea



# T -> Average Access Time $T_A < T_B < T_C$

3

ヘロト 人間 とく ヨン 人 ヨン

Cache Size Estimate Determination of Shared Caches

### Main Idea



## T -> Average Access Time $T_A < T_B < T_C$

Cache Size Estimate Determination of Shared Caches

### **Dunnington Example: Cycles View**



Cache Size Estimate Determination of Shared Caches

### **Dunnington Example: Cycles View**



Cache Size Estimate Determination of Shared Caches

### Dunnington Example: Gradient View (I)



 $L1 = 32KB\sqrt{}$ 

Cache Size Estimate Determination of Shared Caches

### Dunnington Example: Gradient View (I)



 $L1 = 32KB\sqrt{}$ L2 = 1MBX

イロト イボト イヨト イヨト

Cache Size Estimate Determination of Shared Caches

### L2 Problem

#### Physically Indexed Caches

- Most of L2 and L3 caches
- If cache size larger than page size contiguity in virtual memory does not imply adjacency in physical memory
- Cache misses in tests with array sizes smaller than the cache considered

#### Solutions

- Working as virtually indexed caches
  - Page Coloring by the OS -> Not in Linux
  - Calls to OS functions -> Previous works -> Not portable
- Estimating from the physically indexed behavior -> Servet

(ロ) (四) (ヨ) (ヨ) (ヨ)

Cache Size Estimate Determination of Shared Caches

### L2 Problem

#### Physically Indexed Caches

- Most of L2 and L3 caches
- If cache size larger than page size contiguity in virtual memory does not imply adjacency in physical memory
- Cache misses in tests with array sizes smaller than the cache considered

#### Solutions

- Working as virtually indexed caches
  - Page Coloring by the OS -> Not in Linux
  - Calls to OS functions -> Previous works -> Not portable
- Estimating from the physically indexed behavior -> Servet

Cache Size Estimate Determination of Shared Caches

### Probabilistic Algorithm (I)

#### Statistics Aspects of Cache Misses

- Page Size -> PS
- Cache: Size -> CS; Associativity -> K; number of Page Sets -> CS/(K \* PS)
- Number of Pages in a test -> NP
- Probability of a given virtual page is mapped to a given page set is uniform ⇒ Number of pages X per page set ∈ B(NP, (K \* PS)/CS)
- As each set can contain up to K pages without conflict
   ⇒ (X > K) is the miss rate when accessing to NP pages

Cache Size Estimate Determination of Shared Caches

### Probabilistic Algorithm (and II)

#### Algorithm

Entries: S[n], C[n] $hit\_time = MIN(C); miss\_overhead = MAX(C) - MIN(C)$ for (i = 0; i < n; i = i + 1)MR[i] = (C[i] - hit time)/miss overheadNP[i] = S[i]/PSforeach(CS, K)div[CS][K] = 0for (i = 0; i < n; i = i + 1)div[CS][K] = div[CS][K] + |MR[i] - P(X > K)| $X \in B(NP[i], (K * PS)/CS)$ Result: The statistical mode of CS using the five elements of div with the lowest values

Cache Size Estimate Determination of Shared Caches

### Dunnington Example: Gradient View (and II)



 $L1 = 32KB\checkmark$  $L2 = 3MB\checkmark$ 

イロト イポト イヨト イヨト

Cache Size Estimate Determination of Shared Caches

### Dunnington Example: Gradient View (and II)



 $L1 = 32KB\checkmark$  $L2 = 3MB\checkmark$  $L3 = 12MB\checkmark$ 

イロト イポト イヨト イヨト

Cache Size Estimate Determination of Shared Caches

### **Experimental Evaluation**

#### Academic Environment

- 70 different machines
- 147 different caches
- 140 correct estimations -> 95%
- Solving the bad estimations -> Next version of Servet

<ロト</a>

<ロト</td>

32/61

Cache Size Estimate Determination of Shared Caches

## Main Idea

#### Not Shared Cache





33/61

Cache Size Estimate Determination of Shared Caches

## **Determination of Shared Caches**

#### Algorithm

foreach(*CS*) *ref* = number of cycles to access only one core to an array of size (CS \* 2)/3foreach(pair of cores) c = number of cycles to access both cores simultaneously to an array of size (CS \* 2)/3 *ratio* = c/refif *ratio* > 2 then **SHARED CACHE** 

Cache Size Estimate Determination of Shared Caches

# **Initial Topology**



Cache Size Estimate Determination of Shared Caches

通 ト イ 通 ト

## **Dunnington Example: L1 Results**



 $L1 \rightarrow not shared$ 

э

Cache Size Estimate Determination of Shared Caches

# Topology with L1



≣▶ ▲≣▶ □ ■ ∽ � �

Cache Size Estimate Determination of Shared Caches

#### **Dunnington Example: L2 Results**



 $L2 \rightarrow$  shared by 0 and 12

< ∃⇒

(日)

Cache Size Estimate Determination of Shared Caches

# Topology with L2



Cache Size Estimate Determination of Shared Caches

#### **Dunnington Example: L3 Results**



 $L3 \rightarrow$  shared by 0,1,2,12,13 and 14

Cache Size Estimate Determination of Shared Caches

# Topology with L3





- 2 Cache Topology
- Memory Access Overhead Characterization
- 4 Determination of Communication Costs

#### 5 Conclusions

# **Detection of Different Memory Access Overheads**

#### Algorithm

*n* = 0

ref = memory bandwidth when accessing one isolated core foreach(pair of cores)

b = bandwidth of one process when accessing both cores if(b < ref)

if (b similar to a given BW[i])

Add the pair to  $P_m[i]$ 

#### else

```
BW[n] = b
P_m[n] = The used pair
n = n + 1
```

### **Dunnington Results**



60% of bandwidth when accessing by pairs

<ロト 4 課 ト 4 臣 ト 4 臣 ト 4 臣 9 Q (や 44/61

## Finis Terrae Architecture (I)



・ロ・・聞・・ヨ・・ヨ・ しょうくろ

45/61

## Finis Terrae Architecture (II)



### Finis Terrae Architecture (and III)



47/61

3

イロト 不得 トイヨト イヨト

#### Finis Terrae Results (I)



# Finis Terrae Results (and II)

#### Memory Bandwith in Finis Terrae

- Isolated acesses
  - 2200 MBytes/s
- Cores with the same bus
  - 990 MBytes/s
  - Only 45% of bandwidth
- Cores in the same cell with different bus
  - 1650 MBytes/s
  - 75% of bandwidth
- Cores in different cell
  - The same bandwidth

#### Memory Access Bandwidth per Overhead





2 Cache Topology

#### 3 Memory Access Overhead Characterization

4 Determination of Communication Costs

#### 5 Conclusions

## **Determination of Communication Layers**

```
Algorithm
n = 0
foreach(pair of cores)
   I = latency with a message of L1 size between the two cores
   if(b similar to a given L[i])
      Add the pair to P_{i}[i]
   else
      L[n] = I
       P_{l}[n] = The used pair
       n = n + 1
```









# Dunnington Results (and II)

#### Dunnington (messages of L1 size)

- Intra-processor communications (sharing L2 cache)
  - 2130 MBytes/s
- Intra-processor communications (not sharing L2 cache)
  - 1780 MBytes/s
  - 83% of posible the highest bandwidth
- Inter-processor communications
  - 750 MBytes/s
  - Only 35% of posible highest bandwidth

## Characterization of Communication Layers



## Introduction

2 Cache Topology

#### 3 Memory Access Overhead Characterization

4 Determination of Communication Costs

#### 5 Conclusions

# Conclusions

#### Suite to detect hardware parameters:

- Cache sizes
- Shared caches topology
- Memory accesses overheads
- Communications bandwidths

#### Characteristics

- Portable
- Highly accurate
- Focused to support the autotuning on multicore clusters
- Freely available: http://servet.des.udc.es

## HAVE A TRY AND ENJOY!!!

#### http://servet.des.udc.es

**Contact:** Jorge González-Domínguez *jgonzalezd@udc.es* Computer Architecture Group, Dept. of Electronics and Systems, University of A Coruña, Spain

<ロト < 同ト < 回ト < 回ト = 三日

61/61