http://www.tacpa.org

Taiwanese American Computer Professional Association

Monthly Lecture Series

Introduction to High Performance Computing

hai.tang@nist.gov

What is High Performance Computing?

Fast data crunching; a lot of work done per unit of time.

How to achieve High Performance Computing?

Historically,

Fast CPU - 50 to 200 MHz
Vector processing - 100 to 330 MFLOPs
Parallel/Vector processing - multiple of 100 to 330 MFLOPs
Multiple superscaler processors - up to to 300 MFLOPs/CPU
Smart Math Algorithms

Parallel processing models:

Shared-Memory - all processors share the common memory. Shared-Memory systems are usually huge and expensive.
Distributed-Memory - each processor has its own memory and exchange data by Message Passing Interface. Distributed-Memory systems are scalable.

Shared-Memory SIMD Models:

Dividing a Shared-Memory into Blocks


   |-------------|
   | Processor 1 |-----------o-------------o-------------o
   |-------------|           |             |             |
			     |             |             |
			     |             |             |
   |-------------|           |             |             |
   | Processor 2 |-----------o-------------o-------------o
   |-------------|           |             |             |
			     |             |             |
			     |             |             |
   |-------------|           |             |             |
   | Processor 3 |-----------o-------------o-------------o
   |-------------|           |             |             |
			     .             .             .
			     .             .             .
			     .             .             .

   |-------------|           |             |             |
   | Processor N |-----------o-------------o-------------o
   |-------------|           |             |             |
			     |             |             |
			     |             |             |
                       |----------|  |----------|  |----------|
		       |  Memory  |  |  Memory  |  |  Memory  |
		       | Block  1 |  | Block  2 |  | Block  3 |
                       |----------|  |----------|  |----------|
		     

   Decoding circuitry for Dividing Shared-Memory M into R blocks
				     R x f(M/R)
   Switches(relatively inexpensive)  N x R

Interconnected-Network Model
Fully interconnected N processors N x (N-1)/2 lines
1. Linear Array
2. Two-Dimensional Array (Mesh)
3. Tree Connection
4. Perfect Shuffle Connection(in a one-way line link Pi to Pj, where
```
                / 2i     for 0<=i<=N/2-1
           j =  |
                \ 2i+1-N for N/2<=i<=N-1
```
5. Cube Connection ( q-dimensioned cube or hypercube ) N = 2**q

Origin 2000 Multilayer Shared-Memory:



      |-----------|   |-----------|           |-----------|
      | Processor |   | Processor |           | Processor |
      |     1     |   |     2     |  . . .    |     N     |
      |-----------|   |-----------|           |-----------|
      |  Memory   |   |  Memory   |           |   Memory  |
      |     1     |   |     2     |  . . .    |     N     |
      |-----------|   |-----------|           |-----------|
	     ^              ^                       ^
	     |              |                       |
	     |              |                       |
	     |              |                       |
	     v              v                       v
	 |----------------------------------------------|
	 |                                              |
	 |       ORIGIN 2000 Shared-Memory              |
	 |                                              |
	 |----------------------------------------------|

MIMD Models:



	 |----------------------------------------------|
	 |           SHRED MEMORY                       |
	 |                                              |
	 |              OR                              |
	 |                                              |
	 |         INTERCONNECTED NETWORK              |
	 |                                              |
	 |----------------------------------------------|
	     |              |                       |
	     |DATA          |DATA                   |DATA
	     |STREAM        |STREAM                 |STREAM
	     | 1            | 2                     | N
	     v              v                       v
      |-----------|   |-----------|           |-----------|
      | Processor |   | Processor |           | Processor |
      |     1     |   |     2     |  . . .    |     N     |
      |-----------|   |-----------|           |-----------|
	     ^              ^                       ^
	     |              |                       |
	     |INSTRUCTION   |INSTRUCTION            |INSTRUCTION 
	     |STREAM        |STREAM                 |STREAM
	     |  1           |  2                    |  N
	     |              |                       |
      |-----------|   |-----------|           |-----------|
      |  Control  |   |  Control  |           |   Control |
      |     1     |   |     2     |  . . .    |     N     |
      |-----------|   |-----------|           |-----------|

Distributed-Memory Models:


	 |----------------------------------------------|
	 |      Closely Connected Switch                |
	 |                                              |
	 |              OR                              |
	 |                                              |
	 |       INTERCONNECTED NETWORK                |
	 |      (Fast-Ethernet, ATM, etc.)              |
	 |                                              |
	 |----------------------------------------------|
	     ^              ^                       ^
	     |              |                       |
	     |DATA          |DATA                   |DATA
	     |EXCHANGE      |EXCHANGE               |EXCHANGE
	     | 1            | 2                     | N
	     v              v                       v
      |-----------|   |-----------|           |-----------|
      | Processor |   | Processor |           | Processor |
      |     1     |   |     2     |  . . .    |     N     |
      |-----------|   |-----------|           |-----------|
      |  Memory   |   |  Memory   |           |   Memory  |
      |     1     |   |     2     |  . . .    |     N     |
      |-----------|   |-----------|           |-----------|

NOWS, Network of workstations
POP, Pile of PCs

Operating Systems:

UNIX
Window NT

High Performance Software:

PVM, Parallel Virtual Machine
MPI, Message Passing Interface
HPF, High Performance Fortran
Java - multi-threads

Comparison of NIST High Performance Systems


Machine          Convex C3820   Cray C90/6256    IBM SP2     SGI Origin 2000

Number of CPU           2             6             32             8

CPU speed - MHz        62.5         250             66.7         200

Data cache size        N.A.         N.A.         128/256K         4M L2

Memory size - MB       1024         2048         512/node       8198

Vector processing       Yes          Yes            No            No

Disk storage - GB        27          110           2/node         80

IBM SP2(Scalable POWERparallel2) Machine

POWER2-based RISC processors - 66.7 MHz, 4 Flops per cycle, 266 MFlops per second.
Large local memory(64-256 MB) and hard disk drive capacity (2-4.5 GB) per processor.
High-Performance Switch(LC8) at 40 Mbytes/sec.
Low latency and high bandwidth on all processors.
Scalability via message passing, up to 512 processors.
High speed connectivity to the world - HIPPI, ATM, FDDI.
Superior batch performance.

SP1/SP2 Machine Parameters


                              SP1      SP2-thin    SP2-wide

    RS6K equiv. model no.     370        390         590
    CPU speed, MHz             62.5       66.7        66.7

    Data Cache size, bytes     32K        64K       256K
    Cache line size, bytes     64        128        256
    NO. OF CACHE LINES        128X4      128X4      256X4

    TLB ENTRIES               128        512        512
    "TLB SPACE", BYTES        512K         2M         2M

    MAIN MEMORY, BYTES        64M        128M     >=256M

SP2 Architecture Performance

SP2 Cache Classes and Associativity

The 64KB data cache is organized into four 16KB congruence classes. Each congruence class consists of 128 lines,and each line is 128 bytes. Logically, virtual storage is similar divided into congruence classes of 16KB. Any memory location in virtual storage can be associated with one of a group of four cache lines out of 512 cache lines available.

Serial and Parallel Execution Times

Speed-up

where

We need further take into account of the parallel overhead introduced by parallel execution:

Scheduling,initializing, and synchronizing parallel tasks.
Resource contention by locks and semaphores.
Waiting for unbalanced tasks to be completed.
Communication among processors.