tacpa http://www.tacpa.org

Taiwanese American Computer Professional Association

Monthly Lecture Series

Introduction to High Performance Computing

hai.tang@nist.gov


What is High Performance Computing?

Fast data crunching; a lot of work done per unit of time.

How to achieve High Performance Computing?

Historically,

Parallel processing models:


Shared-Memory SIMD Models:

  1. Dividing a Shared-Memory into Blocks
    
       |-------------|
       | Processor 1 |-----------o-------------o-------------o
       |-------------|           |             |             |
    			     |             |             |
    			     |             |             |
       |-------------|           |             |             |
       | Processor 2 |-----------o-------------o-------------o
       |-------------|           |             |             |
    			     |             |             |
    			     |             |             |
       |-------------|           |             |             |
       | Processor 3 |-----------o-------------o-------------o
       |-------------|           |             |             |
    			     .             .             .
    			     .             .             .
    			     .             .             .
    
       |-------------|           |             |             |
       | Processor N |-----------o-------------o-------------o
       |-------------|           |             |             |
    			     |             |             |
    			     |             |             |
                           |----------|  |----------|  |----------|
    		       |  Memory  |  |  Memory  |  |  Memory  |
    		       | Block  1 |  | Block  2 |  | Block  3 |
                           |----------|  |----------|  |----------|
    		     
    
       Decoding circuitry for Dividing Shared-Memory M into R blocks
    				     R x f(M/R)
       Switches(relatively inexpensive)  N x R
    
  2. Interconnected-Network Model

    Fully interconnected N processors N x (N-1)/2 lines

    1. Linear Array
    2. Two-Dimensional Array (Mesh)
    3. Tree Connection
    4. Perfect Shuffle Connection(in a one-way line link Pi to Pj, where
      
                      / 2i     for 0<=i<=N/2-1
                 j =  |
                      \ 2i+1-N for N/2<=i<=N-1
      
    5. Cube Connection ( q-dimensioned cube or hypercube ) N = 2**q

Origin 2000 Multilayer Shared-Memory:



      |-----------|   |-----------|           |-----------|
      | Processor |   | Processor |           | Processor |
      |     1     |   |     2     |  . . .    |     N     |
      |-----------|   |-----------|           |-----------|
      |  Memory   |   |  Memory   |           |   Memory  |
      |     1     |   |     2     |  . . .    |     N     |
      |-----------|   |-----------|           |-----------|
	     ^              ^                       ^
	     |              |                       |
	     |              |                       |
	     |              |                       |
	     v              v                       v
	 |----------------------------------------------|
	 |                                              |
	 |       ORIGIN 2000 Shared-Memory              |
	 |                                              |
	 |----------------------------------------------|





MIMD Models:



	 |----------------------------------------------|
	 |           SHRED MEMORY                       |
	 |                                              |
	 |              OR                              |
	 |                                              |
	 |         INTERCONNECTED NETWORK              |
	 |                                              |
	 |----------------------------------------------|
	     |              |                       |
	     |DATA          |DATA                   |DATA
	     |STREAM        |STREAM                 |STREAM
	     | 1            | 2                     | N
	     v              v                       v
      |-----------|   |-----------|           |-----------|
      | Processor |   | Processor |           | Processor |
      |     1     |   |     2     |  . . .    |     N     |
      |-----------|   |-----------|           |-----------|
	     ^              ^                       ^
	     |              |                       |
	     |INSTRUCTION   |INSTRUCTION            |INSTRUCTION 
	     |STREAM        |STREAM                 |STREAM
	     |  1           |  2                    |  N
	     |              |                       |
      |-----------|   |-----------|           |-----------|
      |  Control  |   |  Control  |           |   Control |
      |     1     |   |     2     |  . . .    |     N     |
      |-----------|   |-----------|           |-----------|




Distributed-Memory Models:


	 |----------------------------------------------|
	 |      Closely Connected Switch                |
	 |                                              |
	 |              OR                              |
	 |                                              |
	 |       INTERCONNECTED NETWORK                |
	 |      (Fast-Ethernet, ATM, etc.)              |
	 |                                              |
	 |----------------------------------------------|
	     ^              ^                       ^
	     |              |                       |
	     |DATA          |DATA                   |DATA
	     |EXCHANGE      |EXCHANGE               |EXCHANGE
	     | 1            | 2                     | N
	     v              v                       v
      |-----------|   |-----------|           |-----------|
      | Processor |   | Processor |           | Processor |
      |     1     |   |     2     |  . . .    |     N     |
      |-----------|   |-----------|           |-----------|
      |  Memory   |   |  Memory   |           |   Memory  |
      |     1     |   |     2     |  . . .    |     N     |
      |-----------|   |-----------|           |-----------|




Operating Systems:

High Performance Software:


Comparison of NIST High Performance Systems


Machine          Convex C3820   Cray C90/6256    IBM SP2     SGI Origin 2000

Number of CPU           2             6             32             8

CPU speed - MHz        62.5         250             66.7         200

Data cache size        N.A.         N.A.         128/256K         4M L2

Memory size - MB       1024         2048         512/node       8198

Vector processing       Yes          Yes            No            No

Disk storage - GB        27          110           2/node         80



IBM SP2(Scalable POWERparallel2) Machine

SP1/SP2 Machine Parameters


                              SP1      SP2-thin    SP2-wide

    RS6K equiv. model no.     370        390         590
    CPU speed, MHz             62.5       66.7        66.7

    Data Cache size, bytes     32K        64K       256K
    Cache line size, bytes     64        128        256
    NO. OF CACHE LINES        128X4      128X4      256X4

    TLB ENTRIES               128        512        512
    "TLB SPACE", BYTES        512K         2M         2M

    MAIN MEMORY, BYTES        64M        128M     >=256M

 


SP2 Architecture Performance


SP2 Cache Classes and Associativity

The 64KB data cache is organized into four 16KB congruence classes. Each congruence class consists of 128 lines,and each line is 128 bytes. Logically, virtual storage is similar divided into congruence classes of 16KB. Any memory location in virtual storage can be associated with one of a group of four cache lines out of 512 cache lines available.


Serial and Parallel Execution Times

Speed-up

where

We need further take into account of the parallel overhead introduced by parallel execution:

where


Limits on Scalability


Parallel programming considerations

Parallel Programming Steps


Scalable Programming


Parallel Programming Models - Master/Workers


Parallel Programming Models - SPMD