http://www.tacpa.org
Taiwanese American Computer Professional Association
Monthly Lecture Series
Introduction to High Performance Computing
hai.tang@nist.gov
What is High Performance Computing?
Fast data crunching; a lot of work done per unit of time.
How to achieve High Performance Computing?
Historically,
- Fast CPU - 50 to 200 MHz
- Vector processing - 100 to 330 MFLOPs
- Parallel/Vector processing - multiple of 100 to 330 MFLOPs
- Multiple superscaler processors - up to to 300 MFLOPs/CPU
- Smart Math Algorithms
Parallel processing models:
- Shared-Memory - all processors share the common memory.
Shared-Memory systems are usually huge and expensive.
- Distributed-Memory - each processor has its own memory
and exchange data by Message Passing Interface.
Distributed-Memory systems are scalable.
Shared-Memory SIMD Models:
- Dividing a Shared-Memory into Blocks
|-------------|
| Processor 1 |-----------o-------------o-------------o
|-------------| | | |
| | |
| | |
|-------------| | | |
| Processor 2 |-----------o-------------o-------------o
|-------------| | | |
| | |
| | |
|-------------| | | |
| Processor 3 |-----------o-------------o-------------o
|-------------| | | |
. . .
. . .
. . .
|-------------| | | |
| Processor N |-----------o-------------o-------------o
|-------------| | | |
| | |
| | |
|----------| |----------| |----------|
| Memory | | Memory | | Memory |
| Block 1 | | Block 2 | | Block 3 |
|----------| |----------| |----------|
Decoding circuitry for Dividing Shared-Memory M into R blocks
R x f(M/R)
Switches(relatively inexpensive) N x R
- Interconnected-Network Model
Fully interconnected N processors N x (N-1)/2 lines
- Linear Array
- Two-Dimensional Array (Mesh)
- Tree Connection
- Perfect Shuffle Connection(in a one-way line link Pi to
Pj, where
/ 2i for 0<=i<=N/2-1
j = |
\ 2i+1-N for N/2<=i<=N-1
- Cube Connection ( q-dimensioned cube or hypercube )
N = 2**q
Origin 2000 Multilayer Shared-Memory:
|-----------| |-----------| |-----------|
| Processor | | Processor | | Processor |
| 1 | | 2 | . . . | N |
|-----------| |-----------| |-----------|
| Memory | | Memory | | Memory |
| 1 | | 2 | . . . | N |
|-----------| |-----------| |-----------|
^ ^ ^
| | |
| | |
| | |
v v v
|----------------------------------------------|
| |
| ORIGIN 2000 Shared-Memory |
| |
|----------------------------------------------|
MIMD Models:
|----------------------------------------------|
| SHRED MEMORY |
| |
| OR |
| |
| INTERCONNECTED NETWORK |
| |
|----------------------------------------------|
| | |
|DATA |DATA |DATA
|STREAM |STREAM |STREAM
| 1 | 2 | N
v v v
|-----------| |-----------| |-----------|
| Processor | | Processor | | Processor |
| 1 | | 2 | . . . | N |
|-----------| |-----------| |-----------|
^ ^ ^
| | |
|INSTRUCTION |INSTRUCTION |INSTRUCTION
|STREAM |STREAM |STREAM
| 1 | 2 | N
| | |
|-----------| |-----------| |-----------|
| Control | | Control | | Control |
| 1 | | 2 | . . . | N |
|-----------| |-----------| |-----------|
Distributed-Memory Models:
|----------------------------------------------|
| Closely Connected Switch |
| |
| OR |
| |
| INTERCONNECTED NETWORK |
| (Fast-Ethernet, ATM, etc.) |
| |
|----------------------------------------------|
^ ^ ^
| | |
|DATA |DATA |DATA
|EXCHANGE |EXCHANGE |EXCHANGE
| 1 | 2 | N
v v v
|-----------| |-----------| |-----------|
| Processor | | Processor | | Processor |
| 1 | | 2 | . . . | N |
|-----------| |-----------| |-----------|
| Memory | | Memory | | Memory |
| 1 | | 2 | . . . | N |
|-----------| |-----------| |-----------|
- NOWS, Network of workstations
- POP, Pile of PCs
Operating Systems:
High Performance Software:
- PVM, Parallel Virtual Machine
- MPI, Message Passing Interface
- HPF, High Performance Fortran
- Java - multi-threads
Comparison of NIST High Performance Systems
Machine Convex C3820 Cray C90/6256 IBM SP2 SGI Origin 2000
Number of CPU 2 6 32 8
CPU speed - MHz 62.5 250 66.7 200
Data cache size N.A. N.A. 128/256K 4M L2
Memory size - MB 1024 2048 512/node 8198
Vector processing Yes Yes No No
Disk storage - GB 27 110 2/node 80
IBM SP2(Scalable POWERparallel2) Machine
- POWER2-based RISC processors - 66.7 MHz, 4 Flops per
cycle, 266 MFlops per second.
- Large local memory(64-256 MB) and hard disk drive capacity
(2-4.5 GB) per processor.
- High-Performance Switch(LC8) at 40 Mbytes/sec.
- Low latency and high bandwidth on all processors.
- Scalability via message passing, up to 512 processors.
- High speed connectivity to the world - HIPPI, ATM, FDDI.
- Superior batch performance.
SP1/SP2 Machine Parameters
SP1 SP2-thin SP2-wide
RS6K equiv. model no. 370 390 590
CPU speed, MHz 62.5 66.7 66.7
Data Cache size, bytes 32K 64K 256K
Cache line size, bytes 64 128 256
NO. OF CACHE LINES 128X4 128X4 256X4
TLB ENTRIES 128 512 512
"TLB SPACE", BYTES 512K 2M 2M
MAIN MEMORY, BYTES 64M 128M >=256M
SP2 Architecture Performance
SP2 Cache Classes and Associativity
The 64KB data cache is organized into four 16KB congruence
classes. Each congruence class consists of 128 lines,and each
line is 128 bytes. Logically, virtual storage is similar
divided into congruence classes of 16KB. Any memory location
in virtual storage can be associated with one of a group of
four cache lines out of 512 cache lines available.
Serial and Parallel Execution Times
Speed-up
where
We need further take into account of the parallel overhead
introduced by parallel execution:
- Scheduling,initializing, and synchronizing parallel tasks.
- Resource contention by locks and semaphores.
- Waiting for unbalanced tasks to be completed.
- Communication among processors.
where
Limits on Scalability
Parallel programming considerations
- Performance programming.
- FMA
- Data independence.
- Loop unrolling
- Maximum utilization of cache
- Granularity.
- Communication overhead.
- Load balancing.
- Compiler directives for performance.
- Fortran compiler directives - -Pk, -Pv, -O3, -qhot
- KAP C preprocessor: kxlc file.c -r=3 -O3
Parallel Programming Steps
- Profile a program to identify hot spots.
- Check the Ratio of parallel part to serial part.
- Consider Communication Overhead.
- Scheduling,initializing, and synchronizing parallel tasks.
- Resource contention by locks and semaphores.
- Waiting for unbalanced tasks to be completed.
- Communication among processors.
- Examine Granularity and Load Balancing.
Scalable Programming
- Scalability via message passing.
- Serial and parallel execution times.
- Scalable algorithms.
- Maximize computation/communication ratio
- Maximum utilization of data in cache - data locality.
- Minimize "serial" section - minimize data dependences,
avoid central management that overloads the master node,
balance load among processes, minimize barriers.
- Maximize parallelism - data parallelism vs. functional
parallelism.
- A serial or shared memory algorithm may not be optimum
for a distributed memory machine.
- Parallel Programming Models - SPMD and MPMD.
- Fields of applications - Chemical engineering, computational
fluid dynamics, molecular modeling and molecular dynamics,
photo-lithography, quantum chemistry, seismic processing,
weather and climate simulation.
Parallel Programming Models - Master/Workers
Parallel Programming Models - SPMD