Computer System Architecture Lecturer Notes

June 28, 2017 | Autor: Budditha Hettige | Categoria: Computer Science

Descrição do Produto

See discussions, stats, and author profiles for this publication at: http://www.researchgate.net/publication/282845738

Computer System Architecture Lecturer Notes RESEARCH · OCTOBER 2015 DOI: 10.13140/RG.2.1.2592.8407

READS

5

1 AUTHOR: Budditha Hettige General Sir John Kotelawala Defence Unive… 23 PUBLICATIONS 35 CITATIONS SEE PROFILE

Available from: Budditha Hettige Retrieved on: 17 October 2015

CSC 203 1.5

Computer System Architecture

By Budditha Hettige Department of Statistics and Computer Science University of Sri Jayewardenepura (2011)

Computer System architectures

1

Course Outline Course Type

Core

Credit Value

1.5

Duration

22 lecture hours

Pre-requisites

CSC 106 2.0

Course contents • Introduction and Historical Developments – About Historical System development – Processor families • Computer Architecture and Organization – Instruction Set Architecture (ISA) – Microarchitecture – System architecture – Processor architecture – Processor structures • Interfacing and I/O Strategies – I/O fundamentals, Interrupt mechanisms, Buses

Course contents • Memory Architecture – Primary memory, Cache memory, Secondary memory

• Functional Organization – – – –

Instruction pipelining Instruction level parallelism (ILP), Superscalar architectures Processor and system performance

• Multiprocessing – – – –

Amdahl’s law Short vector processing Multi-core multithreaded processors

Introduction

(2011)

Computer System architectures

5

What is Computer? • Is a machine that can solve problems for people by carrying out instructions given to it • The sequence of instructions is call Program • The language machine can understand is call machine language

(2011)

Computer System architectures

6

What is Machine Language? • • •

Machine language(ML) is a system of instructions and data executed directly by a computer's Central Processing Unit The codes are strings of 0s and 1s, or binary digits (“bits”) Instructions typically use some bits to represent – Operations (addition ) – Operands or – Location of the next instruction.

(2011)

Computer System architectures

7

Machine Language contd.. • Advantages – Machine can directly access (Electronic circuit) – High Speed

• Disadvantages – Human cannot identify – Machine depended (Hardware depended) (2011)

Computer System architectures

8

More on Machines • Machine defines a language – Set of instructions carried out by the machine

• Language defines by the machine – Machine executing all the program, writing in the language

Language

(2011)

Machine

Language

Computer System architectures

9

Two Layer (Level) Machine • This machine contains only New Language (L1) and the Machine language (LO)

Virtual Machine (L1)

(2011)

Machine Language (L0)

Virtual Machine (L1) Translate/ Interpreter

Machine Language (L0)

Machine

Computer System architectures

10

Translation (L1 → L0) 1. Replace each instruction written in L1 in to LO 2. Program now execute new Program 3. Program is called compiler/ translator

(2011)

Computer System architectures

11

Interpretation • Each instruction in L1 can execute through the relevant L0 instructions directly • Program is call interpreter

(2011)

Computer System architectures

12

Multi Level Machine High-level Language Program (C, C++)

Assembly Language Program

Machine Language

(2011)

Computer System architectures

13

Multilevel Machine Virtual Machine Ln

Virtual Machine Ln-1 . . .

Machine Language L0

(2011)

Computer System architectures

14

Six-Level Machine • Computer that is designed up to the 6th level of computer architecture

(2011)

Computer System architectures

15

Digital Logic Level • • • • •

• •

(2011)

The interesting objects at this level are gates; Each gate has one or more digital inputs (0 or 1) Each gate is built of at most a handful of transistors A small number of gates can be combined to form a 1-bit memory, which can store a 0 or 1; The 1-bit memories can be combined in groups of, for example, 16, 32 or 64 to form registers Each register can hold a single binary number up to some maximum; Gates can also be combined to form the main computing engine itself. Computer System architectures

16

Microarchitecture level •

•

•

•

(2011)

A collection of 8-32 registers that form a local memory and a circuit called an ALU (Arithmetic Logic Unit) that is capable of performing simple arithmetic operations; The registers are connected to the ALU to form a data path over which the data flow; The basic operation of the data path consists of selecting one or two registers having the ALU operate on them; On some machines the operation of the data path is controlled by a program called a microprogram, on other machine it is controlled by hardware. Computer System architectures

17

Data Path

(2011)

Computer System architectures

18

Instruction Set Architecture Level • The ISA level is defined by the machine’s instruction set • This is a set of instructions carried out interpretively by the microprogram or hardware execution sets

(2011)

Computer System architectures

19

Operating System Level •

•

•

•

(2011)

Uses different memory organization, a new set of instructions, the ability to run one or more programs concurrently Those level 3 instructions identical to level 2’s are carried out directly by the microprogram (or hardwired control), not by the OS; In other words, some of the level 3 instructions are interpreted by the OS and some of the level 3 instructions are interpreted directly by the microprogram; This level is hybrid

Computer System architectures

20

Assembly Language Level • •

•

•

This level is really a symbolic form for the one of the underlying languages; This level provides a method for people to write programs for levels 1, 2 and 3 in a form that is not as unpleasant as the virtual machine languages themselves; Programs in assembly language are first translated to level 1, 2 or 3 language and then interpreted by the appropriate virtual or actual machine; The program that performs the translation is called an assembler.

(2011)

Computer System architectures

21

Between Levels 3 and 4 • The lower 3 levels are not for the average programmer – Instead they are primarily for running the interpreters and translators needed to support the higher levels; • These are written by system programmers who specialise in developing new virtual machines; • Levels 4 and above are intended for the applications programmer • Levels 2 and 3 are always interpreted, Levels 4 and above are usually, but not always, supported by translation; (2011)

Computer System architectures

22

Problem-oriented Language Level • This level usually consists of languages designed to be used by applications programmers; • These languages are generally called higher level languages • Some examples: Java, C, BASIC, LISP, Prolog; • Programs written in these languages are generally translated to Level 3 or 4 by translators known as compilers, although occasionally they are interpreted instead; (2011)

Computer System architectures

23

Multilevel Machines: Hardware • • •

Programs written in a computer’s true machine language (level 1) can be directly executed by the computer’s electronic circuits (level 0), without any intervening interpreters or translators. These electronic circuits, along with the memory and input/output devices, form the computer’s hardware. Hardware consists of tangible objects: – – – – – –

•

(2011)

integrated circuits printed circuit boards Cables power supplies Memories Printers

Hardware is not abstract ideas, algorithms, or instructions.

Computer System architectures

24

Multi level machine Software • Software consists of algorithms (detailed instructions telling how to do something) and their computer representations-namely, programs • Programs can be stored on hard disk, floppy disk, CDROM, or other media but the essence of software is the set of instructions that makes up the programs, not the physical media on which they are recorded. • In the very first computers, the boundary between hardware and software was crystal clear. • Over time, however, it has blurred considerably, primarily due to the addition, removal, and merging of levels as computers have evolved. • Hardware and software are logically equivalent

(2011)

Computer System architectures

25

The Hardware/Software Boundary • Any operation performed by software can also be built directly into the hardware; • Also, any instruction executed by the hardware can also be simulated in software; • The decision to put certain functions in hardware and others in software is based on such factors as: – – – – (2011)

Cost Speed Reliability and Frequency of expected changes Computer System architectures

26

Exercises 1. Explain each of the following terms in your own words – –

Machine Language Instruction

2. What are the differences between Interpretation and translation? 3. What are Multilevel Machines? 4. What are the differences between two-level machine and the six-level machine

(2011)

Computer System architectures

27

Historical Developments

(2011)

Computer System architectures

28

Computer Generation 1. 2. 3. 4. 5. 6.

Zeroth generation- Mechanical Computers (1642-1940) First generation - Vacuum Tubes (1940-1955) Second Generation -Transistors (1956-1963) Third Generation - Integrated Circuits (1964-1971) Forth Generation – VLS-Integration (1971-present) Fifth Generation – Artificial Intelligence (Present and Beyond)

(2011)

Computer System architectures

29

The Zero Generation (1) Year 1834

(2011)

Name Analytical Engine

Made by

Comments

Babbage

First attempt to build a digital computer

1936 Z1

Zuse

First working relay calculating machine

1943 COLOSSUS

British gov't

First electronic computer

1944 Mark I

Aiken

First American general-purpose computer

1946 ENIAC I

EckerVMauchley Modern computer history starts here

1949 EDSAC

Wilkes

First stored-program computer

1951 Whirlwind I

M.I.T.

First real-time computer

1952 IAS

Von Neumann

Most current machines use this design

1960 PDP-1

DEC

First minicomputer (50 sold)

1961 1401

IBM

Enormously popular small business machine

1962 7094

IBM

Dominated scientific computing in the early 1960s Computer System architectures

30

The Zero Generation (2)

(2011)

1963 B5000

Burroughs

First machine designed for a high-level language

1964 360

IBM

First product line designed as a family

1964 6600

CDC

First scientific supercomputer

1965 PDP-8

DEC

First mass-market minicomputer (50,000 sold)

1970 PDP-11

DEC

Dominated minicomputers in the 1970s

1974 8080

Intel

First general-purpose 8-bit computer on a chip

1974 CRAY-1

Cray

First vector supercomputer

1978 VAX

DEC

First 32-bit superminicomputer

1981 IBM PC

IBM

Started the modern personal computer era

1985 MIPS

MIPS

First commercial RISC machine

1987 SPARC

Sun

First SPARC-based RISC workstation

1990 RS6000

IBM

First superscalar machine Computer System architectures

31

The Zero Generation (3) • Pascal’s machine – Addition and Subtraction

• Analytical engine – Four components (Store, mill, input, output)

(2011)

Computer System architectures

32

Charles Babbage

• Difference Engine

1823

• Analytic Engine

1833

– The forerunner of modern digital computer – The first conception of a general purpose computer

(2011)

Computer System architectures

33

Von-Neumann machine

(2011)

Computer System architectures

34

First Generation-Vacuum Tubes (1945-1955) • First generation computers are characterized by the use of vacuum tube logic • Developments – ABC – ENIAC – UNIVAC I

(2011)

Computer System architectures

35

Brief Early Computer Timeline

First Generation- Time Line

(2011)

Date

Event

Description

Arithmetic Logic

Memory

1942

ABC

Atanasoff-Berry Computer

binary

vacuum tubes

capacitors

1946

ENIAC

Electronic Numerical Integrator And Computer

decimal

vacuum tubes

vacuum tubes

1947

EDVAC

Electronic Discrete Variable Automatic Computer

binary

vacuum tubes

mercury delay lines

1948

The Baby

Manchester Small Scale Experimental Machine

binary

vacuum tubes

CRST

1949

UNIVAC I

Universal Automatic Computer

decimal

vacuum tubes

mercury delay lines

1949

EDSAC

Electronic Delay Storage Automatic Computer

binary

vacuum tubes

mercury delay lines

1952

IAS

Institute for Advanced Study

binary

vacuum tubes

cathode ray tubes

1953

IBM 701

binary

vacuum tubes

mercury delay lines

Computer System architectures

36

ABC - Atanasoff-Berry Computer • world's first electronic digital computer • The ABC used binary arithmetic

(2011)

Computer System architectures

37

ENIAC – First general purpose computer • •

Electronic Numerical Integrator And Computer Designed and built by Eckert and Mauchly at the University of Pennsylvania during 1943-45 • capable of being reprogrammed to solve a full range of computing problems • The first, completely electronic, operational, general-purpose analytical calculator! – 30 tons, 72 square meters, 200KW • Performance – Read in 120 cards per minute – Addition took 200 µs, Division 6 ms

(2011)

Computer System architectures

38

UNIVAC - UNIVersal Automatic Computer • The first commercial computer • UNIVAC was delivered in 1951 • designed at the outset for business and administrative use • The UNIVAC I had 5200 vacuum tubes, weighed 29,000 pounds, and consumed 125 kilowatts of electrical power • Originally priced at US$159,000

(2011)

Computer System architectures

39

The Second GenerationTransistors (1955-1965) • Second generation computers are characterized by the use of discrete transistor logic • Use of magnetic core for primary storage • Developments – – – –

(2011)

IBM 1620 System IBM 7030 System IBM 7090 System IBM 7094 System

Computer System architectures

40

IBM 7090 • The IBM 7090 system was announced in 1958. • The 7090 included a multiplexor which supported up to 8 I/O channels. • The 7090 supported both fixed point and floating point arithmetic. • Two fixed point numbers could be added in 4.8 microseconds, and two floating point numbers could be added in 16.8 microseconds. • The 7090 had 32,768 thirty-six bit words of core storage. • In 1960, the American Airlines • SABRE system used two 7090 systems. • Cost of a 7090 system was in the $3,000,000 range.

(2011)

Computer System architectures

41

IBM 1620 • The IBM 1620 system was announced in 1959. • The IBM 1620 system had up to 60,000 digits of core storage (6 bits each.) • Floating point hardware was optional. • The IBM 1620 system performed decimal arithmetic. • The system was digit oriented, not word oriented.

(2011)

Computer System architectures

42

IBM 7030 • The IBM 7030 system was announced in 1960. • The IBM 7030 system used magnetic core for main memory, and magnetic disks for secondary storage. • The ALU could perform 1,000,000 operations per second. • Up to 32 I/O channels were supported. • The 7030 was also referred to as "Stretch." • Cost of a 7030 system was in the $10,000,000 range. (2011)

Computer System architectures

43

IBM 7094 • The IBM 7094 system was announced in 1962. • The 7094 was an improved 7090. • The 7094 introduced double precision floating point arithmetic.

(2011)

Computer System architectures

44

Third Generation • Third generation computers are characterized by the use of integrated circuit logic. • Development – IBM System/360

(2011)

Computer System architectures

45

IBM S 360 • The IBM S/360 family was announced in 1964. • Included both multiplexor and selector I/O channels. • Supported both fixed point and floating point arithmetic. • Had a microprogrammed instruction set. • Cost between $133,000 and $12,500,000.

(2011)

Computer System architectures

46

Forth Generation • Very Large Scale(VLSI) and Ultra Large scale(ULSI) • Fourth generation computers are characterized by the use of microprocessors. • Semiconductor memory was commonly used • Development – Intel – AMD etc

(2011)

Computer System architectures

47

Intel 4004 • The Intel 4004 microprocessor was announced in 1971. • The Intel 4004 microprocessor had – – – – –

2,300 transistors. A clock speed of 108 KHz. A die size of 12 sq mm. 4 bit memory access. 4 bit registers.

• The Intel 4004 microprocessor supported – Up to 32,768 bits of program storage. – Up to 5,120 bits of data storage.

• The 4004 was used mainly in calculators. (2011)

Computer System architectures

48

Intel 4004 - 1971

(2011)

Computer System architectures

49

MOS 6502 • The MOS 6502 microprocessor was announced in 1975. • The MOS 6502 microprocessor had – A clock speed of 1 MHz. – 8 bit memory access. – 8 bit registers.

• The MOS 6502 microprocessor supported – Up to 65,536 bytes (8 bit) of main memory.

• The MOS 6502 was used in – – – – –

The Apple II personal computer. The Comodore PET personal computer. The KIM-1 computer kit. The Atari 2600 game system. The Nintendo Famicon game system.

• Initial price of the 6502 was $25.00. (2011)

Computer System architectures

50

Intel Pentium IV - 2001 • “State of the art” • 42 million transistors • 2GHz • 0.13µm process • Could fit ~15,000 4004s on this chip! (2011)

Computer System architectures

51

Now - zEnterprise196 Microprocessor • • • •

1.4 billion transistors, Quad core design Up to 96 cores (80 visible to OS) in one multichip module 5.2 GHz, IBM 45nm SOI CMOS technology 64-bit virtual addressing – original 360 was 24-bit; 370 was a 31-bit extension

• Superscalar, out-of-order – Up to 72 instructions in flight

• Variable length instruction pipeline: 15-17 stages • Each core has 2 integer units, 2 load-store units and 2 floating point units • 8K-entry Branch Target Buffer – Very large buffer to support commercial workload

• Four Levels of caches: – – – –

(2011)

64KB L1 I-cache, 128KB L1 D-cache 1.5MB L2 cache per core 24MB shared on-chip L3 cache 192MB shared off-chip L4 cache

Computer System architectures

52

Fifth Generation • Computing devices, based on artificial intelligence • Features – Voice recognition, – Parallel processing – Quantum computation and molecular and nanotechnology will radically change the face of computers in years to come. – The goal of fifth-generation computing is to develop devices that respond to natural language input and are capable of learning and self-organization (2011)

Computer System architectures

53

Computer Architecture

2011

Computer System Architecture

54

What is Computer Architecture? • Set of data types, Operations, and features are call its architecture • It deals with those aspects that are visible to user of that level • Study of how to design those parts a computer is called Computer Architecture

2011

Computer System Architecture

55

Why Computer Architecture • Maximum overall performance of system keeping within cost constraints • Bridge performance gap between slowest and fastest component in a computer • Architecture design – Search the space of possible design – Evaluate the performance of design choose – Identify bottlenecks, redesign and repeat process 2011

Computer System Architecture

56

Computer Organization • The Simple Computer concise with – CPU – I/O Devices – Memory – BUS (Connection method)

2011

Computer System Architecture

57

Simple Computer

2011

Computer System Architecture

58

CPU – Central Processing Unit • Is the “Brain” • It Execute the program and stored in the main memory • Composes with several parts – Control Unit – Arithmetic and Logic Units – Registers

2011

Computer System Architecture

59

Registers • High-speed memory • Top of the memory hierarchy, and provide the fastest way to access data • Store temporary results • Some useful registers – PC – Program counters • Point to the next instructions

– IR - Instruction Register • Hold instruction currently being execute 2011

Computer System Architecture

60

Registers more… • Types – User-accessible Registers – Data registers – Address registers – General purpose registers – Special purpose registers – Etc.

2011

Computer System Architecture

61

Instruction • Types – Data handling and Memory operations • Set, Move, Read, Write

– Arithmetic and Logic • Add, subtract, multiply, or divide • Compare

– Control flow

• Complex instructions – Take many instructions on other computers • saving many registers on the stack at once • moving large blocks of memory

2011

Computer System Architecture

62

Parts of an instruction • Opcode – Specifies the operation to be performed

• Operands – – – –

2011

Register values, Values in the stack, Other memory values, I/O ports

Computer System Architecture

63

Type of the operation • Register-Register Operation – Add, subtract, compare, and logical operations

• Memory Reference – All loads from memory

• Multi Cycle Instructions – Integer multiply and divide and all floatingpoint operations 2011

Computer System Architecture

64

Fetch-Decode execute circle • Instruction fetch – 32-bit instruction was fetched from the cache

• • • •

2011

Decode Execute Memory Access Write back

Computer System Architecture

65

Fetch-Decode execute circle

2011

Computer System Architecture

66

MIcroprocessors • Processors can be identify by two main parameters – Speed (MHz/ GHz) – Processor with • Data bus • Address bus • Internal registers

2011

Computer System Architecture

67

Data bus • Known as Front side bus, CPU bus and Processor side bus • Use between CPU and main chipset • Define a size of memory – 32 bit – 64 bit etc.

2011

Computer System Architecture

68

Data bus

2011

Computer System Architecture

69

The division of I/O buses is according to data transfer rate. Specifically,

I/O Ports with data transfer rates Controller

Port / Device

PS/2 (keyboard / mouse) Serial Port Super I/O Floppy Disk Parallel Port Integrated Audio Integrated LAN USB Southbridge Integrated Video IDE (HDD, DVD) SATA (HDD, DVD) 2011

Computer System Architecture

Typical Data Transfer Rate 2 KB/s 25 KB/s 125 KB/s 200 KB/s 1 MB/s 12 MB/s 60 MB/s 133 MB/s 133 MB/s 300 MB/s 70

Address Bus • Carries addressing information • Each wire carries a single bit • Width indicates maximum amount of RAM the processor can handle • Data bus and address bus are independent

2011

Computer System Architecture

71

How CPU works? • A Simple CPU – 4 Bit Address bus – Registers A, B and C (4 Bit) – 8 Bit Program ( 4 BIT Instruction, 4 BIT Data)

2011

Computer System Architecture

72

How CPU works? A

B

IP

Instruction SET 0000

Sleep

0001

LOAD M → A

0010

LOAD M → B

0101

SET A → M

0110

SET B → M

1000

ADD A + B → C

1111

MOVE

1001

RESET

ALU IC Register C

Instruction Counter

C

2011

Computer System Architecture

73

Instruction SET

How CPU works? A

B

0000

Sleep

0001

LOAD M → A

0010

LOAD M → B

0101

SET A → M

0110

SET B → M

0111

SET C → M

1000

ADD A + B → C

C 0000

IC 01 C

1

0

0

0

0

0

0

0

0

2

0

0

0

1

0

0

1

0

3

0

0

1

0

0

1

0

1

4

1

0

0

0

0

0

0

0

5

0

1

1

1

0

0

0

0

6

2011

Computer System Architecture

74

Instruction SET

How CPU works? A

B

0000

Sleep

0001

LOAD M → A

0010

LOAD M → B

0101

SET A → M

0110

SET B → M

0111

SET C → M

1000

ADD A + B → C

C 0001 1

IC 02 C

0

0

0

0

0

0

0

0 0

2

0

0

0

1

000 0 1 10

3

0

0

1

0

0

1

0

1

4

1

0

0

0

0

0

0

0

5

0

1

1

1

0

0

0

0

6

2011

Computer System Architecture

75

Instruction SET

How CPU works? A 0010

B

0000

Sleep

0001

LOAD M → A

0010

LOAD M → B

0101

SET A → M

0110

SET B → M

0111

SET C → M

1000

ADD A + B → C

C 0010 IC 03 C

1

0

0

0

0

0

0

0

0

2

0

0

0

1

0

0

1

0

3

0

0

1

0

0

4

1

0

0

0

0

0

0

0

5

0

1

1

1

0

0

0

0

0 11 0 01

1

6

2011

Computer System Architecture

76

Instruction SET

How CPU works? A 0010

B 0101

0000

Sleep

0001

LOAD M → A

0010

LOAD M → B

0101

SET A → M

0110

SET B → M

0111

SET C → M

1000

ADD A + B → C

C 1000

C 0111

IC 04

1

0

0

0

0

0

0

0

0

2

0

0

0

1

0

0

1

0

3

0

0

1

0

0

1

0

1

4

1

0

0

0

0

0

0

0

5

0

1

1

1

0

0

0

0

6

1

0

0

1

0

0

0

0

7

1

1

1

1

0

0

0

1

8

2011

Computer System Architecture

77

Instruction SET

How CPU works? A 0010

B 0101

0000

Sleep

1111

MOVE

1001

RESET

0101

SET A → M

0110

SET B → M

0111

SET C → M

1000

ADD A + B → C

C 0111

C 0111

IC 05

0 1 1 1

1

0

0

0

0

0

0

0

0

2

0

0

0

1

0

0

1

0

3

0

0

1

0

0

1

0

1

4

1

0

0

0

0

0

0

0

5

0

1

1

1

1

1

1

1

6

1

0

0

1

0

0

0

0

7

1

1

1

1

0

0

0

1

8

2011

Computer System Architecture

78

Instruction SET

How CPU works? A 0010

B 0101

0000

Sleep

1111

MOVE

1001

RESET

0101

SET A → M

0110

SET B → M

0111

SET C → M

1000

ADD A + B → C

C 1001

C 0111

2011

IC 06

Computer System Architecture

1

0

0

0

0

0

0

0

0

2

0

0

0

1

0

0

1

0

3

0

0

1

0

0

1

0

1

4

1

0

0

0

0

0

0

0

5

0

1

1

1

1

1

1

1

6

1

0

0

1

0

0

0

0

7

1

1

1

1

0

0

0

1

8

0

1

1

1

79

Instruction SET

How CPU works? A 0000

B 0000

0000

Sleep

1111

MOVE

1001

RESET

0101

SET A → M

0110

SET B → M

0111

SET C → M

1000

ADD A + B → C

C 0000

C 0000

IC 06

1

0

0

0

0

0

0

0

0

2

0

0

0

1

0

0

1

0

3

0

0

1

0

0

1

0

1

4

1

0

0

0

0

0

0

0

5

0

1

1

1

1

1

1

1

6

1

0

0

1

0

0

0

0

7

1

1

1

1

0

0

0

1

8

2011

Computer System Architecture

80

Instruction SET

How CPU works? A 0000

B 0000

0000

Sleep

1111

MOVE

1001

RESET

0101

SET A → M

0110

SET B → M

0111

SET C → M

1000

ADD A + B → C

C 1111

C 0000

IC 07

1

0

0

0

0

0

0

0

0

2

0

0

0

1

0

0

1

0

3

0

0

1

0

0

1

0

1

4

1

0

0

0

0

0

0

0

5

0

1

1

1

1

1

1

1

6

1

0

0

1

0

0

0

0

7

1

1

1

1

0

0

0

1

8

2011

Computer System Architecture

81

Instruction SET

How CPU works? A 0000

B 0000

0000

Sleep

1111

MOVE

1001

RESET

0101

SET A → M

0110

SET B → M

0111

SET C → M

1000

ADD A + B → C

C 0000

C 0000

IC 01

1

0

0

0

0

0

0

0

0

2

0

0

0

1

0

0

1

0

3

0

0

1

0

0

1

0

1

4

1

0

0

0

0

0

0

0

5

0

1

1

1

1

1

1

1

6

1

0

0

1

0

0

0

0

7

1

1

1

1

0

0

0

1

8

2011

Computer System Architecture

82

How BUS System works

CPU Device A

Device B

Device C

DATA BUS ADDRESS BUS CONTROL BUS 2011

Computer System Architecture

83

How BUS System works

DATA BUS ADDRESS BUS CONTROL BUS 2011

Computer System Architecture

84

How BUS System works ADDRESS BUS DATA BUS CONTROL BUS

4 BIT 4 BIT 2 BIT

CPU Device A

Device B

Device C

DATA BUS ADDRESS BUS CONTROL BUS 2011

Computer System Architecture

85

How BUS System works CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0100

CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0010

CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0001

DATA BUS ADDRESS BUS CONTROL BUS 2011

Computer System Architecture

86

How BUS System works CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0100

CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0010

CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0001

DATA BUS ADDRESS BUS CONTROL BUS 0000 2011

0000

00 Computer System Architecture

87

How BUS System works CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0100

CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0010

CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0001

DATA BUS ADDRESS BUS CONTROL BUS 0000 2011

0100

00 Computer System Architecture

88

How BUS System works CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0100

CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0010

CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0001

DATA BUS ADDRESS BUS CONTROL BUS 1 01 0 2011

0100

10 Computer System Architecture

89

How BUS System works CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0100

CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0010

CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0001

DATA BUS ADDRESS BUS CONTROL BUS 1 01 0 2011

0010

00 Computer System Architecture

90

How BUS System works CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0100

CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0010

CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0001

DATA BUS ADDRESS BUS CONTROL BUS 1 01 0 2011

0010

01 Computer System Architecture

91

Intel Microprocessor History

2011

Computer System Architecture

92

Microprocessor History • Intel 4004 (1971) – – – – –

2011

0.1 MHz 4 bit World first Single chip microprocessor Instruction set contained 46 instructions Register set contained 16 registers of 4 bits each

Computer System Architecture

93

Microprocessor History • Intel 8008 (1972) – – – –

2011

Max. CPU clock rate 0.5 MHz to 0.8 MHz 8-bit CPU with an external 14-bit address bus could address 16KB of memory had 3,500 transistors

Computer System Architecture

94

Microprocessor History • Intel 8080 (1974) – second 8-bit microprocessor – Max. CPU clock rate 2 MHz – Large 40-pin DIP packaging – 16-bit address bus and an 8-bit data bus – Easy access to 64 kilobytes of memory – Processor had seven 8-bit registers, (A, B, C, D, E, H, and L) 2011

Computer System Architecture

95

Microprocessor History • Intel 8086 (1978) – 16-bit microprocessor – Max. CPU clock rate 5 MHz to 10 MHz – 20-bit external address bus gave a 1 MB physical address – 16-bit registers including the stack pointer,

2011

Computer System Architecture

96

Microprocessor History • Intel 80286 (1978) – 16-bit x86 microprocessor – 134,000 transistors – Max. CPU clock rate 6 MHz to 25 MHz – Run in two modes • Protected mode • Real mode

2011

Computer System Architecture

97

Microprocessor History • Intel 80386 (1985) – 32-bit Microprocessor – 275,000 transistors – 16-bit data bus – Max. CPU clock rate 12 MHz to 40 MHz – Instruction set • x86 (IA-32)

2011

Computer System Architecture

98

Microprocessor History • Intel 80486 (1989) – – – – – –

2011

Max. CPU clock rate 16 MHz to 100 MHz FSB speeds 16 MHz to 50 MHz Instruction set x86 (IA-32) An 8 KB on-chip SRAM cache stores 486 has a 32-bit data bus and a 32-bit address bus. Power Management Features and System Management Mode (SMM) became a standard feature

Computer System Architecture

99

Microprocessor History • Intel Pentium I (1993) – Intel's 5th generation micro architecture – Operated at 60 MHz – powered at 5V and generated enough heat to require a CPU cooling fan – Level 1 CPU cache from 16 KB to 32 KB – Contained 4.5 million transistors – compatible with the common Socket 7 motherboard configuration

2011

Computer System Architecture

100

Microprocessor History • Intel Pentium II (1997) – Intel's sixth-generation microarchitecture – 296-pin Staggered Pin Grid Array (SPGA) package (Socket 7) – speeds from 233 MHz to 450 MHz – Instruction set IA-32, MMX – cache size was increased to 512 KB – better choice for consumer-level operating systems, such as Windows 9x, and multimedia applications

2011

Computer System Architecture

101

Microprocessor History • Intel Pentium III (1999) – – – –

400 MHz to 1.4 GHz Instruction set IA-32, MMX, SSE L1-Cache: 16 + 16 KB (Data + Instructions) L2-Cache: 512 KB, external chips on CPU module at 50% of CPU-speed – the first x86 CPU to include a unique, retrievable, identification number

2011

Computer System Architecture

102

Microprocessor History • Intel Pentium IV (2000) – Max. CPU clock rate 1.3 GHz to 3.8 GHz – Instruction set x86 (i386), x86-64, MMX, SSE, SSE2, SSE3 – featured Hyper-Threading Technology (HTT) – The 64-bit external data bus – More than 42 million transistors – Processor (front-side) bus runs at 400MHz, 533MHz, 800MHz, or 1066MHz – L2 cache can handle up to 4GB RAM – 2MB of full-speed L3 cache 2011

Computer System Architecture

103

Microprocessor History • Intel Core Duo – Processing Die Transistors 151 million – Consists of two cores – 2 MB L2 cache – All models support: MMX, SSE, SSE2, SSE3, EIST, XD bit – FSB Speed 533 MHz – Intel® Virtualization Technology (VT-x) – Execute Disable Bit 2011

Computer System Architecture

104

Microprocessor History • Pentium Dual-Core – Max. CPU clock rate 1.3 GHz to 2.6 GHz – based on either the 32-bit Yonah or (with quite different microarchitectures) 64-bit Merom-2M – Instruction set MMX, SSE, SSE2, SSE3, SSSE3, x86-64 – FSB speeds 533 MHz to 800 MHz – Cores 2

2011

Computer System Architecture

105

Microprocessor History • Intel Core Due – – – – – –

Clock Speed 1.2 GHz L2 Cache 2 MB FSB Speed 533 MHz Instruction Set 32-bit Processing Die Transistors 151 million Advanced Technologies • Intel® Virtualization Technology (VT-x) • Enhanced Intel SpeedStep® Technolog • Execute Disable Bit

2011

Computer System Architecture

106

Microprocessor History • Core 2 due – – – – – –

Cores 2 , Threads 2 Clock Speed 3.33 GHz L2 Cache 6 MB FSB Speed 1333 MHz Processing Die Transistors 410 million Advanced Technologies • • • • • • • •

2011

Intel® Virtualization Technology (VT-x) Intel® Virtualization Technology for Directed IO (VT-d) Intel® Trusted Execution Technology Intel® 64 Idle States Enhanced Intel SpeedStep® Technology Thermal Monitoring Technologies Execute Disable Bit

Computer System Architecture

107

Microprocessor History • Intel Core 2 Quad – – – – – –

Cores 4 , Threads 4 Clock Speed 3. GHz L2 Cache 12 MB FSB Speed 1333 MHz Processing Die Transistors 410 million Advanced Technologies • • • • • • • •

2011

Intel® Virtualization Technology (VT-x) Intel® Virtualization Technology for Directed IO (VT-d) Intel® Trusted Execution Technology Intel® 64 Idle States Enhanced Intel SpeedStep® Technology Thermal Monitoring Technologies Execute Disable Bit

Computer System Architecture

108

Microprocessor History • Core i3 – – – – –

Cores 2 Threads 4 Clock Speed 2.13 GHz Intel® Smart Cache 3 MB Instruction Set 64-bit Instruction Set Extensions SSE4.1,SSE4.2 – Max Memory Size 8 GB – Processing Die Transistors 382 million – Technologies • Intel® Trusted Execution Technology • Intel® Fast Memory Access • Intel® Flex Memory Access

2011

Computer System Architecture

109

Microprocessor History • Core i5 – – – – – –

Cores 2 Threads 4 Clock Speed 1.7 - 3.0 GHz Max Memory Size 8 GB Processing Die Transistors 382 million Technologies • • • • • • •

Intel® Trusted Execution Technology Intel® Fast Memory Access Intel® Flex Memory Access Intel® Anti-Theft Technology Intel® My WiFi Technology 4G WiMAX Wireless Technology Idle States

– 2011

Computer System Architecture

110

Microprocessor History • Core i7 – – – –

Cores 4 Threads 8 Clock Speed 3.4 GHz Max Turbo Frequency 3.8 GHz – Intel® Smart Cache 8 MB

2011

Technologies Intel® Turbo Boost Technology 2.0Intel® vPro Technology Intel® Hyper-Threading Technology Intel® Virtualization Technology (VT-x) Intel® Virtualization Technology for Directed I/O (VT-d) Intel® Trusted Execution Technology AES New Instructions Intel® 64 Idle States Enhanced Intel SpeedStep® Technology Thermal Monitoring Technologies Intel® Fast Memory Access Intel® Flex Memory Access Execute Disable Bit

Computer System Architecture

111

Summary – Processor Family Vs Buses

2011

Computer System Architecture

112

Summary - Intel processors (1)

2011

Computer System Architecture

113

AMD processors (1)

2011

Computer System Architecture

114

AMD processors (2)

2011

Computer System Architecture

115

Microprocessors

2011

Computer System Architecture

116

Processor Instructions • Intel 80386 (1985) – x86 (IA-32)

• Intel 80486 (1989) – x86 (IA-32)

• Intel Pentium I (1993) – x86 (IA-32)

• Intel Pentium II (1997) – IA-32, MMX 2011

Computer System Architecture

117

Processor Instructions(2) • Intel Pentium III (1999) – IA-32, MMX, SSE

• Intel Pentium IV (2000) – x86 (i386), x86-64, MMX, SSE, SSE2, SSE3

• Intel Core Duo – MMX, SSE, SSE2, SSE3, EIST, XD bit

• Pentium Dual-Core – MMX, SSE, SSE2, SSE3, SSSE3, x86-64 2011

Computer System Architecture

118

Processor Modes

2011

Computer System Architecture

119

Processor modes • Intel and Compatible processors are run in several modes – Real Mode – IA 32 Mode • Protected Mode • Virtual Real Mode

– IA 32e 64 bit mode • 64-bit mode • Compatibility mode 2011

Computer System Architecture

120

8086 Real Mode (x86) • • • • •

80286 and later x86-compatible CPUs Execute 16 bit instructions Address only 1MB Memory Single task MS-Dos Programs are run in this mode – Windows 1x, 3x – 16 bit instructions

• No built in protection to keep one program overwriting another in memory 2011

Computer System Architecture

121

IA-32 - Protected Mode • First implemented in the Intel 80386 as a 32-bit extension of x86 architecture • Can run 32-bit instructions • 32 bit OS and Application are Required • Programs are protection to keep one program overwriting another in memory

2011

Computer System Architecture

122

Virtual Real mode (IA- 32 Mode) • Backward compatibility (can run 16 bit apps) – used to execute DOS programs in Windows/386, Windows 3.x, Windows 9x/Me

• 16 bit program run on the 32 bit protected mode • Address only up to 1 Mb • All Intel and Intel-supported processors power up in real mode 2011

Computer System Architecture

123

IA-32e 64 bit Exaction Mode • Originally design by AMD , later adapted by Intel • Processor can run – Real mode – IA 32 mode – IA 32e mode

• IA -32e 64 bit is run 64 bit OS and 64 bit apps • Need 64 bit OS and All 64 bit hardware 2011

Computer System Architecture

124

64-Bit Operating Systems • Windows XP – 64 bit Edition for Itanium (IA64 bit processors) • Windows XP professional x64( IA 32, Atholen 64) • 32 bit Application can run without any probem • 16 bit and Dos application does not run • Problem ? – All 32-bit and 64 bit drivers are required 2011

Computer System Architecture

125

Physical memory limit

2011

Computer System Architecture

126

Processors Features

2011

Computer System Architecture

127

Processors Features • • • • • • • • • • • 2011

System Management Mode (SMM) MMX Technology SSE, SSE2, SSE3, SSE4 etc 3DNow!, Technology Math core processor Hyper Threading Dual core technology Quad core technology Intel Virtualization Execute Disable bit Intel® Turbo Boost Technology Computer System Architecture

128

System Management Mode(SMM) • is an operating mode • is suspended, and special separate software is executed in high-privilege mode • It is available in all later microprocessors in the x86 architecture • Some uses of SMM are – Handle system events like memory or chipset errors. – Manage system safety functions, such as shutdown on high CPU temperature and turning the fans on and off. – Control power management operations, such as managing the voltage regulator modules.

2011

Computer System Architecture

129

MMX Technology • Multimedia extension / Matrix math extension • Improves audio/video compression • MMX defined eight registers, known as MM0 through MM7 • Each of the MMn registers holds 64 bits • MMX provides only integer operations • Used for both 2D and 3D calculations • 57 new instructions + (SIMD- Single instruction multiple data) 2011

Computer System Architecture

130

SSE -Streaming SIMD Extensions • Used to accelerate floating point and parallel calculations • is a SIMD instruction set extension to the x86 architecture • subsequently expanded by Intel to SSE2, SSE3, SSSE3, and SSE4 • it supports floating point math • SSE originally added eight new 128-bit registers known as XMM0 through XMM7 • SSE Instructions – Floating point instructions – Integer instructions – Other instructions 2011

Computer System Architecture

131

SSE2- Streaming SIMD Extensions 2 • • • •

2011

Introduce in Pentium IV Add 114 additional instructions Also include MMX and SSE instructions SSE2 is an extension of the IA-32 architecture

Computer System Architecture

132

SSE3- Streaming SIMD Extensions 3 • Introduce in PIV Prescott processor • Code name Prescott New Instructions (PNI) • Contains 13 new instructions • Also include MMX, SSE, SSE2

2011

Computer System Architecture

133

SSE3- Supple • Introduce in xeon and Core 2 processors • Add new 32 SIMD instructions to SSE3

2011

Computer System Architecture

134

SSE4 (HD BOOT) • Introduce by Intel in 2008 • Adds 54 new instructions • 47 of SSE4 instructions are referred to as SSE4.1 • 7 other instruction as SSE4.2 • SSE4.1 – is targeted to improve performance of media, imaging and 3D • SSE4.2 improves string and text processing 2011

Computer System Architecture

135

SSE - Advantages • Higher quality and high quality image resolution • High quality audio and MPEG2 Video multi media application support • Reduce CPU utilization for speech recognition software • SSEx instructions are useful withMPEG2 decoding 2011

Computer System Architecture

136

3DNow! Technology • AMD’s alternative to SSE • Uses 21 instructions uses SIMD technologies • Enhanced 3DNow! ADDS 24 more instructions • Professional 3DNow! Adds 51 SSE command to the Enhanced 3DNow!

2011

Computer System Architecture

137

Math coprocessor • Provides hardware for plotting point Math • Speed Computer Operations • All Intel processors since 486DX include built-in floating point unit (FPU) • Can performance high level mathematical operation • Instruction set differ from main CPU 2011

Computer System Architecture

138

Hyper-Threading Technology • Is an Intel-proprietary technology used to improve parallelization of computations doing multiple tasks at once • The operating system addresses two virtual processors, and shares the workload between them when possible • Allowing multiple threads to run simultaneously 2011

Computer System Architecture

139

Hyper-Threading Technology • Originally introduce Xeon processor for Servers (2002) • Available all PIV processor with bus speed 800 MHz • HT enable processors has 2 set of general purpose registers, control registers • Only Single Cache memory and Single Buses 2011

Computer System Architecture

140

HT - Requirements • • • • •

2011

Processor with HT Technology Compatible MB (Chipset) BIOS support Compatible OS Software written to Support HT

Computer System Architecture

141

Dual Core Technology • Introduce in 2005 • Consist of 2 CPU cores (Enable Single processors to work as 2 processors) • Multi Tasking performance is improved

2011

Computer System Architecture

142

Quad-Core Technology • Consist of 4 CPU cores (Enable Single processors to work as 4 processors) • Less power consumption • Design to provide multimedia and multi tasking experience

2011

Computer System Architecture

143

Intel Virtualization • Allows hardware platform to run multiple platform • Available in Core to Quad processors

2011

Computer System Architecture

144

Execute Disable Bit • Is a hardware-based security feature • Can reduce exposure to viruses and malicious-code attacks and prevent harmful software from executing and propagating on the server or network. • Help protect your customers' business assets and reduce the need for costly virus-related repairs by building systems with built-in Intel Execute Disable Bit. 2011

Computer System Architecture

145

Intel® Turbo Boost Technology • Provides more performance when needed • Automatically allows processor cores to run faster than the base operating frequency • Depends on the workload and operating environment • Processor frequency will dynamically increase until the upper limit of frequency is reached • Has multiple algorithms operating in parallel to manage current, power, and temperature to maximize performance and energy efficiency 2011

Computer System Architecture

146

Bugs

2011

Computer System Architecture

147

Bugs • Processor can contain defects or errors • Only way to fix the bug – Work around it or replace it with bugs free

• Now… – Many bugs to be fixed by altering the microcode – Microcode gives set of information how processor works – Incorporate Reprogrammable Microcode 2011

Computer System Architecture

148

Fixing the Bugs • Microcode updates reside in ROM BIOS • Each time the system rebooted fixed code is loaded • These microcode is provided by Intel to motherboard manufacturers and they can incorporate it into ROM BIOS • Need to install most recent BIOS every time 2011

Computer System Architecture

149

CPU Design Strategy CISC & RISC

2011

Computer System Architecture

150

What is CISC? • CISC is an acronym for Complex Instruction Set Computer • Most common microprocessor designs such as the Intel 80x86 and Motorola 68K series followed the CISC philosophy. • But recent changes in software and hardware technology have forced a re-examination of CISC and many modern CISC processors are hybrids, implementing many RISC principles. • CISC was developed to make compiler development simpler.

CISC Characteristics • 2-operand format, • Variable length instructions where the length often varies according to the addressing mode • Instructions which require multiple clock cycles to execute. • E.g. Pentium is considered a modern CISC processor • Complex instruction-decoding logic, driven by the need for a single instruction to support multiple addressing modes. • A small number of general purpose registers • Several special purpose registers. • A 'Condition code" register which is set as a side-effect of most instructions.

CISC Advantages • Microprogramniing is as easy as assembly language to implement • The ease of microcoding new instructions allowed designers to make CISC machines upwardly compatible: a new computer could run the same programs as earlier computers because the new computer would contain a superset of the instructions of the earlier computers. • As each instruction became more capable, fewer instructions could be used to implement a given task. This made more efficient use of the relatively slow main memory. 2011

Computer System Architecture

153

CISC Disadvantages • Instruction set & chip hardware become more complex with each generation of computers. • Many specialized instructions aren't used frequently enough to justify their existence • CISC instructions typically set the condition codes as a side effect of the instruction.

What is RISC? • RISC - Reduced Instruction Set Computer. – is a type of microprocessor architecture – utilizes a small, highly-optimized set of instructions, rather than a more specialized set of instructions often found in other types of architectures.

• History – The first RISC projects came from IBM, Stanford, and UC-Berkeley in the late 70s and early 80s. – The IBM 801, Stanford MIPS, and Berkeley RISC 1 and 2 were all designed with a similar philosophy which has become known as RISC.

RISC - Characteristic • one cycle execution time: RISC processors have a CPI (clock per instruction) of one cycle. This is due to the optimization of each instruction on the CPU and a technique called PIPELINING • pipelining: a techique that allows for simultaneous execution of parts, or stages, of instructions to more efficiently process instructions; • large number of registers: the RISC design philosophy generally incorporates a larger number of registers to prevent in large amounts of interactions with memory 2011

Computer System Architecture

156

RISC Attributes • • • • • • • • •

The main characteristics of CISC microprocessors are: Extensive instructions. Complex and efficient machine instructions. Microencoding of the machine instructions. Extensive addressing capabilities for memory operations. Relatively few registers. In comparison, RISC processors are more or less the opposite of the above: Reduced instruction set. Less complex, simple instructions. Hardwired control unit and machine instructions. Few addressing schemes for memory operands with only two basic instructions, LOAD and STORE

CISC Vs RISC CISC

RISC

Emphasis on hardware

Emphasis on software

Includes multi-clock complex instructions

Single-clock, reduced instruction only

Memory-to-memory: "LOAD" and "STORE" incorporated in instructions

Register to register: "LOAD" and "STORE" are independent instructions

Small code sizes, high cycles per second

Low cycles per second, large code sizes

Transistors used for storing complex instructions

Spends more transistors on memory registers

Performance of Computers

Improving Performance of Computers • Increasing clock speed – Physical limitation (Need new hardware)

• Parallelism (Doing more things at once) – Instruction-level parallelism • Getting more instruction per second

– Processor-level parallelism • Having multiple CPUs working on the same problem

Instruction-level parallelism • Pipelining – Instruction execution speed is affected by time taken to fetch instruction from memory – Early Computers fetch instructions in advance and stored in registers (Prefetch buffer) • Prefetching divides instruction execution into two parts – Fetching – Actual execution

– Pipelining divides instruction in to many parts; each handled by different hardware and can run in parallel

Pipelining example • Packaging cakes – W1: Place an empty box on the belt every 10 second – W2: Place the cake in the empty box – W3: Close and seal the box – W4: Label the box – W5: Remove the box and place it in the large container

162

Computer Pipelines

• S1: Fetch instruction from memory and place it in a buffer until it is needed • S2: Decode the instruction; determine it type and operands it needs • S3: locate the fetch operands from memory (or registers) • S4: Execute instruction • S5: Write back result in a register

163

Example T - Cycle time N - Number of stages in the pipeline

Latency: Time taken to execute an instruction = N x T Processor Bandwidth: No. of MIPS the CPU has = 1000 MIPS T 164

Processor - pipeline depth

165

Dual pipelines

• Instruction fetch unit fetches a pair of instructions and puts each one into own pipeline • Pentium has two five-stage pipelines – U pipeline (main) executes an arbitrary Pentium instructions – V pipeline (second) executes inter instructions, one simple floating point instruction

• If instructions in a pair conflict, instruction in u pipeline is executed. Other instruction is held and is paired with next instruction

166

Superscalar architecture • Single pipeline with multiple functional units

Processor level parallelism • High bus traffic

• Low bus traffic

Measuring Performance

Moore’s law • Describes a long-term trend in the history of computing hardware • Defined by Dr. Gordon Moore during the sixties. • Predicts an exponential increase in component density over time, with a doubling time of 18 months. • Applicable to microprocessors, DRAMs , DSPs and other microelectronics.

Moore's Law and Performance • The performance of computers is determined by architecture and clock speed. • Clock speed doubles over a 3 year period due to the scaling laws on chip. • Processors using identical or similar architectures gain performance directly as a function of Moore's Law. • Improvements in internal architecture can yield better gains than predicted by Moore's Law.

Measuring Performance • Execution time: – Time between start and completion of a task (including disk accesses, memory accesses )

• Throughput: – Total amount of work dome a given time

Performance of a Computer

Two Computer X and Y; Performance of (X) > Performance of (Y)

Execution Time (Y) > Execution Time (X)

Performance of difference 2 Computer X is n Time faster than Y

CPU Time • Time CPU spends on a task • User CPU time – CPU time spent in the program

• System CPU time – CPU time spent in OS performing tasks on behalf of the program

CPU Time (Example) • User CPU time = 90.7s • System CPU time 12.9s • Execution time 2m 39 s 159s • % of CPU time = User CPU Time + System CPU Time X 100 % Execution time

CPU Time % CPU time

= (90.7 + 12.9 ) x 100 159 = 65 %

Clock Rate • Computer clock runs at the constant rate and determines when events take place in the hardware Clock Rate =

1 Clock Cycle

Amdahl’s law • Performance improvement that can be gained from some faster mode of execution is limited by fraction of the time the faster mode can be used

Amdahl’s law • Speedup depends on – Fraction of computation time in original machine that can be converted to take advantage of the enhancement

(Fraction Enhanced) – Improvement gains by enhanced execution mode

(Speedup Enhanced)

Example Total execution time of a Program = 50 s Execution time that can be enhanced = 30 s FractionEnhanced = 30 /50 = 0.6

Speedup

Example Normal mode execution time for some portion of a program = 6s Enhances mode execution time for the same program = 2s Speedup Enhanced = 6/2 =3

Execution Time

Example • Suppose we consider an enhancement to the processor of a server system used for Web serving. New CPU is 10 times faster on computation in Web application than original CPU. Assume original CPU is busy with computation 40% of the time and is waiting for I/O 60% of time.

What is the overall speedup gained from enhancement?

Answer

188

Remark • If an enhancement is only usable for fraction of a task, we cannot speedup by more than

189

Example • A common transformation required in graphics engines is square root. Implementation of floating-point (FP) square root vary significantly in performance, especially among processors designed graphics • Suppose FP square root (FPSQR) is responsible for 20% of execution tine of a critical graphics program • Design alternative 1. Enhance EPSQR hardware and speed up this operation by a factor of 10 2. Make all FP instruction run faster by a factor of 1.6 190

Example • FP instruction are responsible for a total of 50% of execution time. Design team believes they can make all fp instruction run 1.6 times faster with same effort as required for fast square root. Compare these two design alternatives 191

192

CPU performance equation CPU time = CPU clock cycles for a program x Clock cycle time = CPU clock cycles / Clock rate

Example A program runs in 10s on computer A having 400 MHz clock. A new machine B, which could run the same program in 6s, has to be designed. Further, B should have 1.2 times as many clock cycles as A. What should be the clock rate of B?

Answer

CPU Clock Cycles CPI (clock cycles per instruction) average no. of clock cycles each instruction takes to execute IC (instruction count) no. of instructions executed in the program CPU clock cycles = CPI x IC Note: CPI can be used to compare two different implementations of the same instruction set architecture (as IC required for a program is same)

Example • Consider two implementations of same instruction set architecture. For a certain program, details of time measurements of two machines are given below

• Which machine is faster for this program and by how much?

Answer

Measuring components of CPU performance equation • CPU Time: by running the program • Clock Cycle Time: published in documentation • IC: by a software tools/simulator of the architecture ((more difficult to obtain) • CPI: by simulation of an implementation (more difficult to obtain)

CPU clock cycles Suppose n different types of instruction Let ICi – No. of times instruction i is executed in a program CPIi – Avg. no. of clock cycles for instruction i

Example Suppose we have made the following measurements: – – – – –

Frequency of FP operations (other than FPSQR) = 25% Average CPI of FP operations = 4.0 Average CPI of other instructions = 1.33 Frequency of FPSQR= 2% CPI of FPSQR = 20

Design alternatives: 1. 2.

decrease CPI of FPSQR to 2 decrease average CPI of all FP operation to 2.5

Compare these two design alternatives using CPU performance equation

Answers •

Note that only CPI changes; clock rate; IC remain identical

MIPS as a performance measure

Problems MIPS as a performance measure • MIPS is dependant on instruction set – difficult to compare MIPS of computers with different instruction sets

• MIPS can vary inversely to performance

MFLOPS as a performance measure

Problems MIPS as a performance measure • MFLOPS is not dependable – Cray C90 has no divide instructions while Pentium has

• MFLOPS depends on the mixture of fast and slow floating point operations – add (fast) and divide (slow) operations

Instruction Set Architecture (ISA) Level

207

Introduction

208

Instruction Set Architecture • Positioned between microarchitecture level and operating system level • Important to system architects – interface between software and hardware

209

Instruction Set Architecture

210

ISA contd.. • General approach of system designers: – Build programs in high-level languages – Translate to ISA level – Build hardware that executes ISA level programs directly

• Key challenge: – Build better machines subject to backward compatibility constraint 211

Features off a good ISA • Define a set of instructions that can be implemented efficiently in current and future technologies resulting in cost effective designs over several generations • Provide a clean target for compiled code

212

Properties off ISA level • ISA level code is what a compiler outputs • To produce ISA code, compiler writer has to know – What the memory model is – What registers are there – What data types and instructions are available 213

ISA level memory models • Computers divide memory into cells (8 bits) that have consecutive addresses • Bytes are grouped into words (4-, 8byte) with instructions available for manipulating entire words • Many architectures require words to be aligned on their natural boundaries – Memories operate more efficiently that way 214

ISA level Memory Models

• On Pentium II (fetches 8 bytes at a time from memory), ISA programs can make memory references to words starting at any address – Requires extra logic circuits on the chip – Intel allows it cause of backward compatibility constraint (8088 programs made non-aligned memory references) 215

ISA level registers • Main function of ISA level registers: – provide rapid access to heavily used data

• Registers are divided into 2 categories – special purpose registers (program counter, stack pointer) – General purpose registers (hold key local variables, intermediate results of calculations). • These are interchangeable 216

Instructions • Main feature of ISA level is its set of machine instructions • They control what the machine can do • Ex: – LOAD and STORE instructions move data between memory and registers – MOVE instruction copies data among registers 217

Pentium II ISA level (Intel’s IA-32) • Maintains full support for execution of programs written for 8086, 8088 processors (16-bit) • Pentium II has 3 operating modes (Real mode, Virtual 8086 mode, Protected mode) • Address Add space: memory is divided into 16,384 segments, each going from address 0 to address 232-1 (Windows supports only one segment) • Every byte has its own address, with words being 32 bits long • Words are stored in Little endian format (loworder byte has lowest address)

218

Little endian and Big endian format

219

Pentium II’s primary registers

220

Pentium II’s primary registers • EAX: Main arithmetic registers, 32-bit – 16-bit register in low-order 16 bits – 8-bit register in low-order 8 bits – easy to manipulate 16-bit (in 80286) and 8-bit (in 8088) quantities

• EBX: holds pointers • ECX: used in looping • EDX: used for multiplication and division, where together with EAX, it holds 64-bit products and dividends 221

Pentium II’s primary registers • ESI,ESI EDI: holds pointers into memory – Especially for hardware string manipulation instructions (ESI points to source string, EDI points to destination string)

• • • • •

EBP: pointer register ESP: stack pointer CS through GS: segment registers EIP: program counter EFLAGS: flag register (holds various miscellaneous bits such as conditional codes) 222

Pentium II data Types

223

Instruction Formats • An instruction consists of an opcode, plus additional information such as where operands come from, where results go to • Opcode tells what instruction does • On some machines, all instructions have same length – Advantages: simple, easy to decode – Disadvantages: waste space 224

Common Instruction Formats

(a) Zero address instruction (b) One address instruction (c) Two address instruction (d) Three address instruction

225

Instruction and Word length Relationships

226

Example • An Instruction with 4bit Opcode and Three 4bit address

227

Design of Instruction Formats • Factors: – Length of instruction • short instructions are better than long instructions (modern processors can execute multiple instructions per clock cycle)

– Sufficient room in the instruction format to express all operations required – No. of bits in an address field

228

Intel® 64 and IA-32 Architectures •

Intel 64 and IA-32 instructions

– – – – – – – – – – – – – – – –

General purpose x87 FPU x87 FPU and SIMD state management Intel MMX technology SSE extensions SSE2 extensions SSE3 extensions SSSE3 extensions SSE4 extensions AESNI and PCLMULQDQ Intel AVX extensions F16C, RDRAND, FS/GS base access System instructions IA-32e mode: 64-bit mode instructions VMX instructions SMX instructions

229

Addressing

230

Addressing • Subject of specifying where the operands (addresses) are – ADD instruction requires 2 or 3 operands, and instruction must tell where to find operands and where to put result

• Addressing Modes – Methods of interpreting the bits of an address field to find operand • • • • •

Immediate Addressing Direct Addressing Register Addressing Register Indirect Addressing Indexed Addressing 231

Immediate Addressing • Simplest way to specify where the operand is • Address part of instruction contains operand itself (immediate operand) • Operand is automatically fetched from memory at the same time the instruction it self is fetched – Immediately available for use

• No additional memory references are required • Disadvantages – only a constant can be supplied – value of the constant is limited by size of address field

• Good for specifying small integers 232

Example Immediate Addressing MOV R1, #8 ; Reg[R1] ← 8 ADD R2R2, #3 ; Reg[R2] ← Reg[R2] + 3

233

Direct Addressing • Operand is in memory, and is specified by giving its full address (memory address is hardwired into instruction) • Instruction will always access exactly same memory location, which cannot change • Can only be used for global variables who address is known at compile time

• Example Instruction: – ADD R1, R1(1001) ; Reg[R1] ← Reg[R1] +Mem[1001] 234

Direct Addressing Example

235

Register Addressing • Same as direct addressing with the exception that it specifies a register instead of memory location • Most common addressing mode on most computers since register accesses are very fast • Compilers try to put most commonly accessed variables in registers • Cannot be used only in LOAD and STORE instructions (one operand in is always a memory address)

• Example instruction: – ADD R3, R4 ; Reg[R3] ← Reg[R3] + Reg[R4] 236

Register Indirect Addressing • Operand being specified comes from memory or goes to memory • Its address is not hardwired into instruction, but is contained in a register (pointer) • Can reference memory without having full memory address in the instruction • Different memory words can be used on different executions of the instruction

• Example instruction: – ADD R1,R1(R2) ; Reg[R1] ← Reg[R1] + Mem[Reg[R2]]

237

Example • Following generic assembly program calculates the sum of elements (1024) of an array A of integers of 4 bytes each, and stores result in register R1 – MOV R1, #0 – MOV R2, #A – MOV R3, #A+4096 word – LOOP: ADD R1, (R2) operand – ADD R2, #4 – CMP R2, R3 – BLT LOOP

; sum in R1 (0 initially) ; Reg[R2] = address of array A ; Reg[R3] = address of first beyond A ; register indirect via R2 to get ; increment R2 by one word ; is R2 < R3? ; loop if R2 < R3 238

Indexed Addressing • Memory is addressed by giving a register plus a constant offset • Used to access local variables

• Example instruction: – ADD R3, 100(R2) ; Reg[R3] ← Reg[R3] + Mem[100+Reg[R2]]

239

Based-Indexed Addressing • Memory address is computed by adding up two registers plus an optional offset • Example instruction: ADD R3, (R1+R2) ;Reg[R3] ← Reg[R3] + Mem[Reg[R1] + Reg[R2]] 240

Instruction Types • ISA level instructions are divided into few categories – Data Movement Instructions • Copy data from one location to another

– Examples (Pentium II integer instructions): • MOV DST, SRC – copies SRC (source) to DST (destination) • PUSH SRC – push SRC into the stack • XCHG DS1, DS2 – exchanges DS1 and DS2 • CMOV DST, SRC – conditional move 241

Instruction Types contd.. – Dyadic Operations • Combine two operands to produce a result (arithmetic instructions, Boolean instructions)

– Examples (Pentium II integer instructions): • ADD DST, SRC – adds SRC to DST, puts result in DST • SUB DST, SRC – subtracts DST from SRC • AND DST, SRC – Boolean AND SRC into DST • OR DST, SRC - Boolean OR SRC into DST • XOR DST,DST SRC – Boolean Exclusive OR to DST 242

Instruction Types contd.. • Monadic Operations – Have one operand and produce one result – Shorter than dyadic instructions

• Examples (Pentium II integer instructions): – INC DST – adds 1 to DST – DEC DST – subtracts 1 from DST – NOT DST – replace DST with 1’s complement 243

Instruction Types contd.. • Comparison and Conditional Branch Instructions • Examples (Pentium II integer instructions): – TST SRC1, SRC2 – Boolean AND operands, set flags (EFLAGS) – CMP SRC1, SRC2 – sets flags based on SRC1-SRC2

244

Instruction Types contd.. • Procedure (Subroutine) call Instructions – When the procedure has finished its task, transfer is returned to statement after the call

• Examples (Pentium II integer instructions): – CALL ADDR -Calls procedure at ADDR – RET - Returns from procedure 245

Instruction Types contd.. • Loop Control Instructions – LOOPxx – loops until condition is met

• Input / Output Instructions There are several input/output schemes currently used in personal computers – Programmed I/O with busy waiting – Interrupt-driven I/O – DMA (Direct Memory Access) I/O

246

Programmed I/O with busy waiting • Simplest I/O method • Commonly used in low-end processors • Processors have a single input instruction and a single output instruction, and each of them selects one of the I/O devices • A single character is transferred between a fixed register in the processor and selected I/O device • Processor must execute an explicit sequence of instructions for each and every character read or written 247

DMA I/O • DMA controller is a chip that has a direct access to the bus • It consists of at least four registers, each can be loaded by software. – Register 1 contains memory address to be read/written – Register 2 contains the count of how many bytes / words to be transferred – Register 3 specifies the device number or I/O space address to use – Register 4 indicates whether data are to be read from or written to I/O device 248

Structure of a DMA

249

Registers in the DMA • • •

•

•

Status register: readable by the CPU to determine the status of the DMA device (idle, busy, etc) Command register: writable by the CPU to issue a command to the DMA Data register: readable and writable. It is the buffering place for data that is being transferred between the memory and the IO device. Address register: contains the starting location of memory where from or where to the data will be transferred. The Address register must be programmed by the CPU before issuing a "start" command to the DMA. Count register: contains the number of bytes that need to be transferred. The information in the address and the count register combined will specify exactly what information need to be transferred. 250

Example • Writing a block of 32 bytes from memory address 100 to a terminal device (4)

251

Example contd.. • CPU writes numbers 32, 100, and 4 into first three DMA registers, and writes the code for WRITE (1, for example) in the fourth register • DMA controller makes a bus request to read byte 100 from memory • DMA controller makes an I/O request to device 4 to write the byte to it • DMA controller increments its address register by 1 and decrements its count register by 1 • If the count register is > 0, another byte is read from memory and then written to device • DMA controller stops transferring data when count = 0 252

Sample Questions Q1. 1. Explain the processor architecture of 8086. 2. What are differences in Intel Pentium Processor and dual core processor. 3. What are the advantages and disadvantage of the multi-core processors

253

Sample Questions Q2. 1. What is addressing. 2. Comparing advantages, disadvantages and features briefly explain each addressing modes. 3. What is DMA and why it useful for Programming?. Explain your answer

254

Computer Memory • Primary Memory • Secondary Memory • Virtual Memory

255

Levels in Memory Hierarchy Cache

Regs CPU

Register size: speed: $/Mbyte: line size:

32 B 0.3 ns 4B

8B

C a c h e

32 B

Cache 32 KB-4MB 2 ns? $75/MB 32 B

larger, slower, cheaper

Virtual Memory

Memory

4 KB

Memory 4096 MB 7.5 ns $0.014/MB 4 KB

disk

Disk Memory 1 TB 8 ms $0.00012/MB

Primary Memory

257

Primary memory • Memory is the workspace for CPU • When a file is loaded into memory, it is a copy of the file that is actually loaded • Consists of a no. of cells, each having a number (address) • n cells → addresses: 0 to n‐1 ‐ • Same no. off bits in each cell • Adjacent cells have consecutive addresses • m‐bit ‐ address 2m addressable cells • A portion of RAM address space is mapped into one or more ROM chips 258

Ways of organizing a 96-bit memory

259

SRAM (Static RAM) • • • •

Constructed using flip flops 6 transistors for each bit of storage Very fast Contents are retained as long as power is kept on • Expensive • Used in level 2 cache

260

DRAM (Dynamic RAM) • No flip‐flops • Array of cells, each consisting a transistor and a capacitor • Capacitors can be charged or discharged, allowing 0s and 1s to be Stored • Electric charge tends to leak out Þ each bit in a DRAM must be reloaded (refreshed) every few milliseconds (15 ms) to prevent data from leaking away • Refreshing takes several CPU cycles to complete (less than 1% of overall bandwidth) • High density (30 times smaller than SRAM) • Used in main memories • Slower than SRAM • Inexpensive (30 times lower than SRAM)

261

SDRAM (Synchronous DRAM) • • • •

Hybrid of SRAM and DRAM Runs in synchronization with the system bus Driven by a single synchronous clock Used in large caches, main memories

262

DDR (Double Data Rate) SDRAM • An upgrade to standard SDRAM • Performs 2 transfers per clock cycle (one at falling edge, one at rising edge) without doubling actual clock rate

263

Dual channel DDR •

Technique in which 2 DDR DIMMs are installed at one time and function as a single bank doubling the bandwidth of a single module

•

DDR2 SDRAM – – – –

•

A faster version of DDR SDRAM (doubles the data rate of DDR) Less power consumption than DDR Achieves higher throughput by using differential pairs of signal wires Additional signal add to the pin count

DDR3 SDRAM – – – – –

An improved version off DDR2 SDRAM Same no. of pins as in DDR2, Not compatible with DDR2 Can transfer twice the data rate of DDR2 DDR3 standard allows chip sizes of 512 Megabits to 8 Gigabits (max module size – 16GB)

264

DRAM Memory module

265

DRAM Memory module

266

SDRAM and DDR DIMM versions • Buffered • Unbuffered • Registered

267

SDRAM and DDR DIMM • Buffered Module – Has additional buffer circuits between memory chips and the connector to buffer signals – New motherboards are not designed to use buffered modules

• Unbuffered Module – Allows memory controller signals to pass directly to memory chips with no interference – Fast and most efficient design – Most motherboards are designed to use unbuffered modules 268

SDRAM and DDR DIMM • Registered Module – Uses register chips on the module that act as an interface between RAM chip and chipset – Used in systems designed to accept extremely large amounts of RAM (server motherboards)

269

Memory Errors

270

Memory errors • Hard errors – Permanent failure – How to fix? (replace the chip)

• Soft errors – Non permanent failure – Occurs at infrequent intervals – How to fix? (restart the system)

• Best way to deal with soft errors is to increase system’s fault tolerance (implement ways of detecting and correcting errors) 271

Techniques used for fault tolerance • Parity • ECC (Error Correcting Code)

272

Parity Checking • 9 bits are used in the memory chip to store 1 byte of information • Extra bit (parity bit) keeps tabs on other 8 bits • Parity can only detect errors, but cannot correct them

273

ODD Parity stranded for error checking • Parity generator/checker is a part of CPU or located in a special chip on motherboard • Parity checker evaluates the 8 data bits by adding the no. of 1s in the byte • If an even no. of 1s is found, parity generator creates a 1 and stores it as the parity bit in memory chip 274

ODD Parity stranded for error checking (contd.) • If the sum is odd, parity bit would be 0 • If a (9 bit) byte has an even no. of 1s, that byte must have an error · System cannot tell which bit or bits have changed • If 2 bits changed, bad byte could pass unnoticed • Multiple bit errors in a single byte are very rare • System halts when a parity check error is detected 275

ECC- Error Correcting Code • Successor to parity checking • Can detect and correct memory errors • Only a single bit error can be corrected though it can detect doubled bit errors • This type of ECC is known as single bit error correction double bit error detection (SEC DED) • SEC DED requires an additional 7 check bits over 32 bits in a 4 byte system, or 8 check bits over 64 bits in an 8 byte system 276

ECC- Error Correcting Code • ECC entails memory controller calculating check bits on a memory write operation, performing a compare between read and calculated check bits on a read operation • Cost of additional ECC logic in memory controller is not significant • It affects memory performance on a write 277

Cache memory

278

Cache Memory • A high speed,speed small memory • Most frequently used memory words are kept in • When CPU needs a word, it first checks it in cache. If not found, checks in memory

279

Cache and Main Memory

280

Cache memory Vs Main Memory

281

Cache Hit and Miss • Cache Hit: a request to read from memory, which can satisfy from the cache without using the main memory. • Cache Miss: A request to read from memory, which cannot be satisfied from the cache, for which the main memory has to be consulted. 282

Locality Principle • PRINCIPAL OF LOCALITY is the tendency to reference data items that are near other recently referenced data items, or that were recently referenced themselves. • TEMPORAL LOCALITY : memory location that is referenced once is likely to be referenced multiple times in near future. • SPATIAL LOCALITY : memory location that is referenced once, then the program is likely to be reference a nearby memory location in near future. 283

Locality Principle

Let c – cache access time m – main memory access time h – hit ratio (fraction of all references that can be satisfied out of cache) miss ratio = 1‐h

Average memory access time = c + (1 h) m H =1 No memory references H=0 all are memory references 284

Example: Suppose that a word is read k times in a short interval

First reference: memory, Other k 1 references: cache h = k–1 k

Memory access time = c + m k 285

Cache Memory • Main memories and caches are divided into fixed sized blocks • Cache lines – blocks inside the cache • On a cache miss, entire cache line is loaded into cache from memory • Example: – 64K cache can be divided into 1K lines of 64 bytes, 2K lines of 32 byte etc

• Unified cache – instruction and data use the same cache

• Split cache – Instructions in one cache and data in another 286

A system with three levels of cache

287

Pentium 4 Block Diagram

288

Replacement Algorithm • Optimal Replacement: replace the block which is no longer needed in the future. If all blocks currently in Cache Memory will be used again, replace the one which will not be used in the future for the longest time. • Random selection: replace a randomly selected block among all blocks currently in Cache Memory. 289

Replacement Algorithm • FIFO (first-in first-out): replace the block that has been in Cache Memory for the longest time. • LRU (Least recently used): replace the block in Cache Memory that has not been used for the longest time. • LFU (Least frequently used): replace the block in Cache Memory that has been used for the least number of times 290

Cache Memory Placement Policy • Three commonly used methods to translate main memory addresses to cache memory addresses. – Associative Mapped Cache – Direct-Mapped Cache – Set-Associative Mapped Cache

• The choice of cache mapping scheme affects cost and performance, and there is no single best method that is appropriate for all situations 291

Associative Mapping

292

Associative Mapping • A block in the Main Memory can be mapped to any block in the Cache Memory available (not already occupied) • Advantage: Flexibility. An Main Memory block can be mapped anywhere in Cache Memory. • Disadvantage: Slow or expensive. A search through all the Cache Memory blocks is needed to check whether the address can be matched to any of the tags. 293

Direct Mapping

294

Direct Mapping To avoid the search through all CM blocks needed by associative mapping, this method only allows # blocks in main memory # blocks in cache memory Blocks to be mapped to each Cache Memory block. • Each entry (row) in cache can hold exactly one cache line from main memory • 32‐byte ‐ cache line size → cache can hold 64KB 295

Direct Mapping • Advantage: Direct mapping is faster than the associative mapping as it avoids searching through all the CM tags for a match. • Disadvantage: But it lacks mapping flexibility. For example, if two MM blocks mapped to same CM block are needed repeatedly (e.g., in a loop), they will keep replacing each other, even though all other CM blocks may be available. 296

Set-Associative Mapping

297

Set-Associative Mapping • This is a trade-off between associative and direct mappings where each address is mapped to a certain set of cache locations. • The cache is broken into sets where each set contains "N" cache lines, let's say 4. Then, each memory address is assigned a set, and can be cached in any one of those 4 locations within the set that it is assigned to. In other words, within each set the cache is associative, and thus the name. 298

Set Associative cache • LRU (Least Recently Used) algorithm is used – keep an ordering of each set of locations that could be accessed from a given memory location – whenever any of present lines are accessed, it updates list, making that entry the most recently accessed – when it comes to replace an entry, one at the end of list is discarded 299

Load-Through and Store-Through •

Load-Through : When the CPU needs to read a word from the memory, the block containing the word is brought from MM to CM, while at the same time the word is forwarded to the CPU.

•

Store-Through : If store-through is used, a word to be stored from CPU to memory is written to both CM (if the word is in there) and MM. By doing so, a CM block to be replaced can be overwritten by an in-coming block without being saved to MM. 300

Cache Write Methods • Words in a cache have been viewed simply as copies of words from main memory that are read from the cache to provide faster access. However this view point changes. • There are 3 possible write actions: – Write the result into the main memory – Write the result into the cache – Write the result into both main memory and cache memory

301

Cache Write Methods • Write Through: A cache architecture in which data is written to main memory at the same time as it is cached. • Write Back / Copy Back: CPU performs write only to the cache in case of a cache hit. If there is a cache miss, CPU performs a write to main memory. • When the cache is missed : – Write Allocate: loads the memory block into cache and updates the cache block – No-Write allocation: this bypasses the cache and writes the word directly into the memory. 302

Cache Evaluation Problem

Solution

Processor on which feature first appears

External memory slower than the system bus

Add external cache using faster memory technology

386

Increased processor speed results in external bus becoming a bottleneck for cache access.

Move external cache on-chip, operating at the same speed as the processor

486

Internal cache is rather small, due to limited space on chip

Add external L2 cache using faster technology than main memory

486

303

Cache Evaluation Problem

Solution

Processor on which feature first appears Pentium II

Increased processor speed results in external bus becoming a bottleneck for L2 cache access

Move L2 cache on to the processor chip.

Some applications deal with massive databases and must have rapid access to large amounts of data. The on-chip caches are too small.

Add external L3 cache.

Pentium III

Move L3 cache on-chip

Pentium IV

Create separate back-side bus that runs at higher speed than the main (front-side) external bus. The BSB is dedicated to the L2 cache.

Pentium Pro

304

Comparison of Cache Sizes L1 cache

L2 cache

L3 cache

Mainframe

Year of Introduction 1968

16 to 32 KB

—

—

PDP-11/70

Minicomputer

1975

1 KB

—

—

VAX 11/780

Minicomputer

1978

16 KB

—

—

IBM 3033

Mainframe

1978

64 KB

—

—

IBM 3090

Mainframe

1985

128 to 256 KB

—

—

Intel 80486

PC

1989

8 KB

—

—

Pentium

PC

1993

8 KB/8 KB

256 to 512 KB

—

PowerPC 601

PC

1993

32 KB

—

—

PowerPC 620

PC

1996

32 KB/32 KB

—

—

PowerPC G4

PC/server

1999

32 KB/32 KB

256 KB to 1 MB

2 MB

IBM S/390 G4

Mainframe

1997

32 KB

256 KB

2 MB

IBM S/390 G6

Mainframe

1999

256 KB

8 MB

—

Pentium 4

PC/server

2000

8 KB/8 KB

256 KB

—

IBM SP

High-end server

2000

64 KB/32 KB

8 MB

—

CRAY MTAb

Supercomputer

2000

8 KB

2 MB

—

Itanium

PC/server

2001

16 KB/16 KB

96 KB

4 MB

SGI Origin 2001

High-end server

2001

32 KB/32 KB

4 MB

—

Itanium 2

PC/server

2002

32 KB

256 KB

6 MB

IBM POWER5

High-end server

2003

64 KB

1.9 MB

36 MB

CRAY XD-1

Supercomputer

2004

64 KB/64 KB

1MB

—

Processor

Type

IBM 360/85

Memory stall cycles No. of clock cycles during which CPU is stalled waiting for a memory access CPU time = (CPU clock cycles + Memory stall cycles) x Clock cycle time

Memory stall cycles = No. of misses x Miss penalty = IC x Misses per instruction x Miss penalty = IC x Memory accesses per instruction x Miss ratio x Miss penalty 306

Example Assume we have a machine where CPI is 2.0 when all memory accesses hit in the cache. Only data accesses are loads and stores, and these total 40% of instructions. If the miss penalty is 25 clock cycles and miss ratio is 2%, how much faster would the machine be if all instructions were cache hits?

307

Answer

308

Secondary Memory

309

Technologies • Magnetic storage – Floppy, Zip disk, Hard drives, Tapes

• Optical storage – CD, DVD, Blue-Ray, HD-DVD

• Solid state memory – USB flash drive, Memory cards for mobile phones/digital cameras/MP3 players, Solid State Drives

310

Magnetic Disk •

Purpose: – Long term, nonvolatile storage – Large, inexpensive, and slow – Lowest level in the memory hierarchy

•

Two major types: – Floppy disk – Hard disk

•

Both types of disks: – Rely on a rotating platter coated with a magnetic surface – Use a moveable read/write head to access the disk

•

Advantages of hard disks over floppy disks: – – – –

Platters are more rigid ( metal or glass) so they can be larger Higher density because it can be controlled more precisely Higher data rate because it spins faster Can incorporate more than one platter

Disk Track

Components of a Disk •

•

•

The arm assembly is moved in or out to position a head on a desired track. Tracks under heads make a cylinder (imaginary!). Only one head reads/writes at any one time. Block size is a multiple of sector size (which is often fixed).

Disk head

Spindle Tracks

Sector

Arm movement

Platters

Arm assembly

313

Internal Hard-Disk

Page 223

Magnetic Disk • A stack of platters, a surface with a magnetic coating • Typical numbers (depending on the disk size): – 500 to 2,000 tracks per surface – 32 to 128 sectors per track

• A sector is the smallest unit that can be read or written • Traditionally all tracks have the same number of sectors: • Constant bit density: record more sectors on the outer tracks

Magnetic Disk Characteristic • • •

Disk head: each side of a platter has separate disk head Cylinder: all the tracks under the head at a given point on all surface Read/write data is a three-stage process: – Seek time: position the arm over the proper track – Rotational latency: wait for the desired sector to rotate under the read/write head – Transfer time: transfer a block of bits (sector) under the read-write head

•

Average seek time as reported by the industry: – Typically in the range of 8 ms to 15 ms – (Sum of the time for all possible seek) / (total # of possible seeks)

•

Due to locality of disk reference, actual average seek time may: – Only be 25% to 33% of the advertised number

Typical Numbers of a Magnetic Disk • Rotational Latency: – Most disks rotate at 3,600/5400/7200 RPM – Approximately 16 ms per revolution – An average latency to the desired information is halfway around the disk: 8 ms

• Transfer Time is a function of : – Transfer size (usually a sector): 1 KB / sector – Rotation speed: 3600 RPM to 5400 RPM to 7200 – Recording density: typical diameter ranges from 2 to 14 in – Typical values: 2 to 4 MB per second

Disk I/O Performance

Disk Access Time = Seek time + Rotational Latency + Transfer time + Controller Time + Queueing Delay

Disk I/O Performance • Disk Access Time = Seek time + Rotational Latency + Transfer time + Controller Time + Queueing Delay • Estimating Queue Length: – Utilization = U = Request Rate / Service Rate – Mean Queue Length = U / (1 - U) – As Request Rate Service Rate -> Mean Queue Length ->Infinity

Example • Setup parameters: – 16383 Cycliders, 63 sectors per track, 3 platters, 6 heads

• • • •

Bytes per sector: 512 RPM: 7200 Transfer mode: 66.6MB/s Average Read Seek time: 9.0ms (read), 9.5ms (write) • Average latency: 4.17ms • Physical dimension: 1’’ x 4’’ x 5.75’’ • Interleave: 1:1

Disk performance • • • •

Preamble: allows head to be synchronized before read/write ECC (Error Correction Code): corrects errors Unformatted capacity: preambles, ECCs and inter sector gaps are counted as data Disk performance depends on

– seek time ‐ time to move arm to desired track – rotational latency – time needed for requested sector to rotate under head • Rotational speed: 5400, 7200, 10000, 15000 rpm

• Transfer time – time needed to transfer a block of bits under head (e.g., 40 MB/s) 321

Disk performance Disk controller – chip that controls the drive. Its tasks include accepting – commands (READ, WRITE, FORMAT) from software, controlling arm motion, detecting and correcting errors

Controller time – overhead the disk controller imposes in performing an I/O access

Avg. disk access time = avg. seek time + avg. rotational delay + overhead

Transfer time + controller

322

Example • Advertised average seek time of a disk is 5 ms, transfer rate is 40 MB per second, and it rotates at 10,000 rpm Controller overhead is 0.1 ms. Calculate the average time to read a 512 byte sector.

323

RAID(Redundant Array of Inexpensive Disks) • A disk organization used to improve performance of storage systems • An array of disks controlled by a controller (RAID Controller) • Data are distributed over disks (striping) to allow parallel operation

324

RAID 0- No redundancy • No redundancy to tolerate disk failure • · Each strip has k sectors (say) – Strip 0: sectors 0 to k 1 – Strip 1: sectors k to 2k 1 ...etc

• Works well with large accesses • Less reliable than having a single large disk

325

Example (RAID 0) • Suppose that RAID consists of 4 disks with MTTF (mean time to failure) of 20,000 hours. – A drive will fail once in every 5,000 hours – A single large drive with MTTF of 20,000 hours is 4 times reliable

326

RAID 1 (Mirroring) • Uses twice as many disk as does RAID 0 (first half: primary, next half: backup) • Duplicates all disks

• On a write, every strip is written twice • Excellent fault tolerance (if a disk fails, backup copy is used) • Requires more disks 327

RAID 3 (Bit Interleaved Parity) • Reads/writes go to all disks in the group, with one extra disk (parity disk) to hold check information in case off a failure

• Parity contains sum of all data in other disks • If a disk fails, subtract all data in good disks from parity disk 328

RAD 4 (Block Interleaved Parity) • RAID 4 is much like RAID 3 with a strip for strip parity written onto an extra disk – A write involves accessing 2 disks instead of all – Parity disk must be updated on every write

329

RAID 5- Block Interleaved Distributed Parity • In RAID 5, parity information is spread throughout all disks • In RAID 5, multiple writes can occur simultaneously as long as stripe units are not located in same disks, but it is not possible in RAID 4

330

Secondary Storage Devices: CD-ROM

331

Physical Organization of CD-ROM • • • •

•

Compact Disk – read only memory (write once) Data is encoded and read optically with a laser Can store around 600MB data Digital data is represented as a series of Pits and Lands: – Pit = a little depression, forming a lower level in the track – Land = the flat part between pits, or the upper levels in the track Reading a CD is done by shining a laser at the disc and detecting changing reflections patterns. – 1 = change in height (land to pit or pit to land) – 0 = a “fixed” amount of time between 1’s

332

Organization of data LAND

PIT

LAND

PIT

LAND

...------+ +-------------+ +---... |_____| |_______| ..0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 ..

• • • • • •

Cannot have two 1’s in a row! => uses Eight to Fourteen Modulation (EFM) encoding table. 0's are represented by the length of time between transitions, we must travel at constant linear velocity (CLV)on the tracks. Sectors are organized along a spiral Sectors have same linear length Advantage: takes advantage of all storage space available. Disadvantage: has to change rotational speed when seeking (slower towards the outside) 333

CD-ROM •

Addressing – 1 second of play time is divided up into 75 sectors. – Each sector holds 2KB – 60 min CD: 60min * 60 sec/min * 75 sectors/sec = 270,000 sectors = 540,000 KB ~ 540 MB – A sector is addressed by: Minute:Second:Sector e.g. 16:22:34

•

Type of laser – CD: 780nm (infrared) – DVD: 635nm or 650nm (visible red) – HD-DVD/Blu-ray Disc: 405nm (visible blue)

•

Capacity – – – –

CD: 650 MB, 700 MB DVD: 4.7 GB per layer, up to 2 layers HD-DVD: 15 GB per layer, up to 3 layers BD: 25 GB per layer, up to 2 layers 334

Solid state storage

335

Solid state storage • Memory cards – For Digital cameras, mobile phones, MP3 players... – Many types: Compact flash, Smart Media, Memory Stick, Secure Digital card... • USB flash drives – Replace floppies/CD-RW • Solid State Drives – Replace traditional hard disks • Uses flash memory – Type of EEPROM • Electrically erasable programmable read only memory – Grid of cells (1 cell = 1 bit) – Write/erase cells by blocks 336

Solid state storage • Cell=two transistors – Bit 1: no electrons in between – Bit 0: many electrons in between

• Performance – Acces time: 10X faster than hard drive – Transfer rate • 1x=150 kb/sec, up to 100X for memory cards • Similar to normal hard drive for SSD ( 100-150 MB/sec)

– Limited write: 100k to 1,000k cycles 337

Solid state storage • Size – Very small: 1cm² for some memory cards

• Capacity – Memory cards: up to 32 GB – USB flash drives: up to 32 GB – Solid State Drives: up to 256 GB

338

Solid state storage • Reliability – Resists to shocks – Silent! – Avoid extreme heat/cold – Limited number of erase/write

• Challenges – Increasing size – Improving writing limits 339

Virtual Memory

340

Virtual Memory • Virtual memory is a memory management technique developed for multitasking kernels • Separation of user logical memory from physical memory. • Logical address space can therefore be much larger than physical address space

341

A System with Physical Memory Only •

Examples: – Most Cray machines, early PCs, nearly all embedded systems, etc. Memory Physical Addresses

0: 1:

CPU

N-1:

Addresses generated by the CPU correspond directly to bytes in physical memory

A System with Virtual Memory •

Examples: – Workstations, servers, modern PCs, etc.

0: 1:

Page Table Virtual Addresses

0: 1:

Memory

Physical Addresses

CPU

P-1:

N-1: Disk

Address Translation: Hardware converts virtual addresses to physical ones via OS-managed lookup table (page table)

Page Tables Virtual Page Number

Memory-resident page table (physical page Valid or disk address) 1 1 0 1 1 1 0 1 0 1

Physical Memory

Disk Storage (swap file or regular file system file)

VM – Windows • Can change the paging file size • Can set multiple Virtual memory on difference drivers

345

Windows Memory management

346

IO Fundamentals

I/O Fundamentals • Computer System has three major functions – CPU – Memory – I/O

PC with PCI and ISA bus

Types and Characteristics of I/O Devices • Behavior: how does an I/O device behave? – Input – Read only – Output - write only, cannot read – Storage - can be reread and usually rewritten

• Partner: – Either a human or a machine is at the other end of the I/O device – Either feeding data on input or reading data on output

• Data rate: – The peak rate at which data can be transferred • between the I/O device and the main memory • Or between the I/O device and the CPU

Data Rate

Buses • A bus is a shared communication link • Multiple sources and multiple destinations • It uses one set of wires to connect multiple subsystems • Different uses: – Data – Address – Control

Motherboard

Advantages • Versatility: – New devices can be added easily – Peripherals can be moved between computer – systems that use the same bus standard

• Low Cost: – A single set of wires is shared in multiple ways

Disadvantages • It creates a communication bottleneck – The bandwidth of that bus can limit the maximum I/O throughput

• The maximum bus speed is largely limited by: – The length of the bus – The number of devices on the bus – The need to support a range of devices with: • Widely varying latencies • Widely varying data transfer rates

The General Organization of a Bus • Control lines: – Signal requests and acknowledgments – Indicate what type of information is on the data lines

• Data lines carry information between the source and the destination: – Data and Addresses – Complex commands

• A bus transaction includes two parts: – Sending the address – Receiving or sending the data

Master Vs Slave • A bus transaction includes two parts: – Sending the address – Receiving or sending the data

• Master is the one who starts the bus transaction by: – Sending the address

• Salve is the one who responds to the address by: – Sending data to the master if the master ask for data – Receiving data from the master if the master wants to send data

Output Operation

Input Operation • Input is defined as the Processor receiving data from the I/O device

Type of Buses •

Processor-Memory Bus (design specific or proprietary) – – – –

•

Short and high speed Only need to match the memory system Maximize memory-to-processor bandwidth Connects directly to the processor

I/O Bus (industry standard) – Usually is lengthy and slower – Need to match a wide range of I/O devices – Connects to the processor-memory bus or backplane bus

•

Backplane Bus (industry standard) – Backplane: an interconnection structure within the chassis – Allow processors, memory, and I/O devices to coexist

•

Cost advantage: one single bus for all components

Increasing the Bus Bandwidth •

Separate versus multiplexed address and data lines: – Address and data can be transmitted in one bus cycle if separate address and data lines are available – Cost: (a) more bus lines, (b) increased complexity

•

Data bus width: – By increasing the width of the data bus, transfers of multiple words require fewer bus cycles – Example: SPARCstation 20’s memory bus is 128 bit wide – cost: more bus lines

•

Block transfers: – Allow the bus to transfer multiple words in back-to-back bus cycles – Only one address needs to be sent at the beginning – The bus is not released until the last word is transferred – Cost: (a) increased complexity (b) decreased response time for request

Operating System Requirements • Provide protection to shared I/O resources – Guarantees that a user’s program can only access the portions of an I/O device to which the user has rights

• Provides abstraction for accessing devices: – Supply routines that handle low-level device operation

• Handles the interrupts generated by I/O devices • Provide equitable access to the shared I/O resources – All user programs must have equal access to the I/O resources

• Schedule accesses in order to enhance system throughput

OS and I/O Systems Communication Requirements • The Operating System must be able to prevent: – The user program from communicating with the I/O device directly

• If user programs could perform I/O directly: – Protection to the shared I/O resources could not be provided

• Three types of communication are required: – The OS must be able to give commands to the I/O devices – The I/O device must be able to notify the OS when the I/O device has completed an operation or has encountered an error

• Data must be transferred between memory and an I/O device

Commands to I/O Devices •

Two methods are used to address the device: – Special I/O instructions – Memory-mapped I/O

•

Special I/O instructions specify: – Both the device number and the command word – Device number: the processor communicates this via a set of wires normally included as part of the I/O bus – Command word: this is usually send on the bus’s data lines

•

Memory-mapped I/O: – Portions of the address space are assigned to I/O device – Read and writes to those addresses are interpreted as commands to the I/O devices – User programs are prevented from issuing I/O operations directly: • The I/O address space is protected by the address translation

I/O Device Notifying the OS • The OS needs to know when: – The I/O device has completed an operation – The I/O operation has encountered an error

• This can be accomplished in two different ways: – Polling: • The I/O device put information in a status register • The OS periodically check the status register

– I/O Interrupt: • Whenever an I/O device needs attention from the processor, it interrupts the processor from what it is currently doing.

Polling • Advantage: – Simple: the processor is totally in control and does all the work

• Disadvantage: – Polling overhead can consume a lot of CPU time

Interrupts • interrupt is an asynchronous signal indicating the need for attention or a synchronous event in software indicating the need for a change in execution • Advantage: – User program progress is only halted during actual transfer

• Disadvantage, special hardware is needed to: – Cause an interrupt (I/O device) – Detect an interrupt (processor) – Save the proper states to resume after the interrupt (processor)

Interrupt Driven Data Transfer • An I/O interrupt is just like the exceptions except: – An I/O interrupt is asynchronous – Further information needs to be conveyed

• An I/O interrupt is asynchronous with respect to instruction execution: – I/O interrupt is not associated with any instruction – I/O interrupt does not prevent any instruction from completion – You can pick your own convenient point to take an interrupt

I/O Interrupt • I/O interrupt is more complicated than exception: – Needs to convey the identity of the device generating the interrupt – Interrupt requests can have different urgencies: – Interrupt request needs to be prioritized

• Interrupt Logic – Detect and synchronize interrupt requests • • • •

Ignore interrupts that are disabled (masked off) Rank the pending interrupt requests Create interrupt microsequence address Provide select signals for interrupt microsequence

Multi-core architectures

Single Computer

Single Core CPU

Multi core architecture • Replicate multiple processor cores on a single die

Multi-core CPU chip • The cores fit on a single processor socket • Also called CMP (Chip Multi-Processor)

Why Multi-core • Difficult to make single-core clock frequencies even higher • Deeply pipelined circuits: – – – – –

heat problems speed of light problems difficult design and verification large design teams necessary server farms need expensive air-conditioning

• Many new applications are multithreaded • General trend in computer architecture (shift towards more parallelism)

Instruction-level parallelism • Parallelism at the machine-instruction level • The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc. • Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years

Thread-level parallelism (TLP) • This is parallelism on a more coarser scale • Server can serve each client in a separate thread (Web server, database server) • A computer game can do AI, graphics, and physics in three separate threads • Single-core superscalar processors cannot fully exploit TLP • Multi-core architectures are the next step in processor evolution: explicitly exploiting TLP

Multiprocessor memory types • Shared memory: In this model, there is one (large) common shared memory for all processors • Distributed memory: In this model, each processor has its own (small) local memory, and its content is not replicated anywhere else

Multi-core processor is a special kind of a multiprocessor: All processors are on the same chip • Multi-core processors are MIMD: Different cores execute different threads (Multiple Instructions), operating on different parts of memory (Multiple Data). • Multi-core is a shared memory multiprocessor: All cores share the same memory

What applications benefit from multi-core? • • • • •

Database servers Web servers (Web commerce) Compilers Multimedia applications Scientific applications, CAD/CAM • In general, applications with Thread-level parallelism (as opposed to instruction-level parallelism)

Each can run on its own core

More examples • Editing a photo while recording a TV show through a digital video recorder • Downloading software while running an anti-virus program • “Anything that can be threaded today will map efficiently to multi-core” • BUT: some applications difficult to parallelize

A technique complementary to multi-core: Simultaneous multithreading

L2 Cache and Control

• Problem addressed: The processor pipeline can get stalled: – Waiting for the result of a long floating point (or integer) operation – Waiting for data to arrive from memory

Bus

Other execution units wait unused

L1 D-Cache D-TLB

Integer

Floating Point

Schedulers Uop queues Rename/Alloc BTB

Trace Cache

uCode ROM

Decoder BTB and I-TLB Source: Intel

Simultaneous multithreading (SMT) • Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core

• Weaving together multiple “threads” on the same core • Example: if one thread is waiting for a floating point operation to complete, another thread can use the integer units

Without SMT, only a single thread can run at any given time L1 D-Cache D-TLB

L2 Cache and Control

Integer

Floating Point Schedulers

Uop queues Rename/Alloc BTB

Trace Cache

uCode ROM

Bus

Decoder BTB and I-TLB Thread 1: floating point

Without SMT, only a single thread can run at any given time L1 D-Cache D-TLB

L2 Cache and Control

Integer

Floating Point Schedulers

Uop queues Rename/Alloc BTB

Trace Cache

Bus

Decoder BTB and I-TLB Thread 2: integer operation

uCode ROM

SMT processor: both threads can run concurrently L1 D-Cache D-TLB

L2 Cache and Control

Integer

Floating Point Schedulers

Uop queues Rename/Alloc BTB

Trace Cache

uCode ROM

Bus

Decoder BTB and I-TLB Thread 2: Thread 1: floating point integer operation

But: Can’t simultaneously use the same functional unit L1 D-Cache D-TLB

L2 Cache and Control

Integer

Floating Point Schedulers

Uop queues Rename/Alloc BTB

Trace Cache

Bus

Decoder BTB and I-TLB Thread 1 Thread 2 IMPOSSIBLE

uCode ROM

This scenario is impossible with SMT on a single core (assuming a single integer unit)

SMT not a “true” parallel processor • Enables better threading (e.g. up to 30%) • OS and applications perceive each simultaneous thread as a separate “virtual processor” • The chip has only a single copy of each resource • Compare to multi-core: each core has its own copy of resources

Multi-core: threads can run on separate cores L1 D-Cache D-TLB

Floating Point

Schedulers Uop queues Rename/Alloc BTB

Trace Cache

uCode ROM

Integer L2 Cache and Control

L2 Cache and Control

Integer

L1 D-Cache D-TLB

BTB and I-TLB Thread 1

Schedulers Uop queues Rename/Alloc BTB

Trace Cache

Decoder Bus

Bus

Decoder

Floating Point

BTB and I-TLB Thread 2

uCode ROM

Multi-core: threads can run on separate cores L1 D-Cache D-TLB

Floating Point

Schedulers Uop queues Rename/Alloc BTB

Trace Cache

uCode ROM

Integer L2 Cache and Control

L2 Cache and Control

Integer

L1 D-Cache D-TLB

BTB and I-TLB Thread 3

Schedulers Uop queues Rename/Alloc BTB

Trace Cache

Decoder Bus

Bus

Decoder

Floating Point

BTB and I-TLB Thread 4

uCode ROM

Combining Multi-core and SMT • Cores can be SMT-enabled (or not) • The different combinations: – Single-core, non-SMT: standard uniprocessor – Single-core, with SMT – Multi-core, non-SMT – Multi-core, with SMT: our fish machines

• The number of SMT threads: 2, 4, or sometimes 8 simultaneous threads • Intel calls them “hyper-threads”

SMT Dual-core: all four threads can run concurrently L1 D-Cache D-TLB

Floating Point

Schedulers Uop queues Rename/Alloc BTB

Trace Cache

uCode ROM

Integer L2 Cache and Control

L2 Cache and Control

Integer

L1 D-Cache D-TLB

BTB and I-TLB Thread 1 Thread 3

Schedulers Uop queues Rename/Alloc BTB

Trace Cache

Decoder Bus

Bus

Decoder

Floating Point

BTB and I-TLB Thread 2

Thread 4

uCode ROM

Comparison: multi-core vs SMT • Advantages/disadvantages?

Comparison: multi-core vs SMT • Multi-core: – Since there are several cores, each is smaller and not as powerful (but also easier to design and manufacture) – However, great with thread-level parallelism

• SMT – Can have one large and fast superscalar core – Great performance on a single thread – Mostly still only exploits instruction-level parallelism

The memory hierarchy • If simultaneous multithreading only: – all caches shared

• Multi-core chips: – L1 caches private – L2 caches private in some architectures and shared in others

• Memory is always shared

“Fish” machines hyper-threads

L1 cache

CORE0

• Each core is hyper-threaded

CORE1

• Dual-core Intel Xeon processors

L2 cache

• Private L1 caches • Shared L2 caches

memory

L1 cache

L2 cache

L2 cache

L1 cache

CORE0

L1 cache

CORE1

L1 cache

CORE0

CORE1

Designs with private L2 caches

L1 cache

L2 cache

L2 cache

L3 cache

L3 cache

memory memory Both L1 and L2 are private A design with L3 caches Examples: AMD Opteron, AMD Athlon, Intel Pentium D

Example: Intel Itanium 2

Private vs shared caches? • Advantages/disadvantages?

Private vs shared caches • Advantages of private: – They are closer to core, so faster access – Reduces contention

• Advantages of shared: – Threads on different cores can share the same cache data – More cache space available if a single (or a few) high-performance thread runs on the system

Windows Task Manager

core 2

core 1

Lihat lebih banyak...

Computer System Architecture Lecturer Notes

Descrição do Produto

Comentários