PRESSA: from system to chip

Share Embed


Descrição do Produto

PRESSA: from system to chip IT-ACS Ltd by Igor Schagaev

History

Theory

HW

SSW

Design for Reliability -

"From Chip to System”

Theory

Greenwich Uni

05.12.13

History: Theory IT-ACS Ltd

Theory of fault tolerant computer design 1978-till now ! Theory of active system safety, active system control 1989-till now ! Design, development and reliability modelling of fault tolerant RAM: New triplicated memory, sliding reserve RAM, des., dev., done 1999 ! Processors - concept, design, development, simulation, assembler, language run-time system, prototyping and reliability analysis since 1983 till now (see ERA…) ! System software language and run-time system design and development for reconfigurable architectures (see PRESSA below and book 2013)… ! Method of active system safety and system control (since 1984, UK patent 2007) ! details: http://www.researchgate.net/profile/Igor_Schagaev/?ev=hdr_xprf by Igor Schagaev

History: Hardware IT-ACS Ltd

! RELIABLE DESIGN FROM RELIABLE COMPONENT

24 NODE FT NAVY COMPUTER BASED ON LSI11-23, 1978-1982 64 PE FAULT TOLERANT M-SIMD BASED ON AM2901 TILL 1987 FAULT TOLERANT AVIONICS FOR SUKHOY Active safety system for aircraft with dual Motorola 68020, fault tolerant memory for applications (41 chip of SRAM) and new tripled memory together with flight data recorder with unique thermo-resistant system developed, prototyped and tested. Completed 1994 ERRIC Embedded recoverable reduced instruction computer, designed and prototyped in 1998-2009 before and within FP6 ONBASS project (FP6). Malfunction tolerance and rigorous design enabled to achieve fault tolerance with 12% structural redundancy and zero time redundancy. ERRIC requires 6.5 times less power than ARM, has similar performance, and… 104 more reliable… 1998-up to now NEXT STOP - NEW ERA Idea to combine ERRIC and ITACS memory designs to make fault tolerant reconfigurable architecture on a wafer became known as ERA (evolving reconfigurable architecture). In progress

!

PRESSA: Perfomance-, Reliability-, Energy- Smart System Architecture Multi-chip development of ERA… Started in 2009

!

more details: www.it-acs.co.uk

by Igor Schagaev

from system to chip: WHY? IT-ACS Ltd

HW

HW

Software

Software

System Software HW Application Software

Theory, again of system software support of hardware efficiency… ! see for example: www.it-acs.co.uk ! more is needed…

by Igor Schagaev

As a third optimization axis beyond performance and reliability PRESSA aims to facilitate advanced resource management to reduce power consumption in battery driven applications. High degree of reconfigurability combined with that fact that we are designing an entire new computing paradigm consisting of processor hardware, memory architecture, a modelling language, a programming language, and the run time system opens up new dimensions of dynamic power management.

from system to chip: concept first

The PRESSA project is based on previous theoretical results in study of redundancy classification and management introduced in late 80’s [SCH86-11]. PRESSA scientific development pursues redundancy IT-ACS Ltd and reconfigurability study further as shown below on Figure 1 and explained on Table 1 below:

Figure 1 PRESSA areas of theoretical and technological contribution

by Igor Schagaev

refore proposed project defines essential features and their impact on basic system elements when eptional system reconfigurability is required.

from system to chip: principles

dware reconfigurability will be reflected and supported at the system software level by language and time system. Table 1: PRESSA holistic design principles and reasoning

Simplicity

Complex things tend not to work properly. PRESSA avoids introducing extra hardware and software ‘bells and whistles’ in the architecture to placate history (compatibility with main market players) or conventions (pipelines and caches etc.), and which often adds enormous complexity for very little gain in performance or reliability.

Redundancy

Deliberate introduction of hardware and system software redundancy together with monitoring schemes provides the means for PRESSA to use reconfiguration to improve reliability

Reconfigurability

PRESSA reconfigurability has three main purposes: performance, reliability and power awareness. Handling reconfigurability using language and run-time support provides unique flexibility in trading of reliability, performance and energy-wise use.

Scalability

Design and development of hardware and software to achieve high reliability, and monitor graceful degradation of hardware in terms of performance and reliability. Active support of reconfiguration is managed in real time by means of control of hardware and system software resources. The software and hardware are both specifically designed to scale up.

Reliability and fault tolerance

Resource-awareness

IT-ACS Ltd

Our approach is to use minimum redundancy by designing the main elements to be as reliable as possible and combine them together with minimum complexity of connections. Redundancy of resources is deliberately introduced, both in hardware and software, and then managed to maximize tolerance to malfunction and permanent faults. Mission critical systems as well as everyday applications may have significant limitations, in terms of hardware (computational and memory) resources and power consumption constraints (e.g. battery life). All of the above features must be taken into account by using systems engineering based on hardware-software co-design.

by Igor Schagaev

orically computer technologies were not addressing potentially work of computer within connected

PRESSA: from system to chip !

Recoverability?

FAULT TOLERANCE Redundancy

Reconfigurability

Fault model

PRE-smart CC

P

R

E

Performance

Reliability

Energy

P, R, E Trading?

! Big Q: how much?

© IT-ACS

(HW dware d to

m is s and

t the fault dling

malfunction tolerance efficient [18]. In comparison with Motorola, ARM, Intel ERA is much simpler, and a higher level of parallelism and frequency can be achieved, as ERA needs only 10% power compared to the competitors to reach the same clock speed. When an application requires maximum reliability, the TIT-ACS Ltd logic scheme might configure the memory as a 3 unit with !voter. The configurations two to compare and one spare or ! three independent memory elements are possible. for one computer

for multiprocessor system

PRESSA: from system to chip

[7] ACTIVE ZONE

ERRIC

[8]

nates

g the ig. 6. ave is ent is d in a

t into until

RAM

Idle memory

RAM

RAM

ARCHITECTURE BUS

r the

[9]

[10

[11

[12

[13

Memory used by ERRIC

[14

PASSIVE ZONE

[15

Fig. 7. ERA element - HW element “suspected” should “switch itself” -

(left RAM above);

! - System should be able to return it in action after

full-size check, if it was recovered.

[16

[17

[18 Fig. 8. Indicative ERA structure

Igor Schagaev Each element can be turnedby off individually to decrease power consumption. Note that the structure assumes only one leading element at a time enforced by a “rotation” of the T-logic

[19

[20 [21

[22

To chip with reconfigurability IT-ACS Ltd

File name: FT resolved,

Sept 2010

First version of syndrome concept: witnessed by PhD students V Castano and A Petukhov by Igor Schagaev

now: to chip with reconfigurability IT-ACS Ltd

Arithmetic Unit

Logical Unit

Timer

Random Number Gen.

Interrupt Controller

Console

Stable Storage

UART1

UART2

UART3

ROM1

ROM2

RAM1

RAM2

RAM3

RAM4

Memory

Registers

Devices

CU

Processor

Power

Power

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Slightly better syndrome picture...

From a system software point of view the syndrome is

Figure 7.12 Syndrome fault management represented as a set of special hardware registers.

! Syndrome Registers indicates the current hardware state

(current configuration, detected faults, power)...

! Fault detection schemes signal to syndrome causing

hardware interrupts and initiation of GAFT by run-time system.

! Run-time system, when necessary, executes reconfiguration of

hardware.

!

Run-time system new functions of control are:

Figure 7.13 Syndrome power configuration

NB. Pictures of syndrome (Figures 7.12 and 7.13, 7.14) for our proposed architecture ERA were prepared by Victor Castano.

a) reconfiguration for reliability, performance or power-saving

by Igor Schagaev b) control of graceful degradation As an example platform to illustrate the syndrome, we use here the ERRIC simulator with

Reconfigurability: use of syndrome uncertain. Software could for example switch periodically from mode 1 to mode 3 and check the integrity of the spare module, preferably in idle time of the system.

IT-ACS Ltd

If no safety critical applications run on the system, the memory configuration can be set to From defined bycapacity hardware design mode 9 where maximum is available but nosystem HW faultconfigurations

tolerance.

we set memoryTable configurations: 7.1: Possible memory configurations Mode Number 1 2 3 4 5 6 7 8 9 10 11 12

Number of used banks 1 1 2 1 1 2 3 2 4 3 2 1

Redundancy Mode Triplicated Triplicated Triplicated Duplicated Duplicated Duplicated Duplicated Duplicated Linear Linear Linear Linear

+ 1 Spare + 1 Linear + 2 Spare + 1 Spare + 2 Linear + 1 Linear

Number of used memory modules 4 3 4 4 3 4 4 3 4 3 2 1

Usable in Mb 4 4 8 4 4 8 12 8 16 12 8 4

Size

An example of system software control of memory degradation for triplicated memory Degradation Modes starting from Triplication

Phase 1 Triplication + Spare

Phase 2 Triplication

Phase 3 Duplication

111s

111x

11xx 16-bit wide of memory modules could also be used instead of 32-bit

modules. In this case, two Areas processor, interfacing zone, passive memory modules must be combined to allow 32-bit memory access.

11x1

1x1x

1x11

...

x111

x1x1

xx11

zone in terms of configurations can be defined

Phase 4 x1xx xxx1 xxx1 with theirwith degradation sequences.

to duplication No FT only as 1xxx Thetogether possible configurations four 16-bit modules are limited triplication would need atand least their six memory modules. Configurations changes supported by

Phase 5 run-time system, in principle, enabling sequential F Failureone 16If 16-bit modules are used, an emergency mode could be implemented, using only to the last soldier”,

bitdegradation module, mainly“up for signaling the need for maintenance or if space and speed (two memory loading one 32-bit section word) are left, sufficient, whenaccesses singleforelement of each but to run the most critical Figure 7.17 : Degradation phases of a triplicated memory system applications. by Igor Schagaev system will remains operable.

Reliability… IT-ACS Ltd by Igor Schagaev

re, an analysis of the surface shape and evaluation of performance and reliability tion caused by the used redundancies should be performed for every fault tolerant Figure 4.4 presents qualitatively a slope where a fault tolerant system should be between the plane of requirements and curves of reliability and performance tion.

Performance… PRESSA again

IT-ACS Ltd by Igor Schagaev

We actually need: ! PerformanceReliabilityEnergy! reconfigurable systems design and their analysis ! done by good team of collaborators…

Figure 4.4: Tradeoffs to be made in fault tolerant system design:

Thanks for… and… IT-ACS Ltd

- Discussions, efforts: T Kaegi, S Monkman, B Kirk



- Discussions on redundancy: J C Laprie ( late 80’s )



- Discussions on reliability vs. FT: S Birolini (2005-)

(See Birolini Reliability Engineering, Springer Ed. 7, 2013)

!

- Discussions on Graph Logic Model: Felix Friedrich

!

- Pictures: S Monkman,V Castano

!

- NMI team: Paul Jarvie, Rebecca Mann, Jon Older, Mark Hodgetts by Igor Schagaev

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.