PRESSA: from system to chip
Descrição do Produto
PRESSA: from system to chip IT-ACS Ltd by Igor Schagaev
History
Theory
HW
SSW
Design for Reliability -
"From Chip to System”
Theory
Greenwich Uni
05.12.13
History: Theory IT-ACS Ltd
Theory of fault tolerant computer design 1978-till now ! Theory of active system safety, active system control 1989-till now ! Design, development and reliability modelling of fault tolerant RAM: New triplicated memory, sliding reserve RAM, des., dev., done 1999 ! Processors - concept, design, development, simulation, assembler, language run-time system, prototyping and reliability analysis since 1983 till now (see ERA…) ! System software language and run-time system design and development for reconfigurable architectures (see PRESSA below and book 2013)… ! Method of active system safety and system control (since 1984, UK patent 2007) ! details: http://www.researchgate.net/profile/Igor_Schagaev/?ev=hdr_xprf by Igor Schagaev
History: Hardware IT-ACS Ltd
! RELIABLE DESIGN FROM RELIABLE COMPONENT
24 NODE FT NAVY COMPUTER BASED ON LSI11-23, 1978-1982 64 PE FAULT TOLERANT M-SIMD BASED ON AM2901 TILL 1987 FAULT TOLERANT AVIONICS FOR SUKHOY Active safety system for aircraft with dual Motorola 68020, fault tolerant memory for applications (41 chip of SRAM) and new tripled memory together with flight data recorder with unique thermo-resistant system developed, prototyped and tested. Completed 1994 ERRIC Embedded recoverable reduced instruction computer, designed and prototyped in 1998-2009 before and within FP6 ONBASS project (FP6). Malfunction tolerance and rigorous design enabled to achieve fault tolerance with 12% structural redundancy and zero time redundancy. ERRIC requires 6.5 times less power than ARM, has similar performance, and… 104 more reliable… 1998-up to now NEXT STOP - NEW ERA Idea to combine ERRIC and ITACS memory designs to make fault tolerant reconfigurable architecture on a wafer became known as ERA (evolving reconfigurable architecture). In progress
!
PRESSA: Perfomance-, Reliability-, Energy- Smart System Architecture Multi-chip development of ERA… Started in 2009
!
more details: www.it-acs.co.uk
by Igor Schagaev
from system to chip: WHY? IT-ACS Ltd
HW
HW
Software
Software
System Software HW Application Software
Theory, again of system software support of hardware efficiency… ! see for example: www.it-acs.co.uk ! more is needed…
by Igor Schagaev
As a third optimization axis beyond performance and reliability PRESSA aims to facilitate advanced resource management to reduce power consumption in battery driven applications. High degree of reconfigurability combined with that fact that we are designing an entire new computing paradigm consisting of processor hardware, memory architecture, a modelling language, a programming language, and the run time system opens up new dimensions of dynamic power management.
from system to chip: concept first
The PRESSA project is based on previous theoretical results in study of redundancy classification and management introduced in late 80’s [SCH86-11]. PRESSA scientific development pursues redundancy IT-ACS Ltd and reconfigurability study further as shown below on Figure 1 and explained on Table 1 below:
Figure 1 PRESSA areas of theoretical and technological contribution
by Igor Schagaev
refore proposed project defines essential features and their impact on basic system elements when eptional system reconfigurability is required.
from system to chip: principles
dware reconfigurability will be reflected and supported at the system software level by language and time system. Table 1: PRESSA holistic design principles and reasoning
Simplicity
Complex things tend not to work properly. PRESSA avoids introducing extra hardware and software ‘bells and whistles’ in the architecture to placate history (compatibility with main market players) or conventions (pipelines and caches etc.), and which often adds enormous complexity for very little gain in performance or reliability.
Redundancy
Deliberate introduction of hardware and system software redundancy together with monitoring schemes provides the means for PRESSA to use reconfiguration to improve reliability
Reconfigurability
PRESSA reconfigurability has three main purposes: performance, reliability and power awareness. Handling reconfigurability using language and run-time support provides unique flexibility in trading of reliability, performance and energy-wise use.
Scalability
Design and development of hardware and software to achieve high reliability, and monitor graceful degradation of hardware in terms of performance and reliability. Active support of reconfiguration is managed in real time by means of control of hardware and system software resources. The software and hardware are both specifically designed to scale up.
Reliability and fault tolerance
Resource-awareness
IT-ACS Ltd
Our approach is to use minimum redundancy by designing the main elements to be as reliable as possible and combine them together with minimum complexity of connections. Redundancy of resources is deliberately introduced, both in hardware and software, and then managed to maximize tolerance to malfunction and permanent faults. Mission critical systems as well as everyday applications may have significant limitations, in terms of hardware (computational and memory) resources and power consumption constraints (e.g. battery life). All of the above features must be taken into account by using systems engineering based on hardware-software co-design.
by Igor Schagaev
orically computer technologies were not addressing potentially work of computer within connected
PRESSA: from system to chip !
Recoverability?
FAULT TOLERANCE Redundancy
Reconfigurability
Fault model
PRE-smart CC
P
R
E
Performance
Reliability
Energy
P, R, E Trading?
! Big Q: how much?
© IT-ACS
(HW dware d to
m is s and
t the fault dling
malfunction tolerance efficient [18]. In comparison with Motorola, ARM, Intel ERA is much simpler, and a higher level of parallelism and frequency can be achieved, as ERA needs only 10% power compared to the competitors to reach the same clock speed. When an application requires maximum reliability, the TIT-ACS Ltd logic scheme might configure the memory as a 3 unit with !voter. The configurations two to compare and one spare or ! three independent memory elements are possible. for one computer
for multiprocessor system
PRESSA: from system to chip
[7] ACTIVE ZONE
ERRIC
[8]
nates
g the ig. 6. ave is ent is d in a
t into until
RAM
Idle memory
RAM
RAM
ARCHITECTURE BUS
r the
[9]
[10
[11
[12
[13
Memory used by ERRIC
[14
PASSIVE ZONE
[15
Fig. 7. ERA element - HW element “suspected” should “switch itself” -
(left RAM above);
! - System should be able to return it in action after
full-size check, if it was recovered.
[16
[17
[18 Fig. 8. Indicative ERA structure
Igor Schagaev Each element can be turnedby off individually to decrease power consumption. Note that the structure assumes only one leading element at a time enforced by a “rotation” of the T-logic
[19
[20 [21
[22
To chip with reconfigurability IT-ACS Ltd
File name: FT resolved,
Sept 2010
First version of syndrome concept: witnessed by PhD students V Castano and A Petukhov by Igor Schagaev
now: to chip with reconfigurability IT-ACS Ltd
Arithmetic Unit
Logical Unit
Timer
Random Number Gen.
Interrupt Controller
Console
Stable Storage
UART1
UART2
UART3
ROM1
ROM2
RAM1
RAM2
RAM3
RAM4
Memory
Registers
Devices
CU
Processor
Power
Power
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Slightly better syndrome picture...
From a system software point of view the syndrome is
Figure 7.12 Syndrome fault management represented as a set of special hardware registers.
! Syndrome Registers indicates the current hardware state
(current configuration, detected faults, power)...
! Fault detection schemes signal to syndrome causing
hardware interrupts and initiation of GAFT by run-time system.
! Run-time system, when necessary, executes reconfiguration of
hardware.
!
Run-time system new functions of control are:
Figure 7.13 Syndrome power configuration
NB. Pictures of syndrome (Figures 7.12 and 7.13, 7.14) for our proposed architecture ERA were prepared by Victor Castano.
a) reconfiguration for reliability, performance or power-saving
by Igor Schagaev b) control of graceful degradation As an example platform to illustrate the syndrome, we use here the ERRIC simulator with
Reconfigurability: use of syndrome uncertain. Software could for example switch periodically from mode 1 to mode 3 and check the integrity of the spare module, preferably in idle time of the system.
IT-ACS Ltd
If no safety critical applications run on the system, the memory configuration can be set to From defined bycapacity hardware design mode 9 where maximum is available but nosystem HW faultconfigurations
tolerance.
we set memoryTable configurations: 7.1: Possible memory configurations Mode Number 1 2 3 4 5 6 7 8 9 10 11 12
Number of used banks 1 1 2 1 1 2 3 2 4 3 2 1
Redundancy Mode Triplicated Triplicated Triplicated Duplicated Duplicated Duplicated Duplicated Duplicated Linear Linear Linear Linear
+ 1 Spare + 1 Linear + 2 Spare + 1 Spare + 2 Linear + 1 Linear
Number of used memory modules 4 3 4 4 3 4 4 3 4 3 2 1
Usable in Mb 4 4 8 4 4 8 12 8 16 12 8 4
Size
An example of system software control of memory degradation for triplicated memory Degradation Modes starting from Triplication
Phase 1 Triplication + Spare
Phase 2 Triplication
Phase 3 Duplication
111s
111x
11xx 16-bit wide of memory modules could also be used instead of 32-bit
modules. In this case, two Areas processor, interfacing zone, passive memory modules must be combined to allow 32-bit memory access.
11x1
1x1x
1x11
...
x111
x1x1
xx11
zone in terms of configurations can be defined
Phase 4 x1xx xxx1 xxx1 with theirwith degradation sequences.
to duplication No FT only as 1xxx Thetogether possible configurations four 16-bit modules are limited triplication would need atand least their six memory modules. Configurations changes supported by
Phase 5 run-time system, in principle, enabling sequential F Failureone 16If 16-bit modules are used, an emergency mode could be implemented, using only to the last soldier”,
bitdegradation module, mainly“up for signaling the need for maintenance or if space and speed (two memory loading one 32-bit section word) are left, sufficient, whenaccesses singleforelement of each but to run the most critical Figure 7.17 : Degradation phases of a triplicated memory system applications. by Igor Schagaev system will remains operable.
Reliability… IT-ACS Ltd by Igor Schagaev
re, an analysis of the surface shape and evaluation of performance and reliability tion caused by the used redundancies should be performed for every fault tolerant Figure 4.4 presents qualitatively a slope where a fault tolerant system should be between the plane of requirements and curves of reliability and performance tion.
Performance… PRESSA again
IT-ACS Ltd by Igor Schagaev
We actually need: ! PerformanceReliabilityEnergy! reconfigurable systems design and their analysis ! done by good team of collaborators…
Figure 4.4: Tradeoffs to be made in fault tolerant system design:
Thanks for… and… IT-ACS Ltd
- Discussions, efforts: T Kaegi, S Monkman, B Kirk
- Discussions on redundancy: J C Laprie ( late 80’s )
- Discussions on reliability vs. FT: S Birolini (2005-)
(See Birolini Reliability Engineering, Springer Ed. 7, 2013)
!
- Discussions on Graph Logic Model: Felix Friedrich
!
- Pictures: S Monkman,V Castano
!
- NMI team: Paul Jarvie, Rebecca Mann, Jon Older, Mark Hodgetts by Igor Schagaev
Lihat lebih banyak...
Comentários