Fault Mgmngt Techniques

8/8/2019 Fault Mgmngt Techniques

1/17

Pergam,~n Prog. Aerospace Sci.Vol. 32, pp. 415 -431, 1996Copyright 1996ElsevierScienceLtd

Printed n G reat Britain.All rights eserved0376-0421/96$29.00

0376-0421 (95) 000010-0

A R E V I E W O F F A U L T M A N A G E M E N T T E C H N I Q U E S U S E DI N S A F E T Y- C R I T I C A L A V I O N I C S Y S T E M S

David M . Johnson

Department of Aerospace Engineering, University of BristoL Bristol BS8 1TH, U.K.

(Received for publication 14 November 1995)

Ab s t rac t - - In order to ach ieve h igh in tegr i ty leve ls in complex , rea l - t ime , sa fe ty-cr i ti ca l sys tems , i t i s necessaryto de tec t fa i lu res and tak e ap propr ia te fau l t recovery ac tion , to m ain ta in sa fe sys tem opera t ion or fa i l to a sa fes ta te . I t may a l so be necessary to a le r t the opera to r o f the fa ilu re . In order to take ap propr ia te m ain tena nce

ac t ion i t i s a l so necessary to i so la te the fa iled com ponen t . This process i s te rmed fau l t man agem ent . Ai r l ineexper ience wi th modern av ionic sys tems i s tha t , desp i te the apparen t sophis t ica t ion of the Bui l t - In Tes tEquipment and Cent ra l i sed Main tenance Sys tems , spur ious fau l t de tec t ion i s unacceptab ly h igh . Faul tde tec t ion coverage i s no t un i formly good and fau l t i so la t ion i s o f ten inaccura te or imprec ise . This paperpresen ts a c r i t i ca l ana lys i s o f the methods cur ren t ly used in fau l t management , in the l igh t o f persona lexper ience of sa fe ty-cr it i cal sys tems deve lop men t wi th in the a i rc ra f t indus t ry a nd work by o th er researchers . I tm a k e s r e c o m m e n d a t i o n s a b o u t t h e u s e o f th e v a r i o u s a p p r o a c h e s a n d a t t e m p t s t o h i g h l i g h t a r e a s w h e r e f u tu r eresearch could b~ mo s t usefu lly d irec ted . I t a l so assesses the imp act tha t new avionics a rch i tec tures may haveon the u t i l ity of the var ious app roaches to fau l t ma nage men t in fu ture a i rc ra f t sys tems . Copyr igh t 1996Elsevier Science Ltd.

C O N T E N T S

1. I N T R O D U C T I O N 4 162. FA U L T D E T E C T I O N A N D FA U L T I S O L AT I O N 4 16

2.1 . Analys i s o f fau l t de tec t ion requi remen ts 4172 .1 .1 . Fa i lure mod es and e ffec t ana lys i s (FM EA ) 4172.1.2. Fa ult t ree analysis (FTA ) 4172.1.3. By zantin e resi l ience (BR) 4182.1 .4 . Analys i s o f com mo n mo de fa i lu re (CM F) 4182.1 .5 . Analys i s o f hu ma n er ror 4192 .1 .6 . Sum ma ry of fau l t ana lys i s meth ods 420

2 .2 . Faul t de tec t ion and i so la t ion meth ods 4212.2.1. Rea sona blene ss tests 4212.2.2. Be.ha viour tests 4222.2.3. Co ntin uity tests 4222.2.4. Pc:rformance tests 422

2 ,2 .5 . Com par i so ns be tween redu nda nt e lements 4232.2.6. O:~-line tes ts 4232 ,2 .7 . Sum mary of fau l t de tec t ion and fau l t i so la t ion meth ods 424

3. FA U LT T O L E R A N C E A N D S Y ST E M R E C O N F I G U R AT I O N 4 253.1. Fa ult recovery 425

3 .1 .1 . Con t inued opera t ion in degraded mo de 4263.1.2. Fai lu re to a safe passive state 4263 .1 .3 . Fa i lure abso rp t ion 4263.1.4. System recon figura t ion 4263.1.5. Direc t fai lure recovery 4273 .1 .6 . Op era t ion a l l imi ta t ion 427

3 .2 . Faul t to le ran t des ign 4273 .2 .1 . F T A / F M E A a p p r o a c h 4 2 73.2.2. By zantin e resi l ience 427

3 .3 . Fau l t recovery and sys tem reconf igura t ion sum mary 428

4. T H E I M PA C T O F N E W I N T E G R AT E D AV I O N I C S A R C H I T E C T U R E S 4 284.1. Conven~:ional syste ms arc hite ctur es 4284 .2 . In tegra ted mo dular av ionics ( IMA) 4294.3. Analysi, ,; of the effects of IM A on exist ing fault ma nag em ent techn iques 429

5 . C O N C L U S I O N S 4 3 0R E F E R E N C E S 4 3 1

41 5


2/17

416 D. M. Johnson

1. I N T R O D U C T I O N

The aim of this paper is to review the fault management techniques that are currently usedin the design of avionic systems. It is not intended as a review of techniques that might beapplied in the future, but aims to highlight the principal problems tha t any new techniquesshould resolve. It is primarily concerned with the problems of fault detection and diagnosisand of fault tolerance, rather than the prevention of specification, design or implementationerrors. These problems, and the general problem of software integrity, are, however, impor-tant issues in choosing and developing a fault management strategy and they are thereforediscussed briefly. Safety and the regulatory requirements placed on avionic systems are alsodiscussed briefly, in order to place the fault management problem in context. The paper doesnot, however, at tempt to provide a comprehensive review of recent developments in theregulatory requirements, as these relate more to the evidence of compliance with safetyobjectives than to the techniques used in detecting, diagnosing and tolerating faults.

The severity of the fault management problem facing the designers of complex, high-integrity systems can be illustrated by Lufthansa's early experience of A320 operations.From their A320 fleet, they reported an average of 2000 entries in the Post Flight Reportseach day. Of these 2000 entries, only 70 could be correlated with a pilo t report, t hough therewere a further 70 pilot reports with no corresponding entry in the Post Flight Report. Asa result of analysis of the Post Flight Reports and pilot reports, on average 17 LineReplaceable Units were removed each day. Of these 17, only two were confirmed faulty witha fault that correlated with the reports.

This high level of spurious fault detection can largely be attribu ted to the requirement toachieve very high levels of safety. The regulations for commercial aircraft, as defined by JAR25 tl) and FAR 25 , (2) require tha t the probabi lity of a catastrophic system failure (leading toloss of the aircraft) must be less than 1 0 - 9 per hour of flight. Achievement of this requiresthe use of redundancy, the detection of failures with very low probabilities of occurrenceand often an immediate response to the detection of a failure.

A number of approaches have been taken in attempting to achieve very high integrity

levels in digital systems and to resolve the fault management problems. These differingapproaches attempt, in various ways, to prevent or to combat the effects of randomhardware failures and various types of common mode failure, including specification, designor implementation errors and external interference. A number o f approaches have also beenused to minimise risks due to human error.

There are four main aspects that must be considered when designing high integrity digitalsystems. The first is fault prevention, the second is fault detection and isolation, the th ird isfault tolerance and system reconfiguration, and the fou rth is fault reporting and alerting ofthe operator. Within each of these, the possibilities of random hardware failures, of commonmode failures due to specification, design or implementation errors, or external interference,or of human error must be addressed.

Fault prevention will not be discussed in detail except as it relates to human error. Thiswill include errors in specification, design or implementation of the system as well asoperator error. The human factors problems related to fault warning systems, thoughextremely important, are beyond the scope of this paper and will be addressed only briefly.Methods of fault detection and isolation will be examined in detail. Fault recoveryprocedures will then be addressed and these two aspects will then be drawn together toprovide an overall view of the fault management problem. Finally, the impact of new,integrated avionics architectures will be examined so that recommendations can be madefor the direction of future research.

2. FAULT DETECTION AND FAUL T ISOLATION

Faults may occur in any component of the system, or in the connections or interfacesbetween system components. As already stated these faults fall into two main categories:


3/17

Fault management echniques 417

Random hardware failures, assumed to occur with a constant random failure rate.Common mode failures (including specification and design errors).It is impor tant to recognise the existence of these two types of failure. A single random

hardware failure will affect only one component or element of the system whereas a com-mon mode failure may affect more than one component or element. This is of greatsignificance when considering the use of redundancy. Many of the fault detection schemes

developed over the past 20 years have concentrated on the detection and isolation ofrandom hardware failures. Fault detection schemes have progressed from ad hoc methodsto more rigorous and comprehensive approaches. The principal approaches used today aredescribed below. The problem of error on the part of the operator is also analysed andapproaches to the prevention of errors and to ameliorating the effects of error are alsodiscussed.

2.1. ANALYSIS OF FAULT DETECTION REQUIREMENTS

Before choosing any method of failure detection, it is first necessary to identify the failuresthat must be detected. Discussed below are the principal analysis techniques used today.

2.1.1. Failure Modes and Effect Analysis (FMEA)

This is a bottom-up method of analysis, and as such cannot be completed until the systemdesign has been completed. This is a problem. It is quite common when using this approachto find that there are many failures that either cannot be detected or cannot be accuratelyisolated. The system design may then require modification, or periodic tests or inspectionsmay have to be added, in order to satisfy safety, availability or maintainability require-ments.

Basically the method looks at each component of the system. It examines the failuremodes of each component (and the failure rate) and assesses the impact of each failure mode

on the operat ion of the system. This is a t ime-consuming process and it is necessary to show,from the FMEA, that a sufficient level of fault detection can be achieved to accomplish thesystem safety objectives. Thus, probable failures with a severe effect must be detected andappropriate recovery action must be taken, whereas it may be acceptable not to detecthighly improbable failures or failures that have no significant impact on system behaviour.Lala and Harper (3) suggest that such an analysis is not only time-consuming to perform, butalso next to impossible to prove. It is very difficult to ensure that all possible failure modesand where necessary, combinations of failure modes, have been considered and that theireffects have bee~a correctly analysed.

2.1.2. Fau lt Tre e Analysis (FTA)

This is a top- down approach and can therefore be performed at the start of the designprocess and updated throughout the development. The principle here is to take hazardousevents identified by a functional hazard analysis (FHA) and to analyse possible causes ofthose hazardous events.

This is a usefill method in defining the basic architecture and overall system design. It isalso useful in identifying, at an early stage of the development cycle, the fault detectionrequirements arising from considerations of system safety. Due to its top-down naturethough, there is little hope of guaranteeing that every possible component failure mode, tha tcould cause or contribute to causing a hazardous event, has been considered. For thisreason, it is norma l to use the FTA approach to assist in the earlier stages of the design and

then to the use the FMEA to try to ensure that all types of failure have been considered.This combined FTA/FMEA approach has been widely applied, and is still essentially theapproach recommended by the SAE "Guidelines for Certification of Highly-Integrated orComplex Aircraft Systems". 4)


4/17

418 D.M . Johnson

2.1.3. Byzantine Resilience (BR)

The FTA/FMEA approach requires extensive and detailed analysis, and it is argued thatthis analysis process is prone to error 3~ and is unlikely ever to fully succeed. An alternativeapproach is to design the system to provide 'Byzantine resilience'. A Byzantine fault isa fault that may exhibit any arbitra ry behaviour. This may include any behaviour within the

physical limits of the component that may corrupt the system. A system has Byzantineresilience if it can tolerate this class of fault. If a system has Byzantine resilience, then it isguaranteed to tolerate any actually occurring component failure mode and hence it is nolonger necessary to conduct a detailed analysis of the actual failure modes that may occur.Design for BR thus attempts to remove the tedium of the FTA/FMEA approach and toprovide a better guarantee of fault tolerance.

The principle of the BR approach is that any deviation from expected behaviour isregarded as a fault. For components that are subject to variations in performance, due toexternal conditions, manufactur ing tolerances, etc., the BR approach is difficult to apply. Italso requires quite high levels of redundancy tha t may not be cost effective for all types ofcomponent. It is thus primarily applicable to fault tolerant computers, since these are lesssusceptible to random variations due to component tolerances, wear, ageing, etc. Using theBR approach, a computer must consist of 3n + 1 redundan t channels, each identical, wheren is the number of faults tha t must be tolerated. The redundant channels must be initialisedto exactly the same state, they must receive identical inputs, they must perform identicalinstructions and they must be synchronised. The outputs of each of the r edundant channelsare compared and a failure is detected if an exact bit-wise comparison of the outputs revealsany discrepancy between the redundant channels.

Lala and Harper describe the principles of this approach in some detail ~3~ and claim thatit eliminates the need for detailed failure analysis. This claim is valid up to a poin t but thereare a number of important limitations to the approach, not least that the approach is onlyreally applicable to the computing element. These will be addressed in detail later in thepaper.

2.1.4. Analysis o f Common Mode Failure (CMF )

All of the above analysis techniques deal fundamenta lly with the problem of randomhardware failure of components. Failures due to external interference are generally dealtwith by specific analyses and testing to show resilience to the various types of externalthreat. This might typically include analysis and testing of the effects of electro-magneticinterference (EMI) and lightning, disc burst analysis for uncontained engine ro tor failures,fire containment analysis, etc. Alternatively, if the effects of external interference can bepredicted, and a probabil ity associated to the occurrence of the interference, then the FTAand or FMEA methods may be used.

Failures due to specification, design or implementation errors present a greater problem.The FMEA will not reveal these types of failure and the BR approach is of no use if theredundant channels contain a common fault. FTA may be used in a limited manner, in thatthe fault tree for a par ticular event may include a cont ribut ion from this category of failure.A probabilit y of the failure cannot be included in the analysis, but an objective or integritytarget can be set and fault detection requirements can be identified. It is this FTA approachthat is generally used to define the software integrity requirements of a system. Software,effectively can be treated as a component with Byzantine failure properties. The BRapproach though is inappropriate for detecting software failures since it relies on identicalsoftware running synchronously on identical redundan t computing channels.

The SAE guidelines 4~ define "System Development Assurance Levels" depending on the

criticality of failure for a particular system or system function. Similarly RTCA DO-178B S~ defines software levels dependent on the worst effect of software failure. The basicprinciple is the worse the possible effect of error, the greater the effort that should be putinto ensuring that there are no errors. For complex systems, and for software in particular,


5/17


proof of integrity to the levels required by JAR 25 ~1) and FAR 25 TM remains beyond thescope of the curren t state of the art, ~6) despite improvemen ts in the software developmentprocess, the con tinuing development of formal methods and progress in software reliabilityanalysis techniques. Although the idea of proving the correctness of software is appealing,the use of formal proofs does not take into account the possibility of errors in the originalspecification on which the proof is based, or the possibility that the p roof itself might be

wrong. ~7) The abiLlity of reliability analysis techniques is also limited and can only be used todemo nstr ate relztively modest levels of reliability38)

Together, the above guidelines and airworthiness regulations provide a much clearerdefinition of the assurance activities required of system and software developers thanprevious guidelines, but do n ot provide developers with any significant new techniques withwhich to address the fault management problem or to prove very high levels of softwarereliability. They represent in effect a definition of good practice, that has evolved overa number of years, but that has no t previously been universally applied.

Dissimilarity between software (and hardware) used in redundant channels may beused to reduce the probability of malfunction due to design or implementation errors,but the benefit gained from this is difficult to quantify. The independence of dissimilar

sets of software, produced from a common requirement or specification, has been ques-tioned by many and has been shown to be unreliable. ~9) To achieve greater independence,functional dissimilarity should be employed. This, ideally, should use independentlydeveloped specifications. The SAE guidelines ~4) allow a reduction in assurance levelfor the use of functional dissimilarity, though the dissimilarity must be assured to the higherlevel.

2.1.5. Analysis of Human Error

The analysis of the possibilities and effects of hum an error is often neglected in systemdesign or at least is not performed explicitly. It is often assumed that the operator will not

give erroneous commands to the system, or that if (s)he does then it is not the fault of thesystem. Similarly, it is often assumed that the operato r can be relied upon to take the correctand timely course of action in order to overcome the effects of failure elsewhere in thesystem. To some extent this approach has to be accepted. It will always be possible for theoperator to cause an accident through malicious intent, gross negligence or incompetence.However, the system designer should try to design the system to minimise the risks fromaccidenta l errors made by a qualified operator. The SAE guidelines ~4) require that analysesare performed to justify the assumptions made abou t opera tor behaviour and to show thatreasonable precautions have been taken in the design, to protect the system from humanerror.

There are two separate approaches that may be taken: the prevention of error and the

amelioration of the effect of error, ~1) but their applicat ion has not been as systematic asmight be hoped. Application of the SAE guidelines ~4) should help ensure that a moresystematic approach is adopted in the future, but many of the system design featuresdiscussed below are the result of accident investigation leading to system modification,rather than the systematic analysis of the possibilities of human error.

2.1.5.1. Prevention of human error

Human error may be prevented, or at least the risks reduced, by careful design of theman-mach ine interface. Considerat ion should be given to the presentation of informationto the operator and to the design and positioning of controls. The workload and skill

requirements placed on the operator should also be assessed to ensure that they arereasonable and environmental considerations should also be taken into account. The fieldof ergonomics, the design of man-machi ne interfaces and the associated physiological andpsychological considerations are too large for detailed discussion here, but some examples


6/17

420 D.M. Johnson

f rom a i rcraf t sys tems wi ll i llus t ra te the type o f cons idera t ion requi red of the sys temdesigner.

Con tro ls per form ing d i fferent funct ions should be phys ica l ly separa ted and d i fferentia tedto prevent opera t ion of the wron g cont ro l , for exam ple to prevent inadver tent re t rac t ion ofthe landing gear w hen t ry ing to op era te the f laps. Suc h des ign fea tures may seem ob vious ,but th is is one of the problem s - sys tems ' designers be ing hum an tend to be l ieve tha t the y

are automat ica l ly exper t in human fac tors and therefore do not see the need for spec ia l i s tinvolvement or the need to make a spec i f ic ana lys is .Al t imeter des ign evolved due to acc idents caused b y misreading of three poin ter a l t im-

eters f i t ted on ea rly aircraft . This re duc ed the incidence o f 'contro l led f l ight into terra in '(CFIT) acc idents but d id not e l iminate them. Pi lo ts cont inued to re ly on the i r v isua lpercept ion and neglec ted to moni tor the a l t imeter readings on approach. The use ofGr oun d P rox im i ty Warn ing Sys t ems to t ry t o p reven t CFI T acc iden ts is an example o fa sys t em ded ica t ed to t he p reven t ion o f hum an e r ro r. By p rov id ing aud io warn ings o f t heaircraft height , awaren ess of the aircraft s i tuat ion is increased. This has furthe r redu cedCFIT acc idents but has s t i l l no t e l iminated them.

2.1.5.2. Amelioration o f the effects o f human error

I t is genera l ly accepted tha t , w hi lst it is poss ib le to reduce the pro babi l i ty o f hum an er ror,i t wil l not be poss ib le to e l iminate i t com ple te ly (as in the case of C F IT acc idents) . I t i s,therefore , a l so des i rable to des ign fea tures which reduce the effects of hum an er ror.Examples f rom a i rcraf t sys tems wi l l aga in i l lus t ra te the approach.

I t is com m on for landing gear con t ro l levers to inc lude a mechanism tha t wi ll prevent thepi lo t f rom se lec ting U P when the a i rcraf t is on the g round. Add i t ional ly, safe ty fea tures maybe inc luded to preven t the extens ion of the landing gear a bov e a cer ta in speed . The p i lo tm ay s t il l mak e the er ror of t ry ing to ope ra te the landing gear b ut ( s )he wil l be preventedfrom doing so .

The Fly-B y-W ire sys tem of the A320 inc ludes a num ber of fea tures to pro tec t the a i rcraftf rom the effects of p i lo t e r ror. These effect ive ly prevent the p i lo t f rom opera t ing the a i rcraf toutside the f l ight envelope for which i t was designed, thus preventing stal l , over speed,excess ive manoeu vre , e tc.

A s impler example i s the conf i rmat ion of inputs m ade f rom a k eyp ad before executing theins t ruc t ions . This a l lows the p i lo t to co r rec t an er ro neous input before any ac t ion i s taken .

2.1.6. Summary o f Fau lt Analysis M ethods

The F TA /FM EA approach can be u sed to a s s is t in t he sys t em des ign , to i den ti fy f au ltde tec t ion requi rements and to show tha t an ad equa te leve l of fau l t de tec t ion coverage has

been achieved. The method i s widely used and accepted , though i t i s labour in tens ive andpron e to e r ror. I t i s im por tan t to per form b oth types of ana lysis . Ne i ther ana lys is m etho d oni t s ow n is suff ic ient. The effec ts of comm on m ode fa i lures can be incor pora ted in to the F TA ,though the f ai lu re p robab il i ti e s may no t be known . The BR approach , t hough i t on ly r eal lyaddresses computer fau l t s , does not requi re de ta i led fa i lure analys is . This i s an impor tantadvan tage bu t un fo r tuna t e ly t he BR appro ach does no t add res s t he p rob lem o f com m onmode f a i l u re s . Th i s does no t neces sa r i l y ru l e ou t t he BR approach , bu t common modefai lures mu st be tac kled in so m e oth er wa y. Cu rren t cert if icat ion guidelines ~4' 5) al low thisapp roac h to be used , on the bas is tha t the appl ica t ion o f the h ighes t assurance leve ls wi llresu l t in sys tem and sof tware in tegr ity leve ls of the ord er requi red , d espi te the fac t tha t theselevels of in tegr i ty can not be proven.

Bo th the FTA/FMEA approach and the BR approach , desp i t e t he i r l im i t a t i ons , o ff e ra m ore sys tem at ic ana lys is of fa i lures than a ny m etho d curren t ly emp loyed for assess ing thepossib il it ie s o f hum an e r ro r. H um an f ac to r s r e search has p rod uced a vas t quan t i t y o fheur is t ic ru les and guide l ines def in ing good prac t ice but has not ye t provided the sys tem


7/17

Fault management echniques 4 2 1

designer with a systematic method of attacking the problem. The development and applica-tion of failure analysis and fault tole rant design techniques to the problem of human errormust therefore be considered.

One further option available to the system designer is to automate functions such thatintervention by tlhe hum an operator is reduced or eliminated. This clearly can remove thepossibility of human error in operation of the system, but it must be remembered that

humans still per:Form certain tasks better than machines (pattern recognition and mostsensory tasks, reasoning with uncertainty, etc.) and that human error remains possible inthe design and implementation of the system. A further problem with automation is that itcan result in a degradation of the performance of the human operator, due to an effectknown as peripheralisationJ TM

2.2. FAULT DETECTION AND ISOLATION METHODS

Having identiJ~ed the fault detection requirements it is necessary to design the systemsuch that it can detect and isolate the faults to allow appropriate recovery action to be

taken. This is a major area of difficulty in the development of real-time safety-criticalsystems. Problems are frequently encountered with spurious fault detection and with failureto detect faults.

The BR approach to the problem uses redundancy, comparisons and voting to detect andisolate failures. ]it does not attempt to detect specific failure modes of components or todetect specific types of malfunction. Of the fault detection methods described below, onlythe comparison between redundant channels is directly applicable in the BR approach. Theother types of test may, however, be used in conjunction with a BR approach, in order toimprove fault i,;olation and to prevent propagation of failures through the BR faultconta inmen t regions.

The following types of fault detection may be appropriate, depending on the type offailure: reasonableness tests; behaviour tests; continuity tests; performance tests; compari-sons between reclundant elements; and off-line tests. These are discussed below.

2.2.1. Reasonableness Tests

This type of test is particularly applicable to the monitoring of sensors or other systeminputs. Such parameters are normally expected to remain within prescribed limits and maybe expected to Change in prescribed ways. Fau lt de tection schemes can therefore be devisedwhich detect a failure if the parameter is not within prescribed limits or is not changing inthe prescribed manner.

For example, if a pressure sensor is expected to produce an analogue outp ut in the range1-5V, a failure may be detected if the signal lies outside this range. If the thresholds for faultdetection are set too tight, then nuisance fault detection may occur, for example, as a resultof componen t tolerances. The thresholds would not therefore normally be set at 1V and 5V.However, if the thresholds are set too wide, then some failures may not be detected.Problems may also occur due to noise or other external, transient disturbance, so it isnormal to include some filtering or confirmation of the fault condition.

There are two approaches that can be taken to determine the optimum detectionthresholds and filtering or fault confirmation times, though it is not unusual for a morehaphazard app~'oach to be used. The first approach is to use the FM EA to identi fy theactual failure modes and to design the monitor to detect these specific failures. The secondapproach is to use the FHA and FTA to identify system level failure events and to design the

monito rs to protect the system from the feared events.This type of test is usually performed within a single computing channel and is necessary

in a BR system, to prevent fault propagation from one channel to another as a result ofa failed input.


8/17

422 D . M . Johnson

2.2.2. Behaviour Tests

Components of the system are expected to perform, within some prescribed limits,according to the inputs applied to them. Failure of a component to behave within suchlimits may be detected as a failure. Determinat ion of monitoring algori thms for this type offailure is generally more difficult and may require the expected component behaviour to be

modelled for comparison. For example, the movement of the spool of an electro-hydraulicservo-valve may be expected to respond to the servo-valve current according to a first orderlag with a time constant of 10 ms. The monitor ing algori thm may then use measuredservo-valve current, with a model of the nominal servo-valve behaviour to de termine thepredicted position of the spool. This may then be compared with the measured spoolposition and a failure may be detected if the difference exceeds a threshold. The failurecondit ion would normally be confirmed over several computing cycles. This technique mayalso be referred to as analytical redundancy.

Once again, either the FMEA or the FHA and FTA may be used to assist in the definitionof the algor ithm, but the analysis is usually more complex. Once again, it is quite commonthat neither approach is used.

2.2.3. Continuity Tests

In addition to checking the reasonableness of signals and of the behaviour of compo-nents, it is also possible to check electrical continuity between components. This type of test,which may detect open circuit or short circuit failures, is part icularly useful in distinguishingcomponent failures from wiring faults. For example, electrical continuity to an electro-hydraulic servo-valve may be tested by measuring voltages across a resistance placed in theline and at the output from and return to the computer. When da ta bus communications areused to connect two (or more) components, the level and accuracy of fault diagnosis isusually improved, since it is possible to monitor the transmission, refreshment and parity of

data.Requirements for this type of test can be obtained from the FTA or the FMEA, if theconnections between system components are included in the analysis. The BR approachwould not provide specific monitoring requirements, but would offer a fault detectionmethod based on component and communication redundancy and comparisons betweenthe redundant elements.

2.2.4. Performance Tests

These tests are associated with the behaviour of the processing elements of the system.Their aim is to trap software or processor failures that lead to exceptions or failure toexecute in some prescribed manner within real-time constraints. Built-in exception handling(e.g. bus contention, illegal address, divide by zero), often with automatic recovery, isnormally included in avionic and other safety-critical software.

In addition, the real-time execution of the software may be monitored by the useof watchdog timers, of various levels of sophistication. There are also state basedmethods of monitoring software operation and other methods of monitoring the perfor-mance of real-time schedulers. These more sophistica ted monitoring methods, thoughinteresting, will not be discussed further. With the availability of greatly increasedprocessing power, it should be possible to achieve system functions using simple,deterministic scheduling methods, which do not require sophisticated monitoringtechniques.

Most of these methods are used, to some degree, in most avionic applications. Althoughformal processes for determining the required tests are not well defined, there is a weal th ofexperience, in the use of this type of test, from which to draw. The proof of the effectivenessof the tests is, however, rarely attempted.


9/17

Faul t management techniques 423

2.2.5. Compar isons Between Redundant Elements

Redun dancy, w i th compar i sons be twe en the r edundan t e l emen t s, m ay be used to de t ec tand i so la te fai lures. This i s m ost co m m only used in the sens ing and process ing e lements ofa sys tem. Re dun dan cy of ac tua t ion e lem ents is of ten a lso employe d, bu t i t i s less com m on tocom pare t he i r pe r fo rmance to de t ec t and i so la t e fau lt s. Com par i sons m ay be made be tween

two or more redundant e lements . These e lements may be ident ica l or d iss imi lar and thecompar i sons made may be exac t o r app rox ima te .The BR approach to f au l t managemen t u ses i den t i ca l r edundancy and exac t b i t -w i se

compar ison . A s imi lar approach i s used on the space shut t le , though to pro tec t f romcom m on m ode fa i lure there is addi tional , man ual ly se lec table , d iss imi lar redundanc y. O therappr oach es use ident ica l redu ndan cy b ut wi th less s tr ic t synchro nisa t ion a nd in i t ia li sa t ionrequi rements , and approximate compar ison (e .g . the Boeing 737 Yaw Damper) . Others used i ss imi la r r edundan cy and app rox im a te com par i son ( e .g . t he A320 S la t and F lap C on t ro lCompute r and the Boe ing 777 F l igh t Con t ro l Compute r s ) o r func t iona l ly d i s s imi l a rmon i to r ing ( e .g . some m on i to r ing pe r fo rm ed by the A340 Brakes and S teer ing Co n t ro lUni t ) . I t i s not poss ib le to use d iss imi lar redundancy and exact b i t -wise compar isons .

La la and Ha rpe r (3) a rgue tha t m e thods based on app rox ima te com par i son can neve r bewh ol ly successfu l s ince they m ust achieve a com prom ise be tw een de tec t ion o f a l l fau l ts an dthe avoida nce of spur ious faul t de tec tion . The achievemen t of a sa t i s fac tory com prom ise i sund oub tedly p roblem at ic , but dep ending on the objec t ives of the faul t d iagnos is i t i s notimposs ib le . The argum ent put b y Lala and H arp er assum es tha t i t i s necessary to de tec tanydev ia t ion be tween the ope ra t ion o f t he r edundan t channe ls and tha t com ponen t s canexhib i t any ar b i t ra ry m ode of fai lure . This i s not the case . A fa ilure may be m ore usefu llydef ined as the faiLlure of a com pone nt , or p iece of equipm ent (or sof tware), to per form wi th insom e speci f ied limi ts under spec if ied condi t ions . W ith th is def in i t ion of fa i lure , suppo r ted byknowledge o f ac tua l fa i lu re modes , app rox ima te comp ar i son schemes can be deve lopedsuccessfully. This task is further ea sed if i t is recogn ised tha t i t is onlynecessary to de tec tfai lures that rep:resent a hazard . T his dist inct ion b etw een rel iabili ty and safety is im po rtan tbu t is often ove rlook ed. (x2)

Based on the FH A and FTA , it should then be poss ib le to def ine expl ic i tly the behav iourof the sys tem, or of a com pon ent of the sys tem, tha t i s cons idered hazardous . This def in i tiono f haza rdous behav iou r m ay then be u sed , i ndependen t ly o f t he func t iona l r equ i remen t s , a sa bas is for the def in i tion of funct ional ly d iss imi lar moni tor ing , be twee n redund ant sys temelements.

The re a r e d i sadvan tages t o t he BR approach a s we l l . Common mode f a i l u re s a r e no tde tec ted; thus a s ingle fa i lure (e .g . due to a com m on des ign er ror or c om pon ent fau lt ) m ays imul taneo us ly affect a ll redu ndan t channels. The assum pt ion o f Byzant ine faul t proper t iestends to lead to over-des ign of the sys tem s ince i t is des igned to to lera te fa i lure m odes tha twi ll never occu i , or tha t a re suff ic ient ly im prob able no t to requi re cons idera t ion . T od d a nd

Yount(13) ident ify the need to ma inta in m axim um decoup l ing be tween redund ant channelsto p reven t t he i n t roduc t ion o f com m on m ode f a i lu re paths . A fu rthe r p rob lem o f t he BRapp roac h i s the', need to ensure faul t conta inm ent wi th in each of the re dund ant e lementswhi ls t provid in l ; identica l in i tia l isa t ion co ndi t ions , c lose synchronisa t ion and exchange a ndcom par iso n of da ta be tw een the channels. A fa i lure wi th in one channel m ust have no effec touts ide i t and i t s opera t ion mu st not be affec ted by any outs ide fai lure . T he ac hieveme nt ofth is i s not s t ra ilghtforward a nd requi res the app l ica t ion o f o ther fau l t de tec t ion techniquesto prevent fa i lure propagat ion through the communica t ion pa ths . Thus , des ign for Byzan-t ine res i lience does n ot com ple te ly e liminate the need for o ther fa i lure analys is techniquesand faul t de tec t ion schemes .

2.2.6. Off-Line Tests

All o f t he abo ve t e s t me thods a r e p r imar i ly app l icab le t o con t inuous check ing o f someaspec t o f com ponen t o r sys tem behav iou r o r pe r fo rmance . Th i s t ype o f con t inuous


10/17

424 D.M. Johnson

monitoring may be supported by additional off-line tests. These tests may be run atpower-up, on request, during specific phases of system operation (e.g. when the system is instand-by mode), or following initial fault diagnosis. They may be used to detect failures thatcannot be detected during normal system operation, to improve the accuracy of the faultisolation or to support an initial fault diagnosis.

Examples of this type of test include memory checksum tests and processor instruction

set tests that may typically be performed at computer power-up. Requirements for this typeof test may be determined from the FTA/F MEA.

2.2.7. Summary of Fault Detection and Fault Isolation Methods

All of the above types of fault detection are useful and it is likely tha t all will be necessaryin achieving good fault detect ion coverage and accurate fault iso lation in a complex system.FTA and FMEA are useful in defining the detection requirements and designing the faultdetection algorithms, but there is no single methodology available to the system designerthat fully supports this task. Consequently schemes for fault detection and isolation tend to

be based on a collection of different methods (often applied in an ad hoc fashion), experienceand engineering judgement.Fault detection algorithms based only on the FTA tend not to succeed. The most

common problem encountered is that monitor thresholds are set unnecessarily tightresulting in spurious fault detection. Similarly fault detection algorithms based only on theFMEA rare ly succeed. The main problem encountered here is the definition of monitoringalgorithms without proper consideration of the need to detect the failure. Again themonitoring algorithms may be unnecessarily stringent. Both types of analysis should beused to optimise the design of monitor algorithms, just as both types of analysis should beused to identify the fault detection requirements. Generally the FTA should be used as theprimary means to identify requirements and the FMEA as the primary input to the designof the monitor algorithms to meet these requirements. In this way, monitor algorithmsshould be designed to detect real failure modes that might cause, or contribute to, somehazardous behaviour of the system.

The BR approach to fault management appears to offer a simpler and more reliableapproach to the problem. However, its applicability is limited really to the processingelements of the system, it is susceptible to common mode failures and it is reliant on otherfault detection methods to prevent failure propagation between the redundant channels.

Fault detection and isolation remain major problems for the designers of complexreal-time systems. There is a pressing need for improved methodologies and design tools inthis area. Reliability analysis tools have been developed that automate, to some degree, theanalysis process, but these tools are of more assistance in proving compliance withreliability and safety requirements at the end of the development than in aiding the designprocess.

Assistance in the definition of monitoring algorithms would be of great value to thesystem designer. Research to define methods that would support this crucial part of thedesign process and to develop tools to automate the process would be of particular benefit.It is recommended that such methods should consider the fault detection requirementsnecessary to fulfil the system safety requirements separately from those required for otherreasons, and that the safety requirements should be used directly as an input to the process.

Simulation and modelling has proved a valuable aid to system developers in refiningmonitoring algorithms and other system functional requirements. The use of simulationallows the effectiveness and robustness of monitors to be tested more thoroughly than ispossible with the real system and allows testing and validation in advance of the implemen-

tation. Further development and the wider use of simulation techniques should, therefore,also be encouraged.Some of the types of test described above may also be applied to the monitoring of

commands from the human operator, for example the operator input may be checked for


11/17

Fault mana geme nt echniques 425

r e a s o n ab l e n e ss , th e o p e r a t o r m a y b e a s k e d t o c o n f i r m a k e y b o a r d i n p u t b y r e - en t e ri n gt h e i n f o r m a t i o n , ,3r t h e c o m m a n d m a y b e c h e c k e d a g a i n s t t h e s y s te m s t at e . A d a p t a t i o n o fthe t e chn iques u :s ed fo r t he a na ly s i s o f o the r t ypes o f f au l t, o r t he deve lopm en t o f newana lys i s t e chn iques wou ld a id t he sy s t em des igne r i n min imi s ing t he r i sk s due t o humanerror.

3. F A U LT T O L E R A N C E A N D S Y ST E M R E C O N F I G U R AT I O N

Ha v ing de t ec t ed an d i so l a t ed a f a i l u r e w i th in t he sy s t em, o r on t he pa r t o f t he ope ra to r, i tmay be neces sa ry t o t ake some ac t i on i n o rde r t o ensu re con t inued s a f e ope ra t i on . Th i sac t io n wi l l depe nd on the de s ign of the sys tem , the e ffec t of the fa i lure , the c r i t ica l i ty of thesys t em func t ion a nd t he ava i l ab i l i t y o f fa i l- s af e s t at e s. I t sh ou ld a l so be n o t ed t ha t t hed i agnos i s ma y de pend on t he t ype o f ac t i on t o be t aken , t14) Fo r example , f o r imm ed ia t er ecove ry ac t i on , :it m ay be su ff ic i en t t o k now tha t t he r e i s a f au l t somew here i n one l ane o fa r ep l i ca t ed sys t em, whe reas fo r li ne ma in t e nan ce i t ma y be r equ i r ed t o i den t i fy t he f a i l edu n i t a n d f o r s h o p m a i n t e n a n c e i t m a y b e d e s i ra b l e t o i d e n t if y t h e f a il e d c o m p o n e n t w i t h i n

the uni t .The va r ious t ypes o f r ecove ry ac t i on a r e de sc r ibed be low, w i th example s o f t he i r u set aken f rom a i r c r a f t sy st em app l i ca t i ons . Th i s i s f o l l owed by an a s se s smen t o f t he m e tho dsava i l ab l e t o t he de s igne r, t o de t e rm ine t he ac t i ons neces sa ry t o ach i eve t he s a f e ty ob j ec t i vesfo r t he sy s t em, and t o de s ign t he sy s t em so t h a t i t c an p ro v ide t he r equ i r ed l eve l o f f au l tt o l e r ance .

3.1. FA U LT R E C O V E RY

T h e f o l lo w i n g ty p e s o f re c o v e r y ac t i o n m a y b e t a k e n : c o n t i n u e d o p e r a t i o n i n a d e g r a d e dmo de; fa i lure to a safe pass ive s ta te ; fa i lure abso rp t ion ; sys tem rec onf igu ra t ion; d i rec t fa i lure

r ecove ry ; and ope ra t i ona l l im i t a t i on .In add i t i on t o t ak ing r ecove ry ac t i on , i t m ay a l so be neces sa ry t o a l e r t t he ope r a to r t o t he

f a il u r e. Th i s is e s sen ti a l i f t he o pe ra to r i s r equ i r ed t o mo d i fy h i s /he r con t ro l o f t he sy s t em a sa r e su l t o f t he f a i l u re , o r i f t he pe r fo rman ce o f the sy s t em i s deg raded . As s t a t ed p rev ious ly,t he me th ods u sed t o a l e r t t he ope ra to r t o t he p r e sence o f f a i lu r e s a r e impo r t an t , bu t w i ll no tbe d i s cus sed i n de t a il he r e. How eve r, a b r i e f ou t l i ne o f the w a rn ing ph i lo soph y used on t heAi rbus A320 i s g iven t o i l l u s t r a t e t he key f ea tu re s o f a w a rn ing sys tem:

Th ere a re thre e c lasses of a le r t : a 'warn ing ' ( red), a ' cau t io n ' (amber) and an ' ad viso ry '(g r een ) . Fa i l u r e s t ha t have warn ings r equ i r e immed ia t e p i l o t a t t en t i on and immed ia t eac t i on . The p i l o t i s a l e r t ed by a r ed f l a sh ing mas t e r wa rn ing l i gh t , a con t inuous au ra lw a r n i n g a n d a m e s s ag e o n t h e e n g i n e a n d w a r n i n g d i s p l ay (E W D ) . I n s t r u c t i o n s m a y a l s ob e g iv e n o n t h e E W D a n d a d d i t i o n a l s y s te m s t a tu s i n f o r m a t i o n is p r o v id e d o n t h esys t ems d i sp l ay (SD). Ca u t ions a r e g iven fo r fa i l u re s r equ i r ing im med ia t e p i l o t a t t en t i o nb u t n o t r e q u i r i n g i m m e d i a t e a c t i o n . T h e p i l o t ' s a t t e n t i o n i s d r a w n t o t h e c a u t i o n b yi l l u m i n a t i o n o f a n a m b e r m a s t e r c a u t i o n l ig h t a n d a s i n gl e ch i m e. I n f o r m a t i o n i sp r o v i d e d o n t h e E W D a n d S D , s i m i la r t o t h a t p r o v i d e d f o r a w a r n i n g . C a u t i o n s m a y a l sobe g iven fo r f a i lu r e s t ha t r equ i r e c r ew aw arenes s bu t t ha t have no speci fi c ac t i on . I n t h i sca se t he mas t e r c au t ion l i gh t i s no t i l l umina t ed an d t he re i s no ch ime . Adv i so ry messagesa n d i n f o r m a t i o n m a y b e d i s p l a y e d o n t h e E W D o r S D w i t h o u t s pe c if ic 'a t t e n t i o n g e t te r s' .These messages a l e r t t he p i l o t t o mino r f a i l u r e s and p rov ide i n fo rma t ion abou t sy s t ems ta tu s bu t do no t r equ i r e im me d ia t e a t t en t i on . I f seve ra l f a i lu r e s a r e p r e sen t s imu l t a -

neous ly t hey a r e p r io r i t i s ed fo r p r e sen t a t i on t o t he c r ew. I f t he f a i lu r e s have a co m mo ncause (e.g . lo,; s of an e lec t r ica l pow er supp ly) , the n th e p r im ary fa i lure (the loss of pow ersupp ly ) is p re r sen ted a s t he w a rn ing o r cau t ion . Th e s econ da ry f a i l u r es r e su l ti ng f rom thep r im ary f a i l u r e a r e l i st ed unde r sy s t em s t a tu s, on t he E W D. I f t he f a i lu r e s do no t have


12/17

426 D.M. Johnson

a common cause then they are prioritised according to the need for pilot action and pilotattention. The pilot can clear the failure from the EWD (it then remains under systemstatus) in order to see the next failure.

3.1.1. Continued Operat ion in Deg raded Mode

Not all failures require recovery action. The failure may have no overall effect upon thesystem or may simply degrade system performance without significant impact on safety.Alternatively the failure may affect safety, but may be of sufficiently ow probabi lity that theeffect on safety can be accepted. It is, however, usually still necessary to detect and isolatethe fault so tha t maintenance action can be taken at the appropriate time.

Loss of passive redundancy typically has no immediate effect on system operation.Subsequent failures may then have a severe effect, but no action is required as a result of theinitial failure. Loss of braking on one of several braked wheels, for example, due toa tachometer failure, will degrade braking performance slightly but will not have anysignificant effect on safety. No failure recovery action is required. A servo-valve jamresulting in runaway of a spoiler control surface may have a significant safety effect

(particularly on take-off or final approach), but the probability of the failure may besufficiently ow that the effect can be accepted, without the need to take any recovery action.

3.1.2. Failure to a Safe Passive State

A particular component may have hazardous failure modes but its continued operationmay not be critical to the continued safe functioning of the system. In this case, on detectingthe failure, the component may be switched to a safe passive state.

Erroneous output from a computer may be inhibited (e.g. by switching of relays) if there isredundancy available to take-over the function, or if the function is not required for thecontinued safe operation of the system. In some cases the complete system may be shutdown. For example, certain slat and flap control system failures can be catas trophic leadingto loss of the aircraft, but it is quite possible to safely continue flight and land with thesystem inoperative. An aileron runaway caused by a servo-valve j am may be hazardous, butlateral cont rol of the aircraft can be safely maintained without the use of the ailerons, usingspoilers and rudder. Detection of the servo-valve jam may therefore be recovered byremoval of hydraulic power to the aileron.

3.1.3. Failure Absorption

Failure absorption is achieved by nullifying the effect of the failure, normal ly by use of

a vot ing process. This generally requires at least triplex redundancy so that the effect of thefailure can be overcome by the action of the un-failed elements.An example of this is the use of aircraft control surface actuation arrangements that

effectively sum the outputs from three (or more) redundant computing channels.

3.1.4. System Reconfiguration

If a failure occurs which degrades system performance below some acceptable level, thenit is necessary to reconfigure the system in some way in order to recover an acceptable levelof operation.

Failure of an active redundant element will normally require changeover to a passive

element, for example switching of control between computer channels, or use of standbyactuation. In other cases a degradation of system performance may be recoverable bymodifying system behaviour. For example, roll control laws used to control the aileronsmay be modified in the event of failure of roll spoilers.


13/17


The system reconfiguration is not necessarily required to recover fully all aspects ofsystem performance, but to maintain system performance at a level compatible with theprobabi lity of the: fault occurrence.

3.1.5. Direct Failure Recovery

Certain types of transient failure may be directly recoverable, for example, failures causedby external interterence (EMI, lightning, etc.) or by software errors.

Failure recovery in these cases may be automatic or may require some action such asa processor reset. Failures detected by the type of test described in Section 2.2.4 are oftendealt with in this way, though it is normal to limit the number of reset attempts.

3.1.6. Operational Limitation

The final type of action is to place operational limits on the system. This may be achievedby restricting system functionality (effectively the same as degraded operation) or byproviding instructions or warnings to the user or to other systems.

For example, if anti-skid protection is lost on the braked wheels of an aircraft, aninstruction can be provided to the pilot to restrict brake pressure so as to reduce the risk oftyre burst. An increased minimum landing distance may also be required. Loss of the abilityto steer the nose-wheels of the aircraft from the autopilot may result in degradation ofautomatic landing capability. This limitation may be indicated to the autopilot systemand/or the pilot.

3.2. FAULT TOLERANT DESIGN

All of the above types of fault recovery action may be used in order to provide a level offault tolerance consistent with the safety requirements of the system. The problem for thesystem designer is to choose a system architecture that will provide the necessary faulttolerance and to decide where, when and how each type of action should be used.

3.2.1. FTA/FMEA Approach

Just as the FTA and FMEA can be used to identify fault detection requirements, so theycan be used to identify requirements for faul t recovery. It is, however, an iterative processsince the requi rements for fault recovery may change the system design which in turn willmodify the analyses.

For the process to be applied successfully without excessive iteration, existing designs

must be analysed, past experience must be used and engineering judgement must beexercised. Provided sufficient care is used in the design process, and provided that faulttolerance is considered from the beginning and throughout the design process, it is possibleto achieve satisfeLctory esults. The process relies heavily on the skill of the system designer,however, and it i:~ not uncommon for considerations of fault tolerance to be put aside duringsome phases of the development. This may result in expensive redesigns or undesirable testand inspection tasks being required.

3.2.2. Byzantine Resilience

This approach has received more rigorous, analytical study. This has resulted in the

production of hard rules defining redundancy and other requirements as a function of thelevel of fault tolerance to be achieved.This is clearly of benefit to the system designer, though the approach does not cover all

aspects of system fault management. The application of the BR approach does, however,


14/17

428 D.M . Johnson

tend to result in over-design. For example, for a system to be resi l ient to a single Byzantinefa i lure , four redundant channels , or fau l t conta inment reg ions are requi red . Depending onthe cr i tica li ty of the sys tem, duplex or t r ip lex re dund ancy ma y be sh own to be suff icientus ing o ther methods . The cos t of th is over-des ign would have to be outweigh ed by savingsin development cos ts ga ined f rom reduced faul t ana lys is e ffor t .

3.3. FA U LT R E C O V E R Y A N D S Y S TE M R E C O N F I G U R A T I O N S U M M A R Y

Just as there a re numero us m ethod s for de tec t ing fa i lures , so there a re num erous m etho dsfor recov ering from or co ping with th e effects of fai lure. All types o f fai lure recov ery act iondescr ibed are usefu l in cer ta in s i tua t ions , bu t the sys tem des igner has l i tt le mo re than pas texper ience , engineer ing judgement and perhaps some heur is t ic ru les to guide h im/her inproducing a faul t to lerant des ign .

Com bined use o f FTA and FM EA is r equ ir ed to a s si st in t he cho ice o f sys t em a rch it ec-ture , com pon ent des igns and the de termina t ion of fau l t recovery ac tions , but i t i s anexpensive, i terat ive and imprecise design technique.

The B R app roac h to the pro blem p rovides a much c learer se t of ru les but i t s appl ica t ionis l imited real ly to the processing elements of the system. Though i t is potential ly useful ,a technique tha t tackles only a par t of the to ta l sys tem pro blem is unl ike ly to ga in wideacceptance . The tendency to produce over-des igned so lu t ions i s a l so l ike ly to l imi t i t sapplicat ion.

4. T H E I M PA C T O F N E W I N T E G R A T E D AV I O N I C S A R C H I T E C T U R E S

M uch of the previous d iscuss ion , becaus e it has focused on exist ing app roach es to faul tman agem ent , has co ncent ra ted on exis ting , ' convent ional ' sys tems archi tec tures . The fu turedevelop men t of fau l t ma nage me nt techniques must c ons ider the types of sys tem archi tec-ture tha t wi ll be used in the fu ture. Th e fo l lowing d iscuss ion re la tes to l ike ly developm entsin avionic sys tem archi tec tures .

4 .1 . C O N V E N T I O N A L S Y S T E M S A R C H I T E C T U R E S

'Conv ent ional ' sys tems archi tec tures , thou gh there a re w ide d ifferences be twee n d i fferentsystems and different aircraft , al l exhibit the fol lowing characterist ics:

Sys tems are la rge ly se l f -conta ined , though informat ion may be exchanged be tweensystems.

The p rocess ing e lements of a sys tem are conta ined in one or m ore 'b lack boxes ' . Theseblack boxes are dedica ted to tha t sys tem.

Sensors , ac tua tors , e tc . , a re connected to the b lack boxes by dedica ted wir ing car ry inganalogue or discrete signals .

Com mu nica t ion be tween b l ack boxes is ach ieved v i a da t a buses and ded ica ted d i sc re t ewiring.

There i s l it tle comm ona l i ty be tween the co mp onen ts o f d i fferent sys tems.

The resul t of th is i s tha t sys tem s and sys tem com pon ents (par t icu lar ly the com puters ) a redes igned and developed indiv idual ly for each new a i rcraf t programme. Consequent ly, theto ta l developm ent cos t is very h igh and de velopm ent e ffor t i s spread over a la rge numb er o f

separa te pro jec ts . The dedica t ion of process ing e lements to par t icu lar sys tems and thesepara t ion of the indiv idual sys tem s result in excess to ta l process ing capa ci ty and repl ica-t ion of s imi lar funct ions by d i fferent sys tems. Again th is increases the to ta l cos t . Co s t o fownership i s a l so h igh s ince only l imi ted redundancy can be provided , leading to sys tem


15/17

Fault management echniques 4 2 9

availability problems. A wide variety of spares must be stocked and maintenance proced-ures will vary between the different systems.

4.2. INTEGRATED MODULAR AVIONICS (IMA)

New, integrated avionics architectures have been proposed. Their aim is to reduce thecosts associated with conventional avionic architectures. The main features of these archi-tectures are listed below:

The boundaries between systems are less distinct. Different systems may use commonresources.

The 'black-boxes' are replaced by 'line replaceable modules' (LRMs) mounted in a rack orcabinet. These modules are not necessarily dedicated to a particular function. The modulemay be utilised by several functions (e.g. a power supply module), or may change itsfunction.

Sensors, actua~Lors, etc., are connected to the modules via data buses. Inter facing of the

sensors and actuators to the databus is achieved by localised electronics. Thus electronicsand processing fitcilities are distributed rather than centralised.Communication between modules is achieved via high-speed parallel data buses within

the rack or cabinet.All modules ar,e of standard types. It is also in tended to make extensive re-use of software.

This approach is known as integrated modular avionics (IMA). The introduction of IMAwill have a significant impact on the fault management problem.

4.3. ANALYSIS OF THE EFFECTS OF IMA ON EXISTING FAULT

MANAGEMENT TECHNIQUESThe most significant distinctions between conventional and IMA systems architectures

are that the boundaries between different systems are less clearly defined with IMA and thatthe use of standard modules greatly increases the potential for common mode failuresaffecting many system functions. This will tend to complicate analysis using the existingmethods and raises the need for common mode failure analysis at an ai rcraft level. The SAEcertification guidelines clearly recognise this, and require the aircraft manufacturer toperform common cause fault analyses starting with the identification of 'aircraft-levelfunctions'.

The results of the FHA and FTA for a part icular function will no longer be able to definethe processing architecture required, since processing elements will not necessarilybe dedicated to tha t function. Instead the analysis will provide integrity requirements to besatisfied, for tha t function, by the processing resource. The requirements from each functionwill then have to be analysed together in order to determine the requirements at the rack orcabinet level. This analysis, particularly if dynamic reconfiguration is used to preservecritical functions, will require the application of new design techniques. It will also requireadditional coordination between the designers of the different systems so that all of theirrequirements can be integrated.

The use of dynamic reconfiguration offers substantial economic advantages and thepotent ial to imp~:ove safety, but great care must be taken to avoid the introduct ion of new,and possibly devastating, common mode failures. One possibility that should be studied isfor each IMA module to determine its function autonomously, thus avoiding reliance on

one or more 'executive' modules to manage the reconfiguration.The requirements for other components of the system, and the moni toring requirements

for these compo:aents can be addressed using the existing analysis techniques, although asalready stated, tlhere is a need for development and improvement of these techniques.


16/17

430 D.M. Johnson

Thus there is a distinction between the central IMA processing resource and theother components of the system. The interfaces between the central processing resourceand the other system components are greatly simplified with IMA, due to the replacementof dedicated wiring carrying an assortment of signal types, by data bus communica-tions. This should allow the fault management problem to be split in two: one partconsidering the central processing resource, the other considering the other components of

the system.This division of the problem should make the application of the BR approach moreattractive, since one of the criticisms of it was that it only considered the processingpart of the system. In order to gain the maximum cost-benefit from an IMA architecturethough, it is desirable to keep redundancy to the minimum required, to allow resourcesto be shared between different functions and to allow resources to reallocated to newfunctions in the event of failure. All of this is contr ary to the BR philosophy. In particular, itwould seem impossible to maintain strict fault containment with dynamic system recon-figuration. The initialisation and synchronisation requirements would also be difficult toachieve.

The BR approach does not therefore appear useful with an IMA architecture. It is morelikely that individual modules will employ a dual redundant architecture, probably withsome form of dissimilarity, such tha t the modules will effectively be self-monitoring.

The use of data bus communicat ion within a system offers potential improvements infault diagnostic capability. It also simplifies significantly the monitoring tasks requiredwithin the avionics rack. Faults in o ther system components (e.g. jamming of a servo-valve)should be detected locally and reported back via the data bus. Faults in data buscommunications can be readily diagnosed by monitoring data refreshment, parity, etc.

This simplification of requirements within the avionics rack is, however, at the expense ofplacing the monitoring and fault detection requirements in the distributed electronics andprocessing elements, local to the system components . There is, nevertheless, an overallbenefit since analyses can be performed at a component, rather than a system level, and theproblems associated with the detection and iso lation of wiring failures are largely removed.

If IMA features resource sharing and dynamic reconfiguration of resources, carefulconsideration will be necessary of how to present system failures to the flight and mainten-ance crews, since failed LRMs will not be dedicated to any specific function. Generally,failures should be presented to the flight crew in terms of their functional or operationaleffect and to the maintenance crew in a way that uniquely and clearly identifies the failedunit.

5. CONCLUSIONS

Existing analysis methods for determining fault management requirements rely heavilyon the exercise of engineering judgement and the use of heuristics based on previous

experience. They are time-consuming and costly to perform and rely on iteration of thedesign to achieve a sa tisfactory end result. Design to provide Byzant ine resilience is moresystematic, but the approach has a number of limitations and is not readily applicable tofuture integrated avionic architectures since the fault containment requirements would bedifficult to achieve with dynamic system reconfiguration.

The existing methods are more suited to fault management considerations associatedwith random hardware failures. They are less suited to the treatment of common modefailures. There is a need, for safety-critical systems, to avoid common m ode failures or toprotec t the system from their effects. Specification, design or implementation errors musteither be avoided, or detected and contained by the use of dissimilar redundancy.

Faul t detection coverage, accuracy of fault isolation and spurious fault detection remain

major problems in system design. The existing analysis methods support the identificationof monitoring requirements and definition of monitoring algorithms, but do not providedefinitive solutions to the problem. The problems of moni tor design are exacerbated by theuse of dissimilar redundancy.


17/17

Fault man agem ent techniques 431

The i n t r o d uc t i on o f in t eg r a ted m odu la r av ion i c s w i l l s imp l ify som e a spec t s o f t he f au ltmanage men t p rob l em and shou ld improve ove ra l l d i agnos t i c c apab i l i t y. Fau l t manage -ment , in te rms of fau l t recovery and reconf igura t ion , wi th in the rack or cab ine t , wi l l becom pl ica ted by the shar ing o f resources and rea l loca t ion of resources in the event o f fa i lu res .

Im p r ove m en t s i n fau l t ma nagem en t ana l y s i s a n d de s ign t e c h n ique s w i l l b e o f l im i t e dva lue u n l e s s t he p rob l ems caused by hum an e r ro r ar e a l so add re s sed . Mu ch r e sea r ch ha s

been car r ied ou t in th i s a rea bu t th i s has no t ye t y ie lded any sys temat ic des ign techniquetha t can be used by the sys tem des igner. Research i s par t icu la r ly recommended in thefo l lowing a reas :

T he u se o f f unc t i ona l l y d i s s im i l a r r edundancy t o avo id haza rdous behav iou r due t ospec i f ica t ion , des ign or implementa t ion e r rors .

A uto m at io n o t ' the process o f iden t i fy ing fau l t de tec t ion requi rements an d def in ing fau l tde t ec t io n a lgo r i t hms and t he enhancem e n t o f e x i s ti ng s imu l a t i on t e chn i q u es .

App l i ca t i on o f f au lt t o le r an t de s ign t e chn iques t o t he p rob l em o f hum an e rror.Dy nam ic r econ f igu ra t ion o f r e sou rc es i n an I M A a rchi te c tu r e .

R E F E R E N C E S

1. JA R 25.1309 Jo: int airworthiness req uirements ( + Advisory M aterial A M J 25.1309,System Design andAnalysis , Advisory Material Joint) ,J A A .

2. FA R 25.1309 Fcxteral aviat ion regulat ions (+ Advisory Circula r AC25.1309-1A,System Design and Analysis ,Advisory Circular),FA A .

3. Lala, J. H. and Ha rper, R. E. (1994) Arc hitec tural principles for safety-critical real-tim e applications. InProceedings of t ile IE EE Vol, 82, No. 1, Janua ry 1994(IEEE), pp. 25-40.

4. AR P 4754,Guidelines for Cert if icat ion of Highly-integrated or C omplex Aircraft System s,SAE.5 . RTCA DO -178 B,Software Considerations in Airborne Systems and Equipment Certification,RT C A .6. Shagnea, A. M. and Hayh urst , K . J . (1991) An eva luat ion of a DO-178 A software developm ent process. In

Proceedings o f t~e lOth Digital Avionic Sy stems Conference(IEEE/A IAA), pp. 97-102.7. K night , J . C. and Lit t lewoo d, B. (1994) The cr i t ical task of wri t ing depend able software. InIE E E Software,

January 1994(IEEE ), pp. 16--20.8. B rocklehurst , S and Lit t lewoo d, B (1992) New ways to get accurate rel iabi l i ty measures. InIEEE Sof tware ,July 1992,pp. 34-42.

9. Knigh t , J . C. and Leveson, N. G. (1986) An experim ental evaluat ion o f the assum ption o f fai lure independencein m ul t i -vers ion programming . InIEEE Transactions on Software Engineering,Vol. SE-12, No.I , January1986, pp. 96-109.

10. H awk ins, F. H. (:[993)Hum an Fa ctors in Flight ,2rid Edit ion, C hapter 2, pp. 27-55, H. W. O dad y (ed.). AshgatePublishing Ltd.

11. Wiener, E. L. (1988) Coc kpit auto matio n. InHum an F actors in Aviat ion,pp. 433-461, E. L. Wiener and D. C.Nag el (eds). Academ ic Press.

12. Le veso n, N . G . 111989) Safety. InAerospace Software Engineering Progress in Aeron autics and A stronautics,Volume 136,pp. 319-337, C . Anderson an d M. D orfm an (eds) (AIAA).

13. Tod d, J . R. and Youn t , L. J . (1991) Digi tal fl ight control systems: some new comm ercial twists . InProceedingsof the lOth Digital Avionic Sy stem s Conference(IEEE/A IAA ), pp. 79-84.

14. R asmussen, J. (1!)93) Dia gno stic reasonin g in action. InIEEE Transactions on Systems, Man and Cybernetics,

Vol. 23, No. 4, July/A ugust 1993 (IEEE), pp. 981-992.

Fault Mgmngt Techniques

Documents

Transcript of Fault Mgmngt Techniques