Navigation  without Java Scripts

The Use of Prolog in the Analysis of Safety Critical Software

Steve Lympany

Analytical Databases Ltd.,
Westlands, Two Mile Ash, Horsham, RH13 7LA England
+44 (0)1403 730071

steve@analytical-databases.co.uk

Abstract

 

High integrity software and safety critical applications need to be formally and rigorously specified, designed and implemented. Minimising the probability of failure of such applications is vital, since failures can lead to hazards that result in injury or death. This paper outlines some methods that have been used to analyse the safety critical software developed for a rail transport system, supporting the Safety Case that our client has submitted to HMRI (Railways Inspectorate). In particular it describes how Prolog has been used to develop software tools that support such analyses. The method can be used effectively in the development of robust, 'error free', commercial and financial applications.

 

1. INTRODUCTION

Currently, several railways worldwide are considering upgrading their signaling system from a "fixed block" system to "moving block". The moving block system is rather more than a signal system, and essentially provides automatic train control.

Fixed block rail systems i.e. existing rail systems, involve fixed segments of track in which the signaling system controls the movement of trains, allowing only one train to be present in a segment at any time. In a moving block system, the usual trackside signals are dispensed with, and are replaced with computers that control the movement of trains. The "block" now moves along with the train (this is analogous to what car drivers practice by just keeping a safe distance from the car in front). Trackside and on-board computers communicate with each other, and trains know the positions and speeds of trains in front and behind them and also the properties of the track itself (e.g. track slopes in order to calculate braking distances). Fixed block systems typically limit the throughput of trains to 24 trains per hour. A moving block system could achieve a throughput of 36 trains per hour.

The system for which this work was undertaken is likely to be the first fully-fledged moving block system in the world. There are some partial systems currently working, such as Docklands Light Rail, Bay Area Rapid Transit (San Francisco), and currently Westcoast Main Line is considering upgrading to moving block.

For the moving block system, our client is designing and developing both computer hardware and software that will control the trains and, at some stations, platform doors. Hardware is based on Motorola 680x0 chips, and software is primarily written in ADA-83 - the usual platform for safety critical applications.

Primarily there are four products (computers) that together control the trains and manage communications. Two products reside on the trains, and two are trackside, scattered up and down the track and at the stations, continuously communicating with each other.

Each product has one primary application running on it, together with some secondary applications such as Built In Test software that continually checks for hardware faults, communications software, schedulers etc.. There is also the "operating system" that is common to all products. The analysis described here concerns the primary application software.

First, a simulator was developed (in Prolog) that is equivalent to the safety critical application in one of the products, but which runs under MS Windows. Next, a method for systematically injecting faults into the simulator was added. The simulator was run (for the many conditions the trains are to run under), systematically injecting single faults, and creating a database that contains sequences that result in hazards (for example, where the train doors open and the platform doors remain closed). The analysis helped to locate potential weaknesses in design, therefore indicating where designs should be modified (e.g. additional code added) to reduce the risk of failure, and make the application more robust. The results from safety analyses therefore fed back into software design. The simulator was also used to analyse how the application behaves when it runs under improbable conditions (i.e. conditions that may not have been foreseen when specifying the product requirements), resulting in an ultra-safe software design.

The simulators are specific to a particular moving block design, but the analysis method can be applied to any safety critical or other "risky" software.

 

2. THE BUSINESS ISSUE

The objective of the work was to perform a hazard analysis and to provide a safety report on the application software that will control the trains, and finally to recommend design changes to the applications that would maximise safety and robustness of software. The objective of this paper is to demonstrate how Prolog is used to support the analyses and the software safety analysts.

 

3. BACKGROUND

3.1 Software Design

It is clear that the system being delivered is safety critical, and as such the design of the system is both formal and rigorous. The safety analysis of the hardware follows methods that have matured in other safety critical systems, such as in the aviation industry.

Safety Critical Software has been designed by our client using the Shlaer-Mellor method [5], which is an object-oriented method. Objects are designed, and are specified in PDL - Program Design Language - before being implemented in ADA-83. It is the fact that the Shlaer-Mellor method was employed that allows systematic safety analyses described in this paper to be carried out, so the salient points of this method are now described.

The notation used for software development is an important constraint on which analysis method is appropriate. The three common notations are:

Data Flow Diagrams
State Transition Diagrams (used in this work)
Procedural Language

The Shlaer-Mellor Method is object-oriented, and all applications were designed using this method. Software objects can send and receive messages to and from other objects (and eventually to other products). Each instance of an object may be in one state only. When an object receives a message from some other object, the action it performs will depend on its state, and the action may generate new messages and possibly change the state of the object. The design of the object can be encapsulated in a State Transition Table. The table relates object states, messages (incoming) and actions.

For example, consider [some invented] database object (named DBO) that can have messages Open, Close and List, for example. The State Transition Table may look like:

  Object States
Incoming Messages IS OPEN IS CLOSED IS OPENING IS CLOSING
OPEN Error Action 1 Ignore Ignore
CLOSE Action 2 Ignore Ignore Ignore
LIST Action 3 Action 4 Action 5 Action 6
HAS OPENED Error Error Action 7 Error
HAS CLOSED Error Error Error Action 8

Table 1 - Object State Transition Table

 

The actions for DBO are appropriately designed. For example, Action 1 (which is just a function, or in Prolog, a predicate) is called when DBO is in state IS CLOSED, and it receives the message OPEN:

odb_action(1):-
    is_in_state("IS CLOSED"),!,%double check this object is closed
    set_state("IS OPENING"),
    open_database.

odb_action(7):-
    is_in_state("IS OPENING),!,
    set_state("IS OPEN").

open_database:-
    % file manipulations that may take some time
    send_message_to_self("HAS OPENED").

timer:-%or interrupt
    Process_messages. %i.e. send next message in stack and perform appropriate actions

The actual applications have from around 50 software objects to several hundred objects each.

 

3.2 Safety Integrity Level - SIL

The definitions described here are not rigorous but are presented to give a flavour of the safety critical software industry. Some of the ideas presented can be applied to other platforms such as PCs and workstations to develop high integrity and fault tolerant applications (e.g. financial systems).

Software safety analysis takes place within the context of a wider product-level safety analysis, which in turn forms part of the system-level safety analysis of the system as a whole. The analyses at these higher levels define a hazard risk assessment with each product that is reflected in its assigned Safety Integrity Level (SIL). Our client has used the IEC-1508 standard - this defines the Safety Integrity Level. A component's SIL defines the potential severity of any hazards that can be potentially caused by the product, and indicates the required probability of occurrence of any random failure.

The lowest SIL level, SIL0, would be usually assigned to Windows applications - an application causing a general protection fault will not cause direct injury or death. Even Windows itself is not safety-aware and could not normally be used (at least as a single controlling PC) for running high integrity applications.

The highest level, SIL4, has an associated hazard severity of "catastrophic" and requires a probability of failure of 10-4 - 10-5 per year. For further reading, see References [1-4].

In the aviation industry, the Federal Aviation Administration (FAA) defines five categories of failure conditions and five software level definitions ranging from Level A (Catastrophic) to Level E (no effect). ISO 9000 guidelines do not address the production of Safety Critical Software.

 

3.3 Failures

It is interesting to note that hardware failures are random, and these are handled in a probabilistic way. Contacts or wires may be broken, for example, after prolonged vibration of the computers. Memory chips may develop faults over a period of time.

Software failures, on the other hand, are systematic. They may appear to be random, but given the same conditions when the code is run, a software error will consistently manifest itself. However, isolating (and therefore being able to repeat) these conditions is tedious, and this is what these safety analyses are meant to achieve.

Of course, software may fail because of a hardware failure (e.g. a memory fault), but it is also necessary for the software to cope with this in a safe manner.

 

3.4 Single Point Failures

It is a stated objective of our client's System Safety Implementation Strategy that a product should not contain any single random or systematic failure condition that can cause a potentially catastrophic hazard. To explain to readers who have experience in this area, this requirement can be met by ensuring that the fault trees for all product failure conditions that are potentially catastrophic have at least one AND-gate in every fault tree path. For software subsystems, protection mechanisms that introduce AND-gates into the corresponding software fault trees include:

Addition of a separate Protection/Monitor Object in the software architecture that cross-checks objects output data items before they propagate to the Product output signals
Addition of guarding conditions embedded into software objects that protect explicitly against specific conditions that might cause a hazard.

 

3.5 Safety Analysis

There are several analysis methods that may be employed: Threads Analysis, Event Tree Analysis, Fault Tree Analysis, Transaction Analysis and Failure Modes and Effects Analysis (FMEA). There are two reasoning modes for performing analyses:

Forwards Analysis, or "bottom up"/ "inductive" analysis. The FMEA and Event tree analyses are inductive analyses. It requires the systematic injection of faults into the software and discovering what hazards are produced. This is the method used in the Prolog simulators described below.

Backwards Analyses or "top down"/ "deductive" analysis. Here we start with a hazard and work back to discover what object conditions can cause it. Fault tree analysis is deductive.

 

4. The Simulator - Threads Analysis

It is difficult to systematically inject software faults into the actual products (hardware faults are easier to test, by breaking connections etc.). For software safety analysis, the analyst has many documents ranging from requirements, design, and specifications that he must manually review and analyse to try to discover weaknesses and errors. It is very tedious to systematically check the cascade of messages sent between objects when an event occurs. The analyses of the message cascades need to be repeated for various combinations of software object states, and repeated again for each injected fault (the system must not result in a hazard for any single fault). This will usually result in many tens or hundreds of thousands of analyses.

A full analysis needs to be systematic, and the object-oriented design of the system software lent itself to the development of simulators that also allow faults to be injected systematically for every single condition (e.g. train arrives at station, driver presses open doors etc.). It should be noted that the simulators do not replace the safety analysts, but instead help them perform a more complete and more rigorous analysis than could practically be achieved manually.

Two simulators have been/are being developed. The first simulator was used to test the method on one particular product. Only three software failures were catered for:

Message not sent
Message sent when not required
Message corrupted

As such, this simulator could only support a SIL2 analysis (which is what was required for this product).

The second simulator (currently under development) is designed so that all products can be analysed as an integrated signal system system. It is supported by a parser generator (also written in Prolog). The parser generator reads the PDL and creates much of the equivalent Prolog source code - but it needs a fair amount of manual modification. (The PDL is procedural; if-then-else blocks, do-while blocks and case blocks are converted into Prolog predicates. The parser-generator is not intelligent enough to assign the correct arguments to the clauses). The Prolog source is then compiled within the simulator, and effectively allows a full equivalent source code to be analysed, but effectively supports a SIL4 analysis. In addition, there is sufficient information in the simulator to allow the automatic generation of fault trees.

When the simulators run, all object states, messages and actions are stored in a database (basically "threads" analysis), and printed in a form suitable for inclusion in the Safety Case.

On screen, the simulators graphically display physical objects i.e. the trains and platform doors, and their states (i.e. opening doors, rotating wheels) and software objects. State transition tables can be displayed, and object states highlighted. Product windows display the message stack of the product, and also a history of messages processed by the products. Messages can be processed manually, one at a time, or controlled by a timer (when used as a simulator), or can be processed immediately in order to maximise the speed at which systematic fault injection is performed.

 

5. THE OUTCOME AND BUSINESS BENEFITS

One product analysis has been completed and accepted by the client. A systematic and complete analysis was performed in a cost-effective manner. It would have been an impossible task to manually analyse every software fault for every combination of object states; normally engineering judgement is used to select the most sensitive cases and to restrict the analyses to these cases. To perform a complete analysis manually would have taken several tens of man-years, whereas the design, development of the tool/simulator, and the analysis of the product using the tool, took less than 6 man months. A manual analysis is also prone to errors. The automatic analyses performed using this software tool was thorough, and gave confidence in the design of the safety critical product that was analysed.

The second simulator, which can support a safety case for the whole of the moving block signaling software, is currently under development.

6. COMPILER

The compiler used for this work was Visual Prolog 5, (VPRO5) developed by the Prolog Development Center, Denmark (www.pdc.dk). The simulator currently being developed is also written in VPRO5, but uses the object oriented features built into the language.

 

7. REFERENCES

[1] CENELEC prEN 50128 (Final Draft January 1997)

[2] Def-Stan 0055 - Requirements for Safety Related Software in Defence Equipment

[3] IEC 61508 (International Electrotechnical Commission) - Functional safety: Safety Related Systems.

[4] RIA Specification No.23 (Consultative Document 1991)

[5] Sally Shlaer/Stephen Mellor - Object Lifecycles - Modeling the world in States. 1992 Prentice-Hall ISBN 0-13-629940-7