Fault tolerance Machine availability in degraded operation.

In today’s safety concepts, in the event of a safety-relevant fault, the machinery is returned to a safe state as quickly as possible in most cases. This is standard practice, even though most safety functions are designed to be redundant for higher safety integrity levels or performance levels.

Is it possible to continue operating an automation system despite safety-critical errors? What needs to be taken into account here?

Anyone who has ever experienced a flat tire knows how unpleasant it can be. Especially when on vacation, on the way to an important appointment or at night on a lonely country road. In these situations, in order to be able to drive a bit further, the tire industry developed what are referred to as run-flat tires. These are designed to enable your vehicle to be driven at a reduced speed until you can reach the next repair shop at your destination.

To what extent can this concept be transferred to automated manufacturing concepts, especially in the area of safety technology?

Safe state

In today's safety concepts, if a safety-relevant fault occurs, the safe state is usually brought about as quickly as possible, even though most safety functions are designed redundantly for higher safety integrity levels (SIL) or performance levels (PL).
For example, when a cross-circuit is detected between two channels in the sensor circuit of an emergency stop button, any dangerous moving parts are immediately switched off.

Therefore, a working group at ZVEI, with the participation of various member companies and an institute, has addressed the question of to what extent the continued operation of an automation system with a safety-critical fault for a limited period of time is permissible from a normative standpoint.

Machine operation in degraded state

In process engineering plants, certain manufacturing steps could be completed with critical process parameters, depending on the indication when the fault and the displayed status of "degraded operation" occurred. The latest point at which a decision-maker must return a device to a safe state is when the maximum permitted service life is reached in the "degraded state".

In the context of a defect type and impact analysis, a distinction is made between two defect types. In the event of non-tolerable faults, safe continued operation cannot be guaranteed and immediate shutdown must occur. Tolerable faults allow continued operation for a limited time, provided that, for example, a second independent shutdown path can execute the safety function correctly.

Calculation of the probability of failure

The relevant standards EN ISO 13849 and IEC 62061 do not contain any requirements with regard to immediate or instantaneous fault responses when a fault occurs. Furthermore, the models for calculating the probability of failure (PFHd) also offer the necessary leeway for design because the probability of failure starts at a very low level for redundant architectures and only increases after some time. Depending on the risk assessment and quality of the measures in place for fault control, the decision-maker can set the period of time until shutdown all the way to a maximum of one week. The alternative calculation method on which EN 62061 is based defines a diagnostic test interval that is also a practically negligible proportion of the PFHd.

Both calculation methods assume, however, that the implementation of the safety function includes a sufficient reserve or failure reserve, and that the requirements in terms of faults with the same cause (common cause failure) have been taken into account.

Graphic: Calculation of the probability of failure of a machine

Qualitative progression of risk

Additional safety measures

The idea that a decision-maker can activate alternative or supplemental safety mechanisms in the event of a fault offers another possible approach. For example, when monitoring safety-limited speeds in a drive system (SLS in accordance with EN 61800-5-2), in the event of a fault, a decision-maker can decide that operation is only permitted at reduced speeds. The speed restriction would reduce the required level for risk mitigation from PL d to PL c. Concrete areas of application include automated guided vehicle systems (AGVS) in which the path of travel is monitored by means of the speed-based dimensioning of the safety field of a laser scanner.

Outlook

The authors of the white paper, which was published by ZVEI, come to the conclusion that the evaluation of the described measures are consistent with the safety objectives of the Machinery Directive and do not contradict the harmonized standards EN ISO 13849 and EN 62061.

The decisive factor for acceptance will be whether the benefits of the option of "degraded operation" are quantifiably tangible. In view of increased interconnectedness, the diagnostic capability of individual components takes on particular importance with regard to system availability.

Active fault reporting in the process industry

What remains a vision for the future in terms of machine building is already the state of the art in many areas of the process industry. For example, the safe coupling modules of the PSRmini family are equipped with active fault reporting that allow for a safety-based evaluation to be carried out by the higher-level SIS safety controller (Safety Instrumented System). This is carried out without requiring digital inputs for N/C contact readback. The active fault reporting of the coupling relay results in impedance detuning of the safe digital output. As a result, the decision to continue operation or introduce alternative fault reactions remains in the CPU of the safety system (SIS).

Safe coupling relays for the process industry