INM384 - Fault tolerance, redundancy and diversity: Design and analysis techniques for resilience
This module can be taken as part of a Postgraduate course or as a 5 day Continuous Professional Development (CPD) course.Rationale
One of the primary techniques for dependability and resilience is fault tolerance, that is, design of systems (technical or socio-technical) so that they can survive failures of their component. The application of fault tolerance has become a necessity for a growing number of companies, far beyond its traditional application areas, like aerospace and telecommunications. However, the culture of fault tolerance has achieved very limited penetration in the education and training of IT professionals and managers. The result is unnecessary vulnerabilities and/or costs in deployed IT systems and the organisations using them. in addition, professional learning of the techniques is often limited to a "house style" of design, creating unnecessary constraints to the cost-effective application of the principles.
Fault tolerance requires a combination of protective redundancy - to ensure extra resources to detect and correct the effects of failures - and diversity - making sure that the redundant components do not fail together.
Educational Aims
This module introduces the basic concepts and the range of techniques of fault tolerance, redundancy and diversity. It emphasises the unifying principles that apply to all applications, both in terms of scale (computer component, computer, organisation) and of requirements to satisfy (reliability, safety, security, ...) to enable students to apply techniques as appropriate to the circumstances of any system and organisation. It introduces students to the need for quantifying dependability and the gains that fault tolerance allows, but without in-depth study of the mathematical techniques.
Learning Outcomes
Upon successful completion of this programme, a student will be expected to be able to:
Knowledge and understanding
- Analyse fault tolerance techniques in terms of functions they perform, their potential roles in a system and their limitations KU2 Recognise the role of a fault tolerance technique in determining trade-offs between dependability requirements
- Apply basic probabilistic concepts, like probability of failure, fault latency and coverage coefficients, to frame the evaluation of fault tolerance techniques
Values and Attitudes
- Identify possibilities for operation-time protective and mitigating measures in relation to computing risks
- Take responsibility for controlling risk and calling for investment or technical help
Cognitive/Intellectual Skills
- Critically evaluate research and literature relating to fault tolerance and resilience
- Use systematic techniques (non-probabilistic) to identify needs for fault tolerance in a system.
- Evaluate alternative redundancy and diversity solutions for systems and organisations, and identify any further technical input needed towards a decision
Subject Specific Skills
- Identify candidate fault tolerance techniques for a specific problem with the factors determining their costs, benefits and limitations
- Recognise the basic differences between the benefits and costs of different fault tolerant designs; match designs to requirements
- Identify trade-offs in the application of redundancy and diversity with respect to requirements of reliability, availability, safety, security
Transferable Skills
- Create professional reports of performed research
- Detect and explain the standard fallacies with respect to redundancy and diversity
- Take into account psychological and social factors in the operation of systems and organisations
- Research and use scientific literature for research purposes
Indicative Content
- Basics of redundant design and reliability modelling
- Organisation of fault tolerance; phases of response to fault manifestation
- Modular redundancy
- Fault tolerance in distributed systems
- Fault tolerance for software and design faults; layers of defence and diversity seeking decisions
- Fault tolerance in IT and in IT-based systems/organisations
- Humans in fault-tolerant systems
- Examples of fault tolerant products
- Decisions and trade-offs in design, procurement and deployment of fault tolerant systems
- Fault assumptions, fault tolerance and resilience