ECE Course Outline

ECE7142

Fault Tolerant Computing (3-0-3)

Prerequisites
ECE 6100
Corequisites
None
Catalog Description
Key concepts in fault-tolerant computing. Understanding and use of modern fault-tolerant hardware and software design practices. Case studies.
Textbook(s)
Shooman, Martin, Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design, Wiley Interscience, 2002. ISBN 9780471293422 (required) (used Spring 2003)

Topical Outline
Goals and Applications of Fault Tolerant Computing  
     Reliability, Availability, Safety, Dependability, etc.  
     Long Life, Critical Computation  
     High Availability Applications  
     Fault Tolerance as a Design Objective  

Fault Models  
     Faults, Errors, and Failures  
     Causes and Characteristics of Faults  
     Logical and Physical Faults  
     Error Models  

Fault Tolerant Design Techniques Based on Hardware Redundancy  
     Hardware Redundancy  
     TMR, N-modular Redundancy  
     Voting Methods  
     Duplication, Standby Sparing 
     Watchdog Timers  
     Hybrid Hardware Redundancy  
     N-modular Redundancy with Spares  
     Sift-out Modular Redundancy  
     Triple-duplex Architecture  
     Fault Tolerant Interconnection Networks  

Fault Tolerant Design Techniques Based on Information Redundancy  
     Parity, M-of-N, Duplication Codes  
     Checksums, Cyclic Codes, Arithmetic Codes  
     Berger Codes, Hamming Error Correcting Codes  
     Code Selection Issues  
     Time Redundancy, Recomputing with Shifted Operands (RESO)  
     Software Redundancy, Checks and N-version Programming 

Reliability Evaluation Techniques  
     Failure Rate, Mean Time to Repair, Mean Time Between Failure          
            
     Reliability Modeling, Fault Coverage  
     M-of-N Systems  
     Markov Models  
     Safety, Maintainability, Availability  

Fault Tolerance in VLSI Circuits  
     Failure Models in VLSI  
     Redundancy Techniques in VLSI  
     Self-checking Logic  
     Reconfiguration Array Structures  
     Effect on Yield  

Case Studies  
     FTSC, FTBBC  
     Space Shuttle  
     Tandem 16 Non Stop System  
     Stratus/32 System  
     ESS  

This course will involve writing of a term paper by the students on 
research/literature review/design in the fault tolerant computing area. The 
topics will be chosen in consultation with the instructor.