CS580
Spring 1998
115 MILSC, 11-11:50 MWF
Instructor:
Yashwant K. Malaiya, Professor
Office: 238 US
Office Hours: 11-11:50AM.
Text:
Required textbook: Software Reliability Assurance Handbook.
by Lakey and Neufelder.Material from other books and publications
will be used.
Some of the conferences, journals & books in this field.
Evaluation:
Distribution of points:
Option 1
5% Participation
20% Test 1 (tbd)
20% test 2 (tbd)
25% Final (tbd)
20% Project
10% Feedback Modules
After you have selected your project, a one page proposal will be
due on march 25. It should include
motivation, brief scope of study and some specific references.
A progress report will be due on April 15. The final report will be due on
May 4.
Option 2 (project emphasis):
5% Participation
10% Test 1 (24 hour take home)
10% test 2 (24 hour take home)
15% Final (24 hour take home)
50% Project
10% Feedback Modules
This option is for those who have good background in another
area, and want to explore fault-tolerance in that area.
A half-page pre-proposal will be due on Feb. 16. Other dates are same as for
option 1.
A lecture (45 min.) with a two-page handout may be required.
Grading:
The grades are defined in this way:
A Excellent
B Good
C Weak
D Bad
F Worse
I For exceptional cases only
Please see the Student Information Sheet for departmental policies.
FAULT-TOLERANT COMPUTING
OUTLINE
The purpose of the course is to study techniques
for achieving high reliability in computational
systems with software. hardware and networking
components. Approaches for testing, fault handling
and assessing reliability will be examined.
- Terminology
- Fault-tolerant vs. `fault-intolerant' approachs
- Software, hardware and networking systems
- Redundancy: Spatial, temporal and procedural
- Stateless systems and finite state machines
- Defect, error and failure
- Testing
- Structural Testing:
Fault modeling
Detection and diagnosis
Error propagation
States, initialization
Coverage measures
Software vs. hardware testing
- Black-box and Probabilistic testing:
Random methods and their effectiveness
Detectability profiles
Information compression techniques
- Design for testability:
Observability and controllability
Testability enhancements
Unreachable code/redundancy
- Reliability: Schemes and evaluation
- Reliability analysis:
Reliability measures
Serial and parallel systems
Failure rate
Permanent, transient and intermittent failures
Measuring reliability
Markov modeling
Correlation and non-markov behavior
- Redundancy:
Duplex
TMR and other schemes
N-version programming
Design diversity and correlation
Static/dynamic redundancy
Recovery by check-pointing and roll-back
Study of some existing systems
- Design Faults:
Reliability growth
Static and dynamic approaches
Fault density and fault exposure ratio
Prediction capability of models
Test coverage and reliability
Design diversity and correlation
- Redundancy: lower level
Hamming distance
Error detection and correction
Linear separable codes
Cyclic redundancy codes
Effectiveness