When working on a large project, implementing a system that has to run 24/7 and handle significant peak loads of communication, at some point, you have to ask yourself how robust your solution really is. You have to ascertain that it meets the goals you have set out and will consistently do so. There are diverse ways of doing this. Some are more efficient than others. In this article, I will discuss some of the methods I have found useful in the past.
Robustness analysis is a form of behavior analysis that focuses on finding possible faults in the design and implementation of the system (or the part of the system being investigated). It can be approached from several angles, including object-behavioral analysis, static analysis, sequence-of-events analysis, etc. Each of these approaches is useful and is likely to be complementary to other approaches, so for a successful robustness analysis, I would recommend using a mixture of these approaches, starting with an analysis of the sequence of events leading up to normal operation,
Before starting out analyzing your software, you should have an idea of what you’re looking for – what kind of weaknesses you expect to find, and how to recognize them.
One of the most pernicious types of weaknesses found in software is the single point of failure. Single points of failure can take the form of “fatal errors”, “kernel panics”, or other such problems that will either prevent the system from starting, or stop it from working. A sequence-of-events analysis of a boot sequence will almost inevitably turn up a few of these weaknesses.
Single points of failure are pernicious precisely because they are single points of failure: if a failure occurs at one, the system stops functioning correctly. In the best case, it will fail obviously (“loudly”, if you will) but in some cases, it will continue limping along and function badly. This brings us to error handling…
Improper error handling is, sadly, a very common error. In true C++ programs, this often translates to improper resource management (which we will talk about in a moment) but as most software written in C++ is really a mixture of C++, C with classes and plain C, error handling is still an important problem in software written in C++.
Looking for problems in error reporting usually comes down to three things:
- are all errors reported?
- can error reports easily be ignored?
- do error reports contain all the required information?
In the case of C++ code, compiler settings can be very important in this context: turning exceptions off, for example, will make error reporting (and error handling) much more difficult and code that looks like it does the right thing will actually produce unpredictable behavior (for example: what does
throw do when exceptions are turned off?)
You should be careful, however, that looking for these kinds of weaknesses doesn’t turn your robustness analysis into a code review: the point of the exercise is not to find every single instance of a weakness, but rather to assess the general robustness of the system. The scope of a comprehensive code review of a significant system is much larger than that of a robustness analysis of that same system — as people who have performed both will be able to attest to.
Resource management is a common source of problems in embedded software: are there clear rules for the developers to follow w.r.t. resource management? Are they documented? Are they followed? Are they part of what code reviewers look for? I can suggest a few, of course (use RAII; avoid dynamic allocation in small systems, use it wisely in less-small systems; avoid arbitrary limits; don’t expect users (or client code) to respect your limits; etc.) Several tools, both dynamic and static, exist to look for problems in this domain.
Concurrency, security, unexpected language artifacts (such as the fact that large parts of C are deliberately undefined), etc. are all important sources for weaknesses. Some of these should be attacked at the design phase (concurrency and security are excellent examples of that) while others should be a continuous concern throughout development (the language, resource management, error handling, etc. are all examples of that — though they should all also be taken into account during design).
Putting your finger on it
So how do you find those weaknesses? How does one go about verifying the robustness of a significant system?
That’s the topic of the next article 🙂