Robustness analysis: finding fault(s)

When working on a large project, implementing a system that has to run 24/7 and handle significant peak loads of communication, at some point, you have to ask yourself how robust your solution really is. You have to ascertain that it meets the goals you have set out and will consistently do so. There are diverse ways of doing this. Some are more efficient than others. In this article, I will discuss some of the methods I have found useful in the past.

Robustness analysis is a form of behavior analysis that focuses on finding possible faults in the design and implementation of the system (or the part of the system being investigated). It can be approached from several angles, including object-behavioral analysis, static analysis, sequence-of-events analysis, etc. Each of these approaches is useful and is likely to be complementary to other approaches, so for a successful robustness analysis, I would recommend using a mixture of these approaches, starting with an analysis of the sequence of events leading up to normal operation,

Weakness types

Before starting out analyzing your software, you should have an idea of what you’re looking for – what kind of weaknesses you expect to find, and how to recognize them.

One of the most pernicious types of weaknesses found in software is the single point of failure. Single points of failure can take the form of “fatal errors”, “kernel panics”, or other such problems that will either prevent the system from starting, or stop it from working. A sequence-of-events analysis of a boot sequence will almost inevitably turn up a few of these weaknesses.

Single points of failure are pernicious precisely because they are single points of failure: if a failure occurs at one, the system stops functioning correctly. In the best case, it will fail obviously (“loudly”, if you will) but in some cases, it will continue limping along and function badly. This brings us to error handling…

Improper error handling is, sadly, a very common error. In true C++ programs, this often translates to improper resource management (which we will talk about in a moment) but as most software written in C++ is really a mixture of C++, C with classes and plain C, error handling is still an important problem in software written in C++.

To see the sidebar click here.To hide the sidebar click here.

Looking for problems in error reporting usually comes down to three things:

  1. are all errors reported?
  2. can error reports easily be ignored?
  3. do error reports contain all the required information?

In the case of C++ code, compiler settings can be very important in this context: turning exceptions off, for example, will make error reporting (and error handling) much more difficult and code that looks like it does the right thing will actually produce unpredictable behavior (for example: what does throw do when exceptions are turned off?)

You should be careful, however, that looking for these kinds of weaknesses doesn’t turn your robustness analysis into a code review: the point of the exercise is not to find every single instance of a weakness, but rather to assess the general robustness of the system. The scope of a comprehensive code review of a significant system is much larger than that of a robustness analysis of that same system — as people who have performed both will be able to attest to.

Resource management is a common source of problems in embedded software: are there clear rules for the developers to follow w.r.t. resource management? Are they documented? Are they followed? Are they part of what code reviewers look for? I can suggest a few, of course (use RAII; avoid dynamic allocation in small systems, use it wisely in less-small systems; avoid arbitrary limits; don’t expect users (or client code) to respect your limits; etc.) Several tools, both dynamic and static, exist to look for problems in this domain.

Concurrency, security, unexpected language artifacts (such as the fact that large parts of C are deliberately undefined), etc. are all important sources for weaknesses. Some of these should be attacked at the design phase (concurrency and security are excellent examples of that) while others should be a continuous concern throughout development (the language, resource management, error handling, etc. are all examples of that — though they should all also be taken into account during design).

Putting your finger on it

So how do you find those weaknesses? How does one go about verifying the robustness of a significant system?

That’s the topic of the next article 🙂

About rlc

Software Analyst in embedded systems and C++, C and VHDL developer, I specialize in security, communications protocols and time synchronization, and am interested in concurrency, generic meta-programming and functional programming and their practical applications. I take a pragmatic approach to project management, focusing on the management of risk and scope. I have over two decades of experience as a software professional and a background in science.
This entry was posted in Software Engineering and tagged . Bookmark the permalink.

One Response to Robustness analysis: finding fault(s)

  1. Pingback: Robustness analysis: the fundamentals | Making Life Easier

Comments are closed.