How good is the black-box in your program?

I am a “big fan” of the television series “Air Crash Investigators“. I’m not what I consider “morbid” – the concept of making a TV series from actual plane crashes is slightly disturbing. It’s the methodical investigation that appeals to me. The thoroughness with which the investigations are conducted ensures that the airline industry is always striving to improve its safety record. That gives me comfort everytime I set foot on a plane. – However, I don’t recommend you watch one of the shows immediately before boarding an aircraft!

As you are no doubt aware, key to most aircraft crash investigations is the Black box. It contains vital details of the aircraft and along with the cockpit voice recordings help investigators unravel what happened during the final moments of the doomed flight.

Fortunately, the average software developer does not have to write software with the added pressure of it costing human life if it all goes pear-shaped. Of course, there are exceptions to this. Due to the lower cost of failure, most computer applications won’t be stringently controlled to protect against program failures. When a program fails, resources need to be devoted to resolve the problem for the customer. This costs your company money. “Resolving the problem” for the customer, is different to “fixing the bug that caused the error”. This may involve manipulating registry entries, files on disk or other methods. All’s fair in love and war… The more customers you need to fix this problem for, the more compelling finding the underlying problem becomes.

This is where decent error recording code comes into play. General exception handlers tend to include call-stacks these days. These are invaluable to locating the moment where something went wrong, but unfortunately, they don’t provide much context as to how the program came to this point. What do your support engineers ask for when investigating problems? A registry dump? A file listing? Configuration files? Why not package these up in an archive that can be sent through to your technical support people? Is your software capable of telling you the final moments before the crash? Can it report the features used that led to its demise? If it doesn’t, retrofitting this sort of code to your application can be considered a “long term investment”. It will cost your development now, it’s an unsellable feature for the customer, but you will reap the rewards for your effort down the track. Trust me, it will be worth it!
Add to Technorati Favorites