Reporting System Errors

When I first started to design the suite of applications at work I knew that I wanted them to log errors and process information in a sensible manner. Having worked in a less-than-ideal environment previously where every error was emailed to an email address that nobody checked because it was too busy, I wanted to make sure we had a sensible balance between awareness of what was going on beneath the surface and the ability to be alerted when action was needed.

The errors we generate are currently only ever exposed within the team. The systems I look after are all behind-the-scenes systems and rely heavily on nLog to provide process level logging, debug level, error level and fatal level logging. So called “Fatal” errors are emailed through to a specific email address I had set up. These are auto-forwarded to one of the team, and I also check them on a regular basis. Generally, these give little information beyond the fact that there has been an issue and then provide a direct, clickable link to the error log file where the full details can be found. In some of the applications I architected the email to include more details, like the one in the example below. In time I’d like to revisit all of the systems to get them to do this.

Over the past few months I’ve revisited many of the error messages and tried to add clues and common resolutions to the errors that we’ve seen happen frequently — usually when unexpected behaviour happens as a result of a network dropout. So, if a piece of data is missing, meaning that one of the processes can’t complete, it’ll specify exactly what is missing, where it should be put and what the next action is. Kind of combining a support FAQ with an error message. As any developer will tell you, at the time of writing a system, your head is full of what it does and how it does it. As the months go by, the specifics are forgotten, so I, along with anyone else, need clues on how to fix it.

An example messages is:

Order OrderNumber doesn’t seem to contain grid references as expected. It has been moved to Folder. Look at the file at FilePath and see what it contains. If it is empty, then check the admin system and redownload the coordinates file. If the file has content, then put it back into the stoker queue. This sometimes happens if we drop network connectivity.

In this case, this is the content of the email, along with a link to the detailed error log. The error log contains the dreaded stack trace — a set of barely humanified lines indicating where the application errored. Reading these is an art and, in my opinion at least, these should never, ever, be exposed to anyone outside of the development team.

Even at companies which have given me full specifications to work from, I’ve rarely, if ever, been given error messaging content. The error message is such an important piece of content, and can make or break your relationship with your user, and yet its left to a developer to write what they think is appropriate. What I’ve tried to do here is evolve the error messages over time to get them to a helpful level. This is the luxury of being involved with the systems on an ongoing basis (which is the subject of another blog post some time). If an error is raised that involves me going to the code in the first instance to find out what is going on, then I know that my error handling isn’t at the right level. Thus far, I’ve always found time to address the problem then and there and make the systems easier to support and more robust.

What this comes down to, I guess, is being aware of your users at all times — yes, developers are users too — and pitching the error messages at an appropriate level.