This post is a brain dump of the practices that have proved useful to me over time while maintaining production Java systems. Please share your experiences, especially if you have come across any great time savers not already mentioned here.
The scale of the system that you are working on does affect how valuable each of the individual practices below would be to you, some of the practices that I mention were introduced due to experiences working with server farms in excess of 150 JVMs running in a very ‘chatty’ SOA architecture and so those practices may not be as valuable in, say a single process Swing application. However I have come to the conclusion that once these practices are wrapped into a reusable set of classes and bedded into the mind set of the developers then they cost next to nothing to use; and so I recommend putting in place at least a placeholder for each practice at the start of a project as retrofitting large code basis is often more expensive than putting in a good placeholder early.
- Easy log file access
- Self contained messages
- Logs telling the Story of what happened
- Zero tolerance to errors
- Logging as part of QA
- Logging levels
- Minimising Logging Overhead
- Machine parsable logs
- Runtime adjustable levels
- Archive the logs
The more effort and time involved in gaining access to a systems log file(s), then the natural consequence is that the log file will be used less frequently. I have worked in environments where developers had to request a copy of the production log files in order to begin investigation into a problem. Such requests would take between 1 hour and 3 days to be actioned. This barrier to entry blocks all pre-emptive problem solving from happening, as well as gaining timely feedback from experiments used to reproduce and track down a problem.
My ideal is to have real time search access to all production logs, tools suck as Splunk make this very straight forward. Especially as it is capable of combining logs from multiple servers. If a licensed solution is not viable then roll up your sleeves and write a few scripts to achieve the same results.
Each individual message needs to be complete in its own right. That is, a call to the LOG class should be one per key event being reported on. All of the
information for that event must be contained in that call, such as stack traces and values of key data that describe the uniqueness of the event that has just occurred. Do not place stack traces in a separate call to the log than the message describing its context.
If the information or data is spread across multiple calls to the LOG class then it becomes likely
in a multithreaded environment that the messages will get interlaced making them more difficult to decipher. Naturally this interlacing does not occur often during development cycle as a developer working by him/her self tends to only send one request into the system at a time.
Context is King
A single line in the log file describes a single event, a series of events tells a story, and in order to understand the meaning of the story it needs to flow; the relationships between each event needs to be clear and concise.
To achieve this with Log4J I use the MDC class to push context information that is important to the story being described, such as: the name of the user who made the request, a unique request id that was generated when the request started, the thread id at time of the message being generated, and the security principle that the request was running as at that time. In environments consisting of multiple co-operating servers you may also find the following values useful in the MDC: the name of the SOA service that generated the logged event, the machine name where the logged event was generated. Which pieces of data you place on the MDC will depend on your context, remember to make sure that each item pays for itself. It is not a free journey, and we don’t want to give away any free tickets.
Context across machine boundaries
In multi machine environments being able to tie the story together becomes harder, and yet all the more important. Consider passing some of the values that you you push onto he MDC between machine boundaries as part of the remote calls, I make it a matter of point to always pass the user name and request id across machine boundaries. This means that if an error is reported to the user, they can then be given an incident number to report to the support staff. This incident number is the request id mentioned previously and will allow a developer to reconstruct the full story of what the user was doing as the error occurred from the log files.
On projects that I have worked on that have had no compile time warnings, it has always been easy to spot when one has introduced a warning and not wanting to be the one to be caught ‘peeing in the proverbial pool’ I tend to fix new warnings fairly promptly. If on the other hand there are lots of warnings already in place then my brain shuts off and I no longer notice them, I won’t carry on the previous comparison but I do find that after a build has more than a few warnings they quickly start to multiply with out even being noticed. The same phenomena happens with errors in production logs, it is very easy to spot and respond to them when they are rare however when they have become frequent then they become the norm and in my experience it is a significant sign that the project is struggling with a vicious circle that needs to be broken. If you are lucky to be working on a green field project, then it is best to start this practice early.
The log files are part of the interface to a system, think of it as an API in its own right. They may not be used by the end customer, but in circumstances when they are used to support the end customer then they do have an audience. To ensure quality of any interface to a system it must be both tested and used frequently. Failure to perform these checks results in the rapid build up of dust, decay and entropy. No body enjoys house cleaning, so automating these tests can greatly reduce the burden. The usual suspects apply here, unit testing, mocking of the Log interface and TDD.
Tools such as log4j support multiple coarse grains of categorising the messages logged. It makes a significant difference to the quality of the log files if developers agree when to use each of the common log levels early on in the project, and adhere to them. This consistency helps to reduce spam in the log files and helps turn off the noise when going into production. Personally I start off using DEBUG for all of my messages and then increase its level when I can describe the benefit for doing so. The following table captures the type of explanations that I look for with each logging level. This table is intended to help get thoughts flowing, and is by no means a summary of the only rules that a team could use.
|FATAL||Reporting problems that until fixed will cost the company money and will carry on costing the company money until resolved. Once a problem has been reported do not immediately repeat the same message.||Fix ASAP. Even if that means waking people up in the middle of the night. As such false alarms and frequent alerts will not be taken to kindly.||The database is down and all user requests are being rejected.|
|ERROR||A problem has occurred that was not automatically recovered, but it is not critical to the revenue stream of the company.||In low to medium volumes ERROR messages will trigger investigation in a timely manner. High volumes may be escalated to FATAL. NB: A user entering ‘foo’ into a number only field should never be reported as an error at this level, it should be recovered automatically at the GUI level.||Updating a users details failed due to an unexpected database error.|
|WARN||A problem has occurred that was both unexpected and automatically recovered.||Actively monitor with the goal of pre-empting problems that could escalate. Frequent reoccurring problems will need to be fed back to development for resolution.||Connection to the exchange rate server has gone down but it is not causing an immediate problem due to caches or disk space is down to the last 10% etc.|
|INFO||Report key system state changes and business level events.||Used to answer infrequent queries about the behaviour of the system and its users.||Jim deposited 53 pounds sterling.|
|DEBUG||Detailed information about each request.||Used to investigate the causes of a problem. Will usually be disabled in production by default as it will output a lot of contextual information.||Value of field x is 92.4, or ‘starting process Y’ and ‘finished process Y in 95ms’|
Logging is not free. It costs the developer time to add it to a system, it costs the system time to execute the log statements and it costs the poor soul who has to investigate/support the system time to understand the logs. As a consequence of this think twice before logging a new event, is it really needed. If it is then how expensive is it to produce the event and how often will it appear in the logs?
- don’t log unless the event adds to the story being told, what you don’t know what the story is? go back to context is king and do not collect two hundred pounds
- if the event is very common, bulk them together; don’t spam
- if the message to be logged is a constant string use LOG.level(“msg”)
- if the message requires much string comparisons and runtime performance is important us isXXXEnabled
- to improve readability and performance I sometimes use an interface to reduce the number of ifs; also useful if the logs are to be translatable
To make processing logs easier write to the logs in a consistent format. When embedding data into the logs provide consistent names for the data etc.
For example, one approach that I have worked with is to output log messages like this:
[user=Jim] has logged on.
As discussed under ‘Minimising logging overheads’ we want to avoid spamming the log files. However when diagnosing a problem we need data, when reproducing problems we sometimes need a lot of data. For this reason it is very useful to have a mechanism that can increase the log levels on a machine, or group of machines without having to reboot the machine. The most flexible approach is to be able to do this per user; remember to not give untrusted access to such a mechanism.
You do not want to run out of disk space in production. Monitor the amount of data on the disk drives, and archive off logs for future reference.
Custom shell scripts written in grep/sed/awk, perl, ruby, etc