Logging is an irreplaceable tool for maintaining software – not only during development, but especially once it is deployed. During development we tend to place log messages wherever we need to inspect which parts are working and where problems occur. But once we get closer to releasing our product we need a solid and clean logging setup that provides the right amount of information and allows to identify real problems easily.
When we started refactoring the logging implementations for the release of a project, we realised that it is really hard to find concrete takes on how to plan and implement a solid logging infrastructure. Most articles and blogposts concerning logging point out a list of general best practices like:
use an established logging framework
be careful with sensitive data
use the right severity level
provide enough context that anyone can understand your logs
Although all these pieces of advice are very valuable and important, it remains a challenging task to find a consistent approach to a logging setup from scratch. The goal of this article is to contribute to this matter, in sharing some of the thought processes we went through reworking the logging infrastructure regarding the needs of a specific project.
To implement a solid logging structure, it is very important to find out in which ways the logs should be used and what challenges are going to be faced due to the requirements of the specific project. Some relevant questions might be:
Who will work with the logs?
Do you process data that contains private user information or could be used for an attack?
Is your application used in parallel?
Do you use some further analysis tools to filter and visualise your log data?
In our specific case the project at hand is a backend application that stores only small amounts of (pseudonymised) user data, but communicates with multiple external services. Our log messages are collected with the popular elastic ELK Stack and should not only be analysed by developers working on the project, but also be used by a support team without a development background. Also some automatic alerting should be set up based on the provided log data. Due to infrastructural limitations, we are required to store the logged data on a server that is externally available.
From there we found that we had four main goals with our logging infrastructure:
1. Concurrency: As the application is likely to receive many requests at the same time, we need to ensure that we can always find out which log messages belong together.
2. Clarity: Because we need to provide support for our customers, we also have to make sure that the log messages are understandable for people not working in the development of the project.
3. Structure: We also need to provide the logs in such a way that we can make good use of our analysis and monitoring tools.
4. Security: Since we store all logs on an external server, we also want to ensure that a possible attack on the log server doesn’t cause any harm.
A very straightforward approach for handling the problem of parallel requests is to implement some kind of tracing. Although there are existing libraries that provide tracing functionality, we decided to go for a simple standalone implementation, that generates an unique ID for each incoming request and propagates it through the services. Building the tracing mechanism ourselves made it easy to provide some useful restrictions (e.g. on the places in that request IDs can be generated) that enforce the intended usage.
Technical Details: Request ID Generation In our case, working with Scala, we used a sealed abstract class and self types to limit the creation of request ids. The Play framework, within which we built the backend, processes incoming requests with composable actions. We implemented a RequestIdAction that creates an ID for each Request and returns a RequestWithId (which is a simple wrapper). We then modified all other action to expect this wrapper type and thus ensured that the composition is carried out in the right order.
With a sealed abstract case class, we prevent instantiating a RequestId by its apply method and instead restrict it to the use of a random method which will guarantee that each RequestId is created with a new UUID. By making the companion object containing this method private, it can only be used inside the logging package. This package contains a RequestIdProvider trait that provides access to the method from outside the package, but its instantiation is limited to the RequestIdAction class using self types.
We decided to propagate the RequestId through the different services using implicit parameters since it shows that the value is determined externally and not supposed to be changed.
As we are great fans of functional programming, our first idea was to make use of the writer monad to collect log messages on the way through the programs components and carry out all gathered logs at once. This approach has some nice benefits:
Messages are grouped together for each request.
It is possible to only perform logs on specific conditions (collect logs for every step, but only log on error).
Side effects are limited to specific places.
The fact that log messages are produced is expressed in the type.
But there are also major drawbacks:
Understanding the program gets more complicated (more complex monad stacks and transformers) and thus onboarding new developers to the project will take more time.
Since all the logged data has to be kept on the stack it may lead to performance limitations.
In case of an unexpected and unhandled exception we loose all collected log data and thus the means to find out what lead to that exception.
Especially the last point led us to dismiss the approach for now and stick to a simpler solution.
To make the log messages understandable for people that aren’t working on the code itself, we cannot solely rely on the logged class name , but need to provide some information about the context in which the error occurred (e.g. a user tried to register or a push-notification got sent and so on). An even better approach is to work with error codes for specific cases that are denoted with a technical description, an easy to understand meaning, and (ideally) information about who is responsible for fixing the error.
As elastic search is based on JSON structured data, it is an obvious goal to configure our logging toolchain to export information like class name and severity level in a JSON format. Taking it one step further, we also encode the logged content in a JSON object form. This allows us, for instance, to filter for specific error codes and thus set up alerts and visualisations in a simpler and more robust manner.
To minimise the risks involved in storing our log data on an external server that could potentially be compromised, we wanted to ensure that there is no sensitive data included in the available logs. On the one hand this concerns privacy and, on the other, the risk of leaking data that could be used for an attack (e.g. access tokens, keys and the like). We decided to be very conservative in this case and ensure to not log any database values at all. Additionally we don’t log the whole error messages of runtime exceptions, except for those cases in which we are certain that no sensitive data is included.
After talking about all these goals and the cues how to tackle them, one crucial point is still untouched: When and what should be logged? And how can we ensure that our logging works in a way that helps us to achieve our goals? – The simple answer is standardisation. If we implement patterns for what should be logged in different cases, we can achieve a uniform logging throughout the application that is aligned with our goals and challenges.
For me, a good starting point was to think about the different actions carried out in the application: for instance, database operations, HTTP-requests or validations. Then continue with the outcomes of these operations that should be logged: an HTTP-request for example might fail because of either an unexpected response status, a response body that cannot be parsed or simply time out. Now think about which information you’ll need to locate the problem, like the service that was reached out to or which fields led to a parsing error. But bear in mind the security concerns here!
The usage of the standardised formats can be further encouraged with language features that restrict the usage of the logger to structured data.
Following are some examples in Scala, but similar things can be achieved with many typed languages.
We defined a sealed trait LogEntry that contains the union of all possible log message fields and provided a smart constructor for each standardised case. In Scala sealed traits cannot be instantiated directly, so developers have to use one of the provided smart constructors matching the logging case at hand. Each of the smart constructors also requires an implicit RequestId as argument (See the block Request ID Generation)
Next, we wrapped the Logger such that its methods require a LogEntry instead of a string as argument and used it throughout the project.
We provided some helper methods to simplify logging in common cases, like sending error responses.
In order to implement a good logging structure, it is crucial to reflect on the goals you want to achieve with the logged data and the challenges you face due to the way your specific project works. Keep these aspects in mind when you start creating some standardised logging patterns. Of course, it is a big task to find a standard in early stages of the project, but it will keep evolving with the development progress and will provide you with a good starting point for future iterations. Start by thinking about the different cases which should lead to log messages and what information you need in each of those cases. Additionally, make use of language features to create restrictions that help yourself and others sticking to the standardisation.