A zero-cost instrumentation framework for C++ honoring hard real-time requisites

** DRAFT **

If you are not a software pal, don't get fooled! Zero-cost, in software terms, are functionalities a program might get without having to expend additional resources, like memory or CPU time.

And here I'll present one of those examples -- and a good one, really -- ideal, for cases like ours, when you have deadline constraints on your service's response times -- why would someone use C++ if you didn't? some might argue...

Anyway, the tricky part about the subject of this article is that it will be zero-cost provided that you will use an event-driven architecture or, more specifically, our Fast and Flexible Events Framework for C++. Just in an attempt to motivate you, multi-threaded software services that you write using the Mutua's Events Framework runs faster than if you did without it (blog post coming...). Now, when you realize that on top of that you still may get a full instrumentation for free, how cool is that? Some may ask "didn't you just reinvent the wheel on this one?", but the answer lays on the requisites: zero-cost -- I just couldn't find any zero-cost instrumentation framework for C++. In fact, our Events Framework was built not only to be the fastest one available, but also to allow this kind of flexibility.

Now, don't get me wrong, developers: I meant free in run-time costs. There will be some architecture and patterns you will have to adhere on the implementation side. But, for what you'll get, the effort is very low.

First, lets see what one might want to get when we talk about instrumentation of SaaS products. As a guideline, remembar that on SaaS everything is automated: the selling, payment, user management, ... operations included? Sure! The DevOps implemented on SaaS products must be even more automated -- fully automated, I'd say -- because you want to keep your fixed costs low (but this is subject to another post as well). So, to fully automate the operations of a software you will certainly need many tools (to build, deploy, test, start/stop cloud instances... just to name a few), but we are discussing about the instrumentation that goes on the software. It serves the purpose of monitoring. So, on the automated monitoring aspect of a Software as a Service product, what one might want? Here is my list:

To know -- several times per minute -- if the servers + softwares are in a sane state, ready to fully process requests;
To know how many of requests (of type X, Y or Z) are arriving per minute -- actually, to know performance details of each "event X, Y or Z";
A way for a human to inspect those requests, any errors or other details regarding the execution of the software -- it may be the Product Manager (or Product Owner), a developer, a DBA / Data Administrator the one wanting to check out things manually and I mean a way of following the log files (respecting any sensible information and their access level);
An automated notification (alarming) when something nasty happens (errors X, Y or Z... or the absence of events X, Y or Z in a given time interval)

From all the requisites above, we may divide the monitoring into artifacts that should run along with the service and artifacts that may run on a remote machine (a monitoring machine). The artifacts running on the server must acquire, store, notify and provide, upon special request, these information. And, to our delight, the only information needed are the key events the software processed (or failed to process). Did you get the point?

To use our zero-cost instrumentation framework, the architect must carefully design the software keeping in mind that only events will be monitored. So, to the best of Event-Driven programming, everything that meters on you software flow must be designed using the event production / notification / consumption paradigm. Easy and Fast.

On the Mutua Events Framework, you will find a special class designed just for that: LogEventLink. This class, as opposed to QueueEventLink and DirectEventLink, keeps old events. It works just like a queue, but slots are never released. It is as fast as the QueueEventLink (which is faster than a std::function call). A possible caveat it that it, potentially, may fill up your disks, for obvious reasons, but, if you want, you may configure it to use log rotation (which may be later deleted). The I/O is very inexpensive, for we use MMAP, and the producer thread might be configured for low IO priority -- which, when allied with the tuning of dirty_background_ratio & family showed no measurable impact on our benchmarking tests.

As you may be guessing, this approach is OK for the instrumentation of metrics, since we are recording every tick. But, in order to generate the usual log files a tool will be necessary. This tool runs on the monitoring machine and it collects the events, assembles them together in a timeline, compresses it and also consolidates some metrics (for instance, SEND_USER_MESSAGE may occupy only one record per second or minute -- we don't need every tick).

Now lets concentrate on the events that will produce log lines disposed in chronological order. If you have used Mutua Events Framework before, you'll notice that we fully respect the C++ premise "you should only pay for what you get" and that we don't add time measurements to any event. So, wouldn't it be a penalty? Well, if you want time measurements on your events, LogEventLink surely has an option for that and, when you use it, you may still benefit from getting some good and important run-time benchmarks -- but if you don't want timings, we will still be able to generate a chronological log for you: it happens that every event has an atomic ID. This is the key, but it is not sufficient: different events have independent event IDs and here is where you need to act when designing your event object's structures: you must keep track of the "parent event ID".

You might be asking: what if we have unrelated events? like "SERVE_STATIC_FILE" and "SERVE_DYNAMIC_CONTENT"? Each one of them will have their own event ID, won't them?

Yes, they may. But, remember, Mutua Events Framework are so efficient that you should not be economic -- use them as the theory suggests: all over. So, the first event might be "REQUEST RECEIVED", then, when you are consuming the request, you will parse it and determine if it is a SERVE_STATIC_FILE or SERVE_DYNAMIC_CONTENT request -- both of which will have a parent event ID. Did you get the idea?