Yesterday was the global launch of IBM PureSystems, a new and innovating solution from IBM that focuses on reducing the complexity of deploying new applications in the medium to large enterprise IT.
“Congratulations to op5 for being among the early adopters of IBM PureSystems,” said Michael Riegel, vice president of global ISVs, IBM. “This will enable them to offer an industry leading monitoring solution to clients in a way that cannot be matched by competitors.”
“op5 is all about making unified control of customers IT / IP services easy. This new solution from IBM adds to this goal as it takes away uncertainties, needs for compliance testing and other time consuming risks usually associated with deployment of a new application” says Jan Josephson, CEO, op5 AB.
op5 Monitor Enterprise is among the very first applications fully certified by IBM PureSystem, please see more on:
op5 Monitor is a highly scalable solution that enables distributed monitoring with automatic fail-over, load-balancing and redundancy. In this 30 minutes webinar we will give you insights on why and how to scale op5 Monitor to your needs.
We want to share some information and give a progress report on development being carried out by Andreas Ericsson and the op5 Development team on the nagios project, being a core component of op5 Monitor. The development aims to bring improvements to the project and enable new possibilities of importance for us and our solution op5 Monitor.
We hope that this post will shed light on why this prioritized work is of importance for op5 users and what it will bring to the product further down the road. We also want to inform and engage the Nagios community on why the work is being carried out and to give insight into what goodies that is to come available for the Nagios project. Any feedback and suggestions are as always much appreciated.
In this post we describe the work on:
Complexity reduction by moving from multi-threading to single threaded programing
Reducing latency
Disk I/O usage reduction
CPU usage reduction
Memory usage reduction
Decreasing complexity
In short, our aim as a software company is to provide our users with high quality, high performance and rock solid products. In order to achieve this goal, we want to provide a software with low complexity without major performance bottlenecks, as well as producing well-tested code that can be reused across a multitude of applications. That is why our development department is spending a significant amount of resources on on contributing to the development of the Nagios project.
Here are some of the actions taken to remove complexities from the Nagios core:
Removal of multithreading
We have continuously been working to remove the multithreading code in the Nagios core in favor of a general-purpose I/O broker and has done away with the cumbersome and disk I/O intensive check result spoolfiles in favor of worker processes. This provides several benefits over the previous way of executing checks.
Multithreaded programming is a lot more complex than single-threaded programming, since multiple threads sharing the same resources have to deal with resource contention. Threads have to either wait for each other when they wish to use the same resource (making multithreading a moot point in the first place), create their own instance of the shared object (negating the benefits of resource sharing), or risk crashing when both threads try to use the same resource at the same time. Currently, all eventbroker modules that wish to communicate with external programs and update the status of the running Nagios with data from the external program are forced to handle these complexities. With the next generation Nagios core, several thousand lines of code can be removed from such addons in place of a simple, well-defined and well-tested library call provided by Nagios core, reducing complexity by several orders of magnitude.
Complexity separation
Since workers run in their own process space, bugs in the workers do not affect the stability of the core scheduler. This is a good thing, since it means one can experiment more freely with the worker code, and even assign external programs to work as prototype workers. The I/O broker also makes it a lot easier to move several previously hard-to-do tasks outside the core and into a separate process, leading to even further complexity reduction in the core and even better complexity separation.
Latency reduction
The scheduling core need only write a job request to one of its workers in order to execute the external script it wishes to run. Since notifications, eventhandlers and a slew of other actions are now executed asynchronously through one of the worker processes, the time it takes to run them doesn’t add to the master process’ latency numbers.
Disk I/O usage reduction
Since worker processes communicate by copying pieces of memory from one process to another (through a socket, for those interested in details), we can do away with all the disk I/O generated by writing, scanning for and reading the check result spoolfiles.
CPU usage reduction
The lack of need for scanning for check result spoolfiles with frequent intervals means we save some CPU usage. We save even more by implementing a more clever way of executing external scripts, effectively cutting the number of fork() calls in half for every running Nagios installation. Since fork() can be a very expensive call, that provides quite a huge saving.
Memory usage reduction
Another benefit of fork()’ing less is that less memory is consumed. Since worker processes are extremely lightweight, the amount of memory used to launch each check is minimized, and we thereby provide a small saving in memory usage. However, since the worker processes and the communication between workers and master do incur some memory overhead, the net gain is small.
Code reuse
The worker process code is backed by several elegant, simple and well-tested libraries which can be reused to create other addons that want to communicate with Nagios core one way or another. This is a very good thing, since it means the core of such addons will be well-tested and that they can be written very, very quickly.
The changes will also bring several future benefits. Since workers now have their complexity separated from the main Nagios daemon, it will be possible to implement checks directly in the workers, bypassing external scripts altogether. This would mostly be of benefit for highly popular checks that are run frequently enough to warrant the added complexity of building them directly into the worker. check_nrpe (or a replacement for it) comes to mind, and especially since NSClient++ can handle NRPE requests. Another good candidate for in-building would be check_snmp and various other snmp-based checks. It will also be possible to write a small broker module that let external programs subscribe to various types of events and have those events streamed directly from Nagios, avoiding unnecessary disk I/O. PNP4Nagios would be one potential user for such a subscriber service, allowing it to avoid the disk I/O costly spoolfiles it currently uses, and as a nice bonus we would get rid of the delay between executed check and updated performance graph.
Conclusion
This work will continuously be included into Nagios core as well as op5 Monitor and result in performance improvements and complexity reduction. Both the Nagios community and op5 Monitor users will benefit greatly from these changes. Particularly in the long run, when old addon projects start catching on and new ones are created.
This document is intended to give the technical staff and op5 customers an understanding of some of the ongoing development projects we are currently working on to enhance and secure our op5 Monitor solution for the future. This document should be viewed as a complement to our development roadmap.
This development is in someways unappreciative as there will be no visible new features presented in a nice user interface. The work do however lay the foundation for future feature enhancements and enables realization of new cool ideas, so we thought it would be a good idea to share this information with you.
Work behind the scene, core work, we see the need for change in fundamental functionality in the foundation of op5 Monitor, Nagios project.
This article describes the work in progress on:
Complexity reduction by moving from multi-threading to single threaded programing
Reducing latency
Disk I/O usage reduction
CPU usage reduction
Memory usage reduction
We want to provide a “high quality, high performance and rock solid” solution.
In order to achieve this goal, we want to build a system with low complexity without major performance bottlenecks and to produce well-tested code that can be reused as building blocks across a multitude of applications. That is why our development department has been working hard on contributing to the development of the Nagios project. Here are some of the actions taken to remove complexities from the Nagios core:
Removal of multithreading: op5 development department has continuously been working to remove the multithreading code in the Nagios core in favor of a general-purpose I/O broker and has done away with the cumbersome and disk I/O intensive check result spoolfiles in favor of worker processes. This provides several benefits over the previous way of executing checks.
Complexity reduction: Multithreaded programming is a lot more complex than single-threaded programming, since multiple threads sharing the same resources have to deal with resource contention. Threads have to either wait for each other when they wish to use the same resource (making multithreading a moot point in the first place), create their own instance of the shared object (negating the benefits of resource sharing), or risk crashing when both threads try to use the same resource at the same time. Currently, all eventbroker modules that wish to communicate with external programs and update the status of the running Nagios with data from the external program are forced to handle these complexities. With the next generation Nagios core, several thousand lines of code can be removed from such addons in place of a simple, well-defined and well-tested library call provided by Nagios core, reducing complexity by several orders of magnitude.
Complexity separation: Since workers run in their own process space, bugs in the workers do not affect the stability of the core scheduler. This is a good thing, since it means one can experiment more freely with the worker code, and even assign external programs to work as prototype workers. The I/O broker also makes it a lot easier to move several previously hard-to-do tasks outside the core and into a separate process, leading to even further complexity reduction in the core and even better complexity separation.
Latency reduction: The scheduling core need only write a job request to one of its workers in order to execute the external script it wishes to run. Since notifications, eventhandlers and a slew of other actions are now executed asynchronously through one of the worker processes, the time it takes to run them doesn’t add to the master process’ latency numbers.
Disk I/O usage reduction: Since worker processes communicate by copying pieces of memory from one process to another (through a socket, for those interested in details), we can do away with all the disk I/O generated by writing, scanning for and reading the check result spoolfiles.
CPU usage reduction: The lack of need for scanning for check result spoolfiles with frequent intervals means we save some CPU usage. We save even more by implementing a more clever way of executing external scripts, effectively cutting the number of fork() calls in half for every running op5 Monitor installation. Since fork() can be a very expensive call, that provides quite a huge saving.
Memory usage reduction: Another benefit of fork()’ing less is that less memory is consumed. Since worker processes are extremely lightweight, the amount of memory used to launch each check is minimized, and we thereby provide a small saving in memory usage. However, since the worker processes and the communication between workers and master do incur some memory overhead, the net gain is very small indeed.
Code reuse: The worker process code is backed by several elegant, simple and well-tested libraries which can be reused to create other addons that want to communicate with Nagios core one way or another. This is a very good thing indeed, since it means the core of such addons will be well-tested and that they can be written very, very quickly.
The changes will also bring several future benefits. Since workers now have their complexity separated from the main Nagios daemon, it will be possible to implement checks directly in the workers, bypassing external scripts altogether. This would mostly be of benefit for highly popular checks that are run frequently enough to warrant the added complexity of building them directly into the worker. check_nrpe (or a replacement for it) comes to mind, and especially since NSClient++ can handle NRPE requests. Another good candidate for in-building would be check_snmp and various other snmp-based checks. It will also be possible to write a small broker module that let external programs subscribe to various types of events and have those events streamed directly from Nagios, avoiding unnecessary disk I/O. PNP4Nagios would be one potential user for such a subscriber service, allowing it to avoid the disk I/O-costly spoolfiles it currently uses, and as a nice bonus we would get rid of the delay between executed check and updated performance graph.
Conclusion This work will continiously under 2012 be included in the nagios core project as well as in op5 Monitor.
Scalability is a commonly used “feature” in all fields of IT and there is no question that it is a real challenge for many IT managers in the near future. IP is a shared best effort technology by default. Like any road – if you double the traffic it will get jammed! Continue reading →
Business wants agile IT, fast and flexible. IT operations is all about maintaining stability. Can the two really meet?
An increasing number of IT organisations are facing the challenge to having to accept demands for more flexibility, and thus using SaaS services from the public cloud, internal or external outsourcing etc. Continue reading →
At a time when all vendors compete in being “extremely simple and very cheap” it can sometimes be a challenge to make a fast and quick comparison. I have a “simple, cheap and quick” tip to help make the initial judgement on the overall quality of a product, the company behind it and all promises that are given…..
Check out the manual!
We have all done it – bought a cheap remote control or downloaded an app, only to find a piece of really thin paper in Chinese or as an web page that obviously has been auto translated, trying to explain how to set up the device. It´s extremely annoying, takes up our time and is just plain bad.
The same goes for software products, a bad manual tells you a few things:
The vendor really does not care about how or even if you use the product
The vendor has limited or no own experience in using the product for the function that it is sold.
The vendor might be a great code cruncher – but that is not what you bought – you bought a product that should in most cases save you time and/or money.
A good manual on the other hand should:
Save time in finding operational and functional answers to your product.
Make usage of the product easy for more in the company and by that reducing training costs.
Reduce risk, if the application can be used by more people in the company it reduces the risk of creating “a single super user that needs to be in on everything relating to the product” – what happens when he/she gets sick or leaves the company?
A good manual tells you that the vendor cares for his/hers product, how its being used trying to maximise the usage at the customer i.e. the vendor cares for you.
And of course google, forums, blogs etc. etc. are great to compliment a good manuel, but they can not be the starting point as it takes way to long and gives way to many option to get the basic knowledge.
Needless to say… we do spend a good amount time on our manuals as we do think the above is true, take a look at our manuals
The saving can be identified in many places depending on your organisation and your needs. Here are some of the savings users of op5 Monitor have reported:
The large savings can be found in the incidents that can be avoided, these savings can be hard to measure and is easier to just estimate.
Most direct saving are in different areas of time savings, such as:
Faster mean time to repair (MTTR) when an incident occurs
Easier to use system saves time for the system administrators, that can use their time and skills more efficiently.
Decreased need of maintenance of the monitoring solution
More efficient resources and investment planning when having statistics and facts available when taking decisions.
The possibility to monitor SLA makes follow up easier, ensuring that 3rd party services perform as expected, and that compensations are paid when they under perform.
Decreased need of external consultants. If the system is easy to use and intuitive, the need of using product specialists can save large amounts of money.
Technical flexibility of the monitoring solution can decrease the need of using different device managers and a flora of other specialised or limited software’s that each one cost money, time and resources.
We’d like to inform all our customers, partners and community-members that we’ve a planned maintenance window Thursday 22:nd of September between 14:00 – 16:00 UTC (16:00 – 18:00 CEST) where our website will be unavailable.
—
It’s been a while since IT made an appearance to the blog, trust me we’ve been keeping busy working on the internal infrastructure and backend to our new web and customer backend systems.
As always all releases hasn’t been perfect but we’re making improvements week by week to ensure that things are getting more stable and functional for both our internal and external customers.
As a part of this we initiated a project this spring in the spirit of open source to migrate from VMware to KVM as our virtualization solution. Throughout the last months we’ve tested the new systems extensively and are now ready to get our business and mission critical services migrated.
Since the web is our face towards the web and we’ve had issues coping with the load this service is our top priority to have migrated to further reduce load issues and availability-issues.
We’re hoping to release a whitepaper later on this year further describing the changes we’ve made so stay tuned.