Monthly Archives

So we sometimes get asked about our tagline, Monitoring that Matters.

Doesn’t all monitoring matter?  Well, yes it does.  But to badly paraphrase George Orwell, all monitoring matters, but some monitorings are more mattering than others.

What makes monitoring matter?  It’s whatever monitoring reduces your outages, reduces issues of unacceptable performance, or reduces time to resolution of these issues and outages.

We like to think of monitoring as similar to Maslow’s hierarchy – there are base levels that must be met first, but the higher up you go, the better.

  1. The base level of monitoring is “is my host/site alive?”  Everyone needs this (but not everyone has it.) That gives you reactive monitoring.
  2. “Is my host going to keep working in the near term?”  This means alerts about disks filling up, or swap space and memory used.  This helps with reducing some outages.
  3. “How is my host performing?” Things like CPU load, and rate of swapping.  Alerts on these metrics warn of impending performance issues, that can be addressed.
  4. “How is my application performing?”  A measurement of representational application performance. This may be things such as database transaction time, time for a web site to process a request, or even, for a storage array, latency of write requests. It’s really a more fundamental level of monitoring than level 3 – an alert about CPU load, in level 3, may not indicate anything amiss – in the case of NetApp monitoring, it could be a weekly raid scrub, and request latency, which is what really matters, is not affected at all. However, we rank it higher than level 3, simply as it’s easier (and more common) to monitor general purpose metrics (such as CPU) than application specific performance metrics (such as database transaction times.)
  5. “Why is my application performing as it is?” This is where monitoring really starts to matter.  The more data you collect and trend, the better your monitoring system will be at helping you quickly identify and resolve issues. (Of course, the more data you graph, the better your monitoring system need to be at presenting and organizing that data in a meaningful way. It’s easy to show 4 graphs of a system, but when you may have 120 different graphs for one system, and you need to quickly scan them to see correlations – the UI challenges get more interesting.) In a database, an alert about response time is clearly significant, but doesn’t tell you what the cause is.  But if the monitoring system can quickly show you your database’s sequential table scans jumped after the last software release, or that latency of a storage array volume is high because another volume sharing the same physical disks is experiencing an abnormal rate of IO operations, you will be able to resolve your issues much quicker.
  6. “How do I fix my issue?” This is the peak level, where the monitoring system not only shows all the data, but presents directions on how to resolve any issues.  LogicMonitor can do this in some cases (for example, using data from the number of select operations, query cache hits and cache prunes due to low memory to recommend enlarging, reducing or disabling the MySQL query cache).  But this is a much harder issue to generalize, especially across systems that interact. But we’re always improving.

How far up the hierarchy is your monitoring?  If you don’t have a wealth of data about all aspects of your system, that you can trend in real time and look at historically, it’s almost certain that your outages and performance issues are both too frequent and too long.

IT people by nature are supposed to be gurus. They’re supposed to be able to build things from scratch. This expectation certainly applies to data center monitoring, where a common practice is to rely on open source monitoring tools such as Nagios. But when you consider the value of your time, these free tools can quickly wind up being far more costly than commercial tools. For instance, we did a survey and found that some system admins had spent over 100 hours to get their open source monitoring solution to do what they wanted. Further, there was ongoing work to try to keep the system up to date with frequent changes in their datacenter, and even then they only had, for the most part, coarse level monitoring (for example, monitoring only the CPU load of a load balancer, instead of monitoring the state of all the hundreds of VIPs on the load balancer.)

When the only alternatives were costly enterprise-class monitoring solutions, sweating it out with open source was understandable. But now that there are affordable tools that automate configuration and give you everything you need in 30 minutes, insisting on building your own doesn’t seem wise (especially in this era of understaffed data centers.) At the root of this DIY mentality is pride. With so many open source options available, Techies probably feel some sense of shame or embarrassment going to an IT director and asking for tools that cost money.

I’d suggest a better source of pride is being able to spend time on tasks that add value to the enterprise – writing Puppet scripts that automate machine and software deployments, and so greatly reduce the time to spin up machines; investigate cloud usage options; correlate resource expenses with revenue per business unit.  There are a lot of things that should be done in any enterprise, that are not because of lack of time.  A good systems administrator’s time is very valuable – much more valuable than going through a MIB to figure out which item is important to monitor.

And no matter how good a systems administrator you are, monitoring is not going to be your top priority (nor should it be).  You’ll get monitoring going “good enough” – but there will be lots of cases that it failed to alert on, when a comprehensive monitoring system would have.  Then after every outage, you’ll have to go back and extend the monitoring, adding in metrics that could have helped predict the specific case.

So given the cost of your time; the more in depth monitoring that you get immediately with LogicMonitor (a typical Nagios implementation may monitor 10 metrics on a linux host; a typical LogicMonitor deployment will monitor over 100); and the opportunity cost of the things you could be doing to add value with your time, if you weren’t configuring monitoring, then why not use an automated monitoring tool such as LogicMonitor that makes you a better system administrator, and doesn’t require a Fortune 500 budget to implement?

If you’d rather skip the tedious work, but want the peace of mind knowing that your infrastructure is properly monitored, and that you will be alerted of any issues early, it’s perfectly okay to go the automation route. You’ll feel a sense of satisfaction in preventing an outage, whether you wrote the code or not.  And your CFO may even thank you for spending the money.


Popular Posts
Subscribe to our blog.