There’s some interesting discussion around “Monitoring Sucks”, and has been for a while. (Go check the twitter hashtag #monitoringsucks). This is not a new opinion – the fact that I thought monitoring sucks is why I started LogicMonitor.
But it’s interesting to assess whether LogicMonitor meets the criteria for not sucking. Clearly our customers think we have great monitoring - but probably only 30% of our customers are SaaS type companies, and may or may not have the DevOps mentality.
So the initial criteria for why monitoring sucks, at least on the referenced blog post, were:
But does monitoring REALLY suck? Heck no! Monitoring is AWESOME. Metrics are AWESOME. I love it. Here's what I don't love: - Having my hands tied with the model of host and service bindings. - Having to set up "fake" hosts just to group arbitrary metrics together - Having to either collect metrics twice - once for alerting and another for trending - Only being able to see my metrics in 5 minute intervals - Having to chose between shitty interface but great monitoring or shitty monitoring but great interface - Dealing with a monitoring system that thinks IT is the system of truth for my environment - Perl
Let’s look at these points from the point of view of LogicMonitor
Having my hands tied with the model of host and service bindings. I’m not sure how you not tie someone’s hands to some degree, but LogicMonitor certainly tries to give flexibility. Services do generally have to associated with hosts – but can be associated by all sorts of things (hostname, group membership, SNMP agent OID, system description, WMI classes supported, kernel level, etc.)
Having to set up “fake” hosts just to group arbitrary metrics together. LogicMonitor avoids this mostly with custom graphs on dashboards, which allow you to group any metric (or set of metrics based on globs/regex’s) with any other set, filtered to the top 10, or not; aggregated together (sum, max, min, average) or not. Also, some meta-services are associated with groups, not hosts, to allow alerting on things like number of servers providing a service, rather than just whether a specific host is successfully providing the service.
Having to either collect metrics twice – once for alerting and another for trending. We certainly don’t require that. Any datapoint that is collected can be alerted on, graphed, both or neither. (Sometimes datapoints are collected as they are used in other calculated datapoints, derived from multiple inputs.)
Only being able to see my metrics in 5 minute intervals. Again, we don’t impose that restriction – you can specify the collection interval for each datasource, from 1 minute to once a day. (I know going to only 1 minute resolution is not ideal for some applications – but as a SaaS delivery model, we currently impose that limit to protect ourselves, until the next rewrite of the backend storage engine, which should remove that.)
Having to chose between shitty interface but great monitoring or shitty monitoring but great interface.I think we have a pretty good interface and great monitoring. Certainly our interface is orders of magnitude better than it was when we launched, and a lot of people give us kudos for it. But there’s lots of room for improvement.
Dealing with a monitoring system that thinks IT is the system of truth for my environment. LogicMonitor thinks it is the truth for what your monitoring should be monitoring – but it’s willing to listen. It’s easy to use the API to put hooks into puppet, kickstart, etc that automatically add hosts to monitoring, assign them to groups, etc. We’re looking at integration with Puppet Lab’s MCollective initiative and other things to get further along this issue.
Perl. Our collectors are agnostic when it comes to scripting. They support collection and discovery scripts in the native languages of whatever platform they are running on – so VBscript, powershell, C# on Windows; bash, ruby, perl, etc on linux. But as our collectors are Java based, we encourage Groovy as the scripting language for cross-platform goodness. The collectors expose a bunch of their own functionality (snmp, JMX, expect, etc) to groovy, so it makes a lot of things very easy. So it’s the language we use for writing and extending datasources for our customers. But if Perl is your thing, keep at it.
So, does LogicMonitor suck? I don’t think so, and hopefully DevOps Borat does not either.
I’ll be at the DevOps Days conference in Austin this coming week (LogicMonitor is sponsoring), so hopefully we’ll get some more feedback there.
Or post below to let us know what constitutes “non-sucky” monitoring.
Amongst its many monitoring methods, LogicMonitor supports IPMI. Many people aren’t aware of IPMI, and don’t think it’s necessary. And while I’m certainly an advocate of avoiding unnecessary complexity in a data center, sometimes it is good to wear both a belt and suspenders.
A real life example from one of our own data centers conveniently occurred just this morning, when I was looking for fodder to blog about: Read more »
When monitoring a NetApp, the thing that matters is (for most applications) the latency of requests on a volume (or LUN.)
Easy enough to get – with LogicMonitor it’s graphed and alerted on automatically, for every volume. But of course when there is an issue, the focus changes to why there is latency. Usually it’s a limitation of the disks in the aggregate being IO bound. Assuming there is no need for a reallocate (the disks are evenly loaded – I’ll write a separate article about how to determine that), how can you tell when what level of disk busy-ness is acceptable? Visualizing that performance like the below is what this post is about.
Last night our ops team (of which I am a member) got paged about the CPU load on a Cisco 3560 switch in a new datacenter, late at night. My initial reaction was “We don’t need this alert escalated to pagers or phones- 3560′s switch and route in hardware, so CPU load doesn’t matter.” Once I’d woken up a bit more, the corollary - that there is no possible way that this switch should be at a CPU level to trigger an error alert – occurred to me. Read more »
Performance monitoring for all your infrastructure & applications. In minutes, not hours.
Questions? Call Us!
(888) 415-6442 or +1 (805)-617-3884