One question that often arises in monitoring is how to define alert levels and escalations, and what level to set various alerts at – Critical, Error or Warning.
Assuming you have Errors and Critical alerts set to notify teams by pager/phone, and Critical alerts with a shorter escalation time, here are some simple guidelines:
Critical alerts should be for events that have immediate customer impacting effect. For example, a production Virtual IP on a monitored load balancer going down, as it has no available services to route the traffic to. The site is down, so page everyone.
Error alerts should be for events that require immediate attention, and that, if unresolved, increase the likelihood that a production affecting event will occur. To continue with the load balancer example, an error should be triggered if the Virtual IP only has one functioning backend server to route traffic to – there is now no redundancy, so one failure can take the site offline.
Warnings, which we typically recommend be sent by email only, are for all other kinds of events. The loss of a single backend server from a Virtual IP when there are 20 other servers functioning does not warrant anyone being woken in the night.
When deciding what level to assign alerts, consider the primary function of the device. For example, in the case of a NetApp storage array, the function of the device is to serve read and write IO requests. So the primary thing for monitoring NetApps should be the availability and performance (latency) of these read and write requests. If a volume is servicing requests with high latency – such as 70 ms per write request – that should be an Error level alert (in some enterprises, that may be appropriate to configure as a Critical level alert, but usually a Critical performance alert should be triggered only if the end-application performance degrades unacceptably.) However, if CPU load on the NetApp is 99% for a period, even though it sounds alarming, I’d suggest that be treated as a Warning level alert only. If latency is not impacted, why wake people at night? Send an email alert so the issue can be investigated, but if the function of the device is not impaired, do not over react. (If you wish, you can define your alert escalations so that such conditions result in pages if uncorrected or unacknowledged for more than 5 hours, say.)
Alert Overload is a bigger danger to most datacenters than most people realise. The thought is often “if one alert is good, more must be better.” Instead, focus on identifying the primary functions of devices – set Error level alerts on those functions, and use Warnings to inform you about conditions that could impair that functions, or to aid in troubleshooting. (If latency on a NetApp is high, and CPU load is also in alert, that obviously helps diagnose the issue, instead of looking for unusual volume activity.)
Reserve Critical alerts for system performance and availability as a whole.
With LogicMonitor hosted monitoring, the alert definitions for all data center devices have their alert thresholds predefined in the above manner – that’s one way we help provide meaningful monitoring in minutes.
Monitoring System Sprawl
This is often a corollary to the first point, not relying on manual processes. The number of monitoring systems you have in place should approach 1. You do not want one system to monitor windows servers; another for linux, another for MySQL, another for storage. Even if they are all capable of automatic updates, filtering and classifying, having multiple systems still virtually guarantees suboptimal datacenter performance. What happens when the DBA changes his pager address, and the contact information is updated in the escalation methods of 2 systems, but not the other 2? What happens when scheduled maintenance is specified in one system, but not another that is tracking another component of the systems undergoing maintenance?
Continuing on the series of common Datacenter monitoring mistakes…
This is one of the most dangerous conditions. If you have too many noisy alerts, that go off too frequently, people will tune them out – then when you get real, service impacting alerts, they will be tuned out, too. I’ve seen critical production service outage alerts be put into scheduled downtime for 8 hours, as the admin assumed it was “another false alert”. How to prevent this?
Continuing on from Part 1
No issue should be considered resolved if monitoring will not detect its recurrence.
Even with good monitoring practices in place, outages will occur. Best practices dictate that the issue not be considered resolved until monitoring is in place to detect the root cause, or provide earlier warning. For example, if a Java application experiences a service affecting outage due to a large number of users overloading the system, the earliest warning of an impending issue may be an increase in the number of busy threads, which can be tracked via JMX monitoring. An alert threshold should be placed on this metric, to give advance warning before the next event, which could allow time to add another system to share the load, or activate load shedding mechanisms, and so on. (LogicMonitor automatically includes alerts for JMX enabled applications such as Tomcat and Resin when the active threads are approaching the maximum configured – but such alerts should be present for all applications, on all monitoring systems.)
This is a very important principle – just because things are working again, it does not mean issues should be closed unless you are happy with the warning your monitoring gave about the issue before it started, or the kind of alerts and alert escalations that occurred during the issue. It’s possible that the issue is one with no way to warn in advance (for example, sudden panic of a system), but this process of evaluation should be undertaken for every service impacting event.
Everyone knows they need monitoring to ensure their site uptime and keep their business humming. Yet many sites still suffer from outages that are first reported by their customers. Here at LogicMonitor, we have lots of experience with monitoring systems of all kinds, and these are some of the most common mistakes we have seen, and how to address them – so that you can know about issues before they affect your business:
Relying on people and human processes to ensure things are monitored.
People are funny, lovable, amazing creatures, but they are not very reliable. A situation we have seen many times is that during the heat of a crisis (say, you were lucky enough to get Slashdotted), some change is made to some data center equipment (such as adding a new volume to your NetApp so that it can serve as high speed storage for your web tier). But in the heat of the moment, the new volume is not put into your NetApp monitoring. (“I’ll get to that later” are famous last words.)
After the crises is over, everyone is too busy breathing sighs of relief to worry about adding that new volume into monitoring – so when it fills up in 6 months, or starts having latency issues due to high IO operations, no one is alerted, and customers are the first to call in and complain. The CTO is the next to call. Uh oh.
One of LogicMonitor’s design goals has always been to remove human configuration as much as possible – not just because it saves people time, but because it makes monitoring – and hence the services monitored – that much more reliable. We do this in a few different ways:
In short, never depend on monitoring to be manually updated to cover adds, moves and changes. Because you know it doesn’t happen.
Performance monitoring for all your infrastructure & applications. In minutes, not hours.
Questions? Call Us!
(888) 415-6442 or +1 (805)-617-3884