LogicMonitor is, as far as I know, the most automated network monitoring system out there. But there is one area we don’t provide much in the way of automation, that we are often asked about – automated scripts in response to alerts. There are few reasons why not, which flow from our experience running critical production datacenters:
In all these cases, use your monitoring to tell you if your recovery mechanisms are working, not to be the recovery mechanisms. Monitor the memory usage of your mongrel processes, and alert only if the memory consumption is higher than you expect, for longer than it should be if monit was doing it’s job, say.
Of course, LogicMonitor can trigger automated script actions in response to alerts – you can set an agent inside your datacenter to pull all the alerts, send them to a script, which can do … whatever you can script. And there are cases where that’s appropriate. But you should have a good think about your architecture and design before you leap to that as a first resort.
Tags: best practices
One of the difficulties in IT environments is that redundancy can sometimes make outages worse. The problem being that redundancy can often give people (mostly justified) confidence in the availability of their systems, so they design architectures on the assumption that their core switch (or database, or load balancing cluster, or what have you) will not go down.
And they even have monitoring.
But they don’t monitor the state of the redundant server or component. So then the redundant server or component fails, or is unplugged, or synchronization fails, or what have you, and stays that way for weeks with no one noticing. Then the active server or component fails, the other one is already out of commission – and boom – Bad Things happen.
So if you run redundant supervisor modules in your core switches to get high availability, make sure your cisco switch monitoring is capable of monitoring them. Same for redundant power supplies.
Same for active-standby Netscalers, or F5 Big IPs, or NetApp clusters, and or anything that you want to make sure works when needed.
If it’s not monitored, chances are it won’t be there when you need it.
When designing infrastructure architecture, there is usually a choice between complexity and fault tolerance. It’s not just an inverse relationship, however. It’s a curve. You want the minimal complexity possible to achieve your availability goals. And you may even want to reduce your availability goals to reduce your complexity (which will end up increasing your availability.)
The rule to adopt is If you don’t understand something well enough that it seems simple to you (or your staff), even in it’s failure modes, you are better off without it.
Back in the day, clever people suggested that most web sites would have the best availability by running everything – DB, web application, everything – on a single server. This was the simplest configuration, and the easiest to understand.
With no complexity – one of everything (one switch, one load balancer, one web server, one database, for example) – you can tolerate zero failures, but it’s easy to know when there is a failure.
With 2 of everything, connected the right way, you can keep running with one failure, but you may not be aware of the failure.
So is it a good idea to add more connections, and plan to be able to tolerate multiple failures? Not usually. For example, with a redundant pair of load balancers, you can connect one load balancer to one switch, and the other load balancer to another switch. In the event of a load balancer failure, the surviving load balancer will automatically take over, and all is good. If a switch fails, it may be the one that the active load balancer is connected to – this would also trigger a load balancer fail over, and everything is still running correctly. It would be possible to connect each load balancer to each switch, so that failure of a switch does not impact the load balancers, but is it worth it?
This would allow the site to survive two simultaneous unrelated failures – one switch and the one load balancer – but the added complexity of engineering the multiple traffic paths increases the likelihood that something will go wrong in one of the 4 possible states. There are now 4 possible traffic paths instead of 2 – so more testing needed, more maintenance needed on any change, etc. The benefit seems outweighed by the complexity.
The same concept of “if it seems complex, it doesn’t belong”, can be applied to software, too. Load balancing, whether via an appliance such as Citrix Netscalers, or software such as ha_proxy, is simple enough to most people nowadays. The same is not generally true of clustered file systems, or DRDB. If you truly need these technologies, you better have a thorough understanding of them, and invest the time to create all the failure modes you can, and train your staff so that it is not complex for them to deal with any of the failures.
If you have a consultant come in and set up BGP routing, but no one on your NOC or on call staff knows how to do anything with BGP, you just greatly reduced your site’s operational availability.
The “Complexity Filter” can be applied to monitoring systems, as well. If your monitoring system stops, and you don’t have immediate staff available to troubleshoot the restart of the service processes; or the majority of your staff cannot easily interpret the monitoring, or create new checks, or use it to see trends over time – your monitoring is not contributing to your operational uptime. It is instead a resource sink, and is likely to bite you when you least expect it. Datacenter monitoring, like all things in your datacenter, should be as automated and simple as possible.
If it seems complex – it will break. Learn it until it’s not complex, or do without it.
One question that often arises in monitoring is how to define alert levels and escalations, and what level to set various alerts at – Critical, Error or Warning.
Assuming you have Errors and Critical alerts set to notify teams by pager/phone, and Critical alerts with a shorter escalation time, here are some simple guidelines:
Critical alerts should be for events that have immediate customer impacting effect. For example, a production Virtual IP on a monitored load balancer going down, as it has no available services to route the traffic to. The site is down, so page everyone.
Error alerts should be for events that require immediate attention, and that, if unresolved, increase the likelihood that a production affecting event will occur. To continue with the load balancer example, an error should be triggered if the Virtual IP only has one functioning backend server to route traffic to – there is now no redundancy, so one failure can take the site offline.
Warnings, which we typically recommend be sent by email only, are for all other kinds of events. The loss of a single backend server from a Virtual IP when there are 20 other servers functioning does not warrant anyone being woken in the night.
When deciding what level to assign alerts, consider the primary function of the device. For example, in the case of a NetApp storage array, the function of the device is to serve read and write IO requests. So the primary thing for monitoring NetApps should be the availability and performance (latency) of these read and write requests. If a volume is servicing requests with high latency – such as 70 ms per write request – that should be an Error level alert (in some enterprises, that may be appropriate to configure as a Critical level alert, but usually a Critical performance alert should be triggered only if the end-application performance degrades unacceptably.) However, if CPU load on the NetApp is 99% for a period, even though it sounds alarming, I’d suggest that be treated as a Warning level alert only. If latency is not impacted, why wake people at night? Send an email alert so the issue can be investigated, but if the function of the device is not impaired, do not over react. (If you wish, you can define your alert escalations so that such conditions result in pages if uncorrected or unacknowledged for more than 5 hours, say.)
Alert Overload is a bigger danger to most datacenters than most people realise. The thought is often “if one alert is good, more must be better.” Instead, focus on identifying the primary functions of devices – set Error level alerts on those functions, and use Warnings to inform you about conditions that could impair that functions, or to aid in troubleshooting. (If latency on a NetApp is high, and CPU load is also in alert, that obviously helps diagnose the issue, instead of looking for unusual volume activity.)
Reserve Critical alerts for system performance and availability as a whole.
With LogicMonitor hosted monitoring, the alert definitions for all data center devices have their alert thresholds predefined in the above manner – that’s one way we help provide meaningful monitoring in minutes.
Monitoring System Sprawl
This is often a corollary to the first point, not relying on manual processes. The number of monitoring systems you have in place should approach 1. You do not want one system to monitor windows servers; another for linux, another for MySQL, another for storage. Even if they are all capable of automatic updates, filtering and classifying, having multiple systems still virtually guarantees suboptimal datacenter performance. What happens when the DBA changes his pager address, and the contact information is updated in the escalation methods of 2 systems, but not the other 2? What happens when scheduled maintenance is specified in one system, but not another that is tracking another component of the systems undergoing maintenance?
Continuing on the series of common Datacenter monitoring mistakes…
This is one of the most dangerous conditions. If you have too many noisy alerts, that go off too frequently, people will tune them out – then when you get real, service impacting alerts, they will be tuned out, too. I’ve seen critical production service outage alerts be put into scheduled downtime for 8 hours, as the admin assumed it was “another false alert”. How to prevent this?
Continuing on from Part 1
No issue should be considered resolved if monitoring will not detect its recurrence.
Even with good monitoring practices in place, outages will occur. Best practices dictate that the issue not be considered resolved until monitoring is in place to detect the root cause, or provide earlier warning. For example, if a Java application experiences a service affecting outage due to a large number of users overloading the system, the earliest warning of an impending issue may be an increase in the number of busy threads, which can be tracked via JMX monitoring. An alert threshold should be placed on this metric, to give advance warning before the next event, which could allow time to add another system to share the load, or activate load shedding mechanisms, and so on. (LogicMonitor automatically includes alerts for JMX enabled applications such as Tomcat and Resin when the active threads are approaching the maximum configured – but such alerts should be present for all applications, on all monitoring systems.)
This is a very important principle – just because things are working again, it does not mean issues should be closed unless you are happy with the warning your monitoring gave about the issue before it started, or the kind of alerts and alert escalations that occurred during the issue. It’s possible that the issue is one with no way to warn in advance (for example, sudden panic of a system), but this process of evaluation should be undertaken for every service impacting event.
Everyone knows they need monitoring to ensure their site uptime and keep their business humming. Yet many sites still suffer from outages that are first reported by their customers. Here at LogicMonitor, we have lots of experience with monitoring systems of all kinds, and these are some of the most common mistakes we have seen, and how to address them – so that you can know about issues before they affect your business:
Relying on people and human processes to ensure things are monitored.
People are funny, lovable, amazing creatures, but they are not very reliable. A situation we have seen many times is that during the heat of a crisis (say, you were lucky enough to get Slashdotted), some change is made to some data center equipment (such as adding a new volume to your NetApp so that it can serve as high speed storage for your web tier). But in the heat of the moment, the new volume is not put into your NetApp monitoring. (“I’ll get to that later” are famous last words.)
After the crises is over, everyone is too busy breathing sighs of relief to worry about adding that new volume into monitoring – so when it fills up in 6 months, or starts having latency issues due to high IO operations, no one is alerted, and customers are the first to call in and complain. The CTO is the next to call. Uh oh.
One of LogicMonitor’s design goals has always been to remove human configuration as much as possible – not just because it saves people time, but because it makes monitoring – and hence the services monitored – that much more reliable. We do this in a few different ways:
In short, never depend on monitoring to be manually updated to cover adds, moves and changes. Because you know it doesn’t happen.
Performance monitoring for all your infrastructure & applications. In minutes, not hours.
Questions? Call Us!
(888) 415-6442 or +1 (805)-617-3884