×

Monthly Archives

One question that often arises in monitoring is how to define alert levels and escalations, and what level to set various alerts at – Critical, Error or Warning.

Assuming you have Errors and Critical alerts set to notify teams by pager/phone, and Critical alerts with a shorter escalation time, here are some simple guidelines:

Critical alerts should be for events that have immediate customer impacting effect.  For example, a production Virtual IP on a monitored load balancer going down, as it has no available services to route the traffic to.  The site is down, so page everyone.

Error alerts should be for events that require immediate attention, and that, if unresolved, increase the likelihood that a production affecting event will occur.  To continue with the load balancer example, an error should be triggered if the Virtual IP only has one functioning backend server to route traffic to – there is now no redundancy, so one failure can take the site offline.

Warnings, which we typically recommend be sent by email only, are for all  other kinds of events.  The loss of a single backend server from a Virtual IP when there are 20 other servers functioning does not warrant anyone being woken in the night.

When deciding what level to assign alerts, consider the primary function of the device.  For example, in the case of a NetApp storage array, the function of the device is to serve read and write IO requests.  So the primary thing for  monitoring NetApps should be the availability and performance (latency) of these read and write requests. If a volume is servicing requests with high latency – such as 70 ms per write request – that should be an Error level alert (in some enterprises, that may be appropriate to configure as a Critical level alert, but usually a Critical performance alert should be triggered only if the end-application performance degrades unacceptably.)  However, if CPU load on the NetApp is 99% for a period, even though it sounds alarming, I’d suggest that be treated as a Warning level alert only.  If latency is not impacted, why wake people at night?  Send an email alert so the issue can be investigated, but if the function of the device is not impaired, do not over react. (If you wish, you can define your alert escalations so that such conditions result in pages if uncorrected or unacknowledged for more than 5 hours, say.)

Alert Overload is a bigger danger to most datacenters than most people realise. The thought is often “if one alert is good, more must be better.”  Instead, focus on identifying the primary functions of devices – set Error level alerts on those functions, and use Warnings to inform you about conditions that could impair that functions, or to aid in troubleshooting. (If latency on a NetApp is high, and CPU load is also in alert, that obviously helps diagnose the issue, instead of looking for unusual volume activity.)

Reserve Critical alerts for system performance and availability as a whole.

With LogicMonitor hosted monitoring, the alert definitions for all data center devices have their alert thresholds predefined in the above manner – that’s one way we help provide meaningful monitoring in minutes.

Tags:

Share

Monitoring System Sprawl
This is often a corollary to the first point, not relying on manual processes.  The number of monitoring systems you have in place should approach 1.  You do not want one system to monitor windows servers; another for linux, another for MySQL, another for storage.  Even if they are all capable of automatic updates, filtering and classifying, having multiple systems still virtually guarantees suboptimal datacenter performance.  What happens when the DBA changes his pager address, and the contact information is updated in the escalation methods of 2 systems, but not the other 2?  What happens when scheduled maintenance is specified in one system, but not another that is tracking another component of the systems undergoing maintenance?

You will end up with alerts that are not routed correctly, and alert overload.  You may also end up with a system that notifies people about issues they have no ability to acknowledge, leading to “Oh…I turned my pager off…”
A variant of this problem is when your DBA’s, sysadmins or others ‘automate’ things by writing cron jobs or stored procedures to check and alert on things.  The first part is great – the checking. The alerting, however, should happen through your monitoring system.  Just have the monitoring system run the script and check the output, or call the stored procedure,or read the web page. You do not want yet another place to adjust thresholds, acknowledge alerts, deal with escalations, and so on – all things which your sysadmin’s scripts are unlikely to deal with.

Tags:

Share

Continuing on the series of common Datacenter monitoring mistakes…

Alert overload
This is one of the most dangerous conditions.  If you have too many noisy alerts, that go off too frequently, people will tune them out – then when you get real, service impacting alerts, they will be tuned out, too.  I’ve seen critical production service outage alerts be put into scheduled downtime for 8 hours, as the admin assumed it was “another false alert”. How to prevent this?

  • start with sensible defaults, and sensible escalation policies. Distinguish between warnings (that admins should be aware of, but do not require immediate actions) and error or critical level alerts, that require pager notifications.  (No need to awaken people if NTP is out of synchronization – but if the primary database volume is experiencing 200 ms latency for read requests from its disk storage, and end user transaction time is now 8 seconds, then all hands on deck).
  • route the right set of alerts to the right group of people. There is no point in the DBA being alerted about network issues, or vice versa.
  • make sure you tune your thresholds appropriately. Every alert should be real and meaningful.  If any alerts are ‘false positives’ (such as alerts about QA systems being down), tune the monitoring.  LogicMonitor alerts are easily tuned on the global, host or group level, or even the individual instance (such as a file system, or interfaces); and ActiveDiscovery filters make it simple classify discovered instances into the appropriate group, with the appropriate alert levels. A common example is to discover all load balancing VIPs or Storage system volumes with “stage” or “qa” in the name to have no error or critical alerts – this will then apply to all VIPs or volumes created now and in the future, on all devices – greatly simplifying alert management.
  • ensure alerts are acknowledged, dealt with, and cleared.  You don’t want to see hundreds of alerts on your monitoring system.  For large enterprises, make sure you can see a filtered view of just the groups of systems you are responsible for, allowing focus.  You  should also periodically sort alerts by duration, and focus on cleaning out those that have been in alert for longer than a day.
  • Another useful report is to analyze your top alerts, by host, or by alert type. Investigate to see whether there are issues in monitoring, or the systems, or operational processes, that can reduce the frequency of these alerts.

Tags:

Share

Continuing on from Part 1

No issue should be considered resolved if monitoring will not detect its recurrence.

Even with good monitoring practices in place, outages will occur.  Best practices dictate that the issue not be considered resolved until monitoring is in place to detect the root cause, or provide earlier warning.  For example, if a Java application experiences a service affecting outage due to a large number of users overloading the system, the earliest warning of an impending issue may be an increase in the number of busy threads, which can be tracked via JMX monitoring.  An alert threshold should be placed on this metric, to give advance warning before the next event, which could allow time to add another system to share the load, or activate load shedding mechanisms, and so on.  (LogicMonitor automatically includes alerts for JMX enabled applications such as Tomcat and Resin when the active threads are approaching the maximum configured – but such alerts should be present for all applications, on all monitoring systems.)

This is a very important principle – just because things are working again, it does not mean issues should be closed unless you are happy with the warning your monitoring gave about the issue before it started, or the kind of alerts and alert escalations that occurred during the issue.  It’s possible that the issue is one with no way to warn in advance (for example, sudden panic of a system), but this process of evaluation should be undertaken for every service impacting event.

Tags:

Share

Everyone knows they need monitoring to ensure their site uptime and keep their business humming.  Yet many sites still suffer from outages that are first reported by their customers.  Here at LogicMonitor, we have lots of experience with monitoring systems of all kinds, and these are some of the most common mistakes we have seen, and how to address them – so that you can know about issues before they affect your business:

Relying on people and human processes to ensure things are monitored.
People are funny, lovable, amazing creatures, but they are not very reliable.  A situation we have seen many times is that during the heat of a crisis (say, you were lucky enough to get Slashdotted), some change is made to some data center equipment (such as adding a new volume to your NetApp so that it can serve as high speed storage for your web tier).  But in the heat of the moment, the new volume is not put into your NetApp monitoring.  (“I’ll get to that later” are famous last words.)
After the crises is over, everyone is too busy breathing sighs of relief to worry about adding that new volume into monitoring – so when it fills up in 6 months, or starts having latency issues due to high IO operations,  no one is alerted, and customers are the first to call in and complain. The CTO is the next to call.  Uh oh.

One of LogicMonitor’s design goals has always been to remove human configuration as much as possible – not just because it saves people time, but because it makes monitoring – and hence the services monitored – that much more reliable.  We do this in a few different ways:

  • LogicMonitor’s ActiveDiscovery (TM) process continuously scans all monitored devices for changes, automatically adding new volumes, interfaces, load balancer VIPs, database instances, etc into monitoring, and informing you via email in real time (or batched notifications, as you prefer).  However, in order to avoid Alert Overload, you’ll need to ensure your monitoring supports filtering and classification of discovered objects.
  • LogicMonitor can scan your subnets, or even your Amazon EC2 account, and add new machines or instances to monitoring automatically.
  • Just as important, graphs and dashboards should not have to be manually updated.  If you have a dashboard graph that is the sum of the sessions on 10 web servers,  that is commonly used to view the health of your service, what happens when you add 4 more servers? A good monitoring system will automatically add them to your dashboard graph.  A poor system will require you to remember to update all the places that such information is aggregated, virtually ensuring your “business overview” will be anything but.

In short, never depend on monitoring to be manually updated to cover adds, moves and changes. Because you know it doesn’t happen.

Tags:

Share
Categories
Popular Posts
Recent Posts
Top Tags
Archives