1-888-41LOGIC

Top I.T./Datacenter Monitoring Mistakes, Part 3 in a series.

November 6, 2009 – 10:06 am

Continuing on the series of common Datacenter monitoring mistakes…

Alert overload
This is one of the most dangerous conditions.  If you have too many noisy alerts, that go off too frequently, people will tune them out – then when you get real, service impacting alerts, they will be tuned out, too.  I’ve seen critical production service outage alerts be put into scheduled downtime for 8 hours, as the admin assumed it was “another false alert”. How to prevent this?

  • start with sensible defaults, and sensible escalation policies. Distinguish between warnings (that admins should be aware of, but do not require immediate actions) and error or critical level alerts, that require pager notifications.  (No need to awaken people if NTP is out of synchronization – but if the primary database volume is experiencing 200 ms latency for read requests from its disk storage, and end user transaction time is now 8 seconds, then all hands on deck).
  • route the right set of alerts to the right group of people. There is no point in the DBA being alerted about network issues, or vice versa.
  • make sure you tune your thresholds appropriately. Every alert should be real and meaningful.  If any alerts are ‘false positives’ (such as alerts about QA systems being down), tune the monitoring.  LogicMonitor alerts are easily tuned on the global, host or group level, or even the individual instance (such as a file system, or interfaces); and ActiveDiscovery filters make it simple classify discovered instances into the appropriate group, with the appropriate alert levels. A common example is to discover all load balancing VIPs or Storage system volumes with “stage” or “qa” in the name to have no error or critical alerts – this will then apply to all VIPs or volumes created now and in the future, on all devices – greatly simplifying alert management.
  • ensure alerts are acknowledged, dealt with, and cleared.  You don’t want to see hundreds of alerts on your monitoring system.  For large enterprises, make sure you can see a filtered view of just the groups of systems you are responsible for, allowing focus.  You  should also periodically sort alerts by duration, and focus on cleaning out those that have been in alert for longer than a day.
  • Another useful report is to analyze your top alerts, by host, or by alert type. Investigate to see whether there are issues in monitoring, or the systems, or operational processes, that can reduce the frequency of these alerts.
Share

2 Responses to “Top I.T./Datacenter Monitoring Mistakes, Part 3 in a series.”

  1. [...] Alert Overload is a bigger danger to most datacenters than most people realise. The thought is often “if one alert is good, more must be better.”  Instead, focus on identifying the primary functions of devices – set Error level alerts on those functions, and use Warnings to inform you about conditions that could impair that functions, or to aid in troubleshooting. (If latency on a NetApp is high, and CPU load is also in alert, that obviously helps diagnose the issue, instead of looking for unusual volume activity.) [...]

  2. [...] said it before and we’ll probably say it again, but alert overload is dangerous, no matter what the [...]

Leave a Reply