×

Tag Archive alerts

Even with a great monitoring system, it can be hard sometimes to keep the noise down. (Indeed, the more powerful the monitoring, the more difficult this can be, as more data is collected and tested, automatically.) And keeping noise down in monitoring is vital, as you do not want staff to start ignoring alerts – which they will if there are too many meaningless alerts.

There are of course best practices to help with this process, but one of the best ways to start attacking your alert noise is also one of the easiest – simply set up a report to highlight where the noise is coming from, and review it once a week.

Under the Reports tab, select New Report, then fill it out as the below – the important thing being to select the report type as Alert Report.

The magic of the report is in the details:

I suggest setting the report to cover the last week, for all hosts (although if you are responsible only for a set of hosts – by all means change the report to only reflect those you are getting alerted about); exclude alerts that occurred during periods of Scheduled DownTime (those alerts would not have been sent out anyway); check the Summarize Alert Counts box, THEN select the sort method of sorting by Alert count. (This sort order is not available until the summarize alert count box is checked.)

Run this report, and you’ll get output like the below:

 

Which makes it very easy to see that in this case, we could eliminate 80% of the alerts for the last week simply by changing the monitoring on the IPMI event logs of one development host – filtering out alerts, or using SDT, or even disabling that monitoring, given it’s just a development host.

We can then work through the top noise makers, tuning, disabling, or fixing issue (such as increasing the MySQL cache on prod5.iad), which will greatly reduce the amount of alert noise with the least work.

And then we’ll get this report emailed to us every Monday, so we can stay on top of the issues, and keep our monitoring meaningful. That way, we’ll have improved the performance of our systems, eliminated any alert noise, and if we do get an alert – we can be sure it’s meaningful, and that people will react to it.

 

Tags:

Last night our ops team (of which I am a member) got paged about the CPU load on a Cisco 3560 switch in a new datacenter, late at night.  My initial reaction was “We don’t need this alert escalated to pagers or phones- 3560’s switch and route in hardware, so CPU load doesn’t matter.”  Once I’d woken up a bit more, the corollary – that there is no possible way that this switch should be at a CPU level to trigger an error alert – occurred to me. Read more »

Tags:

Agile Monitoring Support

Posted by & filed under SysAdmins, Tips & Troubleshooting .

We recently had a customer come into trial looking around for a new monitoring solution.  This is always good for us.  We love the takeaway.  (Customers defecting from other monitoring systems to us.) As in most takeaway situations this customer had specific needs.  Now there are the obvious ones in which LogicMonitor easily fits the bill such as alerting, dashboards, performance monitoring, etc (and if you fall into that VMWare, Cisco, NetApp sweet spot, game over!).  This guy however, had a very specific need we didn’t fulfill directly out of the gates.  I think anyone who has ever worked with a monitoring solution knows that it’s hard to find one that does everything.  Well in the case of LogicMonitor this is no different.  We don’t do EVERYTHING.  I know, you thought I was going to get all high and mighty and talk about how LogicMonitor is the one monitoring tool that CAN do everything.  Well Read more »

Tags:

We received some alerts tonight that one Tomcat server was using about 95% of its configured thread maximum.

The Tomcat process on http-443  on prod4 now has 96.2 %
of the max configured threads in the busy state.

These were SMS alerts, as that was close enough to exhausting the available threads to warrant waking someone up if needed.

The other alert we got was that Tomcat was taking an unusual time to process requests, as seen in this graph: Read more »

Tags:

Categories
Popular Posts
Subscribe to our blog.