×

Monthly Archives

LogicMonitor is, as far as I know, the most automated network monitoring system out there.  But there is one area we don’t provide much in the way of automation, that we are often asked about – automated scripts in response to alerts.  There are few reasons why not, which flow from our experience running critical production datacenters:

  • There are many cases where you don’t want automated recovery – you want a human to pinpoint the cause of failure, and ensure the recovery is done safely.   e.g.  after a master database crash, many DBAs don’t want to restart the database without determining the cause, whether transactions need to be backed out, whether slaves are still valid replicas, etc.
  • If a system is important enough to need automated recovery, the right way to do that is to have standby systems, clustered or otherwise available. e.g. multiple web servers behind a load balancer; master-master databases; switches with rapid spanning tree; routers with a rapidly converging IGP (OSPF, EIGRP).
  • If a service or process does need to be automatically restarted on a host, the monitoring system is almost certainly not the right way to do it. Use daemon-tools or init on Linux, or configure the service to restart in the Services control panel on Windows.  Using the monitoring system to attempt to remediate this will necessarily be a more fragile system than OS level tools.
  • If there are processes that need to be killed and restarted in response to the state of monitored metrics – if memory leaks and grows too much, say (I’m looking at you, mongrel) – then use a tool designed for that – monit, say.

In all these cases, use your monitoring to tell you if your recovery mechanisms are working, not to be the recovery mechanisms.  Monitor the memory usage of your mongrel processes, and alert only if the memory consumption is higher than you expect, for longer than it should be if monit was doing it’s job, say.

Of course, LogicMonitor can trigger automated script actions in response to alerts – you can set an agent inside your datacenter to pull all the alerts, send them to a script, which can do … whatever you can script.  And there are cases where that’s appropriate.  But you should have a good think about your architecture and design before you leap to that as a first resort.

Tags:

Share

We got a question internally about why one of our demo servers was slow, and how to use LogicMonitor to help identify the issue.  The person asking comes from a VoIP, networking and Windows background, not Linux, so his questions reflect that of the less-experienced sys admin (in this case). I thought it interesting that he documented his thought processes, and I’ll intersperse my interpretation of the same data, and some thoughts on why LogicMonitor alerts as it does… Read more »

Tags:

Share

MySQL Linux Tuning talk

Posted by & filed under Uncategorized .

Not really monitoring, but I just finished giving a talk at the MySQL conference.  (It was gratifyingly packed with people, too.)

Thought I’d post the slides here. The summary is:

  • you need to be able to measure and trend on your OWN infrastructure – your kernel, hardware, MySQL version, application. (Of course, if you are using LogicMonitor, that issue is solved.)
  • solve your problems in the simplest way possible.
  • Test different IO schedulers – may not be any benefit, but its so easy to do so, you should try.
  • Test different levels of innodb thread concurrency – can make big difference, and easy to test.
  • Eliminate swapping, in the simplest way you can (tuning swappiness; NUMA tricks, then hugepages.)

Download the presentation here.

Feel free to post questions.

 

Share
Categories
Popular Posts
Recent Posts
Top Tags
Archives