×

Tag Archive best practices

SSD Stats[Written by Perry Yang, Technical Operations Engineer at LogicMonitor]

In recent years, Solid-State Drives or SSDs have become a standard part of data center architecture. They handle more simultaneous read/write operations than traditional disks and use a fraction of the power. Of course, as a leading infrastructure, software and server monitoring platform vendor, we are very interested in monitoring our SSDs, not only because we want to make sure we’re getting what we paid for, but because we would also like to avoid a disk failure on a production machine at 3:00AM in the morning…and the Shaquille O’Neal sized headache to follow. But how do we know for sure if our SSDs are performing the way we want them to? Being one of the newest members of our technical operations team, it came as no surprise that I was tasked to answer this question. Read more »

Tags:

LogicMonitor is, as far as I know, the most automated network monitoring system out there.  But there is one area we don’t provide much in the way of automation, that we are often asked about – automated scripts in response to alerts.  There are few reasons why not, which flow from our experience running critical production datacenters:

  • There are many cases where you don’t want automated recovery – you want a human to pinpoint the cause of failure, and ensure the recovery is done safely.   e.g.  after a master database crash, many DBAs don’t want to restart the database without determining the cause, whether transactions need to be backed out, whether slaves are still valid replicas, etc.
  • If a system is important enough to need automated recovery, the right way to do that is to have standby systems, clustered or otherwise available. e.g. multiple web servers behind a load balancer; master-master databases; switches with rapid spanning tree; routers with a rapidly converging IGP (OSPF, EIGRP).
  • If a service or process does need to be automatically restarted on a host, the monitoring system is almost certainly not the right way to do it. Use daemon-tools or init on Linux, or configure the service to restart in the Services control panel on Windows.  Using the monitoring system to attempt to remediate this will necessarily be a more fragile system than OS level tools.
  • If there are processes that need to be killed and restarted in response to the state of monitored metrics – if memory leaks and grows too much, say (I’m looking at you, mongrel) – then use a tool designed for that – monit, say.

In all these cases, use your monitoring to tell you if your recovery mechanisms are working, not to be the recovery mechanisms.  Monitor the memory usage of your mongrel processes, and alert only if the memory consumption is higher than you expect, for longer than it should be if monit was doing it’s job, say.

Of course, LogicMonitor can trigger automated script actions in response to alerts – you can set an agent inside your datacenter to pull all the alerts, send them to a script, which can do … whatever you can script.  And there are cases where that’s appropriate.  But you should have a good think about your architecture and design before you leap to that as a first resort.

Tags:

One of the difficulties in IT environments is that redundancy can sometimes make outages worse.  The problem being that redundancy can often give people (mostly justified) confidence in the availability of their systems, so they design architectures on the assumption that their core switch (or database, or load balancing cluster, or what have you) will not go down.

And they even have monitoring.

But they don’t monitor the state of the redundant server or component. So then the redundant server or component fails, or is unplugged, or synchronization fails, or what have you, and stays that way for weeks with no one noticing. Then the active server or component fails, the other one is already out of commission – and boom – Bad Things happen.

So if you run redundant supervisor modules in your core switches to get high availability, make sure your cisco switch monitoring is capable of monitoring them. Same for redundant power supplies.

Same for active-standby Netscalers, or F5 Big IPs, or NetApp clusters, and or anything that you want to make sure works when needed.

If it’s not monitored, chances are it won’t be there when you need it.

Tags:

When designing infrastructure architecture, there is usually a choice between complexity and fault tolerance.  It’s not just an inverse relationship, however. It’s a curve. You want the minimal complexity possible to achieve your availability goals. And you may even want to reduce your availability goals to reduce your complexity (which will end up increasing your availability.)

The rule to adopt is If you don’t understand something well enough that it seems simple to you (or your staff), even in it’s failure modes, you are better off without it.

Back in the day, clever people suggested that most web sites would have the best availability by running everything – DB, web application, everything – on a single server. This was the simplest configuration, and the easiest to understand.

With no complexity – one of everything (one switch, one load balancer, one web server, one database, for example) – you can tolerate zero failures, but it’s easy to know when there is a failure.

With 2 of everything, connected the right way, you can keep running with one failure, but you may not be aware of the failure.

So is it a good idea to add more connections, and plan to be able to tolerate multiple failures?  Not usually.  For example, with a redundant pair of load balancers, you can connect one load balancer to one switch, and the other load balancer to another switch.  In the event of a load balancer failure, the surviving load balancer will automatically take over, and all is good.  If a switch fails, it may be the one that the active load balancer is connected to – this would also trigger a load balancer fail over, and everything is still running correctly.  It would be possible to connect each load balancer to each switch, so that failure of a switch does not impact the load balancers, but is it worth it?

This would allow the site to survive two simultaneous unrelated failures – one switch and the one load balancer – but the added complexity of engineering the multiple traffic paths increases the likelihood that something will go wrong in one of the 4 possible states. There are now 4 possible traffic paths instead of 2 – so more testing needed, more maintenance needed on any change, etc.  The benefit seems outweighed by the complexity.

The same concept of “if it seems complex, it doesn’t belong”, can be applied to software, too.  Load balancing, whether via an appliance such as Citrix Netscalers, or software such as ha_proxy, is simple enough to most people nowadays. The same is not generally true of clustered file systems, or DRDB.  If you truly need these technologies, you better have a thorough understanding of them, and invest the time to create all the failure modes you can, and train your staff so that it is not complex for them to deal with any of the failures.

If you have a consultant come in and set up BGP routing, but no one on your NOC or on call staff knows how to do anything with BGP, you just greatly reduced your site’s operational availability.

The “Complexity Filter” can be applied to monitoring systems, as well.  If your monitoring system stops, and you don’t have immediate staff available to troubleshoot the restart of the service processes; or the majority of your staff cannot easily interpret the monitoring, or create new checks, or use it to see trends over time – your monitoring is not contributing to your operational uptime.  It is instead a resource sink, and is likely to bite you when you least expect it.   Datacenter monitoring, like all things in your datacenter, should be as automated and simple as possible.

If it seems complex – it will break.  Learn it until it’s not complex, or do without it.

Tags:

One question that often arises in monitoring is how to define alert levels and escalations, and what level to set various alerts at – Critical, Error or Warning.

Assuming you have Errors and Critical alerts set to notify teams by pager/phone, and Critical alerts with a shorter escalation time, here are some simple guidelines:

Critical alerts should be for events that have immediate customer impacting effect.  For example, a production Virtual IP on a monitored load balancer going down, as it has no available services to route the traffic to.  The site is down, so page everyone.

Error alerts should be for events that require immediate attention, and that, if unresolved, increase the likelihood that a production affecting event will occur.  To continue with the load balancer example, an error should be triggered if the Virtual IP only has one functioning backend server to route traffic to – there is now no redundancy, so one failure can take the site offline.

Warnings, which we typically recommend be sent by email only, are for all  other kinds of events.  The loss of a single backend server from a Virtual IP when there are 20 other servers functioning does not warrant anyone being woken in the night.

When deciding what level to assign alerts, consider the primary function of the device.  For example, in the case of a NetApp storage array, the function of the device is to serve read and write IO requests.  So the primary thing for  monitoring NetApps should be the availability and performance (latency) of these read and write requests. If a volume is servicing requests with high latency – such as 70 ms per write request – that should be an Error level alert (in some enterprises, that may be appropriate to configure as a Critical level alert, but usually a Critical performance alert should be triggered only if the end-application performance degrades unacceptably.)  However, if CPU load on the NetApp is 99% for a period, even though it sounds alarming, I’d suggest that be treated as a Warning level alert only.  If latency is not impacted, why wake people at night?  Send an email alert so the issue can be investigated, but if the function of the device is not impaired, do not over react. (If you wish, you can define your alert escalations so that such conditions result in pages if uncorrected or unacknowledged for more than 5 hours, say.)

Alert Overload is a bigger danger to most datacenters than most people realise. The thought is often “if one alert is good, more must be better.”  Instead, focus on identifying the primary functions of devices – set Error level alerts on those functions, and use Warnings to inform you about conditions that could impair that functions, or to aid in troubleshooting. (If latency on a NetApp is high, and CPU load is also in alert, that obviously helps diagnose the issue, instead of looking for unusual volume activity.)

Reserve Critical alerts for system performance and availability as a whole.

With LogicMonitor hosted monitoring, the alert definitions for all data center devices have their alert thresholds predefined in the above manner – that’s one way we help provide meaningful monitoring in minutes.

Tags:

Monitoring System Sprawl
This is often a corollary to the first point, not relying on manual processes.  The number of monitoring systems you have in place should approach 1.  You do not want one system to monitor windows servers; another for linux, another for MySQL, another for storage.  Even if they are all capable of automatic updates, filtering and classifying, having multiple systems still virtually guarantees suboptimal datacenter performance.  What happens when the DBA changes his pager address, and the contact information is updated in the escalation methods of 2 systems, but not the other 2?  What happens when scheduled maintenance is specified in one system, but not another that is tracking another component of the systems undergoing maintenance?

You will end up with alerts that are not routed correctly, and alert overload.  You may also end up with a system that notifies people about issues they have no ability to acknowledge, leading to “Oh…I turned my pager off…”
A variant of this problem is when your DBA’s, sysadmins or others ‘automate’ things by writing cron jobs or stored procedures to check and alert on things.  The first part is great – the checking. The alerting, however, should happen through your monitoring system.  Just have the monitoring system run the script and check the output, or call the stored procedure,or read the web page. You do not want yet another place to adjust thresholds, acknowledge alerts, deal with escalations, and so on – all things which your sysadmin’s scripts are unlikely to deal with.

Tags:

Continuing on the series of common Datacenter monitoring mistakes…

Alert overload
This is one of the most dangerous conditions.  If you have too many noisy alerts, that go off too frequently, people will tune them out – then when you get real, service impacting alerts, they will be tuned out, too.  I’ve seen critical production service outage alerts be put into scheduled downtime for 8 hours, as the admin assumed it was “another false alert”. How to prevent this?

  • start with sensible defaults, and sensible escalation policies. Distinguish between warnings (that admins should be aware of, but do not require immediate actions) and error or critical level alerts, that require pager notifications.  (No need to awaken people if NTP is out of synchronization – but if the primary database volume is experiencing 200 ms latency for read requests from its disk storage, and end user transaction time is now 8 seconds, then all hands on deck).
  • route the right set of alerts to the right group of people. There is no point in the DBA being alerted about network issues, or vice versa.
  • make sure you tune your thresholds appropriately. Every alert should be real and meaningful.  If any alerts are ‘false positives’ (such as alerts about QA systems being down), tune the monitoring.  LogicMonitor alerts are easily tuned on the global, host or group level, or even the individual instance (such as a file system, or interfaces); and ActiveDiscovery filters make it simple classify discovered instances into the appropriate group, with the appropriate alert levels. A common example is to discover all load balancing VIPs or Storage system volumes with “stage” or “qa” in the name to have no error or critical alerts – this will then apply to all VIPs or volumes created now and in the future, on all devices – greatly simplifying alert management.
  • ensure alerts are acknowledged, dealt with, and cleared.  You don’t want to see hundreds of alerts on your monitoring system.  For large enterprises, make sure you can see a filtered view of just the groups of systems you are responsible for, allowing focus.  You  should also periodically sort alerts by duration, and focus on cleaning out those that have been in alert for longer than a day.
  • Another useful report is to analyze your top alerts, by host, or by alert type. Investigate to see whether there are issues in monitoring, or the systems, or operational processes, that can reduce the frequency of these alerts.

Tags:

Continuing on from Part 1

No issue should be considered resolved if monitoring will not detect its recurrence.

Even with good monitoring practices in place, outages will occur.  Best practices dictate that the issue not be considered resolved until monitoring is in place to detect the root cause, or provide earlier warning.  For example, if a Java application experiences a service affecting outage due to a large number of users overloading the system, the earliest warning of an impending issue may be an increase in the number of busy threads, which can be tracked via JMX monitoring.  An alert threshold should be placed on this metric, to give advance warning before the next event, which could allow time to add another system to share the load, or activate load shedding mechanisms, and so on.  (LogicMonitor automatically includes alerts for JMX enabled applications such as Tomcat and Resin when the active threads are approaching the maximum configured – but such alerts should be present for all applications, on all monitoring systems.)

This is a very important principle – just because things are working again, it does not mean issues should be closed unless you are happy with the warning your monitoring gave about the issue before it started, or the kind of alerts and alert escalations that occurred during the issue.  It’s possible that the issue is one with no way to warn in advance (for example, sudden panic of a system), but this process of evaluation should be undertaken for every service impacting event.

Tags:

Everyone knows they need monitoring to ensure their site uptime and keep their business humming.  Yet many sites still suffer from outages that are first reported by their customers.  Here at LogicMonitor, we have lots of experience with monitoring systems of all kinds, and these are some of the most common mistakes we have seen, and how to address them – so that you can know about issues before they affect your business:

Relying on people and human processes to ensure things are monitored.
People are funny, lovable, amazing creatures, but they are not very reliable.  A situation we have seen many times is that during the heat of a crisis (say, you were lucky enough to get Slashdotted), some change is made to some data center equipment (such as adding a new volume to your NetApp so that it can serve as high speed storage for your web tier).  But in the heat of the moment, the new volume is not put into your NetApp monitoring.  (“I’ll get to that later” are famous last words.)
After the crises is over, everyone is too busy breathing sighs of relief to worry about adding that new volume into monitoring – so when it fills up in 6 months, or starts having latency issues due to high IO operations,  no one is alerted, and customers are the first to call in and complain. The CTO is the next to call.  Uh oh.

One of LogicMonitor’s design goals has always been to remove human configuration as much as possible – not just because it saves people time, but because it makes monitoring – and hence the services monitored – that much more reliable.  We do this in a few different ways:

  • LogicMonitor’s ActiveDiscovery (TM) process continuously scans all monitored devices for changes, automatically adding new volumes, interfaces, load balancer VIPs, database instances, etc into monitoring, and informing you via email in real time (or batched notifications, as you prefer).  However, in order to avoid Alert Overload, you’ll need to ensure your monitoring supports filtering and classification of discovered objects.
  • LogicMonitor can scan your subnets, or even your Amazon EC2 account, and add new machines or instances to monitoring automatically.
  • Just as important, graphs and dashboards should not have to be manually updated.  If you have a dashboard graph that is the sum of the sessions on 10 web servers,  that is commonly used to view the health of your service, what happens when you add 4 more servers? A good monitoring system will automatically add them to your dashboard graph.  A poor system will require you to remember to update all the places that such information is aggregated, virtually ensuring your “business overview” will be anything but.

In short, never depend on monitoring to be manually updated to cover adds, moves and changes. Because you know it doesn’t happen.

Tags:

Categories
Popular Posts
Subscribe to our blog.