1-888-41LOGIC

How Application monitoring saves you money

February 6, 2010 – 1:11 pm

We here at LogicMonitor use our own service to monitor the various parts of our infrastructure, and doing so demonstrates the financial value that LogicMonitor brings.

The more you instrument with LogicMonitor, the more power it has.  In the cases below, the information and alerts that LogicMonitor presented allowed us to avoid spending money on more hardware – and with LogicMonitor’s availability requirements, each hardware purchase usually means 3 x the hardware (active/passive at the datacenter in question, and failover hardware present in a different datacenter.)

One case was relatively straightforward – a review of the MySQL performance monitoring metrics revealed that the number of rows read due to  read_rnd_next operations was very high – in the tens of thousands per second. (For those of you not DBAs, this is the number of rows MySQL reads sequentially in order to satisfy a read request – an indicator that indexes are not being used.)  A quick bit of investigation by our programmers revealed a query written in such a way that MySQL was not using the existing indexes.  This was rewritten, and on our release, the MySQL table scans dropped dramatically:

This saved the system’s CPU load, disk load, and improved the response time for users.

However,  a more dramatic demonstration came a week or so later, when one cluster started getting disk bound.  An increase in customers, combined with some newly released features that added extra load, meant that one cluster was reaching the capacity of its hardware (or so I thought.)  Average response time was hitting what we regarded as limits, and my thought was that we’d have to throw hardware (meaning money) at the issue.

However, using custom application metrics that the LogicMonitor system exposes (in our case via JMX monitoring, as our system is written in Java, but the data could have been collected by any of a variety of mechanisms, from perfmon counters, to web page content, to log files), it was apparent that the load was solely due to one particular processing queue.  Our CTO investigated the caching algorithm that is applied to the data in this queue, and was able to tune it so that it was much more effective, as can be seen from the graphs below:

This dropped the CPU load of the cluster:

And also improved the servers’ response time:

So while LogicMonitor did not directly solve the problem, the extensive application monitoring did warn us that an issue was arising, and pinpoint where in our system the bottleneck was, and allowed our staff to focus their investigation on the one particular queue, rather than all components of the system. It also allowed us to see the effectiveness of the changes on our staging systems, before we released to production.

LogicMonitor’s application monitoring saved us many thousands of dollars, and many hours of engineering time.  Both things in limited supply at any company.

  • Share/Bookmark

Complexity doesn’t belong in your datacenter.

January 29, 2010 – 10:11 am

When designing infrastructure architecture, there is usually a choice between complexity and fault tolerance.  It’s not just an inverse relationship, however. It’s a curve. You want the minimal complexity possible to achieve your availability goals. And you may even want to reduce your availability goals to reduce your complexity (which will end up increasing your availability.)

Basically, the rule to adopt is If you don’t understand something well enough that it seems simple to you (or your staff), even in it’s failure modes, you are better off without it.

Back in the day, clever people suggested that most web sites would have the best availability by running everything – DB, web application, everything – on a single server. This was the simplest configuration, and the easiest to understand.

With no complexity – one of everything (one switch, one load balancer, one web server, one database, for example) – you can tolerate zero failures.

With 2 of everything, connected the right way, you can keep running with one failure.

So is it a good idea to add more connections, and plan to be able to tolerate multiple failures?  Not usually.  For example, with a redundant pair of load balancers, you can connect one load balancer to one switch, and the other load balancer to another switch.  In the event of a load balancer failure, the surviving load balancer will automatically take over, and all is good.  If a switch fails, it may be the one that the active load balancer is connected to – this would also trigger a load balancer fail over, and everything is still running correctly.  It would be possible to connect each load balancer to each switch, so that failure of a switch does not impact the load balancers, but is it worth it?

This would allow the site to survive two simultaneous unrelated failures – one switch and the one load balancer – but the added complexity of engineering the multiple traffic paths increases the likelihood that something will go wrong in one of the 4 possible states. There are now 4 possible traffic paths instead of 2 – so more testing needed, more maintenance needed on any change, etc.  The benefit seems outweighed by the complexity.

The same concept of “if it seems complex, it doesn’t belong”, can be applied to software, too.  Load balancing, whether via an appliance such as Citrix Netscalers, or software such as ha_proxy, is simple enough to most people nowadays. The same is not generally true of clustered file systems, or DRDB.  If you truly need these technologies, you better have a thorough understanding of them, and invest the time to create all the failure modes you can, and train your staff so that it is not complex for them to deal with any of the failures.

If you have a consultant come in and set up BGP routing, but no one on your NOC or on call staff knows how to do anything with BGP, you just greatly reduced your site’s operational availability.

The “Complexity Filter” can be applied to monitoring systems, as well.  If your monitoring system stops, and you don’t have immediate staff available to troubleshoot the restart of the service processes; or the majority of your staff cannot easily interpret the monitoring, or create new checks, or use it to see trends over time – your monitoring is not contributing to your operational uptime.  It is instead a resource sink, and is likely to bite you when you least expect it.   Datacenter monitoring, like all things in your datacenter, should be as automated and simple as possible.

If it seems complex – it will break.  Learn it until it’s not complex, or do without it.

  • Share/Bookmark

Automation of Datacenter Monitoring

January 8, 2010 – 3:03 pm

Denise Dubie wrote a recent piece in CIO magazine about “5 Must-have IT Management Technologies for 2010“, in which she identifies one of the must-haves as IT process automation. She quotes Jim Frey, research director at EMA: “On the monitoring side, automation will be able to keep up with the pace of virtual environments and recognize when changes happen in ways a human operator simply could not.”

At LogicMonitor we couldn’t agree more. It’s true that, as the article implies, virtualization and cloud computing make the need for monitoring automation more acute than previously (which is why customers use LogicMonitor to automatically detect new hosts and newly created monitor Amazon EC2 instances – having dynamic system scaling without the ability to automatically monitor the dynamic systems is just asking for undetected service affecting issues.)

However, even in traditional non-virtualized datacenters (and despite the buzz, most datacenters and services are still built on physical machines), there is often so much change going on with systems and applications that non-automated monitoring has virtually no chance of keeping up with the additions and deletions. A typical example of an automated change report of one LogicMonitor customer from last night shows:

  • two interfaces on two different switches added to monitoring as they became active, and one removed as it was shutdown
  • discovery of the Resin application newly running on 3 servers (along with discovery of all the ports, webApps,  java monitoring, etc for each Resin server), and the removal of Resin from one server
  • 5 different virtual IP’s added to 2 different load balancers, automatically added to monitoring
  • the addition of a new class of custom application metrics exposed by JMX

And that was just one day’s changes.  Imagine the staff costs involved with tracking and implementing all these changes, every day, in a manual fashion, that are avoided by the use of automated datacenter monitoring.

And more significantly, imagine the likelihood that one of more of these changes would NOT have made it into monitoring manually – so that when a production service has issues, there is no monitoring to detect it.

Having your customers be the first to know about issues is not a situation anyone wants to be in – and monitoring automation is the only way to avoid that.  That’s one area that LogicMonitor’s datacenter monitoring excels at.

  • Share/Bookmark

The many faces of JMX monitoring

January 4, 2010 – 4:22 pm

We like monitoring. We like Java. Not to slight other languages – we like Ruby, perl, php, .NET and other platforms, too, and like to monitor them, also.

However, unlike most other languages, Java provides an explicit technology for monitoring applications and system objects.   JMX is supported on any platform running the JVM, but like most other monitoring protocols, there are lots of interesting nuances and ways to use it. Which means lots of nuances in how to detect it and monitor it.

We have quiet a few customers that use LogicMonitor for JMX monitoring, of both standard and custom applications, so we’ve run into quite a few little issues, and solved them.

One example is that the naming convention for JMX objects is loosely defined.  Initially, the JMX collector for LogicMonitor assumed that every object would have a “type” key property, as specified in best practices. Of course, this is a rule “more honored in the breach than in the observance”, as widespread applications such as WowzaMediaServer and even Tomcat do not adhere to it.

Another example is that JMX supports complex datatypes.  We have customers who do not register different Mbeans for all their classes of data, but instead expose Mbeans that return maps of maps. Our collectors and ActiveDiscovery did not initially deal with these datatypes, as we hadn’t anticipated their use.  But, there are good reasons to use them in a variety of cases, so LogicMonitor should support the wishes of the user – that’s one of our tenets, that LogicMonitor enables user’s to work the way they want, instead of constraining them to a preconceived idea.  So we extended our ActiveDiscovery to iterate through maps, and maps of maps, and composite datatypes.

This enables our customers to instrument their applications in the way they think is most appropriate, while automating the configuration of management and alerting.  While we think we’ve got all the permutations of JMX covered, I’m not taking any bets that a new customer won’t come along with a new variant that adds a perfectly logical use case that we do not support.  Of course, if that’s the case, we’ll support it within a month or so – and all our customers – current and future – will be able to immediately reap the benefits. That’s just one of the niceties of the hosted SaaS model.

  • Share/Bookmark

Active/Active or N+1?

December 21, 2009 – 2:51 pm

If your infrastructure has to be up at all times (or as much as possible), how to best achieve that?  In an Active/Active configuration, where all parts of the infrastructure are used all the time, or in an N+1 configuration, where there are idle resources waiting to take over in the event of a failure?

The short answer is it doesn’t matter unless you have good monitoring in place.

The risk with Active/Active is that load does not scale linearly.  If you have two systems running at 40% load, that does not mean that one will be able the handle the load of both, and run at 80%.  More likely you will run into an inflection point, where you will run into an unanticipated bottleneck – be it CPU, memory bandwidth, disk IO, or some system that is providing external API resources. It can even be the power system. If servers have redundant power supplies, and each PSU is attached to separate Power Distribution Units (PDUs), the critical load for each PDU is now 40% of the rating.  If one circuit fails, all load switches to the other PDU – and if that PDU is now asked to carry more than 80% of its rating, overload circuits will trip, leading to a total outage.  There is some speculation that a cascading failure of this type was behind the recent Amazon EC2 outage.

The risk with N+1 is that, by definition, you have a system sitting idle – so how do you know it’s ready for use?  Oftentimes, just when you need it, it’s not ready.

Of course, the only 100% certain way of knowing your infrastructure can deal with failures is to have operational procedures in place that test everything – actually failover.

But in between the regularly scheduled failover events, you need to monitor everything.  Monitor the PDUs, and if you are running Active-Active, set the thresholds to 40%.  Monitor your standby nodes, and monitor the synchronization of the configuration of the standby nodes (if you use Netscalers, F5 Big IPs, or other load balancers, you do not want to experience a hardware failure on your primary node, only to fail over to a secondary node that is unaware of the configuration of any of your production VIPs.)  Monitor all systems for RAID status,  monitor secondary router paths for BGP peerings, monitor backup switches for changes in link status, temperature sensors, memory pool usage and fragmentation.

If you notice, virtually all the outage announcements companies issue promise to improve their monitoring to prevent such issues.

At LogicMonitor, we suggest you implement thorough monitoring first, to avoid as many of these issues as you can in the first place. LogicMonitor makes that simple.

  • Share/Bookmark

When an OID is not an OID

December 13, 2009 – 7:39 pm

It’s still surprising to me that hardware and software manufacturers do not seem to value any kind of consistency in their management interfaces.  Or maybe it’s intentional, to complicate monitoring and management of their systems to encourage the purchase of the vendors own monitoring systems.

In any event, it makes the case for a monitoring service such as LogicMonitor, where we actually provide the templates of what you should be monitoring for a specific kind of device, all the more compelling.

A few examples of what I mean:

  • NetApp decided to change the OIDs used for reporting fan and electronics failures from one minor release to the next.
  • Similarly, NetApp changed the units that volume latency is reported in for releases after version 7.3 from millseconds to microsecond.
  • Cisco changed the way it responds to queries for the interface queue length of vlan intefaces between minor releases of the 12.2 code.
  • Microsoft changes all sorts of counters in all sorts of releases, and even adopts entirely different monitoring interfaces from one release of a product to the next, encouraging the use of WMI in one release of a product, then dropping support of it in the next.

If your monitoring system cannot automatically apply different monitoring templates based on the version of software being run on devices, then if you run more then one of a device, and don’t upgrade all of them at the same moment, you will be left with a tedious job of associating the correct datasource templates to each device as you update it’s software.  And that’s of course assuming that you know in advance what changes to apply to each upgrade of IOS, or OnTap, or MySQL, or Windows, or …..

It’s this kind of bundled knowledge and automation that helps LogicMonitor save our customers hours of time.  Of course, in this case, they wouldn’t even be aware of it- it’s just a series of false alerts that they do not receive, as a result of the monitoring automatically adjusting to changes in their systems.

I really am proud of our product.

  • Share/Bookmark

Simple ways to start addressing DataCenter power needs

December 7, 2009 – 12:33 pm

Anyone that run’s IT infrastructure is aware that power consumption is one of the biggest costs in datacenter provisioning and ongoing expenses.  If they are not, they will soon become aware, as energy costs are predicted to increase in the future, and are the fastest rising cost in the datacenter.

Maximizing power efficiency is a complex topic, which can involve:

  • virtualization to consolidate physical servers
  • adoption of on-demand cloud computing
  • evaluating whether your applications scale in a way such that new, more powerful equipment (which draws more load) will actually be efficient in delivering more requests per Amp (which may not be the case if your bottleneck is latency of an external storage system,  or database, not CPU speed)

However, there are some simple things that all IT Managers should be on top of.

Track your power usage.

You should be tracking your power usage over time.  You should be able to see the total usage, by datacenter, so you can see how your usage is changing as you bring on new servers, or add load to those servers.  As the saying goes, you can’t manage what you don’t measure.

Large numbers of underutilized servers that could be consolidated.

If you have a pool of servers serving a web site, but they are consistently running at less than 10% load, even during peaks, it is often feasible to drastically reduce the number of servers, and just power down a subset of servers.  At LogicMonitor, we’ve seen this situation in our customers more often than you’d think. Often it arises from web functions being migrated from one set of servers to another, as new code is released. The load on the old servers drops, but the number of servers is never reduced.  (One issue to be aware of non-linear scaling of load – but that’s a topic for another post.)

Pay some attention to older servers

Older servers are often the least efficient in terms of productivity per Amp.  By definition, anything running on an older server does not require the latest CPU and memory speeds – which makes such systems a prime candidate for combining on a newer, energy efficient CPU.

  • Share/Bookmark

Why CPU load should not (usually) be a critical alert.

November 20, 2009 – 9:29 am

One question that often arises in monitoring is how to define alert levels and escalations, and what level to set various alerts at – Critical, Error or Warning. 

Assuming you have Errors and Critical alerts set to notify teams by pager/phone, and Critical alerts with a shorter escalation time, here are some simple guidelines:

Critical alerts should be for events that have immediate customer impacting effect.  For example, a production Virtual IP on a monitored load balancer going down, as it has no available services to route the traffic to.  The site is down, so page everyone.

Error alerts should be for events that require immediate attention, and that, if unresolved, increase the likelihood that a production affecting event will occur.  To continue with the load balancer example, an error should be triggered if the Virtual IP only has one functioning backend server to route traffic to – there is now no redundancy, so one failure can take the site offline.

Warnings, which we typically recommend be sent by email only, are for all  other kinds of events.  The loss of a single backend server from a Virtual IP when there are 20 other servers functioning does not warrant anyone being woken in the night.

When deciding what level to assign alerts, consider the primary function of the device.  For example, in the case of a NetApp storage array, the function of the device is to serve read and write IO requests.  So the primary thing to monitor on NetApps should be the availability and performance (latency) of these read and write requests. If a volume is servicing requests with high latency – such as 70 ms per write request – that should be an Error level alert (in some enterprises, that may be appropriate to configure as a Critical level alert, but usually a Critical performance alert should be triggered only if the end-application performance degrades unacceptably.)  However, if CPU load on the NetApp is 99% for a period, even though it sounds alarming, I’d suggest that be treated as a Warning level alert only.  If latency is not impacted, why wake people at night?  Send an email alert so the issue can be investigated, but if the function of the device is not impaired, do not over react. (If you wish, you can define your alert escalations so that such conditions result in pages if uncorrected or unacknowledged for more than 5 hours, say.)

Alert Overload is a bigger danger to most datacenters than most people realise. The thought is often “if one alert is good, more must be better.”  Instead, focus on identifying the primary functions of devices – set Error level alerts on those functions, and use Warnings to inform you about conditions that could impair that functions, or to aid in troubleshooting. (If latency on a NetApp is high, and CPU load is also in alert, that obviously helps diagnose the issue, instead of looking for unusual volume activity.)

Reserve Critical alerts for system performance and availability as a whole.

With LogicMonitor hosted monitoring, the alert definitions for all data center devices have their alert thresholds predefined in the above manner – that’s one way we help provide meaningful monitoring in minutes.

  • Share/Bookmark

Top I.T./Datacenter Monitoring Mistakes, Part 4 in a series.

November 10, 2009 – 5:55 pm

Monitoring System Sprawl
This is often a corollary to the first point, not relying on manual processes.  The number of monitoring systems you have in place should approach 1.  You do not want one system to monitor windows servers; another for linux, another for MySQL, another for storage.  Even if they are all capable of automatic updates, filtering and classifying, having multiple systems still virtually guarantees suboptimal datacenter performance.  What happens when the DBA changes his pager address, and the contact information is updated in the escalation methods of 2 systems, but not the other 2?  What happens when scheduled maintenance is specified in one system, but not another that is tracking another component of the systems undergoing maintenance?

You will end up with alerts that are not routed correctly, and alert overload.  You may also end up with a system that notifies people about issues they have no ability to acknowledge, leading to “Oh…I turned my pager off…”
A variant of this problem is when your DBA’s, sysadmins or others ‘automate’ things by writing cron jobs or stored procedures to check and alert on things.  The first part is great – the checking. The alerting, however, should happen through your monitoring system.  Just have the monitoring system run the script and check the output, or call the stored procedure,or read the web page. You do not want yet another place to adjust thresholds, acknowledge alerts, deal with escalations, and so on – all things which your sysadmin’s scripts are unlikely to deal with.

  • Share/Bookmark

Top I.T./Datacenter Monitoring Mistakes, Part 3 in a series.

November 6, 2009 – 10:06 am

Continuing on the series of common Datacenter monitoring mistakes…

Alert overload
This is one of the most dangerous conditions.  If you have too many noisy alerts, that go off too frequently, people will tune them out – then when you get real, service impacting alerts, they will be tuned out, too.  I’ve seen critical production service outage alerts be put into scheduled downtime for 8 hours, as the admin assumed it was “another false alert”. How to prevent this?

  • start with sensible defaults, and sensible escalation policies. Distinguish between warnings (that admins should be aware of, but do not require immediate actions) and error or critical level alerts, that require pager notifications.  (No need to awaken people if NTP is out of synchronization – but if the primary database volume is experiencing 200 ms latency for read requests from its disk storage, and end user transaction time is now 8 seconds, then all hands on deck).
  • route the right set of alerts to the right group of people. There is no point in the DBA being alerted about network issues, or vice versa.
  • make sure you tune your thresholds appropriately. Every alert should be real and meaningful.  If any alerts are ‘false positives’ (such as alerts about QA systems being down), tune the monitoring.  LogicMonitor alerts are easily tuned on the global, host or group level, or even the individual instance (such as a file system, or interfaces); and ActiveDiscovery filters make it simple classify discovered instances into the appropriate group, with the appropriate alert levels. A common example is to discover all load balancing VIPs or Storage system volumes with “stage” or “qa” in the name to have no error or critical alerts – this will then apply to all VIPs or volumes created now and in the future, on all devices – greatly simplifying alert management.
  • ensure alerts are acknowledged, dealt with, and cleared.  You don’t want to see hundreds of alerts on your monitoring system.  For large enterprises, make sure you can see a filtered view of just the groups of systems you are responsible for, allowing focus.  You  should also periodically sort alerts by duration, and focus on cleaning out those that have been in alert for longer than a day.
  • Another useful report is to analyze your top alerts, by host, or by alert type. Investigate to see whether there are issues in monitoring, or the systems, or operational processes, that can reduce the frequency of these alerts.

  • Share/Bookmark