Monthly Archives

Active/Active or N+1?

Posted by & filed under Uncategorized .

If your infrastructure has to be up at all times (or as much as possible), how to best achieve that?  In an Active/Active configuration, where all parts of the infrastructure are used all the time, or in an N+1 configuration, where there are idle resources waiting to take over in the event of a failure?

The short answer is it doesn’t matter unless you have good monitoring in place.

The risk with Active/Active is that load does not scale linearly.  If you have two systems running at 40% load, that does not mean that one will be able the handle the load of both, and run at 80%.  More likely you will run into an inflection point, where you will run into an unanticipated bottleneck – be it CPU, memory bandwidth, disk IO, or some system that is providing external API resources. It can even be the power system. If servers have redundant power supplies, and each PSU is attached to separate Power Distribution Units (PDUs), the critical load for each PDU is now 40% of the rating.  If one circuit fails, all load switches to the other PDU – and if that PDU is now asked to carry more than 80% of its rating, overload circuits will trip, leading to a total outage.  There is some speculation that a cascading failure of this type was behind the recent Amazon EC2 outage.

The risk with N+1 is that, by definition, you have a system sitting idle – so how do you know it’s ready for use?  Oftentimes, just when you need it, it’s not ready.

Of course, the only 100% certain way of knowing your infrastructure can deal with failures is to have operational procedures in place that test everything – actually failover.

But in between the regularly scheduled failover events, you need to monitor everything.  Monitor the PDUs, and if you are running Active-Active, set the thresholds to 40%.  Monitor your standby nodes, and monitor the synchronization of the configuration of the standby nodes (if you use Netscalers, F5 Big IPs, or other load balancers, you do not want to experience a hardware failure on your primary node, only to fail over to a secondary node that is unaware of the configuration of any of your production VIPs.)  Monitor all systems for RAID status,  monitor secondary router paths for BGP peerings, monitor backup switches for changes in link status, temperature sensors, memory pool usage and fragmentation.

If you notice, virtually all the outage announcements companies issue promise to improve their monitoring to prevent such issues.

At LogicMonitor, we suggest you implement thorough monitoring first, to avoid as many of these issues as you can in the first place. LogicMonitor makes that simple.


When an OID is not an OID

Posted by & filed under Uncategorized .

It’s still surprising to me that hardware and software manufacturers do not seem to value any kind of consistency in their management interfaces.  Or maybe it’s intentional, to complicate monitoring and management of their systems to encourage the purchase of the vendors own monitoring systems.

In any event, it makes the case for a monitoring service such as LogicMonitor, where we actually provide the templates of what you should be monitoring for a specific kind of device, all the more compelling.

A few examples of what I mean:

  • NetApp decided to change the OIDs used for reporting fan and electronics failures from one minor release to the next.
  • Similarly, NetApp changed the units that volume latency is reported in for releases after version 7.3 from millseconds to microsecond.
  • Cisco changed the way it responds to queries for the interface queue length of vlan intefaces between minor releases of the 12.2 code.
  • Microsoft changes all sorts of counters in all sorts of releases, and even adopts entirely different monitoring interfaces from one release of a product to the next, encouraging the use of WMI in one release of a product, then dropping support of it in the next.

If your monitoring system cannot automatically apply different monitoring templates based on the version of software being run on devices, then if you run more then one of a device, and don’t upgrade all of them at the same moment, you will be left with a tedious job of associating the correct datasource templates to each device as you update it’s software.  And that’s of course assuming that you know in advance what changes to apply to each upgrade of IOS, or OnTap, or MySQL, or Windows, or …..

It’s this kind of bundled knowledge and automation that helps LogicMonitor save our customers hours of time.  Of course, in this case, they wouldn’t even be aware of it- it’s just a series of false alerts that they do not receive, as a result of the monitoring automatically adjusting to changes in their systems.

I really am proud of our product.


Anyone that run’s IT infrastructure is aware that power consumption is one of the biggest costs in datacenter provisioning and ongoing expenses.  If they are not, they will soon become aware, as energy costs are predicted to increase in the future, and are the fastest rising cost in the datacenter.

Maximizing power efficiency is a complex topic, which can involve:

  • virtualization to consolidate physical servers
  • adoption of on-demand cloud computing
  • evaluating whether your applications scale in a way such that new, more powerful equipment (which draws more load) will actually be efficient in delivering more requests per Amp (which may not be the case if your bottleneck is latency of an external storage system,  or database, not CPU speed)

However, there are some simple things that all IT Managers should be on top of.

Track your power usage.

You should be tracking your power usage over time.  You should be able to see the total usage, by datacenter, so you can see how your usage is changing as you bring on new servers, or add load to those servers.  As the saying goes, you can’t manage what you don’t measure.

Large numbers of underutilized servers that could be consolidated.

If you have a pool of servers serving a web site, but they are consistently running at less than 10% load, even during peaks, it is often feasible to drastically reduce the number of servers, and just power down a subset of servers.  At LogicMonitor, we’ve seen this situation in our customers more often than you’d think. Often it arises from web functions being migrated from one set of servers to another, as new code is released. The load on the old servers drops, but the number of servers is never reduced.  (One issue to be aware of non-linear scaling of load – but that’s a topic for another post.)

Pay some attention to older servers

Older servers are often the least efficient in terms of productivity per Amp.  By definition, anything running on an older server does not require the latest CPU and memory speeds – which makes such systems a prime candidate for combining on a newer, energy efficient CPU.


Popular Posts
Subscribe to our blog.