When designing infrastructure architecture, there is usually a choice between complexity and fault tolerance. It’s not just an inverse relationship, however. It’s a curve. You want the minimal complexity possible to achieve your availability goals. And you may even want to reduce your availability goals to reduce your complexity (which will end up increasing your availability.)
The rule to adopt is If you don’t understand something well enough that it seems simple to you (or your staff), even in it’s failure modes, you are better off without it.
Back in the day, clever people suggested that most web sites would have the best availability by running everything – DB, web application, everything – on a single server. This was the simplest configuration, and the easiest to understand.
With no complexity – one of everything (one switch, one load balancer, one web server, one database, for example) – you can tolerate zero failures, but it’s easy to know when there is a failure.
With 2 of everything, connected the right way, you can keep running with one failure, but you may not be aware of the failure.
So is it a good idea to add more connections, and plan to be able to tolerate multiple failures? Not usually. For example, with a redundant pair of load balancers, you can connect one load balancer to one switch, and the other load balancer to another switch. In the event of a load balancer failure, the surviving load balancer will automatically take over, and all is good. If a switch fails, it may be the one that the active load balancer is connected to – this would also trigger a load balancer fail over, and everything is still running correctly. It would be possible to connect each load balancer to each switch, so that failure of a switch does not impact the load balancers, but is it worth it?
This would allow the site to survive two simultaneous unrelated failures – one switch and the one load balancer – but the added complexity of engineering the multiple traffic paths increases the likelihood that something will go wrong in one of the 4 possible states. There are now 4 possible traffic paths instead of 2 – so more testing needed, more maintenance needed on any change, etc. The benefit seems outweighed by the complexity.
The same concept of “if it seems complex, it doesn’t belong”, can be applied to software, too. Load balancing, whether via an appliance such as Citrix Netscalers, or software such as ha_proxy, is simple enough to most people nowadays. The same is not generally true of clustered file systems, or DRDB. If you truly need these technologies, you better have a thorough understanding of them, and invest the time to create all the failure modes you can, and train your staff so that it is not complex for them to deal with any of the failures.
If you have a consultant come in and set up BGP routing, but no one on your NOC or on call staff knows how to do anything with BGP, you just greatly reduced your site’s operational availability.
The “Complexity Filter” can be applied to monitoring systems, as well. If your monitoring system stops, and you don’t have immediate staff available to troubleshoot the restart of the service processes; or the majority of your staff cannot easily interpret the monitoring, or create new checks, or use it to see trends over time – your monitoring is not contributing to your operational uptime. It is instead a resource sink, and is likely to bite you when you least expect it. Datacenter monitoring, like all things in your datacenter, should be as automated and simple as possible.
If it seems complex – it will break. Learn it until it’s not complex, or do without it.
Denise Dubie wrote a recent piece in CIO magazine about “5 Must-have IT Management Technologies for 2010“, in which she identifies one of the must-haves as IT process automation. She quotes Jim Frey, research director at EMA: “On the monitoring side, automation will be able to keep up with the pace of virtual environments and recognize when changes happen in ways a human operator simply could not.”
At LogicMonitor we couldn’t agree more. It’s true that, as the article implies, virtualization and cloud computing make the need for monitoring automation more acute than previously (which is why customers use LogicMonitor to automatically detect new hosts and newly created monitor Amazon EC2 instances – having dynamic system scaling without the ability to automatically monitor the dynamic systems is just asking for undetected service affecting issues.)
However, even in traditional non-virtualized datacenters (and despite the buzz, most datacenters and services are still built on physical machines), there is often so much change going on with systems and applications that non-automated monitoring has virtually no chance of keeping up with the additions and deletions. A typical example of an automated change report of one LogicMonitor customer from last night shows:
And that was just one day’s changes. Imagine the staff costs involved with tracking and implementing all these changes, every day, in a manual fashion, that are avoided by the use of automated datacenter monitoring.
And more significantly, imagine the likelihood that one of more of these changes would NOT have made it into monitoring manually – so that when a production service has issues, there is no monitoring to detect it.
Having your customers be the first to know about issues is not a situation anyone wants to be in – and monitoring automation is the only way to avoid that. That’s one area that LogicMonitor’s datacenter monitoring excels at.
We like monitoring. We like Java. Not to slight other languages – we like Ruby, perl, php, .NET and other platforms, too, and like to monitor them, also.
However, unlike most other languages, Java provides an explicit technology for monitoring applications and system objects. JMX is supported on any platform running the JVM, but like most other monitoring protocols, there are lots of interesting nuances and ways to use it. Which means lots of nuances in how to detect it and monitor it.
We have quiet a few customers that use LogicMonitor for JMX monitoring, of both standard and custom applications, so we’ve run into quite a few little issues, and solved them.
One example is that the naming convention for JMX objects is loosely defined. Initially, the JMX collector for LogicMonitor assumed that every object would have a “type” key property, as specified in best practices. Of course, this is a rule “more honored in the breach than in the observance”, as widespread applications such as WowzaMediaServer and even Tomcat do not adhere to it.
Another example is that JMX supports complex datatypes. We have customers who do not register different Mbeans for all their classes of data, but instead expose Mbeans that return maps of maps. Our collectors and ActiveDiscovery did not initially deal with these datatypes, as we hadn’t anticipated their use. But, there are good reasons to use them in a variety of cases, so LogicMonitor should support the wishes of the user – that’s one of our tenets, that LogicMonitor enables user’s to work the way they want, instead of constraining them to a preconceived idea. So we extended our ActiveDiscovery to iterate through maps, and maps of maps, and composite datatypes.
This enables our customers to instrument their applications in the way they think is most appropriate, while automating the configuration of management and alerting. While we think we’ve got all the permutations of JMX covered, I’m not taking any bets that a new customer won’t come along with a new variant that adds a perfectly logical use case that we do not support. Of course, if that’s the case, we’ll support it within a month or so – and all our customers – current and future – will be able to immediately reap the benefits. That’s just one of the niceties of the hosted SaaS model.
Performance monitoring for all your infrastructure & applications. In minutes, not hours.
Questions? Call Us!
(888) 415-6442 or +1 (805)-617-3884