Here at LogicMonitor we love our happy hours, and since we will be speaking at the upcoming AnsibleWorks Fest (or AnsibleFest) we thought of no better way to tie it off than over a drink.
Our very own Jefff Behl, Chief Network Architect will be speaking at 5:05pm about the importance of measurement and monitoring the whole IT stack in a DevOps world. And then afterwards…. we’d love the chance to meet you over drinks at Dillons (around the corner from the event).
Look for us at the Dillons downstairs bar between 6-8pm.
This post, written by LogicMonitor’s Director of Tech Ops, Jesse Aukeman, originally appeared on HighScalability.com on February 19, 2013
If you are like us, you are running some type of linux configuration management tool. The value of centralized configuration and deployment is well known and hard to overstate. Puppet is our tool of choice. It is powerful and works well for us, except when things don’t go as planned. Failures of puppet can be innocuous and cosmetic, or they can cause production issues, for example when crucial updates do not get properly propagated.
In the most innocuous cases, the puppet agent craps out (we run puppet agent via cron). As nice as puppet is, we still need to goose it from time to time to get past some sort of network or host resource issue. A more dangerous case is when an administrator temporarily disables puppet runs on a host in order to perform some test or administrative task and then forgets to reenable it. In either case it’s easy to see how a host may stop receiving new puppet updates. The danger here is that this may not be noticed until that crucial update doesn’t get pushed, production is impacted, and it’s the client who notices.
Monitoring is clearly necessary in order to keep on top of this. Rather than just monitoring the status of the puppet server (a necessary, but not sufficient, state), we would like to monitor the success or failure of actual puppet runs on the end nodes themselves. For that purpose, puppet has a built in feature to export status info Read more »
Sample SAT question: xUnit is to Continuous Integration as what is to automated server deployments?
We’ve been going through lots of growth here at LogicMonitor. Part of growth means firing up new servers to deal with more customers, but we also have been adding a variety of new services: proxies that allow our customers to route around Internet issues that BGP doesn’t catch; servers that test performance and reachability of customers sites from various locations, and so on. All of which means spinning up new servers: sometimes lots of times, in QA, staging and development environments.
As old hands in running datacenter operations, we have long adhered to the tenet of not trusting people – including ourselves. People make mistakes, and can’t remember things they did to make things work. So all our servers and applications are deployed by automated tools. We happen to use Puppet, but collectively we’ve worked with cfengine, chef, and even Rightscripts.
So, for us to bring up a new server – no problem. It’s scripted, repeatable, takes no time. But how about splitting the functions of what was one server into several? And how do we know that the servers being deployed are set up correctly, if there are changes and updates? Read more »
Our digs here at LogicMonitor are cozy. Being adjacent to sales, I get to hear our sales engineers work with new customers, and it’s not uncommon that a new customer gets a rude awakening when they first install LogicMonitor. Immediately, LogicMonitor starts showing warnings and alerts. ”Can this be right or is this a monitoring error?!”, they ask. Delicately, our engineer will respond, “I don’t think that’s a monitoring error. It looks like you have a problem there.”
This happened recently with a customer who wanted to use LogicMonitor to watch their large VMware installation. We make excellent use of the VMware API which provides a rich set of data sources for monitoring. In this instance, LogicMonitor’s default alert settings threw several warnings about an ESX host’s datastore. There were multiple warnings regarding write latency problems on the ESX datastore, and drilling down, we found that a singular VM on that datastore was an ‘I/O hog’ that was grabbing so much disk resource that it was causing disk contention among the other VMs.
Finding the rogue host was easy with LogicMonitor’s clear, easy to read graphs. With the disk I\O of the different VMs plotted on the same graph, it was easy to spot the one whose disk operations were significantly higher than the rest.
We’ve seen this particular problem with VMware enough that our founder, Steve Francis, made this short video on how to quickly identify which VM on an ESX host is hogging resources: (Caveat: You must be able to understand Austrailian)
All our monitoring data sources have default alerting levels set that you can tune to fit your needs, but they’re pretty close out of the box as they’re the product of a LOT of monitoring experience. This customer didn’t have to make any adjustments to our alert levels to find a problem they were unaware of with potential customer-facing impacts. The resolution was easy, they moved the VM to another ESX host with a different datastore, but the detection tool was the key.
If you’re wondering about your VMware infrastructure, sign up for a free trial with LogicMonitor today and see what you’ve been missing.
- This article was contributed by Jeffrey Barteet, TechOps Engineer at LogicMonitor
Recently we rolled out a new release of LogicMonitor. Among the many improvements and fixes that users saw, there were also some backend changes to the Linux systems that store monitoring data.
The rollout went smooth, no alerts were triggered – but it was pretty easy to see that something had changed: Read more »
We got a question internally about why one of our demo servers was slow, and how to use LogicMonitor to help identify the issue. The person asking comes from a VoIP, networking and Windows background, not Linux, so his questions reflect that of the less-experienced sys admin (in this case). I thought it interesting that he documented his thought processes, and I’ll intersperse my interpretation of the same data, and some thoughts on why LogicMonitor alerts as it does… Read more »
Denise Dubie wrote a recent piece in CIO magazine about “5 Must-have IT Management Technologies for 2010“, in which she identifies one of the must-haves as IT process automation. She quotes Jim Frey, research director at EMA: “On the monitoring side, automation will be able to keep up with the pace of virtual environments and recognize when changes happen in ways a human operator simply could not.”
At LogicMonitor we couldn’t agree more. It’s true that, as the article implies, virtualization and cloud computing make the need for monitoring automation more acute than previously (which is why customers use LogicMonitor to automatically detect new hosts and newly created monitor Amazon EC2 instances – having dynamic system scaling without the ability to automatically monitor the dynamic systems is just asking for undetected service affecting issues.)
However, even in traditional non-virtualized datacenters (and despite the buzz, most datacenters and services are still built on physical machines), there is often so much change going on with systems and applications that non-automated monitoring has virtually no chance of keeping up with the additions and deletions. A typical example of an automated change report of one LogicMonitor customer from last night shows:
And that was just one day’s changes. Imagine the staff costs involved with tracking and implementing all these changes, every day, in a manual fashion, that are avoided by the use of automated datacenter monitoring.
And more significantly, imagine the likelihood that one of more of these changes would NOT have made it into monitoring manually – so that when a production service has issues, there is no monitoring to detect it.
Having your customers be the first to know about issues is not a situation anyone wants to be in – and monitoring automation is the only way to avoid that. That’s one area that LogicMonitor’s datacenter monitoring excels at.
We like monitoring. We like Java. Not to slight other languages – we like Ruby, perl, php, .NET and other platforms, too, and like to monitor them, also.
However, unlike most other languages, Java provides an explicit technology for monitoring applications and system objects. JMX is supported on any platform running the JVM, but like most other monitoring protocols, there are lots of interesting nuances and ways to use it. Which means lots of nuances in how to detect it and monitor it.
We have quiet a few customers that use LogicMonitor for JMX monitoring, of both standard and custom applications, so we’ve run into quite a few little issues, and solved them.
One example is that the naming convention for JMX objects is loosely defined. Initially, the JMX collector for LogicMonitor assumed that every object would have a “type” key property, as specified in best practices. Of course, this is a rule “more honored in the breach than in the observance”, as widespread applications such as WowzaMediaServer and even Tomcat do not adhere to it.
Another example is that JMX supports complex datatypes. We have customers who do not register different Mbeans for all their classes of data, but instead expose Mbeans that return maps of maps. Our collectors and ActiveDiscovery did not initially deal with these datatypes, as we hadn’t anticipated their use. But, there are good reasons to use them in a variety of cases, so LogicMonitor should support the wishes of the user – that’s one of our tenets, that LogicMonitor enables user’s to work the way they want, instead of constraining them to a preconceived idea. So we extended our ActiveDiscovery to iterate through maps, and maps of maps, and composite datatypes.
This enables our customers to instrument their applications in the way they think is most appropriate, while automating the configuration of management and alerting. While we think we’ve got all the permutations of JMX covered, I’m not taking any bets that a new customer won’t come along with a new variant that adds a perfectly logical use case that we do not support. Of course, if that’s the case, we’ll support it within a month or so – and all our customers – current and future – will be able to immediately reap the benefits. That’s just one of the niceties of the hosted SaaS model.
If your infrastructure has to be up at all times (or as much as possible), how to best achieve that? In an Active/Active configuration, where all parts of the infrastructure are used all the time, or in an N+1 configuration, where there are idle resources waiting to take over in the event of a failure?
The short answer is it doesn’t matter unless you have good monitoring in place.
The risk with Active/Active is that load does not scale linearly. If you have two systems running at 40% load, that does not mean that one will be able the handle the load of both, and run at 80%. More likely you will run into an inflection point, where you will run into an unanticipated bottleneck – be it CPU, memory bandwidth, disk IO, or some system that is providing external API resources. It can even be the power system. If servers have redundant power supplies, and each PSU is attached to separate Power Distribution Units (PDUs), the critical load for each PDU is now 40% of the rating. If one circuit fails, all load switches to the other PDU – and if that PDU is now asked to carry more than 80% of its rating, overload circuits will trip, leading to a total outage. There is some speculation that a cascading failure of this type was behind the recent Amazon EC2 outage.
The risk with N+1 is that, by definition, you have a system sitting idle – so how do you know it’s ready for use? Oftentimes, just when you need it, it’s not ready.
Of course, the only 100% certain way of knowing your infrastructure can deal with failures is to have operational procedures in place that test everything – actually failover.
But in between the regularly scheduled failover events, you need to monitor everything. Monitor the PDUs, and if you are running Active-Active, set the thresholds to 40%. Monitor your standby nodes, and monitor the synchronization of the configuration of the standby nodes (if you use Netscalers, F5 Big IPs, or other load balancers, you do not want to experience a hardware failure on your primary node, only to fail over to a secondary node that is unaware of the configuration of any of your production VIPs.) Monitor all systems for RAID status, monitor secondary router paths for BGP peerings, monitor backup switches for changes in link status, temperature sensors, memory pool usage and fragmentation.
If you notice, virtually all the outage announcements companies issue promise to improve their monitoring to prevent such issues.
At LogicMonitor, we suggest you implement thorough monitoring first, to avoid as many of these issues as you can in the first place. LogicMonitor makes that simple.
It’s still surprising to me that hardware and software manufacturers do not seem to value any kind of consistency in their management interfaces. Or maybe it’s intentional, to complicate monitoring and management of their systems to encourage the purchase of the vendors own monitoring systems.
In any event, it makes the case for a monitoring service such as LogicMonitor, where we actually provide the templates of what you should be monitoring for a specific kind of device, all the more compelling.
A few examples of what I mean:
If your monitoring system cannot automatically apply different monitoring templates based on the version of software being run on devices, then if you run more then one of a device, and don’t upgrade all of them at the same moment, you will be left with a tedious job of associating the correct datasource templates to each device as you update it’s software. And that’s of course assuming that you know in advance what changes to apply to each upgrade of IOS, or OnTap, or MySQL, or Windows, or …..
It’s this kind of bundled knowledge and automation that helps LogicMonitor save our customers hours of time. Of course, in this case, they wouldn’t even be aware of it- it’s just a series of false alerts that they do not receive, as a result of the monitoring automatically adjusting to changes in their systems.
I really am proud of our product.
Performance monitoring for all your infrastructure & applications. In minutes, not hours.
Questions? Call Us!
(888) 415-6442 or +1 (805)-617-3884