One of our long time customers, Appfolio, who makes great SaaS property management software, asked how they could use LogicMonitor to monitor the size of some files across their fleet of Linux servers. A simple request, but not as simple as one might hope. Why not? LogicMonitor usually uses SNMP to monitor Linux servers, as that way there is no need for extra software to be installed on any server. (It should be noted that some people deploy LogicMonitor collectors as agents, deploying one per server. In this case, you could use a script based datasource to simply run ‘ls’ on arbitrary files – but that’s for a different blog entry.) While SNMP has many defined OIDs (a fancy way of saying questions that can be asked and answered), there is no defined OID for “how big is arbitrary file X?” Which means that by default, there is no way to remotely query a system, using SNMP, to determine a file size. Read more »
Sample SAT question: xUnit is to Continuous Integration as what is to automated server deployments?
We’ve been going through lots of growth here at LogicMonitor. Part of growth means firing up new servers to deal with more customers, but we also have been adding a variety of new services: proxies that allow our customers to route around Internet issues that BGP doesn’t catch; servers that test performance and reachability of customers sites from various locations, and so on. All of which means spinning up new servers: sometimes lots of times, in QA, staging and development environments.
As old hands in running datacenter operations, we have long adhered to the tenet of not trusting people – including ourselves. People make mistakes, and can’t remember things they did to make things work. So all our servers and applications are deployed by automated tools. We happen to use Puppet, but collectively we’ve worked with cfengine, chef, and even Rightscripts.
So, for us to bring up a new server – no problem. It’s scripted, repeatable, takes no time. But how about splitting the functions of what was one server into several? And how do we know that the servers being deployed are set up correctly, if there are changes and updates? Read more »
One of the great things about being a customer of a SaaS delivered monitoring service like LogicMonitor is that they can get best practices in monitoring of all sorts of technologies without having to have an expert in that technology on staff.
My grandpa loved cars. He worked on them with a level of passion most people reserve for things like expensive red wines and members of the opposite sex. He didn’t believe in outsourcing the care and maintenance of his wheels.
So I was shocked when one day he announced that changing his own oil was senseless. He was prideful, but he also valued his time and was adept at basic math: 4 quarts + 1 filter + 1 oil pan + 1 jack + 10 greasy fingernails + 2 trips to the auto parts store + 3 hours labor was not less than $29.99 + 45 minutes of watching television in the lobby at Oil & Tune.
This same general equation comes to mind when we hear tales of people instrumenting their own network monitoring solutions with open-source tools (see price comparison chart). When you factor in not just software costs, but hardware costs, and people costs to maintain everything, open source monitoring tools can quickly become more costly than a SaaS-based monitoring solution like LogicMonitor. (For more detail, download the network and server monitoring comparison whitepaper.)
You don’t have to take our word for it. This recent Twitter exchange between a Nagios fan* and a LogicMonitor client illustrates the difference in philosophies.
@NagiosFan: #Nagios is awesome Except for the parts that are terrible and inexcusable. But mostly awesome.
@LogicMonitorUser: @NagiosFan I cannot disagree more. Too much work for not enough gain. But we each value things differently #nagios is not for me.
@NagiosFan: @LogicMonitorUser haha that’s ok I like the extensibility and the initial ‘crafting’ for gains later. Plus, hella-automatable. What do you use?
Of course there are use cases where building your own monitoring tool makes sense. But for the greater percentage of SysAdmins, IT departments, and CTO’s out there, LogicMonitor has done the hard work for you, and serves it up on a silicon platter.
You may want to take a minute and do the math, like grandpa finally did. Then take your overalls off, put your toolbox away, grab a cup of coffee, and fire up a free trial…
*Twitter exchange was excerpted and the @names changed
- This article was contributed by Blake Beltram, Community Evangelist at LogicMonitor
VMworld 2012 took place at the Moscone Center in San Francisco a few weeks ago. The weather was surprisingly nice, but the real buzz was inside the convention hall. We had a pod in the New Innovators section of the Vendor Expo in Moscone West. It being our first VMworld (as a sponsor), we were very impressed.
I was initially a little skeptical of our location, but it turned out we got a good bit of traffic and talked to dozens of prospects who were very interested in learning more about cloud-based technology infrastructure monitoring. One of the surprises of the event was how many current customers stopped by to say hello and share how LogicMonitor is working out for them.
One customer had an interesting story about how LogicMonitor saved his movie. He had gone to see the latest Batman movie “The Dark Knight,” and apparently he’s one of those guys who pays attention to his phone while in the movie (you know, like every other SysAdmin in the world). Half way through the film he got a text message alerting him to an issue.
He immediately logged into LogicMonitor and checked the systems he was responsible for and quickly realized the problem wasn’t on his end. He proceeded to dig around in the other systems in LogicMonitor and was able to pinpoint the issue and relay it to the team responsible. Ironically, he was the super hero at that moment. LogicMonitor not only helped him save the day, but it also saved the movie.
The takeaway here is that LogicMonitor helps provide insight into the entire infrastructure and so helps with collaboration across multiple teams.
You never know when the storage guy might help the virtualization guy or the database guy solve a major problem, even if by just proving the issue isn’t the database. This type of collaboration is invaluable when it comes to monitoring. It streamlines the troubleshooting process and motivates the right professionals to action sooner, allowing them to focus on and solve the problem much quicker.
I don’t think there is one SysAdmin out there who enjoys the length of time it takes during the thrill of the hunt, when trying to pinpoint the reason for a major problem or outage.
Needless to say we felt we had a great show. It was fun to be there and talk to really smart and interesting people. If your organization uses VMware in their infrastructure I would highly recommend attending this conference next year – same Bat-time, same Bat-channel.
You released new code with all sorts of new features and improvements. Yay!
Now, after the obvious things like “Does it actually work in production”, this is also the time to assess: did it impact my infrastructure performance (and thus my scalability, and thus my scaling costs) in any way.
This is yet another area where good monitoring and trending is essential.
As an example, we did a release last night on a small set of servers.
Did that help or hurt our scalability?
CPU load dropped for the same workload (we have other graphs showing which particular Java application this improvement was attributable to, but this shows the overall system CPU):
There was an improvement on a variety of MySQL performance metrics, such as the Table open rate (table opens are fairly intensive.)
But…not everything was improved:
While the overall disk performance and utilization is the same, the workload is much more spiky. (For those of you wondering how we get up to 2000 write operations per second – SSDs rock.)
And of course, the peak workloads are what constrain the server usage – with this change in workload, a server that was running at a steady 60% utilization may find itself spiking to 100% – leading to queuing in other parts of the system, and general Bad Things.
As it is, we saw this change in the workload and we can clearly attribute it to the code release. So now we can fix it before it is applied to more heavily loaded servers where it may have had an operational impact.
This keeps our Ops team happy, our customers happy, and, as it means we dont have to spend more money on hardware for the same level of scale, it keeps our business people happy.
Just another illustration of how comprehensive monitoring can help your business in ways you may not have predicted.
Kablooee! That was the sound I (and many others) heard coming from one of Amazon Web Services (aka, the “cloud”) availability zones in Northern Virginia on June 30th (http://venturebeat.com/2012/06/29/amazon-outage-netflix-instagram-pinterest/, http://gigaom.com/cloud/some-of-amazon-web-services-are-down-again/). The sound was a weather-driven event causing one of Amazon’s data centers to lose power. And what happens when a data center loses power (and, for unspecified reasons, UPSs and generators don’t kick in)? Crickets. Computers turn off. Lights stop blinking. The “sounds of silence” (but not how Simon and Garfunkel sing about it).
By this point, you either have your monitoring outside your datacenter, and were notified about the outage, or only became aware belatedly, and regretted the decision not to put monitoring outside. But what happens after power has been restored? Well, that’s when good monitoring comes into play yet again…
As much hype as there has been surrounding “clouds” and “cloud computing” (and for good reason – they are changing the face of infrastructure), “clouds” are still a bunch of computers sitting in some data center – somewhere – requiring power, cooling, etc.
One of the nice things about going with a cloud service for your infrastructure is you are largely removed from needing to monitor hardware – this is all (presumably) done for you. No having to worry about fan speeds, system board temperatures, power supplies, RAID status, etc. However, this doesn’t alleviate the need for good and intricate monitoring of your application “stack”. This is everything else that makes your applications go — databases, JVM statistics, Apache status, system CPU, disk IO performance, system memory, application response time, load balancer health, etc etc. This is the real guts of your organization – and the things that you need to know are working after a reboot. And whether you are in the cloud or not, at some point all your systems are going to be rebooted. I guarantee it, so plan for it.
So what happens when your environment does reboot? It doesn’t matter whether you are in the cloud or not, when power is restored you need to make sure all the components of your software stack are back up. Across all of your systems. Hopefully your disaster recovery plan does not revolve around a single “hero” sysadmin who merely needs to be pulled away from an IRC chat, a MW3 campaign, or the bar (of the three, the last is the most worrisome). Any available admin should be able to identify, via your monitoring system, what components of the stack came back up and are functioning, and which are not. Your monitoring dashboard, listing all machines and services, is your eyes and ears – without it you are blind and dumb (so to speak.) When all alerts have cleared from monitoring, you should be comfortable in knowing that service has been completely restored. Good monitoring is by far the greatest safeguard you can have in making sure all systems are functioning again after a reboot, and in the shortest amount of time.
The take-home: deploy good monitoring. Make sure all aspects of your stack are monitored. All of them. When all of your machines are rebooted (at 3AM in the morning), how do you know all aspects of your stack are back up and functioning? Good monitoring. Good monitoring = LogicMonitor. Check us out. We eat our own dog food (see the next article on the “Leap Second” bug to get an account of this), and we are SaaS service, meaning if all your systems do reboot, your monitoring system is not a part of it. We can help you recover faster from any outage, guaranteed.
It’s 6 AM. Bob, an entry-level IT engineer walks into a cold, dark, lonely building – flips on the lights, fires up the coffee pot, and boots up. Depending on what he’s about to see on his computer screen, he knows the fate of the free world could rest in his soft, trembling, sun-starved hands.
Well maybe not the free world, but at least the near-term fate of his company, his company’s clients, and possibly his next paycheck. Bob is the newest engineer for a busy MSP, whose promise to its clients is very simple: your technology will always be up and working!
Fortunately for Bob, his MSP has a great ticketing system, so as soon as his coffee is hot and hard drive warm, he’ll login to his ticketing dashboard, right? Wrong! What?! Bob! What are you logging into?! Oh. Your monitoring application? Really?
Really. True story. Dramatized for effect, name changed to protect the reasonably innocent, but true story. Eric Egolf, the owner of CIO Solutions, a thriving MSP told us about it just last week. “The first thing the new guy does, intuitively, is open up the monitoring portal, before he ever looks at our ticketing system.” And the other engineers are following suit. Egolf says the ticketing system is great, but their comprehensive monitoring solution reveals the actual, real-time IT landscape for their entire client-base within seconds. And the most critical problems practically jump off the screen at the engineers, sometimes before a ticket has even been created.
Set an easy to use interface on top of the comprehensive monitoring solution, and Bob can often times very quickly isolate the problem, ferret out the root cause, and resolve the issue … before the asteroid plummets to earth and destroys America … or at least before a client calls screaming as if that did just happen.
“LogicMonitor makes my engineers smarter,” claims Egolf, “an entry-level engineer can basically perform all the functions of a mid-level engineer.” And without the increase in pay grade. That keeps costs down and clients up, and while that’s particularly a sweet-spot for MSP’s and cloud providers, the same formula holds true for SaaS/Web companies and in-house IT departments. Not good, but great monitoring is the answer.
That’s how you make an engineer smarter. Next blog post: How to Make an Engineer the Life of the Party.
There’s some interesting discussion around “Monitoring Sucks”, and has been for a while. (Go check the twitter hashtag #monitoringsucks). This is not a new opinion – the fact that I thought monitoring sucks is why I started LogicMonitor.
But it’s interesting to assess whether LogicMonitor meets the criteria for not sucking. Clearly our customers think we have great monitoring - but probably only 30% of our customers are SaaS type companies, and may or may not have the DevOps mentality.
So the initial criteria for why monitoring sucks, at least on the referenced blog post, were:
But does monitoring REALLY suck? Heck no! Monitoring is AWESOME. Metrics are AWESOME. I love it. Here's what I don't love: - Having my hands tied with the model of host and service bindings. - Having to set up "fake" hosts just to group arbitrary metrics together - Having to either collect metrics twice - once for alerting and another for trending - Only being able to see my metrics in 5 minute intervals - Having to chose between shitty interface but great monitoring or shitty monitoring but great interface - Dealing with a monitoring system that thinks IT is the system of truth for my environment - Perl
Let’s look at these points from the point of view of LogicMonitor
Having my hands tied with the model of host and service bindings. I’m not sure how you not tie someone’s hands to some degree, but LogicMonitor certainly tries to give flexibility. Services do generally have to associated with hosts – but can be associated by all sorts of things (hostname, group membership, SNMP agent OID, system description, WMI classes supported, kernel level, etc.)
Having to set up “fake” hosts just to group arbitrary metrics together. LogicMonitor avoids this mostly with custom graphs on dashboards, which allow you to group any metric (or set of metrics based on globs/regex’s) with any other set, filtered to the top 10, or not; aggregated together (sum, max, min, average) or not. Also, some meta-services are associated with groups, not hosts, to allow alerting on things like number of servers providing a service, rather than just whether a specific host is successfully providing the service.
Having to either collect metrics twice – once for alerting and another for trending. We certainly don’t require that. Any datapoint that is collected can be alerted on, graphed, both or neither. (Sometimes datapoints are collected as they are used in other calculated datapoints, derived from multiple inputs.)
Only being able to see my metrics in 5 minute intervals. Again, we don’t impose that restriction – you can specify the collection interval for each datasource, from 1 minute to once a day. (I know going to only 1 minute resolution is not ideal for some applications – but as a SaaS delivery model, we currently impose that limit to protect ourselves, until the next rewrite of the backend storage engine, which should remove that.)
Having to chose between shitty interface but great monitoring or shitty monitoring but great interface.I think we have a pretty good interface and great monitoring. Certainly our interface is orders of magnitude better than it was when we launched, and a lot of people give us kudos for it. But there’s lots of room for improvement.
Dealing with a monitoring system that thinks IT is the system of truth for my environment. LogicMonitor thinks it is the truth for what your monitoring should be monitoring – but it’s willing to listen. It’s easy to use the API to put hooks into puppet, kickstart, etc that automatically add hosts to monitoring, assign them to groups, etc. We’re looking at integration with Puppet Lab’s MCollective initiative and other things to get further along this issue.
Perl. Our collectors are agnostic when it comes to scripting. They support collection and discovery scripts in the native languages of whatever platform they are running on – so VBscript, powershell, C# on Windows; bash, ruby, perl, etc on linux. But as our collectors are Java based, we encourage Groovy as the scripting language for cross-platform goodness. The collectors expose a bunch of their own functionality (snmp, JMX, expect, etc) to groovy, so it makes a lot of things very easy. So it’s the language we use for writing and extending datasources for our customers. But if Perl is your thing, keep at it.
So, does LogicMonitor suck? I don’t think so, and hopefully DevOps Borat does not either.
I’ll be at the DevOps Days conference in Austin this coming week (LogicMonitor is sponsoring), so hopefully we’ll get some more feedback there.
Or post below to let us know what constitutes “non-sucky” monitoring.
Performance monitoring for all your infrastructure & applications. In minutes, not hours.
Questions? Call Us!
(888) 415-6442 or +1 (805)-617-3884