Most people know their hosts via DNS names (e.g. server1.lax.company.com) rather than IP addresses (192.168.3.45), and so enter them into their monitoring systems as DNS names. Thus there is a strong requirement that name resolution works as expected, in order to make sure that the monitoring system is in fact monitoring what the user expects it to be.
Sometimes we get support requests about how the LogicMonitor collector is resolving a DNS name to an IP address incorrectly, but DNS is all set up as it should be, so something is wrong with the collector. However, the issue is simply in the interactions of how hosts resolve names, which is not always the same as how DNS resolves names. Read more »
This post, written by LogicMonitor’s Director of Tech Ops, Jesse Aukeman, originally appeared on HighScalability.com on February 19, 2013
If you are like us, you are running some type of linux configuration management tool. The value of centralized configuration and deployment is well known and hard to overstate. Puppet is our tool of choice. It is powerful and works well for us, except when things don’t go as planned. Failures of puppet can be innocuous and cosmetic, or they can cause production issues, for example when crucial updates do not get properly propagated.
In the most innocuous cases, the puppet agent craps out (we run puppet agent via cron). As nice as puppet is, we still need to goose it from time to time to get past some sort of network or host resource issue. A more dangerous case is when an administrator temporarily disables puppet runs on a host in order to perform some test or administrative task and then forgets to reenable it. In either case it’s easy to see how a host may stop receiving new puppet updates. The danger here is that this may not be noticed until that crucial update doesn’t get pushed, production is impacted, and it’s the client who notices.
Monitoring is clearly necessary in order to keep on top of this. Rather than just monitoring the status of the puppet server (a necessary, but not sufficient, state), we would like to monitor the success or failure of actual puppet runs on the end nodes themselves. For that purpose, puppet has a built in feature to export status info Read more »
Sample SAT question: xUnit is to Continuous Integration as what is to automated server deployments?
We’ve been going through lots of growth here at LogicMonitor. Part of growth means firing up new servers to deal with more customers, but we also have been adding a variety of new services: proxies that allow our customers to route around Internet issues that BGP doesn’t catch; servers that test performance and reachability of customers sites from various locations, and so on. All of which means spinning up new servers: sometimes lots of times, in QA, staging and development environments.
As old hands in running datacenter operations, we have long adhered to the tenet of not trusting people – including ourselves. People make mistakes, and can’t remember things they did to make things work. So all our servers and applications are deployed by automated tools. We happen to use Puppet, but collectively we’ve worked with cfengine, chef, and even Rightscripts.
So, for us to bring up a new server – no problem. It’s scripted, repeatable, takes no time. But how about splitting the functions of what was one server into several? And how do we know that the servers being deployed are set up correctly, if there are changes and updates? Read more »
You only get noticed when things go wrong.
The burden of entire companies rests on your shoulders.
Your work day never ends at 5:30 pm.
You’re on call 24/7/365.
You keep things running 99.999% of the time.
Today, we express our gratitude for your knowledge, dedication, and patience.
VMworld 2012 took place at the Moscone Center in San Francisco a few weeks ago. The weather was surprisingly nice, but the real buzz was inside the convention hall. We had a pod in the New Innovators section of the Vendor Expo in Moscone West. It being our first VMworld (as a sponsor), we were very impressed.
I was initially a little skeptical of our location, but it turned out we got a good bit of traffic and talked to dozens of prospects who were very interested in learning more about cloud-based technology infrastructure monitoring. One of the surprises of the event was how many current customers stopped by to say hello and share how LogicMonitor is working out for them.
One customer had an interesting story about how LogicMonitor saved his movie. He had gone to see the latest Batman movie “The Dark Knight,” and apparently he’s one of those guys who pays attention to his phone while in the movie (you know, like every other SysAdmin in the world). Half way through the film he got a text message alerting him to an issue.
He immediately logged into LogicMonitor and checked the systems he was responsible for and quickly realized the problem wasn’t on his end. He proceeded to dig around in the other systems in LogicMonitor and was able to pinpoint the issue and relay it to the team responsible. Ironically, he was the super hero at that moment. LogicMonitor not only helped him save the day, but it also saved the movie.
The takeaway here is that LogicMonitor helps provide insight into the entire infrastructure and so helps with collaboration across multiple teams.
You never know when the storage guy might help the virtualization guy or the database guy solve a major problem, even if by just proving the issue isn’t the database. This type of collaboration is invaluable when it comes to monitoring. It streamlines the troubleshooting process and motivates the right professionals to action sooner, allowing them to focus on and solve the problem much quicker.
I don’t think there is one SysAdmin out there who enjoys the length of time it takes during the thrill of the hunt, when trying to pinpoint the reason for a major problem or outage.
Needless to say we felt we had a great show. It was fun to be there and talk to really smart and interesting people. If your organization uses VMware in their infrastructure I would highly recommend attending this conference next year – same Bat-time, same Bat-channel.
As the new hire here at LogicMonitor brought in to support the operations of the organization, I had two immediate tasks: Learn how LogicMonitor’s SaaS-based monitoring works to monitor our customer’s servers, and at the same time, learn our own infrastructure.
I’ve been a SysA for a longer than I care to admit, and when you start a new job in a complex environment, there can be a humbling period of time while you spin-up before you can provide some value to the company. There’s often a steep and sometimes painful learning curve to adopting an organization’s technologies and architecture philosophies and make them your own before you can claim to be an asset to the firm.
But this time was different. With LogicMonitor’s founder, Steve Francis, sitting to my right, and its Chief Architect to my left, I was encouraged to dive into our own LogicMonitor portal to see our infrastructure health. A portal, by the way, is an individualized web site where our customers go to see their assets. From your portal, you get a fantastic view of all your datacenter resources from servers, storage and switches to applications, power, and load balancers just to name a few. And YES, we use remote instances of LogicMonitor to watch our own infrastructure. In SysA speak, we call this ‘eating our own dog food’.
As soon as I was given a login, I figured I’d kill two birds with one stone and familiarize myself with our infrastructure and see how our software worked. Starting at the top, I started looking at our Cisco switches to see what was hooked up to what. LogicMonitor has already done the leg-work of getting hooks into the APIs on datacenter hardware, so one has only to point a collector at a device with an IP or hostname, tell it what it is, ( linux or windows host, Cisco or HP switch, etc) provide some credentials and ‘Voila!’ out comes stats and pretty graphs. Before me on our portal was all the monitoring information one could wish for from a Cisco switch.
On the first switch I looked at, I noticed that its internal temperature sensor had been reading erratic temperatures. The temperatures were still within Cisco’s spec, and they hadn’t triggered an alert yet, but they certainly weren’t as steady as they had been for months leading up to that time. For a sanity check, I looked at the same sensor in switch right next to it. The temperature was just as erratic. Checking the same sensors in another pair of switches in a different datacenter showed steady temperature readings for months.
Using the nifty ‘smart-graph’ function of LogicMonitor, I was able to switch the graph around to look at just the data range I wanted. I even added the temperature sensor’s output to a new dashboard view. With with my new-found data, I shared a graph with Jeff and Steve, and asked, “Hey, guys, I’m seeing these erratic temperature’s on our switches in Virginia. Is this normal?”
Jeff took a 3 second glance, scowled, and said, “No, that’s not right! Open a ticket with our datacenter ticket and have them look at that!”
That task was a little harder. Convincing a datacenter operator they have a problem with their HVAC when all their systems are showing normal takes a little persistence. Armed with my graphs, I worked my way up the food-chain with our DC provider support staff. He checked the input and output air temperature of our cabinet, and verified there was no foreign objects disturbing air flow. All good there. We double-checked here that we hadn’t made any changes that would affect load on our systems and cause the temperature fluctuation. No changes here. But on a hunch, he changed a floor tile for one that allowed more air through to our cabinet. And behold, the result:
Looking at our graph, you’ll notice the temperature was largely stable before Sept. 13. I was poking around in LogicMonitor for the first time on Sept. 18th. (Literally, the FIRST TIME ) and created the ticket which got resolved on Friday Sept. 21. You can see the moment when the temps drop and go stable again after the new ventilation tile was fitted. ( In case you’re wondering, you can click on the data sources on the bottom of the graph, and that will toggle their appearance on the graph. I ‘turned off’ the sw-core1&2.lax6 switches since they were in another data center )
Steve’s response to all this was, “Excellent! You’re providing value-add! Maybe we’ll keep you. Now write a blog post about it!”
And I’ll leave you with this: Monitoring can be an onerous task for SysAs. We usually have to build it and support it ourselves, and then we’re the only ones who can understand it enough to actually use it. Monitoring frequently doesn’t get the time it deserves until it’s too late and there’s an outage. LogicMonitor makes infrastructure monitoring easy and effective in a short period of time. We’ve built it, we support it, and we’ve made it easy to understand so your SysA can work on their infrastructure.
A company started a trial yesterday, added a bunch of windows hosts, and immediately got warnings triggered that their hosts were “receiving 42 datagrams per second destined to non-listening ports…Check if all services are up and running.”
This was across many of their hosts, and was an issue they were unaware of, and didn’t immediately know the cause.
However, this morning we received an email:
“I need to share my excitement with discovering the cause of the UDP ‘storm.’ It was the Drobo Dashboard Service we had running on a Citrix XenApp server. Every 5 seconds, it was broadcasting to port 5002 searching for our appliance.
It was further amplified as we have Virtual IPs enabled on the Citrix server, resulting in what appeared to be a broadcast coming from each IP every 5 seconds.
We disabled that service and the UDP alarms have cleared. Thanks again.”
Their UDP error graph now looked much better:
While having 40 extra packets per second discarded by servers is not really going to affect them much (unlike the old days, when a few hundred broadcasts per second could freeze a computer entirely), the more things are controlled, and understood, the better your datacenter will perform. Sources of hidden complexity can hinder troubleshooting, slow resolution, and lead to failures later on.
This is just an example of the ways LogicMonitor has you covered. There are many alerts that most people will never see – but it’s nice to know there are thresholds set that will help you get your infrastructure conforming to best practices – if you happen to slip.
Even with a great monitoring system, it can be hard sometimes to keep the noise down. (Indeed, the more powerful the monitoring, the more difficult this can be, as more data is collected and tested, automatically.) And keeping noise down in monitoring is vital, as you do not want staff to start ignoring alerts – which they will if there are too many meaningless alerts.
There are of course best practices to help with this process, but one of the best ways to start attacking your alert noise is also one of the easiest – simply set up a report to highlight where the noise is coming from, and review it once a week.
Under the Reports tab, select New Report, then fill it out as the below – the important thing being to select the report type as Alert Report.
The magic of the report is in the details:
I suggest setting the report to cover the last week, for all hosts (although if you are responsible only for a set of hosts – by all means change the report to only reflect those you are getting alerted about); exclude alerts that occurred during periods of Scheduled DownTime (those alerts would not have been sent out anyway); check the Summarize Alert Counts box, THEN select the sort method of sorting by Alert count. (This sort order is not available until the summarize alert count box is checked.)
Run this report, and you’ll get output like the below:
Which makes it very easy to see that in this case, we could eliminate 80% of the alerts for the last week simply by changing the monitoring on the IPMI event logs of one development host – filtering out alerts, or using SDT, or even disabling that monitoring, given it’s just a development host.
We can then work through the top noise makers, tuning, disabling, or fixing issue (such as increasing the MySQL cache on prod5.iad), which will greatly reduce the amount of alert noise with the least work.
And then we’ll get this report emailed to us every Monday, so we can stay on top of the issues, and keep our monitoring meaningful. That way, we’ll have improved the performance of our systems, eliminated any alert noise, and if we do get an alert – we can be sure it’s meaningful, and that people will react to it.
It’s 6 AM. Bob, an entry-level IT engineer walks into a cold, dark, lonely building – flips on the lights, fires up the coffee pot, and boots up. Depending on what he’s about to see on his computer screen, he knows the fate of the free world could rest in his soft, trembling, sun-starved hands.
Well maybe not the free world, but at least the near-term fate of his company, his company’s clients, and possibly his next paycheck. Bob is the newest engineer for a busy MSP, whose promise to its clients is very simple: your technology will always be up and working!
Fortunately for Bob, his MSP has a great ticketing system, so as soon as his coffee is hot and hard drive warm, he’ll login to his ticketing dashboard, right? Wrong! What?! Bob! What are you logging into?! Oh. Your monitoring application? Really?
Really. True story. Dramatized for effect, name changed to protect the reasonably innocent, but true story. Eric Egolf, the owner of CIO Solutions, a thriving MSP told us about it just last week. “The first thing the new guy does, intuitively, is open up the monitoring portal, before he ever looks at our ticketing system.” And the other engineers are following suit. Egolf says the ticketing system is great, but their comprehensive monitoring solution reveals the actual, real-time IT landscape for their entire client-base within seconds. And the most critical problems practically jump off the screen at the engineers, sometimes before a ticket has even been created.
Set an easy to use interface on top of the comprehensive monitoring solution, and Bob can often times very quickly isolate the problem, ferret out the root cause, and resolve the issue … before the asteroid plummets to earth and destroys America … or at least before a client calls screaming as if that did just happen.
“LogicMonitor makes my engineers smarter,” claims Egolf, “an entry-level engineer can basically perform all the functions of a mid-level engineer.” And without the increase in pay grade. That keeps costs down and clients up, and while that’s particularly a sweet-spot for MSP’s and cloud providers, the same formula holds true for SaaS/Web companies and in-house IT departments. Not good, but great monitoring is the answer.
That’s how you make an engineer smarter. Next blog post: How to Make an Engineer the Life of the Party.
Performance monitoring for all your infrastructure & applications. In minutes, not hours.
Questions? Call Us!
(888) 415-6442 or +1 (805)-617-3884