Could this happen to you?
Someone in your company makes an erroneous entry in DNS. After a short time, some customers begin receiving ‘Server Not Found’ reports when trying to access your site. Email doesn’t seem to be going through for some users. Help tickets start trickling in.
As your TechOps team attempts to troubleshoot, the error silently propagates through the Internet. The trickle of isolated tickets turns into a flash flood. Executives begin urgently texting to find out what is happening.
Eventually, someone on your team combs through the DNS file and catches the mistake. Instead of entering “.com” in the middle of the night, John must have fat fingered in “.con.” The error is fixed! However, because your DNS is cached it could be a couple of days before service is fully restored for all users.
Customers and executives demand a root cause analysis. “How could this have happened? Why wasn’t it caught earlier? What are you doing to prevent this ever happening again?”
No one can deny the importance of DNS in the Internet age. And to help you keep on top of it, LogicMonitor, maker of the popular automated IT performance monitoring platform, has just released its first free tool, the DNS Change Tracker™ as a free hosted tool. In the near term, we plan to release this tool’s source code on GitHub so that everyone can make it even better.
What it does: Read more »
Why is Solaris any different? Two reasons: (1) it virtualizes the swap space, and includes unused parts of physical memory as swap space, and (2) it maintains the distinction between paging and swapping.
These two factors often give rise to confusion and misinterpretation of the data, especially when queried via SNMP. Read more »
In a prior blog post, I talked about what virtual memory is, the difference between swapping and paging, and why it matters. (TL;DR: swapping is moving an entire process out to disk; paging is moving just specific pages out to disk, not an entire process. Running programs that require more memory than the system has will mean pages (or processes) are moved to/from disk and memory in order to get enough physical memory to run – and system performance will suck.)
Now I’ll talk about how to monitor virtual memory, on Linux (where it’s easy) and, next time, on Solaris (where most people and systems do it incorrectly.) Read more »
Most people know their hosts via DNS names (e.g. server1.lax.company.com) rather than IP addresses (192.168.3.45), and so enter them into their monitoring systems as DNS names. Thus there is a strong requirement that name resolution works as expected, in order to make sure that the monitoring system is in fact monitoring what the user expects it to be.
Sometimes we get support requests about how the LogicMonitor collector is resolving a DNS name to an IP address incorrectly, but DNS is all set up as it should be, so something is wrong with the collector. However, the issue is simply in the interactions of how hosts resolve names, which is not always the same as how DNS resolves names. Read more »
This post, written by LogicMonitor’s Director of Tech Ops, Jesse Aukeman, originally appeared on HighScalability.com on February 19, 2013
If you are like us, you are running some type of linux configuration management tool. The value of centralized configuration and deployment is well known and hard to overstate. Puppet is our tool of choice. It is powerful and works well for us, except when things don’t go as planned. Failures of puppet can be innocuous and cosmetic, or they can cause production issues, for example when crucial updates do not get properly propagated.
In the most innocuous cases, the puppet agent craps out (we run puppet agent via cron). As nice as puppet is, we still need to goose it from time to time to get past some sort of network or host resource issue. A more dangerous case is when an administrator temporarily disables puppet runs on a host in order to perform some test or administrative task and then forgets to reenable it. In either case it’s easy to see how a host may stop receiving new puppet updates. The danger here is that this may not be noticed until that crucial update doesn’t get pushed, production is impacted, and it’s the client who notices.
Monitoring is clearly necessary in order to keep on top of this. Rather than just monitoring the status of the puppet server (a necessary, but not sufficient, state), we would like to monitor the success or failure of actual puppet runs on the end nodes themselves. For that purpose, puppet has a built in feature to export status info Read more »
Sample SAT question: xUnit is to Continuous Integration as what is to automated server deployments?
We’ve been going through lots of growth here at LogicMonitor. Part of growth means firing up new servers to deal with more customers, but we also have been adding a variety of new services: proxies that allow our customers to route around Internet issues that BGP doesn’t catch; servers that test performance and reachability of customers sites from various locations, and so on. All of which means spinning up new servers: sometimes lots of times, in QA, staging and development environments.
As old hands in running datacenter operations, we have long adhered to the tenet of not trusting people – including ourselves. People make mistakes, and can’t remember things they did to make things work. So all our servers and applications are deployed by automated tools. We happen to use Puppet, but collectively we’ve worked with cfengine, chef, and even Rightscripts.
So, for us to bring up a new server – no problem. It’s scripted, repeatable, takes no time. But how about splitting the functions of what was one server into several? And how do we know that the servers being deployed are set up correctly, if there are changes and updates? Read more »
You only get noticed when things go wrong.
The burden of entire companies rests on your shoulders.
Your work day never ends at 5:30 pm.
You’re on call 24/7/365.
You keep things running 99.999% of the time.
Today, we express our gratitude for your knowledge, dedication, and patience.
VMworld 2012 took place at the Moscone Center in San Francisco a few weeks ago. The weather was surprisingly nice, but the real buzz was inside the convention hall. We had a pod in the New Innovators section of the Vendor Expo in Moscone West. It being our first VMworld (as a sponsor), we were very impressed.
I was initially a little skeptical of our location, but it turned out we got a good bit of traffic and talked to dozens of prospects who were very interested in learning more about cloud-based technology infrastructure monitoring. One of the surprises of the event was how many current customers stopped by to say hello and share how LogicMonitor is working out for them.
One customer had an interesting story about how LogicMonitor saved his movie. He had gone to see the latest Batman movie “The Dark Knight,” and apparently he’s one of those guys who pays attention to his phone while in the movie (you know, like every other SysAdmin in the world). Half way through the film he got a text message alerting him to an issue.
He immediately logged into LogicMonitor and checked the systems he was responsible for and quickly realized the problem wasn’t on his end. He proceeded to dig around in the other systems in LogicMonitor and was able to pinpoint the issue and relay it to the team responsible. Ironically, he was the super hero at that moment. LogicMonitor not only helped him save the day, but it also saved the movie.
The takeaway here is that LogicMonitor helps provide insight into the entire infrastructure and so helps with collaboration across multiple teams.
You never know when the storage guy might help the virtualization guy or the database guy solve a major problem, even if by just proving the issue isn’t the database. This type of collaboration is invaluable when it comes to monitoring. It streamlines the troubleshooting process and motivates the right professionals to action sooner, allowing them to focus on and solve the problem much quicker.
I don’t think there is one SysAdmin out there who enjoys the length of time it takes during the thrill of the hunt, when trying to pinpoint the reason for a major problem or outage.
Needless to say we felt we had a great show. It was fun to be there and talk to really smart and interesting people. If your organization uses VMware in their infrastructure I would highly recommend attending this conference next year – same Bat-time, same Bat-channel.
As the new hire here at LogicMonitor brought in to support the operations of the organization, I had two immediate tasks: Learn how LogicMonitor’s SaaS-based monitoring works to monitor our customer’s servers, and at the same time, learn our own infrastructure.
I’ve been a SysA for a longer than I care to admit, and when you start a new job in a complex environment, there can be a humbling period of time while you spin-up before you can provide some value to the company. There’s often a steep and sometimes painful learning curve to adopting an organization’s technologies and architecture philosophies and make them your own before you can claim to be an asset to the firm.
But this time was different. With LogicMonitor’s founder, Steve Francis, sitting to my right, and its Chief Architect to my left, I was encouraged to dive into our own LogicMonitor portal to see our infrastructure health. A portal, by the way, is an individualized web site where our customers go to see their assets. From your portal, you get a fantastic view of all your datacenter resources from servers, storage and switches to applications, power, and load balancers just to name a few. And YES, we use remote instances of LogicMonitor to watch our own infrastructure. In SysA speak, we call this ‘eating our own dog food’.
As soon as I was given a login, I figured I’d kill two birds with one stone and familiarize myself with our infrastructure and see how our software worked. Starting at the top, I started looking at our Cisco switches to see what was hooked up to what. LogicMonitor has already done the leg-work of getting hooks into the APIs on datacenter hardware, so one has only to point a collector at a device with an IP or hostname, tell it what it is, ( linux or windows host, Cisco or HP switch, etc) provide some credentials and ‘Voila!’ out comes stats and pretty graphs. Before me on our portal was all the monitoring information one could wish for from a Cisco switch.
On the first switch I looked at, I noticed that its internal temperature sensor had been reading erratic temperatures. The temperatures were still within Cisco’s spec, and they hadn’t triggered an alert yet, but they certainly weren’t as steady as they had been for months leading up to that time. For a sanity check, I looked at the same sensor in switch right next to it. The temperature was just as erratic. Checking the same sensors in another pair of switches in a different datacenter showed steady temperature readings for months.
Using the nifty ‘smart-graph’ function of LogicMonitor, I was able to switch the graph around to look at just the data range I wanted. I even added the temperature sensor’s output to a new dashboard view. With with my new-found data, I shared a graph with Jeff and Steve, and asked, “Hey, guys, I’m seeing these erratic temperature’s on our switches in Virginia. Is this normal?”
Jeff took a 3 second glance, scowled, and said, “No, that’s not right! Open a ticket with our datacenter ticket and have them look at that!”
That task was a little harder. Convincing a datacenter operator they have a problem with their HVAC when all their systems are showing normal takes a little persistence. Armed with my graphs, I worked my way up the food-chain with our DC provider support staff. He checked the input and output air temperature of our cabinet, and verified there was no foreign objects disturbing air flow. All good there. We double-checked here that we hadn’t made any changes that would affect load on our systems and cause the temperature fluctuation. No changes here. But on a hunch, he changed a floor tile for one that allowed more air through to our cabinet. And behold, the result:
Looking at our graph, you’ll notice the temperature was largely stable before Sept. 13. I was poking around in LogicMonitor for the first time on Sept. 18th. (Literally, the FIRST TIME ) and created the ticket which got resolved on Friday Sept. 21. You can see the moment when the temps drop and go stable again after the new ventilation tile was fitted. ( In case you’re wondering, you can click on the data sources on the bottom of the graph, and that will toggle their appearance on the graph. I ‘turned off’ the sw-core1&2.lax6 switches since they were in another data center )
Steve’s response to all this was, “Excellent! You’re providing value-add! Maybe we’ll keep you. Now write a blog post about it!”
And I’ll leave you with this: Monitoring can be an onerous task for SysAs. We usually have to build it and support it ourselves, and then we’re the only ones who can understand it enough to actually use it. Monitoring frequently doesn’t get the time it deserves until it’s too late and there’s an outage. LogicMonitor makes infrastructure monitoring easy and effective in a short period of time. We’ve built it, we support it, and we’ve made it easy to understand so your SysA can work on their infrastructure.
A company started a trial yesterday, added a bunch of windows hosts, and immediately got warnings triggered that their hosts were “receiving 42 datagrams per second destined to non-listening ports…Check if all services are up and running.”
This was across many of their hosts, and was an issue they were unaware of, and didn’t immediately know the cause.
However, this morning we received an email:
“I need to share my excitement with discovering the cause of the UDP ‘storm.’ It was the Drobo Dashboard Service we had running on a Citrix XenApp server. Every 5 seconds, it was broadcasting to port 5002 searching for our appliance.
It was further amplified as we have Virtual IPs enabled on the Citrix server, resulting in what appeared to be a broadcast coming from each IP every 5 seconds.
We disabled that service and the UDP alarms have cleared. Thanks again.”
Their UDP error graph now looked much better:
While having 40 extra packets per second discarded by servers is not really going to affect them much (unlike the old days, when a few hundred broadcasts per second could freeze a computer entirely), the more things are controlled, and understood, the better your datacenter will perform. Sources of hidden complexity can hinder troubleshooting, slow resolution, and lead to failures later on.
This is just an example of the ways LogicMonitor has you covered. There are many alerts that most people will never see – but it’s nice to know there are thresholds set that will help you get your infrastructure conforming to best practices – if you happen to slip.
Performance monitoring for all your infrastructure & applications. In minutes, not hours.
Questions? Call Us!
(888) 415-6442 or +1 (805)-617-3884