Could this happen to you?
Someone in your company makes an erroneous entry in DNS. After a short time, some customers begin receiving ‘Server Not Found’ reports when trying to access your site. Email doesn’t seem to be going through for some users. Help tickets start trickling in.
As your TechOps team attempts to troubleshoot, the error silently propagates through the Internet. The trickle of isolated tickets turns into a flash flood. Executives begin urgently texting to find out what is happening.
Eventually, someone on your team combs through the DNS file and catches the mistake. Instead of entering “.com” in the middle of the night, John must have fat fingered in “.con.” The error is fixed! However, because your DNS is cached it could be a couple of days before service is fully restored for all users.
Customers and executives demand a root cause analysis. “How could this have happened? Why wasn’t it caught earlier? What are you doing to prevent this ever happening again?”
No one can deny the importance of DNS in the Internet age. And to help you keep on top of it, LogicMonitor, maker of the popular automated IT performance monitoring platform, has just released its first free tool, the DNS Change Tracker™ as a free hosted tool. In the near term, we plan to release this tool’s source code on GitHub so that everyone can make it even better.
What it does: Read more »
‘Meraki’ may not be the best known name in networking, but their technology is going to touch you soon if it hasn’t already. Meraki was just acquired by Cisco in November for a cool $1.2 billion to incorporate into their new Cloud Networking Group.
Cisco is predicting explosive growth in cloud computing, the practice of running applications and storing data on remote servers accessed over the internet instead of running apps and storing data on your local computer. And increasingly, these cloud services will be accessed with with mobile devices over wireless networks.
What Meraki brings to the table is their cloud managed wireless network infrastructure hardware. The Access Point (AP) is the critical bridge from the wired to the wireless world. The unique feature of the Meraki APs is you plug them into your wired network, the AP connects to the mother ship at Meraki, and you go to meraki.com to configure and manage them via a web UI.
This is a stellar leap from the typically clumsy and slow embedded web interfaces found on most APs, and the emphasis is on managing your wireless network as a whole, not a bunch of individual APs. The web UI is clean and easy to use, the network can be managed from anywhere, and the APs are kept up to date by Meraki with automatic firmware and security updates.
As the new hire here at LogicMonitor brought in to support the operations of the organization, I had two immediate tasks: Learn how LogicMonitor’s SaaS-based monitoring works to monitor our customer’s servers, and at the same time, learn our own infrastructure.
I’ve been a SysA for a longer than I care to admit, and when you start a new job in a complex environment, there can be a humbling period of time while you spin-up before you can provide some value to the company. There’s often a steep and sometimes painful learning curve to adopting an organization’s technologies and architecture philosophies and make them your own before you can claim to be an asset to the firm.
But this time was different. With LogicMonitor’s founder, Steve Francis, sitting to my right, and its Chief Architect to my left, I was encouraged to dive into our own LogicMonitor portal to see our infrastructure health. A portal, by the way, is an individualized web site where our customers go to see their assets. From your portal, you get a fantastic view of all your datacenter resources from servers, storage and switches to applications, power, and load balancers just to name a few. And YES, we use remote instances of LogicMonitor to watch our own infrastructure. In SysA speak, we call this ‘eating our own dog food’.
As soon as I was given a login, I figured I’d kill two birds with one stone and familiarize myself with our infrastructure and see how our software worked. Starting at the top, I started looking at our Cisco switches to see what was hooked up to what. LogicMonitor has already done the leg-work of getting hooks into the APIs on datacenter hardware, so one has only to point a collector at a device with an IP or hostname, tell it what it is, ( linux or windows host, Cisco or HP switch, etc) provide some credentials and ‘Voila!’ out comes stats and pretty graphs. Before me on our portal was all the monitoring information one could wish for from a Cisco switch.
On the first switch I looked at, I noticed that its internal temperature sensor had been reading erratic temperatures. The temperatures were still within Cisco’s spec, and they hadn’t triggered an alert yet, but they certainly weren’t as steady as they had been for months leading up to that time. For a sanity check, I looked at the same sensor in switch right next to it. The temperature was just as erratic. Checking the same sensors in another pair of switches in a different datacenter showed steady temperature readings for months.
Using the nifty ‘smart-graph’ function of LogicMonitor, I was able to switch the graph around to look at just the data range I wanted. I even added the temperature sensor’s output to a new dashboard view. With with my new-found data, I shared a graph with Jeff and Steve, and asked, “Hey, guys, I’m seeing these erratic temperature’s on our switches in Virginia. Is this normal?”
Jeff took a 3 second glance, scowled, and said, “No, that’s not right! Open a ticket with our datacenter ticket and have them look at that!”
That task was a little harder. Convincing a datacenter operator they have a problem with their HVAC when all their systems are showing normal takes a little persistence. Armed with my graphs, I worked my way up the food-chain with our DC provider support staff. He checked the input and output air temperature of our cabinet, and verified there was no foreign objects disturbing air flow. All good there. We double-checked here that we hadn’t made any changes that would affect load on our systems and cause the temperature fluctuation. No changes here. But on a hunch, he changed a floor tile for one that allowed more air through to our cabinet. And behold, the result:
Looking at our graph, you’ll notice the temperature was largely stable before Sept. 13. I was poking around in LogicMonitor for the first time on Sept. 18th. (Literally, the FIRST TIME ) and created the ticket which got resolved on Friday Sept. 21. You can see the moment when the temps drop and go stable again after the new ventilation tile was fitted. ( In case you’re wondering, you can click on the data sources on the bottom of the graph, and that will toggle their appearance on the graph. I ‘turned off’ the sw-core1&2.lax6 switches since they were in another data center )
Steve’s response to all this was, “Excellent! You’re providing value-add! Maybe we’ll keep you. Now write a blog post about it!”
And I’ll leave you with this: Monitoring can be an onerous task for SysAs. We usually have to build it and support it ourselves, and then we’re the only ones who can understand it enough to actually use it. Monitoring frequently doesn’t get the time it deserves until it’s too late and there’s an outage. LogicMonitor makes infrastructure monitoring easy and effective in a short period of time. We’ve built it, we support it, and we’ve made it easy to understand so your SysA can work on their infrastructure.
Last night our ops team (of which I am a member) got paged about the CPU load on a Cisco 3560 switch in a new datacenter, late at night. My initial reaction was “We don’t need this alert escalated to pagers or phones- 3560’s switch and route in hardware, so CPU load doesn’t matter.” Once I’d woken up a bit more, the corollary – that there is no possible way that this switch should be at a CPU level to trigger an error alert – occurred to me. Read more »
Performance monitoring for all your infrastructure & applications. In minutes, not hours.
Questions? Call Us!
(888) 415-6442 or +1 (805)-617-3884