Could this happen to you?
Someone in your company makes an erroneous entry in DNS. After a short time, some customers begin receiving ‘Server Not Found’ reports when trying to access your site. Email doesn’t seem to be going through for some users. Help tickets start trickling in.
As your TechOps team attempts to troubleshoot, the error silently propagates through the Internet. The trickle of isolated tickets turns into a flash flood. Executives begin urgently texting to find out what is happening.
Eventually, someone on your team combs through the DNS file and catches the mistake. Instead of entering “.com” in the middle of the night, John must have fat fingered in “.con.” The error is fixed! However, because your DNS is cached it could be a couple of days before service is fully restored for all users.
Customers and executives demand a root cause analysis. “How could this have happened? Why wasn’t it caught earlier? What are you doing to prevent this ever happening again?”
No one can deny the importance of DNS in the Internet age. And to help you keep on top of it, LogicMonitor, maker of the popular automated IT performance monitoring platform, has just released its first free tool, the DNS Change Tracker™ as a free hosted tool. In the near term, we plan to release this tool’s source code on GitHub so that everyone can make it even better.
What it does: Read more »
In recent years, Solid-State Drives or SSDs have become a standard part of data center architecture. They handle more simultaneous read/write operations than traditional disks and use a fraction of the power. Of course, as a leading infrastructure, software and server monitoring platform vendor, we are very interested in monitoring our SSDs, not only because we want to make sure we’re getting what we paid for, but because we would also like to avoid a disk failure on a production machine at 3:00AM in the morning…and the Shaquille O’Neal sized headache to follow. But how do we know for sure if our SSDs are performing the way we want them to? Being one of the newest members of our technical operations team, it came as no surprise that I was tasked to answer this question. Read more »
Here at LogicMonitor we love our happy hours, and since we will be speaking at the upcoming AnsibleWorks Fest (or AnsibleFest) we thought of no better way to tie it off than over a drink.
Our very own Jefff Behl, Chief Network Architect will be speaking at 5:05pm about the importance of measurement and monitoring the whole IT stack in a DevOps world. And then afterwards…. we’d love the chance to meet you over drinks at Dillons (around the corner from the event).
Look for us at the Dillons downstairs bar between 6-8pm.
This post, written by LogicMonitor’s Director of Tech Ops, Jesse Aukeman, originally appeared on HighScalability.com on February 19, 2013
If you are like us, you are running some type of linux configuration management tool. The value of centralized configuration and deployment is well known and hard to overstate. Puppet is our tool of choice. It is powerful and works well for us, except when things don’t go as planned. Failures of puppet can be innocuous and cosmetic, or they can cause production issues, for example when crucial updates do not get properly propagated.
In the most innocuous cases, the puppet agent craps out (we run puppet agent via cron). As nice as puppet is, we still need to goose it from time to time to get past some sort of network or host resource issue. A more dangerous case is when an administrator temporarily disables puppet runs on a host in order to perform some test or administrative task and then forgets to reenable it. In either case it’s easy to see how a host may stop receiving new puppet updates. The danger here is that this may not be noticed until that crucial update doesn’t get pushed, production is impacted, and it’s the client who notices.
Monitoring is clearly necessary in order to keep on top of this. Rather than just monitoring the status of the puppet server (a necessary, but not sufficient, state), we would like to monitor the success or failure of actual puppet runs on the end nodes themselves. For that purpose, puppet has a built in feature to export status info Read more »
Sample SAT question: xUnit is to Continuous Integration as what is to automated server deployments?
We’ve been going through lots of growth here at LogicMonitor. Part of growth means firing up new servers to deal with more customers, but we also have been adding a variety of new services: proxies that allow our customers to route around Internet issues that BGP doesn’t catch; servers that test performance and reachability of customers sites from various locations, and so on. All of which means spinning up new servers: sometimes lots of times, in QA, staging and development environments.
As old hands in running datacenter operations, we have long adhered to the tenet of not trusting people – including ourselves. People make mistakes, and can’t remember things they did to make things work. So all our servers and applications are deployed by automated tools. We happen to use Puppet, but collectively we’ve worked with cfengine, chef, and even Rightscripts.
So, for us to bring up a new server – no problem. It’s scripted, repeatable, takes no time. But how about splitting the functions of what was one server into several? And how do we know that the servers being deployed are set up correctly, if there are changes and updates? Read more »
Our digs here at LogicMonitor are cozy. Being adjacent to sales, I get to hear our sales engineers work with new customers, and it’s not uncommon that a new customer gets a rude awakening when they first install LogicMonitor. Immediately, LogicMonitor starts showing warnings and alerts. ”Can this be right or is this a monitoring error?!”, they ask. Delicately, our engineer will respond, “I don’t think that’s a monitoring error. It looks like you have a problem there.”
This happened recently with a customer who wanted to use LogicMonitor to watch their large VMware installation. We make excellent use of the VMware API which provides a rich set of data sources for monitoring. In this instance, LogicMonitor’s default alert settings threw several warnings about an ESX host’s datastore. There were multiple warnings regarding write latency problems on the ESX datastore, and drilling down, we found that a singular VM on that datastore was an ‘I/O hog’ that was grabbing so much disk resource that it was causing disk contention among the other VMs.
Finding the rogue host was easy with LogicMonitor’s clear, easy to read graphs. With the disk I\O of the different VMs plotted on the same graph, it was easy to spot the one whose disk operations were significantly higher than the rest.
We’ve seen this particular problem with VMware enough that our founder, Steve Francis, made this short video on how to quickly identify which VM on an ESX host is hogging resources: (Caveat: You must be able to understand Austrailian)
All our monitoring data sources have default alerting levels set that you can tune to fit your needs, but they’re pretty close out of the box as they’re the product of a LOT of monitoring experience. This customer didn’t have to make any adjustments to our alert levels to find a problem they were unaware of with potential customer-facing impacts. The resolution was easy, they moved the VM to another ESX host with a different datastore, but the detection tool was the key.
If you’re wondering about your VMware infrastructure, sign up for a free trial with LogicMonitor today and see what you’ve been missing.
– This article was contributed by Jeffrey Barteet, TechOps Engineer at LogicMonitor
Recently we rolled out a new release of LogicMonitor. Among the many improvements and fixes that users saw, there were also some backend changes to the Linux systems that store monitoring data.
The rollout went smooth, no alerts were triggered – but it was pretty easy to see that something had changed: Read more »
We got a question internally about why one of our demo servers was slow, and how to use LogicMonitor to help identify the issue. The person asking comes from a VoIP, networking and Windows background, not Linux, so his questions reflect that of the less-experienced sys admin (in this case). I thought it interesting that he documented his thought processes, and I’ll intersperse my interpretation of the same data, and some thoughts on why LogicMonitor alerts as it does… Read more »
Denise Dubie wrote a recent piece in CIO magazine about “5 Must-have IT Management Technologies for 2010“, in which she identifies one of the must-haves as IT process automation. She quotes Jim Frey, research director at EMA: “On the monitoring side, automation will be able to keep up with the pace of virtual environments and recognize when changes happen in ways a human operator simply could not.”
At LogicMonitor we couldn’t agree more. It’s true that, as the article implies, virtualization and cloud computing make the need for monitoring automation more acute than previously (which is why customers use LogicMonitor to automatically detect new hosts and newly created monitor Amazon EC2 instances – having dynamic system scaling without the ability to automatically monitor the dynamic systems is just asking for undetected service affecting issues.)
However, even in traditional non-virtualized datacenters (and despite the buzz, most datacenters and services are still built on physical machines), there is often so much change going on with systems and applications that non-automated monitoring has virtually no chance of keeping up with the additions and deletions. A typical example of an automated change report of one LogicMonitor customer from last night shows:
And that was just one day’s changes. Imagine the staff costs involved with tracking and implementing all these changes, every day, in a manual fashion, that are avoided by the use of automated datacenter monitoring.
And more significantly, imagine the likelihood that one of more of these changes would NOT have made it into monitoring manually – so that when a production service has issues, there is no monitoring to detect it.
Having your customers be the first to know about issues is not a situation anyone wants to be in – and monitoring automation is the only way to avoid that. That’s one area that LogicMonitor’s datacenter monitoring excels at.
We like monitoring. We like Java. Not to slight other languages – we like Ruby, perl, php, .NET and other platforms, too, and like to monitor them, also.
However, unlike most other languages, Java provides an explicit technology for monitoring applications and system objects. JMX is supported on any platform running the JVM, but like most other monitoring protocols, there are lots of interesting nuances and ways to use it. Which means lots of nuances in how to detect it and monitor it.
We have quiet a few customers that use LogicMonitor for JMX monitoring, of both standard and custom applications, so we’ve run into quite a few little issues, and solved them.
One example is that the naming convention for JMX objects is loosely defined. Initially, the JMX collector for LogicMonitor assumed that every object would have a “type” key property, as specified in best practices. Of course, this is a rule “more honored in the breach than in the observance”, as widespread applications such as WowzaMediaServer and even Tomcat do not adhere to it.
Another example is that JMX supports complex datatypes. We have customers who do not register different Mbeans for all their classes of data, but instead expose Mbeans that return maps of maps. Our collectors and ActiveDiscovery did not initially deal with these datatypes, as we hadn’t anticipated their use. But, there are good reasons to use them in a variety of cases, so LogicMonitor should support the wishes of the user – that’s one of our tenets, that LogicMonitor enables user’s to work the way they want, instead of constraining them to a preconceived idea. So we extended our ActiveDiscovery to iterate through maps, and maps of maps, and composite datatypes.
This enables our customers to instrument their applications in the way they think is most appropriate, while automating the configuration of management and alerting. While we think we’ve got all the permutations of JMX covered, I’m not taking any bets that a new customer won’t come along with a new variant that adds a perfectly logical use case that we do not support. Of course, if that’s the case, we’ll support it within a month or so – and all our customers – current and future – will be able to immediately reap the benefits. That’s just one of the niceties of the hosted SaaS model.
Performance monitoring for all your infrastructure & applications. In minutes, not hours.
Questions? Call Us!
(888) 415-6442 or +1 (805)-617-3884