Last week I was traveling to and from our Austin office – which means a fair amount of time for reading. Amongst other books, I read “The Principles of Product Development Flow”, Reinertsen, Donald G. The most interesting part of this book (to me) was the chapters on queueing theory.
Little’s law shows the Wait time for a process = queue size/processing rate. That’s a nice intuitive formula, but with broad applicability.
So what does this have to do with monitoring? Read more »
This article first appeared on blog.hipchat.com.
At LogicMonitor, we make a cloud based monitoring solution, which tracks billions of metrics a day for businesses around the world. While looking for a way to increase collaboration and community between our various team members who are spread across multiple timezones and continents, I started looking at my old friend IRC. IRC servers are still non-trivial to set up and secure, and still (especially for those not of the technical tilt) not the easiest platform to use, so I found HipChat.
Why HipChat? To me it is graphical, outsourced IRC. It’s simple to use. There is no setup. There is no server to maintain. Just like LogicMonitor, it is a SaaS solution, accessible to everyone. It is archived. It is searchable. It is available on my phone. You can interface with it via an API. You can do things in it that are simply not possible with traditional IRC platforms like paste screenshots, or attach files. So while trying hard not to say it: HipChat is pretty…hip. Nice work guys. All departments here at LogicMonitor now use it.
So just how has HipChat helped within our Operations group? First off, it is the de-facto meeting place for all team members. People outside of Ops know where to reach the operations team, and during emergencies, it is the place where everyone in Operations is expected to be. But beyond this increase in collaboration, it was LogicMonitor’s ability to interface (via WebHooks) with HipChat that really cut down the amount of email going to our team members, reducing internal response time to issues while still keeping the entire team informed.
Part of being in operations is dealing with alerts (in our case, those generated by our own SaaS-based monitoring system, LogicMonitor – it’s LogicMonitor alerting about LogicMonitor. Very meta.) Previously, all alert levels (warning, error and critical) were sent via email to the Operations team, with the on-call engineer also receiving notifications of error or critical conditions via SMS or voice-call. Now, instead of emailing all alerts to every team members (which leads to inbox clutter and extensive filtering rules), all alerts go to a Monitoring HipChat room where anybody can see them. While it is easy to see active alerts and reports of past alerts within LogicMonitor, it’s sometimes simply easier to scroll back through HipChat and see the status of things over the last few hours, or what happened during the night.
For day to day operations, the on-call engineer can choose how he will stay informed of ‘warning’ conditions: either by checking our LogicMonitor account via a web browser or by simply staying in the Monitoring HipChat room (he will be on HipChat anyway for company collaboration). It is his call as to how he wants to stay informed. Other engineers can keep tabs on warnings on an at-will basis as well, without the need of going through segregated folders where they have “quarantined” alert emails.
An even greater benefit is in how “error” and “critical” alerts – those that need immediate attention – are distributed. While these alerts are still sent out via SMS to the on-call engineer, they are simultaneously sent into the TechOps HipChat room where the operations engineers hang out. (This is a different room from the Monitoring room – alerts are sent there, too.) What does this mean? It means that any engineer who happens to be online is immediately informed of a higher severity error, even if he is not on-call. Our Operations team is better informed and quicker to respond to issues as on-call engineers are not always at their keyboard.
By virtue of LogicMonitor’s flexible property inheritance and WebHooks system, the possibility exists to have alerts destined for specific groups (DBA, Network, etc) sent to their respective HipChat rooms as well.
There is more to how we use HipChat (think automatic notifications of git commits to our production puppet code), but that’s for another time. Suffice it to say that HipChat is thriving at LogicMonitor, and HipChat coupled with LogicMonitor keeps everyone better informed.
If you want to integrate LogicMonitor with HipChat, check out our instructions on the HipChat integration page.
This blog post was written by Jeff Behl, Chief Network Architect at LogicMonitor. Follow him on Twitter.
Recently, we ran an internal challenge at LogicMonitor to see how many blog posts we could get in a week. We didn’t quite get to the 100% participation that would have led to Kevin (the C.E.O.) and I taking a ballet lesson in tutu’s in the street front window – but we got a good response. Here’s the first of the series, in no particular order, from Philip Schorr, one of our great support engineers.
LogicMonitor is a great tool, and everyday as I chat with different clients, everyone has a different use case; certain clients may use monthly reports to keep track of trends while others stay on top of database status, or create a view that allows one to finely tune their application’s performance.
I want to show everyone one feature of LogicMonitor that is handy in all sorts of ways to correlate and interpret their performance data. That feature is the “flexible graph widget”. These flexible custom graphs let you make use of powerful regular expressions and can combine data from any metric (from any components of any device) that LogicMonitor is collecting. Read more »
Happy 4th of July.
In addition to the paddleboarding, lazing on the beach, kitesurfing (if the wind picks up), BBQing with friends, and fireworks that will happen later today (Santa Barbara, where LogicMonitor is based, is a great place to live. Why not move here to work for an awesome company?), I’m taking the day to answer a question some interns had yesterday: What is the difference between a derive datapoint, and a counter datapoint, and when would you use one over the other?
Mr Protocol is glad you asked. (For those of you too young to remember Mr Protocol, or Sun Expert magazine, or even Sun Microsystems…. go find a crusty old sysadmin with beard and suspenders, and ask them.) Read more »
Eleanor Roosevelt is reputed to have said “Learn from the mistakes of others. You can’t live long enough to make them all yourself.” In that spirit, we’re sharing a mistake we made so that you may learn.
This last weekend we had a service impacting issue for about 90 minutes, that affected a subset of customers on the East coast. This despite the fact that, as you may imagine, we have very thorough monitoring of our servers; error level alerts (which are routed to people’s pagers) were triggered repeatedly during the issue; we have multiple stages of escalation for error alerts; and we ensure we always have on-call staff responsible for reacting to alerts, who are always reachable.
All these conditions were true this weekend, and yet we still had an issue whereby no person was alerted for over an hour after the first alerts were triggered. How was this possible? Read more »
One of our long time customers, Appfolio, who makes great SaaS property management software, asked how they could use LogicMonitor to monitor the size of some files across their fleet of Linux servers. A simple request, but not as simple as one might hope. Why not? LogicMonitor usually uses SNMP to monitor Linux servers, as that way there is no need for extra software to be installed on any server. (It should be noted that some people deploy LogicMonitor collectors as agents, deploying one per server. In this case, you could use a script based datasource to simply run ‘ls’ on arbitrary files – but that’s for a different blog entry.) While SNMP has many defined OIDs (a fancy way of saying questions that can be asked and answered), there is no defined OID for “how big is arbitrary file X?” Which means that by default, there is no way to remotely query a system, using SNMP, to determine a file size. Read more »
Have you ever been the guy in charge of storage and the dev guy and database guy come over to your desk waaaaay too early in the morning before you’ve had your caffeine and start telling you that the storage is too slow and you need to do something about it? I have. In my opinion it’s even worse when the Virtualization guy comes over and makes similar accusations, but that’s another story.
Now that I work for LogicMonitor I see this all the time. People come to us because “the NetApps are slow”. All too often we come to find that it’s actually the ESX host itself, or the SQL server having problems because of poorly designed queries. I’ve experienced this first hand before I worked for LogicMonitor,so it’s no surprise to me that this is a regular issue. When I experienced this problem myself I found it was vital to monitor all systems involved so I could really figure out where the bottleneck was.
This post, written by LogicMonitor’s Director of Tech Ops, Jesse Aukeman, originally appeared on HighScalability.com on February 19, 2013
If you are like us, you are running some type of linux configuration management tool. The value of centralized configuration and deployment is well known and hard to overstate. Puppet is our tool of choice. It is powerful and works well for us, except when things don’t go as planned. Failures of puppet can be innocuous and cosmetic, or they can cause production issues, for example when crucial updates do not get properly propagated.
In the most innocuous cases, the puppet agent craps out (we run puppet agent via cron). As nice as puppet is, we still need to goose it from time to time to get past some sort of network or host resource issue. A more dangerous case is when an administrator temporarily disables puppet runs on a host in order to perform some test or administrative task and then forgets to reenable it. In either case it’s easy to see how a host may stop receiving new puppet updates. The danger here is that this may not be noticed until that crucial update doesn’t get pushed, production is impacted, and it’s the client who notices.
Monitoring is clearly necessary in order to keep on top of this. Rather than just monitoring the status of the puppet server (a necessary, but not sufficient, state), we would like to monitor the success or failure of actual puppet runs on the end nodes themselves. For that purpose, puppet has a built in feature to export status info Read more »
Our digs here at LogicMonitor are cozy. Being adjacent to sales, I get to hear our sales engineers work with new customers, and it’s not uncommon that a new customer gets a rude awakening when they first install LogicMonitor. Immediately, LogicMonitor starts showing warnings and alerts. ”Can this be right or is this a monitoring error?!”, they ask. Delicately, our engineer will respond, “I don’t think that’s a monitoring error. It looks like you have a problem there.”
This happened recently with a customer who wanted to use LogicMonitor to watch their large VMware installation. We make excellent use of the VMware API which provides a rich set of data sources for monitoring. In this instance, LogicMonitor’s default alert settings threw several warnings about an ESX host’s datastore. There were multiple warnings regarding write latency problems on the ESX datastore, and drilling down, we found that a singular VM on that datastore was an ‘I/O hog’ that was grabbing so much disk resource that it was causing disk contention among the other VMs.
Finding the rogue host was easy with LogicMonitor’s clear, easy to read graphs. With the disk I\O of the different VMs plotted on the same graph, it was easy to spot the one whose disk operations were significantly higher than the rest.
We’ve seen this particular problem with VMware enough that our founder, Steve Francis, made this short video on how to quickly identify which VM on an ESX host is hogging resources: (Caveat: You must be able to understand Austrailian)
All our monitoring data sources have default alerting levels set that you can tune to fit your needs, but they’re pretty close out of the box as they’re the product of a LOT of monitoring experience. This customer didn’t have to make any adjustments to our alert levels to find a problem they were unaware of with potential customer-facing impacts. The resolution was easy, they moved the VM to another ESX host with a different datastore, but the detection tool was the key.
If you’re wondering about your VMware infrastructure, sign up for a free trial with LogicMonitor today and see what you’ve been missing.
- This article was contributed by Jeffrey Barteet, TechOps Engineer at LogicMonitor
We use snmp a lot, and know it well. However, not everyone of our customers has spent years working with OIDs in ASN.1, MIBs, Access types, and so on – and nor should they. (As we like to say, “Your monitoring solution should make your life easier, not harder.”) So one question we often get is the difference between the different SNMP versions.
So here’s the quick rundown:
Note that while you may have to configure the snmp version on your devices that are being monitored, you do not have to configure the version to be used in LogicMonitor. LogicMonitor will automatically try version 3; if that does not succeed, it tries version 2, and only if that does not respond will it use version 1. We try to keep the work away from you when we can.
Performance monitoring for all your infrastructure & applications. In minutes, not hours.
Questions? Call Us!
(888) 415-6442 or +1 (805)-617-3884