This weekend I was catching up on some New Yorker issues, when an article by one of my favorite New Yorker authors, Atul Gawande, struck me as illuminating so much about tech companies and DevOps. (This is an example of ideas coming from diverse, unrelated sources – something part of the culture of LogicMonitor. Just yesterday, in fact, our Chief Network Architect had a great idea to improve security and accountability when our support engineers are asked to log in to a customer’s account - and this idea occurred to him while he and I were charging down the Jesusita trail on mountain bikes.)
The article, Atul Gawande: How Do Good Ideas Spread? : The New Yorker, is an exploration about why some good ideas (such as anesthesia) were readily adopted, while other just as worthy ideas (antisepsis – keeping germs away from medical procedures) did not. So how does this relate to DevOps and technology companies? Read more »
LogicMonitor is happy to participate in the upcoming DevOps Days meet up in Santa Clara.
It is exciting to see how the agile software development movement is bringing development and operations together to promote efficiency. With DevOps constant changes and roles cross-over, it only makes sense you use a monitoring tool that is just as flexible as you are.
LogicMonitor is a SaaS based IT performance monitoring solution that lets you monitor your entire IT infrastructure with just a few clicks of a button. Built by exceptionally smart operations guys, our product instantly discovers system performance metrics and tells you in seconds if they are within acceptable standards. We already researched these standards for you, so you can focus on what you do best.
And since we monitor anything with an IP address, you get visibility to your full IT stack in one place. Talk about Infrastructure as Code!
Come say hi to learn more, we would love to see you!
Have you ever been the guy in charge of storage and the dev guy and database guy come over to your desk waaaaay too early in the morning before you’ve had your caffeine and start telling you that the storage is too slow and you need to do something about it? I have. In my opinion it’s even worse when the Virtualization guy comes over and makes similar accusations, but that’s another story.
Now that I work for LogicMonitor I see this all the time. People come to us because “the NetApps are slow”. All too often we come to find that it’s actually the ESX host itself, or the SQL server having problems because of poorly designed queries. I’ve experienced this first hand before I worked for LogicMonitor,so it’s no surprise to me that this is a regular issue. When I experienced this problem myself I found it was vital to monitor all systems involved so I could really figure out where the bottleneck was.
Developers are sometimes too helpful when they instrument their systems. For example, when asked to add a metric that will report the response time of a request – there are several ways that it can be done. One way that seems to make sense is to just keep a variable with the total number of requests, and another with the total processing time. Then the developer just creates a variable showing total processing time divided by total requests, and a way to expose it (an MBean to report it via JMX, or a status page via HTTP, etc). This will be a nice neat object that reports the response time in milliseconds, all pre-calculated for the user.
The problem with this? It is indeed going to report the average response time - but it’s going to be the average of all response times since the server started. So… if the server has been running with an average response time of 1 ms, and it’s been up for 1000 hours, then it starts exhibiting a response time of 100 ms per request – after an hour of this slow behavior, the pre-calculated average response time will be 1.01 milliseconds (assuming a constant rate of requests). Not even enough of a change to be discernible with the eye on a graph, Read more »
One thing we frequently say is that you need to be monitoring all sorts of metrics when you do software releases, so you can tell if things degrade, and thus head off performance issues. You need to monitor not just the basics of the server (disk IO, memory, CPU, network), and the function of the server (response time serving web pages, database queries, etc), but also the in-between metrics (cache hit rates, etc).
This also provides visibility into when things improve, as well as get worse. For example, in a recent release, we changed the way we store some data in an internal database, and reduced the number of records in some tables by thousands. As you can see, this dropped the number of times Innodb had to hit the file system for data quite a bit:
Now, if we were running on SAS disks, instead of SSDs, we would have just regained a sizable percentage of the maximum IO rate of the drive arrays back, with one software release. (Purists will note that the drop in what is graphed is InnoDB disk requests – not OS level disk requests. Some of these requests will likely be satisfied from memory, not disk.)
If I were a developer, and was on the team that effectively allowed the same servers to scale to support twice as much load with a software release….I’d want people to know that.
This post, written by LogicMonitor’s Director of Tech Ops, Jesse Aukeman, originally appeared on HighScalability.com on February 19, 2013
If you are like us, you are running some type of linux configuration management tool. The value of centralized configuration and deployment is well known and hard to overstate. Puppet is our tool of choice. It is powerful and works well for us, except when things don’t go as planned. Failures of puppet can be innocuous and cosmetic, or they can cause production issues, for example when crucial updates do not get properly propagated.
In the most innocuous cases, the puppet agent craps out (we run puppet agent via cron). As nice as puppet is, we still need to goose it from time to time to get past some sort of network or host resource issue. A more dangerous case is when an administrator temporarily disables puppet runs on a host in order to perform some test or administrative task and then forgets to reenable it. In either case it’s easy to see how a host may stop receiving new puppet updates. The danger here is that this may not be noticed until that crucial update doesn’t get pushed, production is impacted, and it’s the client who notices.
Monitoring is clearly necessary in order to keep on top of this. Rather than just monitoring the status of the puppet server (a necessary, but not sufficient, state), we would like to monitor the success or failure of actual puppet runs on the end nodes themselves. For that purpose, puppet has a built in feature to export status info Read more »
You released new code with all sorts of new features and improvements. Yay!
Now, after the obvious things like “Does it actually work in production”, this is also the time to assess: did it impact my infrastructure performance (and thus my scalability, and thus my scaling costs) in any way.
This is yet another area where good monitoring and trending is essential.
As an example, we did a release last night on a small set of servers.
Did that help or hurt our scalability?
CPU load dropped for the same workload (we have other graphs showing which particular Java application this improvement was attributable to, but this shows the overall system CPU):
There was an improvement on a variety of MySQL performance metrics, such as the Table open rate (table opens are fairly intensive.)
But…not everything was improved:
While the overall disk performance and utilization is the same, the workload is much more spiky. (For those of you wondering how we get up to 2000 write operations per second – SSDs rock.)
And of course, the peak workloads are what constrain the server usage – with this change in workload, a server that was running at a steady 60% utilization may find itself spiking to 100% – leading to queuing in other parts of the system, and general Bad Things.
As it is, we saw this change in the workload and we can clearly attribute it to the code release. So now we can fix it before it is applied to more heavily loaded servers where it may have had an operational impact.
This keeps our Ops team happy, our customers happy, and, as it means we dont have to spend more money on hardware for the same level of scale, it keeps our business people happy.
Just another illustration of how comprehensive monitoring can help your business in ways you may not have predicted.
Sometimes the truth hurts. Well the truth is what we didn’t find at DevOps Days was a throng of adoring fans waiting to throw their undergarments at us. Come to think of it, that would be kind of gross anyway, especially with the DevOps crowd…no disrespect.
What we did find was:
a) our marketing table nestled so close to our competitor’s that…if our tables had been teenagers, we would have sent them to the Principal’s office (see PHOTO below…with competitor’s name shamelessly Photoshopped out and replaced with ours) … and,
b) a lot of companies and DevOps teams that were fairly embedded in their custom-rigged, hard-fought and hard-won monitoring solutions.
In our last blog post we talked about the “suck” factor in monitoring. Well, maybe for some, blessed with sizable IT budgets and IT brains, monitoring doesn’t suck so bad at all. In fact maybe for those who take pride in their ability to cobble together a patchwork of complex solutions into one grand “comprehensive” solution, it’s sort of a way of life… a job within a job, a golden chalice, a worthy opponent for any Real Mensa up to the task.
When I was a kid I entered a Soapbox Derby – a racing event where the entrants spend the better part of a year (usually with their dads) making, honing, tweaking, and polishing their own motorless downhill race cars. Well I was new in town and my dad was busy with a new job, so I saved up and bought a Soapbox Derby Car from an enticing ad in the back of Popular Mechanics. The car was amazing. It was beautiful, took me fifteen minutes to put together, and with very little time, effort, or expense I placed an easy second in the popular Derby out of more than three dozen entrants. I loved it.
When, on the trophy stand, I told everyone I’d bought the car, they called an emergency meeting and, despite having no written rule to back up their judgement…took the trophy right out of my hands and disqualified me from the race. My car was arguably better, faster, sleeker and more attractive than most of the others in the field, but I hadn’t spent hundreds of hours and piles of money and put the requisite amount of blood, sweat and tears into it… so it didn’t count.
Sometimes the truth hurts. Well the truth is I just completely made up that story. Sorry, but I was searching for something analogous to what we didn’t find at DevOps Days and that fake memory seemed to kind of fit. It seemed more rich (and fun) than just coming straight out and saying, “When I was out last week I went to DevOps Days – an event where the participants spend a good part of their year (usually with their team) searching, honing and tweaking a multitude of products like Nagios, Cacti, collectd + graphite + pnp4nagios, Muni, etc. etc. to create their own monitoring solution…” and so on.
Plus, admit it, it conjured up a nice little twinge of boyhood nostalgia for a few seconds, didn’t it? Oh well, it did for me. It also caused me to realize what to do with the rest of this quarter’s marketing & event budget – we’re taking out a full page ad in the back of Popular Mechanics.
There’s some interesting discussion around “Monitoring Sucks”, and has been for a while. (Go check the twitter hashtag #monitoringsucks). This is not a new opinion – the fact that I thought monitoring sucks is why I started LogicMonitor.
But it’s interesting to assess whether LogicMonitor meets the criteria for not sucking. Clearly our customers think we have great monitoring - but probably only 30% of our customers are SaaS type companies, and may or may not have the DevOps mentality.
So the initial criteria for why monitoring sucks, at least on the referenced blog post, were:
But does monitoring REALLY suck? Heck no! Monitoring is AWESOME. Metrics are AWESOME. I love it. Here's what I don't love: - Having my hands tied with the model of host and service bindings. - Having to set up "fake" hosts just to group arbitrary metrics together - Having to either collect metrics twice - once for alerting and another for trending - Only being able to see my metrics in 5 minute intervals - Having to chose between shitty interface but great monitoring or shitty monitoring but great interface - Dealing with a monitoring system that thinks IT is the system of truth for my environment - Perl
Let’s look at these points from the point of view of LogicMonitor
Having my hands tied with the model of host and service bindings. I’m not sure how you not tie someone’s hands to some degree, but LogicMonitor certainly tries to give flexibility. Services do generally have to associated with hosts – but can be associated by all sorts of things (hostname, group membership, SNMP agent OID, system description, WMI classes supported, kernel level, etc.)
Having to set up “fake” hosts just to group arbitrary metrics together. LogicMonitor avoids this mostly with custom graphs on dashboards, which allow you to group any metric (or set of metrics based on globs/regex’s) with any other set, filtered to the top 10, or not; aggregated together (sum, max, min, average) or not. Also, some meta-services are associated with groups, not hosts, to allow alerting on things like number of servers providing a service, rather than just whether a specific host is successfully providing the service.
Having to either collect metrics twice – once for alerting and another for trending. We certainly don’t require that. Any datapoint that is collected can be alerted on, graphed, both or neither. (Sometimes datapoints are collected as they are used in other calculated datapoints, derived from multiple inputs.)
Only being able to see my metrics in 5 minute intervals. Again, we don’t impose that restriction – you can specify the collection interval for each datasource, from 1 minute to once a day. (I know going to only 1 minute resolution is not ideal for some applications – but as a SaaS delivery model, we currently impose that limit to protect ourselves, until the next rewrite of the backend storage engine, which should remove that.)
Having to chose between shitty interface but great monitoring or shitty monitoring but great interface.I think we have a pretty good interface and great monitoring. Certainly our interface is orders of magnitude better than it was when we launched, and a lot of people give us kudos for it. But there’s lots of room for improvement.
Dealing with a monitoring system that thinks IT is the system of truth for my environment. LogicMonitor thinks it is the truth for what your monitoring should be monitoring – but it’s willing to listen. It’s easy to use the API to put hooks into puppet, kickstart, etc that automatically add hosts to monitoring, assign them to groups, etc. We’re looking at integration with Puppet Lab’s MCollective initiative and other things to get further along this issue.
Perl. Our collectors are agnostic when it comes to scripting. They support collection and discovery scripts in the native languages of whatever platform they are running on – so VBscript, powershell, C# on Windows; bash, ruby, perl, etc on linux. But as our collectors are Java based, we encourage Groovy as the scripting language for cross-platform goodness. The collectors expose a bunch of their own functionality (snmp, JMX, expect, etc) to groovy, so it makes a lot of things very easy. So it’s the language we use for writing and extending datasources for our customers. But if Perl is your thing, keep at it.
So, does LogicMonitor suck? I don’t think so, and hopefully DevOps Borat does not either.
I’ll be at the DevOps Days conference in Austin this coming week (LogicMonitor is sponsoring), so hopefully we’ll get some more feedback there.
Or post below to let us know what constitutes “non-sucky” monitoring.
At LogicMonitor we take turns learning from each other in informal sessions. One week it may be developers talking about MySql and NoSQL; or marketing guys talk about lead generation and Adwords, etc. This time we’d arrived on the topic of programming languages, and how there is a trade off: between code speed and efficiency when using assembler or C at the expense of programmer efficiency; compared with much better programmer productivity at the expense of code efficiency when using languages with higher levels of abstraction, like Ruby on Rails or Python/Django.
Someone asked if that abstraction and inefficiency matters: as in most operational issues, it matters only if it matters. By which I mean if you are writing a system that is lightly used, or is on powerful hardware – it may not matter at all. But if you suddenly have an increased workload, it may matter a lot. (See the early occurrences of Twitter’s fail whale and RoR scaling.)
Then the question was asked, how can you know whether you are improving things when you change code? Trend it, of course. You probably know what will constrain your application performance. (If not, you need better monitoring.) For many sites, an obvious constraint is likely to be database queries per second. So plot database queries per web request over time Read more »
Performance monitoring for all your infrastructure & applications. In minutes, not hours.
Questions? Call Us!
(888) 415-6442 or +1 (805)-617-3884