1-888-41LOGIC

What, Free Cisco?! (The real value of service)

May 10, 2012 – 10:18 am

by Cisco Arias

In the modern world of consumerism, there are so many choices, noises and deals, it’s sometimes hard to calculate the real value of the products and services we purchase. At LogicMonitor, we try to make it obvious.

I don’t like to call our solution a “product” because for some reason it makes me think of, “Sorry pal, you purchased this as is. I can’t help you.” That’s not how we roll.

We have a joke here that our product now comes with free Cisco (not the trademarked kind). As an Account Manager here’s the way I look at it: The moment a new client signs on we become partners in improving each others businesses and helping each other grow.

A service, or solution, like ours relies on synergy and strong relationships with our clients if we are to provide them the most effective and intelligent SaaS monitoring solution out there. One of my personal goals here is to build these relationships in order to understand the needs of the end user AND organization… to keep the client’s business flowing and improving.

I see monitoring as one component of the daily workflow of an average IT, Sys Admin, or Network Engineer. And monitoring should provide more than just alerting, it should provide a way of proactively finding out how to prevent those fires. And beyond that, it should provide intelligence that improves operations and business metrics that are meaningful to those  upstairs.

And I’m not happy until our clients actually realize all these benefits. We’re not going to sell you a “product” where it’s up to you to figure it out and get it working. We’re going to provide you with the resources needed to maximize the value of our solution.

From day one, we have a team of Engineers and Developers working with you to ensure your monitoring is made easy and effective. You also have people like me who will check in frequently to see if there is anything we can do to help.

Whether it’s letting you know your portal is not up to date, going over new features, or getting new team members up to speed, you can rest assured that you have a go-to person for anything LogicMonitor related.

So be sure to consider all of this the next time you are calculating what real value means to you and your organization… or the next time you need help on a “product” you purchased and are looking and waiting, looking and waiting…

Cisco
now free, with LogicMonitor :)

Share

Measuring performance improvements in CentOS 6.2

April 27, 2012 – 6:11 pm

Recently, we updated some servers from CentOS 5.6 to CentOS 6.2.  Of course, we carefully monitored the performance of the hosts, so we can make a comparison about the performance. The one big change we applied with the application of Cent OS 6 was switching from ext3 to ext4 (although as yet without TRIM enabled.)

Nothing else changed on the hosts – same hardware (Sun X4170′s, in this case), same disks (Intel SSDs in RAID), same workload, same application version.

Different performance. Better performance, which is good – although the reason as to why is unclear.

This first thing that caught our eye was the amount of time the SSDs were busy was significantly less (The load is on the machine with the dark line to start with – then the same load is transferred to the machine with the green line). This is a plot of the amount of the time the SSD was busy:

Looking at the disk operations on the SSDs both systems are doing the same amount of writes – but the physical disk reads have dropped from about 700 per second to 340 per second. (Gotta love SSDs!)

 

 

These systems don’t swap, don’t have InnoDB buffer cache misses, and have the same hardware, application and load in both cases – the only variable is Centos 5 to 6.

So while we can’t explain why there should be less disk reads, presumably some changes in the virtual memory system made .. something.. more efficient.

The important thing is that we were measuring it, so we could see it. That way we know about it – and given that it could have gone the other way and made performance worse, it’s essential to be able to know about the impact changes make to systems. (Of course, we saw this change on non-production systems, before we rolled it out.)

Are you tracking the impacts of the changes you make to your infrastructure?

 

Share

What I didn’t find at DevOps Days…

April 10, 2012 – 5:26 pm

Sometimes the truth hurts.  Well the truth is what we didn’t find at DevOps Days was a throng of adoring fans waiting to throw their undergarments at us. Come to think of it, that would be kind of gross anyway, especially with the DevOps crowd…no disrespect.

What we did find was:

a) our marketing table nestled so close to our competitor’s that…if our tables had been teenagers, we would have sent them to the Principal’s office (see PHOTO below…with competitor’s name shamelessly Photoshopped out and replaced with ours) … and,

b) a lot of companies and DevOps teams that were fairly embedded in their custom-rigged, hard-fought and hard-won monitoring solutions.

If only, in real life, we could just Photoshop out the competition...

In our last blog post we talked about the “suck” factor in monitoring.  Well, maybe for some, blessed with sizable IT budgets and IT brains, monitoring doesn’t suck so bad at all.  In fact maybe for those who take pride in their ability to cobble together a patchwork of complex solutions into one grand “comprehensive” solution, it’s sort of a way of life… a job within a job, a golden chalice, a worthy opponent for any Real Mensa up to the task.

 

When I was a kid I entered a Soapbox Derby  – a racing event where the entrants spend the better part of a year (usually with their dads) making, honing, tweaking, and polishing their own motorless downhill race cars.  Well I was new in town and my dad was busy with a new job, so I saved up and bought a Soapbox Derby Car from an enticing ad in the back of Popular Mechanics. The car was amazing. It was beautiful, took me fifteen minutes to put together, and with very little time, effort, or expense I placed an easy second in the popular Derby out of more than three dozen entrants.  I loved it.

When, on the trophy stand, I told everyone I’d bought the car, they called an emergency meeting and, despite having no written rule to back up their judgement…took the trophy right out of my hands and disqualified me from the race. My car was arguably better, faster, sleeker and more attractive than most of the others in the field, but I hadn’t spent hundreds of hours and piles of money and put the requisite amount of blood, sweat and tears into it… so it didn’t count.

Sometimes the truth hurts.  Well the truth is I just completely made up that story.  Sorry, but I was searching for something analogous to what we didn’t find at DevOps Days and that fake memory seemed to kind of fit.  It seemed more rich (and fun) than just coming straight out and saying, “When I was out last week I went to DevOps Days – an event where the participants spend a good part of their year (usually with their team) searching, honing and tweaking a multitude of products like Nagios, Cacti, collectd + graphite + pnp4nagios, Muni,  etc. etc. to create their own monitoring solution…” and so on.

Plus, admit it, it conjured up a nice little twinge of boyhood nostalgia for a few seconds, didn’t it?  Oh well, it did for me.  It also caused me to realize what to do with the rest of this quarter’s marketing & event budget –  we’re taking out a full page ad in the back of Popular Mechanics.

Share

Not all monitoring sucks

March 31, 2012 – 12:35 pm

There’s some interesting discussion around “Monitoring Sucks”, and has been for a while. (Go check the twitter hashtag #monitoringsucks).  This is not a new opinion – the fact that I thought monitoring sucks is why I started LogicMonitor.

But it’s interesting to assess whether LogicMonitor meets the criteria for not sucking.  Clearly our customers think we have great monitoring - but probably only 30% of our customers are SaaS type companies, and may or may not have the DevOps mentality.

So the initial criteria for why monitoring sucks, at least on the referenced blog post, were:

But does monitoring REALLY suck?
Heck no! Monitoring is AWESOME. Metrics are AWESOME. I love it.
Here's what I don't love:
- Having my hands tied with the model of host and service bindings.
- Having to set up "fake" hosts just to group arbitrary metrics together
- Having to either collect metrics twice - once for alerting and another for trending
- Only being able to see my metrics in 5 minute intervals
- Having to chose between shitty interface but great monitoring or
shitty monitoring but great interface
- Dealing with a monitoring system that thinks IT is the system of truth for my environment
- Perl

Let’s look at these points from the point of view of LogicMonitor

Having my hands tied with the model of host and service bindings.  I’m not sure how you not tie someone’s hands to some degree, but LogicMonitor certainly tries to give flexibility.  Services do generally have to associated with hosts – but can be associated by all sorts of things (hostname, group membership, SNMP agent OID, system description, WMI classes supported, kernel level, etc.)

Having to set up “fake” hosts just to group arbitrary metrics together. LogicMonitor avoids this mostly with custom graphs on dashboards, which allow you to group any metric (or set of metrics based on globs/regex’s) with any other set, filtered to the top 10, or not; aggregated together (sum, max, min, average) or not.  Also,  some meta-services are associated with groups, not hosts, to allow alerting on things like number of servers providing a service, rather than just whether a specific host is successfully providing the service.

Having to either collect metrics twice – once for alerting and another for trending. We certainly don’t require that. Any datapoint that is collected can be alerted on, graphed, both or neither. (Sometimes datapoints are collected as they are used in other calculated datapoints, derived from multiple inputs.)

Only being able to see my metrics in 5 minute intervals. Again, we don’t impose that restriction – you can specify the collection interval for each datasource, from 1 minute to once a day. (I know going to only 1 minute resolution is not ideal for some applications – but as a SaaS delivery model, we currently impose that limit to protect ourselves, until the next rewrite of the backend storage engine, which should remove that.)

Having to chose between shitty interface but great monitoring or shitty monitoring but great interface.I think we have a pretty good interface and great monitoring.  Certainly our interface is orders of magnitude better than it was when we launched, and a lot of people give us kudos for it.  But there’s lots of room for improvement.

Dealing with a monitoring system that thinks IT is the system of truth for my environment. LogicMonitor thinks it is the truth for what your monitoring should be monitoring – but it’s willing to listen. :-)  It’s easy to use the API to put hooks into puppet, kickstart, etc that automatically add hosts to monitoring, assign them to groups, etc.  We’re looking at integration with Puppet Lab’s MCollective initiative and other things to get further along this issue.

Perl. Our collectors are agnostic when it comes to scripting. They support collection and discovery scripts in the native languages of whatever platform they are running on – so VBscript, powershell, C# on Windows; bash, ruby, perl, etc on linux. But as our collectors are Java based, we encourage Groovy as the scripting language for cross-platform goodness.  The collectors expose a bunch of their own functionality (snmp, JMX, expect, etc) to groovy, so it makes a lot of things very easy.  So it’s the language we use for writing and extending datasources for our customers. But if Perl is your thing, keep at it.

So, does LogicMonitor suck?  I don’t think so, and hopefully DevOps Borat does not either.

I’ll be at the DevOps Days conference in Austin this coming week (LogicMonitor is sponsoring), so hopefully we’ll get some more feedback there.

Or post below to let us know what constitutes “non-sucky” monitoring.

Share

The value of IPMI monitoring

March 23, 2012 – 10:10 am

Amongst its many monitoring methods, LogicMonitor supports IPMI.  Many people aren’t aware of IPMI, and don’t think  it’s necessary. And while I’m certainly an advocate of avoiding unnecessary complexity in a data center, sometimes it is good to wear both a belt and suspenders.

A real life example from one of our own data centers conveniently occurred just this morning, when I was looking for fodder to blog about: (more…)

Share

Visualizing NetApp disk performance and latency

March 19, 2012 – 11:51 am

When monitoring a NetApp, the thing that matters is (for most applications) the latency of requests on a volume (or LUN.)

Easy enough to get – with LogicMonitor it’s graphed and alerted on automatically, for every volume. But of course when there is an issue, the focus changes to why there is latency. Usually it’s a limitation of the disks in the aggregate being IO bound. Assuming there is no need for a reallocate (the disks are evenly loaded – I’ll write a separate article about how to determine that), how can you tell when what level of disk busy-ness is acceptable? Visualizing that performance like the below is what this post is about.

  (more…)

Share

Avoiding a network outage with Cisco monitoring

March 6, 2012 – 10:38 am

Last night our ops team (of which I am a member) got paged about the CPU load on a Cisco 3560 switch in a new datacenter, late at night.  My initial reaction was “We don’t need this alert escalated to pagers or phones- 3560′s switch and route in hardware, so CPU load doesn’t matter.”  Once I’d woken up a bit more, the corollary - that there is no possible way that this switch should be at a CPU level to trigger an error alert – occurred to me. (more…)

Share

Agile Monitoring Support

February 3, 2012 – 4:24 pm

We recently had a customer come into trial looking around for a new monitoring solution.  This is always good for us.  We love the takeaway.  (Customers defecting from other monitoring systems to us.) As in most takeaway situations this customer had specific needs.  Now there are the obvious ones in which LogicMonitor easily fits the bill such as alerting, dashboards, performance monitoring, etc (and if you fall into that VMWare, Cisco, NetApp sweet spot, game over!).  This guy however, had a very specific need we didn’t fulfill directly out of the gates.  I think anyone who has ever worked with a monitoring solution knows that it’s hard to find one that does everything.  Well in the case of LogicMonitor this is no different.  We don’t do EVERYTHING.  I know, you thought I was going to get all high and mighty and talk about how LogicMonitor is the one monitoring tool that CAN do everything.  Well (more…)

Share

Metrics for DevOps

January 21, 2012 – 4:33 pm

At LogicMonitor we take turns learning from each other in informal sessions.  One week it may be  developers talking about MySql and NoSQL; or marketing guys talk about lead generation and Adwords, etc.  This time we’d arrived on the topic of programming languages, and how there is a trade off: between code speed and efficiency when using assembler or C at the expense of programmer efficiency; compared with much better programmer productivity at the expense of code efficiency when using languages with higher levels of abstraction, like Ruby on Rails or Python/Django.

Someone asked if that abstraction and inefficiency matters:  as in most operational issues, it matters only if it matters.  By which I mean if you are writing a system that is lightly used, or is on powerful hardware – it may not matter at all. But if you suddenly have an increased workload, it may matter a lot. (See the early occurrences of Twitter’s fail whale and RoR scaling.)

Then the question was asked, how can you know whether you are improving things when you change code?  Trend it, of course. You probably know what will constrain your application performance. (If not, you need better monitoring.)  For many sites, an obvious constraint is likely to be database queries per second.  So plot database queries per web request over time (more…)

Share

How to minimize the impacts of the next Amazon reboot .. or of your own datacenter failure

January 6, 2012 – 2:45 pm

So as everyone knows, Amazon rebooted virtually all EC2 instances in December.  They emailed people to notify them, but not everyone read the emails, leading to Amazon performing the reboots on their own schedule, with the customers unaware.

For some SaaS companies, this resulted in many hours of downtime. For others, there was a short impact. What was the difference? (more…)

Share