1-888-41LOGIC

Metrics for DevOps

January 21, 2012 – 4:33 pm

At LogicMonitor we take turns learning from each other in informal sessions.  One week it may be  developers talking about MySql and NoSQL; or marketing guys talk about lead generation and Adwords, etc.  This time we’d arrived on the topic of programming languages, and how there is a trade off: between code speed and efficiency when using assembler or C at the expense of programmer efficiency; compared with much better programmer productivity at the expense of code efficiency when using languages with higher levels of abstraction, like Ruby on Rails or Python/Django.

Someone asked if that abstraction and inefficiency matters:  as in most operational issues, it matters only if it matters.  By which I mean if you are writing a system that is lightly used, or is on powerful hardware – it may not matter at all. But if you suddenly have an increased workload, it may matter a lot. (See the early occurrences of Twitter’s fail whale and RoR scaling.)

Then the question was asked, how can you know whether you are improving things when you change code?  Trend it, of course. You probably know what will constrain your application performance. (If not, you need better monitoring.)  For many sites, an obvious constraint is likely to be database queries per second.  So plot database queries per web request over time (more…)

Share

How to minimize the impacts of the next Amazon reboot .. or of your own datacenter failure

January 6, 2012 – 2:45 pm

So as everyone knows, Amazon rebooted virtually all EC2 instances in December.  They emailed people to notify them, but not everyone read the emails, leading to Amazon performing the reboots on their own schedule, with the customers unaware.

For some SaaS companies, this resulted in many hours of downtime. For others, there was a short impact. What was the difference? (more…)

Share

Eating our own Tomcat Monitoring dogfood

December 24, 2011 – 9:27 pm

We received some alerts tonight that one Tomcat server was using about 95% of its configured thread maximum.

The Tomcat process on http-443  on prod4 now has 96.2 %
of the max configured threads in the busy state.

These were SMS alerts, as that was close enough to exhausting the available threads to warrant waking someone up if needed.

The other alert we got was that Tomcat was taking an unusual time to process requests, as seen in this graph: (more…)

Share

Right sizing infrastructure for VMWare migrations

December 9, 2011 – 12:03 pm

I was invited to talk to an MSP peer group the other week, and during the presentation, one of the group members who was a LogicMonitor customer described a way they use LogicMonitor to solve a previously hard-to-solve VMWare operational issue. (more…)

Share

Use LogicMonitor, save the world.

November 25, 2011 – 5:35 pm

My wife was reading the science journal of UCSB (where she did her Masters degree) and pointed out an article referring to the fact that “a typical server consumes as much energy in a year as an SUV”. She then asked how many servers we have….

I found this a bit dismaying at first, as we try to engage in sustainable actions both personally (we have solar panels, two Prius cars, our own chickens – the typical urban hippie) and LogicMonitor as a corporate entity (we recycle, encourage non-car transport, etc). I didn’t like to think that our servers were undoing all the other environmental actions we were taking.

But on reflection, I realized that LogicMonitor is an environmental net positive. Our servers are not only fairly energy efficient (using SSDs, which use about half the energy of rotational disks), but they are very heavily leveraged. Each of our servers is collectively replacing about 100 individual servers that our customers would have deployed if they ran their own monitoring servers. (More, if they ran redundant monitoring servers like we do.)

So from that point of view, LogicMonitor is taking hundreds of SUV’s off the road. All while freeing them from having to run, patch, and backup those servers, or extend their software to deal with new things like MongoDB monitoring, and so on – we do all that for them. So that means they can get out of the office earlier, and avoid the peak traffic times – saving more driving impact. Unless they have something else to do at work. :-)

Happy thanksgiving.

Share

Monitoring the right things

October 16, 2011 – 2:57 am

I’ve talked about this before, but I just read an article about why application performance monitoring is so screwed up, and coincidentally had just talked about it in a lecture I gave to a graduate class at UCSB on scalable computing, so figured it’s worth a mention.

The article mentions that “enterprises have confused (with vendor help) the notion of monitoring the resources that an application uses with its performance”.  The way I put it in my lecture was that:

  • Systems are limited by Disk IO, memory and less commonly CPU and network.
  • Users dont care about Disk/memory/CPU/network… They care about web pages, and speed.

So… how to tie one to the other?

Monitor both.

Monitor what users care about (page load times, response per request, etc)



Also monitor all the limiting resources (CPU, Disk IO – or more importantly what percentage of the time a drive is busy, network, memory):

And monitor the performance of the systems that affect the limiting resources:

So while monitoring InnoDB file sytem reads does not tell you anything that an end user cares about, if your monitoring of Tomcat request time shows that users are experiencing poor performance, and your logical drives are suddenly 100% busy and request service time increasing, it’s good to know why that is. It may be because of InnoDB buffer misses, or it may be because of something else – but having this intermediate data will drastically reduce your time to correct the issue that users care about – response time.

Another point to note: the “user” in the phrase “monitor what users care about” may not be a human.  If a server is a memcached server – the users for this server are web servers, who care about memcached response time, availability and hit rates.  So on this class of machines, that is the thing to monitor to determine if the service is meeting the needs of users.

In short, for every machine, identify the “thing(s) to care about” for it; monitor those things; monitor the constrained resources; and monitor all aspects of the systems on that server that inmpact the constrained resources.

Share

Powershell, snap ins and 32 bit apps

October 1, 2011 – 2:05 am

A more technical article today.

In adding some more Exchange Monitoring we ran into some issues, and solutions, that may help others.  Some things in recent Exchange versions can only be monitored by Powershell. (Perfmon, WMI, Powershell, all needed for different versions of Exchange…. I wish they’d make up their mind…)

So the first issue was that Powershell scripts, when called from a LogicMonitor agent, never returned. This wasn’t too hard – simply pass the parameter -inputformat with the (undocumented) option “none”, and the agent can successfully run Powershell commands:

powershell -inputformat none dbstatus.ps1

(Why? The Microsoft.PowerShell.ConsoleHost class constructs a M.PS.WrappedDeserializer passing the STDIN TextReader as one of the parameters. By default, the WrappedDeserializer will call ReadLine() on this STDIN TextReader and wait indefinitely, effectively hanging PowerShell and the calling process. That’s why.)

So past that hurdle, but the next one:
>> powershell -inputformat none dbstatus.ps1
Add-PSSnapin : No snap-ins have been registered for Windows PowerShell version 2.

Yet running the exact same command from the command shell on the host running the agent resulted in the output we were expecting. And we could see the Exchange snap in, called by the Powershell script, was correctly registered, and in fact worked fine.

But.. our agent was running on a 32 bit JVM and Exchange 2010 (in our lab, at least) is installed on 64 bit Windows. The Powershell snap in was only visible when powershell was started from a 64 bit app. When I started powershell from the cmd.exe in SysWOW64, I got the same error about missing snap-ins as our agent reported.

The solution – it doesn’t matter that our agent was installed as a 32 bit app, in Program files (x86). What mattered was that the Java virtual machine launched by the agent, that ultimately launched Powershell, be a 64 bit JVM, not the default 32 bit JVM installed from Java.com. (At least, a 32 bit JVM is the default when you browse to Java.com with a 32 bit browser.)

So, running the LogicMonitor agent with a 64 bit JVM, and Powershell started with “-inputformat none” gives us full access to Powershell output and all its snap ins, so expect some datasources released very shortly to take advantage of that.

Share

Hosted monitoring in a hurricane

September 13, 2011 – 5:19 pm

One of our customer acquisitions recently came about because the company wanted to be assured of their I.T. infrastructure’s availability during hurricane Irene. Their datacenter was located in the impact area, and obviously a premise based monitoring could not be relied on to alert them of any impacts, if the monitoring system itself was going to be impacted.

They contacted us two days before the hurricane, and were completely monitoring all their infrastructure that same day, with the alerts coming from our datacenters, not theirs.

They and their infrastructure came through Irene unscathed – and they knew that they did, because they used a hosted monitoring system. Had they used a premise based monitoring system, they would not have known if  their lack of alerts was because their monitoring system had been flooded or cut off from the internet.

So while enhanced disaster preparedness is not usually the way we sell our value, it’s certainly a nice bonus.

Share

Netapp Monitoring – too much of a good thing?

September 13, 2011 – 4:56 pm

One way LogicMonitor is different from other NetApp monitoring systems (other than being hosted monitoring, and being able to monitor the complete array of systems found in a datacenter – from AC units, through virtualization, OS’s to applications like MongoDB) is that we default to “monitoring” on”.

i.e. we assume you want to monitor everything, always. (You can of course turn off monitoring or alerting for specific groups, hosts or objects.)  This serves us well almost always – we will detect a new volume on your NetApp once you create it, and start monitoring it for read and write latency, number and type of operations, etc – this means that when you have an issue on that volume (or other groups are blaming storage performance for an issue), you already have the monitoring of all sorts of metrics in place before the issue – so you have the data and alerts to know whether the storage was or was not the issue.

However, we have found some cases where this doesn’t work so well. We have been monitoring NetApp disk performance by default, too, tracking the number of operations and the busy time for each disk.  However, on customers with larger NetApps, there are often hundreds of disks, each of which we would monitor via API queries.  This is useful for identifying when disks need to be rebalanced (if the top 10 and bottom 10 disks by busy time are wildly different.) And while we only monitor the performance of a disk every 5 minutes (as opposed to volumes and PAM cards and things that are monitored more frequently), this apparently overloads the API subsystem of NetApp devices.

We’d see that when we’d restart the collection process, and the only monitoring by the API was for the volume performance, things worked great – the response to an API request from a NetApp was around 100 ms.

When the disk requests started getting added in (and we stagger and skew the requests, so they are not all hitting at once) – the API response time for a single query climbed up to 40 seconds.

This started causing a backlog of monitoring, and was causing data to be missed in the more important volume performance metrics.

So… while we’ll open a case with NetApp, in the interim, we’ll probably disable the monitoring of physical disk performance by default to avoid this issue.

Share

LogicMonitor’s Hosted Monitoring – best support just got better

August 29, 2011 – 5:11 pm

One of the great things about LogicMonitor’s hosted monitoring is the support we can offer.  Because we are hosted monitoring, customer can choose to grant our support staff access to their accounts so we can help them directly; they can chat with an engineer in their portal, or they can email or phone us.

Today we are announcing another support channel: support.logicmonitor.com

This a community site, where you can post questions, report problems, suggest ideas or even give praise.

The advantage of this support channel is that it accumulates knowledge – so once we (or other community members) have answered a question, it will be immediately available for others to find as an answer when they ask a similar question.  If there is no answer, and a question is posted, LogicMonitor staff will be notified and we can answer  the question directly.

So we encourage everyone to use this as their first line of support – it should benefit everyone, and we’ll also be using it in the future for some cool contests, like the most interesting LogicMonitor alert and solution of the month. (As a matter of fact, if you have ideas for cool contests, suggest them at support.logicmonitor.com!)

Go check it out!

Share