Last week I was traveling to and from our Austin office – which means a fair amount of time for reading. Amongst other books, I read “The Principles of Product Development Flow”, Reinertsen, Donald G. The most interesting part of this book (to me) was the chapters on queueing theory.
Little’s law shows the Wait time for a process = queue size/processing rate. That’s a nice intuitive formula, but with broad applicability.
So what does this have to do with monitoring? Read more »
Have you ever been the guy in charge of storage and the dev guy and database guy come over to your desk waaaaay too early in the morning before you’ve had your caffeine and start telling you that the storage is too slow and you need to do something about it? I have. In my opinion it’s even worse when the Virtualization guy comes over and makes similar accusations, but that’s another story.
Now that I work for LogicMonitor I see this all the time. People come to us because “the NetApps are slow”. All too often we come to find that it’s actually the ESX host itself, or the SQL server having problems because of poorly designed queries. I’ve experienced this first hand before I worked for LogicMonitor,so it’s no surprise to me that this is a regular issue. When I experienced this problem myself I found it was vital to monitor all systems involved so I could really figure out where the bottleneck was.
When monitoring a NetApp, the thing that matters is (for most applications) the latency of requests on a volume (or LUN.)
Easy enough to get – with LogicMonitor it’s graphed and alerted on automatically, for every volume. But of course when there is an issue, the focus changes to why there is latency. Usually it’s a limitation of the disks in the aggregate being IO bound. Assuming there is no need for a reallocate (the disks are evenly loaded – I’ll write a separate article about how to determine that), how can you tell when what level of disk busy-ness is acceptable? Visualizing that performance like the below is what this post is about.
One way LogicMonitor is different from other NetApp monitoring systems (other than being hosted monitoring, and being able to monitor the complete array of systems found in a datacenter – from AC units, through virtualization, OS’s to applications like MongoDB) is that we default to “monitoring” on”.
i.e. we assume you want to monitor everything, always. (You can of course turn off monitoring or alerting for specific groups, hosts or objects.) This serves us well almost always – we will detect a new volume on your NetApp once you create it, and start monitoring it for read and write latency, number and type of operations, etc – this means that when you have an issue on that volume (or other groups are blaming storage performance for an issue), you already have the monitoring of all sorts of metrics in place before the issue – so you have the data and alerts to know whether the storage was or was not the issue.
However, we have found some cases where this doesn’t work so well. We have been monitoring NetApp disk performance by default, too, tracking the number of operations and the busy time for each disk. However, on customers with larger NetApps, there are often hundreds of disks, each of which we would monitor via API queries. This is useful for identifying when disks need to be rebalanced (if the top 10 and bottom 10 disks by busy time are wildly different.) And while we only monitor the performance of a disk every 5 minutes (as opposed to volumes and PAM cards and things that are monitored more frequently), this apparently overloads the API subsystem of NetApp devices.
We’d see that when we’d restart the collection process, and the only monitoring by the API was for the volume performance, things worked great – the response to an API request from a NetApp was around 100 ms.
When the disk requests started getting added in (and we stagger and skew the requests, so they are not all hitting at once) – the API response time for a single query climbed up to 40 seconds.
This started causing a backlog of monitoring, and was causing data to be missed in the more important volume performance metrics.
So… while we’ll open a case with NetApp, in the interim, we’ll probably disable the monitoring of physical disk performance by default to avoid this issue.
Performance monitoring for all your infrastructure & applications. In minutes, not hours.
Questions? Call Us!
(888) 415-6442 or +1 (805)-617-3884