When monitoring a NetApp, the thing that matters is (for most applications) the latency of requests on a volume (or LUN.)
Easy enough to get – with LogicMonitor it’s graphed and alerted on automatically, for every volume. But of course when there is an issue, the focus changes to why there is latency. Usually it’s a limitation of the disks in the aggregate being IO bound. Assuming there is no need for a reallocate (the disks are evenly loaded – I’ll write a separate article about how to determine that), how can you tell when what level of disk busy-ness is acceptable? Visualizing that performance like the below is what this post is about.
We recently had a customer come into trial looking around for a new monitoring solution. This is always good for us. We love the takeaway. (Customers defecting from other monitoring systems to us.) As in most takeaway situations this customer had specific needs. Now there are the obvious ones in which LogicMonitor easily fits the bill such as alerting, dashboards, performance monitoring, etc (and if you fall into that VMWare, Cisco, NetApp sweet spot, game over!). This guy however, had a very specific need we didn’t fulfill directly out of the gates. I think anyone who has ever worked with a monitoring solution knows that it’s hard to find one that does everything. Well in the case of LogicMonitor this is no different. We don’t do EVERYTHING. I know, you thought I was going to get all high and mighty and talk about how LogicMonitor is the one monitoring tool that CAN do everything. Well Read more »
So as everyone knows, Amazon rebooted virtually all EC2 instances in December. They emailed people to notify them, but not everyone read the emails, leading to Amazon performing the reboots on their own schedule, with the customers unaware.
For some SaaS companies, this resulted in many hours of downtime. For others, there was a short impact. What was the difference? Read more »
One way LogicMonitor is different from other NetApp monitoring systems (other than being hosted monitoring, and being able to monitor the complete array of systems found in a datacenter – from AC units, through virtualization, OS’s to applications like MongoDB) is that we default to “monitoring” on”.
i.e. we assume you want to monitor everything, always. (You can of course turn off monitoring or alerting for specific groups, hosts or objects.) This serves us well almost always – we will detect a new volume on your NetApp once you create it, and start monitoring it for read and write latency, number and type of operations, etc – this means that when you have an issue on that volume (or other groups are blaming storage performance for an issue), you already have the monitoring of all sorts of metrics in place before the issue – so you have the data and alerts to know whether the storage was or was not the issue.
However, we have found some cases where this doesn’t work so well. We have been monitoring NetApp disk performance by default, too, tracking the number of operations and the busy time for each disk. However, on customers with larger NetApps, there are often hundreds of disks, each of which we would monitor via API queries. This is useful for identifying when disks need to be rebalanced (if the top 10 and bottom 10 disks by busy time are wildly different.) And while we only monitor the performance of a disk every 5 minutes (as opposed to volumes and PAM cards and things that are monitored more frequently), this apparently overloads the API subsystem of NetApp devices.
We’d see that when we’d restart the collection process, and the only monitoring by the API was for the volume performance, things worked great – the response to an API request from a NetApp was around 100 ms.
When the disk requests started getting added in (and we stagger and skew the requests, so they are not all hitting at once) – the API response time for a single query climbed up to 40 seconds.
This started causing a backlog of monitoring, and was causing data to be missed in the more important volume performance metrics.
So… while we’ll open a case with NetApp, in the interim, we’ll probably disable the monitoring of physical disk performance by default to avoid this issue.
Performance monitoring for all your infrastructure & applications. In minutes, not hours.
Questions? Call Us!
(888) 415-6442 or +1 (805)-617-3884