One of our customer acquisitions recently came about because the company wanted to be assured of their I.T. infrastructure’s availability during hurricane Irene. Their datacenter was located in the impact area, and obviously a premise based monitoring could not be relied on to alert them of any impacts, if the monitoring system itself was going to be impacted.
They contacted us two days before the hurricane, and were completely monitoring all their infrastructure that same day, with the alerts coming from our datacenters, not theirs.
They and their infrastructure came through Irene unscathed – and they knew that they did, because they used a hosted monitoring system. Had they used a premise based monitoring system, they would not have known if their lack of alerts was because their monitoring system had been flooded or cut off from the internet.
So while enhanced disaster preparedness is not usually the way we sell our value, it’s certainly a nice bonus.
One way LogicMonitor is different from other NetApp monitoring systems (other than being hosted monitoring, and being able to monitor the complete array of systems found in a datacenter – from AC units, through virtualization, OS’s to applications like MongoDB) is that we default to “monitoring” on”.
i.e. we assume you want to monitor everything, always. (You can of course turn off monitoring or alerting for specific groups, hosts or objects.) This serves us well almost always – we will detect a new volume on your NetApp once you create it, and start monitoring it for read and write latency, number and type of operations, etc – this means that when you have an issue on that volume (or other groups are blaming storage performance for an issue), you already have the monitoring of all sorts of metrics in place before the issue – so you have the data and alerts to know whether the storage was or was not the issue.
However, we have found some cases where this doesn’t work so well. We have been monitoring NetApp disk performance by default, too, tracking the number of operations and the busy time for each disk. However, on customers with larger NetApps, there are often hundreds of disks, each of which we would monitor via API queries. This is useful for identifying when disks need to be rebalanced (if the top 10 and bottom 10 disks by busy time are wildly different.) And while we only monitor the performance of a disk every 5 minutes (as opposed to volumes and PAM cards and things that are monitored more frequently), this apparently overloads the API subsystem of NetApp devices.
We’d see that when we’d restart the collection process, and the only monitoring by the API was for the volume performance, things worked great – the response to an API request from a NetApp was around 100 ms.
When the disk requests started getting added in (and we stagger and skew the requests, so they are not all hitting at once) – the API response time for a single query climbed up to 40 seconds.
This started causing a backlog of monitoring, and was causing data to be missed in the more important volume performance metrics.
So… while we’ll open a case with NetApp, in the interim, we’ll probably disable the monitoring of physical disk performance by default to avoid this issue.
Performance monitoring for all your infrastructure & applications. In minutes, not hours.
Questions? Call Us!
(888) 415-6442 or +1 (805)-617-3884