In recent years, Solid-State Drives or SSDs have become a standard part of data center architecture. They handle more simultaneous read/write operations than traditional disks and use a fraction of the power. Of course, as a leading monitoring platform vendor, we are very interested in monitoring our SSDs, not only because we want to make sure we’re getting what we paid for, but because we would also like to avoid a disk failure on a production machine at 3:00AM in the morning…and the Shaquille O’Neal sized headache to follow. But how do we know for sure if our SSDs are performing the way we want them to? Being one of the newest members of our technical operations team, it came as no surprise that I was tasked to answer this question.
So what actually happens to my SSD?
Solid State Drives are different from the traditional spinning platters. There are no moving parts (the drive head cannot crash into the platter) and there is nothing to demagnetize, but that does not mean they are immune to failure. On the contrary, they absolutely will fail due to the same technology that makes them so fast: NAND based Flash memory technology (a type of storage technology that does not require power to retain data). When deleting or writing files in a solid state drive, old data is marked invalid and new data is written into a new location in the NAND. The old data is later erased when the drive needs more space. Flash cells on an SSD can only be written on a limited number of times before they become unreliable. Simply put…it is like continuously writing on a piece of paper with a pencil and then erasing it. You can only write and erase so many times before the paper is worn out and unusable.
Sure, there are ways to monitor your disk. You can keep an eye on the disk read/writes and proactively watch for poor performance based on trends you see throughout time. At LogicMonitor, we already measure and alert on all the basics such as IO completion time, read and write IOPS, request service time, queue depth, etc. But all of this does not provide us with visibility into the hardware health of an SSD disk itself.
What if there was a way to see real time metrics on SSD wearout?
I found that SSD vendors now put S.M.A.R.T. (Self-Monitoring, Analysis, and Reporting Technology) counters or attributes to give information on the current health of the drive. My next query is how to access this particular data. Thankfully, the gods who blessed us with “ctrl, alt, delete” also gave us smartmontools. The tool consists of smartctl and smartd. Smartctl is the application to test the drives and read/report their hardware S.M.A.R.T statistics.
There are a few important counters to take into consideration when monitoring your disks such as write amplification, reserved space, temperature etc. We wanted to focus on disk life in particular. For example, the Media Wear-Out Indicator on Intel SSDs reports a normalized value of 100 (when the SSD is new) and declines to a minimum value of 1.
Making a datasource!
A new Media Wearout Indicator datasource was created by LogicMonitor.It is currently being used by the Operations team to monitor the SSDs in our production machines. To create the datasource, we utilized various smartctl commands. For example the command smartctl -l devstat -i -A -d sat+megaraid,0 /dev/sda is used to identify SMART health statistics on physical disk 0 behind the raid controller.
We now have the options of doing an SNMP extend to execute a script that will loop through all the available disks or having each command as a separate snmp extend oid. We decided to go with the latter because not only does having individual commands allow us to option of grabbing more statistics on the disk, it takes away the burden of having to manage an external script on each machine.
The LogicMonitor datasource active discovery portion will be able to find all the available disks and effectively collect the data on each. You will then be able to seeing media wearout data in real-time and set alerts so that you know when it’s time to replace it!
Are you not a customer but need to monitor your Solid State Drives (SSDs) and other IT Infrastructure? Try LogicMonitor free.
Have you created datasources to monitor your infrastructure? Post in comments and let us know what you’ve built.
Last week I was traveling to and from our Austin office – which means a fair amount of time for reading. Amongst other books, I read “The Principles of Product Development Flow”, Reinertsen, Donald G. The most interesting part of this book (to me) was the chapters on queueing theory.
Little’s law shows the Wait time for a process = queue size/processing rate. That’s a nice intuitive formula, but with broad applicability.
So what does this have to do with monitoring? Read more »
Have you ever been the guy in charge of storage and the dev guy and database guy come over to your desk waaaaay too early in the morning before you’ve had your caffeine and start telling you that the storage is too slow and you need to do something about it? I have. In my opinion it’s even worse when the Virtualization guy comes over and makes similar accusations, but that’s another story.
Now that I work for LogicMonitor I see this all the time. People come to us because “the NetApps are slow”. All too often we come to find that it’s actually the ESX host itself, or the SQL server having problems because of poorly designed queries. I’ve experienced this first hand before I worked for LogicMonitor,so it’s no surprise to me that this is a regular issue. When I experienced this problem myself I found it was vital to monitor all systems involved so I could really figure out where the bottleneck was.
When monitoring a NetApp, the thing that matters is (for most applications) the latency of requests on a volume (or LUN.)
Easy enough to get – with LogicMonitor it’s graphed and alerted on automatically, for every volume. But of course when there is an issue, the focus changes to why there is latency. Usually it’s a limitation of the disks in the aggregate being IO bound. Assuming there is no need for a reallocate (the disks are evenly loaded – I’ll write a separate article about how to determine that), how can you tell when what level of disk busy-ness is acceptable? Visualizing that performance like the below is what this post is about.
One way LogicMonitor is different from other NetApp monitoring systems (other than being hosted monitoring, and being able to monitor the complete array of systems found in a datacenter – from AC units, through virtualization, OS’s to applications like MongoDB) is that we default to “monitoring” on”.
i.e. we assume you want to monitor everything, always. (You can of course turn off monitoring or alerting for specific groups, hosts or objects.) This serves us well almost always – we will detect a new volume on your NetApp once you create it, and start monitoring it for read and write latency, number and type of operations, etc – this means that when you have an issue on that volume (or other groups are blaming storage performance for an issue), you already have the monitoring of all sorts of metrics in place before the issue – so you have the data and alerts to know whether the storage was or was not the issue.
However, we have found some cases where this doesn’t work so well. We have been monitoring NetApp disk performance by default, too, tracking the number of operations and the busy time for each disk. However, on customers with larger NetApps, there are often hundreds of disks, each of which we would monitor via API queries. This is useful for identifying when disks need to be rebalanced (if the top 10 and bottom 10 disks by busy time are wildly different.) And while we only monitor the performance of a disk every 5 minutes (as opposed to volumes and PAM cards and things that are monitored more frequently), this apparently overloads the API subsystem of NetApp devices.
We’d see that when we’d restart the collection process, and the only monitoring by the API was for the volume performance, things worked great – the response to an API request from a NetApp was around 100 ms.
When the disk requests started getting added in (and we stagger and skew the requests, so they are not all hitting at once) – the API response time for a single query climbed up to 40 seconds.
This started causing a backlog of monitoring, and was causing data to be missed in the more important volume performance metrics.
So… while we’ll open a case with NetApp, in the interim, we’ll probably disable the monitoring of physical disk performance by default to avoid this issue.
Performance monitoring for all your infrastructure & applications. In minutes, not hours.
Questions? Call Us!
(888) 415-6442 or +1 (805)-617-3884