In recent years, Solid-State Drives or SSDs have become a standard part of data center architecture. They handle more simultaneous read/write operations than traditional disks and use a fraction of the power. Of course, as a leading monitoring platform vendor, we are very interested in monitoring our SSDs, not only because we want to make sure we’re getting what we paid for, but because we would also like to avoid a disk failure on a production machine at 3:00AM in the morning…and the Shaquille O’Neal sized headache to follow. But how do we know for sure if our SSDs are performing the way we want them to? Being one of the newest members of our technical operations team, it came as no surprise that I was tasked to answer this question.
So what actually happens to my SSD?
Solid State Drives are different from the traditional spinning platters. There are no moving parts (the drive head cannot crash into the platter) and there is nothing to demagnetize, but that does not mean they are immune to failure. On the contrary, they absolutely will fail due to the same technology that makes them so fast: NAND based Flash memory technology (a type of storage technology that does not require power to retain data). When deleting or writing files in a solid state drive, old data is marked invalid and new data is written into a new location in the NAND. The old data is later erased when the drive needs more space. Flash cells on an SSD can only be written on a limited number of times before they become unreliable. Simply put…it is like continuously writing on a piece of paper with a pencil and then erasing it. You can only write and erase so many times before the paper is worn out and unusable.
Sure, there are ways to monitor your disk. You can keep an eye on the disk read/writes and proactively watch for poor performance based on trends you see throughout time. At LogicMonitor, we already measure and alert on all the basics such as IO completion time, read and write IOPS, request service time, queue depth, etc. But all of this does not provide us with visibility into the hardware health of an SSD disk itself.
What if there was a way to see real time metrics on SSD wearout?
I found that SSD vendors now put S.M.A.R.T. (Self-Monitoring, Analysis, and Reporting Technology) counters or attributes to give information on the current health of the drive. My next query is how to access this particular data. Thankfully, the gods who blessed us with “ctrl, alt, delete” also gave us smartmontools. The tool consists of smartctl and smartd. Smartctl is the application to test the drives and read/report their hardware S.M.A.R.T statistics.
There are a few important counters to take into consideration when monitoring your disks such as write amplification, reserved space, temperature etc. We wanted to focus on disk life in particular. For example, the Media Wear-Out Indicator on Intel SSDs reports a normalized value of 100 (when the SSD is new) and declines to a minimum value of 1.
Making a datasource!
A new Media Wearout Indicator datasource was created by LogicMonitor.It is currently being used by the Operations team to monitor the SSDs in our production machines. To create the datasource, we utilized various smartctl commands. For example the command smartctl -l devstat -i -A -d sat+megaraid,0 /dev/sda is used to identify SMART health statistics on physical disk 0 behind the raid controller.
We now have the options of doing an SNMP extend to execute a script that will loop through all the available disks or having each command as a separate snmp extend oid. We decided to go with the latter because not only does having individual commands allow us to option of grabbing more statistics on the disk, it takes away the burden of having to manage an external script on each machine.
The LogicMonitor datasource active discovery portion will be able to find the all available disks and effectively collect the data on each. You will then be able to seeing media wearout data in real-time and set alerts so that you know when it’s time to replace it!
Are you not a customer but need to monitor your Solid State Drives (SSDs) and other IT Infrastructure? Try LogicMonitor free.
Have you created datasources to monitor your infrastructure? Post in comments and let us know what you’ve built.
We got a question internally about why one of our demo servers was slow, and how to use LogicMonitor to help identify the issue. The person asking comes from a VoIP, networking and Windows background, not Linux, so his questions reflect that of the less-experienced sys admin (in this case). I thought it interesting that he documented his thought processes, and I’ll intersperse my interpretation of the same data, and some thoughts on why LogicMonitor alerts as it does… Read more »
When designing infrastructure architecture, there is usually a choice between complexity and fault tolerance. It’s not just an inverse relationship, however. It’s a curve. You want the minimal complexity possible to achieve your availability goals. And you may even want to reduce your availability goals to reduce your complexity (which will end up increasing your availability.)
The rule to adopt is If you don’t understand something well enough that it seems simple to you (or your staff), even in it’s failure modes, you are better off without it.
Back in the day, clever people suggested that most web sites would have the best availability by running everything – DB, web application, everything – on a single server. This was the simplest configuration, and the easiest to understand.
With no complexity – one of everything (one switch, one load balancer, one web server, one database, for example) – you can tolerate zero failures, but it’s easy to know when there is a failure.
With 2 of everything, connected the right way, you can keep running with one failure, but you may not be aware of the failure.
So is it a good idea to add more connections, and plan to be able to tolerate multiple failures? Not usually. For example, with a redundant pair of load balancers, you can connect one load balancer to one switch, and the other load balancer to another switch. In the event of a load balancer failure, the surviving load balancer will automatically take over, and all is good. If a switch fails, it may be the one that the active load balancer is connected to – this would also trigger a load balancer fail over, and everything is still running correctly. It would be possible to connect each load balancer to each switch, so that failure of a switch does not impact the load balancers, but is it worth it?
This would allow the site to survive two simultaneous unrelated failures – one switch and the one load balancer – but the added complexity of engineering the multiple traffic paths increases the likelihood that something will go wrong in one of the 4 possible states. There are now 4 possible traffic paths instead of 2 – so more testing needed, more maintenance needed on any change, etc. The benefit seems outweighed by the complexity.
The same concept of “if it seems complex, it doesn’t belong”, can be applied to software, too. Load balancing, whether via an appliance such as Citrix Netscalers, or software such as ha_proxy, is simple enough to most people nowadays. The same is not generally true of clustered file systems, or DRDB. If you truly need these technologies, you better have a thorough understanding of them, and invest the time to create all the failure modes you can, and train your staff so that it is not complex for them to deal with any of the failures.
If you have a consultant come in and set up BGP routing, but no one on your NOC or on call staff knows how to do anything with BGP, you just greatly reduced your site’s operational availability.
The “Complexity Filter” can be applied to monitoring systems, as well. If your monitoring system stops, and you don’t have immediate staff available to troubleshoot the restart of the service processes; or the majority of your staff cannot easily interpret the monitoring, or create new checks, or use it to see trends over time – your monitoring is not contributing to your operational uptime. It is instead a resource sink, and is likely to bite you when you least expect it. Datacenter monitoring, like all things in your datacenter, should be as automated and simple as possible.
If it seems complex – it will break. Learn it until it’s not complex, or do without it.
Performance monitoring for all your infrastructure & applications. In minutes, not hours.
Questions? Call Us!
(888) 415-6442 or +1 (805)-617-3884