A guest post by one our intrepid Support Engineers in the UK, Antony Hawkins.
“Catching the little problems you never knew you had (before they cause big problems you never want to deal with).”
So, you’ve configured and tested an NTP hierarchy through your estate and now all your devices run to the same time. You can leave it alone now, safe in the knowledge it’s working.
Can’t you? Read more »
Looking for some Friday fun, we decided to run a quick analysis of the number of metrics we monitor (geeky!)
We found that we are monitoring 1 billion metrics a day from 13 million streams of data! That’s about 11,600 metrics per second that we’re analyzing, sorting and presenting to our customers. Data galore.
Can you imagine the amount of data we are analyzing while you’re reading this post and we’re getting started with our Friday happy hour? We got pretty excited about that.
All those servers, websites, networks, databases, virtual machines and applications, just love to tell us about their performance.. and we love to listen to all of them.
What’s even more exciting is to see our historic metric growth and realize that 1 billion metrics a day is nothing compared to the amount of data that our customers will have at their fingertips in the future. Inspiring!
Eleanor Roosevelt is reputed to have said “Learn from the mistakes of others. You can’t live long enough to make them all yourself.” In that spirit, we’re sharing a mistake we made so that you may learn.
This last weekend we had a service impacting issue for about 90 minutes, that affected a subset of customers on the East coast. This despite the fact that, as you may imagine, we have very thorough monitoring of our servers; error level alerts (which are routed to people’s pagers) were triggered repeatedly during the issue; we have multiple stages of escalation for error alerts; and we ensure we always have on-call staff responsible for reacting to alerts, who are always reachable.
All these conditions were true this weekend, and yet we still had an issue whereby no person was alerted for over an hour after the first alerts were triggered. How was this possible? Read more »
Have you ever been the guy in charge of storage and the dev guy and database guy come over to your desk waaaaay too early in the morning before you’ve had your caffeine and start telling you that the storage is too slow and you need to do something about it? I have. In my opinion it’s even worse when the Virtualization guy comes over and makes similar accusations, but that’s another story.
Now that I work for LogicMonitor I see this all the time. People come to us because “the NetApps are slow”. All too often we come to find that it’s actually the ESX host itself, or the SQL server having problems because of poorly designed queries. I’ve experienced this first hand before I worked for LogicMonitor,so it’s no surprise to me that this is a regular issue. When I experienced this problem myself I found it was vital to monitor all systems involved so I could really figure out where the bottleneck was.
Developers are sometimes too helpful when they instrument their systems. For example, when asked to add a metric that will report the response time of a request – there are several ways that it can be done. One way that seems to make sense is to just keep a variable with the total number of requests, and another with the total processing time. Then the developer just creates a variable showing total processing time divided by total requests, and a way to expose it (an MBean to report it via JMX, or a status page via HTTP, etc). This will be a nice neat object that reports the response time in milliseconds, all pre-calculated for the user.
The problem with this? It is indeed going to report the average response time – but it’s going to be the average of all response times since the server started. So… if the server has been running with an average response time of 1 ms, and it’s been up for 1000 hours, then it starts exhibiting a response time of 100 ms per request – after an hour of this slow behavior, the pre-calculated average response time will be 1.01 milliseconds (assuming a constant rate of requests). Not even enough of a change to be discernible with the eye on a graph, Read more »
One thing we frequently say is that you need to be monitoring all sorts of metrics when you do software releases, so you can tell if things degrade, and thus head off performance issues. You need to monitor not just the basics of the server (disk IO, memory, CPU, network), and the function of the server (response time serving web pages, database queries, etc), but also the in-between metrics (cache hit rates, etc).
This also provides visibility into when things improve, as well as get worse. For example, in a recent release, we changed the way we store some data in an internal database, and reduced the number of records in some tables by thousands. As you can see, this dropped the number of times Innodb had to hit the file system for data quite a bit:
Now, if we were running on SAS disks, instead of SSDs, we would have just regained a sizable percentage of the maximum IO rate of the drive arrays back, with one software release. (Purists will note that the drop in what is graphed is InnoDB disk requests – not OS level disk requests. Some of these requests will likely be satisfied from memory, not disk.)
If I were a developer, and was on the team that effectively allowed the same servers to scale to support twice as much load with a software release….I’d want people to know that.
You released new code with all sorts of new features and improvements. Yay!
Now, after the obvious things like “Does it actually work in production”, this is also the time to assess: did it impact my infrastructure performance (and thus my scalability, and thus my scaling costs) in any way.
This is yet another area where good monitoring and trending is essential.
As an example, we did a release last night on a small set of servers.
Did that help or hurt our scalability?
CPU load dropped for the same workload (we have other graphs showing which particular Java application this improvement was attributable to, but this shows the overall system CPU):
There was an improvement on a variety of MySQL performance metrics, such as the Table open rate (table opens are fairly intensive.)
But…not everything was improved:
While the overall disk performance and utilization is the same, the workload is much more spiky. (For those of you wondering how we get up to 2000 write operations per second – SSDs rock.)
And of course, the peak workloads are what constrain the server usage – with this change in workload, a server that was running at a steady 60% utilization may find itself spiking to 100% – leading to queuing in other parts of the system, and general Bad Things.
As it is, we saw this change in the workload and we can clearly attribute it to the code release. So now we can fix it before it is applied to more heavily loaded servers where it may have had an operational impact.
This keeps our Ops team happy, our customers happy, and, as it means we dont have to spend more money on hardware for the same level of scale, it keeps our business people happy.
Just another illustration of how comprehensive monitoring can help your business in ways you may not have predicted.
We received some alerts tonight that one Tomcat server was using about 95% of its configured thread maximum.
The Tomcat process on http-443 on prod4 now has 96.2 % of the max configured threads in the busy state.
These were SMS alerts, as that was close enough to exhausting the available threads to warrant waking someone up if needed.
The other alert we got was that Tomcat was taking an unusual time to process requests, as seen in this graph: Read more »
I’ve talked about this before, but I just read an article about why application performance monitoring is so screwed up, and coincidentally had just talked about it in a lecture I gave to a graduate class at UCSB on scalable computing, so figured it’s worth a mention.
The article mentions that “enterprises have confused (with vendor help) the notion of monitoring the resources that an application uses with its performance”. The way I put it in my lecture was that:
So… how to tie one to the other?
Monitor what users care about (page load times, response per request, etc)
Also monitor all the limiting resources (CPU, Disk IO – or more importantly what percentage of the time a drive is busy, network, memory):
And monitor the performance of the systems that affect the limiting resources:
So while monitoring InnoDB file sytem reads does not tell you anything that an end user cares about, if your monitoring of Tomcat request time shows that users are experiencing poor performance, and your logical drives are suddenly 100% busy and request service time increasing, it’s good to know why that is. It may be because of InnoDB buffer misses, or it may be because of something else – but having this intermediate data will drastically reduce your time to correct the issue that users care about – response time.
Another point to note: the “user” in the phrase “monitor what users care about” may not be a human. If a server is a memcached server – the users for this server are web servers, who care about memcached response time, availability and hit rates. So on this class of machines, that is the thing to monitor to determine if the service is meeting the needs of users.
In short, for every machine, identify the “thing(s) to care about” for it; monitor those things; monitor the constrained resources; and monitor all aspects of the systems on that server that inmpact the constrained resources.
A more technical article today.
In adding some more Exchange Monitoring we ran into some issues, and solutions, that may help others. Some things in recent Exchange versions can only be monitored by Powershell. (Perfmon, WMI, Powershell, all needed for different versions of Exchange…. I wish they’d make up their mind…)
So the first issue was that Powershell scripts, when called from a LogicMonitor agent, never returned. This wasn’t too hard – simply pass the parameter -inputformat with the (undocumented) option “none”, and the agent can successfully run Powershell commands:
powershell -inputformat none dbstatus.ps1
(Why? The Microsoft.PowerShell.ConsoleHost class constructs a M.PS.WrappedDeserializer passing the STDIN TextReader as one of the parameters. By default, the WrappedDeserializer will call ReadLine() on this STDIN TextReader and wait indefinitely, effectively hanging PowerShell and the calling process. That’s why.)
So past that hurdle, but the next one:
>> powershell -inputformat none dbstatus.ps1
Add-PSSnapin : No snap-ins have been registered for Windows PowerShell version 2.
Yet running the exact same command from the command shell on the host running the agent resulted in the output we were expecting. And we could see the Exchange snap in, called by the Powershell script, was correctly registered, and in fact worked fine.
But.. our agent was running on a 32 bit JVM and Exchange 2010 (in our lab, at least) is installed on 64 bit Windows. The Powershell snap in was only visible when powershell was started from a 64 bit app. When I started powershell from the cmd.exe in SysWOW64, I got the same error about missing snap-ins as our agent reported.
The solution – it doesn’t matter that our agent was installed as a 32 bit app, in Program files (x86). What mattered was that the Java virtual machine launched by the agent, that ultimately launched Powershell, be a 64 bit JVM, not the default 32 bit JVM installed from Java.com. (At least, a 32 bit JVM is the default when you browse to Java.com with a 32 bit browser.)
So, running the LogicMonitor agent with a 64 bit JVM, and Powershell started with “-inputformat none” gives us full access to Powershell output and all its snap ins, so expect some datasources released very shortly to take advantage of that.
Performance monitoring for all your infrastructure & applications. In minutes, not hours.
Questions? Call Us!
(888) 415-6442 or +1 (805)-617-3884