You released new code with all sorts of new features and improvements. Yay!
Now, after the obvious things like “Does it actually work in production”, this is also the time to assess: did it impact my infrastructure performance (and thus my scalability, and thus my scaling costs) in any way.
This is yet another area where good monitoring and trending is essential.
As an example, we did a release last night on a small set of servers.
Did that help or hurt our scalability?
CPU load dropped for the same workload (we have other graphs showing which particular Java application this improvement was attributable to, but this shows the overall system CPU):
There was an improvement on a variety of MySQL performance metrics, such as the Table open rate (table opens are fairly intensive.)
But…not everything was improved:
While the overall disk performance and utilization is the same, the workload is much more spiky. (For those of you wondering how we get up to 2000 write operations per second – SSDs rock.)
And of course, the peak workloads are what constrain the server usage – with this change in workload, a server that was running at a steady 60% utilization may find itself spiking to 100% – leading to queuing in other parts of the system, and general Bad Things.
As it is, we saw this change in the workload and we can clearly attribute it to the code release. So now we can fix it before it is applied to more heavily loaded servers where it may have had an operational impact.
This keeps our Ops team happy, our customers happy, and, as it means we dont have to spend more money on hardware for the same level of scale, it keeps our business people happy.
Just another illustration of how comprehensive monitoring can help your business in ways you may not have predicted.
Last night our ops team (of which I am a member) got paged about the CPU load on a Cisco 3560 switch in a new datacenter, late at night. My initial reaction was “We don’t need this alert escalated to pagers or phones- 3560′s switch and route in hardware, so CPU load doesn’t matter.” Once I’d woken up a bit more, the corollary - that there is no possible way that this switch should be at a CPU level to trigger an error alert – occurred to me. Read more »
I’ve talked about this before, but I just read an article about why application performance monitoring is so screwed up, and coincidentally had just talked about it in a lecture I gave to a graduate class at UCSB on scalable computing, so figured it’s worth a mention.
The article mentions that “enterprises have confused (with vendor help) the notion of monitoring the resources that an application uses with its performance”. The way I put it in my lecture was that:
So… how to tie one to the other?
Monitor what users care about (page load times, response per request, etc)
Also monitor all the limiting resources (CPU, Disk IO – or more importantly what percentage of the time a drive is busy, network, memory):
And monitor the performance of the systems that affect the limiting resources:
So while monitoring InnoDB file sytem reads does not tell you anything that an end user cares about, if your monitoring of Tomcat request time shows that users are experiencing poor performance, and your logical drives are suddenly 100% busy and request service time increasing, it’s good to know why that is. It may be because of InnoDB buffer misses, or it may be because of something else – but having this intermediate data will drastically reduce your time to correct the issue that users care about – response time.
Another point to note: the “user” in the phrase “monitor what users care about” may not be a human. If a server is a memcached server – the users for this server are web servers, who care about memcached response time, availability and hit rates. So on this class of machines, that is the thing to monitor to determine if the service is meeting the needs of users.
In short, for every machine, identify the “thing(s) to care about” for it; monitor those things; monitor the constrained resources; and monitor all aspects of the systems on that server that inmpact the constrained resources.
Performance monitoring for all your infrastructure & applications. In minutes, not hours.
Questions? Call Us!
(888) 415-6442 or +1 (805)-617-3884