Sample SAT question: xUnit is to Continuous Integration as what is to automated server deployments?
We’ve been going through lots of growth here at LogicMonitor. Part of growth means firing up new servers to deal with more customers, but we also have been adding a variety of new services: proxies that allow our customers to route around Internet issues that BGP doesn’t catch; servers that test performance and reachability of customers sites from various locations, and so on. All of which means spinning up new servers: sometimes lots of times, in QA, staging and development environments.
As old hands in running datacenter operations, we have long adhered to the tenet of not trusting people – including ourselves. People make mistakes, and can’t remember things they did to make things work. So all our servers and applications are deployed by automated tools. We happen to use Puppet, but collectively we’ve worked with cfengine, chef, and even Rightscripts.
So, for us to bring up a new server – no problem. It’s scripted, repeatable, takes no time. But how about splitting the functions of what was one server into several? And how do we know that the servers being deployed are set up correctly, if there are changes and updates? Read more »
You released new code with all sorts of new features and improvements. Yay!
Now, after the obvious things like “Does it actually work in production”, this is also the time to assess: did it impact my infrastructure performance (and thus my scalability, and thus my scaling costs) in any way.
This is yet another area where good monitoring and trending is essential.
As an example, we did a release last night on a small set of servers.
Did that help or hurt our scalability?
CPU load dropped for the same workload (we have other graphs showing which particular Java application this improvement was attributable to, but this shows the overall system CPU):
There was an improvement on a variety of MySQL performance metrics, such as the Table open rate (table opens are fairly intensive.)
But…not everything was improved:
While the overall disk performance and utilization is the same, the workload is much more spiky. (For those of you wondering how we get up to 2000 write operations per second – SSDs rock.)
And of course, the peak workloads are what constrain the server usage – with this change in workload, a server that was running at a steady 60% utilization may find itself spiking to 100% – leading to queuing in other parts of the system, and general Bad Things.
As it is, we saw this change in the workload and we can clearly attribute it to the code release. So now we can fix it before it is applied to more heavily loaded servers where it may have had an operational impact.
This keeps our Ops team happy, our customers happy, and, as it means we dont have to spend more money on hardware for the same level of scale, it keeps our business people happy.
Just another illustration of how comprehensive monitoring can help your business in ways you may not have predicted.
Performance monitoring for all your infrastructure & applications. In minutes, not hours.
Questions? Call Us!
(888) 415-6442 or +1 (805)-617-3884