Here at LogicMonitor we love our happy hours, and since we will be speaking at the upcoming AnsibleWorks Fest (or AnsibleFest) we thought of no better way to tie it off than over a drink.
Our very own Jefff Behl, Chief Network Architect will be speaking at 5:05pm about the importance of measurement and monitoring the whole IT stack in a DevOps world. And then afterwards…. we’d love the chance to meet you over drinks at Dillons (around the corner from the event).
Look for us at the Dillons downstairs bar between 6-8pm.
One thing we frequently say is that you need to be monitoring all sorts of metrics when you do software releases, so you can tell if things degrade, and thus head off performance issues. You need to monitor not just the basics of the server (disk IO, memory, CPU, network), and the function of the server (response time serving web pages, database queries, etc), but also the in-between metrics (cache hit rates, etc).
This also provides visibility into when things improve, as well as get worse. For example, in a recent release, we changed the way we store some data in an internal database, and reduced the number of records in some tables by thousands. As you can see, this dropped the number of times Innodb had to hit the file system for data quite a bit:
Now, if we were running on SAS disks, instead of SSDs, we would have just regained a sizable percentage of the maximum IO rate of the drive arrays back, with one software release. (Purists will note that the drop in what is graphed is InnoDB disk requests – not OS level disk requests. Some of these requests will likely be satisfied from memory, not disk.)
If I were a developer, and was on the team that effectively allowed the same servers to scale to support twice as much load with a software release….I’d want people to know that.
by Cisco Arias
In the modern world of consumerism, there are so many choices, noises and deals, it’s sometimes hard to calculate the real value of the products and services we purchase. At LogicMonitor, we try to make it obvious.
I don’t like to call our solution a “product” because for some reason it makes me think of, “Sorry pal, you purchased this as is. I can’t help you.” That’s not how we roll.
We have a joke here that our product now comes with free Cisco (not the trademarked kind). As an Account Manager here’s the way I look at it: The moment a new client signs on we become partners in improving each others businesses and helping each other grow.
A service, or solution, like ours relies on synergy and strong relationships with our clients if we are to provide them the most effective and intelligent SaaS monitoring solution out there. One of my personal goals here is to build these relationships in order to understand the needs of the end user AND organization… to keep the client’s business flowing and improving.
I see monitoring as one component of the daily workflow of an average IT, Sys Admin, or Network Engineer. And monitoring should provide more than just alerting, it should provide a way of proactively finding out how to prevent those fires. And beyond that, it should provide intelligence that improves operations and business metrics that are meaningful to those upstairs.
And I’m not happy until our clients actually realize all these benefits. We’re not going to sell you a “product” where it’s up to you to figure it out and get it working. We’re going to provide you with the resources needed to maximize the value of our solution.
From day one, we have a team of Engineers and Developers working with you to ensure your monitoring is made easy and effective. You also have people like me who will check in frequently to see if there is anything we can do to help.
Whether it’s letting you know your portal is not up to date, going over new features, or getting new team members up to speed, you can rest assured that you have a go-to person for anything LogicMonitor related.
So be sure to consider all of this the next time you are calculating what real value means to you and your organization… or the next time you need help on a “product” you purchased and are looking and waiting, looking and waiting…
now free, with LogicMonitor
We received some alerts tonight that one Tomcat server was using about 95% of its configured thread maximum.
The Tomcat process on http-443 on prod4 now has 96.2 % of the max configured threads in the busy state.
These were SMS alerts, as that was close enough to exhausting the available threads to warrant waking someone up if needed.
The other alert we got was that Tomcat was taking an unusual time to process requests, as seen in this graph: Read more »
We here at LogicMonitor use our own service to monitor the various parts of our infrastructure, and doing so demonstrates the financial value that LogicMonitor brings.
The more you instrument with LogicMonitor, the more power it has. In the cases below, the information and alerts that LogicMonitor presented allowed us to avoid spending money on more hardware – and with LogicMonitor’s availability requirements, each hardware purchase usually means 3 x the hardware (active/passive at the datacenter in question, and failover hardware present in a different datacenter.)
One case was relatively straightforward – a review of the MySQL performance monitoring metrics revealed that the number of rows read due to read_rnd_next operations was very high – in the tens of thousands per second. (For those of you not DBAs, this is the number of rows MySQL reads sequentially in order to satisfy a read request – an indicator that indexes are not being used.) A quick bit of investigation by our programmers revealed a query written in such a way that MySQL was not using the existing indexes. This was rewritten, and on our release, the MySQL table scans dropped dramatically:
This saved the system’s CPU load, disk load, and improved the response time for users.
However, a more dramatic demonstration came a week or so later, when one cluster started getting disk bound. An increase in customers, combined with some newly released features that added extra load, meant that one cluster was reaching the capacity of its hardware (or so I thought.) Average response time was hitting what we regarded as limits, and my thought was that we’d have to throw hardware (meaning money) at the issue.
However, using custom application metrics that the LogicMonitor system exposes (in our case via JMX monitoring, as our system is written in Java, but the data could have been collected by any of a variety of mechanisms, from perfmon counters, to web page content, to log files), it was apparent that the load was solely due to one particular processing queue. Our CTO investigated the caching algorithm that is applied to the data in this queue, and was able to tune it so that it was much more effective, as can be seen from the graphs below:
This dropped the CPU load of the cluster:
And also improved the servers’ response time:
So while LogicMonitor did not directly solve the problem, the extensive application monitoring did warn us that an issue was arising, and pinpoint where in our system the bottleneck was, and allowed our staff to focus their investigation on the one particular queue, rather than all components of the system. It also allowed us to see the effectiveness of the changes on our staging systems, before we released to production.
LogicMonitor’s application monitoring saved us many thousands of dollars, and many hours of engineering time. Both things in limited supply at any company.
Tags: application monitoring
Performance monitoring for all your infrastructure & applications. In minutes, not hours.
Questions? Call Us!
(888) 415-6442 or +1 (805)-617-3884