Sometimes, the hardest things are not technical. They’re a result of the politics of working in large organizations, with different groups with different responsibilities. Sometimes this fragmentation of responsibilities allows 82 year old grandmothers to walk up to the walls of supposedly ultra-secure nuclear weapons facilities. And the lessons we learn from this case can apply just as much to your I.T. monitoring.
In a prior blog post, I talked about what virtual memory is, the difference between swapping and paging, and why it matters. (TL;DR: swapping is moving an entire process out to disk; paging is moving just specific pages out to disk, not an entire process. Running programs that require more memory than the system has will mean pages (or processes) are moved to/from disk and memory in order to get enough physical memory to run – and system performance will suck.)
Now I’ll talk about how to monitor virtual memory, on Linux (where it’s easy) and, next time, on Solaris (where most people and systems do it incorrectly.) Read more »
We’ve launched a new program here at LogicMonitor to help you get insight from us and from your compatriots at different corporations working in different positions solving complexities and issues with LogicMonitor. Here at LogicMonitor, we are referring to this fledgling program as the Monitoring Roundtable. We are looking to have one of these every month with invitations extended by your account managers. Of course, you are welcome to be proactive and reach out to us or to your account manager directly for an invitation. Read more »
[Kevin McGibben (CEO), Steve Francis (Founder and Chief Product Officer) and Jeff Behl (Chief Network Architect) contributed to this post.]
This week LM’s Chief Network Architect “Real Deal Jeff Behl” was featured on the DABCC podcast with Doug Brown. The interview journey covered lots of ground and sparked our interest about IT industry predictions for 2014. There are so many exciting things happening in IT Ops these days it’s hard to name just a few.
Before it’s too late, here’s our turn at early year prognosticating.
1) 2014 is (at long last) the year for public Cloud testing. The definition of what “Cloud” is depends on whom you ask. To our SaaS Ops veterans, it means a group of machines running off premise for which someone else is responsible for managing. Given Cloud can mean lots of things — from public Cloud infrastructure (Amazon), Cloud services (Dyn or SumoLogic) to Cloud apps (Google Apps) to SaaS platforms (SalesForce and LogicMonitor!). The shared definition among all things Cloud is simple: it’s off premise (i.e., outside your data center or co-lo) hosted infrastructure, applications or services. For most enterprises currently , Cloud usually represents a public data center, offering from the very generic VM compute resources to specific services such as high performance NoSQL databases and Hadoop clusters. Enterprises are starting to gear up to test how the public Cloud fits in its data center strategy. In the past month alone, several of our Fortune 1000 clients confirmed they’ve set aside 2014 budget and IT team resources to test public cloud deployments.
This weekend I was catching up on some New Yorker issues, when an article by one of my favorite New Yorker authors, Atul Gawande, struck me as illuminating so much about tech companies and DevOps. (This is an example of ideas coming from diverse, unrelated sources – something part of the culture of LogicMonitor. Just yesterday, in fact, our Chief Network Architect had a great idea to improve security and accountability when our support engineers are asked to log in to a customer’s account – and this idea occurred to him while he and I were charging down the Jesusita trail on mountain bikes.)
The article, Atul Gawande: How Do Good Ideas Spread? : The New Yorker, is an exploration about why some good ideas (such as anesthesia) were readily adopted, while other just as worthy ideas (antisepsis – keeping germs away from medical procedures) did not. So how does this relate to DevOps and technology companies? Read more »
LogicMonitor is happy to participate in the upcoming DevOps Days meet up in Santa Clara.
It is exciting to see how the agile software development movement is bringing development and operations together to promote efficiency. With DevOps constant changes and roles cross-over, it only makes sense you use a monitoring tool that is just as flexible as you are.
LogicMonitor is a SaaS based IT performance monitoring solution that lets you monitor your entire IT infrastructure with just a few clicks of a button. Built by exceptionally smart operations guys, our product instantly discovers system performance metrics and tells you in seconds if they are within acceptable standards. We already researched these standards for you, so you can focus on what you do best.
And since we monitor anything with an IP address, you get visibility to your full IT stack in one place. Talk about Infrastructure as Code!
Come say hi to learn more, we would love to see you!
Have you ever been the guy in charge of storage and the dev guy and database guy come over to your desk waaaaay too early in the morning before you’ve had your caffeine and start telling you that the storage is too slow and you need to do something about it? I have. In my opinion it’s even worse when the Virtualization guy comes over and makes similar accusations, but that’s another story.
Now that I work for LogicMonitor I see this all the time. People come to us because “the NetApps are slow”. All too often we come to find that it’s actually the ESX host itself, or the SQL server having problems because of poorly designed queries. I’ve experienced this first hand before I worked for LogicMonitor,so it’s no surprise to me that this is a regular issue. When I experienced this problem myself I found it was vital to monitor all systems involved so I could really figure out where the bottleneck was.
Developers are sometimes too helpful when they instrument their systems. For example, when asked to add a metric that will report the response time of a request – there are several ways that it can be done. One way that seems to make sense is to just keep a variable with the total number of requests, and another with the total processing time. Then the developer just creates a variable showing total processing time divided by total requests, and a way to expose it (an MBean to report it via JMX, or a status page via HTTP, etc). This will be a nice neat object that reports the response time in milliseconds, all pre-calculated for the user.
The problem with this? It is indeed going to report the average response time – but it’s going to be the average of all response times since the server started. So… if the server has been running with an average response time of 1 ms, and it’s been up for 1000 hours, then it starts exhibiting a response time of 100 ms per request – after an hour of this slow behavior, the pre-calculated average response time will be 1.01 milliseconds (assuming a constant rate of requests). Not even enough of a change to be discernible with the eye on a graph, Read more »
One thing we frequently say is that you need to be monitoring all sorts of metrics when you do software releases, so you can tell if things degrade, and thus head off performance issues. You need to monitor not just the basics of the server (disk IO, memory, CPU, network), and the function of the server (response time serving web pages, database queries, etc), but also the in-between metrics (cache hit rates, etc).
This also provides visibility into when things improve, as well as get worse. For example, in a recent release, we changed the way we store some data in an internal database, and reduced the number of records in some tables by thousands. As you can see, this dropped the number of times Innodb had to hit the file system for data quite a bit:
Now, if we were running on SAS disks, instead of SSDs, we would have just regained a sizable percentage of the maximum IO rate of the drive arrays back, with one software release. (Purists will note that the drop in what is graphed is InnoDB disk requests – not OS level disk requests. Some of these requests will likely be satisfied from memory, not disk.)
If I were a developer, and was on the team that effectively allowed the same servers to scale to support twice as much load with a software release….I’d want people to know that.
This post, written by LogicMonitor’s Director of Tech Ops, Jesse Aukeman, originally appeared on HighScalability.com on February 19, 2013
If you are like us, you are running some type of linux configuration management tool. The value of centralized configuration and deployment is well known and hard to overstate. Puppet is our tool of choice. It is powerful and works well for us, except when things don’t go as planned. Failures of puppet can be innocuous and cosmetic, or they can cause production issues, for example when crucial updates do not get properly propagated.
In the most innocuous cases, the puppet agent craps out (we run puppet agent via cron). As nice as puppet is, we still need to goose it from time to time to get past some sort of network or host resource issue. A more dangerous case is when an administrator temporarily disables puppet runs on a host in order to perform some test or administrative task and then forgets to reenable it. In either case it’s easy to see how a host may stop receiving new puppet updates. The danger here is that this may not be noticed until that crucial update doesn’t get pushed, production is impacted, and it’s the client who notices.
Monitoring is clearly necessary in order to keep on top of this. Rather than just monitoring the status of the puppet server (a necessary, but not sufficient, state), we would like to monitor the success or failure of actual puppet runs on the end nodes themselves. For that purpose, puppet has a built in feature to export status info Read more »
Performance monitoring for all your infrastructure & applications. In minutes, not hours.
Questions? Call Us!
(888) 415-6442 or +1 (805)-617-3884