[Originally appeared February 26, 2014 in the Packet Pushers online community, written by Jeff Behl, Chief Network Architect with LogicMonitor.]
LogicMonitor is a SaaS-based performance and monitoring platform servicing clients across the world. Our customers install LogicMonitor “Collectors” within their data centers to gather data from devices and services utilizing a web application to analyze aggregated performance metrics, and to configure alerting and reporting. This means our entire operation (and therefore the monitoring our customers are dependent on) relies on ISPs to ensure that we efficiently and accurately receive billions of data points a day.
One question we sometimes get is why LogicMonitor relies so little on SNMP traps. When we are writing the monitoring for a new device, we look at the traps in the MIB for the device to see the things the vendor thinks are important to notify about – but we will try to determine the state of the device by polling for those things, not relying on the traps. “Why not rely on traps?” you may ask. Good question. Read more »
This weekend I was catching up on some New Yorker issues, when an article by one of my favorite New Yorker authors, Atul Gawande, struck me as illuminating so much about tech companies and DevOps. (This is an example of ideas coming from diverse, unrelated sources – something part of the culture of LogicMonitor. Just yesterday, in fact, our Chief Network Architect had a great idea to improve security and accountability when our support engineers are asked to log in to a customer’s account - and this idea occurred to him while he and I were charging down the Jesusita trail on mountain bikes.)
The article, Atul Gawande: How Do Good Ideas Spread? : The New Yorker, is an exploration about why some good ideas (such as anesthesia) were readily adopted, while other just as worthy ideas (antisepsis – keeping germs away from medical procedures) did not. So how does this relate to DevOps and technology companies? Read more »
This week I’ve been off visiting customers in Atlanta – which means a lot of time on planes and in airports (especially today, when my flight was cancelled so I have a 6 hour delay…) So that means a lot of reading. One book I read on this trip was UX for Lean Startups, by Laura Klein. A good read, advocating good common sense strategies, which I will roughly paraphrase:
This is, to some degree, a similar message that you will hear from proponents of Agile methodologies like Scrum; from DevOps, and the Lean enterprise movement in general: work collaboratively; release frequently; measure the results.
How does this relate to monitoring?
Like modifying a UX, it’s easier to change code for performance and capacity reasons earlier, rather than later. If your plan to use flat files to store all your customer’s transaction history works fine for 5 customers, but not for 5000 – it’s much better to find that out when you have 5 customers. (Even better to find it out before you’ve released it to any customers.) Finding that out may require simulating the load of 5000 customers – but if you have in depth monitoring, it is more likely to be evident in advance of the load. In the case of flat files, it would be easy to see a spike in linux disk request latency – even if you only have a few users. If you have a less-anachronistic architect whose decided to use MySQL, you may see no issues in disk latency, but you may see a spike in table scans. No actual problem now, but an indicator of where you may run into growing pains. If you run Redis/Memcached/Cassandra/MongoDB (hopefully not all at once), you may not see performance issues in the transactions, but you may have less memory to run the application, so it may start swapping – so now you need to split your systems.
In Lean UX, the initial steps are qualitative observations of a small subset of users to identify the worst issues that are then addressed and iterated on. With Lean monitoring, thorough monitoring should be deployed even initially, and it will require someone with experience to identify changes in behavior that, while not a problem now, could indicate one under greater load, and how to address them. (Change from Mysql to NoSQL? Add indexes? Add hardware resources? Scale horizontally?) The more thorough your monitoring is, with good graphical presentation of trends, the more likely you are to be able to find issues early, and thus scale and release without issues.
If you run infrastructure, and don’t work directly with developers, the same principles apply. You don’t move all functions from one datacenter to another at once (if you have a choice). You run a small set of applications in the new datacenter, monitoring everything you can in the new datacenter, fix the errors you find, then move some more load. Rinse, repeat. Deploying new ESX infrastructure? Move some non-critical VMs first. New Exchange cluster? Dont move all users at once without testing.
Nothing revolutionary, and nothing people don’t know, but it’s good to have reminders sometimes. The key to all changes is to keep them small, and measure the crap out of them.
Eleanor Roosevelt is reputed to have said “Learn from the mistakes of others. You can’t live long enough to make them all yourself.” In that spirit, we’re sharing a mistake we made so that you may learn.
This last weekend we had a service impacting issue for about 90 minutes, that affected a subset of customers on the East coast. This despite the fact that, as you may imagine, we have very thorough monitoring of our servers; error level alerts (which are routed to people’s pagers) were triggered repeatedly during the issue; we have multiple stages of escalation for error alerts; and we ensure we always have on-call staff responsible for reacting to alerts, who are always reachable.
All these conditions were true this weekend, and yet we still had an issue whereby no person was alerted for over an hour after the first alerts were triggered. How was this possible? Read more »
Last night, our server monitoring sent me a text alert about the CPU load of a server in our infrastructure I had never seen before. (I’m not normally on the NOC escalation – but of the usual NOC team, one guy is taking advantage of our unlimited vacation policy to recharge in Europe, and two others were travelling en route to Boston to speak at AnsibleFest.) So I got woken up; saw the errant CPU; checked the server briefly via LogicMonitor on my phone; replied to the text with “SDT 6″ to put this alert into Scheduled Downtime for 6 hours, and went back to sleep with the CPU still running over 90%.
How, you may ask, did I know it was safe to just SDT this alert, when I had never come across this server before? What if it was a critical piece of our infrastructure, and its high CPU was causing Bad Things? Its name told me. Read more »
Have you ever been the guy in charge of storage and the dev guy and database guy come over to your desk waaaaay too early in the morning before you’ve had your caffeine and start telling you that the storage is too slow and you need to do something about it? I have. In my opinion it’s even worse when the Virtualization guy comes over and makes similar accusations, but that’s another story.
Now that I work for LogicMonitor I see this all the time. People come to us because “the NetApps are slow”. All too often we come to find that it’s actually the ESX host itself, or the SQL server having problems because of poorly designed queries. I’ve experienced this first hand before I worked for LogicMonitor,so it’s no surprise to me that this is a regular issue. When I experienced this problem myself I found it was vital to monitor all systems involved so I could really figure out where the bottleneck was.
Developers are sometimes too helpful when they instrument their systems. For example, when asked to add a metric that will report the response time of a request – there are several ways that it can be done. One way that seems to make sense is to just keep a variable with the total number of requests, and another with the total processing time. Then the developer just creates a variable showing total processing time divided by total requests, and a way to expose it (an MBean to report it via JMX, or a status page via HTTP, etc). This will be a nice neat object that reports the response time in milliseconds, all pre-calculated for the user.
The problem with this? It is indeed going to report the average response time - but it’s going to be the average of all response times since the server started. So… if the server has been running with an average response time of 1 ms, and it’s been up for 1000 hours, then it starts exhibiting a response time of 100 ms per request – after an hour of this slow behavior, the pre-calculated average response time will be 1.01 milliseconds (assuming a constant rate of requests). Not even enough of a change to be discernible with the eye on a graph, Read more »
Until very recently I was your client: the VP of Marketing for a business consulting firm where I doubled as the in-house IT. It was my job to bring on MSPs who could solve the bigger problems within our infrastructure. This included two complete office moves involving all new cat-6 cabling in newly built offices, new servers, new backup, migration to MSFT Server 2008, a switch to hosted exchange, and much more. For one reason or another we went through 3 MSPs in less than 5 years. It took some time to find that perfect MSP, but once we did we became an instant source of referrals.
Here are 5 things this MSP did really well that made us loyal customers:
1. We had less downtime
When our phones, internet, file server, network, or SaaS applications are down, we couldn’t work. This is the number one thing we do not want to happen and the only way it happens is through negligence, not having a Plan B, or an accident.
Great MSPs can explain why we’re down in plain english, and don’t blame the failure on us for not knowing enough about our IT infrastructure. A great MSP understands we’re a small business and everyone is doing 10 things. I was offered a simple solution to get back up, fast. Even better, they provided me with a plan for how to avoid these things in the future. I loved hearing about redundancy scenarios, it felt like I was saving the business money before anything actually happened.
2. They kept us in the loop
Great providers discussed changes with me before making them. Sometimes seemingly minor changes have major consequences. If my MSP tells me something big is going to happen, and I can think of any reason why that might adversely affect our day-to-day operations, then I’m glad I was told before something just “happened.”
There was peace of mind knowing that there’s nothing major going on behind the scenes without my knowledge, and our MSP showed that they were considering the possible workflow impact of changes made. The side benefit for the MSP is that they’ve absolved themselves of blame — at that point it’s something we agreed to work on together.
3. Things were fixed before they broke
A great MSP is like a great doctor; you’re monitoring our health and thinking holistically about what I’m trying to get done. Whenever our MSP was updating the server, or performing regular system maintenance, they would provide us with “things to be aware of” – completely outside of what he was there to do. That kind of check up gave me the freedom to worry less about the things the MSP said they were monitoring.
Weekly reporting helped, too. We are very analytical about things like website uptime, latency, and general speed of productivity. The reports they were able to create through monitoring and site visits were remarkable and allowed us to make data-driven decisions about the business. This kind of insight and thoughtful attitude towards our business changed the way we perceived an MSP. Consequently we sought out more solutions through them.
4. They went beyond a “fix”
Being kept up-to-date on the latest and greatest can help us move the business forward. Solutions that give us a competitive edge, save us money, or otherwise move the hassle of IT off of our plate is preferable to just “fixing the glitch.”
Small businesses are typically against large capital expenditures. The ability to scale quickly with solutions that keep us lightweight are attractive ones. Even something as simple as moving to a hosted Exchange solution will drastically change our infrastructure if we’re currently relying on a SBS 2003 (no fun at all).
5. We were given simple explanations to complex IT problems
The technical intricacies of an error are usually lost on small business people. There’s generally more pressing matters we’re obsessing about. So the more digestible an MSP can make that explanation, the better. We were always impressed by our MSP’s ability to distill something highly technical into something we could wrap our biz-dev brains around.
There’s a degree of comfort and familiarity when working with MSPs that understand our level of technical know-how. The more digestible that information is, the more likely I am to adopt whatever it is you’re suggesting.
At the end of the day, we don’t want to switch MSPs. It’s a pain to ask around, read reviews, set up appointments and listen to the sales pitches. So that means we’ll suffer through a lot of pain before saying good-bye. That said, a proactive, intelligence-driven, and quick-to-action MSP that is personable, but doesn’t outstay their welcome will retain existing clients and be referred more new ones over time.
— This guest article was contributed by Matt Harding – Business Consultant
One of the great things about being a customer of a SaaS delivered monitoring service like LogicMonitor is that they can get best practices in monitoring of all sorts of technologies without having to have an expert in that technology on staff.
Performance monitoring for all your infrastructure & applications. In minutes, not hours.
Questions? Call Us!
(888) 415-6442 or +1 (805)-617-3884