Archives Best Practices


We are pleased to announce LogicMonitor’s Second Annual EU Roadshow held on February 25, 2015 in London. LogicMonitor’s marketing, product, and engineering teams have put together an event that promises to be unique and informative.

The Roadshow agenda includes a roadmap presentation from LogicMonitor’s Founder and Chief Product Officer, Steve Francis, LogicMonitor’s State of the Union talk from Kevin McGibben, CEO, and an overview of product releases and  a Q&A with LogicMonitor engineers.

LogicMonitor customers and prospects are highly encouraged to attend the event to enhance their performance monitoring skill set and become a better user of the platform.

LogicMonitor will be in London and San Francisco in Q1, and would love for you to vote on where we should go next. Vote here for the next roadshow city!


Interested in attending the EU Roadshow? Email Krista Damico at krista.damico@logicmonitor.com for more info.


No one likes to talk about outages. They’re horrible to experience as an employee and they take a heavy toll in customer confidence and future revenue. But they do happen. Even publicly traded tech powerhouses, such as eBay and Microsoft, who have more technical resources than you’ll ever have, fall prey to outages. And when they do, they are closed for business, much to the chagrin of their shareholders and executive teams.

It’s not so much a question of whether an outage will occur in your company but when. The secret to surviving them is to get better at handling them and learning from the mistakes of others. Nobody is perfect all the time (my current employer, LogicMonitor, included) but I hope by talking about these mistakes, we can all begin the hard work required to avoid them in the future.

4 Massive Mistakes Companies Make Handling Outages:

  1. Not having a tried-and-true outage response planDoes this sound familiar?

    An outage occurs. A barrage of emails is fired to the Tech Ops team from Customer Support. Executives begin demanding updates every five minutes. Tech team members all run to their separate monitoring tools to see what data they can dredge up, often only seeing a part of the problem. Mass confusion ensues as groups point their fingers at each other and Sys Admins are unsure whether to respond to the text from their boss demanding an update or to continue to troubleshoot and apply a possible fix. Marketing (“We’re getting trashed on social media! We need to send a mass email and do a blog post telling people what is happening!”) and Legal (“Don’t admit liability!”) jump in to help craft a public-facing response. Cats begin mating with dogs and the world explodes.
    Read more »

SSD Stats[Written by Perry Yang, Technical Operations Engineer at LogicMonitor]

In recent years, Solid-State Drives or SSDs have become a standard part of data center architecture. They handle more simultaneous read/write operations than traditional disks and use a fraction of the power. Of course, as a leading infrastructure, software and server monitoring platform vendor, we are very interested in monitoring our SSDs, not only because we want to make sure we’re getting what we paid for, but because we would also like to avoid a disk failure on a production machine at 3:00AM in the morning…and the Shaquille O’Neal sized headache to follow. But how do we know for sure if our SSDs are performing the way we want them to? Being one of the newest members of our technical operations team, it came as no surprise that I was tasked to answer this question. Read more »


In a prior blog post, I talked about what virtual memory is, the difference between swapping and paging, and why it matters. (TL;DR: swapping is moving an entire process out to disk; paging is moving just specific pages out to disk, not an entire process. Running programs that require more memory than the system has will mean pages (or processes) are moved to/from disk and memory in order to get enough physical memory to run – and system performance will suck.)

Now I’ll talk about how to monitor virtual memory, on Linux (where it’s easy) and, next time, on Solaris (where most people and systems do it incorrectly.) Read more »

Hi everyone,

BeforeLogicMonitor Monitoring Roundtable the July 4th holiday, we had the opportunity to host our second LogicMonitor Monitoring Roundtable.

During this session, Mike Aracic, a senior datasource developer here at LogicMonitor, gave us insight into creating datasources for your environment and provided some resources for further education. Read more »

Monitoring Roundtable We’ve launched a new program here at LogicMonitor to help you get insight from us and from your compatriots at different corporations working in different positions solving complexities and issues with LogicMonitor. Here at LogicMonitor, we are referring to this fledgling program as the Monitoring Roundtable. We are looking to have one of these every month with invitations extended by your account managers. Of course, you are welcome to be proactive and reach out to us or to your account manager directly for an invitation. Read more »

[Originally appeared February 26, 2014 in the Packet Pushers online community, written by Jeff Behl, Chief Network Architect with LogicMonitor.]

LogicMonitor is a SaaS-based performance and monitoring platform servicing clients across the world. Our customers install LogicMonitor “Collectors” within their data centers to gather data from devices and services utilizing a web application to analyze aggregated performance metrics, and to configure alerting and reporting. This means our entire operation (and therefore the monitoring our customers are dependent on) relies on ISPs to ensure that we efficiently and accurately receive billions of data points a day.

LogicMonitor Architecture Read more »

One question we sometimes get is why LogicMonitor relies so little on SNMP traps. When we are writing the monitoring for a new device, we look at the traps in the MIB for the device to see the things the vendor thinks are important to notify about – but we will try to determine the state of the device by polling for those things, not relying on the traps. “Why not rely on traps?” you may ask. Good question. Read more »

This weekend I was catching up on some New Yorker issues, when an article by one of my favorite New Yorker authors, Atul Gawande, struck me as illuminating so much about tech companies and DevOps.  (This is an example of ideas coming from diverse, unrelated sources – something part of the culture of LogicMonitor. Just yesterday, in fact, our Chief Network Architect had a great idea to improve security and accountability when our support engineers are asked to log in to a customer’s account  – and this idea occurred to him while he and I were charging down the Jesusita trail on mountain bikes.)

The article, Atul Gawande: How Do Good Ideas Spread? : The New Yorker, is an exploration about why some good ideas (such as anesthesia) were readily adopted, while other just as worthy ideas (antisepsis – keeping germs away from medical procedures) did not.  So how does this relate to DevOps and technology companies? Read more »

Lean Monitoring

Posted by & filed under Best Practices .

This week I’ve been off visiting customers in Atlanta – which means a lot of time on planes and in airports (especially today, when my flight was cancelled so I have a 6 hour delay…) So that means a lot of reading.  One book I read on this trip was UX for Lean Startups, by Laura Klein.  A good read, advocating good common sense strategies, which I will roughly paraphrase:

  •  you will be wrong in some of your assumptions about how customers will use, and be able to use, your UX; therefore
  • start with an MVP of your UX
  • show your UX to test groups of customers as early as possible (before implementing); see where they have issues and what they like/don’t like
  • iterate on the UX with your customers
  • release it in your product; measure usage and business impact
  • rinse and repeat.

This is, to some degree, a similar message that you will hear from proponents of Agile methodologies like Scrum; from DevOps, and the Lean enterprise movement in general: work collaboratively; release frequently; measure the results.

How does this relate to monitoring?

  • you will be wrong in some of your assumptions about how your code will perform under production load; therefore
  • start with the MVP of your  feature
  • run the feature in limited load: in the lab, or with a small set of live traffic. See where the performance issues are.
  • iterate on the feature and performance bottlenecks with your developers
  • release it in your product, measuring performance and capacity impact
  • rinse and repeat

    If your disk load jumps like this with 5 users - dont out 5000 on this system...

    If your disk load jumps like this with 5 users – dont put 5000 on this system…

Like modifying a UX, it’s easier to change code for performance and capacity reasons earlier, rather than later. If your plan to use flat files to store all your customer’s transaction history works fine for 5 customers, but not for 5000 – it’s much better to find that out when you have 5 customers. (Even better to find it out before you’ve released it to any customers.) Finding that out may require simulating the load of 5000 customers – but if you have in depth monitoring, it is more likely to be evident in advance of the load. In the case of flat files, it would be easy to see a spike in linux disk request latency – even if you only have a few users.  If you have a less-anachronistic architect whose decided to use MySQL, you may see no issues in disk latency, but you may see a spike in table scans. No actual problem now, but an indicator of where you may run into growing pains.  If you run Redis/Memcached/Cassandra/MongoDB  (hopefully not all at once), you may not see performance issues in the transactions, but you may have less memory to run the application, so it may start swapping – so now you need to split your systems.

In Lean UX, the initial steps are qualitative observations of a small subset of users to identify the worst issues that are then addressed and iterated on. With Lean monitoring, thorough monitoring should be deployed even initially, and it will require someone with experience to identify changes in behavior that, while not a problem now, could indicate one under greater load, and how to address them. (Change from Mysql to NoSQL? Add indexes? Add hardware resources? Scale horizontally?)  The more thorough your monitoring is, with good graphical presentation of trends, the more likely you are to be able to find issues early, and thus scale and release without issues.

If you run infrastructure, and don’t work directly with developers, the same principles apply. You don’t move all functions from one datacenter to another at once (if you have a choice). You run a small set of applications in the new datacenter, monitoring everything you can in the new datacenter, fix the errors you find, then move some more load. Rinse, repeat. Deploying new ESX infrastructure? Move some non-critical VMs first. New Exchange cluster? Dont move all users at once without testing.

Nothing revolutionary, and nothing people don’t know, but it’s good to have reminders sometimes. The key to all changes is to keep them small, and measure the crap out of them.

Popular Posts
Subscribe to our blog.