×

Archives Best Practices

SSD Stats[Written by Perry Yang, Technical Operations Engineer at LogicMonitor]

In recent years, Solid-State Drives or SSDs have become a standard part of data center architecture. They handle more simultaneous read/write operations than traditional disks and use a fraction of the power. Of course, as a leading infrastructure, software and server monitoring platform vendor, we are very interested in monitoring our SSDs, not only because we want to make sure we’re getting what we paid for, but because we would also like to avoid a disk failure on a production machine at 3:00AM in the morning…and the Shaquille O’Neal sized headache to follow. But how do we know for sure if our SSDs are performing the way we want them to? Being one of the newest members of our technical operations team, it came as no surprise that I was tasked to answer this question. Read more »

Tags:

In a prior blog post, I talked about what virtual memory is, the difference between swapping and paging, and why it matters. (TL;DR: swapping is moving an entire process out to disk; paging is moving just specific pages out to disk, not an entire process. Running programs that require more memory than the system has will mean pages (or processes) are moved to/from disk and memory in order to get enough physical memory to run – and system performance will suck.)

Now I’ll talk about how to monitor virtual memory, on Linux (where it’s easy) and, next time, on Solaris (where most people and systems do it incorrectly.) Read more »

Hi everyone,

BeforeLogicMonitor Monitoring Roundtable the July 4th holiday, we had the opportunity to host our second LogicMonitor Monitoring Roundtable.

During this session, Mike Aracic, a senior datasource developer here at LogicMonitor, gave us insight into creating datasources for your environment and provided some resources for further education. Read more »

Monitoring Roundtable We’ve launched a new program here at LogicMonitor to help you get insight from us and from your compatriots at different corporations working in different positions solving complexities and issues with LogicMonitor. Here at LogicMonitor, we are referring to this fledgling program as the Monitoring Roundtable. We are looking to have one of these every month with invitations extended by your account managers. Of course, you are welcome to be proactive and reach out to us or to your account manager directly for an invitation. Read more »

[Originally appeared February 26, 2014 in the Packet Pushers online community, written by Jeff Behl, Chief Network Architect with LogicMonitor.]

LogicMonitor is a SaaS-based performance and monitoring platform servicing clients across the world. Our customers install LogicMonitor “Collectors” within their data centers to gather data from devices and services utilizing a web application to analyze aggregated performance metrics, and to configure alerting and reporting. This means our entire operation (and therefore the monitoring our customers are dependent on) relies on ISPs to ensure that we efficiently and accurately receive billions of data points a day.

LogicMonitor Architecture Read more »

One question we sometimes get is why LogicMonitor relies so little on SNMP traps. When we are writing the monitoring for a new device, we look at the traps in the MIB for the device to see the things the vendor thinks are important to notify about – but we will try to determine the state of the device by polling for those things, not relying on the traps. “Why not rely on traps?” you may ask. Good question. Read more »

This weekend I was catching up on some New Yorker issues, when an article by one of my favorite New Yorker authors, Atul Gawande, struck me as illuminating so much about tech companies and DevOps.  (This is an example of ideas coming from diverse, unrelated sources – something part of the culture of LogicMonitor. Just yesterday, in fact, our Chief Network Architect had a great idea to improve security and accountability when our support engineers are asked to log in to a customer’s account  – and this idea occurred to him while he and I were charging down the Jesusita trail on mountain bikes.)

The article, Atul Gawande: How Do Good Ideas Spread? : The New Yorker, is an exploration about why some good ideas (such as anesthesia) were readily adopted, while other just as worthy ideas (antisepsis – keeping germs away from medical procedures) did not.  So how does this relate to DevOps and technology companies? Read more »

Lean Monitoring

Posted by & filed under Best Practices .

This week I’ve been off visiting customers in Atlanta – which means a lot of time on planes and in airports (especially today, when my flight was cancelled so I have a 6 hour delay…) So that means a lot of reading.  One book I read on this trip was UX for Lean Startups, by Laura Klein.  A good read, advocating good common sense strategies, which I will roughly paraphrase:

  •  you will be wrong in some of your assumptions about how customers will use, and be able to use, your UX; therefore
  • start with an MVP of your UX
  • show your UX to test groups of customers as early as possible (before implementing); see where they have issues and what they like/don’t like
  • iterate on the UX with your customers
  • release it in your product; measure usage and business impact
  • rinse and repeat.

This is, to some degree, a similar message that you will hear from proponents of Agile methodologies like Scrum; from DevOps, and the Lean enterprise movement in general: work collaboratively; release frequently; measure the results.

How does this relate to monitoring?

  • you will be wrong in some of your assumptions about how your code will perform under production load; therefore
  • start with the MVP of your  feature
  • run the feature in limited load: in the lab, or with a small set of live traffic. See where the performance issues are.
  • iterate on the feature and performance bottlenecks with your developers
  • release it in your product, measuring performance and capacity impact
  • rinse and repeat

    If your disk load jumps like this with 5 users - dont out 5000 on this system...

    If your disk load jumps like this with 5 users – dont put 5000 on this system…

Like modifying a UX, it’s easier to change code for performance and capacity reasons earlier, rather than later. If your plan to use flat files to store all your customer’s transaction history works fine for 5 customers, but not for 5000 – it’s much better to find that out when you have 5 customers. (Even better to find it out before you’ve released it to any customers.) Finding that out may require simulating the load of 5000 customers – but if you have in depth monitoring, it is more likely to be evident in advance of the load. In the case of flat files, it would be easy to see a spike in linux disk request latency – even if you only have a few users.  If you have a less-anachronistic architect whose decided to use MySQL, you may see no issues in disk latency, but you may see a spike in table scans. No actual problem now, but an indicator of where you may run into growing pains.  If you run Redis/Memcached/Cassandra/MongoDB  (hopefully not all at once), you may not see performance issues in the transactions, but you may have less memory to run the application, so it may start swapping – so now you need to split your systems.

In Lean UX, the initial steps are qualitative observations of a small subset of users to identify the worst issues that are then addressed and iterated on. With Lean monitoring, thorough monitoring should be deployed even initially, and it will require someone with experience to identify changes in behavior that, while not a problem now, could indicate one under greater load, and how to address them. (Change from Mysql to NoSQL? Add indexes? Add hardware resources? Scale horizontally?)  The more thorough your monitoring is, with good graphical presentation of trends, the more likely you are to be able to find issues early, and thus scale and release without issues.

If you run infrastructure, and don’t work directly with developers, the same principles apply. You don’t move all functions from one datacenter to another at once (if you have a choice). You run a small set of applications in the new datacenter, monitoring everything you can in the new datacenter, fix the errors you find, then move some more load. Rinse, repeat. Deploying new ESX infrastructure? Move some non-critical VMs first. New Exchange cluster? Dont move all users at once without testing.

Nothing revolutionary, and nothing people don’t know, but it’s good to have reminders sometimes. The key to all changes is to keep them small, and measure the crap out of them.

Eleanor Roosevelt is reputed to have said “Learn from the mistakes of others. You can’t live long enough to make them all yourself.” In that spirit, we’re sharing a mistake we made so that you may learn.

This last weekend we had a service impacting issue for about 90 minutes, that affected a subset of customers on the East coast. This despite the fact that, as you may imagine, we have very thorough monitoring of our servers; error level alerts (which are routed to people’s pagers) were triggered repeatedly during the issue; we have multiple stages of escalation for error alerts; and we ensure we always have on-call staff responsible for reacting to alerts, who are always reachable.

All these conditions were true this weekend, and yet we still had an issue whereby no person was alerted for over an hour after the first alerts were triggered. How was this possible? Read more »

Last night, our server monitoring sent me a text alert about the CPU load of a server in our infrastructure I had never seen before. (I’m not normally on the NOC escalation – but of the usual NOC team, one guy is taking advantage of our unlimited vacation policy to recharge in Europe, and two others were travelling en route to Boston to speak at AnsibleFest.)  So I got woken up; saw the errant CPU; checked the server briefly via LogicMonitor on my phone; replied to the text with “SDT 6″ to put this alert into Scheduled Downtime for 6 hours, and went back to sleep with the CPU still running over 90%.

How, you may ask, did I know it was safe to just SDT this alert, when I had never come across this server before? What if it was a critical piece of our infrastructure, and its high CPU was causing Bad Things? Its name told me. Read more »

Categories
Popular Posts
Subscribe to our blog.