<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>LogicMonitor Blog</title>
	<atom:link href="http://blog.logicmonitor.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.logicmonitor.com</link>
	<description>Interesting issues in datacenter monitoring</description>
	<lastBuildDate>Sat, 04 Feb 2012 00:24:33 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Agile Monitoring Support</title>
		<link>http://blog.logicmonitor.com/2012/02/03/agile-monitoring-support/</link>
		<comments>http://blog.logicmonitor.com/2012/02/03/agile-monitoring-support/#comments</comments>
		<pubDate>Sat, 04 Feb 2012 00:24:19 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.logicmonitor.com/?p=499</guid>
		<description><![CDATA[We recently had a customer come into trial looking around for a new monitoring solution.  This is always good for us.  We love the takeaway.  (Customers defecting from other monitoring systems to us.) As in most takeaway situations this customer had specific needs.  Now there are the obvious ones in which LogicMonitor easily fits the [...]]]></description>
			<content:encoded><![CDATA[<p>We recently had a customer come into trial looking around for a new monitoring solution.  This is always good for us.  We love the takeaway.  (Customers defecting from other monitoring systems to us.) As in most takeaway situations this customer had specific needs.  Now there are the obvious ones in which LogicMonitor easily fits the bill such as alerting, dashboards, performance monitoring, etc (and if you fall into that VMWare, Cisco, NetApp sweet spot, game over!).  This guy however, had a very specific need we didn’t fulfill directly out of the gates.  I think anyone who has ever worked with a monitoring solution knows that it’s hard to find one that does everything.  Well in the case of LogicMonitor this is no different.  We don’t do EVERYTHING.  I know, you thought I was going to get all high and mighty and talk about how LogicMonitor is the one monitoring tool that CAN do everything.  Well we don’t (do everything that is…out of the gates that is).  But for just about everything we don’t do (out of the gates), we can get there pretty easily because we provide an easy to use framework that allows us to quickly build almost anything you can think of on a monitoring level.</p>
<p>Back to the issue!  This guy must have got bitten at some point or another with interface flapping because he needed to be alerted if an interface went down and back up again within the minute that we poll it.  I get it.  As a former admin myself, I know we all have that one quirky thing (ok more like 20 things) that happened to us one time when we got bit bad and we vow that we will never get bit by it again.  This type of attitude is what makes a good admin.  I think this was his one thing.  This one thing happened to be an alert that we didn’t provide by default and he definitely wasn’t going to trust our product if he couldn’t get it to provide this alert.  With his other product he accomplished it by collecting some trap info and doing some calculation which in turn let him know that the interface was flapping.  It sounded complicated and I give him kudos for making the most out of what he has.  I knew we could help him setup this same solution, but at LogicMonitor we prefer not to trust traps.  We would much rather poll snmp counters because traps tend to get lost in transit especially during times of duress, not to mention the configuration headache.  Anywho, I digress, what I didn’t know is how we could help him to make his solution better and more reliable while applying it to every interface he planned to monitor.  So I posed the problem to our support department (of which I am also a member).  Two days later I got a response from one of our smartest engineers a.k.a. “that tech Steve”.  He suggested that we collect the counter for the interface which displays the uptime of the last status change for the interface, but monitor it for changes, not for the time.  Genius!  We could now report alerts on if that value changes.  This is also a value that we can poll rather than catch a trap that is potentially lost in the ether and for good measure I graphed it.  Awesome, now we have ammo to use during the takeaway.</p>
<p>This scenario is one good example where we didn’t provide something by default but because of our flexible framework and expert support we were able to easily provide a solution quickly.  Consequently our potential customer came back with an additional problem he was hoping to see us solve before making a final decision on LogicMonitor.  We were able to get him a solution quickly again and are hoping that this will help close the deal as it has many times before.</p>
<p>Where was LogicMonitor when I was a Network Admin?</p>
<p>Cevin</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fblog.logicmonitor.com%2F2012%2F02%2F03%2Fagile-monitoring-support%2F&amp;title=Agile%20Monitoring%20Support" id="wpa2a_2"><img src="http://blog.logicmonitor.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.logicmonitor.com/2012/02/03/agile-monitoring-support/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Metrics for DevOps</title>
		<link>http://blog.logicmonitor.com/2012/01/21/metrics-for-devops/</link>
		<comments>http://blog.logicmonitor.com/2012/01/21/metrics-for-devops/#comments</comments>
		<pubDate>Sun, 22 Jan 2012 00:33:08 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.logicmonitor.com/?p=480</guid>
		<description><![CDATA[At LogicMonitor we take turns learning from each other in informal sessions.  One week it may be  developers talking about MySql and NoSQL; or marketing guys talk about lead generation and Adwords, etc.  This time we&#8217;d arrived on the topic of programming languages, and how there is a trade off: between code speed and efficiency [...]]]></description>
			<content:encoded><![CDATA[<p>At LogicMonitor we take turns learning from each other in informal sessions.  One week it may be  developers talking about <a href="http://www.logicmonitor.com/monitoring/databases/mysql-monitoring-and-optimization/">MySql</a> and <a href="http://www.logicmonitor.com/monitoring/databases/redis/">NoSQL</a>; or marketing guys talk about lead generation and Adwords, etc.  This time we&#8217;d arrived on the topic of programming languages, and how there is a trade off: between code speed and efficiency when using assembler or C at the expense of programmer efficiency; compared with much better programmer productivity at the expense of code efficiency when using languages with higher levels of abstraction, like Ruby on Rails or Python/Django.</p>
<p>Someone asked if that abstraction and inefficiency matters:  as in most operational issues, it matters only if it matters.  By which I mean if you are writing a system that is lightly used, or is on powerful hardware &#8211; it may not matter at all. But if you suddenly have an increased workload, it may matter a lot. (See the early occurrences of Twitter&#8217;s fail whale and RoR scaling.)</p>
<p>Then the question was asked, how can you know whether you are improving things when you change code?  Trend it, of course. You probably know what will constrain your application performance. (If not, you need <a href="http://www.LogicMonitor.com/">better monitoring</a>.)  For many sites, an obvious constraint is likely to be database queries per second.  So plot database queries per web request over time<span id="more-480"></span></p>
<p>For example, at LogicMonitor, we track database activity, correlated with monitoring data being reported back from customers, with a custom dashboard graph defined like the below. It simply divides Mysql questions per second by the number of datasets being reported back per second:</p>
<p><img class="alignnone size-full wp-image-490" title="DevOps Graph" src="http://blog.logicmonitor.com/wp-content/uploads/2012/01/DevOps-Graph1.png" alt="DevOps graph definition" width="404" height="300" /></p>
<p>&nbsp;</p>
<p>And the result is:</p>
<p><img class="wp-image-483 alignnone" title="DevOps graph 2" src="http://blog.logicmonitor.com/wp-content/uploads/2012/01/DevOps-graph-2.png" alt="DevOps graph 2" width="758" height="367" /><br />
So it&#8217;s clear that in early October there was a code change that resulted in about a 50% reduction in the number of database lookups per reported dataset. Then another change that made things worse, that was reverted.</p>
<p>Graphs like these are essential for tracking the performance implications of code changes &#8211; given that this server has a constantly varying load, it&#8217;s harder to just look at the MySQL graphs &#8211; there may be a large change in Db Questions, but caused by changes to load, not code. By relating the load driver (web requests, or reported data) to the constraint (DB queries in this case), you can assess just the code impacts, not the load.</p>
<p>And the beauty of this approach is it lets you tie together your real inputs to various constraints.  You could chart:</p>
<ul>
<li>web requests to web server CPU</li>
<li>web requests to DB operations (i.e. like relating to questions, but only questions that the query cache could not answer).</li>
<li>web requests to NetApp disk operations for the volume holding the main DB</li>
<li>any primary input to any constraint you can think of.</li>
</ul>
<p>You&#8217;ll see if you&#8217;re code is making your requests more or less efficient, and thus impacting your scalability. You&#8217;ll see if infrastructure changes (deploying memcached, say) are making significant differences.</p>
<p>If you want to scale, and do it without wasting a ton of money on hardware, and have real actionable information to take to your development teams to help improve code &#8211; this approach is a great step.</p>
<p>And of course, if you&#8217;re a LogicMonitor customer, and want help setting up this kind of trending or dashboards &#8211; we&#8217;ll work with you to set  it up, free.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fblog.logicmonitor.com%2F2012%2F01%2F21%2Fmetrics-for-devops%2F&amp;title=Metrics%20for%20DevOps" id="wpa2a_4"><img src="http://blog.logicmonitor.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.logicmonitor.com/2012/01/21/metrics-for-devops/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How to minimize the impacts of the next Amazon reboot .. or of your own datacenter failure</title>
		<link>http://blog.logicmonitor.com/2012/01/06/how-to-minimize-the-impacts-of-the-next-amazon-reboot-or-of-your-own-datacenter-failure/</link>
		<comments>http://blog.logicmonitor.com/2012/01/06/how-to-minimize-the-impacts-of-the-next-amazon-reboot-or-of-your-own-datacenter-failure/#comments</comments>
		<pubDate>Fri, 06 Jan 2012 22:45:41 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.logicmonitor.com/?p=461</guid>
		<description><![CDATA[So as everyone knows, Amazon rebooted virtually all EC2 instances in December.  They emailed people to notify them, but not everyone read the emails, leading to Amazon performing the reboots on their own schedule, with the customers unaware. For some SaaS companies, this resulted in many hours of downtime. For others, there was a short [...]]]></description>
			<content:encoded><![CDATA[<p>So as everyone knows, Amazon rebooted virtually all EC2 instances in December.  They emailed people to notify them, but not everyone read the emails, leading to Amazon performing the reboots on their own schedule, with the customers unaware.</p>
<p>For some SaaS companies, this resulted in many hours of downtime. For others, there was a short impact. What was the difference?<span id="more-461"></span></p>
<p><strong>Time to Notification</strong></p>
<p>One of the main differentiators was where monitoring was running.  Some companies ran their own monitoring within EC2 &#8211; so when their instances were rebooted, so was their monitoring. If it did not automatically recover, they had several hours of outage before they even knew they had an issue. (As their monitoring was down, they had nothing to alert them to the rest of their infrastructure being down too.)</p>
<p><strong>Time to Remediation</strong></p>
<p>Once people knew their servers were down, those with monitoring hosted at EC2 knew they had an issue, but were blind as to the status of their other systems until they recovered their monitoring systems.  In some cases we heard about, this added several hours to the recovery time.  Those using <a title="monitoring as a service" href="http://www.logicmonitor.com/" target="_blank">monitoring as a service</a> had immediate visibility into what systems and services had recovered, and which had not, and needed their attention &#8211; then they could quickly focus on recovering their databases, or <a href="http://www.logicmonitor.com/monitoring/databases/redis/">redis </a>systems, or whatever was needed.</p>
<p><strong>The cloud or your datacenter?</strong><br />
<img class="size-full wp-image-462 alignright" style="border-style: initial; border-color: initial;" title="toast" src="http://blog.logicmonitor.com/wp-content/uploads/2012/01/toast.jpg" alt="" width="200" height="113" /><br />
Of course, exactly the same issues apply to outages of your own datacenter.  Even the best run datacenters can lose power or network connectivity. It should never happen with A/B power and redundant networks &#8211; but it does.  Having Monitoring as a service is just as important whether you have one datacenter (which is analogous to an EC2 region) or multiple.  You want your monitoring outside all your datacenters (otherwise no doubt the one hosting the monitoring will be the one that fails, following the law of toast falling jam side down).</p>
<p>And you want your monitoring to be available immediately, as soon as the power or network recovers, so you can know what to focus on to restore service.</p>
<p><img class="alignleft size-full wp-image-469" style="border-style: initial; border-color: initial;" title="functional groups" src="http://blog.logicmonitor.com/wp-content/uploads/2012/01/functional-groups.png" alt="" width="210" height="161" /><strong>Organize to speed remediation</strong></p>
<p>Your infrastructure has dependencies. No point in trying to bring up your database if its data is stored on a <a href="http://www.logicmonitor.com/monitoring/storage/netapp-filers/">NetApp</a> that is in a cluster failure state.  So have a view in your monitoring that lets you see things by functional groups, and lets you assess whether the prerequisite systems are up.</p>
<p><strong>Practice</strong></p>
<p>Practice, as the saying goes, makes perfect.  Run DR drills.  Spin up a cloud based replica, and rudely shut it down. Make sure you know the order of your system&#8217;s dependencies.  See how long it takes you to recover. Get used to using your monitoring to guide you in what to address next.</p>
<p>Now imagine how long it would take to recover without monitoring visibility.</p>
<p><strong>Key takeaway</strong>: Move your monitoring away from a premise based system, onto Monitoring as a Service, before your next datacenter or cloud impacting event.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fblog.logicmonitor.com%2F2012%2F01%2F06%2Fhow-to-minimize-the-impacts-of-the-next-amazon-reboot-or-of-your-own-datacenter-failure%2F&amp;title=How%20to%20minimize%20the%20impacts%20of%20the%20next%20Amazon%20reboot%20..%20or%20of%20your%20own%20datacenter%20failure" id="wpa2a_6"><img src="http://blog.logicmonitor.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.logicmonitor.com/2012/01/06/how-to-minimize-the-impacts-of-the-next-amazon-reboot-or-of-your-own-datacenter-failure/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Eating our own Tomcat Monitoring dogfood</title>
		<link>http://blog.logicmonitor.com/2011/12/24/eating-our-own-tomcat-monitoring-dogfood/</link>
		<comments>http://blog.logicmonitor.com/2011/12/24/eating-our-own-tomcat-monitoring-dogfood/#comments</comments>
		<pubDate>Sun, 25 Dec 2011 05:27:49 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[application monitoring]]></category>
		<category><![CDATA[jmx]]></category>
		<category><![CDATA[tomcat]]></category>

		<guid isPermaLink="false">http://blog.logicmonitor.com/?p=448</guid>
		<description><![CDATA[We received some alerts tonight that one Tomcat server was using about 95% of its configured thread maximum. The Tomcat process on http-443  on prod4 now has 96.2 % of the max configured threads in the busy state. These were SMS alerts, as that was close enough to exhausting the available threads to warrant waking someone up [...]]]></description>
			<content:encoded><![CDATA[<p>We received some alerts tonight that one Tomcat server was using about 95% of its configured thread maximum.</p>
<pre class="programlisting">The Tomcat process on http-443  on prod4 now has 96.2 %
of the max configured threads in the busy state.</pre>
<p>These were SMS alerts, as that was close enough to exhausting the available threads to warrant waking someone up if needed.</p>
<p>The other alert we got was that Tomcat was taking an unusual time to process requests, as seen in this graph:<span id="more-448"></span><br />
<img class="alignnone size-full wp-image-450" title="Tomcat processing time" src="http://blog.logicmonitor.com/wp-content/uploads/2011/12/Tomcat-processing-time.png" alt="Tomcat processing time" width="504" height="251" /><br />
Normally we process requests in about 7ms, so 40 ms is a significant degradation.</p>
<p>Nothing was obviously wrong &#8211; our internal metrics showed the system processing no more data feeds, nor an unusual number of requests, nor were the disks or CPU any higher utilized than normal.</p>
<p>But as we trend all sorts of metrics, one did jump out, even though it was not alerting:<br />
<img class="alignnone size-full wp-image-452" title="TCP Retransmissions" src="http://blog.logicmonitor.com/wp-content/uploads/2011/12/TCP-Retransmissions.png" alt="TCP Retransmissions" width="502" height="233" /></p>
<p>Turns out there was some minor packet loss between this datacenter and some customers in an upstream ISP. Nothing enough to trigger any network alarms. No loss or discards on our hosts or networks. But enough that some connections to Tomcat were subject to some loss, which meant re-transmissions, which meant connections were taking longer to complete than usual, which meant the server had to deal with more concurrent connections simultaneously, and was in danger of running out.</p>
<p>So, the simple fix: increase the supported max connections in Tomcat. We have plenty of resources available. You can see the max threads was increased at 16:48 to deal with the longer sessions, giving a bit of headroom. (It was a conservative increase, as it hadn&#8217;t been  thoroughly tested &#8211; we&#8217;ll be testing and rolling out bigger max limits soon, just to give more space.)<br />
<img class="alignnone size-full wp-image-453" title="Tomcat threads" src="http://blog.logicmonitor.com/wp-content/uploads/2011/12/Tomcat-threads.png" alt="Tomcat threads" width="504" height="225" /></p>
<p>That allowed Tomcat enough spare threads to remove our alerts and any danger of refusing connections, and soon enough the upstream ISP resolved the loss issue, so session response time &#8211; and thus active threads &#8211; dropped to normal for this server.</p>
<p>Without LogicMonitor watching our own servers &#8211; well, we probably would have been ignorant of Tomcat getting close to exhausting the maximum available threads, and not alerted at all. Which may have led to a more peaceful Christmas Eve, but may have led to angry customers as they couldn&#8217;t have accessed their monitoring, or reported gaps in data after the fact &#8211; when we would have had no idea what the issue was, or how to prevent it.  Even if we had been alerted by customers while the issue occurred, without the TCP retransmissions indicating what the underlying cause was &#8211; we&#8217;d still have been in the dark.</p>
<p>Just an example of why we call LogicMonitor <em>disruptive <strong>I</strong>n<strong>T</strong>elligence.  </em>With LogicMonitor, we avoided an outage, and had the information to diagnose the issue within a few minutes, which let us get back to family time.  That&#8217;s disruptive &#8211; in the best way.</p>
<p>Happy Holidays.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fblog.logicmonitor.com%2F2011%2F12%2F24%2Feating-our-own-tomcat-monitoring-dogfood%2F&amp;title=Eating%20our%20own%20Tomcat%20Monitoring%20dogfood" id="wpa2a_8"><img src="http://blog.logicmonitor.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.logicmonitor.com/2011/12/24/eating-our-own-tomcat-monitoring-dogfood/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Right sizing infrastructure for VMWare migrations</title>
		<link>http://blog.logicmonitor.com/2011/12/09/right-sizing-infrastructure-for-vmware-migrations/</link>
		<comments>http://blog.logicmonitor.com/2011/12/09/right-sizing-infrastructure-for-vmware-migrations/#comments</comments>
		<pubDate>Fri, 09 Dec 2011 20:03:29 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.logicmonitor.com/?p=438</guid>
		<description><![CDATA[I was invited to talk to an MSP peer group the other week, and during the presentation, one of the group members who was a LogicMonitor customer described a way they use LogicMonitor to solve a previously hard-to-solve VMWare operational issue. They provide migration of their customers&#8217; physical servers onto VMware infrastructure &#8211; either the customer&#8217;s own [...]]]></description>
			<content:encoded><![CDATA[<p>I was invited to talk to an MSP peer group the other week, and during the presentation, one of the group members who was a LogicMonitor customer described a way they use LogicMonitor to solve a previously hard-to-solve VMWare operational issue.<span id="more-438"></span></p>
<p>They provide migration of their customers&#8217; physical servers onto VMware infrastructure &#8211; either the customer&#8217;s own VMWare infrastructure, or the MSP&#8217;s own infrastructure if the customer is using their hosted offering.</p>
<p>One of the issues they had faced in the past was right-sizing the infrastructure that they were to migrate onto. If they were migrating 100 servers onto a <a href="http://www.logicmonitor.com/monitoring/virtualization/vmware/">VMWare</a> infrastructure with <a href="http://www.logicmonitor.com/monitoring/storage/netapp-filers/">NetApp</a> storage &#8211; how much memory would be needed? How much CPU resources? How many disk IOps (input/output operations/second) would be needed to support the combined servers&#8217; workload?<br />
Getting the combined metrics from all these servers, covering peak and average usage, was something that used to take them a long time, and be done very imperfectly.</p>
<p>But they use LogicMonitor&#8217;s flexible graphs to make this a simple process. They add the 100 servers to monitoring (if they are not already monitored &#8211; a netscan or import from a script makes this easy). They create 3 flexible graphs, that show the aggregate of each of the three statistics (broken down by servers so they can see the big consumers, or just in the aggregate, as they prefer). This takes about a minute, using wildcards. Come back in a day, or a week, and they have their maximum workload that they need the infrastructure to support.</p>
<div id="attachment_439" class="wp-caption alignnone" style="width: 480px"><img class="size-full wp-image-439" title="Aggregategraph" src="http://blog.logicmonitor.com/wp-content/uploads/2011/12/Aggregategraph.png" alt="Graph of all windows systems disk IOps" width="470" height="233" /><p class="wp-caption-text">Graph of all windows systems disk IOps</p></div>
<p>In the above summary graph of 84 servers, you can see that the combined disk IO&#8217;s (read + write) of all logical disks on all volumes on all servers peaks at around 2000 IO operations per second (at least for this day.)</p>
<p>Knowing this, it makes it easy to size the performance of the NetApp that these servers will be migrated onto. Without this data, it becomes a guessing game, which can result in over spending on capital, or underspending and poor performance.</p>
<p>So just another way LogicMonitor can save you time, or money.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fblog.logicmonitor.com%2F2011%2F12%2F09%2Fright-sizing-infrastructure-for-vmware-migrations%2F&amp;title=Right%20sizing%20infrastructure%20for%20VMWare%20migrations" id="wpa2a_10"><img src="http://blog.logicmonitor.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.logicmonitor.com/2011/12/09/right-sizing-infrastructure-for-vmware-migrations/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Use LogicMonitor, save the world.</title>
		<link>http://blog.logicmonitor.com/2011/11/25/use-logicmonitor-save-the-world/</link>
		<comments>http://blog.logicmonitor.com/2011/11/25/use-logicmonitor-save-the-world/#comments</comments>
		<pubDate>Sat, 26 Nov 2011 01:35:07 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.logicmonitor.com/?p=433</guid>
		<description><![CDATA[My wife was reading the science journal of UCSB (where she did her Masters degree) and pointed out an article referring to the fact that &#8220;a typical server consumes as much energy in a year as an SUV&#8221;. She then asked how many servers we have&#8230;. I found this a bit dismaying at first, as [...]]]></description>
			<content:encoded><![CDATA[<p>My wife was reading the science journal of UCSB (where she did her Masters degree) and pointed out an article referring to the fact that &#8220;a typical server consumes as much energy in a year as an SUV&#8221;. She then asked how many servers we have&#8230;.</p>
<p>I found this a bit dismaying at first, as we try to engage in sustainable actions both personally (we have solar panels, two Prius cars, our own chickens &#8211; the typical urban hippie) and LogicMonitor as a corporate entity (we recycle, encourage non-car transport, etc).  I didn&#8217;t like to think that our servers were undoing all the other environmental actions we were taking.</p>
<p>But on reflection, I realized that LogicMonitor is an environmental net positive.  Our servers are not only fairly energy efficient (using SSDs, which use about half the energy of rotational disks), but they are very heavily leveraged.  Each of our servers is collectively replacing about 100 individual servers that our customers would have deployed if they ran their own monitoring servers. (More, if they ran redundant monitoring servers like we do.)</p>
<p>So from that point of view, LogicMonitor is taking hundreds of SUV&#8217;s off the road. All while freeing them from having to run, patch, and backup those servers, or extend their software to deal with new things like <a href="http://www.logicmonitor.com/monitoring/databases/mongodb-monitoring/">MongoDB monitoring</a>, and so on &#8211; we do all that for them.  So that means they can get out of the office earlier, and avoid the peak traffic times &#8211; saving more driving impact. Unless they have something else to do at work. <img src='http://blog.logicmonitor.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>Happy thanksgiving.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fblog.logicmonitor.com%2F2011%2F11%2F25%2Fuse-logicmonitor-save-the-world%2F&amp;title=Use%20LogicMonitor%2C%20save%20the%20world." id="wpa2a_12"><img src="http://blog.logicmonitor.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.logicmonitor.com/2011/11/25/use-logicmonitor-save-the-world/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Monitoring the right things</title>
		<link>http://blog.logicmonitor.com/2011/10/16/monitoring-the-right-things/</link>
		<comments>http://blog.logicmonitor.com/2011/10/16/monitoring-the-right-things/#comments</comments>
		<pubDate>Sun, 16 Oct 2011 02:57:50 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.logicmonitor.com/?p=420</guid>
		<description><![CDATA[I&#8217;ve talked about this before, but I just read an article about why application performance monitoring is so screwed up, and coincidentally had just talked about it in a lecture I gave to a graduate class at UCSB on scalable computing, so figured it&#8217;s worth a mention. The article mentions that &#8220;enterprises have confused (with [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve talked about this before, but I just read an article about why <a rel="nofollow" href="http://www.virtualizationpractice.com/blog/?p=12982&amp;utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=why-is-application-performance-management-so-screwed-up">application performance monitoring is so screwed up</a>, and coincidentally had just talked about it in a lecture I gave to a graduate class at UCSB on scalable computing, so figured it&#8217;s worth a mention.</p>
<p>The article mentions that &#8220;enterprises have confused (with vendor help) the notion of monitoring the resources that an application uses with its performance&#8221;.  The way I put it in my lecture was that:</p>
<ul>
<li>Systems are limited by Disk IO, memory and less commonly CPU and network.</li>
<li>Users dont care about Disk/memory/CPU/network&#8230; They care about web pages, and speed.</li>
</ul>
<p>	So&#8230; how to tie one to the other?</p>
<p>Monitor both.</p>
<p>Monitor what users care about (page load times, response per request, etc)</p>
<p>﻿<img class="alignnone size-full wp-image-421" title="response" src="http://blog.logicmonitor.com/wp-content/uploads/2011/10/response.png" alt="" width="523" height="335" /></p>
<p>Also monitor all the limiting resources (CPU, Disk IO &#8211; or more importantly what percentage of the time a drive is busy, network, memory):</p>
<p><img class="alignnone size-full wp-image-423" title="drive busy time" src="http://blog.logicmonitor.com/wp-content/uploads/2011/10/drive-busy-time.png" alt="" width="784" height="396" /></p>
<p>And monitor the performance of the systems that affect the limiting resources:</p>
<p><img class="alignnone size-full wp-image-424" title="Innodb reads" src="http://blog.logicmonitor.com/wp-content/uploads/2011/10/Innodb-reads.png" alt="" width="743" height="335" /></p>
<p>So while <a href="http://www.logicmonitor.com/monitoring/applications/memcached/">monitoring InnoDB</a> file sytem reads does not tell you anything that an end user cares about, if your monitoring of Tomcat request time shows that users are experiencing poor performance, and your logical drives are suddenly 100% busy and request service time increasing, it&#8217;s good to know why that is. It may be because of InnoDB buffer misses, or it may be because of something else &#8211; but having this intermediate data will drastically reduce your time to correct the issue that users care about &#8211; response time.</p>
<p>Another point to note: the &#8220;user&#8221; in the phrase &#8220;monitor what users care about&#8221; may not be a human.  If a server is a <a href="http://www.logicmonitor.com/monitoring/applications/memcached/">memcached </a>server &#8211; the users for this server are web servers, who care about memcached response time, availability and hit rates.  So on this class of machines, that is the thing to monitor to determine if the service is meeting the needs of users.</p>
<p>In short, for every machine, identify the &#8220;thing(s) to care about&#8221; for it; monitor those things; monitor the constrained resources; and monitor all aspects of the systems on that server that inmpact the constrained resources.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fblog.logicmonitor.com%2F2011%2F10%2F16%2Fmonitoring-the-right-things%2F&amp;title=Monitoring%20the%20right%20things" id="wpa2a_14"><img src="http://blog.logicmonitor.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.logicmonitor.com/2011/10/16/monitoring-the-right-things/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Powershell, snap ins and 32 bit apps</title>
		<link>http://blog.logicmonitor.com/2011/10/01/powershell-snap-ins-and-32-bit-apps/</link>
		<comments>http://blog.logicmonitor.com/2011/10/01/powershell-snap-ins-and-32-bit-apps/#comments</comments>
		<pubDate>Sat, 01 Oct 2011 02:05:39 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.logicmonitor.com/?p=414</guid>
		<description><![CDATA[A more technical article today. In adding some more Exchange Monitoring we ran into some issues, and solutions, that may help others.  Some things in recent Exchange versions can only be monitored by Powershell. (Perfmon, WMI, Powershell, all needed for different versions of Exchange&#8230;. I wish they&#8217;d make up their mind&#8230;) So the first issue [...]]]></description>
			<content:encoded><![CDATA[<p>A more technical article today.</p>
<p>In adding some more <a href="http://www.logicmonitor.com/monitoring/applications/mailserver-monitoring/microsoft-exchange-monitoring/">Exchange Monitoring</a> we ran into some issues, and solutions, that may help others.  Some things in recent Exchange versions can only be monitored by Powershell. (Perfmon, WMI, Powershell, all needed for different versions of Exchange&#8230;. I wish they&#8217;d make up their mind&#8230;)</p>
<p>So the first issue was that Powershell scripts, when called from a LogicMonitor agent, never returned. This wasn&#8217;t too hard &#8211; simply pass the parameter -inputformat with the (undocumented) option &#8220;none&#8221;, and the agent can successfully run Powershell commands:</p>
<p>powershell -inputformat none dbstatus.ps1</p>
<p>(Why? The Microsoft.PowerShell.ConsoleHost class constructs a M.PS.WrappedDeserializer passing the STDIN TextReader as one of the parameters. By default, the WrappedDeserializer will call ReadLine() on this STDIN TextReader and wait indefinitely, effectively hanging PowerShell and the calling process. That&#8217;s why.)</p>
<p>So past that hurdle, but the next one:<br />
&gt;&gt; powershell -inputformat none dbstatus.ps1<br />
Add-PSSnapin : No snap-ins have been registered for Windows PowerShell version 2.</p>
<p>Yet running the exact same command from the command shell on the host running the agent resulted in the output we were expecting. And we could see the Exchange snap in, called by the Powershell script, was correctly registered, and in fact worked fine.</p>
<p>But.. our agent was running on a 32 bit JVM and Exchange 2010 (in our lab, at least) is installed on 64 bit Windows. The Powershell snap in was only visible when powershell was started from a 64 bit app. When I started powershell from the cmd.exe in <em>SysWOW64</em>, I got the same error about missing snap-ins as our agent reported.</p>
<p>The solution &#8211; it doesn&#8217;t matter that our agent was installed as a 32 bit app, in Program files (x86). What mattered was that the Java virtual machine launched by the agent, that ultimately launched Powershell, be a 64 bit JVM, not the default 32 bit JVM installed from Java.com. (At least, a 32 bit JVM is the default when you browse to Java.com with a 32 bit browser.)</p>
<p>So, running the LogicMonitor agent with a 64 bit JVM, and Powershell started with &#8220;-inputformat none&#8221; gives us full access to Powershell output and all its snap ins, so expect some datasources released very shortly to take advantage of that.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fblog.logicmonitor.com%2F2011%2F10%2F01%2Fpowershell-snap-ins-and-32-bit-apps%2F&amp;title=Powershell%2C%20snap%20ins%20and%2032%20bit%20apps" id="wpa2a_16"><img src="http://blog.logicmonitor.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.logicmonitor.com/2011/10/01/powershell-snap-ins-and-32-bit-apps/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hosted monitoring in a hurricane</title>
		<link>http://blog.logicmonitor.com/2011/09/13/hosted-monitoring-in-a-hurricane/</link>
		<comments>http://blog.logicmonitor.com/2011/09/13/hosted-monitoring-in-a-hurricane/#comments</comments>
		<pubDate>Tue, 13 Sep 2011 17:19:05 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.logicmonitor.com/?p=408</guid>
		<description><![CDATA[One of our customer acquisitions recently came about because the company wanted to be assured of their I.T. infrastructure&#8217;s availability during hurricane Irene. Their datacenter was located in the impact area, and obviously a premise based monitoring could not be relied on to alert them of any impacts, if the monitoring system itself was going [...]]]></description>
			<content:encoded><![CDATA[<p>One of our customer acquisitions recently came about because the company wanted to be assured of their I.T. infrastructure&#8217;s availability during hurricane Irene. Their datacenter was located in the impact area, and obviously a premise based monitoring could not be relied on to alert them of any impacts, if the monitoring system itself was going to be impacted.</p>
<p>They contacted us two days before the hurricane, and were completely monitoring all their infrastructure that same day, with the alerts coming from our datacenters, not theirs.</p>
<p>They and their infrastructure came through Irene unscathed &#8211; and they knew that they did, because they used a <a href="http://www.logicmonitor.com/">hosted monitoring</a> system. Had they used a premise based monitoring system, they would not have known if  their lack of alerts was because their monitoring system had been flooded or cut off from the internet.</p>
<p>So while enhanced disaster preparedness is not usually the way we sell our value, it&#8217;s certainly a nice bonus.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fblog.logicmonitor.com%2F2011%2F09%2F13%2Fhosted-monitoring-in-a-hurricane%2F&amp;title=Hosted%20monitoring%20in%20a%20hurricane" id="wpa2a_18"><img src="http://blog.logicmonitor.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.logicmonitor.com/2011/09/13/hosted-monitoring-in-a-hurricane/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Netapp Monitoring &#8211; too much of a good thing?</title>
		<link>http://blog.logicmonitor.com/2011/09/13/netapp-monitoring-too-much-of-a-good-thing/</link>
		<comments>http://blog.logicmonitor.com/2011/09/13/netapp-monitoring-too-much-of-a-good-thing/#comments</comments>
		<pubDate>Tue, 13 Sep 2011 16:56:17 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.logicmonitor.com/?p=396</guid>
		<description><![CDATA[One way LogicMonitor is different from other NetApp monitoring systems (other than being hosted monitoring, and being able to monitor the complete array of systems found in a datacenter &#8211; from AC units, through virtualization, OS&#8217;s to applications like MongoDB) is that we default to &#8220;monitoring&#8221; on&#8221;. i.e. we assume you want to monitor everything, [...]]]></description>
			<content:encoded><![CDATA[<p>One way LogicMonitor is different from other <a href="http://http://www.logicmonitor.com/monitoring/storage/netapp-filers/">NetApp monitoring</a> systems (other than being hosted monitoring, and being able to monitor the complete array of systems found in a datacenter &#8211; from AC units, through virtualization, OS&#8217;s to applications like MongoDB) is that we default to &#8220;monitoring&#8221; on&#8221;.</p>
<p>i.e. we assume you want to monitor everything, always. (You can of course turn off monitoring or alerting for specific groups, hosts or objects.)  This serves us well almost always &#8211; we will detect a new volume on your NetApp once you create it, and start monitoring it for read and write latency, number and type of operations, etc &#8211; this means that when you have an issue on that volume (or other groups are blaming storage performance for an issue), you already have the monitoring of all sorts of metrics in place <strong>before</strong> the issue &#8211; so you have the data and alerts to know whether the storage was or was not the issue.</p>
<p>However, we have found some cases where this doesn&#8217;t work so well. We have been monitoring NetApp disk performance by default, too, tracking the number of operations and the busy time for each disk.  However, on customers with larger NetApps, there are often hundreds of disks, each of which we would monitor via API queries.  This is useful for identifying when disks need to be rebalanced (if the top 10 and bottom 10 disks by busy time are wildly different.) And while we only monitor the performance of a disk every 5 minutes (as opposed to volumes and PAM cards and things that are monitored more frequently), this apparently overloads the API subsystem of NetApp devices.</p>
<p>We&#8217;d see that when we&#8217;d restart the collection process, and the only monitoring by the API was for the volume performance, things worked great &#8211; the response to an API request from a NetApp was around 100 ms.</p>
<p>When the disk requests started getting added in (and we stagger and skew the requests, so they are not all hitting at once) &#8211; the API response time for a single query climbed up to 40 seconds.</p>
<p>This started causing a backlog of monitoring, and was causing data to be missed in the more important volume performance metrics.</p>
<p>So&#8230; while we&#8217;ll open a case with NetApp, in the interim, we&#8217;ll probably disable the monitoring of physical disk performance by default to avoid this issue.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fblog.logicmonitor.com%2F2011%2F09%2F13%2Fnetapp-monitoring-too-much-of-a-good-thing%2F&amp;title=Netapp%20Monitoring%20%26%238211%3B%20too%20much%20of%20a%20good%20thing%3F" id="wpa2a_20"><img src="http://blog.logicmonitor.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.logicmonitor.com/2011/09/13/netapp-monitoring-too-much-of-a-good-thing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Minified using apc
Page Caching using disk: enhanced

Served from: blog.logicmonitor.com @ 2012-02-04 17:25:57 -->
