×

The value of IPMI monitoring

Posted by & filed under Best Practices, Tips & Troubleshooting, Virtualization.

Amongst its many monitoring methods, LogicMonitor supports IPMI.  Many people aren’t aware of IPMI, and don’t think  it’s necessary. And while I’m certainly an advocate of avoiding unnecessary complexity in a data center, sometimes it is good to wear both a belt and suspenders.

A real life example from one of our own data centers conveniently occurred just this morning, when I was looking for fodder to blog about:

We received several email alerts, like the below:

Host: console.lab2.sjc.logicmonitor.com
Eventsource: IPMI SEL Logs-  BMC  Battery 0x11 Failed 6f [01 ff ff]
Level: error
Detected on: 2012-03-23 08:45:52 PDT

Looking at the host in question in our monitoring portal showed the repeated events:

ipmi alerts

 

And logging in to the device itself – a Dell DRAC card – show the events logged directly:

DRAC log

This particular device was the DRAC of a Dell server running VMWare ESXi – which of course was also monitored by LogicMonitor.

However, the hardware monitoring for the ESX host was not reporting any issues at all through vCenter or LogicMonitor – even though this specific component was monitored and reported by the ESXi API:

My guess is that the battery issues are so transient – you can see from the DRAC logs that they cleared themselves within 5 seconds – that the ESX hardware monitoring never picked them up.

So in this case, having IPMI monitoring as well as the regular ESX hardware monitoring allowed us to identify this issue much sooner. Now we can open a case with Dell, and have the issue remedied. We can migrate VMs to other ESX servers, and avoid any impact.  It is likely that the ESX software will notice the storage controller battery issues once they become severe enough, and which point LogicMonitor will alert on them – but I’d rather be aware of issues that could impact the availability and performance of my ESX hosts as soon as possible. (Performance could be impacted as the controllers will almost certainly switch to write-through mode, instead of using the NVRAM cache to accelerate writes, when there are failures of the storage controller battery.)

How many servers do you have where IPMI – or LogicMonitor’s other extensive monitoring methods – can help you avoid performance and availability issues?

Update: VCenter finally noticed the issue – about 20 hours later.  Plus, Vcenter only reports the issue as an error in “VMware Rollup Health State” – but no details as to what the issue is.

 

 

Tags:

2 Responses to “The value of IPMI monitoring”

  1. Mike Horwath says:

    When adding IPMI to a physical server will I need to create yet another host in the portal or can I add in the datasource and update the IP address for the IPMI interface?

    • admin says:

      As is so often the case, the answer is “It depends”.

      The IPMI collector collects data from the IP Address or DNS name of the host it is associated with (you cannot have it associated with host X but collecting data from host Y).

      Whether this is a separate host or not depends on how IPMI is configured. On some systems it’s a separate device with a separate IP for the management card; on some the management card can be configured to share the IP of the server itself. (In which case the IPMI monitoring will magically just show up when the credentials work.)

Leave a Reply

Categories
Popular Posts
Subscribe to our blog.