Friday, July 06, 2012

Found my first problem with vCOps

I setup an eval of vCenter Ops in a test environment about a week ago and was letting it run. This morning I came in and found the below alert in my email:
New alert was generated at Fri Jul 06 05:12:21 PDT 2012:
Info:Object`s demand is 104.0 percent of its available resource capacity. Disk I/O is the most constrained resource.

Alert Type : Health
Alert Sub-Type : Workload
Alert State : Critical
Resource Kind : Datastore
Resource Name : TKPD_T2_J250003_02
Alert ID : 1806

VCOps Server - 10.89.12.104
Alert details


Always concerned about storage performance, I began digging into the VMware performance stats for the datastore and identified ESX host TKJ1721VK as the likely culprit.

There were only 2 VMs on the datastore/host combination: WTSQLSRV2 and WTDPW065C. A quick glance at the perf stats for each VM shows WTSQLSRV2 as the cause. Because this is a SQL server, I am assuming there is a scheduled job on the server that launches at 5am and is causing the CPU/Memory/Disk utilization to spike.
Looking through the other alerts I received this morning, I now see that there were alerts for both the VM WTSQLSRV2 and host TKJ1721VK at approximately the same time.

This is a fairly simple alert to resolve, with the final resolution of tuning SQL and/or moving the VM to dedicated storage. This does show that the vCOps has some merit, though it took a lot more steps to identify the cause than I would have expected.

No comments: