Wednesday, November 11, 2009

VMware guest level monitoring and alerting

Probably 90% of the monitoring needed in any environment consists of extremely basic measures: CPU utilization, Memory Utilization, Disk throughput, Network throughput, etc… Defining thresholds for these and alerting on them provides immesurable insight into an environment and quickly identifies any problems or bottlenecks. Amazingly, VMware provides many of these basic system monitors out of the box.

Out of the box, ESX contains 2 VM monitors, unfortunately no alerting or other action plans are defined. The first monitor is for the virtual CPU utilization and triggers a warning when it has reached 75% for more than 5 minutes, and critical when it reaches 90% for more than 5 minutes. The second monitor is for the virtual memory utilization and triggers a warning when it has reached 75% for more than 5 minutes, and critical when it reaches 90% for more than 5 minutes.

Those 2 monitors identify the most common causes of system slowness I have ever seen. When either of those reaches 80% or more, a huge bottleneck occurs and can cascade into a completely unusable system. Now you can be alerted and preemptively resolve the issues – focusing your time and money on the problems that truly effect your environment. Simply configure an action plan to email you when these events are triggered and your half way there.


 

There are plenty of other monitors/triggers for the Virtual Machines in your ESX environment. Below is a list of available triggers and their default settings. If you are seeing a potential problem area – such as unreliable or slow disk – then feel free to test those triggers and see if they provide insight into how your environment is working, and how it isn't working.

Trigger Type

Condition

Warning

Condition Length

Alert

Condition Length

VM CPU Ready Time (ms)

Is above

4000

for 5 min

8000

for 5 min

VM CPU Usage (%)

Is above

75

for 5 min

90

for 5 min

VM Disk Aborts

Is above

10

for 5 min

25

for 5 min

VM Disk Resets

Is above

10

for 5 min

25

for 5 min

VM Disk Usage (KBps)

Is above

 

for 5 min

 

for 5 min

VM Fault Tolerance Latency

Is equal to

Moderate

n/a

High

n/a

VM Heartbeat

Is equal to

Intermittent Heartbeat

n/a

No Heartbeat

n/a

VM Memory Usage (%)

Is above

75

for 5 min

90

for 5 min

VM Network Usage (kbps)

Is above

 

for 5 min

 

for 5 min

VM Snapshot Size (GB)

Is above

 

n/a

 

n/a

VM State

Is equal to

Powered On

n/a

Powered Off

n/a

VM Total Disk Latency (ms)

Is above

50

for 5 min

75

for 5 min

VM Total Size on Disk (GB)

Is above

 

n/a

 

n/a