Another quick post today.
I was having issues at a customer with SQL cluster monitoring. The SQL server was quite beefy, an HP server with 1,5 TB of memory running Server 2012 with 25 SQL instances, of which 13 were failover cluster instances, the others were SQL availability groups.
As SCOM has to monitor this server on all levels, you can imagine with the amount of roles on this server (over 1000 databases hosted) and cluster resources that it was hammering WMI hard. Very hard.
Although intervals were already adjusted to meet this kind of cluster size, it started leading to potential issues. SCOM eventually caused cluster resources to deadlock because of it, and roles started failing over to the other node.
The culprit in this case was Microsoft.Windows.Server.MonitorClusterDisks.vbs, as it kept timeing out as it ran over 300 seconds.
I found this post by Kevin Holman , a SCOM rockstar, he says that apparently the WMI namespaces for clusters are poorly optimized and not designed to handle this many objects. Another possibility is that this script is simply not designed to work with that amount of cluster disks (100+).
In my case, I didn’t really have a need to monitor the cluster disks, as all my cluster disks contained SQL databases. The SQL Management Pack monitors free space on the disk when a database is hosted on it by default (if autogrow is enabled). So I disabled the Cluster Disk discovery for SQL servers.
Post analysis showed that the CPU usage by the cluster service was reduced significantly.
Here’s a graph of the CPU usage (keep in mind, scale is 0,1).
When cluster disk monitoring was enabled:
After disabling disk monitoring:
Although officially 25 clustered SQL instances on one box are supported by Microsoft, keep in mind that you will have to adjust monitoring to meet these kind of environments.
If you use SQL Availability Groups instead of failover clusters, I would recommend disabling the Resource Group State monitor, or disable discoveries for Resource Groups altogether, as it this is already covered by the SQL Availability Group MP.
Apparently this problem is related to Server 2012, and can be resolved in Server 2012 R2 by changing the mode of the GUM Manager.