This blog post is not related to exchange but can be useful in some cases since DAG still depends on the clustering technologies. Yesterday, one of our clients had a major issue with a cluster that runs a file server. They installed some patches on the nodes and rebooted the box. Failover cluster manager won’t connect to the cluster since then. A couple of reboots on the servers were tried in a hope that it would fix a problem but that didn’t help.
All cluster groups and resources in each would stay in Pending Online state for a long time and eventually fail. Cluster IP Addresses resource won’t come online either.
Cluster.log file was full of some errors that look like below
000013f8.00000cdc::2014/03/13-08:44:45.318 ERR [RCM] RcmMonitor: Failed to create RHS process ‘C:\Windows\Cluster\rhs.exe -key SYSTEM\CurrentControlSet\Services\ClusSvc\Parameters\Rhs\73feb789-9b11-4be2-9354-46dba2a2419d -parentPid 5112 -initEvent c2b41299-69dd-44ff-99eb-4cc42ddb9a5b -replyEndpoint LRPC-1394a24a6375472e44’. Error ERROR_FILE_NOT_FOUND(2)
000013f8.00000cdc::2014/03/13-08:44:45.318 ERR [RCM] rcm::RcmMonitor::StartMonitor: ERROR_FILE_NOT_FOUND(2)’ because of ‘RcmMonitor: Failed to create RHS process.’
000013f8.00000cdc::2014/03/13-08:44:46.332 WARN [RCM] rcm::RcmMonitor::StartMonitor: Retrying…
It took us more than 4 hours and Microsoft PSS to figure out the problem since it was really rare to happen. We relooked at the cluster logs again and again and the line that says Error ERROR_FILE_NOT_FOUND(2) gave the hint. The finding was rhs.exe was missing from the C:\Windows\Cluster directory.
Since the rhs.exe was missing from this location, the cluster resources could not be brought online. What deleted this file is still a mystery. But in most of the cases, an antivirus may really eat up the rhs.exe image.
To fix a deleted or missing rhs.exe, download any of the hotfixes that are applicable to the Windows Server version that you are running and fixes the issues related to rhs.exe. Some of the hotfixes like KB2907244 which replaces the rhs.exe. If the file is missing, said hotfix would recreate it.
After applying the hotfix we were able to bring up all the resources and by virtue of it; the entire cluster.
RHS stands for Resource Host Subsystem in MSCS and is an extremely critical component that monitors the health of cluster resources. Microsoft core team has a great article here http://blogs.technet.com/b/askcore/archive/2009/11/23/resource-hosting-subsystem-rhs-in-windows-server-2008-failover-clusters.aspx and here http://blogs.msdn.com/b/clustering/archive/2009/06/27/9806160.aspx