The Cluster Service Cannot Be Started. An Attempt To Read Configuration Data From Windows Registry Failed With Error ‘2’.

Today’s morning started with a little fire on some exchange 2010 server running as DAG members. One out of those 8 guys in the DAG was not able to continue the log replication and continued to keep the database copies in failed state.

After looking at the cluster manager it seemed that the server was not appearing in the failover cluster manager and a bunch of events in application logs:

Log Name:      Application
Source:        MSExchangeRepl
Date:          8/17/2013 11:39:09 AM
Event ID:      4092
Task Category: Service
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      egiex02.egi.local
Description:
Database Availability Group ‘EGI-DAG-01’ member server ‘EGIEX02’ is not completely started. Run Start-DatabaseAvailabilityGroup ‘EGI-DAG-01’ -MailboxServer ‘EGIEX02’ to start the server.

and System log showed below events when Start-DatabaseAvailabilityGroup EGI-DAG-01 –MailboxServer EGIEX02

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          8/17/2013 12:48:32 PM
Event ID:      1090
Task Category: Startup/Shutdown
Level:         Critical
Keywords:     
User:          SYSTEM
Computer:      EGIEX02.EGI.LOCAL
Description:
The Cluster service cannot be started. An attempt to read configuration data from the Windows registry failed with error ‘2’. Please use the Failover Cluster Management snap-in to ensure that this machine is a member of a cluster. If you intend to add this machine to an existing cluster use the Add Node Wizard. Alternatively, if this machine has been configured as a member of a cluster, it will be necessary to restore the missing configuration data that is necessary for the Cluster Service to identify that it is a member of a cluster. Perform a System State Restore of this machine in order to restore the configuration data.

This happens when a problem node is not able to communicate with the resource owner in a group. DAG uses MSCS as an underlying layer for building high availability for mailbox servers and databases using an additional logic supplied by DAG components. In an event of communication failure to another set of members in a DAG, the failover cluster will continue to attempt connections and will give up after a certain period. In my case the problem node EGIEX04 was trying to reach all 7 other members to read the configuration information but failed to do so because it could not contact either of the nodes over RPC.

Fix is fairly simple:

Open an elevated command prompt on one of the DAG members and run:

Cluster.exe Node EGIEX02 /ForceCleanUp 

After you have run above command the node will be removed from cluster.

Now open Exchange Management Shell and run:

Start-DatabaseAvailabilityGroup EGI-DAG-01 –MailboxServer EGIEX02

 

This should ideally take care of all issues related to cluster service. In case you are not able to get over the MSExchangeRepl errors after that, you may need to reseed the problem database or all of them manually.

So what causes it?

Although cluster service kept saying that it could not contact either of nodes in the cluster, all those nodes were practically contactable via remote registry, WMI, event logs, etc.

An answer lies within the XML of the event ID 4092 MSExchangeRepl.

Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-FailoverClustering" Guid="{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}" />
    <EventID>1090</EventID>
    <Version>0</Version>
    <Level>1</Level>
    <Task>8</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000000</Keywords>
    <TimeCreated SystemTime="2013-08-17T07:18:32.625000000Z" />
    <EventRecordID>192930</EventRecordID>
    <Correlation />
    <Execution ProcessID="3332" ThreadID="3552" />
    <Channel>System</Channel>
    <Computer>EGIEX02.EGI.LOCAL</Computer>
    <Security UserID="S-1-5-18" />
  </System>
  <EventData>
    <Data Name="Status">2</Data>
    <Data Name="NodeName">EGIEX02</Data>
  </EventData>
</Event>

S-1-5-18  is a well known security principal Local System. and cluster service on a DAG member uses this this account as a logon account so does the replication service. Every time a node in a cluster tries to contact another it has to provide perform a security handshake and that is using Kerberos by default. When these handshakes are not successful, the caller node is denied an access to the resources and any cluster information that other nodes share among each other. Troubleshooting Kerberos is a nightmare (at least for me). This Kerberos thing can be justified very well by looking at the FailoverClustering Operational logs. You will see ample of entries of the problem node trying to perform a handshake and nothing after that.

By removing and re-adding the node to the cluster, we almost reset everything related to the problem node in the cluster database.

 

I hope that helps someone finds himself in trouble with this issue.