Exchange Type attribute

I have a habit of spending a lot of time to understand how exchange uses AD, Windows Registry, WMI, Crypto and all related stuff. One of my favorite things to do with any new version of exchange server is to look for the AD changes it makes. When Exchange 2010 was released I was trying to see through a lot of attributes and the way their values are constructed. All other attributes could be explained with the help of MSDN documentation or spending some time to create a logical link between the attributes, schema classes, etc. but the “type attribute on the exchange server object.

image

Value of “type” attribute looks something really weird. Initially I thought it was Chinese or Japanese but it is not. 😛

So what is this “Type” attribute on the exchange server object in active directory?

This attribute and the value of this attribute contains the licensing information of the server edition that you have chosen to install. When you install an Exchange Server 2010 role only Standard Edition of exchange gets installed automatically. Edition and licensing information is stored in type attribute in an encrypted form. Based on what key you have entered during the activation, exchange edition is determined and the value of this attribute also changes accordingly. Since it is in encrypted form, there is no specific pattern in the change that can be noted but you can still observe the change in the value of type attribute.

Well, that was just a geeky finding. Nothing useful anywhere in production although.

Failed to create RHS process – Windows 2008 R2 cluster

This blog post is not related to exchange but can be useful in some cases since DAG still depends on the clustering technologies. Yesterday, one of our clients had a major issue with a cluster that runs a file server. They installed some patches on the nodes and rebooted the box. Failover cluster manager won’t connect to the cluster since then. A couple of reboots on the servers were tried in a hope that it would fix a problem but that didn’t help.

Symptoms

All cluster groups and resources in each would stay in Pending Online state for a long time and eventually fail. Cluster IP Addresses resource won’t come online either.

image

Cluster.log file was full of some errors that look like below

000013f8.00000cdc::2014/03/13-08:44:45.318 ERR   [RCM] RcmMonitor: Failed to create RHS process ‘C:\Windows\Cluster\rhs.exe -key SYSTEM\CurrentControlSet\Services\ClusSvc\Parameters\Rhs\73feb789-9b11-4be2-9354-46dba2a2419d -parentPid 5112 -initEvent c2b41299-69dd-44ff-99eb-4cc42ddb9a5b -replyEndpoint LRPC-1394a24a6375472e44’. Error ERROR_FILE_NOT_FOUND(2)
000013f8.00000cdc::2014/03/13-08:44:45.318 ERR   [RCM] rcm::RcmMonitor::StartMonitor: ERROR_FILE_NOT_FOUND(2)’ because of ‘RcmMonitor: Failed to create RHS process.’
000013f8.00000cdc::2014/03/13-08:44:46.332 WARN  [RCM] rcm::RcmMonitor::StartMonitor: Retrying…

Resolution

It took us more than 4 hours and Microsoft PSS to figure out the problem since it was really rare to happen. We relooked at the cluster logs again and again and the line that says Error ERROR_FILE_NOT_FOUND(2) gave the hint. The finding was rhs.exe was missing from the C:\Windows\Cluster directory.

image

Since the rhs.exe was missing from this location, the cluster resources could not be brought online. What deleted this file is still a mystery. But in most of the cases, an antivirus may really eat up the rhs.exe image.

To fix a deleted or missing rhs.exe, download any of the hotfixes that are applicable to the Windows Server version that you are running and fixes the issues related to rhs.exe. Some of the hotfixes like KB2907244 which replaces the rhs.exe. If the file is missing, said hotfix would recreate it.

After applying the hotfix we were able to bring up all the resources and by virtue of it; the entire cluster.

More information

RHS stands for Resource Host Subsystem in MSCS and is an extremely critical component that monitors the health of cluster resources. Microsoft core team has a great article here http://blogs.technet.com/b/askcore/archive/2009/11/23/resource-hosting-subsystem-rhs-in-windows-server-2008-failover-clusters.aspx and here http://blogs.msdn.com/b/clustering/archive/2009/06/27/9806160.aspx

Outlook 2013 may not connect using MAPI over HTTPs as expected

I wrote an article about Mapi over HTTPS just few hours ago and noticed that MS has a new KB article for an issue noted related to Mapi over HTTPS. You may experience issue connecting to Exchange Server 2013 SP1 with Microsoft Outlook 2013 with SP1 using Mapi over HTTPS even when all settings on server, load balancer and reverse proxy are correct.

You can follow KB Outlook 2013 may not connect using MAPI over HTTPs as expected to resolve the said issue.

Open registry editor on the client computer and navigate to the path HKEY_CURRENT_USER\SOFTWARE\Microsoft\Exchange and change the value of MapiHttpDisabled to 0 as show below

image

If you do not see this DWORD value in registry editor, you do not need to create it manually.

Exchange 2013 SP1 Mapi over Http (MapiHttp)

Microsoft Exchange team announced general availability of service pack 1 for Exchange Server 2013 on 24th Feb this month. Exchange 2013 SP1 ships with some new additions. MapiHttp is one of the interesting additions from the client connectivity standpoint which improves the stability and reliability of outlook clients to an exchange 2013 SP1 server. MapiHttp seems to be a replacement to the traditional RPC/HTTPS protocol for the clients. RPC/HTTPS has been around the exchange builds since Exchange 2003 and has worked well with outlook clients with few exceptions related to stability. Since RPC traffic is encapsulated inside the HTTPS packets, a RPC proxy was always needed for RPC/HTTPS to work. Although RPC/HTTPS has worked in almost every deployment, it is not very stable to be reliant upon when one uses an internet connection that too unstable. RPC is known to be a thick protocol and is not meant to be running on slower or unstable connections.

Mapi over HTTP removes the RPC protocol completely and moves the client-server traffic over an industry standard HTTP protocol leveraging several functions of windows http client that supports pause and resume capabilities. This gives a the clients a new capability to change networks or resume from hibernations while maintaining the same server context much faster than traditional RPC/HTTPS communications.

Things you should know as an administrator

We have a new protocol that looks similar to RPC/HTTPS but more efficient and flexible but be advised that this is currently available for Outlook 2013 with SP1 and Exchange Server 2013 with SP1 only. Below table describes how other clients will still connect to an Exchange 2013 SP1 based server.

Product Exchange 2013 SP1 Exchange 2013 RTM Exchange 2010 SP3 Exchange 2007 SP3

Outlook 2013 SP1

  • MAPI over HTTP
  • Outlook Anywhere

Outlook Anywhere

  • RPC
  • Outlook Anywhere
  • RPC
  • Outlook Anywhere

Outlook 2013 RTM

Outlook Anywhere

Outlook Anywhere

  • RPC
  • Outlook Anywhere
  • RPC
  • Outlook Anywhere

Outlook 2010

Outlook Anywhere

Outlook Anywhere

  • RPC
  • Outlook Anywhere
  • RPC
  • Outlook Anywhere

Outlook 2007

Outlook Anywhere

Outlook Anywhere

  • RPC
  • Outlook Anywhere
  • RPC
  • Outlook Anywhere
  • Mapi over HTTP is still a new thing in era at the moment. I would recommend not implementing it in production without testing in lab environments.
  • Mapi over HTTP is an organization level setting and can be enabled and can be enabled by using Set-OrganizationConfig –MapiHttpEnabled:$True and all client access servers running Exchange 2013 SP1 must be upgraded Exchange Server 2013 SP1 before enabling this setting.
  • Outlook clients may experience disconnection or may require a restart after you enable this setting. My lab required me to restart outlook after the outlook client threw an error pop up saying it needed to be restarted since an administrator has made some changes.
  • Although the setting is enabled at organization level, configuration is to be done on the server level. Exchange 2013 service pack 1 installer creates a new virtual directory called “mapi” is IIS and an associated object in active directory. You must configure the virtual directories using Set-MapiVirtualDirectory to set InternalUrl and ExternalUrl on individual servers. Ensure the certificate used on Exchange server matches the internal and external url parameter values.
  • Make sure that the servers have enough space to accommodate the log files generated by the connections. Mapi over HTTP logs are generated and stored at:
    • %ExchangeInstallPath%\Logging\MAPI Address Book Service\
    • %ExchangeInstallPath%\Logging\MAPI Client Access\
    • %ExchangeInstallPath%\Logging\HttpProxy\Mapi\

In addition to this post I strongly recommend spending few minutes in reading and watching below:

MAPI over HTTP

Exchange 2013 and MapiHttp

Skip CA Checks during Powershell Remoting

Powershell remoting is really a cool thing to have for an administrator. If you can allocate only few bytes in your brain to remember that New-PSSession syntax it can help managing your entire Windows based infrastructure without logging on to a server.

One of my colleagues was trying to logon a Lync box today and he kept getting an error:

 

 

    + CategoryInfo          : OpenError: (System.Manageme….RemoteRunspace:RemoteRunspace) [New-PSSession], PSRemotingTransportException

    + FullyQualifiedErrorId : AccessDenied,PSSessionOpenFailed

New-PSSession : [lyncserver.exchange.local] Connecting to remote server lyncserver.exchange.local failed with the

following error message : The server certificate on the destination computer (lyncserver.exchange.local:443) has the

following errors:

The SSL certificate could not be checked for revocation. The server used to check for revocation might be unreachable.

For more information, see the about_Remote_Troubleshooting Help topic.

At line:1 char:12

+ $Session = New-PSSession -ConnectionUri https://lyncserver.exchange.local/ocspo

 

This can happen when the powershell cannot check the revocation status of the certificate on a remote server. In a way it is a good thing to prevent anything malicious and a good sign to trigger an alarm to your security guys. But in some cases if your CA is really offline and you know that. It can become a little problematic situation. Fortunately the way to fix it pretty simple. In fact it is a workaround.

Just use below two lines to get over this

$SessionOptions = New-PSSessionOption –SkipCACheck –SkipCNCheck –SkipRevocationCheck

$Session =  $Session = New-PSSession -ConnectionUri https://lyncserver.exchange.local/ocspowershell –Credential (Get-Credential) –SessionOption $SessionOptions

and then import the session usual way by Import-Session $Session.

Remove-ActiveSyncDevice returns an error – Couldn’t find User as a recipient

Today’s blog post comes from another interesting find about Exchange Management Shell and removal of active sync devices. A lot of customers I know prefer to keep their active sync devices clean. If an employee does not use an active sync device more than few days, they simply remove it. Removing these devices periodically is indeed done through some or the other kind of automation techniques. A whole lot of people use powershell to do that.

At one of such customers, they were seeing errors while removing old active sync devices.

Issue

Running Remove-ActiveSyncDevice returns errors stating it Couldn’t find <user identity> as a recipient or The ActiveSyncDevice <DeviceIdentity> cannot be found. Both errors would look like below:

 

Couldn’t find ‘exchange.local/New Delhi/SomeLocaion/User1’ as a recipient.

    + CategoryInfo          : InvalidArgument: (:) [Remove-ActiveSyncDevice], RecipientNotFoundException

    + FullyQualifiedErrorId : 3DAABD9F,Microsoft.Exchange.Management.Tasks.RemoveMobileDevice

and

The ActiveSyncDevice exchange.local/New Delhi/SomeLocation/User1/ExchangeActiveSyncDevices/SAMSUNGGTI9100

§SAMSUNG1818901812 cannot be found.

    + CategoryInfo          : NotSpecified: (2:Int32) [Remove-ActiveSyncDevice], ManagementObjectNotFoundException

    + FullyQualifiedErrorId : 1C3255A8,Microsoft.Exchange.Management.Tasks.RemoveMobileDevice

Cause

Assume that you have created a mailbox named User1 in an OU exchange.local/New Delhi/SomeLocation. After creation of this mailbox the user was allowed to configure his active sync device. After successful activation, the user account stayed at that location for a while.

Due to some requirements or the change in user’s location or company, you move this user account to another OU using ADUC. While user account is moved, all subsequent objects of the user object in AD are also moved along.

When an active sync device activation process starts, exchange creates an active sync device object under user object in AD and this object also gets moved along the user account when a user account movement happens.

When you run Remove-ActiveSyncDevice using EMS, EMS looks for the object at two common places. The first place is the object entry in user’s mailbox as shown in below figure. ExchangeSyncData object in user’s mailbox (inside mailbox database) contains all the active and non active EAS devices the mailbox has ever synchronized with. In this example the device name is AirSync-SAMSUNGGTN7100-SEC160xxxxx

Capture1

The second place is in AD right under the user object associated with the mailbox. You can see this association using ADSIEDIT or LDP.exe

image

Like I said, when you move a user account to another OU, these EAS device objects also get moved along with it changing the identity of the object. However, when powershell queries this device it does not really query the device object in AD but in mailbox (Show in first figure) and tries to locate the device object in AD against the path it retrieved by querying the information received from object in mailbox. Since you have already moved the user object to a different location using ADUC, exchange is not really aware of what has happened and is unable to update this data back in respective user mailbox in database and returns those errors.

Workaround

Locate the EAS object under user account in AD and remove it using ADSIEDIT and remove an associated object in database by using MFCMAPI

Important

If a user has multiple devices partnered with his mailbox it can be very difficult to find out which one to delete. A way to find out a device object that is to be deleted, you can use following steps:

1. Run Get-ActiveSyncDevice –Mailbox “User1”

2. Make a note of Identity and LastSuccessSync for all the devices.

3. Open MFCMAPI and navigate to the screen shown in first figure.

4. Expand each device or appropriate device you identified in mailbox and select SyncStatus

You should see some properties like show below:

image

PR_LOCAL_COMMIT_TIME and PR_LAST_MOFICATION_TIME are two props which should help you determining which device to delete.

 

Note: These steps are not for someone who does not know how to use MFCMAPI and ADSIEDIT and that the only reason steps are outlined in very high level. If you have questions or need help, you can feel free to drop me a note.

Powershell Password Obfuscator

While writing powershell scripts you may have needed to store the username and password inside the script. There are couple of ways to do so. Either you export the password to a text or xml file and then call it inside the script every time the script runs or generate the password combination using another script and save it inside the main script.

Second way of doing it is much easier but requires another powershell script to be run for generation of credentials.

While working on some script I needed to store the credentials inside the same powershell script. Although there was no need of doing so; someone wanted it that way.

This script generates the code that can be directly pasted inside the main script where you want to save your credentials.

Just enter the username (in the format you want) and enter the password associated with that username and click on generate button. That is all! The code you needed is ready in the text box below:

image

This really simple script can be handy in your toolbox if you are a  powershell developer or you do some scripting stuff for fun.

A download of this script is made available at Technet Gallery. You can download the script just by clicking this button 

Exchange 2010 Intermittent Password Prompts in Outlook Clients – NTLM Bottleneck

There are hundreds of articles on internet around this commonly seen issue. If you are running Exchange 2007 or later this issue occurs due to wrong certificate configuration most of the times. A wrong or missing name in certificate versus the URL defined on exchange web components like OWA, EAS, OA, OAB etc.

Exchange is a fairly complex code which runs along with or depends on several components like AD, Crypto, network components, authentication modules, etc.

This particular case I am writing about was more to do with the authentication mechanisms used by Exchange 2010. Exchange 2010 uses and supports several authentication mechanisms. Below diagram should help you understand a pretty simple looking setup that one of our customers were running:

 

image

The diagram is pretty self explanatory. It is a DAG and a CAS array with 4 domain controllers (although not all 4 are shown in diagram).

Even after verifying all certificate, url and authentication settings on OA, OWA, EAS, OAB, etc users still complained that they receive an annoying password which simply wont go away even after entering the correct user name and password.

Finally, we decided to look further into what is happening when the authentication requests is submitted to the CAS array and interestingly, we could correlate some event IDs in security log of  CAS servers which pointed towards the authentication issue. After investigating security logs carefully on the CAS server we found some entries relevant to a computer which reported a problem. The security log for this computer read as below:

Log Name: Security
Source: Microsoft-Windows-Security-Auditing
Date: 9/5/2013 10:22:59 PM
Event ID: 4625
Task Category: Logon
Level: Information
Keywords: Audit Failure
User: N/A
Computer: cas02.exchange.local
Description:
An account failed to log on.
Subject:
  Security ID: NULL SID
  Account Name: –
  Account Domain: –
  Logon ID: 0x0
Logon Type: 3
Account For Which Logon Failed:
  Security ID: NULL SID
  Account Name: username
  Account Domain: EXCHANGE
Failure Information:
  Failure Reason: An Error occurred during Logon.
  Status: 0xc000005e
  Sub Status: 0x0
Process Information:
  Caller Process ID: 0x0
  Caller Process Name: –
Network Information:
  Workstation Name:
  Source Network Address: 178.239.86.252
  Source Port: 37109
Detailed Authentication Information:
  Logon Process: NtLmSsp
  Authentication Package: NTLM
  Transited Services: –
  Package Name (NTLM only): –
  Key Length: 0

Initially it looked like an issue described in http://support.microsoft.com/kb/2157973/en-us but that was not the case since the error code described in KB and error above do not match. Also, there was no smart card logon used. To find out what the error code 0xc000005e meant, we used err.exe and the output was

C:\Tools\Err>Err.exe 0xc000005e
# for hex 0xc000005e / decimal -1073741730 :
  STATUS_NO_LOGON_SERVERS
# There are currently no logon servers available to service

Suspecting something wrong with NTLM netlogon.log was a potential subject to be looked at. Netlogon.log on client shows

Time [LOGON] SamLogon: Network logon of EXCHANGE\UserName from WorkstationName Returns 0xC000005E

It was again little misleading since the AD servers were up and running and processing the logon requests. There was no DNS issues identified either. A lot of googling and Binging, we reached out to a conclusion that lead us to think that something was wrong with the NTLM stuff. So what was it?

You may notice that NTLM bottlenecks can be caused due to RPC/HTTPS requests. RPC/HTTPS are definitely a key contributor to large NTLM requests since the session established using RPC/HTTPS has to be authenticated twice due to two different protocol payloads. Outer layer of HTTP requires the authentication once and the tunneled RPC requires another authentication to take place generating twice the load. Moreover, HTTP is a stateless protocol which can cause multiple authentication requests to be handled by the server.

Although RPC/HTTPS generates additional NTLM authentication requests; a direct MAPI connection to CAS / CAS array can also contribute to this if the traffic is too high. MAPI supports Kerberos authentication and the default setting in Outlook 2007 and later is to negotiate the strongest authentication available when not running in Outlook Anywhere mode. Unless kerberos support is configured in the environment, outlook will fall back on NTLM by default.

Considering all the factors and research done the only conclusion derived was to look for NTLM authentication related issues. A quick network packet capture on CAS servers help determining whether it is NTLM or something else.

To capture the precise results, leave the network capture running on the CAS server until a case of password prompt is reported. You will notice that the capture reveals something like below between the CAS server and client. (Running a simultaneous capture on client and servers both can help gathering precise results

0.0000000           11198    8:13:23 PM 9/2/2013      164.8780960      OUTLOOK.EXE    ClientComputer                 198.168.36.100    MSRPC  MSRPC:c/o Request: MS Exchange Directory RFR {1544F5E0-613C-11D1-93DF-00C04FD7BD09}  Call=0x1  Opnum=0x0  Context=0x0  Hint=0xC0 Warning: Octets trailer appends to authentication token      {MSRPC:105, TCP:104, IPv4:9}     65229

0.0156250           11199    8:13:23 PM 9/2/2013      164.8937210      OUTLOOK.EXE    198.168.36.100               ClientComputer       TCP        TCP:Flags=…A…., SrcPort=6950, DstPort=3117, PayloadLen=0, Seq=3823341786, Ack=264467696, Win=63764 (scale factor 0x0) = 63764  {TCP:104, IPv4:9}               63764

0.0468750           11216    8:13:23 PM 9/2/2013      164.9405960      OUTLOOK.EXE    198.168.36.100               ClientComputer       MSRPC  MSRPC:c/o Fault:  Call=0x1  Context=0x0  Status=0x5  Cancels=0x0       {MSRPC:92, TCP:88, IPv4:9}          63364

In above capture, outlook is clearly trying to use RFR interface

Windows 2008 R2 has NTLM performance counters that can be used to find out the NTLM related issues. One of the support articles on Microsoft KB

Performance counter

Explanation

Semaphore Waiters

The number of the thread that is waiting to obtain the semaphore

Semaphore Holders

The number of the thread that is holding the semaphore

Semaphore Acquires

The total number of times that the semaphore has been obtained over the lifetime of the security channel connection, or since system startup for _Total

Semaphore Timeouts

The total number of times that a thread has timed out while it waited for the semaphore over the lifetime of the security channel connection, or since system startup for _Total

Average Semaphore Hold Time

The average time (in seconds) that the semaphore is held over the last sample.

 

In the case we were troubleshooting, the value of Semaphore Timeouts was reaching beyond 100. As you can read the explanation of the Semaphore Timeouts, this counter suggests the timeouts occurred. In this process, the threads will wait and then will expire denying logon to a requestor. This causes the authentication requests to be rejected. This is exactly what was happening on the servers.

All of these symptoms are caused by a phenomena called “NTLM Bottleneck”. To fix this issue, there are a couple of ways:

Resolution 1

First kind of resolution is increase the MaxConcurrentApi value in registry. This DWORD value can be increased to 10 on Windows Server 2003 based DCs and Member servers and up to 150 on Windows Server 2008 SP2 and later DC and member servers.

  1. Start Registry Editor.
  2. Locate the following registry subkey:

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters

  3. Create the following registry entry:
    Name: MaxConcurrentApi
    Type: REG_DWORD
    Value:Set the value to the larger number, which you tested (any number greater than the default value).
  4. At a command prompt, run net stop netlogon, and then run net start netlogon.

You may have to apply these settings both on the CAS servers and domain controllers depending upon the situation.

Resolution 2

Configure Exchange 2010 CAS array to use kerberos instead of NTLM using Configuring Kerberos Authentication for Load-Balanced Client Access Servers

References and Additional Reading

Is this horse dead yet: NTLM Bottlenecks and the RPC runtime

Updated: NTLM and MaxConcurrentApi Concerns

You are intermittently prompted for credentials or experience time-outs when you connect to Authenticated Services

Netlogon performance counters for Windows Server 2003

Troubleshooting SID translation failures from the obvious to the not so obvious

Script: Finding IIS Servers in Domain

One of our customers is getting ready for a security audit of their critical servers. Indeed Exchange is one of those but there are lot others running IIS on them and exposed to internet through a firewall or some other technology.

Challenge was to find out how many servers in the data center have IIS installed and not in their knowledge. Doing something like this really becomes a challenge when someone has hundreds of servers running inside that cold, noisy and windy storage room Smile with tongue out (Data Center)

Here is a simple script that can help you find the number of IIS servers in an AD domain.

$Error.Clear()
Clear-Host

#$Servers = Get-ADComputer -Filter * -ResultSetSize $null -Properties OperatingSystem | ? { ($_.OperatingSystem -like "Windows Server*") -and ($_.Name -like "BLR-*")}
Foreach ($Server in $Servers) {
Write-Host "Connecting to" $Server.DNSHostName -ForegroundColor Blue
if (Get-WmiObject -ComputerName $Server.DNSHostName -Namespace root -Class __NameSpace -Filter "name=’MicrosoftIISv2’" -ErrorAction SilentlyContinue)
{
    $Found = $Server.DNSHostName
    $Found | Out-File E:\Reports\ServersWithIIS.txt -Force -Append

}

else{
Write-host $Server.DNSHostName + "does not seem to have IIS on it" -ForegroundColor Green
}
}

Again, it is the simplest code that could come upon searching for a ready made script on internet but failing to find one. Hope this helps others too.

The Cluster Service Cannot Be Started. An Attempt To Read Configuration Data From Windows Registry Failed With Error ‘2’.

Today’s morning started with a little fire on some exchange 2010 server running as DAG members. One out of those 8 guys in the DAG was not able to continue the log replication and continued to keep the database copies in failed state.

After looking at the cluster manager it seemed that the server was not appearing in the failover cluster manager and a bunch of events in application logs:

Log Name:      Application
Source:        MSExchangeRepl
Date:          8/17/2013 11:39:09 AM
Event ID:      4092
Task Category: Service
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      egiex02.egi.local
Description:
Database Availability Group ‘EGI-DAG-01’ member server ‘EGIEX02’ is not completely started. Run Start-DatabaseAvailabilityGroup ‘EGI-DAG-01’ -MailboxServer ‘EGIEX02’ to start the server.

and System log showed below events when Start-DatabaseAvailabilityGroup EGI-DAG-01 –MailboxServer EGIEX02

Log Name:      System
Source:        Microsoft-Windows-FailoverClustering
Date:          8/17/2013 12:48:32 PM
Event ID:      1090
Task Category: Startup/Shutdown
Level:         Critical
Keywords:     
User:          SYSTEM
Computer:      EGIEX02.EGI.LOCAL
Description:
The Cluster service cannot be started. An attempt to read configuration data from the Windows registry failed with error ‘2’. Please use the Failover Cluster Management snap-in to ensure that this machine is a member of a cluster. If you intend to add this machine to an existing cluster use the Add Node Wizard. Alternatively, if this machine has been configured as a member of a cluster, it will be necessary to restore the missing configuration data that is necessary for the Cluster Service to identify that it is a member of a cluster. Perform a System State Restore of this machine in order to restore the configuration data.

This happens when a problem node is not able to communicate with the resource owner in a group. DAG uses MSCS as an underlying layer for building high availability for mailbox servers and databases using an additional logic supplied by DAG components. In an event of communication failure to another set of members in a DAG, the failover cluster will continue to attempt connections and will give up after a certain period. In my case the problem node EGIEX04 was trying to reach all 7 other members to read the configuration information but failed to do so because it could not contact either of the nodes over RPC.

Fix is fairly simple:

Open an elevated command prompt on one of the DAG members and run:

Cluster.exe Node EGIEX02 /ForceCleanUp 

After you have run above command the node will be removed from cluster.

Now open Exchange Management Shell and run:

Start-DatabaseAvailabilityGroup EGI-DAG-01 –MailboxServer EGIEX02

 

This should ideally take care of all issues related to cluster service. In case you are not able to get over the MSExchangeRepl errors after that, you may need to reseed the problem database or all of them manually.

So what causes it?

Although cluster service kept saying that it could not contact either of nodes in the cluster, all those nodes were practically contactable via remote registry, WMI, event logs, etc.

An answer lies within the XML of the event ID 4092 MSExchangeRepl.

Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-FailoverClustering" Guid="{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}" />
    <EventID>1090</EventID>
    <Version>0</Version>
    <Level>1</Level>
    <Task>8</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000000</Keywords>
    <TimeCreated SystemTime="2013-08-17T07:18:32.625000000Z" />
    <EventRecordID>192930</EventRecordID>
    <Correlation />
    <Execution ProcessID="3332" ThreadID="3552" />
    <Channel>System</Channel>
    <Computer>EGIEX02.EGI.LOCAL</Computer>
    <Security UserID="S-1-5-18" />
  </System>
  <EventData>
    <Data Name="Status">2</Data>
    <Data Name="NodeName">EGIEX02</Data>
  </EventData>
</Event>

S-1-5-18  is a well known security principal Local System. and cluster service on a DAG member uses this this account as a logon account so does the replication service. Every time a node in a cluster tries to contact another it has to provide perform a security handshake and that is using Kerberos by default. When these handshakes are not successful, the caller node is denied an access to the resources and any cluster information that other nodes share among each other. Troubleshooting Kerberos is a nightmare (at least for me). This Kerberos thing can be justified very well by looking at the FailoverClustering Operational logs. You will see ample of entries of the problem node trying to perform a handshake and nothing after that.

By removing and re-adding the node to the cluster, we almost reset everything related to the problem node in the cluster database.

 

I hope that helps someone finds himself in trouble with this issue.

I'm a Geek!

%d bloggers like this: