Exchange 2010 High Availability – Best Copy Selection

Everyone who has started playing around the Exchange 2010 Beta must be knowing this concept by now. Compared to its predecessors Exchange 2007 and 2003, Exchange 2010 has been completely rewritten on this part. High availability is handled the way different than it was in any other version. Why companies spend so much of money out of their IT budgets to plan, deploy and configure high availability of services is because to make sure that their business critical applications wont go down and loss of money due to down services should not happen. Exchange being a messaging system has become one of the most critical applications for enterprises. Most of the communication happens via emails. Documents exchange, Voice mails on your phone, business communications, newsletters and much more is sent and received using messaging systems in every company. Consider a scenario where your company is a financial firm and needs continuous email communication to the customers, partners and other government authorities which should be available 24x7x365 days. In such a high demanding environment failure of even a single server for a long time may lead into loss of company revenue and company’s customer dissatisfaction. To overcome such scenarios Exchange have been providing the HA capabilities to administrator and architects. Right from Exchange 2003’s failover clustering to Exchange 2007’s CCR the Exchange HA has been evolving according to market needs and business requirements. This is one of the great features I ever liked in Exchange. Lets take a look at how the whole stuff is handled here.

Exchange 2003 and 2007 provided the HA based on Windows Clustering where the whole server as an object used to failover to another node of the cluster. However with the exchange server 2010, things have changed dramatically. Now the failover wont occur at the server level instead of that the store schema has been re-structured in such a way that the failover can occur at the database level only. This is possible with the help of Database Availability Group which is commonly known as DAG.

In short, DAG is a group of servers and databases which will provide the high availability. It still uses the Microsoft Clustering Services but do not rely on it completely. Instead of using the server level failover the clustering services are used to only group the nodes of a cluster. What will failover is the only a problematic database. Article Understanding Mailbox Database Availability provides an overview of how this whole stuff works. If you read the above linked article you will understand that a database will have multiple copies on one or more servers which will be passive copies and will not be used for production connectivity. However, these copies are continuously in-sync with production copy of them and are always updated. When a failure on  production copy of database is detected, one of the passive copies are activated and start working as a production copy for the clients. This is known as  failover.  On the other hand, if one of the passive copies are activated manually by an administrator the process is known as swithover.

As far as swithovers are concerned the administrator knows which copy to bring online. But, if the failure occurs and the administrator is not available to monitor or recover the situation then what? Don’t you worry about that. The failover is managed by the exchange store itself.

Exchange 2010’s Exchange Replication Service which monitors the databases time to time and determines their health if some sort of database failure is detected then the process of failover is started. To failover the databases in stead of the whole server a component called Active Manger is added as a part of Exchange Replication Service which replaces the cluster’s server level failover behavior. Now, when a failover occurs and you have more than one copies of a single database added to multiple servers; exchange has to decide and choose the best copy available to mount among all. This process is known as Best Copy Selection. How Best copy selection works is on a simple basis of choosing the best available copy among all database copies of a particular information store. But, hold on. This selection process includes 10 different criteria which are used to select a best of copy. Again, the one and only Active Manager manages all this stuff. Let’s see how does Active Manager decides which copy to pick up and activate.

So now as you already know that Active Manger will select the best copy and initiate the failover it looks for healthy database copy first of all, DisconnectedAndHealthy, DisconnectedAndResynchronizing, or SeedingSource, and that meets all of the following sets of criteria:

  • It has a content index with a status of Healthy
  • It has a copy queue length that is < 10 log files and
  • It has a replay queue length < 50 log files

If Active Manager finds no copy meeting any of the above criteria, then it will try to locate a database copy that meets the next set of criteria:

  • It has a content index with a status of Crawling
  • It has a copy queue length that is < 10 log files and
  • It has a replay queue length < 50 log files

If Active Manager finds no copy meeting any of the above criteria, then it will try to locate a database copy that meets the next set of criteria:

  • It has a content index with a status of Healthy and
  • It has a replay queue length of < 50 log files

If Active Manager finds no copy meeting any of the above criteria, then it will try to locate a database copy that meets the next set of criteria:

  • It has a content index with a status of Crawling and
  • It has a replay queue length of < 50 log files

If Active Manager finds no copy meeting any of the above criteria, then it will try to locate a database copy that meets the next set of criteria:

  • It has a replay queue length of < 50 log files

If Active Manager finds no copy meeting any of the above criteria, then it will try to locate a database copy that meets the next set of criteria:

  • It has a content index with a status of Healthy and
  • It has a copy queue length < 10 log files

If Active Manager finds no copy meeting any of the above criteria, then it will try to locate a database copy that meets the next set of criteria:

  • It has a content index with a status of Crawling and
  • It has a copy queue length < 10 log files

If Active Manager finds no copy meeting any of the above criteria, then it will try to locate a database copy that meets the next set of criteria:

  • It has a content index with a status of Healthy.

If Active Manager finds no copy meeting any of the above criteria, then it will try to locate a database copy that meets the next set of criteria:

  • It has a content index with a status of Crawling.

Even after doing so much of matching with defined criteria if Active Manager fails finding any copy meeting above criteria then it will try to activate any database copy with a status of Healthy, DisconnectedAndHealthy, SeedingSource, or DisconnectedAndResynchronizing. At the end if Active Manager does not find any copy that meeting any of the criteria the automatic activation (failover) will not occur.

Another good question may come up asking what if Active Manger find more than one copy which matches the above criteria? Answer would be, if more than one database copy meets all of the above criteria, then the configured value for ActivationPreference is consulted, and the database with the lowest value is activated and mounted.

 

I would like to thank Scott for presenting this information in his introductory video and my friend Amit for suggesting me to write up on it. Thanks both!

3 thoughts on “Exchange 2010 High Availability – Best Copy Selection”

Comments are closed.