High Availability - Pro Exchange 2013 SP1 PowerShell Administration: For Exchange On-Premises and Office 365 (2014)

Pro Exchange 2013 SP1 PowerShell Administration: For Exchange On-Premises and Office 365 (2014)

Chapter 5. High Availability

When you have a single Exchange server, whether it be a single multi-role server or two separate servers each hosting one server role, you have a single point of failure, or SPOF. When this server fails, it is not available anymore and your users are without their messaging service.

To overcome this problem you have to implement a high availability solution; in short, that means implementing more servers offering the same service. In the case of Exchange 2013, there are three distinct servers affected.

In Exchange 2013, high availability is implemented by means of a database availability group, or DAG. A DAG is a collection of up to 16 Mailbox servers that can host a set of mailbox databases and can provide recovery from mailbox database failures or Mailbox server failures.

To achieve high availability on the Exchange 2013 Client Access servers, you have to implement an array of Client Access servers that can service client requests. Besides an array of Client Access servers, though, you also need some form of load balancing in front of the array of Client Access servers to distribute the client requests and provide connection failover when one Client Access server fails.

To achieve high availability on the transport layer, you can implement multiple Exchange 2013 SP1 Edge Transport servers—next to multiple Exchange 2013 Mailbox servers, of course.

In this chapter, we discuss high availability on all three Exchange 2013 server roles.

Mailbox Server High Availability

In Exchange Server 2003 and earlier, it was possible to use Windows clustering to create some sort of high availability in Exchange Server. On the underlying Windows operating system, a failover cluster was created that consisted of two or more physical servers called cluster nodes. These nodes used shared storage—that is, storage that could be used by only one of the nodes at a time. Exchange Server was installed as a virtual server on this cluster. When one cluster node failed, another cluster node in the cluster could take over the Exchange virtual server. While this concept works fine for server redundancy, there’s still a single point of failure: the mailbox database.

For Exchange Server 2007, Microsoft improved the cluster technology, which led to the concept of cluster continuous replication (CCR). In a CCR cluster, two Exchange Mailbox servers are combined whereby each server hosts one copy of the mailbox database. If one server fails, the other cluster node takes over the Exchange virtual server and activates the other copy of the mailbox database.

To lower the complexity, and to minimize the downtime in case of a server failure, the CCR technology evolved into the database availability group (DAG) in Exchange Server 2010, a technology that’s also available in Exchange 2013. A DAG is a logical grouping of a set of Exchange 2013 Mailbox servers that can hold copies of each other’s mailbox databases. So, when there are six Mailbox servers in a DAG, mailbox database MBX01 can be active on the first server in the DAG, but it can have a copy on the fourth and sixth servers in the DAG. When the first server in the DAG fails, the mailbox database copy on the fourth server becomes active and continues servicing the user requests with minimal downtime for the user.

Under the hood, a DAG is using components of Windows failover clustering, and as such we have to discuss some of these components in more detail.

Cluster Nodes and the File Share Witness

A DAG usually consists of at least two Exchange 2013 Mailbox servers. It is possible to have a DAG with only one Exchange 2013 Mailbox server, but in this case there’s no redundancy. Another server is involved in a DAG as well, and this is the Witness server.

By way of explanation, the DAG uses Windows failover clustering software, and in Windows 2012 R2, there are some major changes that Exchange 2013 SP1 can take advantage of. In particular, there are two new options to discuss here:

· Dynamic Quorum In Windows 2012 and Windows 2012 R2, the quorum majority is determined by the nodes that are active members of the cluster at a given time, whereas in Windows 2008 R2, the quorum majority is fixed and determined at the moment of cluster creation. This modification means that a cluster can dynamically change from an eight-node cluster to a seven- or six-node cluster, and in case of issues the majority changes accordingly. In theory it is possible to dynamically bring down a cluster to only one (1) cluster node, also referred to as the “last man standing.” Besides changing automatically, an administrator can also change a member manually by setting the cluster’s NodeWeight property to 0. The official Exchange product team’s best practice is to leave the dynamic quorum enabled, but not to take it into account when designing an Exchange environment.

· Dynamic Witness Prior to Windows Server 2012 R2, when the file share witness (FSW) was not available, the cluster service would try to start the FSW resource once. If it failed, the cluster might become unavailable and all mailbox databases would be dismounted.

In Windows Server 2012 R2, though, when a cluster is configured with dynamic quorum, a new feature called dynamic witness becomes available. The witness vote with a dynamic witness is automatically adjusted, based on the status of the FSW. If it’s offline and not available, its witness vote is automatically set to 0, thereby eliminating the chances of an unexpected shutdown of the cluster. Just as with dynamic quorum, the recommendation is to leave the dynamic witness enabled (by default). Exchange 2013 SP1 is not aware of the dynamic witness, but it can take advantage of this cluster behavior.

From an Exchange Server point of view, the failover clustering software and its new features are fully transparent, so there’s no need to start worrying about clusters, and there’s no need to start managing the DAG with the failover cluster manager. All management of the DAG is performed using the Exchange Management Shell or the Exchange Admin Center. In fact, I strongly recommend not using the Windows failover cluster management tool in that case.

The witness server and the file share witness (the latter which is a shared directory on the witness server) are used only when there is an even number of Mailbox servers in the DAG, but as explained before, it is automatically adjusted. Furthermore, the witness server stores no mailbox information; it has only a cluster quorum role.

The following are the prerequisites for the witness server:

· The witness server cannot be a member of the DAG.

· The witness server must be in the same Active Directory forest as the DAG.

· The witness server must be running Windows Server 2003 or later.

· A single server can serve as a witness for multiple DAGs, but each DAG has its own witness directory.

The witness server plays an important role when problems arise in the DAG—for example, when an Exchange 2013 Mailbox server is not available anymore. The underlying principle is based on an N/2+1 number of servers in the DAG. This means that for a DAG to stay alive when disaster strikes, at least half the number of Mailbox servers plus one need to be up and running.

So, if you have a six-node DAG, the DAG can survive the loss of two Exchange 2013 Mailbox servers (6/2 +1). The file share witness, however, is an additional server or vote in this process. So, if there are six Exchange 2013 Mailbox servers in the DAG and three servers fail, the file share witness is the +1 server or vote, and the DAG will survive with four members: three Mailbox servers plus the additional file share witness.

Image Note Following the Microsoft recommendation, the dynamic quorum and dynamic witness are not involved in this example.

Microsoft recommends you use an Exchange server as a file share witness, which of course cannot be a Mailbox server that is part of the DAG. The reason for this is that an Exchange server is always managed by the Exchange administrators in the organization, and the Exchange Trusted Subsystem Universal Security Group has control over all Exchange servers in Active Directory.

When you’re using a multi-role setup, which is the Exchange 2013 Client Access server and Mailbox server on the same box, and these servers are DAG members, there’s no additional Exchange server that can hold the file share witness role. In this case, it is also possible to use another Windows server as the file share witness. The only prerequisite is that the Exchange Trusted Subsystem have full control over the Windows server, so the Exchange Trusted Subsystem needs to be a member of the local Administrators Security Group of the Windows server. As domain controllers do not have local groups, it would be necessary to add the Exchange Trusted Subsystem to the Domain Administrators Security Group. However, this imposes a security risk and it is therefore not recommended.

There’s no reason to configure the file share witness in a high-availability configuration such as on a file cluster. Exchange Server periodically checks for the file share witness—by default, every four hours—to see if the file share witness is still alive. If it’s not available at that moment, the DAG continues to run without any issues. The only time the file share witness needs to be available is during DAG changes, when an Exchange 2013 Mailbox server fails, or when Exchange 2013 Mailbox servers are added to or deleted from the DAG.

A question that pops up on a regular basis is whether or not to store the file share witness on a DFS share, especially when the company is using a server with multiple locations. This is not a good idea. Imagine this: There are two locations, A and B, and the Exchange location has three Exchange 2013 Mailbox servers configured in one DAG. The file share witness is located on a DFS share, and thus potentially available in both locations. Now, suppose the network connection between locations A and B fails for some reason. The DAG will notice the connection loss, and in both locations, Exchange will try to determine the number of available Mailbox servers and attempt to contact the file share witness. In location A, this will succeed and the DAG will continue to run with four nodes (three Exchange 2013 Mailbox servers plus the file share witness). In location B, the same will happen, so Exchange will try to contact the file share witness as well. Since the file share witness is available via the DFS share in location B also, the DAG will claim the file share witness in location B and continue to run as well. And Exchange 2013 in each location will assume that the DAG members in the other location have been shut down—which of course is not the case. This is called a split-brain scenario, a highly undesirable situation that will lead to unpredictable results, and it is a situation that is not supported at all.

Image Note Using a DFS share for the file share witness is not supported and can lead to undesirable results, and should therefore never be done.

Cluster Administrative Access Point

When a Windows failover cluster is created, an access point for the cluster is created as well. An access point is a combination of a name and an IP address. This IP address can be IPv4 or IPv6; it can be statically assigned or dynamically assigned using DHCP.

The first access point that gets created is the cluster administrative access point, sometimes also referred to as the cluster name and cluster IP address.

In Exchange 2013, this cluster administrative access point is the name of the DAG and its IP address. As the name implies, this is only used for management purposes. Important to note is that clients connect to the Exchange 2013 Client Access server and the Client Access server connects to a particular mailbox database where a mailbox resides. The Client Access server does nothing whatsoever with the cluster administrative access point.

New in Windows Server 2012 R2 is the concept of failover clusters without a cluster administrative access point. In Exchange Server, this means that you create a DAG with a name and without an IP address. Is this bad? No, not at all, since nothing connects to the cluster administrative access point, except for the failover cluster manager. But since all cluster management is performed using the Exchange Management Shell, this is not needed for Exchange 2013. In the section about the DAG creation process, we will show how to create a DAG without an administrative access point.

Replication

A database availability group consists of a number of Exchange 2013 Mailbox servers, and these Mailbox servers have multiple mailbox databases (see Figure 5-1). There’s only one copy of a given mailbox database on a given Mailbox server in a DAG, so the total number of copies of a specific mailbox database can never exceed the number of Mailbox servers in the DAG.

image

Figure 5-1. Schematical overview of a database availability group (DAG)

The mailbox databases can be either active or passive copies. The active copy is where all the mailbox data processing takes place, and it’s no different from a normal Exchange 2013 Mailbox server that’s not part of a DAG. Now, another Exchange 2013 Mailbox server in the DAG can host another copy of this same database; this is called a passive copy. A regular passive copy should be close to 100 percent identical to the active copy, and it is kept up to date by a technology called log shipping or log file replication.

There are two ways of replicating data from one Mailbox server to another:

· File mode replication

· Block mode replication

File Mode Replication

As explained in the previous chapter, all transactions are logged in the transaction log files. When the Mailbox server has stored all the transactions in one log file, a new log file is generated and the “old” log file is written to disk. At this moment, the log file is also copied to the second Mailbox server, where it is stored on disk. The log file is then inspected; if it’s okay, the contents of the log file are replayed into the passive copy of the mailbox database. Since the log file on the passive copy is identical to the log file on the active copy, all contents are the same in both the active and the passive copies.

The process of copying transaction log files is called file mode replication, since all log files are copied to the other Mailbox server.

Block Mode Replication

Another mode, which was actually introduced in Exchange 2010 SP1, is block mode replication. In this process the transactions are written into the active server’s log buffer (before they are flushed into the active log file), and at the same time the transactions are copied to the passive server and written into that server’s log buffer. When the log buffers are full, the information is flushed to the current log file and a new log file is used. Both servers do this at the same time. When the Mailbox server is running block mode replication, the replication of individual log files is suspended; only individual transactions are copied between the Mailbox servers. The advantage of block mode replication is that the server holding the passive copy of the mailbox database is always 100 percent up to date and therefore failover times are greatly reduced.

The default process is block mode replication, but the server falls back to file mode replication when that server is too busy to cope with replicating individual transactions. If this happens, the Exchange server can replicate the individual transaction log files at its own pace, and even queue some log files when there are not enough resources.

An active mailbox database copy can have multiple passive copies on multiple Mailbox servers (remember that one server can hold only one copy of a specific mailbox database, active or passive). The active copy of a mailbox database is where all the processing takes place and all the replication, whether it is file mode or block mode, takes place from this active copy to all passive copies of the mailbox database. There’s absolutely no possibility that one passive copy will replicate log files to another passive copy. The only exception to this is when a new copy of a mailbox database is created from another passive copy, but that’s only the initial creation, which is seeding.

Seeding

Creating the passive copy of an active mailbox database is called seeding. In this process, the mailbox database is copied from one Mailbox server to another Mailbox server. When seeding, the complete mailbox database (the actual mailbox database.edb) is copied from the first Mailbox server hosting the active copy of this mailbox database to the second Mailbox server. This is not a simple NTFS file copy, though; the information store streams the file from one location to another. In this process, the Information Store is reading the individual pages of a mailbox database—a process that’s very similar to the streaming backup process that was used in Exchange server 2003 and earlier.

Here’s how it works: The Information Store reads the contents of the mailbox database page by page, automatically checking them. If there’s an error on a particular page (i.e., a corrupt page), the process stops and the error is logged. This way, Exchange prevents copying a mailbox database to another location that has corrupted pages. Since the pages of the mailbox database are copied from one Mailbox server to another Mailbox server, the passive copy is identical to the active copy. When the entire mailbox database is copied to the other Mailbox server, the remaining log files are copied to the other Mailbox server as well.

When a new mailbox database is seeded, the process takes only a couple of minutes because there’s not too much data to copy. But imagine a mailbox database of 1 TB in a normal production environment. When that has to be seeded, it can take a considerable amount of time. And not only is the timing an important factor but also the process puts additional load on the servers. The 1 TB of data needs to be read and checked, copied via the network, and written to disk on the other Mailbox server.

AutoReseed

A new feature in Exchange 2013 is automatic reseed, or AutoReseed. When a disk in an Exchange server fails, it is replaced and an Exchange administrator creates a new copy of a mailbox database on the new disk. AutoReseed is basically the same process, but automated; the idea behind AutoReseed is to get a mailbox database up and running again immediately after a disk failure. To achieve this, Exchange 2013 can use the Windows 2012 feature of multiple mount points per volume.

When AutoReseed is configured, the Exchange 2013 Mailbox server has one or more spare disks in its disk cabinet. When a disk containing a mailbox database fails, the Microsoft Exchange Replication service automatically allocates a spare disk and automatically creates a new copy of this particular mailbox database.

The DAG has three properties that are used for the AutoReseed feature:

· AutoDagVolumesRootFolderPath This is a link to the mount point that contains all available volumes—for example, C:\ExchVols. Volumes that host mailbox databases, as well as spare volumes, are located here.

· AutoDagDatabasesRoolFolderPath This is a link to the mount point that contain all mailbox databases—for example, C:\ExchDBs.

· AutoDagDatabaseCopiesPerVolume This property contains the number of mailbox database copies per volume.

Important to note is that although there’s one mailbox database on a particular location, it can be located through two possible ways. The first is via the C:\ExchVols mount point; the second is via the C:\ExchDBs mount point.

AutoReseed is regularly monitoring to come into action by using the following steps:

1. The Microsoft Exchange Replication service constantly scans for mailbox database copies that have failed—that is, that have a copy status of “FailedAndSuspended.”

2. If a mailbox database is in a “FailedAndSuspended” status, the Microsoft Exchange Replication services does some prerequisite checks to see if AutoReseed can be performed.

3. If the checks are passed successfully, the Replication service automically allocates a spare disk and configures it into the production disk system.

4. When the disk is configured, a new seeding operation is started, thus creating a new copy of the mailbox database.

5. When seeding is done, the Replication service checks if the new copy is healthy and resumes operation.

There is one manual step left at this point: the Exchange administrator has to replace the faulty disk with a new one and format the new disk appropriately.

Image Note For the AutoReseed to function correctly the disks need to be configured in a mount point configuration. You cannot use dedicated drive letters in combination with AutoReseed.

In the section about the DAG creation process, implementing and configuring the AutoReseed is explained in detail.

Replication (Copy) Queue and Replay Queue

In an ideal situation, transaction log files are replicated to other Exchange 2013 Mailbox server directly after the log files are written to disk, and they are processed immediately after being received by the other Exchange 2013 Mailbox server. Unfortunately, we don’t live in an ideal world, so there might be some delay somewhere in the system.

When the Exchange 2013 Mailbox servers are extremely busy, it can happen that more transaction log files are generated than the replication process can handle and transmit. If this is the case, the log files will queue on the Mailbox server holding the active copy of the mailbox database. This queue is called the replication queue. Queuing always happens, and it is normally not reason for concern as long as the number of log files in the queue is low and the log files don’t stay there too long. However, if there are thousands of messages waiting in line, it’s time to do some further investigation.

When the transaction log files are received by the Exchange 2013 Mailbox server holding the passive copy of the mailbox database, those transaction log files are stored in the replay queue. Queuing up in the replay queue happens as well, and is generally speaking also not reason for concern when the number of transaction log files is low. There can be small spikes in the number of transaction log files in the replay queue, but when the number of transaction log files is constantly increasing, there’s something wrong. It can happen that the disk holding the mailbox database is generating too many read-and-write operations. Or, there may not be enough resources to flush the queue, and so the queue will grow. As long as the system is able to flush the queue in a reasonable time, and there aren’t thousands of messages in the queue, you should be fine.

Lagged Copies

Regarding the replay queue, there’s one exception to note: lagged copies. If you have implemented lagged copies in your DAG, and you experience a large number of log files in the replay queue, then there’s nothing to worry about. Lagged copies are passive copies of a mailbox database that aren’t kept up to date. This means that log files are replicated to the Exchange 2013 Mailbox server holding the lagged copy, but the log files themselves are kept in the replay queue. This lag time between replication and writing to the server can be as little as 0 second (the log file is replayed immediately) or up to 14 days. A very long lag time will have a serious impact on scalability, of course. A full 14 days’ worth of log files can mean a tremendous amount of data being stored in the replay queue; also, replaying the transaction log files of a lagged copy can take quite some time when longer time frames are used.

Lagged copies are not a high-availability solution; rather, they are a disaster recovery solution. (Lagged copies and disaster recovery are explained in detail in Chapter 7.)

Active Manager

The active manager is a component of Exchange 2013, and it runs inside the Microsoft Exchange Replication services on all Exchange 2013 Mailbox servers. The Active Manager is the component that’s responsible for the high availability inside the database availability group.

There are several types of active managers:

· Primary Active Manager (PAM) The PAM is the role that decides which copy of a mailbox database is the active copy and which ones are the passive copies; as such, PAM reacts to changes in the DAG, such as DAG member failures. The DAG member that holds the PAM role is always the server that also holds the quorum resource or the default cluster group.

· Standby Active Manager (SAM) The SAM is responsible for providing DAG information—for example, which mailbox database is an active copy and which copies are passive copies—to other Exchange components like the Client Access service or the Hub Transport service. If the SAM detects a failure of a mailbox database, it requests a failover to the PAM. The PAM then decides which copy to activate.

· Standalone Active Manager The Standalone Active Manager is responsible for mounting and dismounting databases on that particular server. This active manager is available only on Exchange 2013 Mailbox servers that are not members of a DAG.

DAG Across (Active Directory) Sites

In the previous examples, the DAG has always been installed in one Active Directory site. However, there’s no such boundary in the DAG, so it is possible to create a DAG that spans multiple Active Directory sites, even in different physical locations. For instance, it is possible to extend the DAG for anticipating two potential scenarios:

· Database Disaster Recovery In this scenario, mailbox databases are replicated to another location exclusively for offsite storage. These databases are safe there should disaster, like a fire or flood, strike at the primary location.

· Site Resiliency In this scenario, the DAG is (most likely) evenly distributed across two locations (see Figure 5-2). The second location, however, also has (multiple) Exchange 2013 Client Access servers with a full-blown Internet connection. When disaster strikes and the primary site is no longer available, the second site can take over all functions.

image

Figure 5-2. A DAG stretched across two locations

When using a Geo DNS solution, only one FQDN (i.e., webmail.contoso.com) can be used. For example, in Figure 5-2, there are two Active Directory sites, one location in Europe and another in North America (NA). When a user tries to contact webmail.contoso.com when traveling in Europe, he’s automatically connected to the Europe site. When he tries to access webmail.contoso.com in the United States, he’s connected to the NA site. In either case, after authentication the client is automatically proxied to the correct Mailbox server to get the mailbox information.

By default, a site failover is not an automated process. If a data-center failover needs to happen, especially when the site holding the file share witness is involved, administrative action is required. However, with Exchange 2013, it is possible to work around this limitation by placing the file share witness in a third Active Directory site.

It is possible to create an active/active scenario whereby both data centers are active, servicing users and processing mail data. In this case, two DAGs have to be created; each DAG is active in one data center and its passive copies are located in the other data center. Note, however, that an Exchange Mailbox server can be a member of only one DAG at a time. This could mean that you need more servers in an active/active scenario, a downside of having two DAGs.

Creating a site-resilient configuration with multiple DAGs requires careful planning, plus asking yourself a lot of questions, both technical and organizational. Typical questions are:

· What level of service is required?

· What level of service is required when one data center fails?

· What are the objectives for recovery point and recovery time?

· How many users are on the system and which data centers are these users connecting to?

· Is the system designed to service all users when one data center fails?

· How are services moved back to the original data center?

· Are there any resources available (like IT Staff) for these scenarios?

These are just basic planning questions to be answered before you even think about implementing a site-resilient configuration. And remember: the more requirements there are and the stricter they are, the more expensive the solution will be!

DAG Networks

A DAG uses one or more networks for client connectivity and for replication. Each DAG contains at least one network for client connectivity, which is created by default, and zero or more replication networks. In Exchange 2010, this default DAG network was called MAPI network. The MAPI protocol is no longer used in Exchange 2013 as a native client protocol, but the default DAG network is still called MapiDagNetwork.

For years Microsoft has been recommending the use of multiple networks to separate the client traffic from the replication traffic. With the upcoming 10Gb networks, separating client traffic from replication traffic is no longer an issue. Also, the use of a Serverblade infrastructure with its 10Gb backbone separation of traffic was more a logication separation then a physical separation. Therefore, Microsoft moves away from the recommendation of separating network traffic, thereby simplying the Mailbox server network configuration.

When you still want to separate client traffic from replication traffic, you can do so in a supported manner. In Exchange 2013, the network is automatically configured by the system. If additional networks need to be configured, you set the DAG to manual configuration, then create the additional DAG networks.

When using multiple networks, it is possible to designate a network for client connectivity and the other networks for replication traffic. When multiple networks are used for replication, Exchange automatically determines which network to use for replication traffic. When all the replication networks are offline or not available, Exchange automatically switches back to the MAPI network for the replication traffic (as was the case in Exchange 2010).

Default gateways need to be considered when you are configuring multiple network interfaces in Windows Server. The only network that needs this configuring with a default gateway is the client connectivity network; all other networks should not be configured with a default gateway.

Other recommendations important for replication networks are the following:

· Disabling the DNS registration on the TCP/IP properties of the respective network interface.

· Disabling the protocol bindings, such as client for Microsoft networks and file and printer sharing for Microsoft networks, on the properties of the network interface.

· Rearranging the binding order of the network interfaces so that the client connectivity network is at the top of the connection order.

When using an iSCSI storage solution, make sure that the iSCSI network is not used at all for replication purposes. Remove any iSCSI network connection from the replication networks list.

DAG Creation Process

Creating a database availability group consists of several steps. The first step in the process is to create the cluster name object in Active Directory, followed by creation of the DAG itself, adding the Mailbox servers to the DAG, configuring the DAG networks when needed, and adding the mailbox database copies. But let’s take each step in order.

Creating the Cluster Name Object

As explained earlier, under the hood, a DAG is using Windows failover clustering binaries. From a failover clustering perspective, the DAG is nothing more than a failover cluster, and this cluster is using a cluster name, formally known as the cluster name object, or CNO. When creating a DAG in Exchange 2013 that’s running on Windows Server 2008 R2, the CNO is created automatically. In Windows Server 2012 or later, this is no longer the case because of tightened security, and thus the CNO needs to be created manually.

The CNO is established as a new, disabled computer object in Active Directory and is assigned the appropriate permissions. To create the CNO, follow these steps.

Import-Module ActiveDirectory
New-ADComputer -Name "AMS-DAG01" -Path "OU=Accounts,DC=Contoso,DC=com" -Enabled $false
Add-AdPermission -Identity "AMS-DAG01" -User "Exchange Trusted Subsystem" -AccessRights GenericAll

Coauthor Michel de Rooij has written a complete PowerShell script around the creation of this CNO, including all prerequisite checks. You can find more information regarding this script at http://bit.ly/CNOPreStage.

Creating the DAG

Now that the computer account needed for the DAG is in Active Directory, you continue with creating the DAG itself. To create the DAG using EMS, you can use the following commands:

New-DatabaseAvailabilityGroup -Name AMS-DAG01 -WitnessServer AMS-FS01.contoso.com -WitnessDirectory C:\DAG01\DAG01_FSW -DatabaseAvailabilityGroupIPAddresses 192.168.0.187

Creating the DAG is simple—it’s only an entry written in the configuration partition of Active Directory. If you want to check it, you can use ADSIEdit and navigate to

CN=AMS-DAG01, CN=Database Availability Groups, CN=Exchange Administrative Group (FYDIBOHF23SPDLT), CN=Administrative Groups, CN=Contoso,CN=Microsoft Exchange, CN=Services, CN=Configuration, DC=Contoso, DC=com.

This is shown in Figure 5-3.

image

Figure 5-3. The newly created DAG in Active Directory

The information that’s returned when running a Get-DatabaseAvailabilityGroup command is just a representation of this object in Active Directory, combined with information taken from the local registry (when using the -status parameter).

Image Note Microsoft recommends that the file share witness best be another Exchange server. This Exchange server cannot be a DAG member; however, this is not always possible and another server must be used as a file share witness. In this example, a file server called AMS-FS01 is used. Since Exchange Server cannot control a non-Exchange Server, the Active Directory’s security group Exchange-trusted subsystem should be added to the local administrator’s security group on the file share witness server.

Adding the Mailbox Servers

Once the DAG exists, the Mailbox servers can be added to it, which is a straightforward process; just run the following commands to add the servers AMS-EXCH01 and AMS-EXCH02 to the DAG created in the previous step:

Add-DatabaseAvailabilityGroupServer -Identity AMS-DAG01 -MailboxServer AMS-EXCH01
Add-DatabaseAvailabilityGroupServer -Identity AMS-DAG01 -MailboxServer AMS-EXCH02

When the Windows failover clustering components are not installed on the Mailbox server, the Add-DatabaseAvailabilityGroupServer cmdlet will install these automatically, as shown in Figure 5-4.

image

Figure 5-4. The failover clustering components will be installed automatically

At this point, a DAG is created with two members using a file server as a witness server.

Adding the Mailbox Database Copies

Now that the DAG is fully up and running, it’s time for the last step: making additional copies of the mailbox databases. Initially there’s only one copy of the mailbox database, but you can create redundancy when you add multiple copies on other Mailbox servers in the DAG.

It is important to note that the location of the mailbox database is identical on all Mailbox servers holding a copy of a particular mailbox database. So if you have a mailbox database F:\MDB01\MDB01.edb of server EXCH01, the copy of the mailbox database on server EXCH02 is onF:\MDB01\MDB01.edb as well. This might sound obvious, but every now and then I talk to people who are not aware of this.

The same is true for mount points, of course. If you have a mailbox C:\ExchDbs\MDB01\MDB01.edb on server EXCH01, the mailbox database copy on server EXCH02 will be at the same location, C:\ExchDbs\MDB01.

To create additional copies of a mailbox database in a DAG, you can use the Add-MailboxDatabaseCopy cmdlet. To add copies of mailbox databases called MDB01 and MDB02 on Exchange 2013 Mailbox server AMS-EXCH02, you can use the following commands:

Add-MailboxDatabaseCopy -Identity AMS-MDB01 -MailboxServer AMS-EXCH02 -ActivationPreference 2
Add-MailboxDatabaseCopy -Identity AMS-MDB02 -MailboxServer AMS-EXCH01 -ActivationPreference 2

The activation preference is meant for administrative purposes and for planned switchovers. It is not used by an automatic failover. In case of an automatic failover, a process called the best copy selection on the Mailbox server is used to determine the optimal passive copy for activation.

Configuring the DAG Networks

In our example, the DAG is now configured with two Mailbox servers; by default, only one DAG network is configured, the default MapiDagNetwork. You can quickly see this in the EAC, in the lower right part of the DAG view, as shown in Figure 5-5.

image

Figure 5-5. Only one network is configured by default in a DAG

To add an additional DAG network (assuming that the servers have multiple network interfaces, of course), the DAG itself should be set to manual configuration, as mentioned earlier. This can only be done using the EMS with the following command:

Set-DatabaseAvailabilityGroup -Identity AMS-DAG01 -ManualDagNetworkConfiguration $true

To create a new additional network for replication purposes you can use the following command:

New-DatabaseAvailabilityGroupNetwork -DatabaseAvailabilityGroup AMS-DAG01
-Name "Contoso Replication Network" -Subnets 192.168.0.0/24 -ReplicationEnabled:$true

To designate this new network as a dedicated replication network, you have to disable the replication feature of the regular MapiDagNetwork in the DAG. To disable this, you can use the following command:

Set-DatabaseAvailabilityGroupNetwork -Identity AMS-DAG01\MapiDagNetwork -ReplicationEnabled:$false

After running these commands, you have created a separate network in the DAG specifically for replication traffic.

AutoReseed Configuration

As explained earlier in this chapter, you can use AutoReseed to have Exchange 2013 automatically reseed a mailbox database when one mailbox database or a disk containing mailbox databases in a DAG fails, assuming you have configured multiple copies, of course.

Configuring AutoReseed involves several steps:

· Configure the database availability group.

· Install and configure database disks.

· Create the mailbox databases.

· Create mailbox database copies.

These steps are explained in the next sections.

Configuring the Database Availability Group

The AutoReseed feature uses a number of properties on the database availability group that need to be populated:

· AutoDagDatabasesRootFolderPath

· AutoDagVolumesRootFolderPath

· AutoDagDatabaseCopiesPerVolume

You can use the following command to set these:

Set-DatabaseAvailabilityGroup AMS-DAG01 -AutoDagDatabasesRootFolderPath "C:\ExchDbs" -AutoDagVolumesRootFolderPath "C:\ExchVols" -AutoDagDatabaseCopiesPerVolume 2

Installing and Configuring the Database Disks

To implement AutoReseed you have to create multiple disks on your Exchange 2013 Mailbox server where the disks are configured using mount points. In this example we have two Exchange 2013 Mailbox servers, each configured with three disks, Vol1, Vol2 and Vol3. Vol1 has two mailbox databases called AMS-MDB01 and AMS-DB02, and Vol2 has two mailbox databases called AMS-MDB03 and AMS-MDB04. Vol3 is a spare disk that will be used if either Vol1 or Vol2 fails.

These three disks will be mounted in a directory C:\ExVols, but they will also be mounted in a directory C:\ExDBS, as shown in Figure 5-6

image

Figure 5-6. Schematical overview of an AutoReseed configuration

When the disks are installed, you can create the root directories for the volumes and mailbox database mount points:

MD C:\ExchVols
MD C:\ExchDBs

You format the disks and mount them into the appropriate volume folders:

· C:\ExchVols\Vol1

· C:\ExchVols\Vol2

· C:\ExchVOls\Vol3

Then you create the mailbox database folders in the appropriate location:

MD C:\ExchDBs\AMS-MDB01
MD C:\ExchDBs\AMS-MDB02
MD C:\ExchDBs\AMS-MDB03
MD C:\ExchDBs\AMS-MDB04

Creating the mount points for the mailbox databases is a bit trickier. You can use the Disk Management MMC snap-in, or you can use the command-line tool Mountvol.exe to achieve this.

When using the the Disk Management MMC snap-in, you have to select a disk that was created in the previous step—for example, C:\ExchVols\Vol1. To add an additional mount point, right-click the disk and select “Change Drive Letter and Path.” Use the Add button to select a mailbox database directory—for example, C:\ExchDBs\AMS-MDB01. Repeat this step for the second mailbox database directory as well, as shown in Figure 5-7:

image

Figure 5-7. Mount the disk in the database directories

The steps as shown in Figure 5-7 need to be repeated for the remaining two directories for mailbox databases AMS-MDB03 and AMS-MDB04. The second volume should be mounted in these two directories.

Instead of using the Computer Management MMC snap-in, it is possible to use the Mountvol.exe command-line utility. The Mountvol.exe utility is used as follows:

Mountvol.exe c:\ExchDbs\AMS-MDB01 \\?\Volume (GUID)
Mountvol.exe c:\ExchDbs\AMS-MDB02 \\?\Volume (GUID)

You can retrieve the GUIDs of the individual volumes using Mountvol.exe as well; just use a command similar to Mountvol.exe C:\ExchVols\ and you’ll see something like what’s shown in Figure 5-8.

image

Figure 5-8. Retrieve the disk GUIDs using Mountvol.exe

To add the disk as an additional mount point to both mailbox database directories, you can use the following commands:

Mountvol.exe C:\ExchDbs\AMS-MDB01 \\?\Volume{845bfe37-193d-11e4-80c6-00155d000347}\
Mountvol.exe C:\ExchDbs\AMS-MDB02 \\?\Volume{845bfe37-193d-11e4-80c6-00155d000347}\
Mountvol.exe C:\ExchDbs\AMS-MDB03 \\?\Volume{845bfe3f-193d-11e4-80c6-00155d000347}\
Mountvol.exe C:\ExchDbs\AMS-MDB04 \\?\Volume{845bfe3f-193d-11e4-80c6-00155d000347}\

You can check the results by running the Mountvol.exe utility without any parameters. The output is shown in Figure 5-9.

image

Figure 5-9. The disk is mounted in three different locations

Creating the Mailbox Databases

The next step is to create the directory structure on both Mailbox servers where the mailbox database files will be stored. This depends on your own naming convention, of course, but it could look something like this:

md c:\ExchDBs\AMS-MDB01\AMS-MDB01.db
md c:\ExchDBs\AMS-MDB01\AMS-MDB01.log

md c:\ExchDBs\AMS-MDB02\AMS-MDB02.db
md c:\ExchDBs\AMS-MDB02\AMS-MDB02.log

md c:\ExchDBs\AMS-MDB03\AMS-MDB03.db
md c:\ExchDBs\AMS-MDB03\AMS-MDB03.log

md c:\ExchDBs\AMS-MDB04\AMS-MDB04.db
md c:\ExchDBs\AMS-MDB04\AMS-MDB04.log

The mailbox database file itself will be stored in the AMS-MDB01.DB subdirectory while the accompanying transaction log files will be stored in the AMS-MDB01.log subdirectory.

New mailbox databases will be created in the directories you just created; just use the following commands in the Exchange Management Shell:

New-MailboxDatabase -Name AMS-MDB01 -Server AMS-EXCH01 -LogFolderPath C:\ExchDbs\AMS-MDB01\AMS-MDB01.log -EdbFilePath C:\ExchDbs\AMS-MDB01\AMS-MDB01.db\AMS-MDB01.edb

New-MailboxDatabase -Name AMS-MDB02 -Server AMS-EXCH01 -LogFolderPath C:\ExchDbs\AMS-MDB02\AMS-MDB02.log -EdbFilePath C:\ExchDbs\AMS-MDB02\AMS-MDB02.db\AMS-MDB02.edb

New-MailboxDatabase -Name AMS-MDB03 -Server AMS-EXCH02 -LogFolderPath C:\ExchDbs\AMS-MDB03\AMS-MDB03.log -EdbFilePath C:\ExchDbs\AMS-MDB03\AMS-MDB03.db\AMS-MDB03.edb

New-MailboxDatabase -Name AMS-MDB04 -Server AMS-EXCH02 -LogFolderPath C:\ExchDbs\AMS-MDB04\AMS-MDB04.log -EdbFilePath C:\ExchDbs\AMS-MDB04\AMS-MDB04.db\AMS-MDB04.edb

Creating the Mailbox Database Copies

Of course, you need to create an additional copy of the mailbox database on the second Exchange 2013 Mailbox server. As explained in the previous section, you can create copies of the mailbox databases by using the following commands:

Add-MailboxDatabaseCopy –Identity AMS-MDB01 –MailboxServer AMS-EXCH02 –ActivationPreference 2
Add-MailboxDatabaseCopy –Identity AMS-MDB02 –MailboxServer AMS-EXCH02 –ActivationPreference 2
Add-MailboxDatabaseCopy –Identity AMS-MDB03 –MailboxServer AMS-EXCH01 –ActivationPreference 2
Add-MailboxDatabaseCopy –Identity AMS-MDB04 –MailboxServer AMS-EXCH01 –ActivationPreference 2

At this point you have created a DAG with the AutoReseed option. If Vol1 fails, it should automatically reseed the mailbox databases to another disk. The best way to test this is to set Vol1 offline in the Computer Management MMC snap-in.

The AutoReseed Process

When a disk fails and goes offline, Exchange will notice almost immediately and activate the copy of the mailbox databases on the second Exchange server as expected. This is clearly visible when we execute a Get-MailboxDatabaseCopyStatus command, as shown in Figure 5-10. The mailbox databases on AMS-EXCH01 are in a FailedAndSuspended state while they are mounted on server AMS-EXCH02.

image

Figure 5-10. The mailbox databases on the first Mailbox server are FailedAndSuspended

What happens next is that a repair workflow is started. The workflow will try to resume the failed mailbox database copy, and if this fails the workflow will assign the spare volume to the failed disk. This is the exact workflow:

1. The workflow will detect a mailbox database copy that is in Failed and Suspended state for 15 minutes.

2. Exchange will try to resume the failed mailbox database copy three times with a 5-minute interval.

3. If Exchange cannot resume the failed copy, Exchange will try to assign a spare volume five times with a 1-hour interval.

4. Exchange will try an InPlaceSeed with the SafeDeleteExistingFiles option five times with a 1-hour interval.

5. If all retries are completed with no success, the workflow will stop. If it is successful, Exchange will finish the reseeding.

6. When everything fails, Exchange will wait three days and see if the mailbox database copy is still in Failed and Suspended state, then it will restart the workflow from step 1.

All events are logged in the event log. There’s a special crimson channel for this, which you can find in Applications and Services Logs | Microsoft | Exchange | HighAvailability | Seeding.

The first event that’s logged is EventID 1109 from the AutoReseed manager, indicating that something is wrong and that no data can be written to location C:\ExDbs\AMS-MDB01\AMS-MDB01.log. This makes sense because the disk has actually “failed” and is no longer available. This event is shown in Figure 5-11.

image

Figure 5-11. The first AutoReseed event indicating something is wrong with the disk containing the mailbox database

Subsequent events in the event log will indicate the AutoReseed manager attempting to resume the copy of the mailbox database. As outlined earlier, it will try this three times, followed by an attempt to reassign a spare disk. This event is shown in Figure 5-12. Please note that it takes almost an hour before Exchange moves to this step.

image

Figure 5-12. Exchange is reassigning a spare disk

When the disk is succesfully reassigned, Exchange will automatically start reseeding the replaced disk, indicated by EventID 1127 (still logged by the AutoReseed manager), as shown in Figure 5-13.

image

Figure 5-13. Exchange automatically reseeds the new disk

Depending of the size of your mailbox databases, it can take quite long time for this step to finish.

You can use the Mountvol utility again to check the new configuration. If all went well, you’ll see the mailbox databases now on volume 3, as shown in Figure 5-14.

image

Figure 5-14. After AutoReseed has kicked in, volume 3 is active

At this point it is up to the administrator to replace the faulty disk, format it, and mount the disk in the C:\ExchVols directory.

Monitoring the DAG

The DAG including the individual Mailbox servers is monitored by Managed Availability, a new feature in Exchange 2013. (This is discussed later in this chapter.) You can use PowerShell to monitor the mailbox database replication, which is a very powerful feature.

Mailbox Database Replication

Mailbox database replication is a key service for determining if the servers are performing as expecting. When the performance of an Exchange 2013 server or related service degrades, you’ll see the replication queues start growing. There are two types of queues available for mailbox database replication:

· Copy queue This is the queue where transaction log files reside before they are replicated (over the network) to other Mailbox servers holding passive copies of a mailbox database.

· Replay queue This queue resides on the Mailbox server holding a passive copy of the mailbox database. It holds transaction log files that are received from the active mailbox database copy but haven’t yet been replayed into the passive mailbox database copy.

Both queues fluctuate constantly, and it’s no big deal when they are momentarily increasing as long as they start decreasing in minutes.

Image Note When you have lagged copies in your DAG, especially when the lag time is long, you can expect a large number of items in the replay queues. If so, there’s no need to worry since this is expected behavior.

You can monitor the replication queues in EMS using the Get-MailboxDatabaseCopyStatus command:

1. To monitor all copies of a particular mailbox database, you can use the following command: Get-MailboxDatabaseCopyStatus -Identity DB1 | Format-List

2. To monitor all mailbox database copies on a given server, you can use the following command: Get-MailboxDatabaseCopyStatus -Server MBX1 | Format-List

3. To monitor the status and network information for a given mailbox database on a given server, you can use the following command: Get-MailboxDatabaseCopyStatus -Identity DB3\MBX3 -ConnectionStatus | Format-List

Image Note The syntax of the identity of the mailbox database copy looks a bit odd, but it is the name of the mailbox database located on the Mailbox server holding the passive copy. In this case, it is mailbox database DB3 located on Mailbox server MBX3, thus DB3\MBX3.

4. To monitor the copy status of a given mailbox database on a given server, you can use the following command: Get-MailboxDatabaseCopyStatus -Identity DB1\MBX2 | Format-List

I often combine the Get-MailboxDatabaseCopyStatus command with the Get-MailboxDatabase command to get a quick overview of all mailbox databases, their passive copies, and the status of the replication queues (see Figure 5-15). To do this, use the following command: Get-MailboxDatabase | Get-MailboxDatabaseCopyStatus.

image

Figure 5-15. Monitoring the status of mailbox database copies

Image Note When you are moving mailboxes from one Mailbox server to another Mailbox server, a lot of transaction log files are generated. It is quite common that, under these circumstances, replication cannot keep up with demand, and you will see a dramatic increase in the replication queues. Things can get even worse when you are using circular logging in a DAG, since the log files will be purged only when the transaction log files are replayed into the mailbox database and all the DAG members agree on purging the log files. When there are too many log files, replication will slow down, the disk holding the log files will fill up, and the mailbox database can potentially dismount. The only way to avoid this situation is to throttle down the mailbox moves so that replication can keep up with demand.

Health Check Commands

Another way in EMS to check for mailbox replication is to use the Test-ReplicationHealth command. This command tests the continuous replication, the availability of the Active Manager, the status of the underlying failover cluster components, the cluster quorum, and the underlying network infrastructure. To use this command against server AMS-EXCH01, you can enter the following command:

Test-ReplicationHealth -Identity AMS-EXCH01

The output of this command is shown in Figure 5-16.

image

Figure 5-16. The Test-ReplicationHealth command checks the entire replication stack

Microsoft has written two health metric scripts, which are located in the C:\Program Files\Microsoft\Exchange Server\v15\Scripts directory that gathers information about mailbox databases in a DAG. These scripts are:

1. CollectOverMetrics.ps1

2. CollectReplicationMetrics.ps1

The CollectOverMetrics.ps1 script reads DAG member event logs to gather information regarding mailbox database operations for a specific time period. Database operations can be mounting, dismounting, database moves (switchovers), or failovers. The script can generate an HTML file and a CSV file for later processing in Microsoft Excel, for example.

To show information in a DAG called DAG01, as well as all mailbox databases in this DAG, you can navigate to the scripts directory and use a command similar to the following:

.\CollectOverMetrics.ps1 -DatabaseAvailabilityGroup DAG01 -Database:"DB*" -GenerateHTMLReport -ShowHTMLReport

The CollectReplicationMetric.ps1 is a more advanced script, since it gathers information in real time while the script is running. Also, it gathers information from performance monitor counters related to mailbox database replication. The script can be run to:

1. Collect data and generate a report (CollectAndReport, the default setting)

2. Collect data and store it (CollectOnly)

3. Generate a report from earlier stored data (ProcessOnly)

The scripts start PowerShell jobs that gather all information and, as such, it is a time- and resource-consuming task. The final stage of the script, when all data is processed to generate a report, can also be time- and resource-intensive. To gather one hour of performance data from a DAG using a one-minute interval and generate a report, the following command can be used:

.\CollectReplicationMetrics.ps1 -DagName DAG1 -Duration "01:00:00" -Frequency "00:01:00" -ReportPath

To read data from all files called CounterData* and generate a report, the following command can be used:

.\CollectReplicationMetrics.ps1 -SummariseFiles (dir CounterData*) -Mode ProcessOnly -ReportPath

Image Note Do not forget to navigate to the scripts directory before entering this command. This can be easily done by entering cd $exscripts in EMS.

Not directly related to monitoring an Exchange server is the RedistributeActiveDatabases.ps1 script. It can happen, especially after a failover, that the mailbox databases are not properly distributed among the Mailbox servers. For example, in such a scenario, one Mailbox server may be hosting only active copies of mailbox databases while another Mailbox server is hosting only passive copies. To redistribute the mailbox database copies over the available Mailbox servers, you can use the following command:

.\RedistributeActiveDatabases.ps1 -DagName DAG1 -BalanceDbsByActivationPreference -ShowFinalDatabaseDistribution

This command will distribute all mailbox databases by their activation preference, which was set during creation of the mailbox database copies. If you have a multi-site DAG, you can use the -BalanceDbsBySiteAndActivationPreference parameter. This will balance the mailbox databases to their most preferred copy, but also try to balance mailbox databases within each Active Directory site.

Client Access Server High Availability

In a high-availability environment, not only do the mailboxes databases need to be highly available but also the Client Access servers need to be highly available.

This is done by implementing multiple Client Access servers where a load balancer is distributing the client connections across these Client Access servers. When one Exchange 2013 Client Access server fails, another Client Access server takes over the service, the load balancer will take of this and the failed Client Access server will automatically be disabled in the list of servers in the load balancer configuration.

Image Note Load balancing is discussed in detail in Chapter 4.

Managed Availability

In the past, you could use a separate monitoring solution like System Center Operations Manager or some less-sophisticated solution based on SNMP to monitor your Exchange environment. While these are certainly good solutions, their only task is to monitor the Exchange environment. Imagine that you’re running an environment with thousands of servers in a high-availability environment. This will result in millions and millions of events being generated, and so you are having to take action on a continuous basis. That’s no fun when you need a good night’s sleep, but there are tools to help you avoid this situation.

This is where Managed Availability comes into play. Managed Availability is a new service in Exchange 2013 that constantly monitors the Exchange servers and takes appropriate action when needed, without any system administrator intervention. Managed Availability not only monitors the various services within an Exchange server for their availability but also does end-to-end monitoring from a user’s perspective.

So, Managed Availability monitors if the Information Store is running and if the mailbox database is mounted; at the same time, it monitors if the mailbox itself is available. Similarly, Managed Availability not only checks if the Internet Information Server (IIS) is running and is offering the Outlook Web App (OWA) function, but it also tries to log in to a mailbox to see if the OWA service is actually available.

This service represents a huge difference from past versions of Exchange Server and their earlier monitoring solutions, where the only monitoring was to determine if the web server was up and running. If a logon page was shown, then OWA was supposed to be running fine. Likewise, SMTP monitoring in the past was a matter of setting up a Telnet session on port 25 so you would get a banner showing the service as up and running; however, there was no monitoring to determine if a messages could actually be delivered.

According to Microsoft, Managed Availability is cloud trained, user focused, and recovery oriented.

Cloud Trained

Microsoft developed Managed Availability in Exchange Online, sometimes referred to as “the service,” and brought it back to the on-premises Exchange installations. Multiple years of running the services allowed for the incorporation of experience and best practices from operations in a large environment and with a diverse, worldwide client base running operations 24/7.

In Exchange Online, developers were responsible for building, maintaining, and improving Managed Availability. Those developers also handled escalations in Exchange Online that allowed them to take feedback, not only for the software they were coding but also for improvements in the monitoring process itself. The developers were paged in the middle of the night when problems escalated, so they had to focus on improving Managed Availability.

This service is included in the Exchange 2013 product and is installed out of the box by default; no additional configuration is needed. At the same time, Microsoft has the ability to make changes and improvements to Managed Availability every time a cumulative update is released.

User Focused

Managed Availability is based on end-user experience. Listening on port 443 for OWA or on port 25 for SMTP does not guarantee successful message delivery. Managed Availability, however, performs monitoring checks for the following:

· Availability: Is the service being monitored actually accessible and available?

· Latency: Is the service working with an acceptable degree of latency?

· Error: When accessing the service, are there any errors logged?

These items result in a customer touch point—a test that ensures the availability of the service; it responds at or below an acceptable latency and returns no error when performing these operations.

Recovery Oriented

Managed Availability protects the user experience through a series of recovery actions. It’s basically the recognition that problems may arise, but the user experience should not be impacted. An example of Managed Availability’s monitoring of OWA is as follows:

1. The monitor attempts to submit a message via OWA and an error is returned.

2. The responder is notified and tries to restart the OWA application pool.

3. The monitor attempts to verify OWA and checks if it’s healty. When healthy, the monitor again attempts to submit a message and again an error is returned.

4. The responder now moves the active mailbox database to another Mailbox server.

5. The monitor attempts to verify OWA and checks for health status. When healthy, the monitor attempts to submit a message and now receives a success.

Managed Availability is implemented through the new Microsoft Exchange Health Manager service running on the Exchange 2013 server. How does the Health Manager service gets its information? Through a series of new crimson channels. As described earlier, a crimson channel is a channel where applications can store certain events, and these events can be consumed by other applications or services. In this case, various Exchange components write the events to a crimson channel and the Health Manager service consumes those events to monitor the service and take appropriate actions.

The configuration files that are used by Managed Availability are XML files supplied by Microsoft and stored on the local hard drive on C:\Program Files\Microsoft\Exchange Server\V15\Bin\Monitoring\Config (see Figure 5-17).

image

Figure 5-17. The XML config files for Managed Availability

Image Note While it is interesting to check out these XML files, it’s not a good idea to modify them—not even when you know 120 percent of what you’re doing. You will most likely see unexpected results and Managed Availability will do things you don’t want, like rebooting servers that have no problems, only because you incorrectly changed some configuration file.

You have to be careful. When you look at the SmtpProbes_Frontend.xml, for example, you’ll see the following in the WorkContext:

<WorkContext>
<SmtpServer>127.0.0.1</SmtpServer>
<Port>25</Port>
<HeloDomain>InboundProxyProbe</HeloDomain>
<MailFrom Username="inboundproxy@contoso.com" />
<MailTo Select="All" />
<Data AddAttributions="false">X-Exchange-Probe-Drop-Message:FrontEnd-CAT-250amp;#x000D;amp;#x000A;Subject:Inbound proxy probe</Data>
<ExpectedConnectionLostPoint>None</ExpectedConnectionLostPoint>
</WorkContext>

Above all, this tells us that the Health service is using port 25 on IP address 127.0.0.1 to check the front-end proxy function on the Exchange 2013 C AS. At the same time, you know that if, for whatever reason, you have to unbind LOCALHOST from 127.0.0.1 in the server’s configuration, you’ll get unwanted complications. So you have to take special care when making changes to the server configuration!

The Architecture of Managed Availability

Managed Availability consists of three different components (illustrated in Figure 5-18):

· Probe engine

· Monitor

· Responder engine

image

Figure 5-18. Architectural overview of Managed Availability

The probe engine consists of three different components:

· Probe Determines the success of a particular service or component from an end-user perspective. A probe performs an end-to-end test, also known as a synthetic transaction. One probe may perform a portion of a stack, such as checking if a web service is actually running, while another probe tests the full stack—that is, it sees if any successful data is returned. Each component team in the Exchange Product Group is responsible for building its own probe.

· Check Monitors end-user activity and looks for trends within the Exchange server that might indicate (known) issues. A check is implemented against performance counters where thresholds can be monitored. A check is a passive monitoring mechanism.

· Notify Processes notifications from the system that are known to be issues in an Exchange 2013 server. These are general notifications and they do not automatically mean they have originated from a probe. The notifier makes it possible to take immediate action when needed, instead of waiting for a probe to signal that something is wrong.

The monitor receives data from one or more probes. The feedback from a probe determines if a monitor is healthy or not. If a monitor is using multiple probes, but one probe returns unhealthy feedback, the entire monitor is considered unhealthy. Based on the frequency of the probe feedback, the monitor decides whether a responder should be triggered.

In Figure 5-19, it is clear that a monitor works at different levels. The levels illustrated in the figure include:

· Mailbox self-test (MST) This first check makes sure the mailbox database is accessible. The mailbox self-test runs every 5 minutes.

· Protocol self-test (PST) This second test is for assessing the protocol used to access the mailbox database. The protocol self-test runs every 20 seconds.

· Proxy self-test (PrST) This third test is actually running on the Exchange 2013 CAS server to make sure that requests are proxied correctly to the Mailbox server. Like the protocol self-test, this proxy self-test runs every 20 seconds.

· Customer touch point (CTP) This is an end-to-end test that validates the entire accessibility of the mailbox, starting at the Exchange 2013 CAS down to the actual mailbox. The customer touch point runs every 20 minutes.

image

Figure 5-19. Monitoring occurs at different layers

The advantage of this multi-layer approach is that it is possible to check various components using different probes, and to respond in different ways using the responder. Thus, a responder is a component that responds with a predefined account when a monitor turns unhealthy. The following responders are available:

· Restart responder: This responder terminates and recycles a particular service.

· Reset AppPool responder: This responder can recycle the IIS application pool.

· Failover responder: This responder can take an Exchange 2013 Mailbox server out of service by failing all the mailbox databases on this Mailbox server over to other Mailbox servers.

· Bugcheck responder: This responder can bug-check a particular server—that is, it will restart with a “blue screen.”

· Offline responder: This responder can take a protocol running on an Exchange 2013 server out of service. Especially when if you’re using a load balancer, this is important. When the Offline responder kicks in and a protocol component is shut down, the load balancer will notice and disable the Exchange 2013 CAS (or this particular protocol), so it stops servicing client requests.

· Escalate responder: This responder can escalate an issue to another application, like System Center 2012 Operations Manager. It is an indication that human intervention is required.

The responder sequence is stopped when the associated monitor becomes healthy again. Responders can also be throttled. Imagine that you have three Exchange 2013 Mailbox servers in a DAG. You don’t want two different responders (each on a Mailbox server) to bug-check this particular Mailbox server, since it will result in a complete outage of the DAG. When two Mailbox servers are bug-checked, then the remaining Mailbox server loses quorum and shuts down the DAG, which of course will result in downtime for all users.

Exchange 2013 CAS and Managed Availability

In theory, this monitoring is nice, but how does it work in a production environment? When looking at the Exchange 2013 CAS protocols, Managed Availability dynamically generates a file called healthcheck.htm. Since this file is dynamically generated to test a particular protocol, you will not find it anywhere on the Exchange 2013 CAS’s hard disk.

This is a basic HTM file; the only thing it does is to return a 200 OK code plus the name of the server. You can easily open the file using a browser. It doesn’t reveal much information, but when it returns the information as shown in Figure 5-20, you know your server is fine from an OWA protocol perspective.

image

Figure 5-20. A dynamically generated file for health checking by a probe

All other protocols, like ECP, EWS, Autodiscover, and Outlook Anywhere, have their own healthcheck.htm files when the server is running fine. A hardware load balancer as shown in Figure 5-21 can use this information as well. When the load balancer checks for this file, and a 200 OK is returned, the load balancer knows that the respective protocol is doing fine. If an error is returned, the load balancer knows there’s something wrong and it should take this protocol out of service.

image

Figure 5-21. The load balancer is using the healthcheck.htm file for checking availability

Image Note This is only a protocol check and doesn’t say anything if a user can actually log in or not.

What can you do with this information? The offline responder, for example, can be invoked to place a node in maintenance mode—say, when you’re patching your servers or updating to the latest cumulative update. To accomplish this, you have to change the server component state of an Exchange server. The server component state of an Exchange server can be requested using the Get-ServerComponentState command in EMS like this:

Get-ServerComponentState -Identity AMS-EXCH01

This command will show the state of all server components on the console, as displayed in Figure 5-22.

image

Figure 5-22. Requesting the component state of an Exchange 2013 server

The items listed in Figure 5-22 are all components on the Exchange 2013 server AMS-EXCH01, which is a multi-role server—that is, with both the CAS and a Mailbox server installed on it. The components are not the individual services running on the Exchange server, but they are an abstraction layer that mimics the individual services. For example, in Figure 8-15 you can see the Hub Transport component. In this case, the Hub Transport component represents all services running on the Mailbox server, as explained in Chapter 4.

The components that reside on an Exchange 2013 CAS are the ones with “proxy” in their name, with UMCallRouter and FrontEndTransport as exceptions (also components of the Exchange 2013 CAS). The Hub Transport is a component that belongs to the Mailbox server role, while the Monitoring and RecoveryActionsEnabled component belong to both roles.

When a component is shut down, the requester acts as a label on the actual shutdown. There are five types of requesters defined:

1. HealthAPI

2. Maintenance

3. Sidelined

4. Functional

5. Deployment

Since these act as labels, you can use them when you want to shut down a component manually. To do this, you can use the Set-ServerComponentState command in EMS:

Set-ServerComponentState -Identity AMS-EXCH01 -Component OWAProxy -State Inactive -Requester Maintenance

When this command is run, and you check the state of the components using the Get-ServerComponentState –Identity AMS-EXCH01 –Component OWAProxy command in EMS, you’ll see on the console that the OWA component is actually inactive:

Server Component State
------ --------- -----
AMS-EXCH01.Contoso.com OwaProxy Inactive

When you open a browser and navigate to https://localhost/owa/healthcheck.htm, you’ll observe that an error message is generated. The load balancer will determine that the OWA component is no longer available and will automatically disable this server. You can see this in the load balancer configuration, since the inactive server is marked in red (the top Real Servers IP Address in Figure 5-23, which appears as a different shade of gray in the printed book).

image

Figure 5-23. The load balancer detects that OWA is not available on the Exchange 2013 CAS

Image Note All server components change their status immediately, with the exception of the Hub Transport and the Front End Transport components. When these components are disabled in EMS, they continue to run until the service is restarted. This can be confusing when you are not aware of this situation; you will think you disabled the component (actually you did!), but it continues working. However, Managed Availability will notice this inconsistency and force a restart of the service after some time.

The component state is stored in two places:

1. Active Directory In Active Directory, it is stored in a property of the Exchange server object in the configuration partition. You can find this server object in:

CN=Servers, CN=Exchange Administrative Group (FYDIBOHF23SPDLT), CN=Administrative Groups, CN=Contoso, CN=Microsoft Exchange, CN=Services, CN=Configuration, DC=Contoso, DC=COM.

The value in Active Directory is used when performing Set-ServerComponentState commands against a remote server.

2. Local Registry When checking the registry, you have to check:

HKEY_LOCAL_MACHINE\Software\Microsoft\ExchangeServer\v15\ServerComponentStates

and then the component you want to check. This can be seen in Figure 5-24.

image

Figure 5-24. The Server Component state is stored in the registry of the Exchange 2013 server

When the work on the Exchange 2013 server is finished, the components can be activated again in EMS, using the following command:

Set-ServerComponentState -Identity AMS-EXCH01 -Component OWAProxy -State Active -Requester Maintenance

If you check the server component state using the Get-ServerComponentState command in EMS, you’ll see that it is active again:

Server Component State
------ --------- -----
AMS-EXCH01.Contoso.com OwaProxy Active

When you check the load balancer, you’ll see that the Exchange 2013 server is back online again.

Front End Transport Server High Availability

The Front End Transport server (FETS) running on the Client Access server is the primary point of entry for SMTP messages from external messaging servers. These can be regular SMTP server, but also multi-functional devices that use SMTP to send incoming faxes or send scanned documents to mailboxes.

To implement high availability for FETS on the Client Access servers—that is, the “Default Frontend <<servername>>” receive connector—you need to use a load balancer to distribute the incoming request across the available Client Access servers and to react on a failing Client Access server.

For a load balancer in front of the FETS, a “simple” layer-4 load balancer can be used: a load balancer that accepts incoming SMTP connections and distributes these across the available Client Access servers. Of course, a layer-7 configuration is supported as well.

Most major vendors do have templates available for configuring their load balancing solution with Exchange 2013 to use their solution and Exchange 2013 in an optimal configuring. In Figure 5-25, a virtual IP address configuration on the load balancer is shown running on layer-7 with source IP as the persistence option and a simple round-robin distribution mechanism. This configuration is based on a standard Exchange 2013 SMTP template as provided by the vendor.

image

Figure 5-25. Configuring the load balancer for SMTP use in front of the Client Access servers

The load balancer will check each Client Access server on port 25 to see if the SMTP service is still healthy. If not, the load balancer will automatically disable this Client Access server in its configuration.

Transport High Availability

When it comes to transport high availability in Exchange 2013, there are two server roles we have to focus on:

· Exchange 2013 Mailbox Transport server role

· Exchange 2013 Edge Transport server role

Transport in the Exchange Mailbox server role has been discussed in more detail in Chapter 4; transport in the Edge Transport server role will be discussed in more detail in Chapter 6.

In this section, I discuss the high-availability options in both server roles.

Mailbox Server Transport High Availability

When it comes to transport in Exchange 2013, things are relatively easy. The Transport service is part of the Mailbox server role in Exchange 2013, so when multiple Mailbox servers are added to achieve higher availability, the transport availability increases at the same time.

If a Mailbox server fails or when mailbox databases are failed over to another Mailbox server, the processing of SMTP messages is automatically moved to another Mailbox server as well.

Is a load balancer needed to distribute SMTP messages across multiple Exchange 2013 Mailbox servers? Well, it depends on the workload—that is, the type of SMTP messages.

SMTP messages routed between a Exchange 2013 Mailbox server or any down-level Hub Transport service do not need a load balancer because Exchange 2013 automatically determines the most effective path to deliver its messages. Things are different, though, when you have created receive connectors on Exchange 2013 that are used by multifunctional devices (fax, scanner, and printer) that use these connectors to route messages to other recipients. These connectors can also be used by third-party applications on your network to route messages to various recipients.

Receive connectors don’t have any knowledge about high availability, so you need to create a virtual service on a load balancer. Then, the load balancer will distribute the submitted SMTP messages across available Exchange 2013 Mailbox servers. The application or the multifunctional device needs to submit its messages to the virtual service on the load balancer.

Edge Transport High Availability

Edge Transport servers are located in the perimeter network and are not domain joined. To achieve high availability on the Edge Transport servers, you need to implement multiple Edge Transport servers in the perimeter network, as shown in Figure 5-26.

image

Figure 5-26. Multiple Edge Transport servers in the perimeter network

Edge Transport servers are connected to the Exchange 2013 Mailbox servers using Edge subscriptions, and each Edge Transport server has its own subscription.

Multiple subscriptions provide a means of achieving high availability for communications between the Exchange 2013 Mailbox servers and the Edge Transport servers. Both the Mailbox servers and the Edge Transport servers automatically distribute their SMTP messages across the Edge subscriptions. The load is therefore automatically distributed; if one server fails, other servers automatically take over.

For inbound messages from the Internet to the Edge Transport servers, there are two mechanisms for achieving high availability:

1. When multiple Edge Transport servers are used, it is possible to have multiple MX records. Each MX record points to one Edge Transport server, so the messages will automatically be distributed. The down side of this is that if that Edge Transport server is not available, the sending SMTP host has no notion of this and keeps trying on this particular Edge Transport server.

2. You can use a load balancer in front of the Edge Transport servers and create a virtual service on this load balancer. The MX records in public DNS point to this virtual service and external SMTP hosts deliver their messages to this virtual service. The load balancer distributes the inbound SMTP messages across the available Edge Transport servers. If one Edge Transport server fails, the load balancer automatically redistributes inbound message across the remaining Edge Transport servers.

Summary

High availability in Exchange 2013 can be implemented on Mailbox databases, Client Access, and Transport servers. When it comes to Client Access servers, it is just a matter of implementing multiple Exchange servers and using a load balancer to distribute the load across multiple servers. If one server fails, the load balancer automatically redistributes the load across the remaining servers.

Multiple Mailbox servers automatically mean multiple Transport servers in your Exchange organization. Exchange Server automatically distributes the processing of SMTP messages, so that if one server fails in a DAG configuration, the processing of SMTP messages is automatically moved to other Mailbox servers. When Edge Transport servers are used, high availability is achieved with multiple Edge Transport servers and multiple Edge subscriptions, with the Mailbox servers on the internal network.

High availability for mailboxes is achieved by implementing multiple Mailbox servers configured in a database availability group, or DAG. Multiple copies of a mailbox database are configured in a DAG, so if one mailbox database fails, or if one Mailbox server fails, another copy of the mailbox database on another Mailbox server automatically takes over.

Using a DAG is pretty complex, but doing so well will dramatically increase the availability of your Exchange environment.