Scalability and High Availability - Administration - Professional Team Foundation Server 2013 (2013)

Professional Team Foundation Server 2013 (2013)

Part V

Administration

Chapter 22
Scalability and High Availability

What's in this chapter?

· Understanding architectural changes

· Understanding scale limitations

· Exploring availability solutions

· Exploring load balancing

· Getting to know configuration best practices

Scalability and high availability are very involved topics, and an entire book could be written on each of them. Every Team Foundation Server environment is unique, and every organization has a different usage pattern and availability requirements.

It's not the purpose or intent of this chapter to provide prescriptive guidance on exact configurations to support your environment or usage pattern. Rather, this chapter is intended to give you insight into the different factors that affect scalability and to offer some solutions to consider in your overall environment design.

Note

If you need advice specifically tailored to your organization's needs, your best option is to contact Microsoft Support, Microsoft Services, or a Microsoft Certified Partner in your area. These organizations have deep knowledge, extensive resources, and ample hands-on experience to best meet your individual needs.

An Evolving Architecture

Team Foundation Server 2010 introduced significant architecture and infrastructure changes. On the product team, these product changes were referred to as Enterprise TFS Management (ETM). It was a significant and necessary investment of effort to allow the product to handle the future scale demands of enterprises and the Internet.

The full value of these investments is beginning to be realized. With the release of Team Foundation Server 2013 also comes the production release of Visual Studio Online, a cloud-based version of Team Foundation Server. Although the service is running on the Windows Azure Platform, it is built from the same codebase as the on-premises product.

The introduction of team project collections was perhaps the largest architectural change. This innovation took the seven databases that used to make up a server and “folded” them into a single database that represents a collection. This database becomes the unit of isolation, and a collection can be detached and moved between different servers. Collections enable the following:

· The consolidation of multiple pre-2010 Team Foundation Server instances onto a single shared instance

· The scale-out of a single instance to multiple physical servers

Team Foundation Server 2008 and previous releases included many built-in assumptions about your deployment. For example, you could have only a single application tier. This application tier couldn't be installed on a 64-bit operating system. Analysis Services had to be installed on the same server as SQL Server (although you could move it later if you wanted to). These restrictions made the initial installation and any future configuration changes to the server quite fragile and error-prone.

One of the core deliverables of ETM was to enable configuration flexibility and remove these limitations. Team Foundation Server supports the following features critical to scalability and availability:

· Configuration flexibility

· Multiple application tiers with load balancing

· application-tier installation

· Stateless application tier and web access

On top of all these infrastructure improvements, there was also a huge investment in the setup, configuration, and administration experiences. All this investment made Team Foundation Server 2010 an exceptionally scalable and robust release.

Because Team Foundation Server 2013 shares the same codebase as the Visual Studio Online, many of the core scalability improvements in the cloud version also flow through to the on-premises version.

An example of these improvements is the memory usage per collection in the environment. For an on-premises server, it would be rare to see more than 100 collections on a single deployment. In the cloud, the target is many thousands of collections. For the server to support this number of collections, a lot of optimization and testing was done around the collection management internals. These changes mean that each collection consumes the least amount of server memory. There are also optimizations such as placing a collection in a dormant state and pausing certain jobs if the collection has not been accessed.

Another significant change that first appeared in Team Foundation Server 2012 is the generalization of the version control content store. Before the 2012 release, Work Item Tracking and Test attachments were stored in their own tables. For the cloud-based service, it is more cost effective to store content in Windows Azure blob storage than inline in Windows Azure SQL Databases. This change also allows the Team Foundation Proxy server to cache additional artifact types.

Much of this chapter is relevant to only the largest Team Foundation Server environments. However, because Team Foundation Server is built on the Microsoft platform, you might find these suggestions also useful for scaling your own applications built on the Microsoft platform.

Limiting Factors

Implementing a system that scales is all about finding the largest bottleneck, removing it, and then finding the next one. A system with as many moving parts as Team Foundation Server has many opportunities for bottlenecks. Even when you manage to remove the largest bottlenecks, you still have some inherent limitations in the architecture to consider.

Microsoft Recommendations

The officially tested and recommended system configurations for deploying Team Foundation Server 2013 are detailed in the Installation Guide at http://aka.ms/tfsInstallGuide.

As you can see in Table 22.1 (which information is gathered from the official Installation Guide), the hardware requirements for a small team are quite modest. It's perfectly reasonable to run a server that supports 250 users on a single core machine that has a reasonable hard disk. You should, however, consider these recommendations with respect to your own individual circumstances. In general, the larger your team is, the greater your need will be for a robust hardware configuration.

Table 22.1 Recommended Hardware Configurations

Number of Users

Configuration

CPU

Memory

Hard Disk

Fewer than 250 users

Single-server (Team Foundation Server and the Database Engine on the same server)

1 single core processor at 2.13 GHz

2GB

1 disk at 7.2 K rpm (125GB)

250 to 500 users

Single-server

1 dual core processor at 2.13 GHz

4GB

1 disk at 10K rpm (300GB)

500 to 2,200 users

Dual-server (Team Foundation Server and the Database Engine on different servers)

Application tier

1 dual core Intel Xeon processor at 2.13 GHz

4GB

1 disk at 7.2K rpm (500GB)

Data tier

1 quad core Intel Xeon processor at 2.33 GHz

8GB

SAS disk array at 10K rpm (2TB)

2,200 to 3,600 users

Dual-server

Application tier

1 quad core Intel Xeon processor at 2.13 GHz

8GB

1 disk at 7.2K rpm (500GB)

Data tier

2 quad core Intel Xeon processors at 2.33 GHz

16GB

SAS disk array at 10K rpm (3TB)

As discussed later in this chapter, the number of team project collections and team projects in an environment will also affect the performance of the system. These hardware recommendations don't give an indication of how many collections they can support, only the number of users.

One important distinction for the number of collections is the number of active collections compared to the number of dormant collections. An active collection is one that has been accessed in the past five minutes. When a collection is automatically marked asdormant in the system, it will be unloaded from memory until another request for that collection is received. Table 22.2 describes the maximum number of active collections per SQL server based upon total available memory.

Table 22.2 Recommended Maximum Active Collections per SQL Server

RAM Available to SQL Server

Active Collections

2GB

1 to 5

4GB

5 to 10

8GB

30 to 75

16GB

40 to 90

32GB

50 to 125

64GB

75 to 195

For a recommendation of the number of collections per SQL server, you should refer to the “Visual Studio Team Foundation Server Planning Guide” at http://vsarplanningguide.codeplex.com/. More specifically, you should refer to the “Capacity Planning” workbook, which is available to download from this site.

This planning workbook lets you enter the maximum expected users and the current number of users for your environment. Using the official hardware recommendations from Table 22.2, the workbook will tell you the recommended configuration for your expected number of users and a maximum number of active collections it can support.

Data Tier

The vast majority of work in Team Foundation Server happens on the data tier. Therefore, it makes sense that the most common bottlenecks are found on the data tier. Team Foundation Server performance is directly proportional to the performance of your SQL server. For a large environment, you must pay the same level of attention that you pay to other critical database applications in your organization, such as your Human Resources, Finance, or Sales databases.

Beware of several opportunities for bottlenecks in the data tier:

· Storage performance

· SQL query plan cache

· SQL buffer cache

Storage Performance

The single biggest factor that contributes to server performance is the storage performance. If your storage isn't matched to the demands of the system, then the performance of everything will suffer. Team Foundation Server makes heavy use of SQL Server's TempDBdatabase for large version control commands, which makes that a common source of bottlenecks.

SQL Query Plan Cache

SQL stored procedures have query plans. These plans are precompiled and the server uses them to work out the most efficient way to execute a particular stored procedure. Some commands (such as Merge) in Team Foundation Server contain some very complex logic. This makes the query plans quite detailed, and their size adds up. Because each project collection is a separate database, a separate plan is cached in SQL for each stored procedure.

The scalability limitation here is that the plan cache is shared among all databases on the same SQL instance. In SQL Server 2008, the plan cache is sized according to this formula:

· 75 percent of visible target memory from 0 to 4GB

· plus 10 percent of visible target memory from 4GB to 64GB

· plus 5 percent of visible target memory greater than 64GB

This means that as you add more collections to a SQL server, there will be more contention for resources in the plan cache. When a stored procedure's plan isn't in the cache, it must be recalculated and recompiled. Although this is not a significant overhead, it's not optimal to be recompiling plan caches all the time.

In Team Foundation Server 2010, most work item tracking queries generated ad hoc SQL and required a new query plan with every execution. In the 2012 release, some optimizations were made to improve caching of work item tracking query plans.

Note

For more information on how the SQL plan cache works, see the “Plan Caching in SQL Server 2008” whitepaper from Greg Low at http://aka.ms/SQLPlanCache.

SQL Buffer Cache

The SQL buffer cache is where recently accessed database pages are kept in memory. Having pages in memory is a good thing because this results in the best performance.

Work item tracking uses a series of tables to store work items with a set of views over those tables. When you run a query from Team Explorer, that query is translated into a SQL query and executed in the database. Because work item tracking is completely customizable and has a dynamic schema, it performs best when all the tables are in the buffer cache.

The buffer cache is shared across all databases on a SQL server. So, if your work item tracking tables are competing in the buffer cache with other tables, then they may get pushed out. When they get pushed out, work item query performance will suffer.

This can be observed as you add more project collections to a server or your collections get bigger. If you look at the SQL Server “Memory Manager\Buffer Cache Hit Ratio” performance counter, it will drop and performance may start to suffer. In particular, when using SQL Express Edition, work item queries that use the contains clause will suffer the most noticeable effects because they require a table scan and cannot make use of the SQL Full-Text indexing service. If the pages aren't in the buffer cache, then SQL must fetch them from the disk.

To summarize, the size of the SQL buffer cache (which is calculated based upon total server memory) will limit the size of the project collections that SQL server can support while maintaining reasonable performance.

Note

More information on SQL Server Memory Architecture can be found at http://aka.ms/SQLMemoryArch.

Application Tier

You're more likely to encounter bottlenecks in the data tier than the application tier. However, the two main scale limitations that affect the application tier are:

· Memory

· ASP.NET worker threads configuration

Memory

Access control checks for version control are performed on the application tier. At a high level, this is the way it works:

1. The client makes a Get request to the application tier.

2. The application tier runs the prc_Get stored procedure on the data tier.

3. The data tier executes the request and returns all the relevant files, regardless of the permissions of the requesting user.

4. The application tier then retrieves the permissions associated with the paths returned. If the permissions are in the cache, then the cached permissions are used. If the permissions are not in the cache, then they are requested from the data tier.

5. The application tier then evaluates the permissions of the requesting user against the path permissions. Any file that the user does not have access to is removed from the response.

6. The application then sends the trimmed response to the client.

What's important here is that each application-tier server keeps a cache of all version control path permissions that it has evaluated. The cache is not persisted and is reset every time the application pool restarts. Cached permissions are also invalidated when the permissions change.

Version control uses Access Control Entries (ACEs) on paths to define which users and groups have access to which files and folders. By default, these permissions are inherited to subdirectories. However, you can set explicit permissions on subdirectories. Each of these explicit permissions results in an additional ACE that the server must store, evaluate, and cache.

To summarize, if you have many paths, or many paths with explicitly set permissions, then you may run into issues where the cache isn't large enough to be effective. In this scenario, the application tier will be constantly retrieving permissions from the data tier, and this may affect version control performance.

ASP.NET Worker Threads Configuration

ASP.NET 2.0 introduced the processModel/autoConfig configuration element, which defines how many worker threads should be running to serve requests. The default configuration setting may not work for everyone because it limits the number of concurrently executing requests per CPU to 12.

This works well for websites with low latency. But in an application like Team Foundation Server, which has longer running requests and higher latency, it may become a bottleneck. If ASP.NET has reached these limits, then users may receive intermittent timeout or “Server does not exist” error messages.

Web Access

Similar to the 2010 release, web access is integrated into the product. However, the 2012 release fully integrated web access with the other web services and it runs in the same application pool. It is not possible to install web access by itself, or separate it from the other web services.

Warehouse

As discussed at the beginning of this chapter, Team Foundation Server 2010 introduced some major architectural changes to support server consolidation and scale-out. One of the commonly requested features from large organizations was the capability to do cross-server (and, therefore, in 2010, cross-collection) reporting. Users wanted the capability to roll up metrics into a company-wide view.

This requirement drove the architectural decision to have a single, shared relational data warehouse and Analysis Services Online Analytical Processing (OLAP) cube per Team Foundation Server instance. This, in itself, is not a big problem. The limitations of the architecture start to emerge when you have multiple project collections attached to an instance that would, by themselves, strain a dedicated data warehouse per collection architecture.

The main limitations with the data warehouse in Team Foundation Server 2013 are:

· The relational warehouse has a limit of approximately 1,000 unique reportable fields across all project collections. This is the limit of columns in a SQL Server table, less some overhead.

· The time to process the OLAP cube is proportional to the number of reportable fields.

· Different field data types will be expanded to more than one dimension in the cube. For example, a datetime field is expanded to six dimensions to support the different data slicing requirements of a date: Year Month Date, Year Week Date, Date, Month, Week, and Year.

· Analysis Services does not have a scale-out solution for processing a single cube. You can add additional query servers, or process multiple cubes on separate processing servers and swap them in later. But you cannot process a single cube across multiple servers.

In summary, if your SQL Server hardware and application-tier server are not scalability bottlenecks in your environment, some architectural limitations in the data warehouse may affect you.

Team Foundation Proxy

The Team Foundation Proxy is a very effective method of increasing version control performance for users and reducing the load on the application-tier servers. As noted earlier, the 2012 release added support for more than just version control downloads.

The most significant limitations in the performance of the proxy server are:

· Network performance—Latency and throughput

· Storage performance—Disk size and throughput

Network Performance

The largest influence on the performance of the proxy server is the network performance between the proxy server and the clients. If the clients are separated from the proxy by a slow link, then the proxy may not provide any benefit at all compared to accessing the application tier directly.

Storage Performance

The amount of disk space available for the file download cache is the next most important influence on the performance of the proxy server. If the cache size isn't large enough then the proxy will be constantly cleaning up and refilling the cache.

Periodically, the cleanup job will scan the entire directory and look for files that have not been accessed recently (more than 14 days by default). The cleanup job will then delete these files. For caches with large numbers of files, the cleanup algorithm can be quite inefficient and can take many hours to identify and clean up stale files. It's important that your disks can handle the normal proxy load in addition to this cleanup load.

To get an estimate of how long this cleanup identification process takes, you can open a command prompt and run a directory listing of your cache directory. To do so, follow these steps:

1. Open a command prompt.

2. Change to your cache directory by typing the following (all on one line):

3. CD /D "C:\Program Files\Microsoft Team Foundation Server 11.0\

Application Tier\Web Services\_tfs_data"

4. Perform a directory listing by typing the following:

dir /s > NUL

This will retrieve the file descriptors of every file in the cache directory and redirect the output to the NUL device so that it doesn't flood your console. The time it takes for this command to return is roughly the same time it takes for the proxy to identify files for cleanup.

In the case of a cache miss, the proxy server streams the content from the SQL server and writes the stream to the cache drive simultaneously while sending it to the client. In the case of a cache hit, the proxy server streams the content from the disk to the client. This means that the memory and processor demands of the proxy server are relatively moderate. Therefore, if the network speed is not a bottleneck, the throughput of the proxy server is directly proportional to the performance of the disks.

Microsoft Developer Division Adoption

The adoption of Team Foundation Server at Microsoft is something that has steadily increased since the early days of the product's development. Brian Harry and others on the product team have been blogging the internal adoption numbers over the years and sharing them with the public. You can see an example of these numbers at http://aka.ms/TfsDogfoodStats.

The Developer Division is the division in which the Team Foundation Server product group works. Until the release of Team Foundation Server 2008, the usage was limited to the product group, and the larger division used the existing Microsoft-only internally developed tools (Product Studio and Source Depot).

Once the division had shipped the 2008 wave of developer tools, there was a huge push to move all the people and systems over to Team Foundation Server. It's fair to say that this was not without its challenges, and the server was constantly patched to meet the scalability demands of the division's 4,000 users and build lab.

These patches made up the majority of the performance-related improvements in Team Foundation Server 2008 Service Pack 1. You can get an overview of these improvements on Brian Harry's blog under the “Performance & Scale” heading athttp://aka.ms/Tfs2008Sp1Changes.

Although the use of these systems was painful at times for people in the division and across the company, it has pushed the product team to ensure that the product scales well. The widely varied usage patterns and user base have proven that the product can scale in real-world use far beyond what any load simulation can do.

This internal adoption and usage continued throughout the development of the 2013 release. Perhaps the most important usage though is the usage of the Visual Studio Online service. As scalability and availability issues are found in the service, updates are developed and deployed on at least a weekly basis. By the time Team Foundation Server 2013 was released, the product had been used on a day-to-day basis by many thousands of users.

Principles

If you are designing a new Team Foundation Server environment, or if you anticipate having to scale your existing installation, you can generally apply a number of principles. When implemented, these principles will also help you achieve your goals of high availability. These principles are:

· Scale out to multiple servers.

· Eliminate single points of failure.

· Anticipate growth.

· Keep it simple.

Scale Out to Multiple Servers

The first principle is to spread out the different components that make up a Team Foundation Server environment over multiple physical or virtual servers. The biggest benefit of doing this is to allow each component to make maximum use of the hardware that it sits on without competing with other components. As bottlenecks develop, the hardware for that single component can be scaled up or scaled out, without touching the other components. It's much easier for users to accept “Reporting won't be available this weekend while we upgrade the reporting server hardware” than it is “The whole server won't be available this weekend while we upgrade the hardware.” This reduces overall risk and increases the ability to react to changing usage patterns.

Eliminate Single Points of Failure

The second principle is the well-known formula for achieving high availability. By introducing redundancy in the environment and eliminating single points of failure, you reduce the chances that a failed component will impact the overall availability of the service. Depending on your goals for availability, this can be the most costly principle to implement. However, for some organizations, the impact of a failure greatly outweighs the infrastructure cost to avoid that failure, and it's an easy decision to make.

Anticipate Growth

The third principle can be a difficult one to gauge and plan for. Team Foundation Server is a powerful system with some very compelling features. Without proper planning and preparation, the use of these features can overwhelm the planned capacity of the system. The most common limitation that people encounter in a successful Team Foundation Server environment is the lack of storage space. Once people discover the value of an integrated version control, work item tracking, build, and test case automation system, the storage requirements start to grow rapidly. Without careful growth estimates and foresight in the storage design, this can have a dramatic impact on the stability of the system.

Keep It Simple

The final principle applies not just to Team Foundation Server but also to any system. Keep it simple. Simple things are easy to get right, and they usually cost less to set up and maintain.

Solutions

Now that the limitations have been covered, it's time to discuss some of the solutions, including:

· Data tier

· Application tier and web access

· Virtualization

This section covers the different components of Team Foundation Server and some strategies to increase their availability and scalability.

Data Tier

If scalability and high availability are important to you, then the data tier is where you will need to invest most of your resources.

High Availability

Availability is not only impacted by unexpected failures but also expected failures or maintenance work. Without a redundant system in place that can respond to requests while the primary system is undergoing maintenance, the system will be unavailable.

When planning for high availability, the most important database is the Tfs_Configuration database. Within the current architecture of the system, this is a single point of failure. An issue with this database will cause the entire instance to be unavailable.

SQL Server 2012 introduced a new comprehensive high availability and disaster recovery solution called SQL Server AlwaysOn. These features increase the high availability options for a Team Foundation Server deployment. AlwaysOn offers two core capabilities:

· AlwaysOn Availability Group—This is a capability that helps protect application databases from both planned and unplanned downtime. The key availability features are failover of a group of databases, multiple secondary copies for improved redundancy, and virtual names for fast application failover.

· AlwaysOn Failover Cluster Instance (FCI)—This provides protection for the entire instance and is an enhancement to the existing SQL Server Failover Cluster Instance. It includes multi-site clustering across subnets and TempDB on a local drive that allows better query performance.

Both of these capabilities rely on the Windows Server Failover Clustering (WSFC) infrastructure, which provides a robust and reliable high-availability platform.

These capabilities can be used by themselves or in combination with each other, depending on your availability needs. AlwaysOn Availability Group is the recommended high availability capability for database availability. It does not require shared storage as each SQL Server in the topology has its own copy of the data and does not need to share. Additionally, the replica can be used as an Active Secondary server for offloading backup operations.

Note

For more information on the deployment options, see SQL Server 2012 AlwaysOn High Availability and Disaster Recovery Design Patterns at http://aka.ms/SQL2012AlwaysOnHADRPatterns.

Be Careful of the Complexity

Although a Failover Cluster Instance is a fully supported configuration, it violates the fourth principle of achieving high availability: “Keep it simple.” The Developer Division server at Microsoft used to run a two-node, two-instance cluster configuration with the SQL Server instance running on one node and the Analysis Services instance normally running on the other. This worked fine until it came time to upgrade from SQL Server 2005 to SQL Server 2008.

The upgrade wizard supported failover cluster upgrades, but it did not support an online upgrade of a failover cluster with multiple resource groups. In the end, moving Analysis Services out of the cluster and off to its own dedicated hardware kept the configuration simple and allowed the team to use the online upgrade capabilities of the upgrade wizard.

Scalability

Earlier in this chapter, storage performance was identified as the biggest potential bottleneck of Team Foundation Server performance. In general, the same recommendations that generally apply for SQL Server also apply for Team Foundation Server.

You should start with the SQL Server Customer Advisory Team (CAT) “Storage Top 10 Best Practices” at http://tinyurl.com/SQLServerStorageTop10. Following are the most important of these 10 best practices:

· More or faster spindles are better for performance.

· Isolate transaction log files from data files at the physical disk level.

· Consider the configuration of the TempDB database.

· Don't overlook some of SQL Server basics.

· Don't overlook storage configuration basics.

Physical disks have physical limitations with the performance they can provide. The only way to increase your storage performance is to have faster spindles or to have more of them to spread the load out onto.

It's fairly common knowledge that SQL transaction logs, data files, and TempDB files should reside on physically separate drives. Because all of these are used at the same time, you don't want contention for resources among them. The aforementioned article includes this advice, along with many other storage configuration best practices.

Your storage is the most critical component of your Team Foundation Server environment. You must collaborate with your storage administrators and vendors to ensure that the storage is optimally configured for your needs.

A Storage Misconfiguration

An upgrade of the particularly large Developer Division server at Microsoft occurred during 2008. In the weeks leading up to the upgrade, the storage vendors had identified an issue on the storage array that required a firmware update. This update was supposed to have minimal impact on storage performance, and the team was told that it could be done while the server was online.

Unfortunately, this was not the case. It turns out that the firmware update reset the configuration back to factory defaults. It disabled the write cache setting on the array. It wasn't until halfway through the upgrade that a team member noticed the storage wasn't performing as expected. After some frantic phone calls and support investigations from the vendor, the missed configuration setting was identified and restored. The upgrade still failed for other reasons, but it certainly taught the team to keep the storage administrators close by during critical times.

SQL Server Enterprise Edition

If you separately license the high-end SQL Server edition, Team Foundation Server can use the extra features that it provides. The following features can be used to increase the availability and scalability of the system.

Note

For more information on the features available, see “Features Supported by the Editions of SQL Server 2012” at http://aka.ms/SQL2012Features.

· Online index operations—Index rebuilds and reorganization will be automatically done using the WITH ONLINE condition as part of the Optimize Databases job. Normally, indexes are taken offline, and operations that rely on those indexes are blocked while they are rebuilt.

· Page compression—Page compression can yield significant storage savings and increased storage performance. However, compression increases processor use, so be sure that you have enough available capacity. When you create a new collection on a SQL Server Enterprise Edition server, the majority of the version control tables and relational warehouse tables have page compression enabled. With page compression on these tables, storage usage can be reduced by up to a quarter of the uncompressed size. Additionally, the pages remain compressed in the buffer pool, which also results in increased performance.

Upgrading to Enterprise Edition

If you upgrade an existing SQL Server instance that hosts Team Foundation Server databases, compression will not be automatically used for existing collections. Existing collections will need to have compression enabled and their indexes rebuilt to see the benefits of compression. You can do this by running the following script in each of your collection databases:

EXEC prc_EnablePrefixCompression @Online = 1, @disable = 0

Newly created collections will have compression enabled automatically.

· Table and index partitioning—On large Team Foundation Server 2010 systems with many files and workspaces, the tbl_LocalVersion table can become very large and unwieldy. At Microsoft, this table peaked at five billion rows. This caused all kinds of problems, most notably that it would take more than a week to rebuild the index. If it finished rebuilding, it would need to start again because of the high churn in the table. The obvious solution to this was to implement table partitioning and split the table into more manageable chunks. Because this isn't documented, if you need table partitioning, you must contact Microsoft Support and they will guide you through the process. Team Foundation Server 2013 has a feature enabled by default called Local Workspaces. This removes the need to store workspace state on the server and dramatically reduces the size of the tbl_LocalVersion table.

· Larger read-ahead buffering—Enterprise Edition uses 1024KB read-ahead buffering compared to 64KB in Standard Edition. This increased buffering makes some of Team Foundation Server's expensive queries faster.

· Cube perspectives—A cube perspective is a definition that allows users to see the cube in a simpler way. If you are using the Enterprise Edition of Analysis Services, Team Foundation Server defines individual perspectives for work items, builds, and so on, in addition to the Team System cube.

Note

For more information on cube perspectives, see the article on MSDN at http://aka.ms/SQL2012CubePerspectives.

Application Tier and Web Access

As discussed at the beginning of this chapter, since the 2008 release there have been some significant changes to the architecture of the system. These changes enable the application tier to be scaled out and meet your scalability and high-availability requirements. With the exception of Web Access, the changes since Team Foundation Server 2010 are minor.

Web Access REST API

The largest limiting factor of web access scalability in Team Foundation Server 2010 was that it used the client object model to access the collections. The client object model was just not designed to operate in a high throughput web application and consumed a large amount of memory.

Therefore, in the 2012 release, the team implemented a lightweight REST API that uses the server object model. This dramatically improves the performance and scalability of web access along with reducing the overall load on the server. They have continued to expand the REST API in the 2013 release.

Stateless

In the 2010 release, aside from the change to support 64-bit architectures, changing web access to be stateless was the biggest change in the application tier from previous versions. Before this change, users would lose their sessions, along with the page they were on or the changes they were making. The lost session was usually triggered by a timeout, an application pool recycle, or by being directed to a different back-end server by a load balancer. This was an extremely frustrating experience for users.

The stateless implementation of web access dehydrates and rehydrates the client's session as required. This eliminates the session timeouts and allows any application tier to serve a user's request.

Load Balancing

Load balancing is an important feature for scalability and high availability. It allows the load to be spread across multiple servers. This increases the number of requests that can be handled by the system, as well as provides protection against planned and unplanned server downtime.

There are many load-balancing options. Whether you use Windows Network Load Balancing, the IIS Application Request Routing (ARR) extension, or a third-party load-balancer device, you need to consider some settings. Table 22.3 provides an overview of these settings and their recommended configurations.

Table 22.3 Recommended Load Balancer Configuration Settings

Setting

Description

Idle Timeout

60 minutes

Affinity, Stickiness, or Persistence

No Affinity

IP Pass Through

Enabled

Idle Timeout

Most load balancers have an idle connection timeout setting. This is because every connection consumes memory, and they want to close the connection if it's idle. The usual default setting of five minutes can cause problems with version control in Team Foundation Server.

If the client sends a request that takes a long time to calculate in SQL Server (such as a large Get), there will be no data transferred over the connection, and it will appear to be idle. If the setting is not long enough, then the load balancer will close the connection and the client will get an error like “The connection was forcibly closed by the remote server.” In this scenario, you want to match the idle timeout setting of the load balancer with the request timeout setting of Team Foundation Server, which is 60 minutes.

Team Explorer 2013 and the forward-compatibility patches for earlier versions will send TCP Keep-Alive packets after a connection is idle for 30 seconds. This is enough to keep most load balancers from closing the connection. However, if your connection to the remote server relies on an intermediate proxy server (Microsoft Forefront Threat Management Gateway, for example), then the TCP Keep-Alive packets may not propagate to the destination server. In this situation, you may continue to have problems with long-running commands.

HTTP Keep-Alive and TCP Keep-Alive Explained

When people talk about Keep-Alive settings in the context of a web application, they normally mean HTTP Keep-Alive. HTTP Keep-Alive is a feature of the HTTP 1.1 protocol that instructs the server to keep the client's connection open after it has responded to a request. This avoids the cost of connection handshaking and is very useful for when web browsers open web pages with many images. Instead of opening and closing connections for every single image, connections are reused.

TCP Keep-Alive operates at a level below the HTTP traffic. It periodically sends a benign packet to the server over an existing connection, which then sends an acknowledgement response. This all happens without interfering with the client's HTTP conversation with the server.

Affinity

Affinity is the setting that determines if a client should be routed to the same back-end server for successive requests. Depending on the load balancer implementation, it is sometimes referred to as persistence or stickiness.

Some operations in Team Foundation Server can take a long time to process. Some operations (such as downloading version control file content) will use multiple threads. If you have affinity enabled, then it's possible that the load won't be evenly distributed between your back-end servers. In the case of a two-server, load-balanced instance, it's possible for one server to be overloaded processing most of the requests while the other server is sitting idle. For this reason, you may want to disable connection affinity.

Unfortunately, some load-balancing technologies don't handle authentication well. Users may receive generic authentication errors if they were authenticated against one back-end server, but then are load-balanced to another back-end server. In this scenario, you will need to fix the authentication issues or enable affinity.

IP Pass Through

One of the useful diagnostic tools in Team Foundation Server is the activity log. (See Chapter 25 and the Operational Intelligence Hub in Chapter 21 for more information on this.) The activity log records the IP address of each request in the system. The use of a load balancer can mask the actual IP address of the client. In this case, the activity log will show that the load balancer is the only client of the system.

To avoid this masking, you will want to enable the IP pass-through setting or some equivalent setting. If the load balancer sets the X-Forwarded-For HTTP header with the actual client IP address, then the activity log will show this address.

ASP.NET Worker Threads

As discussed earlier, the default configuration of ASP.NET limits the number of concurrently executing requests per CPU to 12. You can check to see if you are reaching this limit by monitoring the ASP.NET Applications\Requests in the Application Queue performance counter. If this is a non-zero value, then it means you definitely have a performance problem.

To enable increased concurrency, you can follow the guidance in the KB821268 article at http://aka.ms/KB821268. This describes the steps to change the maxWorkerThreads and maxIoThreads settings.

If your bottleneck is the throughput of your SQL server, then the majority of the concurrent requests will be sitting idle waiting for a response from the data tier. In this scenario, you can safely increase the settings to allow more concurrent connections.

Resource Governor

Similar to the Resource Governor feature available in SQL Server, Team Foundation Server also includes its own resource governor. Every command in the system is assigned a cost based on how resource intensive it is. Once the total cost of all the commands currently executing hits a configurable limit, subsequent requests are queued until sufficient resources are available. Each individual command can have a limit on its concurrency as well.

This resource governor prevents the server from becoming overwhelmed with a large number of requests and overloading the SQL server.

Resource Governor at Microsoft

While running the internal servers for the Developer Division at Microsoft, the team constantly battled to keep the server performance ahead of user demand. It seemed that no matter how many optimizations and hardware upgrades they did, the gains were quickly eroded. This was usually because of the sheer growth of data in the system or, more commonly, a misbehaving tool.

This problem was tackled in two ways. The first approach was to do some analysis on the server's activity logs and identify the tools or users generating the most load. In one case, a single user was responsible for more than 50 percent of the load on the system. Once identified, the team worked with the tool owners to understand their requirements and made suggestions for using the object model more efficiently.

This effort yielded some great results but left the team vulnerable and waiting for the next rogue tool to hit the system.

Toward the end of the 2010 release, the team implemented a resource governor with a default policy.

As an example, the (recursive) Merge command is assigned a cost of VeryHigh (5). The default limit for a server is 200. This means that, by default, each application tier will allow only 40 Merge operations to execute concurrently. Table 22.4 includes a listing of common commands and their costs.

Table 22.4 Resource Governor Command Costs

Method Name

Estimated Method Cost

CancelRequest

Free (0)

CheckIn

High (4)

CreateBranch

High (4)

DeleteCollection

High (4)

Destroy

VeryHigh (5)

Get (non-recursive)

Low (2)

Get (recursive)

Moderate (3)

LabelItem

High (4)

Merge (non-recursive)

Low (2)

Merge (recursive)

VeryHigh (5)

MethodologyUpload

High (4)

QueryHistory

Moderate (3)

ReadIdentityFromSource

Moderate (3)

VCDownloadHandler

VeryLow (1)

Configuration of the resource governor is performed within the Team Foundation Server registry (not the Windows Registry). The configuration isn't documented by Microsoft, so if you have a specific need to change the default configuration, you will need to contact Microsoft Support, and they will be able to guide you through.

Team Foundation Server 2013 has additions to the Windows performance counters and event logging. These changes help you identify if you are reaching a scalability limit and need to configure the resource governor. More information on these changes can be found in Chapter 25.

Note

Although the configuration isn't documented by Microsoft, if you are experienced with modifying the server registry, you can use the following information to change the configuration:

· By default, the resource governor is always enabled. You can disable it by setting the following registry key:

/Service/ResourceGovernor/Settings/Enabled = false

· To limit the number of Merge commands that can execute concurrently, create and set a key that specifies the total cost limit. For example:

/Service/ResourceGovernor/Method/Merge = 10

· To limit the total resources an individual user can consume, you will need to create and set a key that specifies the total cost limit for that user. The key name is the unique Team Foundation Identity of the user. This is different than the Windows SID and can be found by querying the tf_id column in the tbl_security_identity_cache table of the Tfs_Configuration database. The following example limits a particular user to executing commands with a total cost of 10:

·/Service/ResourceGovernor/User/<tf id> = 10

·/Service/ResourceGovernor/User/

0DA27F4C-61CB-4F18-B30A-3F65E62899CD = 10

File Download Cache Directory

If you have a large or busy server, the file download cache directory is going to be important for you. Previously, this cache directory was used only for version control content, but it now includes Work Item Tracking and Test attachments. By default, it lives in the\Web Services\_tfs_data directory where Team Foundation Server is installed. Depending on the usage patterns of your server, this directory can become very large, very quickly, and you should consider moving it to its own dedicated disk drive. You can move this directory in two ways:

1. Option 1—The first option doesn't require any Team Foundation Server configuration changes. You can use the Disk Management administrative tools and use a mount point to mount the additional drive to the _tfs_data path.

2. Option 2—The second option is to follow these steps:

a. On the application-tier server, create a cache folder.

b. Right-click the folder, and click Properties. The Properties dialog box for the folder opens.

c. Click the Security tab, and click Add.

d. Add the local group TFS_APPTIER_SERVICE_WPG, and click OK.

e. Select both the Read and Write check boxes, clear all other check boxes, and then click OK.

f. Open Windows Explorer, and browse to C:\Program Files\Microsoft Team Foundation Server 11.0\Application Tier\Web Services.

g. Open the Web.config file in a text or XML editor, and locate the <appSettings> section.

h. Add a new line within the <appSettings> section and change the value to match the new location:

<add key="dataDirectory" value="D:\Cache"/>

i. Save and close the Web.config file. The application pool will be recycled automatically. The next time a file is downloaded from the application tier, it will be cached to this new location.

j. If the old cache folder has files in it, you should delete it to free up disk space on the server.

Note

An alternative (and possibly better) way to configure the file download cache directory is to use a setting in the Team Foundation Server registry. The benefit of using the registry is that the configuration is maintained after a server upgrade and when additional application-tier servers are added to an environment. To configure the cache directory location, you will need to set the following Team Foundation Server registry key:

/Configuration/Application/DataDirectory = "D:\Cache"

When you use this key, it is automatically applied to any new application tiers that you configure. If the specified drive or directory does not exist on that server then that server will have an invalid cache configuration. This will impact performance until rectified by creating the specified directory with the correct permissions. The invalid configuration will manifest itself through errors in the Application Event Log on the server.

If you set the dataDirectory key in the Web.config file, that setting will override any registry settings.

Team Foundation Proxy

The proxy server is completely stateless and has always supported being in a load-balanced configuration. If your proxy server is overloaded, the simplest solution is to set up an additional server and configure network load balancing.

When designing a proxy server for scalability, you should prioritize the following:

1. Proximity to build servers and users

2. Storage size

3. Storage performance

If your application tier is in another building or another city, then having a proxy server in the same building as your biggest group of users is important. You want to keep the latency low for the most benefit.

Given the choice of storage size or storage speed, you should prioritize for storage size. For example, there's no point in having high-performance, solid-state drives for your proxy if they're not big enough to hold a day's worth of files.

The more storage that the proxy server can use, the more files and versions of files it can store. This increases the chance of a cache hit and decreases the number of times a file must be downloaded from the main server.

Like the application tier, you should also change the version control cache directory for the proxy server to be a dedicated drive.

Note

For more information, see “How to: Change Cache Settings for Team Foundation Server Proxy” at http://aka.ms/TfsProxySettings.

Virtualization

Virtualization can be a great solution for achieving your high-availability goals. You can configure your application tier and data tier servers as virtual machine guests on a highly available host machine. If the underlying host requires planned or unplanned maintenance, you can perform a quick migration of the guest machines to another host without interruption.

Note

This configuration is beyond the scope of this chapter, and you should refer to the article “Reference Architecture for Private Cloud” at http://aka.ms/PrivateCloudRefArch.

Microsoft supports virtualization of Team Foundation Server in supported virtualization environments. For more information, see the following pages on the Microsoft website:

· Microsoft server software and supported virtualization environments at http://aka.ms/VirtSupport1

· Support policy for Microsoft software running in non-Microsoft hardware virtualization software at http://aka.ms/VirtSupport2

· Support partners for non-Microsoft hardware virtualization software at http://aka.ms/VirtSupport3

· Server Virtualization (officially supported products) at http://aka.ms/VirtSupport4

You should also read the best practices and performance recommendations on “Running SQL Server 2008 in a Hyper-V Environment” at http://aka.ms/SQLHyperV along with “Running SQL Server with Hyper-V Dynamic Memory” at http://aka.ms/SQLHyperVDM. Regardless of your virtualization technology, the tests and recommendations are very relevant.

Summary

This chapter explored some of the scalability and availability limitations of Team Foundation Server, as well as the issues faced by large and busy environments. You learned that the performance of Team Foundation Server is tied directly to the performance of SQL Server. Finally, solutions and best practices were discussed for overcoming these limitations to meet your scalability and high-availability goals.

Chapter 23 discusses another important aspect of Team Foundation Server administration—disaster recovery.