Monitoring Server Health and Performance - Administration - Professional Team Foundation Server 2013 (2013)

Professional Team Foundation Server 2013 (2013)

Part V

Administration

Chapter 25
Monitoring Server Health and Performance

What's in this chapter?

· Understanding factors that affect Team Foundation Server health

· Monitoring SQL Server health

· Learning useful queries for investigating SQL Server performance

· Learning about data sources available for monitoring Team Foundation Server

· New diagnostics and monitoring functionality in Team Foundation Server 2013

· Using valuable tools and reports

The health of a Team Foundation Server can be broken down into three components:

· System health

· SQL Server health

· Team Foundation Server health

The health and performance of Team Foundation Server is largely dependent upon the health and performance of the underlying components. For example, if the storage subsystem is not performing well, then SQL Server performance will likely suffer and, in turn, affect the performance of Team Foundation Server commands.

This chapter provides an overview of how you can monitor the health and performance of Team Foundation Server.

Note

While Team Foundation Server was being developed, the entire Developer Division at Microsoft (approximately 3,000 users) started using the server as its primary source control and bug-tracking system. This onboarding process continued through the 2008 release, and the overall size and usage of the server increased. As more teams were moved onto the server, performance issues with the application were identified and fixed.

Because of this aggressive internal usage over an extended time period, the 2010 release was highly tuned based upon real-world usage, rather than synthetic load tests. In the 2012 release cycle, the Visual Studio Online service was brought online. Additional performance bottlenecks were identified and fixed through its preview period. This level of tuning continued throughout the development of the 2013 release and it means that, in most cases, the cause of a performance problem in the server is likely to be a configuration or hardware problem in the underlying systems, rather than in Team Foundation Server itself.

System Health

Server health refers to the health of the operating system and the underlying hardware. The easiest and most reliable way to monitor and measure server health is through the use of Windows performance counters.

Performance counters are generally considered an accurate representation of system performance. Performance counters are understood across different disciplines (development, testing, and operations) and across different groups (customers, vendors, and product teams). This makes them very useful for understanding the performance of a system.

If you don't already have a system (such as System Center Operations Manager) for collecting and analyzing performance counters, it's fairly easy to get started without one. You can configure a performance counter log to capture a core set of counters, once a minute, to a circular log file on each server. This will prove invaluable when you get the inevitable phone call asking, “Why is the server slow?”

Note

To configure a performance counter log in Windows Server, see the article “Create a Data Collector Set from a Template” at http://aka.ms/DataCollectorSet.

The next thing to look at is storage health. In large applications and database applications, the most common source of slow system performance or high application response times is the performance of the storage system. To determine if you are having an issue with storage latency, you should use the following performance counters:

· Object—Physical Disk or Logical Disk

· Counter—Avg. Disk Sec/Transfer

· Instance—Ideally, you should collect this for individual disks. However, you may also use _Total to identify general issues. If _Total is high, then further collections can be taken to isolate the specific disks affected.

· Collection interval—Ideally, you should collect at least every one minute (but no more than every 15 seconds). The collection should be run for a significant period of time to show it is an ongoing issue, and not just a transient spike. The minimum suggested interval is 15 minutes.

When looking at the results, the following are the thresholds (in seconds) that you should consider:

· < 0.030—This is normal, and no storage latency issues are apparent.

· > 0.030 to 0.050—You may be somewhat concerned. Continue to collect and analyze data. Try to correlate application performance issues to these spikes.

· > 0.050 to 0.100—You should be concerned, and you should escalate to your storage provider with your data and analysis. Correlate spikes to application performance concerns.

· > 0.100—You should be very concerned, and you should escalate to your storage provider. Correlate spikes to application performance concerns.

With this data and these thresholds, you should be able to confidently identify a storage issue and work with either your server administrators or storage providers to get the issue resolved.

In large organizations, the storage will usually be provided by a Storage Area Network (SAN). SAN administrators usually work with the SAN vendors to ensure optimal configuration and performance. The administrators have many knobs they can tweak, and quite often it's just a matter of allocating more bandwidth and processing power from the SAN controller to your server. Sometimes, however, there just may not be enough disks available to meet the performance demands. If this is the case, it will often require a redesign and data migration to new SAN drives.

SQL Server

The majority of the Team Foundation Server application logic is implemented in SQL Server as stored procedures. The application tier itself is responsible for very little processing. For source control, the application tier performs caching and security checks. For work item tracking, the majority of the requests are passed directly to SQL Server. Because of this, the health of Team Foundation Server can largely be determined using the tools and functions provided by SQL Server.

Dynamic Management Views

Dynamic management views (DMVs) return server state information that can be used to monitor the health of a server instance, diagnose problems, and tune performance. To use them, you must have the VIEW SERVER STATE permission on the SQL server, or be a member of the sysadmins database role.

If DMVs are new to you, the easiest way to get started is to download the sample script from Jimmy May's blog at http://aka.ms/SQLDMVAllStars. Following are the five examples included in this script:

· Expensive Queries (CPU, reads, frequency, and so on)

· Wait Stats

· Virtual File Stats (including calculations for virtual file latency)

· Plan Cache

· Blocking (real time)

Each of these queries has a series of commented-out WHERE and ORDER BY clauses that can be uncommented to surface different information.

Note

For more information and examples, see “Dynamic Management Views and Functions (Transact-SQL)” at http://aka.ms/SQL2012DMV.

Currently Running Processes

The query examined here is perhaps the single most useful query for identifying performance problems within SQL Server. It uses a combination of the dm_exec_requests, dm_exec_sql_text, and dm_exec_query_memory_grants DMVs to discover problems in real time. This query is not specific to Team Foundation Server, and it can be used on any SQL server to see what SQL is doing.

As shown here, the query will return interesting details (explained in Table 25.1) about all non-system processes. It also excludes the process running the query, and it sorts all the processes with the longest-running ones at the top.

SELECT

@@SERVERNAME as ServerName,

a.session_id,

datediff(ss, a.Start_Time, getdate()) as seconds,

a.wait_type,

a.wait_time,

m.requested_memory_kb / 1024 as requestedMB,

a.granted_query_memory,

m.dop,

a.command,

d.Name as DBName,

a.blocking_session_id as blockedby,

LTRIM(b.text) as sproc,

substring(b.text, a.statement_start_offset / 2,

CASE WHEN

(a.statement_end_offset - a.statement_start_offset) / 2 > 0

THEN

(a.statement_end_offset - a.statement_start_offset) / 2

ELSE 1

END) as stmt,

a.last_wait_type,

a.wait_resource,

a.reads,

a.writes,

a.logical_reads,

a.cpu_time

FROM

sys.dm_exec_requests a with (NOLOCK)

OUTER APPLY sys.dm_exec_sql_text(a.sql_handle) b

LEFT JOIN

sys.dm_exec_query_memory_grants m (NOLOCK)

on m.session_id = a.session_id

and m.request_id = a.request_id

LEFT JOIN

sys.databases d

ON d.database_id = a.database_id

WHERE

a.session_id > 50

AND a.session_id <> @@spid

ORDER BY

datediff(ss, a.Start_Time, getdate()) DESC

Table 25.1 Description of the Columns Returned by the Currently Running Processes Query

Column

Description

ServerName

The name of the server that the query was executed on. When using SQL Server Management Studio to connect to multiple servers, this column is useful to verify that the query was executed against the correct server.

session_id

The ID of the SQL process. This is commonly referred to as a SPID, as in a SQL Process ID. It is unique while the process is running, and it can be reused by another process later.

Seconds

Total seconds since the query was started.

wait_type

See the section “SQL Wait Types” later in this chapter.

requestedMB

Memory requested by the query.

granted_query_memory

Memory allocated to the query.

Dop

The degree of parallelism (DOP). This indicates how many CPU cores this process is using.

Command

Command that the query is running (for example, SELECT, INSERT, UPDATE, DELETE, or BACKUP DATABASE).

DBName

Name of the database that the query is running on.

Blockedby

The ID of the process that this process is blocked by or waiting for.

Sproc

The text of the currently running query.

Stmt

The currently executing statement within the currently running query.

last_wait_type

The wait type that the process was previously waiting for

wait_resource

The resource that the process is currently blocked by or waiting for

Reads

Physical read operations of the process

Writes

Physical write operations of the process

logical_reads

Logical read operations of the process

cpu_time

CPU time (in milliseconds) that is used by the process

The detailed information provided by this query can be used to identify many common SQL Server issues.

Long-Running Processes

The seconds column will tell you how long a process has been running. Team Foundation Server has a default SQL time-out setting of one hour (3,600 seconds). If there are processes running against the Tfs_Configuration or the Tfs_Collection databases that are anywhere near 3,600 seconds, then you likely have a problem.

Once a Team Foundation Server process runs for 3,600 seconds in SQL, it will be cancelled on the application tier, and clients will receive an error message. The exception to this is queries executed by the Team Foundation Server Background Job Agent. These queries usually have a much longer time-out.

Note

There may be processes that did not originate from Team Foundation Server that run for longer than one hour. An example is a SQL backup or other database maintenance task.

High Memory Usage

The granted_query_memory column tells you how much memory SQL has allocated to a specific process. Each process will require different amounts of memory to perform its work.

In general, the commonly executed Team Foundation Server commands use less than 4GB of memory, and anything consistently higher is worth further investigation. If a process is using large amounts of memory, it can mean that SQL has chosen an inefficient query plan. If the total of the column is close to the total physical memory of the server, you may want to consider adding more memory.

Upgrades and SQL Query Plans

When SQL compiles a stored procedure, it uses the query optimizer engine to generate a query plan. Upon execution, the query plan is used to determine the most efficient way to run the query. Based upon index statistics, data histograms, and other metrics, it determines things like whether it is more efficient to use a scan (iterate through all rows) or a seek (skip to a specific location based upon an index). As a SQL developer, you can set query hints in your stored procedure that force the query optimizer to choose a specific plan.

Although the SQL team strives to ensure that query plans remain stable, many things can cause SQL to pick a new query plan. During the development of Team Foundation Server 2010, there were a couple of events that caused query plan instability on the busy internal servers. The first was a hardware and operating system upgrade. The SQL server was moved from a 32-bit to a 64-bit machine with more capable storage.

The second change was an upgrade from SQL Server 2005 to SQL Server 2008. Changes in the query optimization engine caused SQL to overestimate memory for some of the important commands. For example, Get commands started consuming 10GB of memory each. Get was the most commonly executed command, and there was only 64GB of memory available on the server. This meant that everything ground to a halt until the inefficient query plan was identified and a query hint was added to force the old behavior.

Sometimes other changes such as index fragmentation and growing data sizes will cause a query plan to be inefficient over time. Fortunately, most of these types of issues have been flushed out through Microsoft's own internal usage, and you are unlikely to encounter them on your own server. In other cases, a restart of SQL server or DBCC FREEPROCCACHE will cause the query plans to be flushed and regenerated.

For much more detailed information on how SQL Server allocates memory for plan caching, see the “Plan Caching in SQL Server 2008” white paper by Greg Low at http://aka.ms/SQLPlanCache.

High Processor Usage

The dop column indicates the degree of parallelism for a query. A value of 0 indicates that the process is running in parallel on all processors. SQL Server has a setting for the maximum degree of parallelism (referred to as MAXDOP). This controls how many processors any individual process can use. The default setting is zero, meaning all processors.

If a process is using all processors, it means that it is making the most of the CPU hardware available. This is great. However, it also reduces the concurrency of the system and will block other processes from running until it is finished.

On busy servers, you may want to reduce the MAXDOP setting to allow increased concurrency. For example, do you want one large Merge command to block all the smaller commands while it executes? Or, could you live with a slightly slower Merge command that doesn't block all the other smaller commands?

Note

For more information, see the article “max degree of parallelism Option” at http://aka.ms/MAXDOP.

Performance Problems in a Specific Collection Database

The DBName column indicates in which database the process is currently executing. If you are experiencing a performance problem for a particular team project collection, this column will help you identify what commands are currently running for that collection.

Blocking Processes

The blockedby column indicates the ID of the process for which this process is waiting. If one process has a lock on a table or an index, and another process requires that same lock, it will be blocked until the first process releases the lock.

An example of this is a check-in lock. Because all check-ins are serialized, there must be a lock until SQL has committed the changes. If a check-in is large and requires lots of processing, it can hold the lock for a period of time. This can frustrate users who are just trying to check in a single file.

Another example is a lock in the tbl_LocalVersion table. If a user has a large workspace with many files, a DeleteWorkspace command may cause blocking of other commands such as a Get. This is because SQL Server does lock escalation. Team Foundation Server will request row locks (which lock only the affected rows), but SQL may determine that a page lock (which affects all rows on the same page) or a table lock (which affects all rows in the table) is more efficient.

Locking (and, therefore, blocking) was a significant problem during the internal usage of Team Foundation Server 2008 and 2010. The 2010 release eliminated the most common causes of blocking (for example, undo, edit, and check-in commands), which results in a much improved user experience.

Resource Contention

The wait_resource column is an identifier for what resource the process is waiting for. If the wait_type is PAGIOLATCH_*, this value will likely be a set of three colon-separated numbers such as 6:1:35162 in which:

· The first number is the database ID.

· The second number is the physical file ID.

· The third number is the page number.

You can look up the name of the database from the database ID by running the following query:

SELECT database_id, name FROM sys.databases

You can look up the physical file path by running the following query:

USE [Database_Name]

GO

SELECT file_id, type_desc, name, physical_name

FROM sys.database_files

Warning

For large operations, Team Foundation Server makes heavy use of SQL's TempDB. The database ID of TempDB is usually 2. If you see processes waiting on TempDB as a resource, this may indicate that you have a storage throughput problem. The general scalability recommendation for TempDB is that you should have one equal-sized data file per CPU.

SQL Wait Types

A SQL process will be either running or waiting. When the process is waiting, SQL will record the wait type and wait time. Specific wait types and times can indicate bottlenecks or hot spots within the server.

The wait_type column of the currently running requests query will indicate what each process is waiting for (if anything). If you see many processes with the same value, this may indicate a system-wide bottleneck.

If it's not clear from the list of processes, you can also use the dm_os_wait_stats DMV, which collects cumulative wait statistics since the server was last restarted (or the statistics were reset). The following command will give you an output similar to Table 25.2:

-- What is SQL waiting for the most?

SELECT TOP 5 wait_type, wait_time_ms

FROM sys.dm_os_wait_stats

ORDER BY wait_time_ms DESC

Table 25.2 Sample Output from dm_os_wait_stats

wait_type

wait_time_ms

FT_IFTS_SCHEDULER_IDLE_WAIT

2669883

DISPATCHER_QUEUE_SEMAPHORE

2316915

BACKUPBUFFER

2029392

CXPACKET

1292475

XE_TIMER_EVENT

932119

You can also manually reset the wait statistics for a server by running the following command:

-- Clear wait stats for this instance

DBCC SQLPERF ('sys.dm_os_wait_stats', CLEAR)

By looking at the results of the dm_os_wait_stats DMV, you can determine the most likely bottleneck in the system. Table 25.3 describes the common wait types.

Note

For more information and a description of each of the wait types, see “sys.dm_os_wait_stats” at http://aka.ms/SQL2012WaitTypes.

Table 25.3 Common Wait Types

Wait Type

Description

CXPACKET

Indicates time spent waiting for multiple processors to synchronize work. You may consider lowering the degree of parallelism or increasing the number of processors if contention on this wait type becomes a problem.

PAGEIOLATCH_*

Indicates time spent waiting for storage operations to complete. You may have a storage throughput problem if this is consistently high.

LOGBUFFER

Indicates time spent waiting for the transaction log. Consistently high values may indicate that the transaction log devices cannot keep up with the amount of logging being generated by the server. You will also see this wait type if your transaction log is full and has triggered an auto-grow. In this case, you should check that your transaction log backups are working and correctly truncating the log files.

Storage Health

SQL Server provides the dm_io_virtual_file_stats DMV for keeping track of various storage metrics. The following query will list each of the physical database files in descending latency order:

SELECT

--virtual file latency

vLatency

= CASE WHEN (num_of_reads = 0 AND num_of_writes = 0)

THEN 0 ELSE (io_stall/(num_of_reads + num_of_writes)) END

, vReadLatency

= CASE WHEN num_of_reads = 0

THEN 0 ELSE (io_stall_read_ms/num_of_reads) END

, vWriteLatency

= CASE WHEN num_of_writes = 0

THEN 0 ELSE (io_stall_write_ms/num_of_writes) END

--avg bytes per IOP

, BytesperRead

= CASE WHEN num_of_reads = 0

THEN 0 ELSE (num_of_bytes_read/num_of_reads) END

, BytesperWrite

= CASE WHEN num_of_writes = 0

THEN 0 ELSE (num_of_bytes_written/num_of_writes) END

, BytesperTransfer

= CASE WHEN (num_of_reads = 0 AND num_of_writes = 0)

THEN 0 ELSE (

(num_of_bytes_read+num_of_bytes_written)/

(num_of_reads + num_of_writes)) END

, LEFT(mf.physical_name,2) as Drive

, DB_NAME(vfs.database_id) as DB

, vfs.*

, mf.physical_name

FROM sys.dm_io_virtual_file_stats(NULL,NULL) as vfs

JOIN sys.master_files as mf

ON vfs.database_id = mf.database_id AND vfs.file_id = mf.file_id

ORDER BY vLatency DESC

In the results of this query, you should pay particular attention to the vLatency and physical_name columns. The vLatency column indicates the combined average read and write latency for a file (in milliseconds). The physical_name column indicates which database file the results are for.

Following are some general points to consider when looking at these results:

· A latency of more than 30 milliseconds is something worth investigating for large databases with lots of activity.

· Latency in TempDB will affect overall server performance.

· High write latency (greater than 50 milliseconds) may be an indication that write caching is disabled or not working correctly. If this is the case, you will need to work with either your server administrators or storage providers to get the issue resolved.

Memory Contention

SQL Server (and, therefore, Team Foundation Server) performs best when the data pages that it requires are in memory. SQL Server maintains a buffer pool of pages. You can see how much of each database is available in the buffer pool by using thedm_os_buffer_descriptors DMV. If your database is not in the buffer pool, or if it is lower than you expect, you may need to add more memory to your SQL Server. As another option, you could move the database to a new SQL Server.

The following query produces an output similar to Table 25.4:

-- How much of the databases are in memory?

SELECT

db_name(database_id) as dbName,

COUNT(*)*8/1024 as BufferPoolMB

FROM sys.dm_os_buffer_descriptors

GROUP BY db_name(database_id)

ORDER BY 2 DESC

Table 25.4 Example Output of dm_os_buffer_descriptors

dbName

BufferPoolMB

Tfs_Collection1

92251

Tfs_Collection2

15252

Tempdb

2914

Tfs_Warehouse

1175

Tfs_Configuration

231

Tfs_Collection3

129

ReportServer

27

Master

2

ReportServerTempDB

2

Model

0

Team Foundation Server

Ever since the very first release of Team Foundation Server, the server has included rich command logging and tracing functionality. This level of logging and tracing is invaluable in identifying and measuring server performance.

With the release of the Visual Studio Online service, built on the Windows Azure platform, the monitoring and diagnostics functionality of the product has been enhanced. Because the product team members are responsible for the smooth operation of the service, it is in their best interest for the product to be easy to monitor and diagnose. These new responsibilities lead to new tracing infrastructures and enhancements to how the Windows Event Log is used in the on-premises version as well.

Command Log

The application tier keeps track of who executed what command at what time. It logs information such as the user name, IP address, user agent, execution time, and execution count for each request. In Team Foundation Server 2013, the command log also shows activity performed using the Web Access interface.

Note

The command log data is also presented in Team Web Access through the Operational Intelligence Hub. For more information see the Operational Intelligence Hub section of Chapter 21.

In the 2005 and 2008 versions, this data was recorded in the TfsActivityLogging database. In Team Foundation Server 2010, the tables were moved into the Tfs_Configuration and Team Project Collection databases. Following are the two tables used to record this data:

· tbl_Command

· tbl_Parameter

Approximately every 30 seconds, the application tier flushes recent requests to the command log tables in the database, where they can be queried. There is also an internal job that trims the command log to the past 14 days of data.

To show all the commands run by a particular user in the past 24 hours, you can run the following query:

-- Recent commands from a particular user

USE [Tfs_DefaultCollection]

GO

SELECT *

FROM [dbo].[tbl_Command] WITH (NOLOCK)

WHERE StartTime > DATEADD(HOUR, -24, GETUTCDATE())

AND IdentityName = ‘DOMAIN\Username'

ORDER BY StartTime DESC

For commands that run longer than 30 seconds, or commands that fail, the parameters are also logged to tbl_Parameter. This is useful to identify if the user is trying to do something unreasonable. One such example is a QueryHistory call of the root folder ($/) with theRecursive flag set. To retrieve the parameters for the command, you must join or filter on the CommandId column, as shown in the following example:

-- Parameters for a particular CommandId

USE [Tfs_DefaultCollection]

GO

SELECT *

FROM tbl_Parameter WITH (NOLOCK)

WHERE CommandId = 12345678

ORDER BY ParameterIndex

The data in the command log is useful for seeing how many users actively use the server. For example, if you want to know how many distinct users have actively used the server in the past seven days, you can run the following query:

-- Recent active users

USE [Tfs_DefaultCollection]

GO

SELECT

COUNT(DISTINCT IdentityName) as DistinctUsers,

SUM(ExecutionCount) as TotalExecutionCount

FROM [dbo].[tbl_Command] WITH (NOLOCK)

WHERE StartTime > DATEADD(DAY, -7, GETUTCDATE())

AND Command IN

('UpdateLocalVersion', ‘PendChanges', ‘Get', ‘CheckIn', ‘Update', ‘GetWorkItem')

This will include any user who has refreshed his or her workspace, checked out a file, saved a work item, or opened a work item. For measuring active users, it's important to filter based upon the command. Otherwise, if a user happens to select a collection in the “Connect to Team Foundation Server” dialog box, he or she will be included in the count, even though that user is not actively using that collection. This can lead to inflated numbers.

Note

With the introduction of the “Local Workspaces” feature in Team Foundation Server 2012, it is no longer necessary for clients to contact the server to update their workspace and check out files. As such, the activity log may not accurately reflect all the active users of the system. You can read more about Local Workspaces in Chapter 6.

The command log is incredibly useful for identifying performance problems for particular commands. For example, you can use the ExecutionTime and ExecutionCount columns to determine the average response time for each command. So, if you want to know the top ten slowest commands for the past seven days, you can run the following query:

-- Top 10 commands with the highest average response time

USE [Tfs_DefaultCollection]

GO

SELECT TOP 10

Application,

Command,

ROUND(SUM(Cast(ExecutionTime AS float) / 1000000) / SUM(ExecutionCount),3)

AS ResponseTimeSeconds

FROM [dbo].[tbl_Command] WITH (NOLOCK)

WHERE StartTime > DATEADD(DAY, -7, GETUTCDATE())

GROUP BY Application, Command

ORDER BY

SUM(Cast(ExecutionTime AS Float) / 1000000) / SUM(ExecutionCount) DESC

Using the information within the command log, you can help determine whether user complaints of slow performance are widespread on the client side or specific to a particular user.

Note

The ExecutionTime in the command log starts when the server receives the first byte of the request. It finishes when the server starts transmitting the last packet of the response to the client.

Because of this, it only shows the server's view of the request, and there may be additional time spent on the client to complete processing of a request. For a more accurate view from a particular client, you can use client-side tracing. An example of this behavior is the GetMetadataEx command. This command is called when Team Explorer connects to a Team Project Collection for the first time. When connecting to a collection with a large number of Team Projects, the metadata will be relatively large (sometimes hundreds of megabytes). Once the server has processed the request and sent it to the client, the client will spend some time processing the response. This processing time won't be reflected in the command log.

Active Server Requests

Team Foundation Server provides a web service that lists the currently executing requests on an application tier server. This web service can be used to see real-time blocking, as well as which users are currently using the server. The Team Foundation Server URL for the web service is http://localhost:8080/tfs/TeamFoundation/Administration/v3.0/AdministrationService.asmx.

Warning

If you have multiple application tiers in a network load-balancing configuration, you will need to query each server. The active requests are local to each server and are not aggregated between servers.

No tools are provided with Team Foundation Server for calling this web service. You will need to write your own, or use another method. The simplest (but not necessarily the prettiest) way to view the requests is to use Windows PowerShell.

The following script will dynamically generate a web service proxy object, execute the request, retrieve the results, and print them out (as shown in Figure 25.1).

$tfsadmin = New-WebServiceProxy -UseDefaultCredential

-URI http://tfsserver:8080/tfs/TeamFoundation/administration

/v3.0/AdministrationService.asmx?WSDL

$tfsadmin.QueryActiveRequests($null, “False”) | %{ $_.ActiveRequests } |

sort StartTime | ft StartTime,UserName,MethodName,RemoteComputer

image

Figure 25.1 Example output from PowerShell script

If you are not a member of the Team Foundation Server Administrators security group, you will see only your own requests. Otherwise, you will see the requests for all users in all collections.

Health Monitoring Events

Querying the active requests list periodically is a very effective way to identify blocking and other problems in real time. In Team Foundation Server 2013, an internal task exists that identifies long-running requests. By default, it runs every 30 seconds and looks for the following:

· Requests that have been queued for longer than 15 seconds

· When there are more than 10 requests in the queue

· Requests that have been executing for longer than 60 seconds

If any of these conditions are met, a warning event with ID 7005 is logged to the Windows Application Event log indicating the condition. This then allows monitoring systems (such as System Center) to alert on the fact. The event text will look something like this:

Detailed Message: A request for service host DefaultCollection

has been executing for 34 seconds, exceeding the warning

threshold of 30.

Request details: Request Context Details

Url: /tfs/DefaultCollection/VersionControl/v1.0/repository.asmx

Method: QueryHistory

Parameters: itemSpec = $/ (Full)

versionItem = T

maxCount = 256

includeFiles = False

slotMode = True

generateDownloadUrls = False

sortAscending = False

User Name: VSALM\Administrator

User Agent: Team Foundation (TF.exe, 11.0.50727.1, Other, SKU:9)

Unique Id: af139ed8-3526-422a-a9ee-16fa084ba5c6

One of the problems often encountered with monitoring systems is that alerts fire too often, are never closed, and eventually are deemed too noisy and disabled. In Team Foundation Server 2013, once the condition has passed, an additional event is logged with ID7006. This allows the monitoring system to be intelligent and auto-resolve the alert. The event text will look something like this:

Detailed Message: There are no active requests for service host DefaultCollection

that exceed the warning threshold of 30.

The default thresholds are designed to be reasonable for most environments. However, in some environments, it might be perfectly reasonable for many commands to run longer than 60 seconds. These thresholds can be overridden under the path/Configuration/ServiceHostMonitor/ in the following Team Foundation Server registry keys:

· QueuedRequestElapsedThreshold = 15

· QueuedRequestThreshold = 10

· TotalExecutionElapsedThreshold = 60

Performance Counters

All versions of Team Foundation Server have included performance counters that allow administrators to monitor various aspects of the system. The release of Visual Studio Online forced the product team to add additional performance counters so that they can monitor the behavior of the system. Unlike an on-premises server, it is not possible to view the SQL performance counters with SQL Azure. This necessitated the addition of the following performance counters in the \TFS Services category:

· Average SQL Connect Time

· Current SQL Connection Failures/Sec

· Current SQL Connection Retries/Sec

· Current SQL Execution Retries/Sec

· Current SQL Executions/Sec

· Current SQL Notification Queries/Sec

These counters can be used to diagnose connectivity and transient errors with the SQL Server environment being used by Team Foundation Server.

Server Tracing

Team Foundation Server also includes low-level tracing for diagnosing complex server problems. Typically, this is used only by Microsoft support personnel when investigating a bug or strange behavior.

With the introduction of Visual Studio Online, the product team required a more detailed and flexible tracing mechanism. Because the service shares the same codebase, this tracing infrastructure is available in the on-premises product as well. However, because it requires an intimate knowledge of how the product works, it should be used only under the direction of Microsoft support personnel.

Tracing at Internet Scale

To move an application from an on-premises product to an Internet-scale service often requires design changes. One such example of this is the server tracing in Team Foundation Server.

For Team Foundation Server 2010 and prior versions, tracing was pretty much an “all or nothing” approach. You could turn it on for a subsystem (for example, Work Item Tracking), but then everything within that subsystem produced reams of tracing data. Additionally, the tracing was scoped to a single application tier in a load-balanced environment.

For Visual Studio Online, the team needed more flexibility and much finer grained central control. Some of the scenarios that were considered were:

· A single user is having problems checking in a file to his or her project.

· Many users are having problems with a particular part of the system.

· One of the jobs is not completing successfully but doesn't log enough information in the job result message.

· A very small number of executions of a particular command over a period of time are failing and it's not known why.

· Some users of a particular version of a particular client are experiencing a performance problem.

With these scenarios, the team came up with a flexible solution driven by the prc_CreateTrace and prc_QueryTraces stored procedures in the Tfs_Configuration database. Additionally, thousands of trace messages are spread throughout the code, each with unique TracePoint identifiers. They all remain dormant until enabled in the central database configuration.

This solution allows support personnel to enable tracing for specific code TracePoints within a method, whole methods, individual users, and specific user agents. It is also possible to enable tracing for whole layers of the system, such as BusinessLogicand with areas, such as CheckIn and CreateWorkspace.

The output of this tracing infrastructure is logged to an Event Tracing for Windows (ETW) session. By default, an Event Log trace listener is available in the Windows Event Viewer under \Applications and Services Logs\Microsoft-Team Foundation Server\Debug.

Client Performance Tracing

Similar to server tracing, tracing is also available in the client object model.

Note

For more information on enabling client-side tracing, see the “Team Foundation Server Client Tracing” blog post at http://aka.ms/TfsClientTracing.

If you want a convenient way to see the web service calls your Team Foundation Server clients are making to the server, you can enable the PerfTraceListener trace listener on your client. This is done by adding the following configuration in the appropriate app.configfile:

<configuration>

<appSettings>

<add key=”TFTrace.Writer” value=”true” />

<add key=”TFTrace.DirectoryName” value=”C:\Temp” />

<add key=”VersionControl.EnableSoapTracing” value=”true” />

</appSettings>

<system.diagnostics>

<switches>

<add name=”TeamFoundationSoapProxy” value=”4” />

<add name=”VersionControl” value=”4” />

<add name=”Download” value=”2” />

<add name=”LocalWorkspaces” value=”4” />

</switches>

<trace autoflush=”true” indentsize=”3”>

<listeners>

<add name=”perfListener”

type=”Microsoft.TeamFoundation.Client.PerfTraceListener,

Microsoft.TeamFoundation.Client,

Version=11.0.0.0, Culture=neutral,

PublicKeyToken=b03f5f7f11d50a3a”

/>

</listeners>

</trace>

</system.diagnostics>

<configuration>

Once the trace listener is enabled and the application is started, a dialog box will appear, as shown in Figure 25.2. In the dialog box, you'll see how long each call takes (in milliseconds), the number of calls made, and the average time for each call.

image

Figure 25.2 Example of the performance trace listener dialog box

The top section of the dialog box shows the aggregated information. The bottom section shows the list of web service methods in the order in which they were called, the elapsed time in milliseconds, and the time of day when the method completed execution. If the method has not completed, it will display Running for the completion time. If you move the mouse over the entries in the bottom section, a tooltip will show you the stack trace, so you can see what part of the application made the call.

Note

For more information on how to interpret the results of the dialog box, see Buck Hodges's blog post, “How to see the TFS server calls made by the client,” at http://aka.ms/TfsPerfListener.

Job History

Team Foundation Server 2013 includes a background job agent, just as the previous versions did. Jobs are defined in the configuration and in each collection database. The history for each of these jobs is stored in the configuration database. The supported way of accessing this history is through the object model.

Note

The background job data is also presented in Team Web Access through the Operational Intelligence Hub. For more information, see the Job Monitoring topic in the Operational Intelligence Hub section of Chapter 21.

To list the currently defined jobs for a server, you can use the following PowerShell script. The output will be similar to that shown in Figure 25.3:

$ErrorActionPreference = “Stop”;

[void][Reflection.Assembly]::Load(”Microsoft.TeamFoundation.Client,

Version=11.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a”);

# Modify the collection URL as necessary.

$tpc = new-object Microsoft.TeamFoundation.Client.TfsTeamProjectCollection

“http://localhost:8080/tfs/DefaultCollection”

$jobService = $tpc.GetService([Microsoft.TeamFoundation

.Framework.Client.ITeamFoundationJobService])

# List all the jobs and their JobIds

$jobService.QueryJobs() | sort Name | select JobId, Name | ft -a

image

Figure 25.3 PowerShell example of currently defined jobs

To get the recent execution history of a job, as shown in Figure 25.4, you can use the following PowerShell command after running the previous one:

# Get the 20 latest execution results for a particular JobId

$jobService.QueryJobHistory([Guid[]] @('a4804dcf-4bb6-4109-b61c-e59c2e8a9ff7'))

| select -last 20 | ft ExecutionStartTime,Result,ResultMessage

Note

For more information, see Chris Sidi's blog post, “TFS 2010: Introducing the TFS Background Job Agent and Service,” at http://aka.ms/TfsJobAgent.

image

Figure 25.4 PowerShell example of recent job history

Storage Usage

With a system like Team Foundation Server, it's not uncommon for the storage usage to grow rapidly as people discover the value of the system and start using it for more things. As a Team Foundation administrator, you may want to know what is causing your databases to grow. Because all Team Foundation Server data is stored in SQL Server, you can use the SQL sp_spaceused stored procedure to identify where the data growth is occurring.

The following script will list the total database size and the top 10 largest tables. You will need to run it for each collection database.

-- Database total space used

EXEC sp_spaceused

-- Table rows and data sizes

CREATE TABLE #t (

[name] NVARCHAR(128),

[rows] CHAR(11),

reserved VARCHAR(18),

data VARCHAR(18),

index_size VARCHAR(18),

unused VARCHAR(18)

)

INSERT #t

EXEC [sys].[sp_MSforeachtable] ‘EXEC sp_spaceused ‘'?'''

SELECT TOP 10

name as TableName,

Rows,

ROUND(CAST(REPLACE(reserved, ‘ KB', ‘') as float) / 1024,2) as ReservedMB,

ROUND(CAST(REPLACE(data, ‘ KB', ‘') as float) / 1024,2) as DataMB,

ROUND(CAST(REPLACE(index_size, ‘ KB', ‘') as float) / 1024,2) as IndexMB,

ROUND(CAST(REPLACE(unused, ‘ KB', ‘') as float) / 1024,2) as UnusedMB

FROM #t

ORDER BY CAST(REPLACE(reserved, ‘ KB', ‘') as float) DESC

DROP TABLE #t

You can then use the information in Table 25.5 to match table names to their purposes, and implement strategies to reduce the storage used and control the growth.

Note

In Team Foundation Server 2013, the Work Item Tracking Attachments and Test Attachments are contained in the tbl_Content table that Version Control also uses.

Table 25.5 Largest Tables within a Collection Database

Table Name

Used for

How to Reduce

tbl_Content

All blob content, including: version control files, test attachments, and work item tracking attachments

Destroy version control content, delete team projects, or run the Test Attachment Cleanup tool.

tbl_LocalVersion

Version control workspaces

Switch users to Local Workspaces; delete workspaces or reduce the number of folders for which they have mappings; upgrade to a SQL Server edition that supports data compression.

tbl_PropertyValue

Version control code churn metrics

Upgrade to a SQL Server edition that supports data compression.

WorkItemsWere

Work item tracking historical revisions

Destroy work items.

WorkItemLongTexts

Work item tracking long text field data

Destroy work items.

WorkItemsLatest

Work item tracking latest revisions

Destroy work items.

WorkItemsAre

Work item tracking latest revisions

Destroy work items.

tbl_tmpLobParameter

Temporary storage for large in-progress check-ins

N/A

Data Warehouse

One of the key components of Team Foundation Server is the data warehouse. In general, people don't have a problem with the performance or operation of the data warehouse. However, there are two classes of problems that you're more likely to run into as your servers grow larger:

· Processing time—As the number of reportable fields increases, the number of dimensions that Analysis Services must process also increases. This increases the time it takes to process the cube and, therefore, the latency of the data is higher.

· Schema conflicts—In the simple case, when there are two fields in different collections (for example, Priority) with the same name, but a different type (for example, String versus Integer), this results in a schema conflict. That project collection is then blocked from processing warehouse updates, and the data in the relational warehouse and cube become stale.

You can use two reports (“Cube Status” and “Blocked Fields”) to monitor the health and performance of the Team Foundation Server data warehouse. They display the following information:

· Recent processing times

· Current status (whether the cube is processing now and, if not, when it is scheduled to process next)

· Schema conflicts

· Most recent time that each warehouse adapter successfully ran

Note

For more information on how to download, install, and interpret the reports, see “Administrative Report Pack for Team Foundation Server 2010 and 2012” at http://aka.ms/TfsWarehouseReports and “Monitoring the TFS Data Warehouse—FAQ” athttp://aka.ms/WarehouseReportsFAQ. The same reports continue to work for Team Foundation Server 2013 without modifications.

Tools

A few tools are useful for monitoring server health and performance. Some are specific to Team Foundation Server, and some are not.

Performance Analysis of Logs Tool

The Performance Analysis of Logs (PAL) tool knows how to analyze a performance counter log file, look for threshold violations, and produce a server health report. It is not specific to Team Foundation Server, and it can identify SQL Server issues.

The tool encapsulates the collective wisdom of Microsoft engineers and other experts to identify possible problems with your servers. Figure 25.5 shows an example of a CPU utilization threshold violation. You can use this report to identify potential problems that might need to be looked at on the server or deeper in SQL Server and Team Foundation Server.

Note

For more information and to download the tool, see the PAL project site on CodePlex at http://pal.codeplex.com/.

image

Figure 25.5 Example PAL report

Team Foundation Server Best Practices Analyzer

The Team Foundation Server Best Practices Analyzer (BPA) is a tool used by Microsoft support personnel to help diagnose customer issues. When executed, the BPA tool will connect to your Team Foundation Server, download the event logs and command logs, and run queries against the database. With each Power Tool release, the tool is updated to include rules that detect the causes of common support requests.

For the most complete results, you should run the tool as an administrator on one of your application tier servers. It can also be run remotely if you are an administrator and remote administration is enabled for Windows and SQL Server.

Note

To run the BPA tool, you must download and install the latest Team Foundation Server Power Tools from http://aka.ms/TFS2013PowerTools.

Once it has finished collecting the data, it will parse it and run a series of rules that look for known problems. It displays a report similar to Figure 25.6. Each of the rules has an expected result and a help topic that describes how to rectify an unexpected result.

image

Figure 25.6 Best Practices Analyzer scan report

Team Foundation Server Management Pack for System Center Operations Manager

System Center Operations Manager (SCOM) is an enterprise-level monitoring product from Microsoft. A management pack defines monitors and rules for monitoring specific applications.

The Team Foundation Server 2012 management pack provides both proactive and reactive monitoring of Team Foundation Server 2012. It monitors application tier servers, team project collections, build servers, and proxy servers.

You can download the management pack from http://aka.ms/TFS2013SCOM. Once you have downloaded the management pack, you should review the MPGuide_TFS2013.docx document. This document includes important information on how to set up and use the management pack.

Note

You will need to create a Run As Profile and an associated Run As Account that has administrative access within Team Foundation Server to be able to use the management pack. Refer to the installation guide.

If everything is configured correctly, the management pack will automatically discover Team Foundation Server instances and start monitoring them. It has a series of rules and monitors that look for problems in the event log and check the health of the system.

When the Best Practices Analyzer tool is installed, you can also initiate a BPA scan from the Operator Console.

Summary

In this chapter, you learned about the factors that influence the health of Team Foundation Server. You learned that Windows performance counters are a useful way to record system health, and you learned how to use the built-in SQL Server Dynamic Management Views to understand many aspects of SQL Server Performance. You also learned about the different data sources available within Team Foundation Server, along with some useful queries and reports for determining the health of the system. Additionally, you looked at the tracing and Windows Events available in Team Foundation Server 2013. Finally, this chapter covered three useful tools for monitoring server health and performance.

Chapter 26 takes a look at the new Testing and Lab Management features, and how they can be used to build high-quality software.