Sams Teach Yourself PHP, MySQL and Apache All in One (2012)
Part VI. Administration and Fine-Tuning
Chapter 29. Apache Performance Tuning and Virtual Hosting
In this chapter, you learn the following:
• Which operating system and Apache-related settings can limit the server’s ability to scale or degrade performance
• Several tools for load testing Apache
• How to fine-tune Apache for optimum performance
• How to configure Apache to detect and prevent abusive behavior from clients
• How to configure name-based virtual hosts, IP-based virtual hosts, and the difference between the two
• The dependencies virtual hosting has on DNS
• How to set up scaled-up cookie-cutter virtual hosts
This administration-related chapter focuses on increasing the performance and scalability of your Apache server. In addition, you learn about name-based and IP-based virtual hosting and Domain Name System (DNS)-related issues and issues related to the web browser itself. This chapter also explains different mechanisms that you can use to isolate clients from each other and the associated security trade-offs.
Performance and Scalability Issues
This section covers scalability problems and how to prevent them. This section is more of a “don’t do this” list, explaining limiting factors that can degrade performance or prevent the server from scaling. You also learn about the proactive tuning of Apache for optimal performance.
Operating System Limits
Several operating system factors limit Apache’s performance. These factors relate to process creation, memory limits, and the maximum simultaneous number of open files or connections.
The UNIX ulimit command enables you to set several of the limits covered in this section on a per-process basis. Refer to your operating system documentation for details on ulimit’s syntax.
Apache provides settings for preventing the number of server processes and threads from exceeding certain limits. These settings affect scalability because they limit the number of simultaneous connections to the web server, which in turn affects the number of visitors you can service simultaneously from one server.
The Apache Multi-Processing Module (MPM) settings are in turn constrained by OS settings limiting the number of processes and threads. How to change those limits varies from operating system to operating system. In Linux kernels, it requires changing the NR_TASKS defined in the/proc/sys/kernel/threads-max file. You can read the contents of the file with this command:
# cat /proc/sys/kernel/threads-max
You can write to the file using this command:
# echo value > /proc/sys/kernel/threads-max
In Linux (unlike most other UNIX versions), there is a mapping between threads and processes, and they are similar from the point of view of the OS.
In Solaris, those parameters can be changed in the /etc/system file. Those changes do not require rebuilding the kernel but might require a reboot to take effect. You can change the total number of processes by changing the max_nprocs entry and the number of processes allowed for a given user with maxuproc.
Whenever a process opens a file (or a socket), a structure called a file descriptor is assigned until the file is closed. The OS limits the number of file descriptors that a given process can open, thus limiting the number of simultaneous connections the web server can have. How those settings are changed depends on the operating system. On Linux systems, you can read or modify /proc/sys/fs/file-max. On Solaris systems, you must edit the value for rlim_fd_max in the /etc/system file. This change requires a reboot to take effect.
You can find additional information at http://httpd.apache.org/docs/2.4/vhosts/fd-limits.html.
Controlling External Processes
Apache provides several directives to control the amount of resources external processes use. Such processes include CGI scripts spawned from the server and programs executed via server-side includes, but do not include PHP scripts that are invoked using the module version because the module is part of the server process.
Following the installation instructions in the initial chapters of this book will result in PHP being installed as a module. Therefore, these directives will not apply in your situation, unless you modified the installation type on your own or are in a virtual hosting situation in which PHP is not installed as a module. However, in the latter situation, it is unlikely you would be able to modify these directives anyway.
Support for the following Apache directives (used in httpd.conf) is available only on UNIX and varies from system to system:
• RLimitCPU—Accepts two parameters: the soft limit and the hard limit for the amount of CPU time in seconds that a process is allowed. If the max keyword is used, it indicates the maximum setting allowed by the operating system. The hard limit is optional. The soft limit can be changed between restarts, and the hard limit specifies the maximum allowed value for that setting.
• RLimitMem—The syntax is identical to RLimitCPU, but this directive specifies the amount (in bytes) of memory used per process.
• RLimitNProc—The syntax is identical to RLimitCPU, but this directive specifies the number of processes.
These three directives help to prevent malicious or poorly written programs from running out of control.
Performance-Related Apache Settings
This section presents different Apache settings that affect performance.
File System Access
From a resource standpoint, accessing files on disk is an expensive process, so you should try to minimize the number of disk accesses required for serving a request. Symbolic links, per-directory configuration files, and content negotiation are some of the factors that affect the number of disk accesses:
• Symbolic links—In UNIX, a symbolic link (or symlink) is a special kind of file that points to another file. It is created with the UNIX ln command and is useful for making a certain file appear in different places.
Two of the parameters that the Options directive allows are FollowSymLinks and SymLinksIfOwnerMatch. By default, Apache won’t follow symbolic links because they can be used to bypass security settings. For example, you can create a symbolic link from a public part of the website to a restricted file or directory not otherwise accessible via the Web. So, also by default, Apache needs to perform a check to verify that the file is not a symbolic link. If SymLinksIfOwnerMatch is present, it follows a symbolic link if the same user who created the symbolic link owns the target file.
Because those tests must be performed for every path element and for every path that refers to a filesystem object, they can be taxing on your system. If you control the content creation, you should add an Options +FollowSymLinks directive to your configuration and avoid theSymLinksIfOwnerMatch argument. In this way, the tests won’t take place, and performance isn’t affected.
• Per-directory configuration files—As explained in Chapter 3, “Installing and Configuring Apache,” it is possible to have per-directory configuration files. These files, usually named .htaccess, provide a convenient way of configuring the server and allow for some degree of delegated administration. However, if this feature is enabled, Apache has to look for these files in each directory in the path leading to the file being requested, resulting in taxing filesystem accesses. If you don’t have a need for per-directory configuration files, you can disable this feature by addingAllowOverride none to your configuration. Doing so avoids the performance penalty associated with accessing the filesystem looking for .htaccess files.
• Content negotiation—Apache can serve different versions of a file depending on client language or preferences. This can be accomplished using specific language-related file extensions, but in that case, Apache must access the filesystem for every request, looking for files such as extensions. If you need to use content negotiation, make sure that you at least use a type-map file, minimizing accesses to disk. Some application-based alternatives to Apache content negotiation for internationalization purposes can be found in Chapter 27, “Application Localization.”
• Scoreboard file—This is a special file that the main Apache process uses to communicate with its child processes on older operating systems. You can specify its location using the ScoreBoardFile directive, but most modern platforms do not require the use of this file. If this file is required, you might find improved performance if you place it on a RAM disk. A RAM disk is a mechanism that allows a portion of the system memory to be accessed as a filesystem. The details on creating a RAM disk vary from system to system.
Network and Status Settings
A number options can degrade performance:
• HostnameLookups—When HostnameLookups is set to on or double, Apache performs a DNS lookup to capture the hostname of the client each time the client makes a request. This constant lookup introduces a delay into the response process. The default setting for this directive is off. If you want to capture the hostname of the requestor, you can always process the request logs with a log resolver later, offline, and not in real time.
• Accept mechanism—Apache can use different mechanisms to control how Apache children arbitrate requests. The optimal mechanism depends on the specific platform and number of processors. You can find additional information at http://httpd.apache.org/docs-2.4/misc/perf-tuning.html.
• mod_status—This module collects statistics about the server, connections, and requests, which slows down Apache. For optimal performance, disable this module, or at least make sure that ExtendedStatus is set to off, which is the default.
Load Testing with ApacheBench
You can test the performance of your site with benchmarking and traffic-generation tools. Many commercial and open source tools are available, each with varying degrees of sophistication. In general, it is difficult to accurately simulate real-world request traffic because visitors have different navigation patterns, access the Internet using connections with different speeds, stop a download if it is taking too long, click the reload button repeatedly if they get impatient, and so on. As such, some tools record actual network traffic for later replay.
However, for a quick—but accurate—glimpse at basic information regarding your server’s capability to handle heavy traffic, the Apache server comes with a simple, but useful, load-testing tool called ApacheBench, or ab. You can find it in the bin directory of the Apache distribution.
This tool enables you to request a certain URL a number of times and display a summary of the result. The following command requests the main page of Google 1,000 times, with 10 simultaneous clients at any given time:
# /usr/local/apache2/bin/ab -n 1000 -c 10 http://www.google.com/
If you invoke ab without any arguments, you get a complete listing of command-line options and syntax. In addition, the trailing slash on the target URL is required, unless a specific page is named.
The result will look similar to the following:
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking www.google.com (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Finished 1000 requests
Server Software: gws
Server Hostname: www.google.com
Server Port: 80
Document Path: /
Document Length: 11955 bytes
Concurrency Level: 10
Time taken for tests: 50.751 seconds
Complete requests: 1000
Failed requests: 669
(Connect: 0, Receive: 0, Length: 669, Exceptions: 0)
Write errors: 0
Total transferred: 12710814 bytes
HTML transferred: 11974814 bytes
Requests per second: 19.70 [#/sec] (mean)
Time per request: 507.515 – (mean)
Time per request: 50.751 – (mean, across all concurrent requests)
Transfer rate: 244.58 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 31 50 11.3 46 179
Processing: 88 454 53.8 449 803
Waiting: 84 285 119.0 282 694
Total: 136 504 56.5 499 850
Percentage of the requests served within a certain time (ms)
100% 850 (longest request)
These requests were made over the open Internet; you should get many more requests per second if you conduct the test against a server on the same machine or over a local network, so bear that in mind as you test.
The output of the tool should be self-explanatory; what will be most relevant are the number of requests per second and the average time it takes to fully service a request (the Total time). You can also see how all the requests were served in less than 1 second.
You can play with different settings for the number of requests and with the number of simultaneous clients to find the point at which your server slows down significantly.
Proactive Performance Tuning
The previous sections explained which settings might prevent Apache from scaling. Now it’s time for you to learn some techniques to proactively increase the performance of your server.
Mapping Files to Memory
As explained previously, accesses to disk affect performance significantly. Although most modern operating systems keep a cache of the most frequently accessed files, Apache also enables you to explicitly map a file into memory so that access to disk is not necessary. The module that performs this mapping is mod_file_cache. You can specify a list of files to memory map by using the MMapFile directive, which applies to the server as a whole. An additional directive in Apache 2.x, CacheFile, takes a list of files, caches the file descriptors at startup, and keeps them around between requests, saving time and resources for frequently requested files.
Distributing the Load
Another way to increase the overall performance of your web services for your end users is to distribute the load among several servers. You can do so in a variety of ways:
• Use a hardware load balancer to direct network and HTTP traffic across several servers, making it look like a single server from the outside.
• Use a software load-balancer solution using a reverse proxy with mod_rewrite.
• Use separate servers to provide images, large download files, and other static material. For example, you can place your images in a server called images.example.com and link to them from your main server.
The fastest way to serve content is not to serve it at all! This can be achieved by using appropriate HTTP headers that instruct clients and proxies of the validity in time of the requested resources. In this way, some resources that appear in multiple pages but do not change frequently, such as logos or navigation buttons, are transmitted only once for a certain period of time.
In addition, you can use mod_cache in Apache 2.x to cache dynamic content so that it does not have to be created for every request. This is potentially a big performance boost because dynamic content usually requires accessing databases, processing templates, and so on, which can take significant resources.
Apache 2.4 has many caching features that were considered experimental in earlier versions of Apache. See the Apache Caching Guide for more information on this topic: http://httpd.apache.org/docs/2.4/caching.html.
Reducing Transmitted Data
Another method for reducing server load is to reduce the amount of data being transferred to the client. This in turn makes your clients’ websites operate faster, especially those over slow links. You can do a number of things to achieve this:
• Reduce the number of images.
• Reduce the size of your images.
• Compress large, downloadable files.
• Precompress static HTML and use content negotiation.
• Use mod_deflate to compress HTML content. This can be useful if CPU power is available and clients are connecting over slow links. The content will be delivered quicker, and the process will be free sooner to answer additional requests.
HTTP 1.1 allows multiple requests to be served over a single connection. HTTP 1.0 enables the same thing with keepalive extensions. The KeepAliveTimeout directive enables you to specify the maximum time in seconds that the server waits before closing an inactive connection. Increasing the timeout means that you increase the chance of the connection being reused. However, it also ties up the connection and Apache process during the waiting time, which can degrade performance, as discussed earlier in the chapter.
Denial of service (DoS) attacks work by swamping your web server with a great number of simultaneous requests, slowing down the server or preventing access altogether. DoS attacks are difficult to prevent in general, and usually the most effective way to address them is at the network or operating system level. One example is to block specific addresses from making requests to the server; although you can block addresses at the web server level, it is more efficient to block them at the network firewall/router or with the operating system network filters.
Other kinds of abuse include posting extremely big requests or opening many simultaneous connections. You can limit the size of requests and timeouts to minimize the effect of attacks. The default request timeout is 300 seconds, but you can change it with the TimeOut directive. A number of directives enable you to control the size of the request body and headers: LimitRequestBody, LimitRequestFields, LimitRequestFieldSize, LimitRequestLine, and LimitXMLRequestBody.
Robots, web spiders, and web crawlers are names that describe a category of programs that access pages in your website, recursively following your site’s links. Web search engines use these programs to scan the Internet for web servers, download their content, and index it. Real-life users use these types of programs to download an entire website or portion of a website for later offline browsing. Normally, these programs are well behaved, but sometimes they can be aggressive and swamp your website with too many simultaneous connections or become caught in cyclic loops.
Well-behaved spiders request a special file, called robots.txt, that contains instructions about how to access your website and which parts of the website won’t be available to them. You can find the syntax for the file at http://www.robotstxt.org/.
By placing a properly formatted robots.txt file in your web server document root, you can control spider activity. In addition, you can stop the requests at the router or operating system level.
Implementing Virtual Hosting
The first generation of web servers was designed to handle the contents of a single site. The standard way of hosting several websites in the same machine was to install and configure separate web server instances for each site. As the Internet grew, so did the need for hosting multiple websites, and a more efficient solution was developed: virtual hosting. Virtual hosting allows a single instance of Apache to serve different websites, identified by their domain names or IP addresses. IP-based virtual hosting means that each domain is assigned a different IP address; name-based virtual hosting means that several domains share a single IP address.
Web clients use the DNS to translate hostnames into IP addresses and vice versa. Several mappings are possible:
• One-to-one—Each hostname is assigned a single, unique IP address. This is the foundation for IP-based virtual hosting.
• One-to-many—A single hostname is assigned to several IP addresses. This is useful for having several Apache instances serving the same website. If each of the servers is installed in a different machine, it is possible to balance the web traffic among them, improving scalability.
• Many-to-one—You can assign the same IP address to several hostnames. The client specifies the website it is accessing by using the Host: header in the request. This is the foundation for name-based virtual hosting.
When a many-to-one mapping is in place, a DNS server can usually be configured to respond with a different IP address for each DNS query, which helps to distribute the load. This is known as round-robin DNS. However, if you have the opportunity to use a load-balancing device instead of relying on a DNS server, doing so will alleviate any problems that may arise when tying your web server to your DNS server. Using a load balancer also eliminates the possibility that high traffic to your web server will bring down your DNS server.
IP-Based Virtual Hosting
The simplest virtual host configuration is when each host is assigned a unique IP address. Each IP address maps the HTTP requests that Apache handles to separate content trees in their own VirtualHost containers, as shown in the following snippet:
[other configurations specific to this host]
[other configurations specific to this host]
If a DocumentRoot or any other configurations are not specified for a given virtual host, the global setting, specified outside any <VirtualHost> section, is used. In this example, each virtual host has its own DocumentRoot. When a request arrives, Apache uses the destination IP address to direct the request to the appropriate host. For example, if a request comes for IP 192.168.128.10, Apache returns the documents from /usr/local/apache2/htdocs/host1.
If the host operating system cannot resolve an IP address used as the VirtualHost container’s name and there’s no ServerName directive, Apache complains at server startup time that it cannot map the IP addresses to hostnames. This complaint is not a fatal error. Apache still runs, but the error indicates that there might be some work to be done with the DNS configuration so that web browsers can find your server. You can use a fully qualified domain name (FQDN) rather than an IP address as the VirtualHost container name and the Listen directive binding (if the domain name resolves in DNS to an IP address configured on the machine and Apache can bind to it).
Name-Based Virtual Hosts
As a way to mitigate the consumption of IP addresses for virtual hosts, the HTTP/1.1 protocol version introduced the Host header, which enables a browser to specify the exact host for which the request is intended. This allows several hostnames to share a single IP address. Most browsers nowadays provide HTTP/1.1 support.
You cannot use SSL with name-based virtual hosts, except in tightly controlled circumstances. Please see http://httpd.apache.org/docs/2.4/ssl/ssl_faq.html#vhosts for more information. You can, however, use SSL with IP-based virtual hosts.
Listing 29.1 shows a typical set of request headers from the Mozilla Firefox browser. If the URL were entered with a port number, it would be part of the Host header contents as well.
Listing 29.1 Request Headers
GET / HTTP/1.1
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
Accept-Encoding: gzip, deflate
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko)
Apache uses the Host: header for configurations in which multiple hostnames can be shared by a single IP address—the many-to-one scenario outlined earlier this chapter—hence the description name-based virtual hosts.
Prior to Apache 2.4, using the NameVirtualHost directive enables you to specify IP address and port combinations on which the server receives requests for name-based virtual hosts. In Apache 2.2 and earlier, this is a required directive for name-based virtual hosts.
If you are using Apache 2.4, the NameVirtualHost directive is not used; instead, simply be sure you have matched the proper IP and ServerName in the VirtualHost container.
Listing 29.2 has Apache dispatch all connections to 192.168.128.10 based on the Host header contents.
Listing 29.2 Name-Based Virtual Hosts
#Only use NameVirtualHost if pre-Apache 2.4
For every hostname that resolves to 192.168.128.10, Apache can support another name-based virtual host. If a request comes for that IP address for a hostname that is not included in the configuration file, say host3.example.com, Apache simply associates the request to the first container in the configuration file; in this case, host1.example.com. The same behavior is applied to requests that are not accompanied by a Host header; whichever container is first in the configuration file is the one that gets the request.
An end user from the example.com domain might have his machine set up with example.com as his default domain. In that case, he might direct his browser to http://host1/ rather than the fully qualified http://host1.example.com/. The Host header would simply have host1 in it rather thanhost1.example.com. To make sure that the correct virtual host container gets the request, you can use the ServerAlias directive as shown in Listing 29.3.
Listing 29.3 The ServerAlias Directive
#Only use NameVirtualHost if pre-Apache 2.4
In fact, you can give ServerAlias a space-separated list of other names that might show up in the Host header so that you don’t need a separate VirtualHost container with a bunch of common directives just to handle all the name variants.
HTTP 1.1 forces the use of the Host header. If the protocol version is identified as 1.1 in the HTTP request line, the request must be accompanied by a Host header. In the early days of name-based virtual hosts, Host headers were considered a trade-off: Fewer IP resources were required, but legacy browsers that did not send Host headers were still in use and, therefore, could not access all the server’s virtual hosts. Today, that is not a consideration; there is no statistically significant number of such legacy browsers in use.
Mass Virtual Hosting
In the previous listings, the DocumentRoot directives follow a simple pattern:
where hostname is the hostname portion of the FQDN used in the virtual host’s ServerName. For just a few virtual hosts, this configuration is fine. But what if there are dozens, hundreds, or even thousands of these virtual hosts? The configuration file can become difficult to maintain. Apache provides a good solution for cookie-cutter virtual hosts with mod_vhost_alias. You can configure Apache to map the virtual host requests to separate content trees with pattern-matching rules in the VirtualDocumentRoot directive. This functionality proves especially useful for Internet service providers (ISPs) that want to provide a virtual host for each one of their users. The following example provides a simple mass virtual host configuration:
#Only use NameVirtualHost if pre-Apache 2.4
The %1 token used in this example’s VirtualDocumentRoot directive is substituted for the first portion of the FQDN. The mod_vhost_alias directives have a language for mapping FQDN components to filesystem locations, including characters within the FQDN.
If all the VirtualHost containers are eliminated and our configuration is simplified to the one shown here, the server serves requests for any subdirectories created in the /usr/local/apache2/htdocs directory. If the hostname portion of the FQDN is matched as a subdirectory, Apache looks there for content when it translates the request to a filesystem location.
Although virtual hosts normally inherit directives from the main server context, some of them, such as Alias directives, do not get propagated. For instance, the virtual hosts will not inherit this filesystem mapping:
Alias /icons /usr/local/apache2/icons
The FollowSymLinks flag for the Options directive is also disabled in this context. However, a variant of the ScriptAlias directive is supported.
The VirtualScriptAlias directive shown in the following snippet treats requests for any resources under /cgi-bin as containing CGI scripts:
#Only use NameVirtualHost if pre-Apache 2.4
Note that cgi-bin is a special token for that directive; calling the directory just cgi won’t work; it must be cgi-bin.
For IP-based virtual hosting needs, there are variants of these directives: Virtual-DocumentRootIP and VirtualScriptAliasIP.
This chapter provided information on Apache and operating system settings that can affect scalability and performance. In most cases, the problems in website performance relate to dynamic content generation and database access. Writing efficient scripts can help alleviate issues in those categories. Hardware-related improvements, such as high-quality network cards and drivers, increased memory, and disk arrays can also provide enhanced performance.
With regard to virtual hosting, you can configure Apache to handle virtual hosts in a variety of ways. Whether you need a large number of cookie-cutter virtual hosts, a varied set of different virtual host configurations, or the number of IP addresses you can use is limited, there’s a way to configure Apache for your application. Name-based virtual hosting is a common technique for deploying virtual hosts without using up IP addresses. IP-based virtual hosting is another method when you have plenty of IP addresses available and you want to keep your configuration tidy, with a one-to-one balance of IP addresses to virtual hosts. In addition, if you cannot change your DNS configuration, you have the recourse of using separate port numbers for your virtual hosts.
Q. How can I measure whether my site is fast enough?
A. Many developers test their sites locally or over an internal network, but if you run a public website, chances are good that many of your users will access it over slow links. Try navigating your website from a dialup account and make sure that your pages load fast enough, with the rule of thumb being that pages should load in less than three seconds.
Q. How can I migrate an existing name-based virtual host to its own machine while maintaining continuous service?
A. If a virtual host is destined to move to a neighboring machine, which by definition cannot have the same IP address, there are some extra measures to take. A common practice is to do something like the following, although many variations on these steps are possible:
1. Set the time-to-live (TTL) of the DNS mapping to a very low number. This increases the frequency of client lookups of the hostname.
2. Configure an IP alias on the old host with the new IP address.
3. Configure the virtual host’s content to be served by both name- and IP-address-based virtual hosts.
4. After all the requests for the virtual host at the old IP address diminish (due to DNS caches expiring their old lookups), migrate the server.
Q. Can I mix IP- and name-based virtual hosting?
A. Yes. If multiple IP addresses are bound, you can allocate their usage a number of different ways. A family of name-based virtual hosts might be associated with each; just use a separate NameVirtualHost directive for each IP (if pre-Apache 2.4) or a controlled set of ServerName directives. One IP might be dedicated as an IP-based virtual host for SSL, for instance, whereas another might be dedicated to a family of name-based virtual hosts.
The workshop is designed to help you review what you’ve learned.
1. Name some Apache settings that might affect Apache performance.
2. Name some operating system settings that might limit scalability and performance.
3. Name some approaches to improve performance.
4. Is the ServerName directive necessary in a VirtualHost container?
1. Some of the Apache settings that might affect performance include FollowSymLinks, SymLinksIfOwnerMatch arguments to the Options directive, enabling per-directory configuration files, hostname lookups, having a scoreboard file, and statistics collection with mod_status.
2. Some operating system settings that might affect scalability and performance include limits for number of processes, open file descriptors, and memory allowed per process.
3. The following are some suggestions for improving performance: load distribution via a hardware load balancer or reverse proxy, data compression, caching, mapping files to memory, and compiling modules statically.
4. The ServerName directive is necessary in a VirtualHost container only when name-based virtual hosts are used. The Host header contents are compared to the contents of the ServerName directive. If a match isn’t satisfied, the VirtualHost containers’ ServerAlias directive value(s) are checked for matches.