Content Delivery Networks - The Rise of Cloud Computing - eCommerce in the Cloud (2014)

eCommerce in the Cloud (2014)

Part II. The Rise of Cloud Computing

Chapter 7. Content Delivery Networks

Content Delivery Networks, known as CDNs, are large distributed networks of servers that accelerate the delivery of your platform to your customers, provide security, assign customers to a data center if you operate more than one, provide throttling, and a host of other value adds. Their role in modern large-scale ecommerce is ubiquitous and often a necessity.

The largest CDN in the world has 137,000 servers in 87 countries.[49] Servers belonging to CDNs are often colocated directly in ISP/backbone vendors’ data centers and plugged directly into their high-speed networks. Chances are, a CDN has servers within a few milliseconds of where you live. It’s because of this proximity to customers that CDNs are often called edge computing. Often CDNs are entirely transparent to your customers.

An example of the value CDNs offer is in their acceleration of HTML-based web pages.

For most web pages, less than 10%–20% of the end-user response time is spent getting the HTML document from the web server to the web browser. If you want to dramatically reduce the response times of your web pages, you have to focus on the other 80%–90% of the end-user experience.

— Steve Souders High Performance Websites

Figure 7-1 shows a breakdown of the time it takes to load the home page of a popular US-based ecommerce website for an anonymous customer.

Frontend versus backend HTTP requests for SamsClub.com home page

Figure 7-1. Frontend versus backend HTTP requests for SamsClub.com home page

Waiting for a server-side response accounts for just 2.4% of the total page view time. This is very representative of most ecommerce web pages. CDNs are responsible for delivering the remaining 97.6% in this example. CDNs can also accelerate the delivery of the server-side response, as we’ll discuss later in the chapter. CDNs play an increasingly important role when transitioning to the cloud.

TIP

While CDNs have long been associated with delivering websites (HTML, images, CSS, JavaScript), their role has greatly expanded to the point where they’re now crucial to delivering entire platforms.

Let’s look at their multifaceted roles and how they can improve your customers’ experience.

What Is a CDN?

CDNs first started to be used in the late 1990s to deliver static content at scale. At the time, most eCommerce websites were delivered from in-house corporate data centers belonging to eCommerce vendors. Serving large amounts of static content requires an enormous amount of Internet bandwidth and specialized network infrastructure that was prohibitively expensive and complicated. A side effect of delivering all of this content was that performance improved. Improved performance leads to improved customer satisfaction, higher conversion rates, and increased brand loyalty. As performance became more important to the customers of CDNs and of eCommerce customers in particular, CDN vendors shifted their attention to improving performance.

For a while, CDNs were basically web servers serving up static content. Their value proposition was that they provided availability and scale by offloading static content and delivering it from machines near the end user. Over time, CDN vendors have evolved their offerings to:

§ Proxy HTTP requests back to your data centers (called an origin), effectively taking your platform off the Internet by forcing all HTTP requests through the CDN first.

§ Optimize the delivery of content through advanced functionality like content pre-fetching, network optimization, compression, image resizing, geolocation, and modifying the HTML of pages to improve the rendering and browser performance.

§ Cache entire responses (including HTML pages) at the edge, such as the home page for anonymous customers.

§ Cache API calls at the edge. Responses are typically XML or JSON.

§ Offer value adds like a web application firewall, protection against distributed denial-of-service attacks, and a full Global Server Load Balancing solution.

§ Reduce the CPU usage by multiplexing HTTP connections and keeping the connections alive for longer periods of time.

NOTE

An origin is a term used by Content Delivery Network vendors to refer to the data center(s) that actually generate the content that the Content Delivery Network serves. This typically means the data centers where you have your application servers.

Are CDNs Clouds?

CDNs are clouds, albeit a lesser form of clouds. To be considered part of the cloud, an offering must be described by the following three adjectives:

§ Elastic

§ On demand

§ Metered

CDNs always meet the first and second requirement but not always the third. Some vendors permit the use of their services only with contracts that last a year or longer.

WARNING

Fixed long-term contracts are common ways of paying CDN vendors and therefore may technically not be considered cloud computing. Pragmatism should rule your decision making. Go for the best vendor, regardless of how they bill you.

With their core business now being fairly mature, CDNs are moving down into the space traditionally owned by cloud vendors. Cloud vendors are also moving up into the CDN space as they seek to offer their customers full vertically integrated solutions and increase their share of their customers’ technology spending. All technology vendors seek to provide their customers with one-stop, full vertically integrated solutions, as opposed to point products/services.

The real difference between CDNs and Infrastructure-as-a-Service vendors is that CDNs still don’t originate the content they’re serving. They just accelerate the delivery of content originated elsewhere. Infrastructure-as-a-Service vendors originate content and can accelerate its delivery to some degree. While there’s a lot of overlap, both still do fundamentally different things.

For more information on Content Delivery Networks and optimization, read Steve Souders’ High Performance Websites and Even Faster Websites (O’Reilly).

Serving Static Content

The first and original value proposition of CDNs is that they almost entirely eliminate latency. When you pull up your favorite ecommerce website, you make a single HTTP request to http://www.walgreens.com. In response to your request, you’ll get an HTML file that’s probably under 100 kilobytes in size. If that was it, you wouldn’t need a CDN. The average latency between Tokyo and London is only 242 milliseconds.[50] That latency could be tolerated for one HTTP request.

Web browsers have to make hundreds of HTTP requests. When your web browser gets the HTML back from the origin, it will parse it to find out what other content it needs to fetch to make the page render:

<script type="text/javascript" src='/gomez-tag.js'></script>

<script type="text/javascript" src='http://img.website.com/scripts/mbox.js'>

</script>

<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico"/>

<script type="text/javascript" src="http://www.website.com/scripts/common.js">

</script>

<script type="text/javascript" src="http://img.website.com/scripts/menu.js">

</script>

Each of these includes requires an HTTP request, at least until the object is cached locally on the client’s web browser. The number of HTTP requests can range from dozens to several hundred. Table 7-1 shows a random sampling of the page weight and number of HTTP requests needed from the top 100 largest ecommerce websites in the US.

Table 7-1. Page weight and number of HTTP requests for a sample of large ecommerce websites in the US

Website

Page weight

Number of HTTP requests

http://www.chicos.com

1.9 MB

79

http://www.1800flowers.com

2.6 MB

176

http://www.jcrew.com

1.2 MB

134

http://www.walgreens.com

1.1 MB

180

http://www.samsclub.com

1.6 MB

145

http://www.shutterfly.com

1.6 MB

172

http://www.lowes.com

1.2 MB

108

http://www.ebay.com

4.1 MB

286

http://www.hsn.com

2.2 MB

108

http://www.staples.com

1.1 MB

124

Mean

1.86 MB

151

The problem with loading all of these objects in under a few seconds is that web browsers load objects in serial batches of roughly 10 HTTP requests. Let’s look at an example. Using the mean number of HTTP requests from the preceding sample, a customer in Los Angeles accessing a website served from New York would incur roughly one additional second of overhead due to the latency between the two data centers, as shown in Figure 7-2.

Browsers making HTTP requests in batches

Figure 7-2. Browsers making HTTP requests in batches

This gets even worse for websites that are served to audiences on high latency networks, such as those not physically near to the origin, cellular networks, etc. This doesn’t include the time it takes for the origin to actually generate the response. It can take one second or more just to generate the HTTP response for a dynamic page.

Without a CDN, loading a page consists of what’s shown in Figure 7-3.

Page rendering without Content Delivery Network

Figure 7-3. Page rendering without Content Delivery Network

Now, let’s see what happens when you use a CDN, as in Figure 7-4.

Page rendering with Content Delivery Network

Figure 7-4. Page rendering with Content Delivery Network

As you can see, only one HTTP request incurs the 67-millisecond overhead. The remaining 150 HTTP requests are served directly from the customer’s local CDN data center in Los Angeles. The latency is almost entirely eliminated. It is this principle that allows websites to be served from one data center to a global audience. Assuming the same 15 round-trips back to the origin data center to render a page, a website served from London to Tokyo with 260 milliseconds of latency would incur 3.9 seconds of overhead just due to latency. When you factor in the time it takes to render the HTML, serve the static content, and so forth, response times of 6+ seconds are to be expected. In addition to substantially reducing latency, the static content is downloaded faster because the packets must travel a shorter distance across fewer network hops. With average home pages approaching 2 MB, clients have to download a lot of data very quickly.

The business advantages of this are clear: larger, more interactive pages can be delivered with less latency than if you didn’t use a CDN.

Serving Dynamic Content

The technology behind serving static content is fairly straightforward. CDNs build dozens or even hundreds of data centers and serve up static content from web servers in their data centers under their own domain (e.g., http://www.website.com/images) or a subdomain of yours (e.g.,http://images.website.com). It used to be that images had to be served from http://customer.cdn-vendor.com. Content is accelerated primarily because it’s so close to customers.

To move beyond static content delivery, CDN’s can act as a proxy in front of your entire website. This approach requires your DNS record to point to the CDN vendor and have all traffic pass through the CDN (Figure 7-5).

Content Delivery Network as a reverse proxy

Figure 7-5. Content Delivery Network as a reverse proxy

With this approach, your DNS record actually points to your CDN vendor. Requests for http://www.website.com go through your CDN. The benefits of this are enormous to performance and security. Let’s explore the different things this enables in the following sections.

Caching Entire Pages

The vast majority of HTTP requests are for static content. Of the average 151 HTTP requests involved to render a page for the first time, the first HTTP request is for the HTML of the page. Until the web browser loads and parses that HTML, none of the other 150 subsequent HTTP requests are made. In other words, it is on the critical path. The requests for HTML need not always get passed back to the origin, because most of the time the HTML is always the same for a given set of input parameters. For example, HTTP requests made by anonymous customers for a home page (e.g., http://www.website.com) are likely to return the exact same HTML unless you employ some advanced targeting based on a user’s geography.

The vast majority of traffic to an ecommerce platform is cacheable because of what’s known as the ecommerce traffic funnel, as shown in Figure 7-6.

ecommerce traffic funnel

Figure 7-6. ecommerce traffic funnel

This doesn’t even include traffic from nonhuman bots, which now accounts for 61% of HTTP requests.[51] All responses to bots can be served directly from a CDN. The remaining HTTP requests depicted in this funnel are from anonymous customers for the home page, category pages, product pages, and so forth. Again, most of those pages can be served from a CDN too. Only a relatively small fraction of total traffic is from real customers who are actually logged in. An even smaller percentage of customers actually buy anything.

Many customers visit websites with persistent login cookies. Websites welcome back customers by saying, “Welcome Back, [First Name]!” or something similar. If the personalization isn’t too substantial, you can simply cache the entire HTML page on the CDN but make an AJAX callback to your origin to populate it with dynamic content, like the customer’s name. Code-wise, it would look something like this:

<head>

<script src="/app/jquery/jquery.min.js"></script>

<script>

$.ajax({url:"/app/RetrieveWelcomeMessage",success:function(result){

// retrieves "Welcome Back, Kelly!"

$("#WelcomeMessage").html(result);

}});

</script>

</head>

<body>

<h2><div id="WelcomeMessage">Please Log In</div></h2>

... rest of web page

</body>

You can repeat this for other dynamic areas on your web page, like the “You Might Also Like” section or the main image on the home page. You could also make just one callback to your origin, with a single JSON or XML response containing all of the data required to properly personalize the page.

The advantage of this approach is that it removes loading the first HTML page from the critical path (as AJAX requests are asynchronous), yet you can still employ limited personalization. Customers get fast performance, and your origin is barely touched. It’s a great approach that’s discussed further in Chapter 11.

A slight variation is to cache different versions of each page in a CDN. CDNs are all capable of looking more deeply into the HTTP request at fields such as your source IP address, user agent, URL parameters, and cookies. This information can then be used by CDNs to discover facts like these:

§ Whether the customer is logged in

§ Web browser/user agent

§ Physical location (sometimes accurate to zip + 4 within the US or post code outside the US)

§ Internet connection speed

§ Locale

§ Operating system

§ Screen dimensions

§ Flash support

§ Capabilities supported by each device

If you have variations of your pages based on these attributes, you can just store each variation in a CDN and have it pull the right version of the page for each customer. For example, a retailer selling online internationally could have country-specific versions of each home page, with each locale having its own copy. That would save dozens, if not hundreds, of milliseconds in just latency while allowing for the pages to be heavier and more dynamic.

Even many search result pages can now be cached. For large eCommerce platforms, 20% of search terms account for 80% of the traffic. So long as you can pull out the search parameters and put them in the URL, you can cache the pages. Search result pages require URLs like search.jsp?query=shirt&size=xl&onsale=true. This trick can result in even more of your platform being served directly from a CDN.

Pre-fetching Static Content

Some pages just cannot be statically cached. For example, checkout pages are inherently dynamic and cannot be easily cached. When used as a reverse proxy, CDNs can speed up the delivery of the static content for all pages. Around 150 of an average of 151 HTTP requests to initially load a page are for static content.

Because the HTTP response passes back through the CDN on its return to your customers, the CDN can parse the HTML and proactively make concurrent HTTP requests to the origin for the static content it doesn’t already have. CDNs have dozens or even hundreds of data centers, and each data center generally maintains its own autonomous cache. A CDN can make all of the HTTP requests concurrently over a lightning-fast network, whereas your web browser has to make HTTP requests in batches of 10 over a slow “last mile” network before it even hits the CDN’s optimized network, see Figure 7-7.

Retrieving a web page through a CDN with pre-fetching

Figure 7-7. Retrieving a web page through a CDN with pre-fetching

Pre-fetching is wise to use and can yield substantial benefits.

Security

CDNs are able to provide exceptional security by basically erasing your origin data center(s) from the Internet. To get to your origin, everybody must first go through the CDN. That alone provides enormous value by reducing your attack profile. Defense in depth, or adding security in layers, is an excellent defense against attacks. CDNs have a few tricks that they are able to employ to keep you secure.

Distributed denial-of-service attacks, whereby attackers flood your origin with traffic in an attempt to knock you offline, are a big problem. In addition to special techniques to prevent and stop distributed denial-of-service attacks, CDNs have many thousands of physical servers across dozens or even hundreds of data centers that can soak up traffic from an attack. For example, this can be helpful for US-based eCommerce vendors, as many attacks originate from Asia. The CDN’s servers in Asia soak up the traffic from the attack, leaving the US servers and origin to continue serving traffic to the US and other customers around the world as normal. Also, attacks tend to target one website at a time, leaving the excess capacity in a CDN available to handle the onslaught of traffic from an attack. CloudFlare famously handled 118 gigabits of data per second,[52] despite being a fairly small CDN relative to its competitors.

Most ecommerce platforms have some form of distributed denial-of-service attack mitigation in place, whether from a dedicated Software-as-a-Service vendor or a CDN.

It is exceptionally rare these days that attackers are able to gain root access to your operating system. Use of CDNs and other intermediaries, coupled with strong firewalls, has largely prevented those attacks. Attacks like SQL injection (forcing the database to execute your own arbitrary SQL), cross-site scripting (which allows sessions and the permissions tied to them to be stolen), and code injection (executing your own arbitrary code) are far more likely. For example, SQL injection is a common vulnerability:

<%

String userId = request.getParameter("userId");

String query = "SELECT * FROM user where userId=" + userId + "'";

Statement st = conn.createStatement();

ResultSet res = st.executeQuery(query);

%>

Setting userId to 12345 or 1=1 by executing the URL &productId=12345'%20or%20’1%3D1 will lead to an application printing the details of every single user in the database without explicitly compromising any systems.

Many CDNs have full web application firewalls in place to inspect the HTML and evaluate it for vulnerabilities. For example, any parameter with a value of select%20*%20from%20credit_card (select * from credit_card) should never be allowed to be passed back to the application.

When you operate a large-scale ecommerce platform, you’ll find that certain bots can wreak havoc by requesting too many pages too quickly. Since most bots don’t understand HTTP sessions, they’ll end up creating a new HTTP session for each page view. Most CDNs allow you to blacklist by IP, user agent, subnet, and so forth.

Many CDNs also offer full compliance with common security frameworks such as FedRAMP,[53] PCI DSS,[54] HIPAA,[55] and ISO.[56] Compliance with these frameworks helps to demonstrate that these vendors can be trusted with your most sensitive data.

Security will be discussed in detail in Chapter 9.

Additional CDN Offerings

In addition to performance and security, CDNs offer many ancillary services such as DNS and storage. CDNs are strategically placed by having a large footprint of servers around the world plugged directly into backbone networks. From that vantage point, it’s easy to push other add-ons to the edge, using the considerable infrastructure they have in place.

Frontend Optimization

The frontend code of most ecommerce websites is very poorly written. Individual developers are working on their own small page fragments, often with nobody looking at the big picture. By the time anybody cares about performance, it’s usually too late to go back and fix things. Many CDNs now offer HTML rewriting (see Figure 7-8), whereby they will dynamically rewrite your HTML at the edge for each specific customer based on factors such as device type, resolution, web browser, and connection speed.

HTML optimization performed by the CDN

Figure 7-8. HTML optimization performed by the CDN

Optimizations can include the following:

§ Reducing the number of HTTP requests by clubbing together CSS and JavaScript, and by inlining

§ Pushing commonly referenced static items down to the web browser before the HTTP request is even made to the CDN

§ Making browser-specific optimizations

§ Deferring the loading of third-party JavaScript beacons (e.g., analytics and ads) until after the page has fully rendered

§ Using just-in-time or on-demand image loading, which loads images as the customer scrolls

§ Retrieving images from multiple subdomains to allow the web browser to download more in parallel

§ DNS pre-fetching

§ Reducing whitespace

§ Resizing images

§ Using compression

§ Rewriting HTML to leverage browser-specific features

If you’re unable to perform these optimizations on your own, it is highly recommended that you use these services.

DNS/GSLB

DNS is an area that CDNs have invested heavily in, both in standard DNS hosting and more-advanced Global Server Load Balancing (GSLB). We’ll discuss DNS and GSLB extensively in Chapter 10. DNS is something no ecommerce vendor should host themselves. Disadvantages of self hosting include the following:

§ Cost to properly build and maintain DNS

§ Challenge of deploying DNS across multiple data centers or multiple networks

§ Security concerns

§ DNS is often targeted for exploits.

§ DNS can be brought down with distributed denial-of-service attacks.

§ DNS can be tricked into flooding a distributed denial-of-service attack victim with traffic.

§ Latency involved, with customers querying DNS servers

Properly hosted DNS solutions, whether in a CDN or not, are able to overcome these challenges primarily through their ability to focus. Vendors who sell this service are able to hire the best experts in the world, use the best technology, and employ the best security techniques. The marginal cost of a new consumer of their service is very low, allowing them to make money while saving you money.

CDNs offer the ability to respond to DNS queries from the edge, likely just a few milliseconds away from each customer, as shown in Figure 7-9.

DNS resolution with a CDN

Figure 7-9. DNS resolution with a CDN

With traditional DNS, you may have to go cross-country or even transcontinental to retrieve an IP address (see Figure 7-10).

DNS resolution without a CDN

Figure 7-10. DNS resolution without a CDN

DNS is much more complicated than this, but as you can see, serving DNS from the edge has advantages.

In addition to mapping domain names back to IP addresses, DNS can also be used to assign customers to data centers. Each data center you’re running an ecommerce platform out of generally presents one IP address to the world. If you’re running out of multiple data centers, you need a way of deciding which data center each customer should be assigned to. This is called Global Server Load Balancing (GSLB) and it’s effectively enhanced DNS.

GSLB works by constantly redirecting customers to the right data center by a combination of factors, including availability, data center capacity/utilization, arbitrary weights, and real-time performance. Again, we’ll discuss GSLB and DNS in more detail in Chapter 10.

CDNs have a fairly unique advantage over traditional DNS hosting vendors, as they are able to accurately map the real-time performance of the Internet and because they have so many servers connected to so many different networks around the world. It is often the case that the fastest route between any two points is not the shortest distance-wise. Network capacity, speed, congestion, hops, and interference by governments all play a role in reducing network throughput and latency. Another advantage CDNs have is they’re able to monitor the actual time it’s taking each data center to respond to HTTP requests, because they’re often serving as proxies.

Throttling

Prior to cloud computing, each platform had a fixed amount of capacity it could handle without breaking. For example, if you deployed 500 servers and each server got you 10,000 concurrent customers, you knew you couldn’t handle more than 5,000,000 concurrent customers. There was no use in letting anybody access your origin if you know that it won’t work. CDNs offer the ability to throttle so that the 5,000,001st customer would get directed to a virtual waiting room. At a minimum, these waiting rooms offer helpful messages about the situation, including estimates as to when the site will be accessible again. Waiting rooms can also have games or even full catalogs so customers can begin shopping and then finish after your website comes back online again. Employing throttling protects your origin from overuse while keeping your customers happier than if they simply received an error message.

Traditional hardware load balancers also offer throttling, but there are two disadvantages. The first is that load balancers themselves can be overwhelmed. Like any physical system, they have their limits. It’s also connected to networks and other physical infrastructure that itself can be overwhelmed. The second disadvantage is that traditional load balancers must direct overflow traffic to a waiting room page. If that page is within the same data center that’s overwhelmed, there’s a good chance the waiting room itself won’t work. CDNs themselves can serve content directly from the edge, regardless of what’s happening in your data center(s).

There may be some software that cannot scale past a certain point or software that cannot be deployed in a cloud and therefore has a fixed capacity of hardware behind it. No platform scales infinitely. For these reasons, even with the elasticity that a cloud brings, it is advised that you throttle.

Summary

In this chapter, we discussed the multifaceted role CDNs play in today’s Internet and how they can be beneficial to eCommerce. In the next part of the book, we’ll focus on how to actually adopt the cloud, beginning with its architecture.


[49] Akamai Technologies, “Facts and Figures,” (2014) http://www.akamai.com/html/about/facts_figures.html

[50] AT&T, “Global Network Latency Averages,” (2014) http://www.akamai.com/html/technology/dataviz2.html.

[51] Leo Kelion, “Bots Now Account for 61% of Web Traffic,” BBC News (2013), http://bbc.in/1gAdpoV.

[52] Matthew Prince, “The DDoS That Knocked Spamhaus Offline (And How We Mitigated It),” CloudFlare (20 March 2013), http://bit.ly/1gAdrNp.

[53] US General Services Administration, “About FedRAMP,” http://www.gsa.gov/portal/category/102375.

[54] PCI Security Standards Council, https://www.pcisecuritystandards.org.

[55] US Department of Health & Human Services, “Health Information Privacy,” http://www.hhs.gov/ocr/privacy/.

[56] ISO/IEC 27001—Information Security Management, http://bit.ly/1gAdt7W.