Fetching URLs and Web Resources - Programming Google App Engine (2012)

Programming Google App Engine

Chapter 13. Fetching URLs and Web Resources

An App Engine application can connect to other sites on the Internet to retrieve data and communicate with web services. It does this not by opening a connection to the remote host from the application server, but through a scalable service called the URL Fetch service. This takes the burden of maintaining connections away from the app servers, and ensures that resource fetching performs well regardless of how many request handlers are fetching resources simultaneously. As with other parts of the App Engine infrastructure, the URL Fetch service is used by other Google applications to fetch web pages.

The URL Fetch service supports fetching URLs by using the HTTP protocol, and using HTTP with SSL (HTTPS). Other methods sometimes associated with URLs (such as FTP) are not supported.

Because the URL Fetch service is based on Google infrastructure, the service inherits a few restrictions that were put in place in the original design of the underlying HTTP proxy. The service supports the five most common HTTP actions (GET, POST, PUT, HEAD, and DELETE) but does not allow for others or for using a nonstandard action. Also, it can only connect to TCP ports in several allowed ranges: 80–90, 440–450, and 1024–65535. By default, it uses port 80 for HTTP, and port 443 for HTTPS. The proxy uses HTTP 1.1 to connect to the remote host.

The outgoing request can contain URL parameters, a request body, and HTTP headers. A few headers cannot be modified for security reasons, which mostly means that an app cannot issue a malformed request, such as a request whose Content-Length header does not accurately reflect the actual content length of the request body. In these cases, the service uses the correct values, or does not include the header.

Request and response sizes are limited, but generous. A request can be up to 5 megabytes in size (including headers), and a response can be up to 32 megabytes in size.

The service waits for a response up to a time limit, or “deadline.” The default fetch deadline is 5 seconds, but you can increase this on a per-request basis. The maximum deadline is 60 seconds during a user request, or 10 minutes during a task queue or scheduled task or from a backend. That is, the fetch deadline can be up to the request handler’s own deadline, except for backends (which have none).

Both the Python and Java runtime environments offer implementations of standard libraries used for fetching URLs that call the URL Fetch service behind the scenes. For Python, these are the urllib, httplib, and urllib2 modules. For Java, this is the java.net.URLConnection set of APIs, including java.net.URL. These implementations give you a reasonable degree of portability and interoperability with other libraries.

Naturally, the standard interfaces do not give you complete access to the service’s features. When using the standard libraries, the service uses the following default behaviors:

§ If the remote host doesn’t respond within five seconds, the request is canceled and a service exception is raised.

§ The service follows HTTP redirects up to five times before returning the response to the application.

§ Responses from remote hosts that exceed 32 megabytes in size are truncated to 32 megabytes. The application is not told whether the response is truncated.

§ HTTP over SSL (HTTPS) URLs will use SSL to make the connection, but the service will not validate the server’s security certificate. (The App Engine team has said certificate validation will become the default for the standard libraries in a future release, so check the App Engine website.)

All of these behaviors can be customized when calling the service APIs directly. You can increase the fetch response deadline, disable the automatic following of redirects, cause an exception to be thrown for responses that exceed the maximum size, and enable validation of certificates for HTTPS connections.

The development server simulates the URL Fetch service by making HTTP connections directly from your computer. If the remote host might behave differently when your app connects from your computer rather than from Google’s proxy servers, be sure to test your URL Fetch calls on App Engine.

In this chapter, we introduce the standard-library and direct interfaces to the URL Fetch service, in Python and in Java. We also examine several features of the service, and how to use them from the direct APIs.

TIP

Fetching resources from remote hosts can take quite a bit of time. Like several other services, the URL Fetch service offers a way to call the service asynchronously, so your application can issue fetch requests and do other things while remote servers take their time to respond. See Chapter 17 for more information.

Fetching URLs in Python

In Python, you can call the URL Fetch service by using the google.appengine.api.urlfetch module, or you can use Python standard libraries such as urllib2.

The Python runtime environment overrides portions of the urllib, urllib2, and httplib modules in the Python standard library so that HTTP and HTTPS connections made with these modules use the URL Fetch service. This allows existing software that depends on these libraries to function on App Engine, as long as the requests function within certain limitations. urllib2 has rich extensible support for features of remote web servers such as HTTP authentication and cookies. We won’t go into the details of this module here, but Example 13-1 shows a brief example using the module’s urlopen() convenience function.

Example 13-1. A simple example of using the urllib2 module to access the URL Fetch service

import urllib2

from google.appengine.api import urlfetch

# ...

try:

newsfeed = urllib2.urlopen('http://ae-book.appspot.com/blog/atom.xml/')

newsfeed_xml = newsfeed.read()

except urllib2.URLError, e:

# Handle urllib2 error...

except urlfetch.Error, e:

# Handle urlfetch error...

In this example, we catch both exceptions raised by urllib2 and exceptions raised from the URL Fetch Python API, google.appengine.api.urlfetch. The service may throw one of its own exceptions for conditions that urllib2 doesn’t catch, such as a request exceeding its deadline.

Because the service follows redirect responses by default (up to five times) when using urllib2, a urllib2 redirect handler will not see all redirects, only the final response.

If you use the service API directly, you can customize these behaviors. Example 13-2 shows a similar example using the urlfetch module, with several options changed.

Example 13-2. Customizing URL Fetch behaviors, using the urlfetch module

from google.appengine.api import urlfetch

# ...

try:

newsfeed = urlfetch.fetch('http://ae-book.appspot.com/blog/atom.xml/',

allow_truncated=False,

follow_redirects=False,

deadline=10)

newsfeed_xml = newsfeed.content

except urlfetch.Error, e:

# Handle urlfetch error...

We’ll consider the direct URL Fetch API for the rest of this chapter.

Fetching URLs in Java

In Java, the direct URL Fetch service API is provided by the com.google.appengine.api.urlfetch package. You can also use standard java.net calls to fetch URLs. The Java runtime includes a custom implementation of the URLConnection class in the java.net package that calls the URL Fetch service instead of making a direct socket connection. As with the other standard interfaces, you can use this interface and rest assured that you can port your app to another platform easily.

Example 13-3 shows a simple example of using a convenience method in the URL class, which in turn uses the URLConnection class to fetch the contents of a web page. The openStream() method of the URL object returns an input stream of bytes. As shown, you can use anInputStreamReader (from java.io) to process the byte stream as a character stream. The BufferedReader class makes it easy to read lines of text from the InputStreamReader.

Example 13-3. Using java.net.URL to call the URL Fetch service

import java.net.URL;

import java.net.MalformedURLException;

import java.io.IOException;

import java.io.InputStream;

import java.io.InputStreamReader;

import java.io.BufferedReader;

// ...

try {

URL url = new URL("http://ae-book.appspot.com/blog/atom.xml/");

InputStream inStream = url.openStream();

InputStreamReader inStreamReader = new InputStreamReader(inStream);

BufferedReader reader = new BufferedReader(inStreamReader);

// ... read characters or lines with reader ...

reader.close();

} catch (MalformedURLException e) {

// ...

} catch (IOException e) {

// ...

}

Note that the URL Fetch service has already buffered the entire response into the application’s memory by the time the app begins to read. The app reads the response data from memory, not from a network stream from the socket or the service.

You can use other features of the URLConnection interface, as long as they operate within the functionality of the service API. Notably, the URL Fetch service does not maintain a persistent connection with the remote host, so features that require such a connection will not work.

By default, the URL Fetch service waits up to five seconds for a response from the remote server. If the server does not respond by the deadline, the service throws an IOException. You can adjust the amount of time to wait using the setConnectTimeout() method of theURLConnection. (The setReadTimeout() method has the same effect; the service uses the greater of the two values.) The deadline can be up to 60 seconds during user requests, or up to 10 minutes (600 seconds) for task queue and scheduled tasks and when running on a backend.

When using the URLConnection interface, the URL Fetch service follows HTTP redirects automatically, up to five consecutive redirects. The app does not see the intermediate redirect responses, only the last one. If there are more than five redirects, the service returns the fifth redirect response to the app.

The low-level API for the URL Fetch service lets you customize several behaviors of the service. Example 13-4 demonstrates how to fetch a URL with this API with options specified. As shown, the FetchOptions object tells the service not to follow any redirects, and to throw aResponseTooLargeException if the response exceeds the maximum size of 32 megabytes instead of truncating the data.

Example 13-4. Using the low-level API to call the URL Fetch service, with options

import java.net.URL;

import java.net.MalformedURLException;

import com.google.appengine.api.urlfetch.FetchOptions;

import com.google.appengine.api.urlfetch.HTTPMethod;

import com.google.appengine.api.urlfetch.HTTPRequest;

import com.google.appengine.api.urlfetch.HTTPResponse;

import com.google.appengine.api.urlfetch.ResponseTooLargeException;

import com.google.appengine.api.urlfetch.URLFetchService;

import com.google.appengine.api.urlfetch.URLFetchServiceFactory;

// ...

try {

URL url = new URL("http://ae-book.appspot.com/blog/atom.xml/");

FetchOptions options = FetchOptions.Builder

.doNotFollowRedirects()

.disallowTruncate();

HTTPRequest request = new HTTPRequest(url, HTTPMethod.GET, options);

URLFetchService urlfetch = URLFetchServiceFactory.getURLFetchService();

HTTPResponse response = urlfetch.fetch(request);

// ... process response.getContent() ...

} catch (ResponseTooLargeException e) {

// ...

} catch (MalformedURLException e) {

// ...

} catch (IOException e) {

// ...

}

You use the FetchOptions to adjust many of the service’s features. You get an instance of this class by calling a static method of FetchOptions.Builder, and then set options by calling methods on the instance. For convenience, there is a static method for each option, and every method returns the instance, so your code can build the full set of options with a single statement of chained method calls.

We will use the direct urlfetch API for the remainder of this chapter.

Outgoing HTTP Requests

An HTTP request can consist of a URL, an HTTP method, request headers, and a payload. Only the URL and HTTP method are required, and the API assumes you mean the HTTP GET method if you only provide a URL.

In Python, you fetch a URL using HTTP GET by passing the URL to the fetch() function in the google.appengine.api.urlfetch module:

from google.appengine.api import urlfetch

# ...

response = memcache.fetch('http://www.example.com/feed.xml')

In Java, you prepare an instance of the HTTPRequest class from the com.google.appengine.api.urlfetch package with the URL as a java.net.URL instance, then you pass the request object to the service’s fetch() method. (Notice that this HTTPRequest class is different from the J2EE class you use with your request handler servlets.)

import java.net.URL;

import com.google.appengine.api.urlfetch.HTTPRequest;

import com.google.appengine.api.urlfetch.HTTPResponse;

import com.google.appengine.api.urlfetch.URLFetchService;

import com.google.appengine.api.urlfetch.URLFetchServiceFactory;

// ...

HTTPRequest outRequest =

new HTTPRequest(new URL("http://www.example.com/feed.xml");

URLFetchService urlfetch = URLFetchServiceFactory.getURLFetchService();

HTTPResponse response = urlfetch.fetch(outRequest);

The URL

The URL consists of a scheme, a domain, an optional port, and a path. For example:

https://www.example.com:8081/private/feed.xml

In this example, https is the scheme, www.example.com is the domain, 8081 is the port, and /private/feed.xml is the path.

The URL Fetch service supports the http and https schemes. Other schemes, such as ftp, are not supported.

If no port is specified, the service will use the default port for the scheme: port 80 for HTTP, and port 443 for HTTPS. If you specify a port, it must be within 80–90, 440–450, or 1024–65535.

As a safety measure against accidental request loops in an application, the URL Fetch service will refuse to fetch the URL that maps to the request handler doing the fetching. An app can make connections to other URLs of its own, so request loops are still possible, but this restriction provides a simple sanity check.

As shown above, the Python API takes the URL as a string passed to the fetch() function as its first positional argument. The Java API accepts a java.net.URL object as an argument to the HTTPRequest constructor.

The HTTP Method and Payload

The HTTP method describes the general nature of the request, as codified by the HTTP standard. For example, the GET method asks for the data associated with the resource identified by the URL (such as a document or database record). The server is expected to verify that the request is allowed, then return the data in the response, without making changes to the resource. The POST method asks the server to modify records or perform an action, and the client usually includes a payload of data with the request.

The URL Fetch service can send requests using the GET, POST, PUT, HEAD, and DELETE methods. No other methods are supported.

In Python, you set the method by providing the method keyword argument to the fetch() function. The possible values are provided as constants by the urlfetch method. If the argument is omitted, it defaults to urlfetch.GET. To provide a payload, you set the payload keyword argument:

profile_data = profile.get_field_data()

response = urlfetch.fetch('http://www.example.com/profile/126542',

method=urlfetch.POST,

payload=new_profile_data)

In Java, the method is an optional second argument to the HTTPRequest constructor. Its value is from the enum HTTPMethod, whose values are named GET, POST, PUT, HEAD, and DELETE. To add a payload, you call the setPayload() method of the HTTPRequest, passing in a byte[]:

import com.google.appengine.api.urlfetch.HTTPMethod;

// ...

byte[] profileData = profile.getFieldData();

HTTPRequest request = new HTTPRequest(url, HTTPMethod.POST);

request.setPayload(profileData);

Request Headers

Requests can include headers, a set of key-value pairs distinct from the payload that describe the client, the request, and the expected response. App Engine sets several headers automatically, such as Content-Length. Your app can provide additional headers that may be expected by the server.

In Python, the fetch() function accepts additional headers as the headers keyword argument. Its value is a mapping of header names to values:

response = urlfetch.fetch('http://www.example.com/article/roof_on_fire',

headers={'Accept-Charset': 'utf-8'},

payload=new_profile_data)

In Java, you set a request header by calling the setHeader() method on the HTTPRequest. Its sole argument is an instance of the HTTPHeader class, whose constructor takes the header name and value as strings:

import com.google.appengine.api.urlfetch.HTTPHeader;

// ...

HTTPRequest request =

new HTTPRequest(new URL("http://www.example.com/article/roof_on_fire");

request.setHeader(new HTTPHeader("Accept-Charset", "utf-8");

Some headers cannot be set directly by the application. This is primarily to discourage request forgery or invalid requests that could be used as an attack on some servers. Disallowed headers include Content-Length (which is set by App Engine automatically to the actual size of the request), Host, Vary, Via, X-Forwarded-For, and X-ProxyUser-IP.

The User-Agent header, which most servers use to identify the software of the client, can be set by the app. However, App Engine will append a string to this value identifying the request as coming from App Engine. This string includes your application ID. This is usually enough to allow an app to coax a server into serving content intended for a specific type of client (such as a specific brand or version of web browser), but it won’t be a complete impersonation of such a client.

HTTP Over SSL (HTTPS)

When the scheme of a URL is https, the URL Fetch service uses HTTP over SSL to connect to the remote server, encrypting both the request and the response.

The SSL protocol also allows the client to verify the identity of the remote host, to ensure it is talking directly to the host and traffic is not being intercepted by a malicious host (a “man in the middle” attack). This protocol involves security certificates and a process for clients to validate certificates.

By default, the URL Fetch service does not validate SSL certificates. With validation disabled, traffic is still encrypted, but the remote host’s certificates are not validated before sending the request data. You can tell the URL Fetch service to enable validation of security certificates.

To enable certificate validation in Python, you provide the validate_certificate=True argument to fetch():

response = urlfetch.fetch('https://secure.example.com/profile/126542',

validate_certificate=True)

In Java, you use a FetchOptions instance with the request, and call its validateCertificate() option. Its antonym is doNotValidateCertificate(), which is the default:

FetchOptions options = FetchOptions.Builder

.validateCertificate();

HTTPRequest request = new HTTPRequest(

new URL("https://secure.example.com/profile/126542"),

HTTPMethod.GET, options);

The standard libraries use the default behavior and do not validate certificates. The App Engine team has said they will change this default for the standard libraries in a future release. See the official App Engine website for updates.

Request and Response Sizes

The request can be up to 5 megabytes in size, including the headers and payload. The response can be up to 32 megabytes in size.

The URL Fetch service can do one of two things if the remote host returns a response larger than 32 megabytes: it can truncate the response (delete everything after the first 32 megabytes), or it can raise an exception in your app. You control this behavior with an option.

In Python, the fetch() function accepts an allow_truncated=True keyword argument. The default is False, which tells the service to raise a urlfetch.ResponseTooLargeError if the response is too large:

response = memcache.fetch('http://www.example.com/firehose.dat',

allow_truncated=True)

In Java, the FetchOptions method allowTruncate() enables truncation, and disallowTruncate() tells the service to throw a ResponseTooLargeException if the response is too large:

FetchOptions options = FetchOptions.Builder

.allowTruncate();

HTTPRequest request = new HTTPRequest(

new URL("http://www.example.com/firehose.dat"),

HTTPMethod.GET, options);

The standard libraries tell the URL Fetch service to allow truncation. This ensures that the standard libraries won’t raise an unfamiliar exception when third-party code fetches a URL, at the expense of returning unexpectedly truncated data when responses are too large.

Request Deadlines

The URL Fetch service issues a request, waits for the remote host to respond, and then makes the response available to the app. But the service won’t wait on the remote host forever. By default, the service will wait 5 seconds before terminating the connection and raising an exception with your app.

You can adjust the amount of time the service will wait (the “deadline”) as an option to the fetch call. You can set a deadline up to 60 seconds for fetches made during user requests, and up to 10 minutes (600 seconds) for requests made during tasks. That is, you can wait up to the maximum amount of time your request handler can run. Typically, you’ll want to set a fetch deadline shorter than your request handler’s deadline, so it can react to a failed fetch.

To set the fetch deadline in Python, provide the deadline keyword argument, whose value is a number of seconds. If a fetch exceeds its deadline, the service raises a urlfetch.DeadlineExceededError:

response = memcache.fetch('http://www.example.com/users/ackermann',

deadline=30)

In Java, the FetchOptions class provides a setDeadline() method, which takes a java.lang.Double. The Builder static method is slightly different, named withDeadline() and taking a double. The value is a number of seconds:

FetchOptions options = FetchOptions.Builder

.withDeadline(30);

HTTPRequest request = new HTTPRequest(

new URL("http://www.example.com/users/ackermann"),

HTTPMethod.GET, options);

Handling Redirects

You can tell the service to follow redirects automatically, if HTTP redirect requests are returned by the remote server. The server will follow up to five redirects, then return the last response to the app (regardless of whether the last response is a redirect or not).

In Python, urlfetch.fetch() accepts a follow_redirects=True keyword argument. The default is False, which means to return the first response even if it’s a redirect. When using the urllib2, redirects are followed automatically, up to five times:

response = memcache.fetch('http://www.example.com/bounce',

follow_redirects=True)

In Java, the FetchOptions.Builder has a followRedirects() method, and its opposite doNotFollowRedirects(). The default is to not follow redirects. When using java.net.URLConnection, redirects are followed automatically, up to five times:

FetchOptions options = FetchOptions.Builder

.followRedirects();

HTTPRequest request = new HTTPRequest(

new URL("http://www.example.com/bounce"),

HTTPMethod.GET, options);

When following redirects, the service does not retain or use cookies set in the responses of the intermediate steps. If you need requests to honor cookies during a redirect chain, you must disable the automatic redirect feature, and process redirects manually in your application code.

Response Objects

In Python, the fetch() function returns an object with response data available on several named properties. (The class name for response objects is _URLFetchResult, which implies that only the fetch() function should be constructing these objects—or relying on the class name.)

In Java, the fetch() service method returns an HTTPResponse instance, with getter methods for the response data.

The response fields are as follows:

content / getContent()

The response body. A Python str or Java byte[].

status_code / getResponseCode()

The HTTP status code. An int.

headers / getHeaders()

The response headers. In Python, this value is a mapping of names to values. In Java, this is a List<HTTPHeader>, where each header has getName() and getValue() methods (returning strings).

final_url / getFinalUrl()

The URL that corresponds to the response data. If automatic redirects were enabled and the server issued one or more redirects, this is the URL of the final destination, which may differ from the request URL. A Python str or a Java java.net.URL.

content_was_truncated (Python only)

True if truncation was enabled and the response data was larger than 32 megabytes. (There is no Java equivalent.)