Fetching URLs and Web Resources - Programming Google App Engine with Python (2015)

Programming Google App Engine with Python (2015)

Chapter 13. Fetching URLs and Web Resources

An App Engine application can connect to other sites on the Internet to retrieve data and communicate with web services. It does this not by opening a connection to the remote host from the application server, but through a scalable service called the URL Fetch service. This takes the burden of maintaining connections away from the app servers, and ensures that resource fetching performs well regardless of how many request handlers are fetching resources simultaneously. As with other parts of the App Engine infrastructure, the URL Fetch service is used by other Google applications to fetch web pages.

The URL Fetch service supports fetching URLs using the HTTP protocol as well as using HTTP with SSL (HTTPS). Other methods sometimes associated with URLs (such as FTP) are not supported.

Because the URL Fetch service is based on Google infrastructure, the service inherits a few restrictions that were put in place in the original design of the underlying HTTP proxy. The service supports the five most common HTTP actions (GET, POST, PUT, HEAD, and DELETE) but does not allow for others or for using a nonstandard action. Also, it can only connect to TCP ports in several allowed ranges: 80–90, 440–450, and 1024–65535. By default, it uses port 80 for HTTP, and port 443 for HTTPS. The proxy uses HTTP 1.1 to connect to the remote host.

The outgoing request can contain URL parameters, a request body, and HTTP headers. A few headers cannot be modified for security reasons, which mostly means that an app cannot issue a malformed request, such as a request whose Content-Length header does not accurately reflect the actual content length of the request body. In these cases, the service uses the correct values, or does not include the header.

Request and response sizes are limited, but generous. A request can be up to 5 megabytes in size (including headers), and a response can be up to 32 megabytes in size.

The service waits for a response up to a time limit, or “deadline.” The default fetch deadline is 5 seconds, but you can increase this on a per-request basis. The maximum deadline is 60 seconds during a user request, or 10 minutes during a task queue or scheduled task or from a backend. That is, the fetch deadline can be up to the request handler’s own deadline, except for backends (which have none).

The Python runtime environment offers implementations of standard libraries used for fetching URLs that call the URL Fetch service behind the scenes. These are the urllib, httplib, and urllib2 modules. These implementations give you a reasonable degree of portability and interoperability with other libraries.

Naturally, the standard interfaces do not give you complete access to the service’s features. When using the standard libraries, the service uses the following default behaviors:

§ If the remote host doesn’t respond within 5 seconds, the request is canceled and a service exception is raised.

§ The service follows HTTP redirects up to five times before returning the response to the application.

§ Responses from remote hosts that exceed 32 megabytes in size are truncated to 32 megabytes. The application is not told whether the response is truncated.

§ HTTP over SSL (HTTPS) URLs will use SSL to make the connection, but the service will not validate the server’s security certificate. (The App Engine team has said certificate validation will become the default for the standard libraries in a future release, so check the App Engine website.)

All of these behaviors can be customized when calling the service APIs directly. You can increase the fetch response deadline, disable the automatic following of redirects, cause an exception to be thrown for responses that exceed the maximum size, and enable validation of certificates for HTTPS connections.

The development server simulates the URL Fetch service by making HTTP connections directly from your computer. If the remote host might behave differently when your app connects from your computer rather than from Google’s proxy servers, be sure to test your URL Fetch calls on App Engine.

In this chapter, we introduce the standard-library and direct interfaces to the URL Fetch service. We also examine several features of the service, and how to use them from the direct APIs.

TIP

Fetching resources from remote hosts can take quite a bit of time. Like several other services, the URL Fetch service offers a way to call the service asynchronously, so your application can issue fetch requests and do other things while remote servers take their time to respond. See Chapter 17 for more information.

Fetching URLs

You call the URL Fetch service by using the google.appengine.api.urlfetch module, or you can use Python standard libraries such as urllib2.

The Python runtime environment overrides portions of the urllib, urllib2, and httplib modules in the Python standard library so that HTTP and HTTPS connections made with these modules use the URL Fetch service. This allows existing software that depends on these libraries to function on App Engine, as long as the requests function within certain limitations. urllib2 has rich extensible support for features of remote web servers such as HTTP authentication and cookies. We won’t go into the details of this module here, but Example 13-1 shows a brief example using the module’s urlopen() convenience function.

Example 13-1. A simple example of using the urllib2 module to access the URL Fetch service

import urllib2

from google.appengine.api import urlfetch

# ...

try:

newsfeed = urllib2.urlopen('http://ae-book.appspot.com/blog/atom.xml/')

newsfeed_xml = newsfeed.read()

except urllib2.URLError, e:

# Handle urllib2 error...

except urlfetch.Error, e:

# Handle urlfetch error...

In this example, we catch both exceptions raised by urllib2 and exceptions raised from the URL Fetch Python API, google.appengine.api.urlfetch. The service may throw one of its own exceptions for conditions that urllib2 doesn’t catch, such as a request exceeding its deadline.

Because the service follows redirect responses by default (up to five times) when using urllib2, a urllib2 redirect handler will not see all redirects, only the final response.

If you use the service API directly, you can customize these behaviors. Example 13-2 shows a similar example using the urlfetch module, with several options changed.

Example 13-2. Customizing URL Fetch behaviors, using the urlfetch module

from google.appengine.api import urlfetch

# ...

try:

newsfeed = urlfetch.fetch('http://ae-book.appspot.com/blog/atom.xml/',

allow_truncated=False,

follow_redirects=False,

deadline=10)

newsfeed_xml = newsfeed.content

except urlfetch.Error, e:

# Handle urlfetch error...

We’ll consider the direct URL Fetch API for the rest of this chapter.

Outgoing HTTP Requests

An HTTP request can consist of a URL, an HTTP method, request headers, and a payload. Only the URL and HTTP method are required, and the API assumes you mean the HTTP GET method if you only provide a URL.

You fetch a URL using HTTP GET by passing the URL to the fetch() function in the google.appengine.api.urlfetch module:

from google.appengine.api import urlfetch

# ...

response = urlfetch.fetch('http://www.example.com/feed.xml')

The URL

The URL consists of a scheme, a domain, an optional port, and a path. For example:

https://www.example.com:8081/private/feed.xml

In this example, https is the scheme, www.example.com is the domain, 8081 is the port, and /private/feed.xml is the path.

The URL Fetch service supports the http and https schemes. Other schemes, such as ftp, are not supported.

If no port is specified, the service will use the default port for the scheme: port 80 for HTTP, and port 443 for HTTPS. If you specify a port, it must be within 80–90, 440–450, or 1024–65535.

As a safety measure against accidental request loops in an application, the URL Fetch service will refuse to fetch the URL that maps to the request handler doing the fetching. An app can make connections to other URLs of its own, so request loops are still possible, but this restriction provides a simple sanity check.

As just shown, the API takes the URL as a string passed to the fetch() function as its first positional argument.

The HTTP Method and Payload

The HTTP method describes the general nature of the request, as codified by the HTTP standard. For example, the GET method asks for the data associated with the resource identified by the URL (such as a document or database record). The server is expected to verify that the request is allowed, then return the data in the response, without making changes to the resource. The POST method asks the server to modify records or perform an action, and the client usually includes a payload of data with the request.

The URL Fetch service can send requests using the GET, POST, PUT, HEAD, and DELETE methods. No other methods are supported.

You set the method by providing the method keyword argument to the fetch() function. The possible values are provided as constants by the urlfetch method. If the argument is omitted, it defaults to urlfetch.GET. To provide a payload, you set the payload keyword argument:

profile_data = profile.get_field_data()

response = urlfetch.fetch('http://www.example.com/profile/126542',

method=urlfetch.POST,

payload=new_profile_data)

Request Headers

Requests can include headers, a set of key-value pairs distinct from the payload that describe the client, the request, and the expected response. App Engine sets several headers automatically, such as Content-Length. Your app can provide additional headers that may be expected by the server.

The fetch() function accepts additional headers as the headers keyword argument. Its value is a mapping of header names to values:

response = urlfetch.fetch('http://www.example.com/article/roof_on_fire',

headers={'Accept-Charset': 'utf-8'},

payload=new_profile_data)

Some headers cannot be set directly by the application. This is primarily to discourage request forgery or invalid requests that could be used as an attack on some servers. Disallowed headers include Content-Length (which is set by App Engine automatically to the actual size of the request), Host, Vary, Via, X-Forwarded-For, and X-ProxyUser-IP.

The User-Agent header, which most servers use to identify the software of the client, can be set by the app. However, App Engine will append a string to this value identifying the request as coming from App Engine. This string includes your application ID. This is usually enough to allow an app to coax a server into serving content intended for a specific type of client (such as a specific brand or version of web browser), but it won’t be a complete impersonation of such a client.

HTTP over SSL (HTTPS)

When the scheme of a URL is https, the URL Fetch service uses HTTP over SSL to connect to the remote server, encrypting both the request and the response.

The SSL protocol also allows the client to verify the identity of the remote host, to ensure it is talking directly to the host and traffic is not being intercepted by a malicious host (a “man in the middle” attack). This protocol involves security certificates and a process for clients to validate certificates.

By default, the URL Fetch service does not validate SSL certificates. With validation disabled, traffic is still encrypted, but the remote host’s certificates are not validated before sending the request data. You can tell the URL Fetch service to enable validation of security certificates.

To enable certificate validation in Python, you provide the validate_certificate=True argument to fetch():

response = urlfetch.fetch('https://secure.example.com/profile/126542',

validate_certificate=True)

The standard libraries use the default behavior and do not validate certificates. If you need to validate certificates, you must use the urlfetch API.

Request and Response Sizes

The request can be up to 5 megabytes in size, including the headers and payload. The response can be up to 32 megabytes in size.

The URL Fetch service can do one of two things if the remote host returns a response larger than 32 megabytes: it can truncate the response (delete everything after the first 32 megabytes), or it can raise an exception in your app. You control this behavior with an option.

The fetch() function accepts an allow_truncated=True keyword argument. The default is False, which tells the service to raise a urlfetch.ResponseTooLargeError if the response is too large:

response = urlfetch.fetch('http://www.example.com/firehose.dat',

allow_truncated=True)

The standard libraries tell the URL Fetch service to allow truncation. This ensures that the standard libraries won’t raise an unfamiliar exception when third-party code fetches a URL, at the expense of returning unexpectedly truncated data when responses are too large.

Request Deadlines

The URL Fetch service issues a request, waits for the remote host to respond, and then makes the response available to the app. But the service won’t wait on the remote host forever. By default, the service will wait 5 seconds before terminating the connection and raising an exception with your app.

You can adjust the amount of time the service will wait (the “deadline”) as an option to the fetch call. You can set a deadline up to 60 seconds for fetches made during user requests, and up to 10 minutes (600 seconds) for requests made during tasks. That is, you can wait up to the maximum amount of time your request handler can run. Typically, you’ll want to set a fetch deadline shorter than your request handler’s deadline, so it can react to a failed fetch.

To set the fetch deadline, provide the deadline keyword argument, whose value is a number of seconds. If a fetch exceeds its deadline, the service raises a urlfetch.DeadlineExceededError:

response = urlfetch.fetch('http://www.example.com/users/ackermann',

deadline=30)

Handling Redirects

You can tell the service to follow redirects automatically, if HTTP redirect requests are returned by the remote server. The server will follow up to five redirects, then return the last response to the app (regardless of whether the last response is a redirect or not).

urlfetch.fetch() accepts a follow_redirects=True keyword argument. The default is False, which means to return the first response even if it’s a redirect. When using the urllib2, redirects are followed automatically, up to five times:

response = urlfetch.fetch('http://www.example.com/bounce',

follow_redirects=True)

When following redirects, the service does not retain or use cookies set in the responses of the intermediate steps. If you need requests to honor cookies during a redirect chain, you must disable the automatic redirect feature, and process redirects manually in your application code.

Response Objects

The fetch() function returns an object with response data available on several named properties. (The class name for response objects is _URLFetchResult, which implies that only the fetch() function should be constructing these objects—or relying on the class name.)

The response fields are as follows:

content

The response body. A Python str.

status_code

The HTTP status code. An int.

headers

The response headers, as a mapping of names to values.

final_url

The URL that corresponds to the response data. If automatic redirects were enabled and the server issued one or more redirects, this is the URL of the final destination, which may differ from the request URL. A Python str.

content_was_truncated

True if truncation was enabled and the response data was larger than 32 megabytes.