Java Network Programming, 4th Edition (2013)

Chapter 5. URLs and URIs

In the last chapter, you learned how to address hosts on the Internet via host names and IP addresses. In this chapter, we increase the granularity by addressing resources, any number of which may reside on any given host.

HTML is a hypertext markup language because it includes a way to specify links to other documents identified by URLs. A URL unambiguously identifies the location of a resource on the Internet. A URL is the most common type of URI, or Uniform Resource Identifier. A URI can identify a resource by its network location, as in a URL, or by its name, number, or other characteristics.

The URL class is the simplest way for a Java program to locate and retrieve data from the network. You do not need to worry about the details of the protocol being used, or how to communicate with the server; you simply tell Java the URL and it gets the data for you.

URIs

A Uniform Resource Identifier (URI) is a string of characters in a particular syntax that identifies a resource. The resource identified may be a file on a server; but it may also be an email address, a news message, a book, a person’s name, an Internet host, the current stock price of Oracle, or something else.

A resource is a thing that is identified by a URI. A URI is a string that identifies a resource. Yes, it is exactly that circular. Don’t spend too much time worrying about what a resource is or isn’t, because you’ll never see one anyway. All you ever receive from a server is a representation of a resource which comes in the form of bytes. However a single resource may have different representations. For instance, https://www.un.org/en/documents/udhr/ identifies the Universal Declaration of Human Rights; but there are representations of the declaration in plain text, XML, PDF, and other formats. There are also representations of this resource in English, French, Arabic, and many other languages. Some of these representations may themselves be resources. For instance, https://www.un.org/en/documents/udhr/ identifies specifically the English version of the Universal Declaration of Human Rights.

TIP

One of the key principles of good web architecture is to be profligate with URIs. If anyone might want to address something or refer to something, give it a URI (and in practice a URL). Just because a resource is a part of another resource, or a collection of other resources, or a state of another resource at a particular time, doesn’t mean it can’t have its own URI. For instance, in an email service, every user, every message received, every message sent, every filtered view of the inbox, every contact, every filter rule, and every single page a user might ever look at should have a unique URI.

Although architecturally URIs are opaque strings, in practice it’s useful to design them with human-readable substructure. For instance, http://mail.example.com/ might be a particular mail server, http://mail.example.com/johndoe might be John Doe’s mail box on that server, and http://mail.example.com/johndoe?messageID=162977.1361. JavaMail.nobody%40meetup.com might be a particular message in that mailbox.

The syntax of a URI is composed of a scheme and a scheme-specific part, separated by a colon, like this:

scheme:scheme-specific-part

The syntax of the scheme-specific part depends on the scheme being used. Current schemes include:

data

Base64-encoded data included directly in a link; see RFC 2397

file

A file on a local disk

ftp

An FTP server

http

A World Wide Web server using the Hypertext Transfer Protocol

mailto

An email address

magnet

A resource available for download via peer-to-peer networks such as BitTorrent

telnet

A connection to a Telnet-based service

urn

A Uniform Resource Name

In addition, Java makes heavy use of nonstandard custom schemes such as rmi, jar, jndi, and doc for various purposes.

There is no specific syntax that applies to the scheme-specific parts of all URIs. However, many have a hierarchical form, like this:

//authority/path?query

The authority part of the URI names the authority responsible for resolving the rest of the URI. For instance, the URI http://www.ietf.org/rfc/rfc3986.txt has the scheme http, the authority www.ietf.org, and the path /rfc/rfc3986.txt (initial slash included). This means the server at www.ietf.org is responsible for mapping the path /rfc/rfc3986.txt to a resource. This URI does not have a query part. The URI http://www.powells.com/cgi-bin/biblio?inkey=62-1565928709-0 has the scheme http, the authority www.powells.com, the path /cgi-bin/biblio, and the query inkey=62-1565928709-0. The URI urn:isbn:156592870 has the scheme urn but doesn’t follow the hierarchical //authority/path?query form for scheme-specific parts.

Although most current examples of URIs use an Internet host as an authority, future schemes may not. However, if the authority is an Internet host, optional usernames and ports may also be provided to make the authority more specific. For example, the URI ftp://mp3:mp3@ci43198-a.ashvil1.nc.home.com:33/VanHalen-Jump.mp3 has the authority mp3:mp3@ci43198-a.ashvil1.nc.home.com:33. This authority has the username mp3, the password mp3, the host ci43198-a.ashvil1.nc.home.com, and the port 33. It has the scheme ftp and the path /VanHalen-Jump.mp3. (In most cases, including the password in the URI is a big security hole unless, as here, you really do want everyone in the universe to know the password.)

The path is a string that the authority can use to determine which resource is identified. Different authorities may interpret the same path to refer to different resources. For instance, the path /index.html means one thing when the authority is www.landoverbaptist.org and something very different when the authority is www.churchofsatan.com. The path may be hierarchical, in which case the individual parts are separated by forward slashes, and the . and .. operators are used to navigate the hierarchy. These are derived from the pathname syntax on the Unix operating systems where the Web and URLs were invented. They conveniently map to a filesystem stored on a Unix web server. However, there is no guarantee that the components of any particular path actually correspond to files or directories on any particular filesystem. For example, in the URIhttp://www.amazon.com/exec/obidos/ISBN%3D1565924851/cafeaulaitA/002-3777605-3043449, all the pieces of the hierarchy are just used to pull information out of a database that’s never stored in a filesystem. ISBN%3D1565924851 selects the particular book from the database by its ISBN number, cafeaulaitA specifies who gets the referral fee if a purchase is made from this link, and 002-3777605-3043449 is a session key used to track the visitor’s path through the site.

Some URIs aren’t at all hierarchical, at least in the filesystem sense. For example, snews://secnews.netscape.com/netscape.devs-java has a path of /netscape.devs-java. Although there’s some hierarchy to the newsgroup names indicated by the period between netscape and devs-java, it’s not encoded as part of the URI.

The scheme part is composed of lowercase letters, digits, and the plus sign, period, and hyphen. The other three parts of a typical URI (authority, path, and query) should each be composed of the ASCII alphanumeric characters (i.e., the letters A–Z, a–z, and the digits 0–9). In addition, the punctuation characters - _ . ! and ~ may also be used. Delimiters such as / ? & and = may be used for their predefined purposes. All other characters, including non-ASCII alphanumerics such as á and ζ as well as delimiters not being used as delimiters should be escaped by a percent sign (%) followed by the hexadecimal codes for the character as encoded in UTF-8. For instance, in UTF-8, á is the two bytes 0xC3 0xA1 so it would be encoded as %c3%a1. The Chinese character 木 is Unicode code point 0x6728. In UTF-8, this is encoded as the three bytes E6, 9C, and A8. Thus, in a URI it would be encoded as %E6%9C%A8.

If you don’t hexadecimally encode non-ASCII characters like this, but just include them directly, then instead of a URI you have an IRI (an Internationalized Resource Identifier). IRIs are easier to type and much easier to read, but a lot of software and protocols expect and support only ASCII URIs.

Punctuation characters such as / and @ must also be encoded with percent escapes if they are used in any role other than what’s specified for them in the scheme-specific part of a particular URL. For example, the forward slashes in the URI http://www.cafeaulait.org/books/javaio2/ do not need to be encoded as %2F because they serve to delimit the hierarchy as specified for the http URI scheme. However, if a filename includes a / character—for instance, if the last directory were named Java I/O instead of javaio2 to more closely match the name of the book—the URI would have to be written as http://www.cafeaulait.org/books/Java%20I%2FO/. This is not as far-fetched as it might sound to Unix or Windows users. Mac filenames frequently include a forward slash. Filenames on many platforms often contain characters that need to be encoded, including @, $, +, =, and many more. And of course URLs are, more often than not, not derived from filenames at all.

URLs

A URL is a URI that, as well as identifying a resource, provides a specific network location for the resource that a client can use to retrieve a representation of that resource. By contrast, a generic URI may tell you what a resource is, but not actually tell you where or how to get that resource. In the physical world, it’s the difference between the title “Harry Potter and The Deathly Hallows” and the library location “Room 312, Row 28, Shelf 7”. In Java, it’s the difference between the java.net.URI class that only identifies resources and the java.net.URL class that can both identify and retrieve resources.

The network location in a URL usually includes the protocol used to access a server (e.g., FTP, HTTP), the hostname or IP address of the server, and the path to the resource on that server. A typical URL looks like http://www.ibiblio.org/javafaq/javatutorial.html. This specifies that there is a file called javatutorial.html in a directory called javafaq on the server www.ibiblio.org, and that this file can be accessed via the HTTP protocol.

The syntax of a URL is:

protocol://userInfo@host:port/path?query#fragment

Here the protocol is another word for what was called the scheme of the URI. (Scheme is the word used in the URI RFC. Protocol is the word used in the Java documentation.) In a URL, the protocol part can be file, ftp, http, https, magnet, telnet, or various other strings (though not urn).

The host part of a URL is the name of the server that provides the resource you want. It can be a hostname such as www.oreilly.com or utopia.poly.edu or an IP address, such as 204.148.40.9 or 128.238.3.21.

The userInfo is optional login information for the server. If present, it contains a username and, rarely, a password.

The port number is also optional. It’s not necessary if the service is running on its default port (port 80 for HTTP servers).

Together, the userInfo, host, and port constitute the authority.

The path points to a particular resource on the specified server. It often looks like a filesystem path such as /forum/index.php. However, it may or may not actually map to a filesystem on the server. If it does map to a filesystem, the path is relative to the document root of the server, not necessarily to the root of the filesystem on the server. As a rule, servers that are open to the public do not show their entire filesystem to clients. Rather, they show only the contents of a specified directory. This directory is called the document root, and all paths and filenames are relative to it. Thus, on a Unix server, all files that are available to the public might be in /var/public/html, but to somebody connecting from a remote machine, this directory looks like the root of the filesystem.

The query string provides additional arguments for the server. It’s commonly used only in http URLs, where it contains form data for input to programs running on the server.

Finally, the fragment references a particular part of the remote resource. If the remote resource is HTML, the fragment identifier names an anchor in the HTML document. If the remote resource is XML, the fragment identifier is an XPointer. Some sources refer to the fragment part of the URL as a “section”. Java rather unaccountably refers to the fragment identifier as a “Ref”. Fragment identifier targets are created in an HTML document with an id attribute, like this:

<h3 id="xtocid1902914">Comments</h3>

This tag identifies a particular point in a document. To refer to this point, a URL includes not only the document’s filename but the fragment identifier separated from the rest of the URL by a #:

http://www.cafeaulait.org/javafaq.html#xtocid1902914

TIP

Technically, a string that contains a fragment identifier is a URL reference, not a URL. Java, however, does not distinguish between URLs and URL references.

Relative URLs

A URL tells a web browser a lot about a document: the protocol used to retrieve the document, the host where the document lives, and the path to the document on that host. Most of this information is likely to be the same for other URLs that are referenced in the document. Therefore, rather than requiring each URL to be specified in its entirety, a URL may inherit the protocol, hostname, and path of its parent document (i.e., the document in which it appears). URLs that aren’t complete but inherit pieces from their parent are called relative URLs. In contrast, a completely specified URL is called an absolute URL. In a relative URL, any pieces that are missing are assumed to be the same as the corresponding pieces from the URL of the document in which the URL is found. For example, suppose that while browsinghttp://www.ibiblio.org/javafaq/javatutorial.html you click on this hyperlink:

The browser cuts javatutorial.html off the end of http://www.ibiblio.org/javafaq/javatutorial.html to get http://www.ibiblio.org/javafaq/. Then it attaches javafaq.html onto the end of http://www.ibiblio.org/javafaq/ to get http://www.ibiblio.org/javafaq/javafaq.html. Finally, it loads that document.

If the relative link begins with a /, then it is relative to the document root instead of relative to the current file. Thus, if you click on the following link while browsing http://www.ibiblio.org/javafaq/javatutorial.html:

the browser would throw away /javafaq/javatutorial.html and attach /projects/ipv6/ to the end of http://www.ibiblio.org to get http://www.ibiblio.org/projects/ipv6/.

Relative URLs have a number of advantages. First—and least important—they save a little typing. More importantly, relative URLs allow a single document tree to be served by multiple protocols: for instance, both HTTP and FTP. HTTP might be used for direct surfing, while FTP could be used for mirroring the site. Most importantly of all, relative URLs allow entire trees of documents to be moved or copied from one site to another without breaking all the internal links.

The URL Class

The java.net.URL class is an abstraction of a Uniform Resource Locator such as http://www.lolcats.com/ or ftp://ftp.redhat.com/pub/. It extends java.lang.Object, and it is a final class that cannot be subclassed. Rather than relying on inheritance to configure instances for different kinds of URLs, it uses the strategy design pattern. Protocol handlers are the strategies, and the URL class itself forms the context through which the different strategies are selected.

Although storing a URL as a string would be trivial, it is helpful to think of URLs as objects with fields that include the scheme (a.k.a. the protocol), hostname, port, path, query string, and fragment identifier (a.k.a. the ref), each of which may be set independently. Indeed, this is almost exactly how the java.net.URL class is organized, though the details vary a little between different versions of Java.

URLs are immutable. After a URL object has been constructed, its fields do not change. This has the side effect of making them thread safe.

Creating New URLs

Unlike the InetAddress objects in Chapter 4, you can construct instances of java.net.URL. The constructors differ in the information they require:

public URL(String url) throws MalformedURLException

public URL(String protocol, String hostname, String file)

throws MalformedURLException

public URL(String protocol, String host, int port, String file)

throws MalformedURLException

public URL(URL base, String relative) throws MalformedURLException

Which constructor you use depends on the information you have and the form it’s in. All these constructors throw a MalformedURLException if you try to create a URL for an unsupported protocol or if the URL is syntactically incorrect.

Exactly which protocols are supported is implementation dependent. The only protocols that have been available in all virtual machines are http and file, and the latter is notoriously flaky. Today, Java also supports the https, jar, and ftp protocols. Some virtual machines support mailto and gopher as well as some custom protocols like doc, netdoc, systemresource, and verbatim used internally by Java.

TIP

If the protocol you need isn’t supported by a particular VM, you may be able to install a protocol handler for that scheme to enable the URL class to speak that protocol. In practice, this is way more trouble than it’s worth. You’re better off using a library that exposes a custom API just for that protocol.

Other than verifying that it recognizes the URL scheme, Java does not check the correctness of the URLs it constructs. The programmer is responsible for making sure that URLs created are valid. For instance, Java does not check that the hostname in an HTTP URL does not contain spaces or that the query string is x-www-form-URL-encoded. It does not check that a mailto URL actually contains an email address. You can create URLs for hosts that don’t exist and for hosts that do exist but that you won’t be allowed to connect to.

Constructing a URL from a string

The simplest URL constructor just takes an absolute URL in string form as its single argument:

public URL(String url) throws MalformedURLException

Like all constructors, this may only be called after the new operator, and like all URL constructors, it can throw a MalformedURLException. The following code constructs a URL object from a String, catching the exception that might be thrown:

try {

URL u = new URL("http://www.audubon.org/");

} catch (MalformedURLException ex) {

System.err.println(ex);

}

Example 5-1 is a simple program for determining which protocols a virtual machine supports. It attempts to construct a URL object for each of 15 protocols (8 standard protocols, 3 custom protocols for various Java APIs, and 4 undocumented protocols used internally by Java). If the constructor succeeds, you know the protocol is supported. Otherwise, a MalformedURLException is thrown and you know the protocol is not supported.

Example 5-1. Which protocols does a virtual machine support?

import java.net.*;

public class ProtocolTester {

public static void main(String[] args) {

// hypertext transfer protocol

testProtocol("http://www.adc.org");

// secure http

testProtocol("https://www.amazon.com/exec/obidos/order2/");

// file transfer protocol

testProtocol("ftp://ibiblio.org/pub/languages/java/javafaq/");

// Simple Mail Transfer Protocol

testProtocol("mailto:elharo@ibiblio.org");

// telnet

testProtocol("telnet://dibner.poly.edu/");

// local file access

testProtocol("file:///etc/passwd");

// gopher

testProtocol("gopher://gopher.anc.org.za/");

// Lightweight Directory Access Protocol

testProtocol(

"ldap://ldap.itd.umich.edu/o=University%20of%20Michigan,c=US?postalAddress");

// JAR

testProtocol(

"jar:http://cafeaulait.org/books/javaio/ioexamples/javaio.jar!"

+ "/com/macfaq/io/StreamCopier.class");

// NFS, Network File System

testProtocol("nfs://utopia.poly.edu/usr/tmp/");

// a custom protocol for JDBC

testProtocol("jdbc:mysql://luna.ibiblio.org:3306/NEWS");

// rmi, a custom protocol for remote method invocation

testProtocol("rmi://ibiblio.org/RenderEngine");

// custom protocols for HotJava

testProtocol("doc:/UsersGuide/release.html");

testProtocol("netdoc:/UsersGuide/release.html");

testProtocol("systemresource://www.adc.org/+/index.html");

testProtocol("verbatim:http://www.adc.org/");

}

private static void testProtocol(String url) {

try {

URL u = new URL(url);

System.out.println(u.getProtocol() + " is supported");

} catch (MalformedURLException ex) {

String protocol = url.substring(0, url.indexOf(':'));

System.out.println(protocol + " is not supported");

}

The results of this program depend on which virtual machine runs it. Here are the results from Java 7 on Mac OS X:

http is supported

https is supported

ftp is supported

mailto is supported

telnet is not supported

file is supported

gopher is not supported

ldap is not supported

jar is supported

nfs is not supported

jdbc is not supported

rmi is not supported

doc is not supported

netdoc is supported

systemresource is not supported

verbatim is not supported

The nonsupport of RMI and JDBC is actually a little deceptive; in fact, the JDK does support these protocols. However, that support is through various parts of the java.rmi and java.sql packages, respectively. These protocols are not accessible through the URL class like the other supported protocols (although I have no idea why Sun chose to wrap up RMI and JDBC parameters in URL clothing if it wasn’t intending to interface with these via Java’s quite sophisticated mechanism for handling URLs).

Other Java 7 virtual machines will show similar results. VMs that are not derived from the Oracle codebase may vary somewhat in which protocols they support. For example, Android’s Dalvik VM only supports the required http, https, file, ftp, and jar protocols.

Constructing a URL from its component parts

You can also build a URL by specifying the protocol, the hostname, and the file:

public URL(String protocol, String hostname, String file)

throws MalformedURLException

This constructor sets the port to -1 so the default port for the protocol will be used. The file argument should begin with a slash and include a path, a filename, and optionally a fragment identifier. Forgetting the initial slash is a common mistake, and one that is not easy to spot. Like all URLconstructors, it can throw a MalformedURLException. For example:

try {

URL u = new URL("http", "www.eff.org", "/blueribbon.html#intro");

} catch (MalformedURLException ex) {

throw new RuntimeException("shouldn't happen; all VMs recognize http");

}

This creates a URL object that points to http://www.eff.org/blueribbon.html#intro, using the default port for the HTTP protocol (port 80). The file specification includes a reference to a named anchor. The code catches the exception that would be thrown if the virtual machine did not support the HTTP protocol. However, this shouldn’t happen in practice.

For the rare occasions when the default port isn’t correct, the next constructor lets you specify the port explicitly as an int. The other arguments are the same. For example, this code fragment creates a URL object that points to http://fourier.dur.ac.uk:8000/~dma3mjh/jsci/, specifying port 8000 explicitly:

try {

URL u = new URL("http", "fourier.dur.ac.uk", 8000, "/~dma3mjh/jsci/");

} catch (MalformedURLException ex) {

throw new RuntimeException("shouldn't happen; all VMs recognize http");

}

Constructing relative URLs

This constructor builds an absolute URL from a relative URL and a base URL:

public URL(URL base, String relative) throws MalformedURLException

For instance, you may be parsing an HTML document at http://www.ibiblio.org/javafaq/index.html and encounter a link to a file called mailinglists.html with no further qualifying information. In this case, you use the URL to the document that contains the link to provide the missing information. The constructor computes the new URL as http://www.ibiblio.org/javafaq/mailinglists.html. For example:

try {

URL u1 = new URL("http://www.ibiblio.org/javafaq/index.html");

URL u2 = new URL (u1, "mailinglists.html");

} catch (MalformedURLException ex) {

System.err.println(ex);

}

The filename is removed from the path of u1 and the new filename mailinglists.html is appended to make u2. This constructor is particularly useful when you want to loop through a list of files that are all in the same directory. You can create a URL for the first file and then use this initial URL to create URL objects for the other files by substituting their filenames.

Other sources of URL objects

Besides the constructors discussed here, a number of other methods in the Java class library return URL objects. In applets, getDocumentBase() returns the URL of the page that contains the applet and getCodeBase() returns the URL of the applet .class file.

The java.io.File class has a toURL() method that returns a file URL matching the given file. The exact format of the URL returned by this method is platform dependent. For example, on Windows it may return something like file:/D:/JAVA/JNP4/05/ToURLTest.java. On Linux and other Unixes, you’re likely to see file:/home/elharo/books/JNP4/05/ToURLTest.java. In practice, file URLs are heavily platform and program dependent. Java file URLs often cannot be interchanged with the URLs used by web browsers and other programs, or even with Java programs running on different platforms.

Class loaders are used not only to load classes but also to load resources such as images and audio files. The static ClassLoader.getSystemResource(String name) method returns a URL from which a single resource can be read. TheClassLoader.getSystemResources(String name) method returns an Enumeration containing a list of URLs from which the named resource can be read. And finally, the instance method getResource(String name) searches the path used by the referenced class loader for a URL to the named resource. The URLs returned by these methods may be file URLs, HTTP URLs, or some other scheme. The full path of the resource is a package qualified Java name with slashes instead of periods such as /com/macfaq/sounds/swale.au or com/macfaq/images/headshot.jpg. The Java virtual machine will attempt to find the requested resource in the classpath, potentially inside a JAR archive.

There are a few other methods that return URL objects here and there throughout the class library, but most are simple getter methods that return a URL you probably already know because you used it to construct the object in the first place; for instance, the getPage() method ofjavax.swing.JEditorPane and the getURL() method of java.net.URLConnection.

Retrieving Data from a URL

Naked URLs aren’t very exciting. What’s interesting is the data contained in the documents they point to. The URL class has several methods that retrieve data from a URL:

public InputStream openStream() throws IOException

public URLConnection openConnection() throws IOException

public URLConnection openConnection(Proxy proxy) throws IOException

public Object getContent() throws IOException

public Object getContent(Class[] classes) throws IOException

The most basic and most commonly used of these methods is openStream(), which returns an InputStream from which you can read the data. If you need more control over the download process, call openConnection() instead, which gives you a URLConnection which you can configure, and then get an InputStream from it. We’ll take this up in Chapter 7. Finally, you can ask the URL for its content with getContent() which may give you a more complete object such as String or an Image. Then again, it may just give you an InputStream anyway.

public final InputStream openStream() throws IOException

The openStream() method connects to the resource referenced by the URL, performs any necessary handshaking between the client and the server, and returns an InputStream from which data can be read. The data you get from this InputStream is the raw (i.e., uninterpreted) content the URL references: ASCII if you’re reading an ASCII text file, raw HTML if you’re reading an HTML file, binary image data if you’re reading an image file, and so forth. It does not include any of the HTTP headers or any other protocol-related information. You can read from thisInputStream as you would read from any other InputStream. For example:

try {

URL u = new URL("http://www.lolcats.com");

InputStream in = u.openStream();

int c;

while ((c = in.read()) != -1) System.out.write(c);

in.close();

} catch (IOException ex) {

System.err.println(ex);

}

The preceding code fragment catches an IOException, which also catches the MalformedURLException that the URL constructor can throw, since MalformedURLException subclasses IOException.

As with most network streams, reliably closing the stream takes a bit of effort. In Java 6 and earlier, we use the dispose pattern: declare the stream variable outside the try block, set it to null, and then close it in the finally block if it’s not null. For example:

InputStream in = null

try {

URL u = new URL("http://www.lolcats.com");

in = u.openStream();

int c;

while ((c = in.read()) != -1) System.out.write(c);

} catch (IOException ex) {

System.err.println(ex);

} finally {

try {

if (in != null) {

in.close();

}

} catch (IOException ex) {

// ignore

}

Java 7 makes this somewhat cleaner by using a nested try-with-resources statement:

try {

URL u = new URL("http://www.lolcats.com");

try (InputStream in = u.openStream()) {

int c;

while ((c = in.read()) != -1) System.out.write(c);

}

} catch (IOException ex) {

System.err.println(ex);

}

Example 5-2 reads a URL from the command line, opens an InputStream from that URL, chains the resulting InputStream to an InputStreamReader using the default encoding, and then uses InputStreamReader’s read() method to read successive characters from the file, each of which is printed on System.out. That is, it prints the raw data located at the URL if the URL references an HTML file; the program’s output is raw HTML.

Example 5-2. Download a web page

import java.io.*;

import java.net.*;

public class SourceViewer {

public static void main (String[] args) {

if (args.length > 0) {

InputStream in = null;

try {

// Open the URL for reading

URL u = new URL(args[0]);

in = u.openStream();

// buffer the input to increase performance

in = new BufferedInputStream(in);

// chain the InputStream to a Reader

Reader r = new InputStreamReader(in);

int c;

while ((c = r.read()) != -1) {

System.out.print((char) c);

}

} catch (MalformedURLException ex) {

System.err.println(args[0] + " is not a parseable URL");

} catch (IOException ex) {

System.err.println(ex);

} finally {

if (in != null) {

try {

in.close();

} catch (IOException e) {

// ignore

}

And here are the first few lines of output when SourceViewer downloads http://www.oreilly.com:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<head>

<title>oreilly.com -- Welcome to O'Reilly Media, Inc. -- computer books,

software conferences, online publishing</title>

books, UNIX, unix, Perl, Java, Linux, Internet, Web, C, C++, Windows, Windows

NT, Security, Sys Admin, System Administration, Oracle, PL/SQL, online books,

books online, computer book online, e-books, ebooks, Perl Conference, Open Source

Conference, Java Conference, open source, free software, XML, Mac OS X, .Net, dot

net, C#, PHP, CGI, VB, VB Script, Java Script, javascript, Windows 2000, XP,

There are quite a few more lines in that web page; if you want to see them, you can fire up your web browser.

The shakiest part of this program is that it blithely assumes that the URL points to text, which is not necessarily true. It could well be pointing to a GIF or JPEG image, an MP3 sound file, or something else entirely. Even if does resolve to text, the document encoding may not be the same as the default encoding of the client system. The remote host and local client may not have the same default character set. As a general rule, for pages that use a character set radically different from ASCII, the HTML will include a META tag in the header specifying the character set in use. For instance, this META tag specifies the Big-5 encoding for Chinese:

An XML document will likely have an XML declaration instead:

<?xml version="1.0" encoding="Big5"?>

In practice, there’s no easy way to get at this information other than by parsing the file and looking for a header like this one, and even that approach is limited. Many HTML files handcoded in Latin alphabets don’t have such a META tag. Since Windows, Mac, and most Unixes have somewhat different interpretations of the characters from 128 to 255, the extended characters in these documents do not translate correctly on platforms other than the one on which they were created.

And as if this isn’t confusing enough, the HTTP header that precedes the actual document is likely to have its own encoding information, which may completely contradict what the document itself says. You can’t read this header using the URL class, but you can with the URLConnectionobject returned by the openConnection() method. Encoding detection and declaration is one of the thornier parts of the architecture of the Web.

public URLConnection openConnection() throws IOException

The openConnection() method opens a socket to the specified URL and returns a URLConnection object. A URLConnection represents an open connection to a network resource. If the call fails, openConnection() throws an IOException. For example:

try {

URL u = new URL("https://news.ycombinator.com/");

try {

URLConnection uc = u.openConnection();

InputStream in = uc.getInputStream();

// read from the connection...

} catch (IOException ex) {

System.err.println(ex);

}

} catch (MalformedURLException ex) {

System.err.println(ex);

}

You should use this method when you want to communicate directly with the server. The URLConnection gives you access to everything sent by the server: in addition to the document itself in its raw form (e.g., HTML, plain text, binary image data), you can access all the metadata specified by the protocol. For example, if the scheme is HTTP or HTTPS, the URLConnection lets you access the HTTP headers as well as the raw HTML. The URLConnection class also lets you write data to as well as read from a URL—for instance, in order to send email to a mailto URL or post form data. The URLConnection class will be the primary subject of Chapter 7.

An overloaded variant of this method specifies the proxy server to pass the connection through:

public URLConnection openConnection(Proxy proxy) throws IOException

This overrides any proxy server set with the usual socksProxyHost, socksProxyPort, http.proxyHost, http.proxyPort, http.nonProxyHosts, and similar system properties. If the protocol handler does not support proxies, the argument is ignored and the connection is made directly if possible.

public final Object getContent() throws IOException

The getContent() method is the third way to download data referenced by a URL. The getContent() method retrieves the data referenced by the URL and tries to make it into some type of object. If the URL refers to some kind of text such as an ASCII or HTML file, the object returned is usually some sort of InputStream. If the URL refers to an image such as a GIF or a JPEG file, getContent() usually returns a java.awt.ImageProducer. What unifies these two disparate classes is that they are not the thing itself but a means by which a program can construct the thing:

URL u = new URL("http://mesola.obspm.fr/");

Object o = u.getContent();

// cast the Object to the appropriate type

// work with the Object...

getContent() operates by looking at the Content-type field in the header of the data it gets from the server. If the server does not use MIME headers or sends an unfamiliar Content-type, getContent() returns some sort of InputStream with which the data can be read. AnIOException is thrown if the object can’t be retrieved. Example 5-3 demonstrates this.

Example 5-3. Download an object

import java.io.*;

import java.net.*;

public class ContentGetter {

public static void main (String[] args) {

if (args.length > 0) {

// Open the URL for reading

try {

URL u = new URL(args[0]);

Object o = u.getContent();

System.out.println("I got a " + o.getClass().getName());

} catch (MalformedURLException ex) {

System.err.println(args[0] + " is not a parseable URL");

} catch (IOException ex) {

System.err.println(ex);

}

Here’s the result of trying to get the content of http://www.oreilly.com:

% java ContentGetter http://www.oreilly.com/ I got a

sun.net.www.protocol.http.HttpURLConnection$HttpInputStream</programlisting>

The exact class may vary from one version of Java to the next (in earlier versions, it’s been java.io.PushbackInputStream or sun.net.www.http.KeepAliveStream) but it should be some form of InputStream.

Here’s what you get when you try to load a header image from that page:

% java ContentGetter http://www.oreilly.com/graphics_new/animation.gif

I got a sun.awt.image.URLImageSource</programlisting>

Here’s what happens when you try to load a Java applet using getContent():

% java ContentGetter http://www.cafeaulait.org/RelativeURLTest.class</userinput>

I got a sun.net.www.protocol.http.HttpURLConnection$HttpInputStream

</programlisting>

Here’s what happens when you try to load an audio file using getContent():

% java ContentGetter http://www.cafeaulait.org/course/week9/spacemusic.au

</userinput>

I got a sun.applet.AppletAudioClip</programlisting>

The last result is the most unusual because it is as close as the Java core API gets to a class that represents a sound file. It’s not just an interface through which you can load the sound data.

This example demonstrates the biggest problems with using getContent(): it’s hard to predict what kind of object you’ll get. You could get some kind of InputStream or an ImageProducer or perhaps an AudioClip; it’s easy to check using the instanceof operator. This information should be enough to let you read a text file or display an image.

public final Object getContent(Class[] classes) throws IOException

A URL’s content handler may provide different views of a resource. This overloaded variant of the getContent() method lets you choose which class you’d like the content to be returned as. The method attempts to return the URL’s content in the first available format. For instance, if you prefer an HTML file to be returned as a String, but your second choice is a Reader and your third choice is an InputStream, write:

URL u = new URL("http://www.nwu.org");

Class<?>[] types = new Class[3];

types[0] = String.class;

types[1] = Reader.class;

types[2] = InputStream.class;

Object o = u.getContent(types);

If the content handler knows how to return a string representation of the resource, then it returns a String. If it doesn’t know how to return a string representation of the resource, then it returns a Reader. And if it doesn’t know how to present the resource as a reader, then it returns anInputStream. You have to test for the type of the returned object using instanceof. For example:

if (o instanceof String) {

System.out.println(o);

} else if (o instanceof Reader) {

int c;

Reader r = (Reader) o;

while ((c = r.read()) != -1) System.out.print((char) c);

r.close();

} else if (o instanceof InputStream) {

int c;

InputStream in = (InputStream) o;

while ((c = in.read()) != -1) System.out.write(c);

in.close();

} else {

System.out.println("Error: unexpected type " + o.getClass());

}

Splitting a URL into Pieces

URLs are composed of five pieces:

§ The scheme, also known as the protocol

§ The authority

§ The path

§ The fragment identifier, also known as the section or ref

§ The query string

For example, in the URL http://www.ibiblio.org/javafaq/books/jnp/index.html?isbn=1565922069#toc, the scheme is http, the authority is www.ibiblio.org, the path is /javafaq/books/jnp/index.html, the fragment identifier is toc, and the query string is isbn=1565922069. However, not all URLs have all these pieces. For instance, the URL http://www.faqs.org/rfcs/rfc3986.html has a scheme, an authority, and a path, but no fragment identifier or query string.

The authority may further be divided into the user info, the host, and the port. For example, in the URL http://admin@www.blackstar.com:8080/, the authority is admin@www.blackstar.com:8080. This has the user info admin, the host www.blackstar.com, and the port 8080.

Read-only access to these parts of a URL is provided by nine public methods: getFile(), getHost(), getPort(), getProtocol(), getRef(), getQuery(), getPath(), getUserInfo(), and getAuthority().

public String getProtocol()

The getProtocol() method returns a String containing the scheme of the URL (e.g., “http”, “https”, or “file”). For example, this code fragment prints https:

URL u = new URL("https://xkcd.com/727/");

System.out.println(u.getProtocol());

public String getHost()

The getHost() method returns a String containing the hostname of the URL. For example, this code fragment prints xkcd.com:

URL u = new URL("https://xkcd.com/727/");

System.out.println(u.getHost());

public int getPort()

The getPort() method returns the port number specified in the URL as an int. If no port was specified in the URL, getPort() returns -1 to signify that the URL does not specify the port explicitly, and will use the default port for the protocol. For example, if the URL ishttp://www.userfriendly.org/, getPort() returns -1; if the URL is http://www.userfriendly.org:80/, getPort() returns 80. The following code prints -1 for the port number because it isn’t specified in the URL:

URL u = new URL("http://www.ncsa.illinois.edu/AboutUs/");

System.out.println("The port part of " + u + " is " + u.getPort());

public int getDefaultPort()

The getDefaultPort() method returns the default port used for this URL’s protocol when none is specified in the URL. If no default port is defined for the protocol, then getDefaultPort() returns -1. For example, if the URL is http://www.userfriendly.org/, getDefaultPort()returns 80; if the URL is ftp://ftp.userfriendly.org:8000/, getDefaultPort() returns 21.

public String getFile()

The getFile() method returns a String that contains the path portion of a URL; remember that Java does not break a URL into separate path and file parts. Everything from the first slash (/) after the hostname until the character preceding the # sign that begins a fragment identifier is considered to be part of the file. For example:

URL page = this.getDocumentBase();

System.out.println("This page's path is " + page.getFile());

If the URL does not have a file part, Java sets the file to the empty string.

public String getPath()

The getPath() method is a near synonym for getFile(); that is, it returns a String containing the path and file portion of a URL. However, unlike getFile(), it does not include the query string in the String it returns, just the path.

WARNING

Note that the getPath() method does not return only the directory path and getFile() does not return only the filename, as you might expect. Both getPath() and getFile() return the full path and filename. The only difference is that getFile() also returns the query string and getPath() does not.

public String getRef()

The getRef() method returns the fragment identifier part of the URL. If the URL doesn’t have a fragment identifier, the method returns null. In the following code, getRef() returns the string xtocid1902914:

URL u = new URL(

"http://www.ibiblio.org/javafaq/javafaq.html#xtocid1902914");

System.out.println("The fragment ID of " + u + " is " + u.getRef());

public String getQuery()

The getQuery() method returns the query string of the URL. If the URL doesn’t have a query string, the method returns null. In the following code, getQuery() returns the string category=Piano:

URL u = new URL(

"http://www.ibiblio.org/nywc/compositions.phtml?category=Piano");

System.out.println("The query string of " + u + " is " + u.getQuery());

public String getUserInfo()

Some URLs include usernames and occasionally even password information. This information comes after the scheme and before the host; an @ symbol delimits it. For instance, in the URL http://elharo@java.oreilly.com/, the user info is elharo. Some URLs also include passwords in the user info. For instance, in the URL ftp://mp3:secret@ftp.example.com/c%3a/stuff/mp3/, the user info is mp3:secret. However, most of the time, including a password in a URL is a security risk. If the URL doesn’t have any user info, getUserInfo() returns null.

Mailto URLs may not behave like you expect. In a URL like mailto:elharo@ibiblio.org, “elharo@ibiblio.org” is the path, not the user info and the host. That’s because the URL specifies the remote recipient of the message rather than the username and host that’s sending the message.

public String getAuthority()

Between the scheme and the path of a URL, you’ll find the authority. This part of the URI indicates the authority that resolves the resource. In the most general case, the authority includes the user info, the host, and the port. For example, in the URLftp://mp3:mp3@138.247.121.61:21000/c%3a/, the authority is mp3:mp3@138.247.121.61:21000, the user info is mp3:mp3, the host is 138.247.121.61, and the port is 21000. However, not all URLs have all parts. For instance, in the URL http://conferences.oreilly.com/java/speakers/, the authority is simply the hostname conferences.oreilly.com. The getAuthority() method returns the authority as it exists in the URL, with or without the user info and port.

Example 5-4 uses these methods to split URLs entered on the command line into their component parts.

Example 5-4. The parts of a URL

import java.net.*;

public class URLSplitter {

public static void main(String args[]) {

for (int i = 0; i < args.length; i++) {

try {

URL u = new URL(args[i]);

System.out.println("The URL is " + u);

System.out.println("The scheme is " + u.getProtocol());

System.out.println("The user info is " + u.getUserInfo());

String host = u.getHost();

if (host != null) {

int atSign = host.indexOf('@');

if (atSign != -1) host = host.substring(atSign+1);

System.out.println("The host is " + host);

} else {

System.out.println("The host is null.");

}

System.out.println("The port is " + u.getPort());

System.out.println("The path is " + u.getPath());

System.out.println("The ref is " + u.getRef());

System.out.println("The query string is " + u.getQuery());

} catch (MalformedURLException ex) {

System.err.println(args[i] + " is not a URL I understand.");

}

System.out.println();

}

Here’s the result of running this against several of the URL examples in this chapter:

% java URLSplitter \

ftp://mp3:mp3@138.247.121.61:21000/c%3a/ \

http://www.oreilly.com \

http://www.ibiblio.org/nywc/compositions.phtml?category=Piano \

http://admin@www.blackstar.com:8080/ \

The URL is ftp://mp3:mp3@138.247.121.61:21000/c%3a/

The scheme is ftp

The user info is mp3:mp3

The host is 138.247.121.61

The port is 21000

The path is /c%3a/

The ref is null

The query string is null

The URL is http://www.oreilly.com

The scheme is http

The user info is null

The host is www.oreilly.com

The port is -1

The path is

The ref is null

The query string is null

The URL is http://www.ibiblio.org/nywc/compositions.phtml?category=Piano

The scheme is http

The user info is null

The host is www.ibiblio.org

The port is -1

The path is /nywc/compositions.phtml

The ref is null

The query string is category=Piano

The URL is http://admin@www.blackstar.com:8080/

The scheme is http

The user info is admin

The host is www.blackstar.com

The port is 8080

The path is /

The ref is null

The query string is null</programlisting>

Equality and Comparison

The URL class contains the usual equals() and hashCode() methods. These behave almost as you’d expect. Two URLs are considered equal if and only if both URLs point to the same resource on the same host, port, and path, with the same fragment identifier and query string. However there is one surprise here. The equals() method actually tries to resolve the host with DNS so that, for example, it can tell that http://www.ibiblio.org/ and http://ibiblio.org/ are the same.

WARNING

This means that equals() on a URL is potentially a blocking I/O operation! For this reason, you should avoid storing URLs in data structure that depend on equals() such as java.util.HashMap. Prefer java.net.URI for this, and convert back and forth from URIs to URLs when necessary.

On the other hand, equals() does not go so far as to actually compare the resources identified by two URLs. For example, http://www.oreilly.com/ is not equal to http://www.oreilly.com/index.html; and http://www.oreilly.com:80 is not equal to http://www.oreilly.com/.

Example 5-5 creates URL objects for http://www.ibiblio.org/ and http://ibiblio.org/ and tells you if they’re the same using the equals() method.

Example 5-5. Are http://www.ibiblio.org and http://ibiblio.org the same?

import java.net.*;

public class URLEquality {

public static void main (String[] args) {

try {

URL www = new URL ("http://www.ibiblio.org/");

URL ibiblio = new URL("http://ibiblio.org/");

if (ibiblio.equals(www)) {

System.out.println(ibiblio + " is the same as " + www);

} else {

System.out.println(ibiblio + " is not the same as " + www);

}

} catch (MalformedURLException ex) {

System.err.println(ex);

}

When you run this program, you discover:

<programlisting format="linespecific" id="I_7_tt233">% <userinput moreinfo=

"none">

java URLEquality</userinput>

http://www.ibiblio.org/ is the same as http://ibiblio.org/</programlisting>

URL does not implement Comparable.

The URL class also has a sameFile() method that checks whether two URLs point to the same resource:

public boolean sameFile(URL other)

The comparison is essentially the same as with equals(), DNS queries included, except that sameFile() does not consider the fragment identifier. This sameFile() returns true when comparing http://www.oreilly.com/index.html#p1 and http://www.oreilly.com/index.html#q2 whileequals() would return false.

Here’s a fragment of code that uses sameFile() to compare two URLs:

URL u1 = new URL("http://www.ncsa.uiuc.edu/HTMLPrimer.html#GS");

URL u2 = new URL("http://www.ncsa.uiuc.edu/HTMLPrimer.html#HD");

if (u1.sameFile(u2)) {

System.out.println(u1 + " is the same file as \n" + u2);

} else {

System.out.println(u1 + " is not the same file as \n" + u2);

}

The output is:

http://www.ncsa.uiuc.edu/HTMLPrimer.html#GS is the same file as

http://www.ncsa.uiuc.edu/HTMLPrimer.html#HD

Conversion

URL has three methods that convert an instance to another form: toString(), toExternalForm(), and toURI().

Like all good classes, java.net.URL has a toString() method. The String produced by toString() is always an absolute URL, such as http://www.cafeaulait.org/javatutorial.html. It’s uncommon to call toString() explicitly. Print statements call toString() implicitly. Outside of print statements, it’s more proper to use toExternalForm() instead:

public String toExternalForm()

The toExternalForm() method converts a URL object to a string that can be used in an HTML link or a web browser’s Open URL dialog.

The toExternalForm() method returns a human-readable String representing the URL. It is identical to the toString() method. In fact, all the toString() method does is return toExternalForm().

Finally, the toURI() method converts a URL object to an equivalent URI object:

public URI toURI() throws URISyntaxException

We’ll take up the URI class shortly. In the meantime, the main thing you need to know is that the URI class provides much more accurate, specification-conformant behavior than the URL class. For operations like absolutization and encoding, you should prefer the URI class where you have the option. You should also prefer the URI class if you need to store URLs in a hashtable or other data structure, since its equals() method is not blocking. The URL class should be used primarily when you want to download content from a server.

The URI Class

A URI is a generalization of a URL that includes not only Uniform Resource Locators but also Uniform Resource Names (URNs). Most URIs used in practice are URLs, but most specifications and standards such as XML are defined in terms of URIs. In Java, URIs are represented by thejava.net.URI class. This class differs from the java.net.URL class in three important ways:

§ The URI class is purely about identification of resources and parsing of URIs. It provides no methods to retrieve a representation of the resource identified by its URI.

§ The URI class is more conformant to the relevant specifications than the URL class.

§ A URI object can represent a relative URI. The URL class absolutizes all URIs before storing them.

In brief, a URL object is a representation of an application layer protocol for network retrieval, whereas a URI object is purely for string parsing and manipulation. The URI class has no network retrieval capabilities. The URL class has some string parsing methods, such as getFile() andgetRef(), but many of these are broken and don’t always behave exactly as the relevant specifications say they should. Normally, you should use the URL class when you want to download the content at a URL and the URI class when you want to use the URL for identification rather than retrieval, for instance, to represent an XML namespace. When you need to do both, you may convert from a URI to a URL with the toURL() method, and from a URL to a URI using the toURI() method.

Constructing a URI

URIs are built from strings. You can either pass the entire URI to the constructor in a single string, or the individual pieces:

public URI(String uri) throws URISyntaxException

public URI(String scheme, String schemeSpecificPart, String fragment)

throws URISyntaxException

public URI(String scheme, String host, String path, String fragment)

throws URISyntaxException

public URI(String scheme, String authority, String path, String query,

String fragment) throws URISyntaxException

public URI(String scheme, String userInfo, String host, int port,

String path, String query, String fragment) throws URISyntaxException

Unlike the URL class, the URI class does not depend on an underlying protocol handler. As long as the URI is syntactically correct, Java does not need to understand its protocol in order to create a representative URI object. Thus, unlike the URL class, the URI class can be used for new and experimental URI schemes.

The first constructor creates a new URI object from any convenient string. For example:

URI voice = new URI("tel:+1-800-9988-9938");

URI web = new URI("http://www.xml.com/pub/a/2003/09/17/stax.html#id=_hbc");

URI book = new URI("urn:isbn:1-565-92870-9");

If the string argument does not follow URI syntax rules—for example, if the URI begins with a colon—this constructor throws a URISyntaxException. This is a checked exception, so either catch it or declare that the method where the constructor is invoked can throw it. However, one syntax rule is not checked. In contradiction to the URI specification, the characters used in the URI are not limited to ASCII. They can include other Unicode characters, such as ø and é. Syntactically, there are very few restrictions on URIs, especially once the need to encode non-ASCII characters is removed and relative URIs are allowed. Almost any string can be interpreted as a URI.

The second constructor that takes a scheme specific part is mostly used for nonhierarchical URIs. The scheme is the URI’s protocol, such as http, urn, tel, and so forth. It must be composed exclusively of ASCII letters and digits and the three punctuation characters +, -, and .. It must begin with a letter. Passing null for this argument omits the scheme, thus creating a relative URI. For example:

URI absolute = new URI("http", "//www.ibiblio.org" , null);

URI relative = new URI(null, "/javafaq/index.shtml", "today");

The scheme-specific part depends on the syntax of the URI scheme; it’s one thing for an http URL, another for a mailto URL, and something else again for a tel URI. Because the URI class encodes illegal characters with percent escapes, there’s effectively no syntax error you can make in this part.

Finally, the third argument contains the fragment identifier, if any. Again, characters that are forbidden in a fragment identifier are escaped automatically. Passing null for this argument simply omits the fragment identifier.

The third constructor is used for hierarchical URIs such as http and ftp URLs. The host and path together (separated by a /) form the scheme-specific part for this URI. For example:

URI today= new URI("http", "www.ibiblio.org", "/javafaq/index.html", "today");

This produces the URI http://www.ibiblio.org/javafaq/index.html#today.

If the constructor cannot form a legal hierarchical URI from the supplied pieces—for instance, if there is a scheme so the URI has to be absolute but the path doesn’t start with /—then it throws a URISyntaxException.

The fourth constructor is basically the same as the third, with the addition of a query string. For example:

URI today = new URI("http", "www.ibiblio.org", "/javafaq/index.html",

"referrer=cnet&date=2014-02-23", "today");

As usual, any unescapable syntax errors cause a URISyntaxException to be thrown and null can be passed to omit any of the arguments.

The fifth constructor is the master hierarchical URI constructor that the previous two invoke. It divides the authority into separate user info, host, and port parts, each of which has its own syntax rules. For example:

URI styles = new URI("ftp", "anonymous:elharo@ibiblio.org",

"ftp.oreilly.com", 21, "/pub/stylesheet", null, null);

However, the resulting URI still has to follow all the usual rules for URIs; and again null can be passed for any argument to omit it from the result.

If you’re sure your URIs are legal and do not violate any of the rules, you can use the static factory URI.create() method instead. Unlike the constructors, it does not throw a URISyntaxException. For example, this invocation creates a URI for anonymous FTP access using an email address as password:

URI styles = URI.create(

"ftp://anonymous:elharo%40ibiblio.org@ftp.oreilly.com:21/pub/stylesheet");

If the URI does prove to be malformed, then an IllegalArgumentException is thrown by this method. This is a runtime exception, so you don’t have to explicitly declare it or catch it.

The Parts of the URI

A URI reference has up to three parts: a scheme, a scheme-specific part, and a fragment identifier. The general format is:

scheme:scheme-specific-part:fragment

If the scheme is omitted, the URI reference is relative. If the fragment identifier is omitted, the URI reference is a pure URI. The URI class has getter methods that return these three parts of each URI object. The getRawFoo() methods return the encoded forms of the parts of the URI, while the equivalent getFoo() methods first decode any percent-escaped characters and then return the decoded part:

public String getScheme()

public String getSchemeSpecificPart()

public String getRawSchemeSpecificPart()

public String getFragment()

public String getRawFragment()

TIP

There’s no getRawScheme() method because the URI specification requires that all scheme names be composed exclusively of URI-legal ASCII characters and does not allow percent escapes in scheme names.

These methods all return null if the particular URI object does not have the relevant component: for example, a relative URI without a scheme or an http URI without a fragment identifier.

A URI that has a scheme is an absolute URI. A URI without a scheme is relative. The isAbsolute() method returns true if the URI is absolute, false if it’s relative:

public boolean isAbsolute()

The details of the scheme-specific part vary depending on the type of the scheme. For example, in a tel URL, the scheme-specific part has the syntax of a telephone number. However, in many useful URIs, including the very common file and http URLs, the scheme-specific part has a particular hierarchical format divided into an authority, a path, and a query string. The authority is further divided into user info, host, and port. The isOpaque() method returns false if the URI is hierarchical, true if it’s not hierarchical—that is, if it’s opaque:

public boolean isOpaque()

If the URI is opaque, all you can get is the scheme, scheme-specific part, and fragment identifier. However, if the URI is hierarchical, there are getter methods for all the different parts of a hierarchical URI:

public String getAuthority()

public String getFragment()

public String getHost()

public String getPath()

public String getPort()

public String getQuery()

public String getUserInfo()

These methods all return the decoded parts; in other words, percent escapes, such as %3C, are changed into the characters they represent, such as <. If you want the raw, encoded parts of the URI, there are five parallel getRaw_Foo_() methods:

public String getRawAuthority()

public String getRawFragment()

public String getRawPath()

public String getRawQuery()

public String getRawUserInfo()

Remember the URI class differs from the URI specification in that non-ASCII characters such as é and ü are never percent escaped in the first place, and thus will still be present in the strings returned by the getRawFoo() methods unless the strings originally used to construct the URI object were encoded.

TIP

There are no getRawPort() and getRawHost() methods because these components are always guaranteed to be made up of ASCII characters.

In the event that the specific URI does not contain this information—for instance, the URI http://www.example.com has no user info, path, port, or query string—the relevant methods return null. getPort() is the single exception. Since it’s declared to return an int, it can’t return null. Instead, it returns -1 to indicate an omitted port.

For various technical reasons that don’t have a lot of practical impact, Java can’t always initially detect syntax errors in the authority component. The immediate symptom of this failing is normally an inability to return the individual parts of the authority, port, host, and user info. In this event, you can call parseServerAuthority() to force the authority to be reparsed:

public URI parseServerAuthority() throws URISyntaxException

The original URI does not change (URI objects are immutable), but the URI returned will have separate authority parts for user info, host, and port. If the authority cannot be parsed, a URISyntaxException is thrown.

Example 5-6 uses these methods to split URIs entered on the command line into their component parts. It’s similar to Example 5-4 but works with any syntactically correct URI, not just the ones Java has a protocol handler for.

Example 5-6. The parts of a URI

import java.net.*;

public class URISplitter {

public static void main(String args[]) {

for (int i = 0; i < args.length; i++) {

try {

URI u = new URI(args[i]);

System.out.println("The URI is " + u);

if (u.isOpaque()) {

System.out.println("This is an opaque URI.");

System.out.println("The scheme is " + u.getScheme());

System.out.println("The scheme specific part is "

+ u.getSchemeSpecificPart());

System.out.println("The fragment ID is " + u.getFragment());

} else {

System.out.println("This is a hierarchical URI.");

System.out.println("The scheme is " + u.getScheme());

try {

u = u.parseServerAuthority();

System.out.println("The host is " + u.getHost());

System.out.println("The user info is " + u.getUserInfo());

System.out.println("The port is " + u.getPort());

} catch (URISyntaxException ex) {

// Must be a registry based authority

System.out.println("The authority is " + u.getAuthority());

}

System.out.println("The path is " + u.getPath());

System.out.println("The query string is " + u.getQuery());

System.out.println("The fragment ID is " + u.getFragment());

}

} catch (URISyntaxException ex) {

System.err.println(args[i] + " does not seem to be a URI.");

}

System.out.println();

}

Here’s the result of running this against three of the URI examples in this section:

% java URISplitter tel:+1-800-9988-9938 \

http://www.xml.com/pub/a/2003/09/17/stax.html#id=_hbc \

urn:isbn:1-565-92870-9

The URI is tel:+1-800-9988-9938

This is an opaque URI.

The scheme is tel

The scheme specific part is +1-800-9988-9938

The fragment ID is null

The URI is http://www.xml.com/pub/a/2003/09/17/stax.html#id=_hbc

This is a hierarchical URI.

The scheme is http

The host is www.xml.com

The user info is null

The port is -1

The path is /pub/a/2003/09/17/stax.html

The query string is null

The fragment ID is id=_hbc

The URI is urn:isbn:1-565-92870-9

This is an opaque URI.

The scheme is urn

The scheme specific part is isbn:1-565-92870-9

The fragment ID is null</programlisting>

Resolving Relative URIs

The URI class has three methods for converting back and forth between relative and absolute URIs:

public URI resolve(URI uri)

public URI resolve(String uri)

public URI relativize(URI uri)

The resolve() methods compare the uri argument to this URI and use it to construct a new URI object that wraps an absolute URI. For example, consider these three lines of code:

URI absolute = new URI("http://www.example.com/");

URI relative = new URI("images/logo.png");

URI resolved = absolute.resolve(relative);

After they’ve executed, resolved contains the absolute URI http://www.example.com/images/logo.png.

If the invoking URI does not contain an absolute URI itself, the resolve() method resolves as much of the URI as it can and returns a new relative URI object as a result. For example, take these three statements:

URI top = new URI("javafaq/books/");

URI resolved = top.resolve("jnp3/examples/07/index.html");

After they’ve executed, resolved now contains the relative URI javafaq/books/jnp3/examples/07/index.html with no scheme or authority.

It’s also possible to reverse this procedure; that is, to go from an absolute URI to a relative one. The relativize() method creates a new URI object from the uri argument that is relative to the invoking URI. The argument is not changed. For example:

URI absolute = new URI("http://www.example.com/images/logo.png");

URI top = new URI("http://www.example.com/");

URI relative = top.relativize(absolute);

The URI object relative now contains the relative URI images/logo.png.

Equality and Comparison

URIs are tested for equality pretty much as you’d expect. It’s not quite direct string comparison. Equal URIs must both either be hierarchical or opaque. The scheme and authority parts are compared without considering case. That is, http and HTTP are the same scheme, andwww.example.com is the same authority as www.EXAMPLE.com. The rest of the URI is case sensitive, except for hexadecimal digits used to escape illegal characters. Escapes are not decoded before comparing. http://www.example.com/A and http://www.example.com/%41 are unequal URIs.

The hashCode() method is consistent with equals. Equal URIs do have the same hash code and unequal URIs are fairly unlikely to share the same hash code.

URI implements Comparable, and thus URIs can be ordered. The ordering is based on string comparison of the individual parts, in this sequence:

1. If the schemes are different, the schemes are compared, without considering case.

2. Otherwise, if the schemes are the same, a hierarchical URI is considered to be less than an opaque URI with the same scheme.

3. If both URIs are opaque URIs, they’re ordered according to their scheme-specific parts.

4. If both the scheme and the opaque scheme-specific parts are equal, the URIs are compared by their fragments.

5. If both URIs are hierarchical, they’re ordered according to their authority components, which are themselves ordered according to user info, host, and port, in that order. Hosts are case insensitive.

6. If the schemes and the authorities are equal, the path is used to distinguish them.

7. If the paths are also equal, the query strings are compared.

8. If the query strings are equal, the fragments are compared.

URIs are not comparable to any type except themselves. Comparing a URI to anything except another URI causes a ClassCastException.

String Representations

Two methods convert URI objects to strings, toString() and toASCIIString():

public String toString()

public String toASCIIString()

The toString() method returns an unencoded string form of the URI (i.e., characters like é and \ are not percent escaped). Therefore, the result of calling this method is not guaranteed to be a syntactically correct URI, though it is in fact a syntactically correct IRI. This form is sometimes useful for display to human beings, but usually not for retrieval.

The toASCIIString() method returns an encoded string form of the URI. Characters like é and \ are always percent escaped whether or not they were originally escaped. This is the string form of the URI you should use most of the time. Even if the form returned by toString() is more legible for humans, they may still copy and paste it into areas that are not expecting an illegal URI. toASCIIString() always returns a syntactically correct URI.

x-www-form-urlencoded

One of the challenges faced by the designers of the Web was dealing with the differences between operating systems. These differences can cause problems with URLs: for example, some operating systems allow spaces in filenames; some don’t. Most operating systems won’t complain about a # sign in a filename; but in a URL, a # sign indicates that the filename has ended, and a fragment identifier follows. Other special characters, nonalphanumeric characters, and so on, all of which may have a special meaning inside a URL or on another operating system, present similar problems. Furthermore, Unicode was not yet ubiquitous when the Web was invented, so not all systems could handle characters such as é and 本. To solve these problems, characters used in URLs must come from a fixed subset of ASCII, specifically:

§ The capital letters A–Z

§ The lowercase letters a–z

§ The digits 0–9

§ The punctuation characters - _ . ! ~ * ' (and ,)

The characters : / & ? @ # ; $ + = and % may also be used, but only for their specified purposes. If these characters occur as part of a path or query string, they and all other characters should be encoded.

The encoding is very simple. Any characters that are not ASCII numerals, letters, or the punctuation marks specified earlier are converted into bytes and each byte is written as a percent sign followed by two hexadecimal digits. Spaces are a special case because they’re so common. Besides being encoded as %20, they can be encoded as a plus sign (+). The plus sign itself is encoded as %2B. The / # = & and ? characters should be encoded when they are used as part of a name, and not as a separator between parts of the URL.

The URL class does not encode or decode automatically. You can construct URL objects that use illegal ASCII and non-ASCII characters and/or percent escapes. Such characters and escapes are not automatically encoded or decoded when output by methods such as getPath() andtoExternalForm(). You are responsible for making sure all such characters are properly encoded in the strings used to construct a URL object.

Luckily, Java provides URLEncoder and URLDecoder classes to cipher strings in this format.

URLEncoder

To URL encode a string, pass the string and the character set name to the URLEncoder.encode() method. For example:

String encoded = URLEncoder.encode("This*string*has*asterisks", "UTF-8");

URLEncoder.encode() returns a copy of the input string with a few changes. Any nonalphanumeric characters are converted into % sequences (except the space, underscore, hyphen, period, and asterisk characters). It also encodes all non-ASCII characters. The space is converted into a plus sign. This method is a little overaggressive; it also converts tildes, single quotes, exclamation points, and parentheses to percent escapes, even though they don’t absolutely have to be. However, this change isn’t forbidden by the URL specification, so web browsers deal reasonably with these excessively encoded URLs.

Although this method allows you to specify the character set, the only such character set you should ever pick is UTF-8. UTF-8 is compatible with the IRI specification, the URI class, modern web browsers, and more additional software than any other encoding you could choose.

Example 5-7 is a program that uses URLEncoder.encode() to print various encoded strings.

Example 5-7. x-www-form-urlencoded strings

import java.io.*;

import java.net.*;

public class EncoderTest {

public static void main(String[] args) {

try {

System.out.println(URLEncoder.encode("This string has spaces",

"UTF-8"));

System.out.println(URLEncoder.encode("This*string*has*asterisks",

"UTF-8"));

System.out.println(URLEncoder.encode("This%string%has%percent%signs",

"UTF-8"));

System.out.println(URLEncoder.encode("This+string+has+pluses",

"UTF-8"));

System.out.println(URLEncoder.encode("This/string/has/slashes",

"UTF-8"));

System.out.println(URLEncoder.encode("This\"string\"has\"quote\"marks",

"UTF-8"));

System.out.println(URLEncoder.encode("This:string:has:colons",

"UTF-8"));

System.out.println(URLEncoder.encode("This~string~has~tildes",

"UTF-8"));

System.out.println(URLEncoder.encode("This(string)has(parentheses)",

"UTF-8"));

System.out.println(URLEncoder.encode("This.string.has.periods",

"UTF-8"));

System.out.println(URLEncoder.encode("This=string=has=equals=signs",

"UTF-8"));

System.out.println(URLEncoder.encode("This&string&has&ampersands",

"UTF-8"));

System.out.println(URLEncoder.encode("Thiséstringéhasé

non-ASCII characters", "UTF-8"));

} catch (UnsupportedEncodingException ex) {

throw new RuntimeException("Broken VM does not support UTF-8");

}

Here is the output (note that the code needs to be saved in something other than ASCII, and the encoding chosen should be passed as an argument to the compiler to account for the non-ASCII characters in the source code):

% javac -encoding UTF8 EncoderTest

% java EncoderTest

This+string+has+spaces

This*string*has*asterisks

This%25string%25has%25percent%25signs

This%2Bstring%2Bhas%2Bpluses

This%2Fstring%2Fhas%2Fslashes

This%22string%22has%22quote%22marks

This%3Astring%3Ahas%3Acolons

This%7Estring%7Ehas%7Etildes

This%28string%29has%28parentheses%29

This.string.has.periods

This%3Dstring%3Dhas%3Dequals%3Dsigns

This%26string%26has%26ampersands

This%C3%A9string%C3%A9has%C3%A9non-ASCII+characters</programlisting>

Notice in particular that this method encodes the forward slash, the ampersand, the equals sign, and the colon. It does not attempt to determine how these characters are being used in a URL. Consequently, you have to encode URLs piece by piece rather than encoding an entire URL in one method call. This is an important point, because the most common use of URLEncoder is preparing query strings for communicating with server-side programs that use GET. For example, suppose you want to encode this URL for a Google search:

https://www.google.com/search?hl=en&as_q=Java&as_epq=I/O

This code fragment encodes it:

String query = URLEncoder.encode(

"https://www.google.com/search?hl=en&as_q=Java&as_epq=I/O", "UTF-8");

System.out.println(query);

Unfortunately, the output is:

https%3A%2F%2Fwww.google.com%2Fsearch%3Fhl%3Den%26as_q%3DJava%26as_epq%3DI%2FO

The problem is that URLEncoder.encode() encodes blindly. It can’t distinguish between special characters used as part of the URL or query string, like / and =, and characters that need to be encoded. Consequently, URLs need to be encoded a piece at a time like this:

String url = "https://www.google.com/search?";

url += URLEncoder.encode("hl", "UTF-8");

url += "=";

url += URLEncoder.encode("en", "UTF-8");

url += "&";

url += URLEncoder.encode("as_q", "UTF-8");

url += "=";

url += URLEncoder.encode("Java", "UTF-8");

url += "&";

url += URLEncoder.encode("as_epq", "UTF-8");

url += "=";

url += URLEncoder.encode("I/O", "UTF-8");

System.out.println(url);

The output of this is what you actually want:

https://www.google.com/search?hl=en&as_q=Java&as_epq=I/O

In this case, you could have skipped encoding several of the constant strings such as “Java” because you know from inspection that they don’t contain any characters that need to be encoded. However, in general, these values will be variables, not constants; and you’ll need to encode each piece to be safe.

Example 5-8 is a QueryString class that uses URLEncoder to encode successive name and value pairs in a Java object, which will be used for sending data to server-side programs. To add name-value pairs, call the add() method, which takes two strings as arguments and encodes them. The getQuery() method returns the accumulated list of encoded name-value pairs.

Example 5-8. The QueryString class

import java.io.UnsupportedEncodingException;

import java.net.URLEncoder;

public class QueryString {

private StringBuilder query = new StringBuilder();

public QueryString() {

}

public synchronized void add(String name, String value) {

query.append('&');

encode(name, value);

}

private synchronized void encode(String name, String value) {

try {

query.append(URLEncoder.encode(name, "UTF-8"));

query.append('=');

query.append(URLEncoder.encode(value, "UTF-8"));

} catch (UnsupportedEncodingException ex) {

throw new RuntimeException("Broken VM does not support UTF-8");

}

public synchronized String getQuery() {

return query.toString();

}

@Override

public String toString() {

return getQuery();

}

Using this class, we can now encode the previous example:

QueryString qs = new QueryString();

qs.add("hl", "en");

qs.add("as_q", "Java");

qs.add("as_epq", "I/O");

String url = "http://www.google.com/search?" + qs;

System.out.println(url);

URLDecoder

The corresponding URLDecoder class has a static decode() method that decodes strings encoded in x-www-form-url-encoded format. That is, it converts all plus signs to spaces and all percent escapes to their corresponding character:

public static String decode(String s, String encoding)

throws UnsupportedEncodingException

If you have any doubt about which encoding to use, pick UTF-8. It’s more likely to be correct than anything else.

An IllegalArgumentException should be thrown if the string contains a percent sign that isn’t followed by two hexadecimal digits or decodes into an illegal sequence.

Since URLDecoder does not touch non-escaped characters, you can pass an entire URL to it rather than splitting it into pieces first. For example:

String input = "https://www.google.com/" +

"search?hl=en&as_q=Java&as_epq=I%2FO";

String output = URLDecoder.decode(input, "UTF-8");

System.out.println(output);

Proxies

Many systems access the Web and sometimes other non-HTTP parts of the Internet through proxy servers. A proxy server receives a request for a remote server from a local client. The proxy server makes the request to the remote server and forwards the result back to the local client. Sometimes this is done for security reasons, such as to prevent remote hosts from learning private details about the local network configuration. Other times it’s done to prevent users from accessing forbidden sites by filtering outgoing requests and limiting which sites can be viewed. For instance, an elementary school might want to block access to http://www.playboy.com. And still other times it’s done purely for performance, to allow multiple users to retrieve the same popular documents from a local cache rather than making repeated downloads from the remote server.

Java programs based on the URL class can work through most common proxy servers and protocols. Indeed, this is one reason you might want to choose to use the URL class rather than rolling your own HTTP or other client on top of raw sockets.

System Properties

For basic operations, all you have to do is set a few system properties to point to the addresses of your local proxy servers. If you are using a pure HTTP proxy, set http.proxyHost to the domain name or the IP address of your proxy server and http.proxyPort to the port of the proxy server (the default is 80). There are several ways to do this, including calling System.setProperty() from within your Java code or using the -D options when launching the program. This example sets the proxy server to 192.168.254.254 and the port to 9000:

<programlisting format="linespecific" id="I_7_tt264">% <userinput moreinfo=

"none">

java -Dhttp.proxyHost=192.168.254.254 -Dhttp.proxyPort=9000 </userinput>

<emphasis role="bolditalic">com.domain.Program</emphasis></programlisting>

If the proxy requires a username and password, you’ll need to install an Authenticator, as we’ll discuss shortly in Accessing Password-Protected Sites.

If you want to exclude a host from being proxied and connect directly instead, set the http.nonProxyHosts system property to its hostname or IP address. To exclude multiple hosts, separate their names by vertical bars. For example, this code fragment proxies everything exceptjava.oreilly.com and xml.oreilly.com:

System.setProperty("http.proxyHost", "192.168.254.254");

System.setProperty("http.proxyPort", "9000");

System.setProperty("http.nonProxyHosts", "java.oreilly.com|xml.oreilly.com");

You can also use an asterisk as a wildcard to indicate that all the hosts within a particular domain or subdomain should not be proxied. For example, to proxy everything except hosts in the oreilly.com domain:

% java -Dhttp.proxyHost=192.168.254.254 -Dhttp.nonProxyHosts=*.oreilly.com

<emphasis role="bolditalic">com.domain.Program</emphasis></programlisting>

If you are using an FTP proxy server, set the ftp.proxyHost, ftp.proxyPort, and ftp.nonProxyHosts properties in the same way.

Java does not support any other application layer proxies, but if you’re using a transport layer SOCKS proxy for all TCP connections, you can identify it with the socksProxyHost and socksProxyPort system properties. Java does not provide an option for nonproxying with SOCKS. It’s an all-or-nothing decision.

The Proxy Class

The Proxy class allows more fine-grained control of proxy servers from within a Java program. Specifically, it allows you to choose different proxy servers for different remote hosts. The proxies themselves are represented by instances of the java.net.Proxy class. There are still only three kinds of proxies, HTTP, SOCKS, and direct connections (no proxy at all), represented by three constants in the Proxy.Type enum:

§ Proxy.Type.DIRECT

§ Proxy.Type.HTTP

§ Proxy.Type.SOCKS

Besides its type, the other important piece of information about a proxy is its address and port, given as a SocketAddress object. For example, this code fragment creates a Proxy object representing an HTTP proxy server on port 80 of proxy.example.com:

SocketAddress address = new InetSocketAddress("proxy.example.com", 80);

Proxy proxy = new Proxy(Proxy.Type.HTTP, address);

Although there are only three kinds of proxy objects, there can be many proxies of the same type for different proxy servers on different hosts.

The ProxySelector Class

Each running virtual machine has a single java.net.ProxySelector object it uses to locate the proxy server for different connections. The default ProxySelector merely inspects the various system properties and the URL’s protocol to decide how to connect to different hosts. However, you can install your own subclass of ProxySelector in place of the default selector and use it to choose different proxies based on protocol, host, path, time of day, or other criteria.

The key to this class is the abstract select() method:

public abstract List<Proxy> select(URI uri)

Java passes this method a URI object (not a URL object) representing the host to which a connection is needed. For a connection made with the URL class, this object typically has the form http://www.example.com/ or ftp://ftp.example.com/pub/files/, for example. For a pure TCP connection made with the Socket class, this URI will have the form socket://host:port:, for instance, socket://www.example.com:80. The ProxySelector object then chooses the right proxies for this type of object and returns them in a List<Proxy>.

The second abstract method in this class you must implement is connectFailed():

public void connectFailed(URI uri, SocketAddress address, IOException ex)

This is a callback method used to warn a program that the proxy server isn’t actually making the connection. Example 5-9 demonstrates with a ProxySelector that attempts to use the proxy server at proxy.example.com for all HTTP connections unless the proxy server has previously failed to resolve a connection to a particular URL. In that case, it suggests a direct connection instead.

Example 5-9. A ProxySelector that remembers what it can connect to

import java.io.*;

import java.net.*;

import java.util.*;

public class LocalProxySelector extends ProxySelector {

private List<URI> failed = new ArrayList<URI>();

public List<Proxy> select(URI uri) {

List<Proxy> result = new ArrayList<Proxy>();

if (failed.contains(uri)

|| !"http".equalsIgnoreCase(uri.getScheme())) {

result.add(Proxy.NO_PROXY);

} else {

SocketAddress proxyAddress

= new InetSocketAddress( "proxy.example.com", 8000);

Proxy proxy = new Proxy(Proxy.Type.HTTP, proxyAddress);

result.add(proxy);

}

return result;

}

public void connectFailed(URI uri, SocketAddress address, IOException ex) {

failed.add(uri);

}

As I said, each virtual machine has exactly one ProxySelector. To change the ProxySelector, pass the new selector to the static ProxySelector.setDefault() method, like so:

ProxySelector selector = new LocalProxySelector():

ProxySelector.setDefault(selector);

From this point forward, all connections opened by that virtual machine will ask the ProxySelector for the right proxy to use. You normally shouldn’t use this in code running in a shared environment. For instance, you wouldn’t change the ProxySelector in a servlet because that would change the ProxySelector for all servlets running in the same container.

Communicating with Server-Side Programs Through GET

The URL class makes it easy for Java applets and applications to communicate with server-side programs such as CGIs, servlets, PHP pages, and others that use the GET method. (Server-side programs that use the POST method require the URLConnection class and are discussed inChapter 7.) All you need to know is what combination of names and values the program expects to receive. Then you can construct a URL with a query string that provides the requisite names and values. All names and values must be x-www-form-url-encoded—as by theURLEncoder.encode() method, discussed earlier in this chapter.

There are a number of ways to determine the exact syntax for a query string that talks to a particular program. If you’ve written the server-side program yourself, you already know the name-value pairs it expects. If you’ve installed a third-party program on your own server, the documentation for that program should tell you what it expects. If you’re talking to a documented external network API such as the eBay Shopping API, then the service usually provides fairly detailed documentation to tell you exactly what data to send for which purposes.

Many programs are designed to process form input. If this is the case, it’s straightforward to figure out what input the program expects. The method the form uses should be the value of the METHOD attribute of the FORM element. This value should be either GET, in which case you use the process described here, or POST, in which case you use the process described in Chapter 7. The part of the URL that precedes the query string is given by the value of the ACTION attribute of the FORM element. Note that this may be a relative URL, in which case you’ll need to determine the corresponding absolute URL. Finally, the names in the name-value pairs are simply the values of the NAME attributes of the INPUT elements. The values of the pairs are whatever the user types into the form.

For example, consider this HTML form for the local search engine on my Cafe con Leche site. You can see that it uses the GET method. The program that processes the form is accessed via the URL http://www.google.com/search. It has four separate name-value pairs, three of which have default values:

<br />

<input type="image" height="22" width="55"

src="images/search_blue.gif" alt="search" border="0"

name="search-image" />

</form>

The type of the INPUT field doesn’t matter. For instance, it doesn’t matter if it’s a set of checkboxes, a pop-up list, or a text field. Only the name of each INPUT field and the value you give it is significant. The submit input tells the web browser when to send the data but does not give the server any extra information. Sometimes you find hidden INPUT fields that must have particular required default values. This form has three hidden INPUT fields. There are many different form tags in HTML that produce pop-up menus, radio buttons, and more. However, although these input widgets appear different to the user, the format of data they send to the server is the same. Each form element provides a name and an encoded string value.

In some cases, the program you’re talking to may not be able to handle arbitrary text strings for values of particular inputs. However, since the form is meant to be read and filled in by human beings, it should provide sufficient clues to figure out what input is expected; for instance, that a particular field is supposed to be a two-letter state abbreviation or a phone number. Sometimes the inputs may not have such obvious names. There may not even be a form, just links to follow. In this case, you have to do some experimenting, first copying some existing values and then tweaking them to see what values are and aren’t accepted. You don’t need to do this in a Java program. You can simply edit the URL in the address or location bar of your web browser window.

TIP

The likelihood that other hackers may experiment with your own server-side programs in such a fashion is a good reason to make them extremely robust against unexpected input.

Regardless of how you determine the set of name-value pairs the server expects, communicating with it once you know them is simple. All you have to do is create a query string that includes the necessary name-value pairs, then form a URL that includes that query string. Send the query string to the server and read its response using the same methods you use to connect to a server and retrieve a static HTML page. There’s no special protocol to follow once the URL is constructed. (There is a special protocol to follow for the POST method, however, which is why discussion of that method will have to wait until Chapter 7.)

To demonstrate this procedure, let’s write a very simple command-line program to look up topics in the Open Directory. This site is shown in Figure 5-1 and it has the advantage of being really simple.

The user interface for the Open Directory

Figure 5-1. The user interface for the Open Directory

The Open Directory interface is a simple form with one input field named search; input typed in this field is sent to a program at http://search.dmoz.org/cgi-bin/search, which does the actual search. The HTML for the form looks like this:

<input style="*vertical-align:middle; *padding-top:1px;" value="Search"

class="btn" type="submit">

<a href="search?type=advanced"><span class="advN">advanced</span></a>

</form>

There are only two input fields in this form: the Submit button and a text field named q. Thus, to submit a search request to the Open Directory, you just need to append q=searchTerm to http://www.dmoz.org/search. For example, to search for “java”, you would open a connection to the URL http://www.dmoz.org/search/?q=java and read the resulting input stream. Example 5-10 does exactly this.

Example 5-10. Do an Open Directory search

import java.io.*;

import java.net.*;

public class DMoz {

public static void main(String[] args) {

String target = "";

for (int i = 0; i < args.length; i++) {

target += args[i] + " ";

}

target = target.trim();

QueryString query = new QueryString();

query.add("q", target);

try {

URL u = new URL("http://www.dmoz.org/search/q?" + query);

try (InputStream in = new BufferedInputStream(u.openStream())) {

InputStreamReader theHTML = new InputStreamReader(in);

int c;

while ((c = theHTML.read()) != -1) {

System.out.print((char) c);

}

} catch (MalformedURLException ex) {

System.err.println(ex);

} catch (IOException ex) {

System.err.println(ex);

}

Of course, a lot more effort could be expended on parsing and displaying the results. But notice how simple the code was to talk to this server. Aside from the funky-looking URL and the slightly greater likelihood that some pieces of it need to be x-www-form-url-encoded, talking to a server-side program that uses GET is no harder than retrieving any other HTML page.

Accessing Password-Protected Sites

Many popular sites require a username and password for access. Some sites, such as the W3C member pages, implement this through HTTP authentication. Others, such as the New York Times website, implement it through cookies and HTML forms. Java’s URL class can access sites that use HTTP authentication, although you’ll of course need to tell it which username and password to use.

Supporting sites that use nonstandard, cookie-based authentication is more challenging, not least because this varies a lot from one site to another. Implementing cookie authentication is hard short of implementing a complete web browser with full HTML forms and cookie support; we’ll discuss Java’s cookie support in Chapter 7. Accessing sites protected by standard HTTP authentication is much easier.

The Authenticator Class

The java.net package includes an Authenticator class you can use to provide a username and password for sites that protect themselves using HTTP authentication:

public abstract class Authenticator extends Object

Since Authenticator is an abstract class, you must subclass it. Different subclasses may retrieve the information in different ways. For example, a character mode program might just ask the user to type the username and password on System.in. A GUI program would likely put up a dialog box like the one shown in Figure 5-2. An automated robot might read the username out of an encrypted file.

An authentication dialog

Figure 5-2. An authentication dialog

To make the URL class use the subclass, install it as the default authenticator by passing it to the static Authenticator.setDefault() method:

public static void setDefault(Authenticator a)

For example, if you’ve written an Authenticator subclass named DialogAuthenticator, you’d install it like this:

Authenticator.setDefault(new DialogAuthenticator());

You only need to do this once. From this point forward, when the URL class needs a username and password, it will ask the DialogAuthenticator using the static Authenticator.requestPasswordAuthentication() method:

public static PasswordAuthentication requestPasswordAuthentication(

InetAddress address, int port, String protocol, String prompt, String scheme)

throws SecurityException

The address argument is the host for which authentication is required. The port argument is the port on that host, and the protocol argument is the application layer protocol by which the site is being accessed. The HTTP server provides the prompt. It’s typically the name of the realm for which authentication is required. (Some large web servers such as www.ibiblio.org have multiple realms, each of which requires different usernames and passwords.) The scheme is the authentication scheme being used. (Here the word scheme is not being used as a synonym for protocol. Rather, it is an HTTP authentication scheme, typically basic.)

Untrusted applets are not allowed to ask the user for a name and password. Trusted applets can do so, but only if they possess the requestPasswordAuthentication NetPermission. Otherwise, Authenticator.requestPasswordAuthentication() throws aSecurityException.

The Authenticator subclass must override the getPasswordAuthentication() method. Inside this method, you collect the username and password from the user or some other source and return it as an instance of the java.net.PasswordAuthentication class:

protected PasswordAuthentication getPasswordAuthentication()

If you don’t want to authenticate this request, return null, and Java will tell the server it doesn’t know how to authenticate the connection. If you submit an incorrect username or password, Java will call getPasswordAuthentication() again to give you another chance to provide the right data. You normally have five tries to get the username and password correct; after that, openStream() throws a ProtocolException.

Usernames and passwords are cached within the same virtual machine session. Once you set the correct password for a realm, you shouldn’t be asked for it again unless you’ve explicitly deleted the password by zeroing out the char array that contains it.

You can get more details about the request by invoking any of these methods inherited from the Authenticator superclass:

protected final InetAddress getRequestingSite()

protected final int getRequestingPort()

protected final String getRequestingProtocol()

protected final String getRequestingPrompt()

protected final String getRequestingScheme()

protected final String getRequestingHost()

protected final String getRequestingURL()

protected Authenticator.RequestorType getRequestorType()

These methods either return the information as given in the last call to requestPasswordAuthentication() or return null if that information is not available. (If the port isn’t available, getRequestingPort() returns -1.)

The getRequestingURL() method returns the complete URL for which authentication has been requested—an important detail if a site uses different names and passwords for different files. The getRequestorType() method returns one of the two named constants (i.e.,Authenticator.RequestorType.PROXY or Authenticator.RequestorType.SERVER) to indicate whether the server or the proxy server is requesting the authentication.

The PasswordAuthentication Class

PasswordAuthentication is a very simple final class that supports two read-only properties: username and password. The username is a String. The password is a char array so that the password can be erased when it’s no longer needed. A String would have to wait to be garbage collected before it could be erased, and even then it might still exist somewhere in memory on the local system, possibly even on disk if the block of memory that contained it had been swapped out to virtual memory at one point. Both username and password are set in the constructor:

public PasswordAuthentication(String userName, char[] password)

Each is accessed via a getter method:

public String getUserName()

public char[] getPassword()

The JPasswordField Class

One useful tool for asking users for their passwords in a more or less secure fashion is the JPasswordField component from Swing:

public class JPasswordField extends JTextField

This lightweight component behaves almost exactly like a text field. However, anything the user types into it is echoed as an asterisk. This way, the password is safe from anyone looking over the user’s shoulder at what’s being typed on the screen.

JPasswordField also stores the passwords as a char array so that when you’re done with the password you can overwrite it with zeros. It provides the getPassword() method to return this:

public char[] getPassword()

Otherwise, you mostly use the methods it inherits from the JTextField superclass. Example 5-11 demonstrates a Swing-based Authenticator subclass that brings up a dialog to ask the user for his username and password. Most of this code handles the GUI. A JPasswordField collects the password and a simple JTextField retrieves the username. Flip back to Figure 5-2 to see the rather simple dialog box this produces.

Example 5-11. A GUI authenticator

import java.awt.*;

import java.awt.event.*;

import java.net.*;

import javax.swing.*;

public class DialogAuthenticator extends Authenticator {

private JDialog passwordDialog;

private JTextField usernameField = new JTextField(20);

private JPasswordField passwordField = new JPasswordField(20);

private JButton okButton = new JButton("OK");

private JButton cancelButton = new JButton("Cancel");

private JLabel mainLabel

= new JLabel("Please enter username and password: ");

public DialogAuthenticator() {

this("", new JFrame());

}

public DialogAuthenticator(String username) {

this(username, new JFrame());

}

public DialogAuthenticator(JFrame parent) {

this("", parent);

}

public DialogAuthenticator(String username, JFrame parent) {

this.passwordDialog = new JDialog(parent, true);

Container pane = passwordDialog.getContentPane();

pane.setLayout(new GridLayout(4, 1));

JLabel userLabel = new JLabel("Username: ");

JLabel passwordLabel = new JLabel("Password: ");

pane.add(mainLabel);

JPanel p2 = new JPanel();

p2.add(userLabel);

p2.add(usernameField);

usernameField.setText(username);

pane.add(p2);

JPanel p3 = new JPanel();

p3.add(passwordLabel);

p3.add(passwordField);

pane.add(p3);

JPanel p4 = new JPanel();

p4.add(okButton);

p4.add(cancelButton);

pane.add(p4);

passwordDialog.pack();

ActionListener al = new OKResponse();

okButton.addActionListener(al);

usernameField.addActionListener(al);

passwordField.addActionListener(al);

cancelButton.addActionListener(new CancelResponse());

}

private void show() {

String prompt = this.getRequestingPrompt();

if (prompt == null) {

String site = this.getRequestingSite().getHostName();

String protocol = this.getRequestingProtocol();

int port = this.getRequestingPort();

if (site != null & protocol != null) {

prompt = protocol + "://" + site;

if (port > 0) prompt += ":" + port;

} else {

prompt = "";

}

mainLabel.setText("Please enter username and password for "

+ prompt + ": ");

passwordDialog.pack();

passwordDialog.setVisible(true);

}

PasswordAuthentication response = null;

class OKResponse implements ActionListener {

@Override

public void actionPerformed(ActionEvent e) {

passwordDialog.setVisible(false);

// The password is returned as an array of

// chars for security reasons.

char[] password = passwordField.getPassword();

String username = usernameField.getText();

// Erase the password in case this is used again.

passwordField.setText("");

response = new PasswordAuthentication(username, password);

}

class CancelResponse implements ActionListener {

@Override

public void actionPerformed(ActionEvent e) {

passwordDialog.setVisible(false);

// Erase the password in case this is used again.

passwordField.setText("");

response = null;

}

public PasswordAuthentication getPasswordAuthentication() {

this.show();

return this.response;

}

Example 5-12 is a revised SourceViewer program that asks the user for a name and password using the DialogAuthenticator class.

Example 5-12. A program to download password-protected web pages

import java.io.*;

import java.net.*;

public class SecureSourceViewer {

public static void main (String args[]) {

Authenticator.setDefault(new DialogAuthenticator());

for (int i = 0; i < args.length; i++) {

try {

// Open the URL for reading

URL u = new URL(args[i]);

try (InputStream in = new BufferedInputStream(u.openStream())) {

// chain the InputStream to a Reader

Reader r = new InputStreamReader(in);

int c;

while ((c = r.read()) != -1) {

System.out.print((char) c);

}

} catch (MalformedURLException ex) {

System.err.println(args[0] + " is not a parseable URL");

} catch (IOException ex) {

System.err.println(ex);

}

// print a blank line to separate pages

System.out.println();

}

// Since we used the AWT, we have to explicitly exit.

System.exit(0);

}